Understanding HTTP requests, responses and headers is critical to develop a powerful web scraper
A HTTP request is what the client (eg a web browser, web scraper bot, app) sends to a server in order to request data (eg load a webpage or send a form's data).
Beyond the initial connection to a server, a HTTP request is a protocol used to communicate between users and servers in order to browse, load, send and receive data from a website or application.
In simplest form, a HTTP request looks something like this:
GET / HTTP/1.1 Host: www.google.com
Generally, HTTP requests also include other data such as User Agents and Cookies.
GET / HTTP/1.1 Host: www.google.com User-Agent: My Awesome Browser Version 0.001 beta Cookies: hereismypassword
Often there will be a dozen or so headers, some set by the website, in order to navigate a web or app correctly. Therefore, it is useful to have a look at the HTTP headers being sent and received under the hood in order to build a successful web scraper.
A HTTP response is what the server sends back to the client, in order to respond to a request - generally this means sending back a website.
A HTTP response in simplest form will look something like this:
HTTP/1.1 200 OK Content-Type: text/html
Often responses will contain a number of other headers, such as Set-Cookie (to set a cookie after logging in or similar), Cache-Control, Date and more.
It is useful to understand what various HTTP responses mean in order to write a scraper bot which crawls the web correctly. For example, it may need to set cookies in order to access types of data or API endpoints.