Web scraping is the process of scraping websites for information, and turning that information into data. This data may be loaded into a database, spreadsheet, API or another format for further use. Uses may include: competitive analysis, lead generation, price monitoring, research, brand monitoring and a huge range of other business use cases.
Web scraping involves programming a bot to crawl a website and process its data. This data may come from various endpoints, utilizing various techniques including API's, GET and POST requests, parsed HTML, autocomplete forms and more.
This type of information gathering allows a business to gain competitive intelligence - as it can be processed using programs, applications, spreadsheets and more. From a research, intelligence, and business perspective you gain a huge advantage over browsing the internet as a human.
Scraping websites is simply a form of data gathering, using a bot or program, to save information in a structured manner for later analysis.
Depending on the project a range of website scraping techniques may be utilized:
This is the most commonly thought of web scraping technique - simply downloading a html website and parsing it with code. For example, a program may save all `<h1>` tags to a database.
Many publicly available API endpoints (usually http methods GET and POST) already have well structured data which may provide a bot with an advantage. For example, many client side applications interface with a JSON api which is publicly available. This data can often be read directly into a database.
The four main costs associated with web scraping are: software development, volume, quantity of sites and frequency.
Web Scraping Software can cost anywhere between $250 for a tiny project to tens of thousands of dollars. The average project requires a small level of complexity such as bypassing rate limiting by routing through proxies, dealing with error messages and dealing with badly structured code. The average project may cost around $750 - $1600 for initial software development.
The next three cost components I will group under 'compute' as they all influence the amount of compute power required to complete the project. Compute power is the amount and size of servers required to collect and process the data.
How much data is required? Are we crawling hundreds, thousands, hundreds of thousands or millions of endpoints?
The answer to this question will greatly influence the compute cost. The more data points, and the more data required, the higher the cost.
Let's give an example. Say you wish to crawl pricing for an eCommerce provider weekly. The eCommerce provider may have rate limiting setup, which only allows us 1 second per request. Now, let's say you only want 1000 data points saved each week. This would be easy enough - respecting their rate limiting requirements, your project would take 1000 seconds (roughly 17 minutes) to complete each week.
Using the same example, let's say you require 1 million data points saved each week. Your bot will now take 1 million seconds to complete its crawl. This is roughly 277 hours - 11.5 days: more than a week!
As you can see the volume of data to be crawled will greatly influence compute cost. In the latter case above, there may be ways to decrease crawl time by switching proxies, user agent randomization and more. However, if your project requires a large volume, perhaps consider less frequent updates, or consider a smaller data set. Perhaps the research you plan to undertake can be completed with 20,000 rows - not 1,000,000.
The quantity of sites to be crawled, including subdomains and data points with different representations, will greatly impact compute cost and software cost. If we need to crawl 10 websites each with different HTML, POST and GET endpoints - we basically need to rewrite the bot for each, and the compute power becomes a multiple of volume and quantity.
Also, if your data set requires multiple data types - for example 'products' and 'documentation', the bot will need to include different parsing functionality for each data type.
Frequency is the time interval between updates of your data. If your bot needs to crawl a website daily, the compute cost will of course be far higher than monthly updates. If your bot is simply a one off research project, this cost will be a one off.
To summarize, the cost of your website crawling project depends on a number of factors including website complexity, quantity of websites and data points, frequency of updates and volume of data to be collected.