Web scraping is easy even for those who haven’t ever done it. All it takes is a bit of programming knowledge—usually Python—to get a good thing going. Of course, until scaling becomes a necessity. Then everything falls apart.
Homebrew web scrapers for small projects work perfectly fine for the minor tasks they are intended to do. But once a large amount of data or constant scraping is required, things tend to get messy. Load management becomes a problem, IP blocks ensue, new servers need to be bought, and so on. Without a good foundation, the entire undertaking can become sluggish and unviable.
A bird’s eye view of large-scale scraping
All web scraping begins in the same manner and turns around the same axis. There’s an automated application that can perform the scraping, a URL or URLs that are the sources of data, and the delivery destination (which is usually a file like JSON or CSV).
Small-scale scraping projects usually retain the structure outlined above fully intact. There’s no need to add any complexity, as the only person sending requests is the owner and the database is stored locally. Usually, there are no intentions to move the database elsewhere.
However, when scaling, upgrades become necessary, for instance, a way to interact with the scraper through a distance. Sometimes, that means developing or using an API to send commands. However, with the ability to get remote access to the scraper, the results should be delivered remotely as well.
That means developing a way to present the findings over the internet. If they were stored in a file or database, this requires gaining access to storage. Printing out the results might work for singular requests, however, in many cases, historical data might be necessary.
If historical data becomes a necessity, then it’s likely that the database will have to be moved out of the original machine onto a dedicated server. Now, developing a way to communicate between several machines in tandem becomes a necessity.
Finally, at some point the amount of requests delivered becomes insurmountable. More machines are required. Those machines have to communicate with a master server that serves as a load management system and collects queues. Then, optimizing queues and sending jobs to the proper machines becomes the issue.
After all, is said and done, the three-step project now works like this:
- A request is sent through an API.
- A master server receives the request and checks existing queues.
- A load balancer picks up the request.
- The request is sent somewhere where it can be interpreted and executed.
- Data is then sent either to storage, directly to the requestee, or to a temporary server.
- Somewhere along the chain, logging has to happen for troubleshooting purposes.
What was once a single machine executing synchronous code has now become dozens—if not hundreds—of machines executing many different actions asynchronously, communicating with each other, and delivering data to multiple third parties.
In the end, the seemingly simple process proliferates and grows in many different directions until the original project becomes barely recognizable. Rest assured: it doesn’t have to be this way. At our company, Oxylabs, we had to discover the right way to do things ourselves, which I’ll share with you here.
Building the foundation in the correct way
Oxylabs’ Scraper APIs went through similar phases. In fact, our first prototype was used to solve only one particular problem—it wasn’t even scalable. However, over time, we learned the most frequent issues that arise in the scraping process. Getting ahead of them can save a tremendous amount of time and resources.
What are the areas that have to be scaled most frequently? Of course, having enough servers to perform the scraping and/or parsing jobs is necessary. However, from our experience, that’s not where frequent bottlenecks happen, as there, the potential issues are so easy to notice.
Core clusters, however, are ironically easy to miss. In fact, we experienced frequent bottlenecks at our queue master and load balancers. Hindsight is 20/20, but we should have expected that the primary request highways would get flooded eventually. However, in the thick of things, it’s easy to start looking for problems elsewhere without discovering their root cause.
On the other hand, logging for troubleshooting purposes can be a headache. It needs to be verbose enough to provide insight into any potential issue. However, verbose logging for such a complex process means there are a lot of things to write down, and that means each log will take up a significant amount of space. Additionally, these logs have to be kept around for a decent amount of time. Deleting them too quickly will make the entire logging process useless.
At Oxylabs, we have had to upgrade our verbose logging servers frequently. However, the most important decision we’ve made was finding out the optimal deletion time. As our servers create dozens of terabytes of troubleshooting logs daily, finding the clear time for deletion that worked for our clients and for us was essential, as it could free up significant resources with proper management.
To conclude, as web scraping projects improve, then by necessity, they almost always increase in complexity. It’s easy to lose the forest for the trees when upgrading both the software and hardware capabilities. There’s only so much optimization that can be done without the proper infrastructure. To avoid hectic hardware improvements, keeping the possible bottlenecks outlined above in mind can provide an invaluable structure to the entire project.