Web scraping, alternative data, and many other things have been all the buzz in the private sector. Digital businesses have been looking for ways to diversify their data portfolios for more accurate decision-making and discovery of new strategies.
Unfortunately, the adoption of all the above in the public sector hasn’t been as successful. Few, if any, institutions employ web scraping. Even fewer have the necessary tools and expertise to work with alternative data.
A lot of different reasons can be found: legacy systems, lack of availability for technical expertise, slower adoption of new technologies. Nevertheless, there should be a push towards greater adoption as the uses could usher in a completely new era for the public sector.
What is alternative data?
Alternative data is defined through negation. It’s not a type that’s described as having features, rather it is data that’s not acquired through traditional methods. These traditional methods include institutions and government bodies that collect data for reporting purposes.
Government census data is the perfect example of a traditional source: It takes ages to collect, it’s released on a yearly (or even slower) basis, has lots of regulatory bodies surrounding it, etc. Company tax filings, if they are public, would be another case of traditional data.
Alternative data would be everything the above is not. That might not seem like a helpful definition, but it gets easier to understand with a little involvement. In fact, most alternative data consists of images, web scraped information, and many other sources.
An unfortunate side effect of alternative data is that it’s significantly harder to acquire and work with. Unless it’s bought from third-party alternative data vendors, most of it is unstructured, hard to understand, and requires a lot of pre-processing.
Outside of outright buying the data, web scraping is the only semi-accessible piece of the puzzle that could make the alternative data acquisition process palatable. Newer solutions work almost completely out-of-the-box, allowing users with minimal software development knowledge to collect data from the internet.
With all of the data being left online by individuals and businesses, government and other public sector entities could find tremendous value in an organized collection approach. The use cases are almost impossible to enumerate.
Alternative data use cases for the public sector
Law enforcement is one of the primary areas where governments and other public sector entities can use alternative data and web scraping. There are a wide variety of crimes that can, intentionally or not, leave traces online.
Some infringements or crimes happen purely online. Copyright or intellectual property infringements, for example, are a common occurrence in cyberspace. These, however, are generally handled through civil cases and don’t involve governments.
Criminal offenses, such as the possession and distribution of illegal or harmful content, can be uncovered through the combined use of web scraping and artificial intelligence. Oxylabs have proven that to be the case through the GovTech challenge.
The Communications Regulatory Authority of the Republic of Lithuania (RRT) had hosted a hackathon for the detection of images of child abuse within the Lithuanian IP space. Our team at Oxylabs went on to win the competition with an AI-powered solution that involves web scraping.
In simple terms, the application continually scans the internet, downloads images, runs them through machine learning models, and if any potentially abusive content is happened upon, it is forwarded to the authorities. As of 2022, the tool has been fully integrated into the daily operations of RRT.
Tracking the shadow economy and money flows
But alternative data can serve the public sector in much simpler ways without necessitating artificial intelligence or machine learning. A great example would be tracking shadow economy trades for better estimations.
Shadow economy is often quite difficult to track due to the inherent nature of it being hidden from traditional statistics. Outside of faked salaries, a significant portion of the shadow economy thrives through undeclared income from secondary sources.
One of the most common ways to avoid taxes (or launder money) is through in-person sales of household goods. In our case, such income wouldn’t be presented to tax authorities, bolstering the shadow economy.
Luckily, a lot of such activity now happens online through the help of classified ad websites such as Craigslist or user-generated marketplaces like eBay. Any advertisement or proposal will have either account or contact information added.
As long as a government institution has access to methods for acquiring such data and matching it with internal records, any sales can be tracked on a large scale. As a result, alternative data can enrich tax collection practices.
Additionally, such calculations would help finance departments estimate the scale of the shadow economy with more accuracy. Such data would be beneficial when new policies intended to curb the growth of undeclared income are implemented. It would enable the measurement of their effectiveness.
Conducting independent research
Finally, alternative data and web scraping could help governments enrich their traditional dataset on various topics. Similar attempts have been made by researchers at the Billion Prices Project (BPP).
Web scraping was employed by researchers at the various projects associated with BPP. Primarily, it was used as a way to measure inflation in countries without having to resort to official statistics.
BPP turned out to be extremely effective in countries where such information was manipulated or suppressed. It, however, would be just as useful as an additional hedge against the current ways of measuring inflation, such as CPI.
They scraped millions of products and their prices in order to track changes in valuations over time. Presumably, with enough products, that would give a good estimation of the changes in the value of money.
Comparisons can then be drawn in order to get a better glimpse of the changing economy. Web scraping is doubly helpful as regular data collection methods usually have significant lag. Web scraping delivers data nearly instantaneously, which would make predictions significantly more accurate.
Conclusion
Alternative data and web scraping can tremendously enhance current public sector practices. It is, however, still rarely used, probably due to it being hard to access. Developers of solutions and datasets must now focus on ways to make web scraping more accessible, even to those without high-level technical expertise.