top of page

Web data extraction: Custom, commercial offerings ease the task

The World Wide Web is a vast source of information on any conceivable topic. Imagine you could access this information with the same ease as you access structured data in a database, using a SQL-like query language: "Search the WWW for all used car sales in Australia from 1990 to 2010, and calculate the total sales volume, grouped by year, make and model."

Armed with a web-scraping tool like that, you could automate many consumer and business reports that currently require massive manual effort and access the web as if it were one enormous database. For example:

  • industry analysis reports, e.g., trends in the used car market in Australia;

  • aggregated ratings for restaurants or movies, based on reviews and social media mentions;

  • metadata about TV shows, including broadcast channel and time, program duration and rich media about the program; and

  • a comparison-shopping catalog of all online retailers who sell any given product.

Easing web data extraction

Why don't we have such a query language? After all, Google does a pretty good job of finding most of the relevant pages. The real problem is extracting structured data from the largely unstructured mess that is the web. Going back to our example of used car sales in Australia, some of the data exists in tables in Adobe Acrobat pdf files, some data is found in press releases, and still other data may exist in footnotes within articles published by the Australian Bureau of Statistics. To truly turn the web into a database requires the ability to automatically identify and extract structured key-value data from a mass of unstructured textual data stored in a variety of formats.

The real problem is extracting structured data from the largely unstructured mess that is the web.

Both custom solutions and commercial products can facilitate web data extraction with a minimum of effort. For example, import.io provides a point-and-click interface that enables the user to teach the system how to extract data for a given website, enhanced with machine learning to infer extraction patterns for new sites based on knowledge learned from other sites. My company, Ness Digital Engineering, has also worked with clients to develop custom solutions designed to address a particular set of needs.

Guidelines for web data extraction tools

Whichever solution is chosen to best fit the use case, there are some important guidelines to consider when implementing a solution:

  1. Web crawling must be scheduled very carefully, so as not to harm the responsiveness of the target site. Crawl too many pages in too few seconds, and you will be identified as a denial-of-service attacker and blocked from the crawled site forever.

  2. Sites have different information in different regions, so you may need to use proxies to see all versions of a given site. For some sites, even a proxy is not enough, and you may need to set up a machine physically located in the desired region.

  3. Some pages download all the data to be scraped along with the initial HTML page, other sites download JavaScript that then executes in the browser to retrieve the desired data via Ajax, and some sites only download needed data in response to user actions, e.g., mouse clicks or scroll right/down. Be prepared to handle all these cases, and only scrape data after it has entirely arrived in the client browser.

  4. Save the raw page content after extracting data from it. That way, you can go back and determine why you extracted the value you extracted, even if the publisher has subsequently modified the page.

  5. Initially, the system needs to be trained by a human being, who points out content to be extracted from the webpage. Make this system as easy to use as possible, and assume users do not understand HTML or anything that goes on under the hood in a webpage.

  6. The system should learn from user actions over time, so that it begins to recognize textual patterns that contain useful data when it is introduced to new websites. It's this machine learning component that enables a well-trained web data extraction system to approach the goal of fully automated web extraction, even for webpages it has never seen before.

  7. When extracting data, you will inevitably find multiple sources for the same data item, e.g., two sources that describe the same TV show. The data must be deduplicated, by finding matches, and then multiple records must be merged together to form the master record for the given data item. This merge must handle conflicting data, by giving precedence to more authoritative websites.

Whether you use a commercial product or a custom system, it is possible today to extract information from the web that drives your business. It's not quite as easy as fetching data from a relational database, but thanks to machine learning, it is getting easier all the time.


RECENT POST
bottom of page