IWMW Mashups Round the Edges: Scraping Tables

Even though most people, most of the time, don’t set out to publish data in a semantically structured way on the web, it’s often possible to drive the semantics from the HTML elements used to represent the content within the page.

The most obvious example of this is the HTML table (see also: HTML Tables and the Data Web). Whilst often misused as a way of organising the visual layout of content on a page by designers who don’t yet understand the DOM or CSS (?!;-), tables are more appropriately used as containers for tabulated data.

They are also eminently amenable to screenscraping, if they are used properly…

So in this post I’ll show you three ways of scraping tabular data from a web page without having to leave the comfort of your own browser, just by using freely available online applications (although you may need to set up a user account with Yahoo and/or Google to play with them…):

– table scraping with YQL;
– table scraping with Yahoo pipes;
– table scraping with Google docs (in fact, with Google spreadsheets).

To illustrate table scraping with each of these tools, we’ll use the page at http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm:

If we look at the HTML source of the page, we see that it contains at least one table:

So let’s see how we can scrape that information about courses out of the web page, so that we can republish it elsewhere.

Table Scraping With YQL

First up, YQL, the Yahoo Query Language. Using the html datable, it’s easy enough to just grab the HTML table out of the page as is, to allow us tho republish it elsewhere.

Inspection of the HTML source of the original page shows that the table containing the course information is identified as follows:

<table class=”courses”>

The following query trivially extracts the table from the original page:

select * from html where url=”http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm&#8221; and xpath=’//table[@class=”courses”]’

The HTML is directly available in the query results field:

This table could then be redisplayed elsewhere, or processed using a YQL Execute script.

Table Scraping With Yahoo Pipes

Another simple screen scraping route can be achieved using Yahoo Pipes. In this example, we’ll start to explore how we can get a bit more of a handle on the information contained in each row, although as we’ll see, Pipes is not necessarily the ideal environment for doing this in this particular case.

Whilst Pipes started out as a drag and drop environment for processing RSS feeds, a Pipe can also import an HTML page at the start of the pipe and then process it in order to produce an RSS feed output.

To get the HTML page into the pipe, we use a Fetch Page block from the Sources menu:

This gives us separate items within the pipe that correspond to each row in the first table in the page. Let’s just tidy up those elements a little bit using the regular expression block:

And a bit more tidying – filter out the first couple of rows:

We’re now going to create a proper RSS feed, with title, link and description elements, so start out by setting each of them to b a copy of the content attribute in the internal feed representation:

Now we use some more regular expressions to define those elements appropriately:

Note that we could further process the description element and maybe reintroduce semantics, for example by specifically and explicitly identifying the level, number of points, next start date, and cost.

The resulting RSS feed output from the pipe can now be used to syndicate details of the OU’s science courses, information that was originally locked down to only appear in the original web page, or, by using the YQL trick shown above, as a simple republication of the original table.

Now I know that the semantic web purists will already be up in arms about the way the semantics were decoupled from the data at the start of the pipe, and that we’re now considering arbitrarily reintroducing them in a completely cavalier way. But so what..? We have to be pragmatic. Yes, we’d need to keep checking that the fields hadn’t changed, but this is always a risk when we’re screen scraping.

But are there any better ways of scraping this data that maybe preserve the columnar ordering in a more semantically transparent way? Maybe…

Table Scraping With Google Spreadsheets

One of the common features of the online spreadsheet offerings from Google and Zoho, as well as online database service DabbleDB is the ability to import tabular data scraped from webpage directly into a spreadsheet, given the URL of the original web page.

In Google spreadsheets, this is as simple of using a formula along the lines of:

=ImportHtml(“http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm&#8221;,”table”,1)

And the result?

What’s particularly handy about this representation is that we have access to the semantics of each cell value via the column headings.

We can see this by viewing the output of the CSV representation of the spreadsheet in a Yahoo Pipe:

(Note that we can’t actually rename these columns directly within the Pipe because the names aren’t well formed and Pipes can’t handle them. But there are workarounds – such as ignoring the first row and defining our own column names… though again, this breaks the direct semantic linkage between the original column name and the pipe’s internal element naming.)

One of the little known features of Google spreadsheets (for the moment, at least), is an API that allows a user to construct queries on the data contained within the spreadsheet. In actual fact, there are two APIs; a GData spreadsheets API, and a ‘datasource’ API defined as a Google Query Language component of the Google Visualization API; it’s the latter, the Google Query Language service we’ll play with :-)

One of the easiest ways that I know of to get started with writing GQL queries (which can all be expressed as RESTful URI calls to a web service that exposes the query language interface to a particular document or data source is to use the Datastore Explorer that I put together to help people write simple queries on spreadsheets collected together in the Guardian datastore.

The user interface is not the best designed interface you’ve ever seen, but when you get to grips with is it can help you put together query URIs quite quickly. (If anyone is interested in working on an open source project with me to try and build a proper system, please drop me an email:-)

Go to the datastore explorer, paste in the URI of the spreadsheet you want to explore and click on Preview table headings. The key for the spreadsheet will be extracted and used to query the spreadsheet. Note that you may also need to enter the sheet number separately. (If there is a gid= argument in the URI, then that’s a big giveaway of the sheet number;-)

You can now start to build up a query to explore the data source, such as the following:

Preview URLs are also generated that will display the table as an HTML table , or as a CSV file.

So for example, here is the HTML preview table for the above query:

Using tools such as the datasource explorer, it’s possible to generate queries on Google spreadsheets that contain data that has been imported into the spreadsheet via a simple table screenscraping function.

What this means in practice is: if you’ve ever published tabular information in simple HTML table, it’s easy enough for a third party to treat that table as a database table :-)

5 comments

  1. Carl-Johan Sveningsson

    Wow, I just have to say how much I love your blog, it is beautiful! Anyway, I am trying to do a small scrape, but one record is distributed over four tr’s otherwise not distinguished from each other. Any idea about the best way to go about that?

    http://twitpic.com/c46xp

    Really would appreciate a comment to this, I’m pretty stuck…

    Best regards
    CJ

  2. RandiR

    Very useful tutorial. When the html is not as-well formed, you can use a script like

    http://www.biterscripting.com/helppages/SS_WebPageToCSV.html

    It extracts an html table from a web page into a CSV (Command Separated Values) format. I use it to scrape stock data. Even when a website needs login/password, you can use the IS (Internet Session) commands to conduct a detailed dialog with the web server automatically.

    I have bunch of such scripts on my desktop. I start them by double clicking an icon. Within a few minutes, all the results are on my computer, neatly stored in files that way I needed.

    Thanks again.