IWMW Mashups Round the Edges: Scraping Tables

Even though most people, most of the time, don’t set out to publish data in a semantically structured way on the web, it’s often possible to drive the semantics from the HTML elements used to represent the content within the page.

The most obvious example of this is the HTML table (see also: HTML Tables and the Data Web). Whilst often misused as a way of organising the visual layout of content on a page by designers who don’t yet understand the DOM or CSS (?!;-), tables are more appropriately used as containers for tabulated data.

They are also eminently amenable to screenscraping, if they are used properly…

So in this post I’ll show you three ways of scraping tabular data from a web page without having to leave the comfort of your own browser, just by using freely available online applications (although you may need to set up a user account with Yahoo and/or Google to play with them…):

– table scraping with YQL;
– table scraping with Yahoo pipes;
– table scraping with Google docs (in fact, with Google spreadsheets).

To illustrate table scraping with each of these tools, we’ll use the page at http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm:

If we look at the HTML source of the page, we see that it contains at least one table:

So let’s see how we can scrape that information about courses out of the web page, so that we can republish it elsewhere.

Table Scraping With YQL

First up, YQL, the Yahoo Query Language. Using the html datable, it’s easy enough to just grab the HTML table out of the page as is, to allow us tho republish it elsewhere.

Inspection of the HTML source of the original page shows that the table containing the course information is identified as follows:

<table class=”courses”>

The following query trivially extracts the table from the original page:

select * from html where url=”http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm&#8221; and xpath=’//table[@class=”courses”]’

The HTML is directly available in the query results field:

This table could then be redisplayed elsewhere, or processed using a YQL Execute script.

Table Scraping With Yahoo Pipes

Another simple screen scraping route can be achieved using Yahoo Pipes. In this example, we’ll start to explore how we can get a bit more of a handle on the information contained in each row, although as we’ll see, Pipes is not necessarily the ideal environment for doing this in this particular case.

Whilst Pipes started out as a drag and drop environment for processing RSS feeds, a Pipe can also import an HTML page at the start of the pipe and then process it in order to produce an RSS feed output.

To get the HTML page into the pipe, we use a Fetch Page block from the Sources menu:

This gives us separate items within the pipe that correspond to each row in the first table in the page. Let’s just tidy up those elements a little bit using the regular expression block:

And a bit more tidying – filter out the first couple of rows:

We’re now going to create a proper RSS feed, with title, link and description elements, so start out by setting each of them to b a copy of the content attribute in the internal feed representation:

Now we use some more regular expressions to define those elements appropriately:

Note that we could further process the description element and maybe reintroduce semantics, for example by specifically and explicitly identifying the level, number of points, next start date, and cost.

The resulting RSS feed output from the pipe can now be used to syndicate details of the OU’s science courses, information that was originally locked down to only appear in the original web page, or, by using the YQL trick shown above, as a simple republication of the original table.

Now I know that the semantic web purists will already be up in arms about the way the semantics were decoupled from the data at the start of the pipe, and that we’re now considering arbitrarily reintroducing them in a completely cavalier way. But so what..? We have to be pragmatic. Yes, we’d need to keep checking that the fields hadn’t changed, but this is always a risk when we’re screen scraping.

But are there any better ways of scraping this data that maybe preserve the columnar ordering in a more semantically transparent way? Maybe…

Table Scraping With Google Spreadsheets

One of the common features of the online spreadsheet offerings from Google and Zoho, as well as online database service DabbleDB is the ability to import tabular data scraped from webpage directly into a spreadsheet, given the URL of the original web page.

In Google spreadsheets, this is as simple of using a formula along the lines of:

=ImportHtml(“http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm&#8221;,”table”,1)

And the result?

What’s particularly handy about this representation is that we have access to the semantics of each cell value via the column headings.

We can see this by viewing the output of the CSV representation of the spreadsheet in a Yahoo Pipe:

(Note that we can’t actually rename these columns directly within the Pipe because the names aren’t well formed and Pipes can’t handle them. But there are workarounds – such as ignoring the first row and defining our own column names… though again, this breaks the direct semantic linkage between the original column name and the pipe’s internal element naming.)

One of the little known features of Google spreadsheets (for the moment, at least), is an API that allows a user to construct queries on the data contained within the spreadsheet. In actual fact, there are two APIs; a GData spreadsheets API, and a ‘datasource’ API defined as a Google Query Language component of the Google Visualization API; it’s the latter, the Google Query Language service we’ll play with :-)

One of the easiest ways that I know of to get started with writing GQL queries (which can all be expressed as RESTful URI calls to a web service that exposes the query language interface to a particular document or data source is to use the Datastore Explorer that I put together to help people write simple queries on spreadsheets collected together in the Guardian datastore.

The user interface is not the best designed interface you’ve ever seen, but when you get to grips with is it can help you put together query URIs quite quickly. (If anyone is interested in working on an open source project with me to try and build a proper system, please drop me an email:-)

Go to the datastore explorer, paste in the URI of the spreadsheet you want to explore and click on Preview table headings. The key for the spreadsheet will be extracted and used to query the spreadsheet. Note that you may also need to enter the sheet number separately. (If there is a gid= argument in the URI, then that’s a big giveaway of the sheet number;-)

You can now start to build up a query to explore the data source, such as the following:

Preview URLs are also generated that will display the table as an HTML table , or as a CSV file.

So for example, here is the HTML preview table for the above query:

Using tools such as the datasource explorer, it’s possible to generate queries on Google spreadsheets that contain data that has been imported into the spreadsheet via a simple table screenscraping function.

What this means in practice is: if you’ve ever published tabular information in simple HTML table, it’s easy enough for a third party to treat that table as a database table :-)

IWMW Mashups Round the Edges: YQL, Microformats and Structured Data

Whenever you publish HTML on the web, there is a possibility that someone, somewhere, will think that the information on your page is so interesting or valuable that they want to republish a particular of it elsewhere.

So for example, if you publish a list of calendar events on a webpage, there may well be someone out there who sees the benefit of making that list more widely available, for example by including the dates in another, more comprehensive aggregating calendar. (A great example of this is Jon Udell’s Elm City project; in an academic context, see Jim Groom’s Aggregating Google Calendars.)

In the case of calendar dates, if you’re feeling helpful you can publish the calendar events using a syndication format, such as iCal. If you’re feeling unhelpful, you can write a mess of HTML with no particular structure, putting dates all over the web page in a variety of formats and using a variety of HTML markup. And somewhere in-between, you could publish the information in a semantically meaningful way, where the HTML structure can be used to identify the different components of an event record (event name, date, location, and so on).

Why would you do this? Well, if semantics are included in the page structure, CSS styling might be able to reveal that meaning in an appropriate presentational way, which makes life as a web page designer easier. And depending on how you semantically mark-up your web page, browser add-ons might be able to exploit that structure to provide additional user functionality, such as adding selected calendar dates on a web page to a personal calendar.

Semantics can be added to calendar information in a web page in an informal way as tabulated data, where separate columns in an HTML table might identify things like the event name, date, location, etc. Semantics can also be associated with each element in an event record using a standardised markup convention such as the hCalendar microformat.

A good example of microformats in action can be found on the University of Bath Semester Timetable:

Here’s how each cell (each ‘event’) is represented in the table using hCalendar:

Because hCalendar is a recognised format (by some people at least!), several tools already exist for scraping it in an efficient way from a web page. For example, Brian Suda’s X2V service, “a BETA implementation of an XSLT file to transform and hCa* encoded XHTML file into the corresponding vCard/iCalendar file”.

Generating the iCal feed at the click of a button gives me something I can subscribe to in my desktop calendar:

And here it is:

Brian’s approach is based on the use of XSLT to extract the microformatted data from the page and represent it. Essentially, using microformats ‘in page’ allows pre-defined screenscraping utilties to effectively implement an API on top of the page that exposes particular data contained within it. The W3C GRDDL Recommendation generalises this sort of approach.

Another, more recent take on scraping conventionally (or consistently) marked up information from web pages is Yahoo’s YQL. For some time, Yahoo have been providing tools and utilities form scraping structured data from webpages so that it can be used to augment search results listing (Yahoo SearchMonkey), but YQL takes this a whole step further.

YQL offers a SQL like query language that provides a search query console over the web, as well as individual pages on the web. So for example, we can scrape all the microformatted entries from a webpage using a query of the form:

Here’s the result:

A RESTful URI can be constructed to run this query and return the results as XML or JSON, which can then be used elsewhere.

(Note that YQL can be used equally well to scrape more loosely structured pages – XPATH and CSS Selector statements can both be used in a YQL query to extract the part of the page you want to gt hold of.)

As well as microformats, YQL also sees a wide range of other content as queryable “datatable’s on the web”, and provides a way for developers to define their own datatable interfaces to their own web pages, or equally pages on third party sites.

The YQL Execute extends the power YQL further by allowing the developer to run server-side Javascript programmes that will process the results obtained from a YQL query howsoever they want (With YQL Execute, the Internet becomes your database). But that’s for another time… ;-)