XML Data Scraping and Screen Scraping with YQL

Although it seems as if the OU’s repository data is, henceforth, going to be made available as Linked Data, I thought I’d just post this quick demo I’ve had laying around for a week or two demonstrating how to use YQL to run queries over the XML data files published from the OU’s eprints server, as well as how to use it to scrape structured data from an HTML page (in this case, from Lanyrd).

Picking the “submissions by author” listing for the first author in the list (which happens to be James Aczel, not that you’d know it from the listing page as shown below…;-), we see a variety of ways of exporting the James’ publication list:

export formats from OU ORO/eprints

If we select the “EP3 XML” (ePrints v3?) option, this is what we get:

OU eprints XML

The URL for the XML export view, as for the HTML page is well structured around a unique identifier for James, although I can’t tell you what because the bit of the OU that “owns” the minting of the identifiers is very protective of them, how they’re used, and the extent to which they’re allowed to be made public… doh!

Anyway, given the URL for the XML version of someone’s publications, we can treat that document as a database using YQL.

So for example, here’s a query that will select the id and name of all the authors, orgainised by publication, on papers where James Aczel was an author:

select eprint.id, eprint.creators from xml where url='http://oro.open.ac.uk/cgi/exportview/person/jca25/XML/jca25.xml'

(I wrote this query because I wanted to start exploring co-author neworks around particular authors…)

Here it is in the YQL Developer console:

Querying eprints xml from Yql developer console

And by extension, this query gets the authors and paper titles for papers where Jame has some creator attribution.

If we parameterise the query (as described here: YQL Web Service URLs), we can generate a “nice” parameterised URL for the query that returns the YQL XML query response.

YQL query alias

So for example, here’s response data for the user with ID bkh2:


(I don’t know how to construct the URL within YQL from just the ID? If you know, please post the answer in a comment…;-)[See @hapdaniel’s comment below :-)]

You may have noticed that YQL uses a dot notation in the construction of the SELECT component of the query to identify the data fields you want returning from the query. A rather different approach can be taken when using YQL to screenscrape data from an HTML page, in particular by using an XPATH expression to direct the query to the parent HTML element you want returning. So for example, we can scrape the list of attendees at an event as listed on Lanyrd, such as today’s #RSWebSci event:

lanyrd HTML

Here’s the YQL to scrape that list, using an XPATH expression to target in on the appropriate HTML container element:

select href from html where url='http://lanyrd.com/2010/rswebsci/' and xpath='//div[@class="attendees-placeholder placeholder"]/ul[@class="user-list"]/li/a'

(I think an exact match is being run in this expression on the class attributes?)

Here’s the result:

Screenscrpaing attendees from a Lanyrd event listing

You’ll notice the scraped data is a list of paths off the Lanyrd domain to personal profile pages. So given these users’ URLs, we can scrape their Lanyrd profile pages to find their their Twitter IDs:

select href from html where url='http://lanyrd.com/people/nigel_shadbolt/' and xpath='//a[@class="icon twitter url nickname"]'

HTML Screenscrpae with YQL

If we’re thinking linked data (little ‘l’, little ‘d’;-), we might them use these Twitter URLs to see if we can find other webpages for the same person using Google’s otherme service:

PS just by the by, during the course of the #rswebsci event, I automatically generated a custom search engine around the #rswebsci hashtaggers websites and a quick viz of the structure of the hashtaggers’ network.

rswebsci hashtag community

With a bit of luck, I’ll get round to posting the code for grabbing this data from the Twitter API in the next day or two…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

3 thoughts on “XML Data Scraping and Screen Scraping with YQL”

  1. Is this of any help?
    select eprint.id,eprint.creators from xml where url in(select url from uritemplate where template=’http://oro.open.ac.uk/cgi/exportview/person/{ID}/XML/{ID}.xml’ and ID=’bkh2′)

Comments are closed.

%d bloggers like this: