OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

XML Data Scraping and Screen Scraping with YQL

with 3 comments

Although it seems as if the OU’s repository data is, henceforth, going to be made available as Linked Data, I thought I’d just post this quick demo I’ve had laying around for a week or two demonstrating how to use YQL to run queries over the XML data files published from the OU’s eprints server, as well as how to use it to scrape structured data from an HTML page (in this case, from Lanyrd).

Picking the “submissions by author” listing for the first author in the list (which happens to be James Aczel, not that you’d know it from the listing page as shown below…;-), we see a variety of ways of exporting the James’ publication list:

export formats from OU ORO/eprints

If we select the “EP3 XML” (ePrints v3?) option, this is what we get:

OU eprints XML

The URL for the XML export view, as for the HTML page is well structured around a unique identifier for James, although I can’t tell you what because the bit of the OU that “owns” the minting of the identifiers is very protective of them, how they’re used, and the extent to which they’re allowed to be made public… doh!

Anyway, given the URL for the XML version of someone’s publications, we can treat that document as a database using YQL.

So for example, here’s a query that will select the id and name of all the authors, orgainised by publication, on papers where James Aczel was an author:

select eprint.id, eprint.creators from xml where url='http://oro.open.ac.uk/cgi/exportview/person/jca25/XML/jca25.xml'

(I wrote this query because I wanted to start exploring co-author neworks around particular authors…)

Here it is in the YQL Developer console:

Querying eprints xml from Yql developer console

And by extension, this query gets the authors and paper titles for papers where Jame has some creator attribution.

If we parameterise the query (as described here: YQL Web Service URLs), we can generate a “nice” parameterised URL for the query that returns the YQL XML query response.

YQL query alias

So for example, here’s response data for the user with ID bkh2:

http://query.yahooapis.com/v1/public/yql/psychemedia/test2?url=http://oro.open.ac.uk/cgi/exportview/person/bkh2/XML/bkh2.xml

(I don’t know how to construct the URL within YQL from just the ID? If you know, please post the answer in a comment…;-)[See @hapdaniel's comment below :-)]

You may have noticed that YQL uses a dot notation in the construction of the SELECT component of the query to identify the data fields you want returning from the query. A rather different approach can be taken when using YQL to screenscrape data from an HTML page, in particular by using an XPATH expression to direct the query to the parent HTML element you want returning. So for example, we can scrape the list of attendees at an event as listed on Lanyrd, such as today’s #RSWebSci event:

lanyrd HTML

Here’s the YQL to scrape that list, using an XPATH expression to target in on the appropriate HTML container element:

select href from html where url='http://lanyrd.com/2010/rswebsci/' and xpath='//div[@class="attendees-placeholder placeholder"]/ul[@class="user-list"]/li/a'

(I think an exact match is being run in this expression on the class attributes?)

Here’s the result:

Screenscrpaing attendees from a Lanyrd event listing

You’ll notice the scraped data is a list of paths off the Lanyrd domain to personal profile pages. So given these users’ URLs, we can scrape their Lanyrd profile pages to find their their Twitter IDs:

select href from html where url='http://lanyrd.com/people/nigel_shadbolt/' and xpath='//a[@class="icon twitter url nickname"]'

HTML Screenscrpae with YQL

If we’re thinking linked data (little ‘l’, little ‘d’;-), we might them use these Twitter URLs to see if we can find other webpages for the same person using Google’s otherme service:
http://socialgraph.apis.google.com/otherme?q=http://twitter.com/timbernerslee

PS just by the by, during the course of the #rswebsci event, I automatically generated a custom search engine around the #rswebsci hashtaggers websites and a quick viz of the structure of the hashtaggers’ network.

rswebsci hashtag community

With a bit of luck, I’ll get round to posting the code for grabbing this data from the Twitter API in the next day or two…

Written by Tony Hirst

September 29, 2010 at 12:28 pm

Posted in Tinkering

Tagged with ,

3 Responses

Subscribe to comments with RSS.

  1. Is this of any help?
    select eprint.id,eprint.creators from xml where url in(select url from uritemplate where template=’http://oro.open.ac.uk/cgi/exportview/person/{ID}/XML/{ID}.xml’ and ID=’bkh2′)

    hapdaniel

    September 30, 2010 at 7:36 am

    • Hi Paul
      It very much is :-) So I’m guessing for a short URL, we could end it “and ID=@id”?
      thanks
      tony

      Tony Hirst

      September 30, 2010 at 11:34 am

  2. Yes.
    http://query.yahooapis.com/v1/public/yql/hapdaniel/idtest2?id=bkh2

    I’ve noticed that the query can sometimes time-out. I don’t know if that’s the case for your original query.

    hapdaniel

    September 30, 2010 at 12:12 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 134 other followers