Although it seems as if the OU’s repository data is, henceforth, going to be made available as Linked Data, I thought I’d just post this quick demo I’ve had laying around for a week or two demonstrating how to use YQL to run queries over the XML data files published from the OU’s eprints server, as well as how to use it to scrape structured data from an HTML page (in this case, from Lanyrd).
Picking the “submissions by author” listing for the first author in the list (which happens to be James Aczel, not that you’d know it from the listing page as shown below…;-), we see a variety of ways of exporting the James’ publication list:
If we select the “EP3 XML” (ePrints v3?) option, this is what we get:
The URL for the XML export view, as for the HTML page is well structured around a unique identifier for James, although I can’t tell you what because the bit of the OU that “owns” the minting of the identifiers is very protective of them, how they’re used, and the extent to which they’re allowed to be made public… doh!
Anyway, given the URL for the XML version of someone’s publications, we can treat that document as a database using YQL.
So for example, here’s a query that will select the id and name of all the authors, orgainised by publication, on papers where James Aczel was an author:
select eprint.id, eprint.creators from xml where url='http://oro.open.ac.uk/cgi/exportview/person/jca25/XML/jca25.xml'
(I wrote this query because I wanted to start exploring co-author neworks around particular authors…)
Here it is in the YQL Developer console:
And by extension, this query gets the authors and paper titles for papers where Jame has some creator attribution.
If we parameterise the query (as described here: YQL Web Service URLs), we can generate a “nice” parameterised URL for the query that returns the YQL XML query response.
So for example, here’s response data for the user with ID bkh2:
(I don’t know how to construct the URL within YQL from just the ID? If you know, please post the answer in a comment…;-)[See @hapdaniel’s comment below :-)]
You may have noticed that YQL uses a dot notation in the construction of the SELECT component of the query to identify the data fields you want returning from the query. A rather different approach can be taken when using YQL to screenscrape data from an HTML page, in particular by using an XPATH expression to direct the query to the parent HTML element you want returning. So for example, we can scrape the list of attendees at an event as listed on Lanyrd, such as today’s #RSWebSci event:
Here’s the YQL to scrape that list, using an XPATH expression to target in on the appropriate HTML container element:
select href from html where url='http://lanyrd.com/2010/rswebsci/' and xpath='//div[@class="attendees-placeholder placeholder"]/ul[@class="user-list"]/li/a'
(I think an exact match is being run in this expression on the class attributes?)
Here’s the result:
You’ll notice the scraped data is a list of paths off the Lanyrd domain to personal profile pages. So given these users’ URLs, we can scrape their Lanyrd profile pages to find their their Twitter IDs:
select href from html where url='http://lanyrd.com/people/nigel_shadbolt/' and xpath='//a[@class="icon twitter url nickname"]'
If we’re thinking linked data (little ‘l’, little ‘d’;-), we might them use these Twitter URLs to see if we can find other webpages for the same person using Google’s otherme service:
PS just by the by, during the course of the #rswebsci event, I automatically generated a custom search engine around the #rswebsci hashtaggers websites and a quick viz of the structure of the hashtaggers’ network.
With a bit of luck, I’ll get round to posting the code for grabbing this data from the Twitter API in the next day or two…