XML Data Scraping and Screen Scraping with YQL

Although it seems as if the OU’s repository data is, henceforth, going to be made available as Linked Data, I thought I’d just post this quick demo I’ve had laying around for a week or two demonstrating how to use YQL to run queries over the XML data files published from the OU’s eprints server, as well as how to use it to scrape structured data from an HTML page (in this case, from Lanyrd).

Picking the “submissions by author” listing for the first author in the list (which happens to be James Aczel, not that you’d know it from the listing page as shown below…;-), we see a variety of ways of exporting the James’ publication list:

export formats from OU ORO/eprints

If we select the “EP3 XML” (ePrints v3?) option, this is what we get:

OU eprints XML

The URL for the XML export view, as for the HTML page is well structured around a unique identifier for James, although I can’t tell you what because the bit of the OU that “owns” the minting of the identifiers is very protective of them, how they’re used, and the extent to which they’re allowed to be made public… doh!

Anyway, given the URL for the XML version of someone’s publications, we can treat that document as a database using YQL.

So for example, here’s a query that will select the id and name of all the authors, orgainised by publication, on papers where James Aczel was an author:

select eprint.id, eprint.creators from xml where url='http://oro.open.ac.uk/cgi/exportview/person/jca25/XML/jca25.xml'

(I wrote this query because I wanted to start exploring co-author neworks around particular authors…)

Here it is in the YQL Developer console:

Querying eprints xml from Yql developer console

And by extension, this query gets the authors and paper titles for papers where Jame has some creator attribution.

If we parameterise the query (as described here: YQL Web Service URLs), we can generate a “nice” parameterised URL for the query that returns the YQL XML query response.

YQL query alias

So for example, here’s response data for the user with ID bkh2:


(I don’t know how to construct the URL within YQL from just the ID? If you know, please post the answer in a comment…;-)[See @hapdaniel’s comment below :-)]

You may have noticed that YQL uses a dot notation in the construction of the SELECT component of the query to identify the data fields you want returning from the query. A rather different approach can be taken when using YQL to screenscrape data from an HTML page, in particular by using an XPATH expression to direct the query to the parent HTML element you want returning. So for example, we can scrape the list of attendees at an event as listed on Lanyrd, such as today’s #RSWebSci event:

lanyrd HTML

Here’s the YQL to scrape that list, using an XPATH expression to target in on the appropriate HTML container element:

select href from html where url='http://lanyrd.com/2010/rswebsci/' and xpath='//div[@class="attendees-placeholder placeholder"]/ul[@class="user-list"]/li/a'

(I think an exact match is being run in this expression on the class attributes?)

Here’s the result:

Screenscrpaing attendees from a Lanyrd event listing

You’ll notice the scraped data is a list of paths off the Lanyrd domain to personal profile pages. So given these users’ URLs, we can scrape their Lanyrd profile pages to find their their Twitter IDs:

select href from html where url='http://lanyrd.com/people/nigel_shadbolt/' and xpath='//a[@class="icon twitter url nickname"]'

HTML Screenscrpae with YQL

If we’re thinking linked data (little ‘l’, little ‘d’;-), we might them use these Twitter URLs to see if we can find other webpages for the same person using Google’s otherme service:

PS just by the by, during the course of the #rswebsci event, I automatically generated a custom search engine around the #rswebsci hashtaggers websites and a quick viz of the structure of the hashtaggers’ network.

rswebsci hashtag community

With a bit of luck, I’ll get round to posting the code for grabbing this data from the Twitter API in the next day or two…

So How Does the Twitter Backchannel Work When The Chatham House Rule Is in Place?

As I type, there is a spinoff meeting (that I’m not at) from the #RSWebSci event operating from a location near Milton Keynes (@martinjemoore: “At tremendous Kavli Centre, Royal Society’s base in Buckinghamshire, for satellite meeting about future of the web & web science #rswebsci”) and being held under the Chatham House rule:

“When a meeting, or part thereof, is held under the Chatham House Rule, participants are free to use the information received, but neither the identity nor the affiliation of the speaker(s), nor that of any other participant, may be revealed”.

In the RSWebSci event, from the backchannel we can identify the participants and their affiliations from any tweets they make from the event:

Chatham House Tweet

So for example, when @timdavies mentions: “Nigel Shadbolt asking “What are Chatham house rules for Twitter” at #rswebsci… Opps, er, I mean Someone asking.” we know that both Tim Davies (“Consultant and action researcher focussing on civic engagement and social technology. Specific focus on youth engagement & open data”) and Nigel Shadbolt are at the event. And from tweets like “#RSwebsci hilarious moment when one unattributable person forgets other unattributable person’s name”, we can assume that the originator of that tweet is also at the event (and maybe that they are not either of the unattributable persons mentioned?)

If we know an event is happening, and we know the sorts of people it is likely to attract (e.g. by looking at the Twitterers from the last couple of days of the #rswebsci event), if a Twitter blackout is in operation we can look to Twitter histories to see who was not tweeting during the event who might normally be expected to be tweeting over that period, and tentatively locate them at the event. We can also rule out people who have declared they aren’t there (@cameronneylon: “I decided not to go to #RSWebsci and satellite meeting because I had too much “proper” work to do. Think I probably picked wrong…”), unless they’re bluffing…?!;-)

From tweets so far, we know via @lescarr that there are several sessions taking place (“”Breakthroughs in Web Science”, “Dark Web”, “Networks in web science”, “Govt open data” and “Collaborative Science” sessions at #RSwebsci”). From clustering the folk who we know to be at, and suspect to be at the event, we might tentatively allocate them to different sessions, with a particular probability. If different hashtags are used for each session, the sort of thing @briankelly (who I don’t think is at the event) often lobbies for, it makes conversation analysis maybe a little easier?

On the topic of conversation analysis, or at least time series analysis (using a tool such as TimeFLow, for example?), we might be able to use some form of it to identify who said what from inspecting the timeline. For example, if @ianmulvany is a truth teller, and says at 9.47 “#RSWebsci time to pitch my idea”, we can monitor tweets over the next few minutes to see if any ideas that are reported are the sort of thing he might have come up with, given we can find out easily enough that he works for Mendeley. So maybe @timdavies’ mention at 9.53 that “#rswebsci @? “Crowdsourcing & crowdcurating more an art than a science right now” <– Shd it develop as science? Or best in domain of art…", that crowdsourcing thing is something that I could imagine Ian saying (P=0.7?) The idea as to whether it's science or art is presumably Tim Davies'?

Just by the by, TIm's use of @? comes from a suggestion I made about a possible "chatham bot" that would accept DMs, anonymise the sender and replace any @name attributions with @?. Thinking about it a little more, it would be easy enough for folk to see who was friended by the chatham bot, and narrow down at least the sender of the tweet to someone on that list. [If by implication we assume @? is a twitterer, rather than a participant not on twitter, we might further narrow down who said what in this case to someone on Twitter whose Twitter username the person who ‘mentioned’ them knows.] Chris Gutteridge, who is also not at the event ("@lescarr eh? I didn't know there was a Wednesday bit! #rswebsci any of it streamed?"), suggested "…creat[ing] a rswebscichatham twitter account and tell all people in the room the username/password. #rswebsci" which gets round this problem of preserving anonymity of the sender, which the creation of a Birdherd account (via @jamestoon) might also do?

Okay, enough of that… except to wonder: what other sorts of traffic analysis might we apply to a hashtag twitter stream and a “likely candidates” twitter use analysis over the duration of the event. Would it be easier to preserve the sense of the Chatham House rule if a hashtag was not used?

PS doh! I forgot to raise the point that first came to mind: how would it be possible to remotely attend a Chatham House event via a public backchannel? (Which is where the chathambot anonymiser came in…)

PPS just to note, as the clock ticks on, and the day warms up, other folk who were at #rswebsci on Monday and Tuesday, but who are not at today;s event, are now tweeting again using the hashtag, which means that the channel now has added noise on top of the discussions from today’s satellite event… The easiest way I can think of following today’s events is to create a list of folk known or suspected to be there, and follow that list through an additional #rswebsci filter?

PPPS [via @timdavies] Chatham House rule FAQ covers Twitter as follows:

Can I ‘tweet’ whilst at an event under the Chatham House Rule?
A. The Rule can be used effectively on social media sites such as Twitter as as long as the person tweeting or messaging reports only what was said at an event and does not identify – directly or indirectly – the speaker or another participant. This consideration should always guide the way in which event information is disseminated – online as well as offline.

It also says:

Q. Can a list of attendees at the meeting be published?
A. No – the list of attendees should not be circulated beyond those participating in the meeting.

which can in part be inferred from various uses of Twitter, and maybe also any public geolocation services used by participants. Which is to say, if you know where an event is, you can maybe look for people near there..?