Using YQL With Yahoo Pipes

A couple of days ago, @mikechelen asked:

“can yql plug in to pipes for improved development, compared with other cloud platforms that accept standard languages?”

where YQL is the Yahoo Query Language, a SQL like query language that can run queries on data pulled in from all over the web…

There are a couple of ways at least of doing this: a) calling YQL from Yahoo Pipes; b) calling Yahoo Pipes from within a YQL query.

First up, calling YQL from Yahoo Pipes, using the pipes YQL block and a trick I learned from @hapdaniel that lets me run a query on a couple of Google spreadsheets, where the results from one of the spreadsheets are subselected based on results of a query to a second Google spreadsheet:

select * from csv(2,500) where url =’’ and col4 > ’70’ and col1 in (select col1 from csv(2,500) where url = ‘’ and col4 > ’70’)

The second approach is to run a YQL query, e.g. with the YQL console, that calls on the JSON output of a Yahoo pipe (in this case, I just happen to be displaying the results from the pipe shown above. That is, a pipe that itself embeds a different YQL query).

Calling Yahoo pipes from YQL  -*

So, there you have it – how to embed YQL in a Yahoo pipe, and how to call a Yahoo pipe from YQL…

PS See also HTML Tables and the Data Web and the Google Visualization API
Query Language
, which apparently lets you “perform various data manipulations with the query to the data source. The Query Language does not depend on the implementation of any specific data source. These data manipulations are performed by the data source server, reducing the need to perform data manipulations and formatting by developers on the client.”

Grabbing the JSON Description of a Yahoo Pipe from the Pipe Itself

In a series of recent posts, (The Yahoo Pipes Documentation Project – Initial Thoughts, Grabbing JSON Data from One Web Page and Displaying it in Another, . Starting to Think About a Yahoo Pipes Code Generator) I’ve started exploring some of the various ingredients that might be involved in documenting the structure of a Yahoo Pipe and potentially generating some programme code that will then implement a particular pipe.

One problem I’d come across was how to actually obtain the abstract description of a pipe. I’d found an appropriate Javascript object within an open Pipes editor, but getting that data out was a little laborious…

…and then came a comment on one of the posts from Paul Daniel/@hapdaniel, pointing me to a pipe that included a little trick he was aware of. A trick for grabbing the description of a pipe from a pipe’s feed (e.g.

Paul used something akin to this YPDP pipe’s internals pipe to grab the data from the info feed of a specified pipe (the URL of which has the form using YQL:

It’s just as easy to grab the JSON feed from YQL, e.g. using a query of the form:
select PIPE.working from json where url=”″. The pipe id is the id of the pipe you want the description of.

If you have a Yahoo account, you can try this for yourself in the YQL developer console:

We can then grab the JSON feed either from YQL or the YPDP pipe’s internals pipe into a web page and run whatever we want from it.

So for example, the demo service I have set up at will take an id argument containing the id of a pipe, and display a crude textual description of it. Like this:

So what’s next on the “to do” list? Firstly, I want to tidy up – and further unpack – the “documentation” that the above routine produces. Secondly, there’s the longer term goal of producing the code generator. If anyone fancies attacking that problem, you can get hold of the JSON description of a pipe from its ID using either the YPDP internals pipe or the YQL query that are shown above.

XML Data Scraping and Screen Scraping with YQL

Although it seems as if the OU’s repository data is, henceforth, going to be made available as Linked Data, I thought I’d just post this quick demo I’ve had laying around for a week or two demonstrating how to use YQL to run queries over the XML data files published from the OU’s eprints server, as well as how to use it to scrape structured data from an HTML page (in this case, from Lanyrd).

Picking the “submissions by author” listing for the first author in the list (which happens to be James Aczel, not that you’d know it from the listing page as shown below…;-), we see a variety of ways of exporting the James’ publication list:

export formats from OU ORO/eprints

If we select the “EP3 XML” (ePrints v3?) option, this is what we get:

OU eprints XML

The URL for the XML export view, as for the HTML page is well structured around a unique identifier for James, although I can’t tell you what because the bit of the OU that “owns” the minting of the identifiers is very protective of them, how they’re used, and the extent to which they’re allowed to be made public… doh!

Anyway, given the URL for the XML version of someone’s publications, we can treat that document as a database using YQL.

So for example, here’s a query that will select the id and name of all the authors, orgainised by publication, on papers where James Aczel was an author:

select, eprint.creators from xml where url=''

(I wrote this query because I wanted to start exploring co-author neworks around particular authors…)

Here it is in the YQL Developer console:

Querying eprints xml from Yql developer console

And by extension, this query gets the authors and paper titles for papers where Jame has some creator attribution.

If we parameterise the query (as described here: YQL Web Service URLs), we can generate a “nice” parameterised URL for the query that returns the YQL XML query response.

YQL query alias

So for example, here’s response data for the user with ID bkh2:

(I don’t know how to construct the URL within YQL from just the ID? If you know, please post the answer in a comment…;-)[See @hapdaniel’s comment below :-)]

You may have noticed that YQL uses a dot notation in the construction of the SELECT component of the query to identify the data fields you want returning from the query. A rather different approach can be taken when using YQL to screenscrape data from an HTML page, in particular by using an XPATH expression to direct the query to the parent HTML element you want returning. So for example, we can scrape the list of attendees at an event as listed on Lanyrd, such as today’s #RSWebSci event:

lanyrd HTML

Here’s the YQL to scrape that list, using an XPATH expression to target in on the appropriate HTML container element:

select href from html where url='' and xpath='//div[@class="attendees-placeholder placeholder"]/ul[@class="user-list"]/li/a'

(I think an exact match is being run in this expression on the class attributes?)

Here’s the result:

Screenscrpaing attendees from a Lanyrd event listing

You’ll notice the scraped data is a list of paths off the Lanyrd domain to personal profile pages. So given these users’ URLs, we can scrape their Lanyrd profile pages to find their their Twitter IDs:

select href from html where url='' and xpath='//a[@class="icon twitter url nickname"]'

HTML Screenscrpae with YQL

If we’re thinking linked data (little ‘l’, little ‘d’;-), we might them use these Twitter URLs to see if we can find other webpages for the same person using Google’s otherme service:

PS just by the by, during the course of the #rswebsci event, I automatically generated a custom search engine around the #rswebsci hashtaggers websites and a quick viz of the structure of the hashtaggers’ network.

rswebsci hashtag community

With a bit of luck, I’ll get round to posting the code for grabbing this data from the Twitter API in the next day or two…

Feed Autodiscovery With YQL

Just a quick follow up to the post on using Beautiful Soup for RSS feed autodetection – it struck me that I should be able to do a similar thing with YQL:

YQL feedautodetect

Remember, feed autodiscovery relies on web page containing the following construction in the HTML <head> element:
<link rel=”alternate” type=”application/rss+xml” href=”FEED_URL” title=”FEED_NAME” />

So to try and autodetect the feed in a web page, we can use the following YQL statement:

select * from html where url="" and
xpath='//link[@rel="alternate" and @type="application/rss+xml"]'

Feed autodetection in YQL

We can then generalise this and create a query alias that allows us to pass in a URL and get any autodetected feeds back:

That is, use the query:

select * from html where url=@url and
xpath='//link[@rel="alternate" and @type="application/rss+xml"]'

We can look for atom feeds too:
select * from html where url=@url and xpath='//link[@rel="alternate" and (@type="application/rss+xml" or @type="application/atom+xml")]'

In this case, I’ve used the argument url for the original page URL, and specified the query alias feedautodetect, which means I can run a query remotely as follows:

The format=json switch forces the query to provide the response using JSON (example).


PS …though of course I expect that @hapdaniel knows an ever more elegant/powerful/efficient way of doing this?;-)

Blog Details from an RSS/Atom Feed

Picking up on Feed Autodetection With YQL, where I described a YQL custom query for autodetecting RSS and Atom feed URLs in a web page given the web page URL, here’s a complementary YQL custom query function which polls a feed URL through the YQL feed normaliser and returns the title and URL of the alternate HTML page for the feed:

select title,link from feednormalizer where url=@url and output='atom_1.0'

You can call the query using this alias:

(Leave the &format=json off if you want amn XML response.)

Here’s an example of that query with a URL instantiated, via a specific query in the YQL developer console:

Feed details via YQL

You’ll notice several alternatives are also given; the HTML page URL is given in the result where rel=”alternate”, which is somewhat reminiscent of the case of feed autodetection in an HTML page, where rel=”alternate” identifies a <link> element that includes the URL for a feed alternative…

It’s now easy enough to do a two-pass procedure where we autodetect a feed URL from an HTML blog homepage using the autodetection query described previously, and then lookup the details of the feed using the query described above.

And why exactly might we want to do this? Because in many HTML docs that do specify an alternate RSS/Atom feed, the title element provided is often something like the uninformative “RSS2.0”, rather than the title of original blog…

Accessing Linked Data in Scraperwiki via YQL

A comment from @frabcus earlier today alerted me to the fact that the Scraperwiki team had taken me up on my suggestion that they make the Python YQL library available in the Scraperwiki environment, so I thought I ought to come up with an example of using it…

YQL provides a general purpose standard query interface “to the web”, interfacing with all manner of native APIs and providing a common way of querying with them, and receiving responses from them. YQL is extensible too – If there isn’t a wrapper for your favourite API, you can write one yourself and submit it to the community. (For a good overview of the rationale for, and philosophy behind YQL, see Christian Heilmann’s the Why of YQL.)

Browsing through the various community tables, I found one for handling SPARQL queries. The YQL wrapper expects a SPARQL query and an endpoint URL, and will return the results in the YQL standard form. (Here’s an example SPARQL query in the YQL developer console using the education datastore.)

The YQL query format is:
select * from where query=”YOUR_SPARQL_QUERY” and service=”SPARQL_ENDPOINT_URL”
and can be called in Python YQL in the following way (Python YQL usage):

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from where query="'+query+'" and service="'+endpoint+'"'
    env = ""
    return y.execute(query, env=env)

For a couple of weeks now, I’ve been look for an opportunity to try to do something – anything – with the newly released Ordnance Survey Linked Data (read @gothwin’s introduction to it for more details: /location /location /location – exploring Ordnance Survey Linked Data – Part 2).

One of the things the OS Linked Data looks exceedingly good for is acting as glue, mapping between different representations for geographical and organisational areas; the data can also return regions that neighbour on a region, which could make for some interesting “next door to each other” ward, district or county level comparisons.

One of the most obvious ways in to the data is via a postcode. The following Linked Data query to the ordnance survey SPARQL endpoint ( returns the OS district ID, ward and district name that a postcode exists in:
PREFIX skos: <;
PREFIX postcode: <;

select ?district ?wardname ?districtname where { <;
postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname

Here is is running in the YQL developer console:

OS Posctcode query in YQL developer console

(Just by the by, we can create a query alias for that query if we want, by changing the postcode (MK76AA in the example to @postcode. This gives us a URL argument/variable called postcode whose value gets substituted in to the query whenever we call it:

[Note we manually need to add the environment variable &env= to the URL created by the query alias generator/wizard.]

YQL query alieas for sparql query

So… that’s SPARQL in YQL – but how can we use it in Scraperwiki… The newly added YQL wrapper makes it easy.. here’s an example, based on the above:


PREFIX skos: <>
PREFIX postcode: <>

select ?district ?wardname ?districtname where {
<> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
postcode="MK7 6AA"

os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from where query="'+query+'" and service="'+endpoint+'"'
    env = ""
    return y.execute(query, env=env)

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']}["id"], record) 

I use the MAGICPOSTCODE substitution to give me the freedom to create a procedure that will take in a postcode argument and add it in to the query. Note that I am probably breaking all sorts of Linked Data rule by constructing the URL that uniquely identifies (reifies?) the postcode in the ordnance survey URL namespace (that is, I construct something like <;, which contravenes the “URIs are opaque” rule that some folk advocate, but I’m a pragmatist;-)

Anyway, here’s a Scraperwiki example that scrapes a postcode from a web page, and looks up some of its details via the OS: simple Ordnance Survey Linked Data postcode lookup

The next thing I wanted to do was use two different Linked Data services. Here’s the setting. Suppose I know a postcode, and I want to lookup all the secondary schools in the council area that postcode exists in. How do I do that?

The education datastore lets you look up schools in a council area given the council ID. Simon Hume gives some example queries to the education datastore here: Using SPARQL & the school data. The following is a typical example:

prefix sch-ont: <;

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <; ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation .

Here, the secondary schools are being identified according to the district area they are in (00MG in this case).

But all I have is the postcode… Can Linked Data help me get from MK7 6AA to 00MG (or more specifically, from <; to <;?)

Here’s what the OS knows about a postcode:

What the OS knows about a postcode

If we click on the District link, we can see what the OS knows about a district:

Local authority area code lookup in OS Linked Data

The Census Code corresponds to the local council id code used in the Education datastore (thanks to John Goodwin for pointing that out…). The identifier doesn’t provide a Linked Data URI, but we can construct one out of the code value:

(Note that the lookup on the district code does include a sameas URL link back to the OS identifier.)

Here’s how we can get hold of the district code – it’s the dmingeo:hasCensusCode you’re looking for:

PREFIX skos: <>
PREFIX admingeo: <>
PREFIX postcode: <>

select ?district ?nsdistrict ?wardname ?districtname where {
<> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname .
?district admingeo:hasCensusCode ?nsdistrict.

postcode='MK7 6AA'
os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print row['result']['nsdistrict']['value']
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']} 

So what does that mean… well. we managed to look up the district code from a postcode using the Ordnance Survey API, which means we can insert that code into a lookup on the education datastore to find schools in that council area:

def run_sparql_query(query, endpoint):
    # The following string replacement construction may be handy
    query = 'select * from where text=@text limit 3';
    y.execute(query, {"text": "panda"})
    y = yql.Public()
    query='select * from where query="'+query+'" and service="'+endpoint+'"'
    env = ""
    return y.execute(query, env=env)


prefix sch-ont:  <>

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <> ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation <>.
result=run_sparql_query(edu_query, edu_endpoint)
for row in result.rows:
    for school in row['result']:
        print school['name']['value'],school['reference']['value'],school['date']['value']
        record={ "id":school['reference']['value'],"name":school['name']['value'],"openingDate":school['date']['value']}["id"], record) 

Here’s a Scraperwiki example showing the two separate Linked Data calls chained together (click on the “Edit” tab to see the code).

Linked Data in Scraperwiki

Okay – so that easy enough (?!;-). We’ve seen how:
– Scraperwiki supports calls to YQL;
– how to make SPARQL/Linked Data queries from Scraperwiki using YQL;
– how to get data from one Linked Data query and use it in another.

A big problem though is how do you know whether there is a linked data path from a data element in one Linked Data store (e.g. from a postcode lookup in the Ordnance Survey data) through to another datastore (e.g. district area codes in the education datastore), where you is a mere mortal and not a Linked Data guru?! Answers on the back of a postcard, please, or via the comments below;-)

PS whilst doing a little digging around, I came across some geo-referencing guidance on the National Statistcics website that suggests that postcode areas might change over time (they also publish current and previous postcode info). So what do we assume about the status (currency, validity) of the Ordnance Survey postcode data?

PPS Just by the by, this may be useful to folk looking for Linked Data context around local councils: @pezholio’s First steps to councils publishing their own linked data

Feed Autodiscovery in Javascript

For what it’s worth, I’ve posted a demo showing a couple of feed autodiscovery/autodetection tricks that let you autodiscover feeds in remote pages via a couple of online services: the Google feed api, and YQL (Feed Autodiscovery With YQL).

Try it out: Feed autodiscovery in Javascript (code)

Single page web app: feed autodetection

I’ve also added in a routine that uses the Google feed api to look up historical entries on an RSS feed. As soon as Google is alerted to a feed (presumably by anyone or any means), it starts cacheing entries. The historical entries API lets you grab up to 250 of the most recent entries from a feed, irrespective of how many items the feed itself currently contains…

Why it matters: Public Data Principles: RSS Autodiscovery on Government Department Websites?, Autodiscoverable Feeds and UK HEIs (Again…)

PS Just by the by, I added a Scraperwiki view to my UK HEI autodiscovered feeds Scraperwiki. I added a little bit of logic to try to pull out feeds on a thematic basis too…

UK HE autodisocverable feeds

On the to do list is to create some OPML output views so you can easily subscribe to, or display, batches of the feeds in one go.

I guess I should also add a table to the scraper to start logging the number of feeds that are autodiscoverably out there over time?