I got a question from @liwazi last week wondering why a SRU request to the Cambridge Library catalogue wasn’t being handled properly in Yahoo pipes… I think it’s because the Yahoo Pipes XML parser is sometimes like that!
Anyway, here was my fix – to use YQL as a proxy, based around a SRU request URL of the form:
Here’s the the form of YQL query (try it in YQL developer console):
select * from xml where url='http://search.lib.cam.ac.uk/sru.ashx?version=1.1&recordSchema=dc&query=SEARCH+TERMSsearchRetrieve&maximumRecords=10'
You can find a copy of the pipe here: SRU demo pipe
Note that as well as accessing the data via the pipe, you can also pull the results of a search into a web page directly from YQL as a JSON feed:
If you’re really keen, you might also define a YQL data table that would allow you to make a request of the form “select * from camsru where q=’learning perl'”, and then set up a short alias for the query so you could run it using a construction of the form http://query.yahooapis.com/v1/public/yql/EXAMPLEUSER/camsru?q=learning%20perl based on a YQL query of the form select * from camsru where q=@q
[This post is a more complete working of Mash Oop North – Pipes Mashup by Way of an Apology]
Keeping current with journal articles in a particular subject area is one of the challenges that faces many researchers, and by implication the academic and research librarians tasked with supporting the information needs of those researchers.
This relatively simple recipe shows how to create a “two dimensional” search that allows a user to provide two sets of keywords, one to identify a set of journals in a particular subject area, the other to filter the current articles down to a particular subtopic in that subject area.
What this demo shows is:
– how to pull in a list of journals in a particular subject area based on user provided keywords into the Yahoo pipes environment;
– how to pull in the most recent table of contents from those journals into that environment;
– how to then filter those recent articles to only display articles on a particular subtopic.
Th starting point for this recipe is jOPML, a service created by Scott Wilson that allows you to run a keyword search on the titles of journals whose tables of contents are made available as RSS on ticTOCs and a generate an OPML feed containing the RSS feed URLs for those journal TOCS. (OPML is an XML formatted language that, among other things, can be used to transport bundles of RSS feed URLs around the web. In much the same way that RSS is one of the most effective ways of transporting sets of links to web pages around the web (as for example in the case of RSS feeds from social bookmarking sites such as delicious), so OPML is one of the best ways of moving collections of RSS links around.)
– the Fetch Feed block can pull in a wide variety of RSS flavoured forms (different versions of RSS, Atom etc); [Handy tip – a pipe that just wires a Fetch Feed block direct to the pip output can be used to “normalise” different flavours of RSS/Atom in order to provide a single, standard feed format at the output of the pipe. ]<
– Fetch Data can be used to import XML and JSON into the pipes environment (with Fetch CSV pulling in CSV data files, from sources such as Google Spreadsheets);
– Fetch Page can be used to load HTML web pages into Yahoo Pipes, providing the means by which to develop simple screen scraping applications within the Pipes environment.
What this means is that we can pull in the OPML file generated by jOPML into the Yahoo Pipes environment and have a play with it :-)
So let’s see how. To start with, we need to find a way of getting arbitrary OPML files out of jOPML. Running a search for science history on jOPML returns:
with the OPML available here: http://jopml.org/feeds.opml?q=science+history
Looking at this URI, you’ll hopefully see that it contains the search terms used to query the journals database on jOPML. In effect, the URI is an API to the jOPML service. By rewriting the URI, we can make different calls on the jOPML service, and return different OPML files for different topic areas.
AS-AN-ASIDE TAKE HOME POINT: many URIs effectively provide an API to a web service. If you ever see a search form, run some queries using it, and look at the URIs of th results page. If you can see your search terms in the URI, you are now in a position to construct your own queires to that service simply by using the URI, rather than having to go by the search form.
Here are a couple of services you can try this out with:
– Google: http://google.com;
– Twitter: http://search.twitter.com.
Remember, the goal is to:
1) run a search;
2) look at the the URI of the results page and see if you can spot the search terms;
3) try to hack the URI to run a search using a new set of search terms.
Are there any other hackable items in the URI? For example, run several Twitter searches returning different numbers of search results, and look at the URI in each case. Can you see how to hack it to return the number of results items that you want? (Note that there is a hard limit set at the Twitter end that defines the maximun numbr of results that can be returned.)
It’s not just search terms that appear in URIs either. For example, the ISBN Playground will generate a wide variety of URIs that are all “keyed” using an arbitrary ISBN. (Actually, thot’s not quite true; many of them require ISBN 10 format ISBNs. But there are ways around that, as I’ll show in a later post…) If I’m missing any URIs you know of that contain ISBNs, please let me know in a comment to this post ;-)
Anyway, that’s more than enough of that! Let’s go back to the 2D journal search recipe, and let the pipework begin…
The main idea bhind Yahoo pipes is to “wire” together different components in order to complete some sort of task. When you create a new Yahoo pipe you are presented with an empty canvas that dominates the screen on which to create your “pipe”, and a menu area on the left that contains different blocks that you can use to create your pipe.
Blocks are added to the canvas either by dragging them from th menu area and dropping them on the canvas, or clicking the + symbol on the block you want in the menu area, which adds it to the canvas automatically.
Blocks are wired together by clicking on the circle on the bottom of a block and dragging and dropping the “wire” onto a circle on the top of the next block in your pipe.
The idea is that content flows through one block into the next, entering the block along its top edge, being processed by the block as appropriate, and then passing out through the bottom edge of the block.
Blocks that do not have an input circle on the top edge are used to pull content into the pipe from elsewhere. (These can be found in the Sources part of the menu panel.)
In contrast, the Pipe output block does not have any circles on its lower edge – the output from this block is exposed to the outside world on the the pipe’s public home page. (The single pipe output block is added to the canvas automatically when you add an input block. Pipes can have multiple input blocks, but only on output block.)
So where do we start? The first thing to do is to construct the URI to the OPML feed that we can then use to pull in the OPML feed for a set of journals on a particular topic:
If you highlight a block by clicking on it, it will glow orange. You can then inspect the output from just this block by looking in the preview pan at the bottom of the screen:
The “Journal keywords (text)” block is actually a Text input block:
The URL builder constructs a URL from the fixed elements of the URI (the page location http://jopml.org/feeds.opml and the query variable q) and the user inputted search terms. The user inputs are exposed as text entry boxes on the front page of the pipe as well by arguments in the URI for the pipe (e.g. in the same way that the query terms appear in the jOPML URIs).
In order to import the contents of the jOPML file, we can use the Fetch Datablock.
To see what we’ll be working with, here’s what an original OPML file looks like:
If we load this XML file into Pipes, we need to tell the Fetch Data block what parts of the OPML file which should use as separate items in within the pipe. Looking at the OPML file, we ideally want each journal to be represented as a separate item within the pipe. We do this by specifying the path to the outlin element in the OPML feed, noting that each journal listing is represented using a separate outline element.
Within the pipes environment, the OPML file is represented as follows:
Each outline element contains information regarding a single journal – it’s title, xmlUrl, and so on. The xmlUrl element contains the URI of the RSS feed for the contents of the current issue of the particular journal. You’ll see that the xmlURI points to the RSS feed of the journal from the publisher’s site.
So for example, the RSS version of the TOCs for the journal The British Journal for the History of Science can be found at http://journals.cambridge.org/data/rss/feed_BJH_rss_2.0.xml.
Now you could of course subscribe to all these journal table of contents feeds simply importing the OPML file into an RSS reader such as Google Reader, but where would the fun be in that? After all, most of the time, I’m not actually that interested in most of the articles in any particular journal. For example, it would be far more efficient (?!) if I was only alerted to articles that were in my subject area. So let’s see how to do that…
The Loop block lets us work with each item in the selected journals feed. Essentially, it says “for each item in a feed, do something to or with that item”. (For each is a really powerful idea in computational thinking. It does pretty much exactly what it says on the tin: for each item in a list, do something with it. In the Yahoo Pipes environment, the Loop block essentially implements for each):
You’ll see that the loop block has a space for adding another block – the block whose functionality will be applied to each element in the incoming feed. As well as placing ‘standard’ pipes blocks taken from the blocks menu in a Loop element, you can also use pipes you have created yourself.
If we embed a Fetch Feed block in the Loop, then for each journal item identified in the imported OPML feed, we can locate its TOCs RSS feed URI (the xmlUrl element) and use it to fetch the contents of that feed.
Now you may notice that the Loop block can output the results of the Fetch Feed call in one of two ways; it can either annotate the original feed items, for example by assigning (that is, adding) the current list of contents for a journal to each a subelement of each item in the pipe:
In more abstract terms, we might represent that as follows:
Or byemitting the items, which is to say that each item that comes into the the Loop block is replaced by the set of items that were created within the Loop block:
Here’s how that looks diagrammatically:
Because I want to produce a feed that just contains links to articles that may be of interest to me, we’re going to use the “emit all results” option.
So let’s just recap where we are. Here’s the pipe so far:
We start by taking some user keyword terms and construct a URI that calls the jOPML service, returning an OPML file that contains the titles and TOC RSS URLs of journals related to those keywords. We then loop through that list of journals, replacing each journal item with a list of items corresponding to the current table of contents of each journal. These items are pulled from the table of contents RSS feed for each journal as obtained from the ticTOCs listings.
The next step is to filter the contents list so that we only get passed journal articles on a particular topic. We’ll do that using a crude keyword filter that only lets articles through whose contents contain a particular keyword or set of keywords.
Taking the filter block, we wire in another user input that allows the user to specify keywords that must appear in the title element of an article for it to be emitted from the pipe, and take the output from this filter to the output of the pipe.
So there we have it: a 2D search that takes two sets of keywords, one set that pulls out likely suspect journals on a topic, and the second set that filters articles from those journals on a more detailed subject.
The output form the pipe is then available as an RSS feed in its own right, as a Google personal (iGoogle) widget, etc etc.
The whole pipe looks like this:
It works by generating a jOPML URI based on user provided keyword terms, importing the jOPML feed into the pipe, grabbing the RSS feed of the table of contents for each journal specified in the OPML feed and then filtering those contents listings using another set of keyword terms based on the title of each article.
In doing so, you have seen how to use the URL Builder block to construct the jOPML URI using user provided search terms entered using a Text Input block; the Import Data block to grab the jOPML XML feed; the Loop and Fetch Feed blocks to pull in the table of contents RSS feed from the journal publisher for each journal identified in the jOPML feed; and the Filter block to pass through only those articles that contain a second set of user specified keywords in the article title.
PS if I manage to blag being able to run a Library Mashup uncourse in the Autumn, this is about the level of detail post I’d was planning to write. So – too much detail? Not enough? Just right? How’s it for leveling? Appropriate for a ‘not necessarily techie, but keen to learn’ audience?