Finding Common Terms around a Twitter Hashtag

@aendrew sent me a link to a StackExchange question he’s just raised, in a tweet asking: “Anyone know how to find what terms surround a Twitter trend/hashtag?”

I’ve dabbled in this area before, though not addressing this question exactly, using Yahoo Pipes to find what hashtags are being used around a particular search term (Searching for Twitter Hashtags and Finding Hashtag Communities) or by members of a particular list (What’s Happening Now: Hashtags on Twitter Lists; that post also links to a pipe that identifies names of people tweeting around a particular search term.).

So what would we need a pipe to do that finds terms surrounding a twitter hashtag?

Firstly, we need to search on the tag to pull back a list of tweets containing that tag. Then we need to split the tweets into atomic elements (i.e. separate words). At this point, it might be useful to count how many times each one occurs, and display the most popular. We might also need to generate a “stop list” containing common words we aren’t really interested in (for example, the or and.

So here’s a quick hack at a pipe that does just that (Popular words round a hashtag).

For a start, I’m going to construct a string tokeniser that just searches for 100 tweets containing a particular search term, and then splits each tweet up in separate words, where words are things that are separated by white space. The pipe output is just a list of all the words from all the tweets that the search returned:

Twitter string tokeniser

You might notice the pipe also allows us to choose which page of results we want…

We can now use the helper pipe in another pipe. Firstly, let’s grab the words from a search that returns 200 tweets on the same search term. The helper pipe is called twice, once for the first page of results, once for the second page of results. The wordlists from each search query are then merged by the union block. The Rename block relabels the .content attribute as the .title attribute of each feed item.

Grab 200 tweets and check we have set the title element

The next thing we’re going to do is identify and count the unique words in the combined wordlist using the Unique block, and then sort the list accord to the number of times each word occurs.

Preliminary parsing of a wordlist

The above pipe fragment also filters the wordlist so that only words containing alphabetic characters are allowed through, as well as words with four or more characters. (The regular expression .{4,} reads: allow any string of four or more ({4,}) characters of any type (.). An expression .{5,7} would say – allow words through with length 5 to 7 characters.)

I’ve also added a short routine that implements a stop list. The regular expression pattern (?i)\b(word1|word2|word3)\b says: ignoring case ((?i)),try to match any of the words word1, word2, word3. (\b denotes word boundary.) Note that in the filter below, some of the words in my stop list are redundant (the ones with three or fewer characters. Remember, we have already filtered the word list to show only words of length four or more characters.)

Stop list

I also added a user input that allows additional stop terms to be added (they should be pipe (|) separated, with no spaces between them). You can find the pipe here.

Fragments… Obtaining Individual Photo Descriptions from flickr Sets

I think I’ve probably left it too late to think up some sort of hack for the UK Discovery developer competition, but here’s a fragment that might provide a starting point for someone else… How to use a Yahoo pipe to grab a list of photos in a particular flickr set (such as one of the sets posted by the UK National Archive to the flickr commons)

The recipe makes use of two calls to the flickr api: one to get the a list of photos in a particular set, the second, repeatedly made call, to grab details down for each photo in the set.

In pseudo-code, we would write the algorithm along the lines of:

get list of photos in a given flickr set
for each photo in the set:
  get the details for the photo

Here’s the pipe:

Example of calling flickr api to obtain descriptions of photos in a flickr set

The first step is construct a call to the flickr API to pull down the photos in a given set. The API is called via a URI of the form:

The API returns a JSON object containing separate items identifying each photo in the set.

The rename block constructs a new attribute for each photo item (detailURI) containing the corresponding photo ID. The RegEx block applies a regular expression to each item’s detailURI attribute to transform it to a URI that calls the flickr API for details of a particular phot, by photo id. The call this time is of the form:

Finally, the Loop block runs through each item in the original set, calls the flickr API using the detailURI to get the details for the corresponding photo, and replaces each item with the full details of each photo.

flickr api - photo details

You can find the pipe here: grabbing photo details for photos in a flickr set

An obvious next step might be to enrich the phot decriptions with semantic tags using something like the Reuters OpenCalais service. On a quick demo, this didn’t seem to work in the pipes context (I wonder if there is Calais API throttling going on, or maybe a timeout?) but I’ve previously posted a recipe using Python that shows how to call the Open Calais service in a Python context: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags.

Again in pseudo code, we might do something like:

get JSON feed out of Yahoo pipe
for each item:
  call the Calais API against the description element

We could then index the description, title, and semantic tags for each photo and use them to support search and linking between items in the set.

Other pivots potentially arise from identifying whether photos are members of other sets, and using the metadata associated with those other sets to support discovery, as well as information contained within comments associate with each photo.

Extracting Data From Misbehaving RSS/Atom Feeds

A quick recipe prompted by a query from Joss Winn about getting data out of an apparently broken Atom feed:

The feed previews in Google Reader okay – – and is also viewable in my browser, but neither Google Spreadsheets (via the =importFeed() formula) nor YQL (?!) appear to like it.

[Note: h/t to Joss for pointing this out to me: is a recipe for accessing Google Reader’s archive of a feed, and pulling out e.g. n=150 items (r=n is maybe an ordering argument?) Which is to say: here’s a way of accessing an archive of RSS feed items…:-)]

However, Yahoo Pipes does, so a simple proxy pipe normalises the feed and gives us one that is properly formatted:

Sample Yahoo feed proxy -

The normalised feed can now be accessed via:

We can also get access to a CSV output:

The CSV can be imported in to a Google spreadsheet using the =importData() formula:

[Gotcha: if you have ?&_render in the URL (i.e. ?&), Spreadsheets can’t import the data…]

Once in the spreadsheet it’s easy enough to just pull out e.g. the description text from each feed item because it all appears in a single column.

Google spreadsheets can also query the feed and just pull in the description element. For example:

=ImportFeed(“”,”items summary”)

(Note that it seemed to time out on me when I tried to pull in the full set of 150 elements in Joss’ original feed, but it worked fine with 15.)

We can also use YQL developer console to just pull out the description elements:

select description from rss where url=’

YQL querying an rss feed

YQL then provides XML or JSON output as required.

A Bit of NewsJam MoJo – SocialGeo Twitter Map

At the Mozilla/Knight/Guardian/BBC NewsJam #mojo event on Saturday (review by Martin Belam; see also Martin’s related review of the news:rewired event the day before), I was part of a small team that looked at how we might start validating folk tweeting about a particular news event. Here’s a brief write up of our design…

Try it here: SocioGeo map

When exploring twitter based activity around an event, Guardian journalist Paul Lewis raised the question “how does a journalist know which sources are to be trusted?” (a source verification problem), identifying this as one area where tool innovation may be able to help the journalist assessing which twitter sources around an event may be worth following or contacting directly.

The SocioGeo map begins to address this concern, and represents an initial attempt at mapping the social and geographical spread of tweets around an event in near real time. In its first incarnation, SocioGeoMap is intended to support visual analysis of the social and spatial distribution of people tweeting about an event in order to identify the extent to which people tweeting about an event are co-located with it/and or each other (initially, based on a sampling of geocoded tweets, although this might extend to reconciliation of identities from Twitter into location based checkin services such as Foursquare, or derived location services such as uploaded geocoded photos to Flickr), and the extent to which they know each other (initially, this is limited to whether or not they are following each other on Twitter, but could be extended to other social networks).

In his presentation at the #mojo London event, Guardian interactive designer Alastair Dant suggested a fruitful approach for hacks/hackers communities might be to identify publication “archetypes” such as maps and timelines, as well as “standard content types” such as map+timeline combinations. To these, we might add the “social network” archetype, and geo-social maps (locating nodes in space and drawing social connections between them), socio-temporal maps (showing how social connections ebb and flow over time, or how messages are passed between actors) or geo-socio-temporal maps (where we plot how messages move across spatially and socially distributed nodes over time.

If the simple geo-social map depiction demonstrated above does turn out to be useful, informative or instructive, the next phase might be to start using mathematical analyses of the geographical concentration of people tweeting about an event, as well as social network analysis metrics to start assigning certainty factors to individuals relating to the degree of confidence we might have that they were eyewitness to an event, embedded within it/central to it or a secondary/amplifying source only, and so on. A wider social network analysis (eg of the social networks of people associated with an event) might also provide information related to the authority/trustedness/reputations of the source in other contexts. These certainty factors might then be able to rank tweets associated with an event, or identify sources who might be worth contacting directly, or ignoring altogether. (That is, the analyses might be able to contribute to automatic filter configuration).

SocioGeoMap is based on several observations:

  • that events may occur in a physical location, or virtual online space, or a combination of the two;
  • that people tweeting about an event may or may not be participating in it or eyewitnesses to it (if not, they may be amplifying for direct or indirect reasons (indirect reasons might be where the retweeter is not really interested in the event, but was interested in amplifying the content of a tweet that also mentioned the event); we might associate a certainty factor with the extent to we believe a person was a participant in, or eyewitness to an event, whether they were rebroadcasting the event as a ‘news service”, whether they were commenting on the event, or raising a question to event participants, and so on;
  • that people tweeting about an event may or may not know each other.

Taking the example of football match, we might imagine several different co-existing states:

  • a significant number of people co-located with the event (and eyewitnesses to it); small clusters of these people may be tightly interconnected and follow each other (for example, social groups who visit matches together), some clusters that are weakly associated with each other via a common node (for example, different follower groups of the same team following the same football players), large numbers of people/clusters that are independent).
  • a very large number of people following the event through a video or audio stream but not co-located with it; it is likely that we will see large numbers of small subnetworks of people who know each other through eg work but who also share an interest in football;

In the case of a bomb going off in a busy public space, we might imagine:

  • a small number of people colocated with the event and largely independent of each other (not socially connected to each other)
  • a larger number of people who know eyewitnesses and retweet the event;
  • people in the vicinity of the event but unaware of it, except that they have been inconvenienced by it in some way;
  • people unconnected to the event who saw it via a news source or as a trending news topic and retweet it to feel as if they are doing their bit, to express shock, horror, pity, anger, etc

SocioGeoMap helps visualise the extent to which twitter activity around an event is either distributed or localised in both social/social network and geographical spaces.

In its current form, SocioGeoMap is built from a couple of pre-existing services:

  • a service that searches the recent twitter stream around a topic and identifies social connections between people who are part of that stream;
  • a service that searches the recent twitter stream around a particular location (using geo-coded tweets) and renders them on an embeddable map</li

In its envisioned next generation form, SocioGeoMap will display people tweeting about a particular topic by location (i.e. on a map) and also draw connections between them to demonstrate the extent to which they are socially connected (on Twitter, at least).

SocioGeoMap as currently presented is based on particular, user submitted search queries that may also have a geographical scope. An extention of SocioGeoMap might be to create SocioGeoMap alerting dashboards around particular event types, using techniques similar to the techniques employed in many sentiment analysis tools, specifically the filtering of items through word lists containing terms that are meaningful in terms of sentiment. The twist in news terms is to identify meaningful terms that potentially relate to newsworthy exclamations (“Just heard a loud explosion”, “goal!!!!”, “feel an earthquake?” and so on), and rather than associating positive or negative sentiment around a term brand, trying to discover tweets associated with sentiments of shock or concern in a particular geographical location.

SocioGeoMap may also be used in associsation with other services that support the pre-qualification or pre-verification of individuals, or certainty measure estimates on their expertise or likelihood of being in a particular place at a particular time. So for example, in the first case we might imagine doing some prequalification work around people likely to attend a planned event, such as a demonstration, based on their public declarations (“Off to #bigDemo tomorrow”), or identify their remote support/interest in it (“gutted not to be going to #bigDemo tomorrow”). Another example might include looking for geolocated evidence that an individual is a frequenter of a particular space, for example through a geo-coded analysis of their personal twitter stream and potentially also at one remove, such as through a geocoded analysis of their friends’ profiles and tweetstream, and as a result derive a certainty measure about the association of an individual with a particular location; that is, we could start to assign certainty measure to the likelihood of their being an eyewitness to an event in a particular locale based on previous geo-excursions.

By: Tony Hirst (@psychemedia), Alex Gamela (@alexgamela), Misae Richwood (@minxymoggy)
Mozilla/Knight/Guardian/BBC News Jam, Kings Tower, London, May 28th, 2011 #mojo

Implementation notes:

The demo was built out of a couple of pre-existing tools/components: a geo-based Twitter search constructed using Yahoo Pipes (Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…); and a map of social network connections between folk recently using a particular search term or hashtag (Using Protovis to Visualise Connections Between People Tweeting a Particular Term). It is possible to grab a KML URL from the geotwitter pipe and feed it into a Google map that can be embedded in a page using an iframe. The social connections graph can also be embedded in an iframe. The SocialGeoMap page is a page that contains two iframes, one that loads the map, and a second that loads the social network graph. The same data pulled from the Yahoo geo-search pipe feeds both visualisations.

In many cases, several tweets may have the exact same geo-coordinates, which means they are overlaid on the map and difficult to see. to get around this, a certain amount of jitter is added to each latitude and longitude; because Yahoo Pipes doesn’t have a native random number generator, I use a tweet ID to generate a jitter offset using the following pipe:

This is called just before the output of the geotwitter search pipe:

Whilst this does mean that no points are plotted with their exact original co-ordinates, it does mean that we can separate out most of the markers corresponding to tweets with the same latitude and longitude and thus see them independently on the map at their approximate location.

A next step in development might to move away from using Yahoo pipes, (which incur a cacheing overhead) and use server side service. A quickstart solution to this might be to generate a Python equivalent of the current pipework using Greg Gaughan’s pipe2py compiler, that generates a Python code equivalent of a Yahoo pipe.

Fragments: Glueing Different Data Sources Together With Google Refine

I’m working on a new pattern using Google Refine as the hub for a data fusion experiment pulling together data from different sources. I’m not sure how it’ll play out in the end, but here are some fragments….

Grab Data into Google Refine as CSV from a URL (Proxied Google Spreadsheet Query via Yahoo Pipes)

Firstly, getting data into Google Refine… I had hoped to be able to pull a subset of data from a Google Spreadsheet into Google Refine by importing CSV data obtained from the spreadsheet via a query generated using my Google Spreadsheet/Guardian datastore explorer (see Using Google Spreadsheets as a Database with the Google Visualisation API Query Language for more on this) but it seems that Refine would rather pull the whole of the spreadsheet in (or at least, the whole of the first sheet (I think?!)).

Instead, I had to tweak create a proxy to run the query via a Yahoo Pipe (Google Spreadsheet as a database proxy pipe), which runs the spreadsheet query, gets the data back as CSV, and then relays it forward as JSON:

Here’s the interface to the pipe – it requires the Google spreadsheet public key id, the sheet id, and the query… The data I’m using is a spreadsheet maintained by the Guardian datastore containing UK university fees data (spreadsheet.

You can get the JSON version of the data out directly, or a proxied version of the CSV, as CSV via the More options menu…

Using the Yahoo Pipes CSV output URL, I can now get a subset of data from a Google Spreadsheet into Google Refine…

Here’s the result – a subset of data as defined by the query:

We can now augment this data with data from another source using Google Refine’s ability to import/fetch data from a URL. In particular, I’m going to use the Yahoo Pipe described above to grab data from a different spreadsheet and pass it back to Google Refine as a JSON data feed. (Google spreadsheets will publish data as JSON, but the format is a bit clunky…)

To test out my query, I’m going to create a test query in my datastore explorer using the Guardian datastore HESA returns (2010) spreadsheet URL ( which also has a column containing HESA numbers. (Ultimately, I’m going to generate a URL that treats the Guardian datastore spreadsheet as a database that lets me get data back from the row with a particular HESA code column value. By using the HESA number column in Google Refine to provide the key, I can generate a URL for each institution that grabs its HESA data from the Datastore HESA spreadsheet.)

Hit “Preview Table Headings”, then scroll down to try out a query:

Having tested my query, I can now try the parameters out in the Yahoo pipe. (For example, my query is select D,E,H where D=21 and the key is tpxpwtyiYZwCMowl3gNaIKQ; this grabs data from columns D, E and H where the value of D (HESA Code) is 21). Grab the JSON output URL from the pipe, and use this as a template for the URL template in Google Refine. Here’s the JSON output URL I obtained:

Remember, the HESA code I experiment with was 21, so this is what we want to replace in the URL with the value from the HESA code column in Google Refine…

Here’s how we create the URLs built around/keyed by an appropriate HESA code…

Google Refine does its thing and fetches the data…

Now we process the JSON response to generate some meaningful data columns (for more on how to do this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure).

First say we want to create a new column based on the imported JSON data:

Then parse the JSON to extract the data field required in the new column.

For example, from the HESA data we might extract the Expenditure per student /10:

value.parseJson().value.items[0]["Expenditure per student / 10"]

or the Average Teaching Score (value.parseJson().value.items[0]["Average Teaching Score"]):

And here’s the result:

So to recap:

– we use a Yahoo Pipe to query a Google spreadsheet and get a subset of data from it;
– we take the CSV output from the pipe and use it to create a new Google Refine database;
– we note that the data table in Google Refine has a HESA code column; we also note that the Guardian datastore HESA spreadsheet has a HESA code column;
– we realise we can treat the HESA spreadsheet as a database, and further that we can create a query (prototyped in the datastore explorer) as a URL keyed by HESA code;
– we create a new column based on HESA codes from a generated URL that pulls JSON data from a Yahoo pipe that is querying a Google spreadsheet;
– we parse the JSON to give us a couple of new columns.

And there we have it – a clunky, but workable, route for merging data from two different Google spreadsheets using Google Refine.

Linked Data Without the SPARQL – OU/BBC Programmes on iPlayer

Over the last few weeks, I’ve been trying to look at the Linked Data project from different points of view. Part of the reason for this is to try to find one or more practical ways in that minimise the need to engage too much with the scary looking syntax. (Whether it really is scary or not, I think the fact that it looks scary makes it hard for novices to see past.)

Here’s my latest attempt, which uses Yahoo Pipes (sigh, yes, I know…) and BBC programme pages to engage with the BBC programme Linked Data: iPlayer clips and videos from OU/BBC co-pros

In particular, a couple of hacks that demonstrate how to:

– find all the clips associated with a particular episode of a BBC programme;
– find all the programmes associated with a parrticular series;
– find all the OU/BBC co-produced programmes that are currently available on iPlayer.

Rather than (or maybe as well as?) dumping all the programme data into a single Linked Data triple store, the data is exposed via programme pages on the BBC website. As well as HTML versions of each programme pages (that is, pages for each series, each episode in a series, each clip from a programme), the BBC also publish RDF and XML views over the data represented in each page. This machine readable data is all linked, so for example, a series page includes well defined links to the programme pages for each episode included in that series.

The RDF and XML views over the data (just add .rdf or .xml respectively as a suffix on the end of a programme page URL) are slightly different in content (I think), with the added difference that the XML view is naturally in a hierarchical/tree like structure, whereas the RDF would rather define a more loosely structured graph. [A JSON representation of the data is also available – just add .json]

So for example, to get the XML version of the series page add the suffix .xml to give

In the following demos, I’m going to make use of the XML rather than RDF expression of the data, partly to demonstrate that the whole linked data thing can work without RDF as well as without SPARQL…

There are also a couple of other URL mappings that can be useful, as described on the BBC Programmes Developers’ site:

– episodes available on iPlayer:
– episodes upcoming

Again, the .xml suffix can be used to get the xml version of the page.

So – let’s start with looking a the details of a series, such as @liamgh’s favourite – Coast, and pulling out the episodes that are currently available on iPlayer:

Coast pipe

We can use a Yahoo Pipes Fetch Data block to get the XML from the Coast episodes/player page:

BBC iplayer series episodes on iplayer

The resulting output is a feed of episodes of Coast currently available. The link can be easily rewritten from a programme page form (e.g. so that it points to the iPlayer page for the episode (e.g. If the programme is not available on iPlayer, I think the iPlayer link redirects to the programme page?

This extended pipe will accept a BBC series code, look for current episodes on iPlayer, and then link to the appropriate iPlayer page. Subscribing to the RSS output of the pipe should work in the Boxee RSS app. You should also be able to compile a standalone Python runnable version of the Pipe using Pipe2Py.

Now let’s look at some linked data action..(?!) Err… sort of…

Here’s the front half of a pipe that grabs the RDF version of a series page and extracts the identifiers of available clips from the series:

Clips from BBC series

By extracting the programme identifier from the list of programme clips, we can generate the URL of the programme page for that programme (as well as the URI for the XML version of the page); or we might call on another pipe that grabs “processed” data from the clip programme page:

Using BBC programme IDs for Linked Data fetches

Here’s the structure of the subpipe – it pulls together some details from the programme episode page:

Programme episode details

To recap – given a programme page identifier (in this case for a series), we can get a list of clips associated with the series; for each clip, we can then pull in the data version of the clip’s programme page.

We can also use this compound pipe within another pipe that contains series programme codes. For example, I have a list of OU/BBC co-produced series bookmarked on delicious (OU/BBC co-pros). If I pull this list into a pipe via a delicious RSS feed, and extract the series/programme ID from each URL, we can then go and find the clips…

Linking BBC programme stuff

Which is to say: grab a feed from delicious, extract programmme/series IDs, lookup clips for each series from the programme page for the series, then for each clip, lookup clip details from the clip’s programme page.

And if the dependence on Yahoo Pipes is too much for you, there’s always pipe2py, which can compile it to a Python equivalent.

PS hmm, as well as the piPE2py approach, maybe I should set up a scraperwiki page to do a daily scrape?

PPS see also Visualising OU Academic Participation with the BBC’s “In Our Time”, which maybe provides an opportunity for a playback channel component relating to broadcast material featuring OU academics?

5 Minute Hack – QR Codes from an RSS Feed

Skimming through my feeds a few minutes ago, I saw a post by Karen Coombs on the OCLC Developer blog about QR Code hacks to “mass generate QR Codes for all the web addresses of the applications in the [OCLC] Application Gallery”. The technique uses a bit of PHP to parse a feed from the gallery and create the QRcode images using the Google charts API.

If the thought of writing code doesn’t appeal to you, here’s a Yahoo pipe for achieving a similar effect – Yahoo Pipe to generate QRcodes from RSS feed item links:

QRcode images for RSS feed item links

[UPDATE: if you need a code equivalent of this pipe, you can always generate it using Greg Gaughan’s Pipe2Py compiler, which generates a Python programming code equivalent of a Yahoo Pipe, given the Pipe ID;-)]

[ANOTHER UPDATE: via the knows-way-more-than-I-do-about-Yahoo-pipes @hapdaniel: given the feed output URL for a pipe, for example, if we leave the _render=rss argument out, the pipe will render an unordered HTML list containing the feed items, rather than RSS. e.g. gives and unordered HTML list of the output of the above pipe…]

Using the Yahoo pipes image gallery badge, or maybe the Google AJAX Feed API, you can then quickly and easily generate an embed code to display the feed in a gallery widget in your own webpage.

PS After a moment or two’s reflection, I realised there’s an even lazier approach to generating the codes: @adrianshort’s Barcode Posters service, which also generates a list of QRcodes from an RSS feed:


Embedding the Output of a Yahoo Pipe in Your Own Webpage

I’ve demonstrated previously a way of using JQuery to pull the JSON (Javascript Object Notion) output of a Yahoo pipe – a Javascript version of an RSS feed, essentially – into a web page (Previewing the Contents of a JSON Feed in Yahoo Pipes), but if the idea of tinkering with JQuery is one step towards coding hell to far, what are the alternatives?

One approach is to grab the RSS output of a pipe and use it as the RSS URL in Feed2js, which will let you customise a review display of the output of the pipe and give you a handy Javascript embed code for the display that you can embed in your own page.

I don’t know why, but I keep forgetting the Yahoo pipes itself offers a couple of widgets – referred to as “badges” – for displaying the output of a pipe in your own web page: Yahoo Pipes Badges how to.

As with the pipe output previews on the “homepage” of a Yahoo Pipe, three sorts of display are possible – a list based display, an image display (which displays a slide show of images identified as such in the feed) and a map badge, which renders markers on an interactive Yahoo map. For the latter, you can also take the KML output of a feed, paste it into the Google Maps search box, and grab an iframe embed code.

Unfortunately, embedding the Javascript snippets used to display the Yahoo Pipes badges in hosted blogs isn’t allowed – so to display pipe fed content in such a blog you need to use the RSS output URL from a pipe in a WordPress RSS sidebar widget.

Yahoo Pipes Code Generator (Python): Pipe2Py

Wouldn’t it be nice if you coud use Yahoo Pipes as a visual editor for generating your own feed powered applications running on your own server? Now you can…

One of the concerns occasionally raised around Yahoo Pipes (other than the stability and responsiveness issues) relates to the dependence that results on the Yahoo pipes platform from creating a pipe. Where a pipe is used to construct an information feed that may get published on an “official” web page, users need to feel that content will always be being fed through the pipe, not just when when Pipes feels like it. (Actually, I think the Pipes backend is reasonably stable, it’s just the front end editor/GUI that has its moments…)

Earlier this year, I started to have a ponder around the idea of a Yahoo Pipes Documentation Project (the code appears to have rotted unfortunately; I think I need to put a proper JSON parser in place:-(, which would at least display a textual description of a pipe based on the JSON representation of it that you can access via the Pipes environment. Around the same time, I floated an idea for a code generator, that would take the JSON description of a pipe and generate Python or PHP code capable of achieving a similar function to the Pipe from the JSON description of it.

Greg Gaughan picked up the challenge and came up with a Python code generator for doing just that, written in Python. (I didn’t blog it at the time because I wanted to help Greg extend the code to cover more modules, but I never delivered on my part of the bargain:-(

Anyway – the code is at and it works as follows. Install the universal feed parser (sudo easy_install feedparser) and simplejson (sudo easy_install simplejson), then download Greg’s code and declare the path to it, maybe something like:
export PYTHONPATH=$PYTHONPATH:/path/to/pipe2py.

Given the ID for a pipe on Yahoo pipes, generate a Python compiled version of it:
python -p PIPEID

This generates a file containing a function pipe_PIPEID() which returns a JSON object equivalent of the output of the corresponding Yahoo pipe, the major difference being that it’s the locally compiled pipe code that’s running, not the Yahoo pipe…

So for example, for the following simple pipe, which just grabs the blog feed and passes it straight through:

SImple pipe for compilation

we generate a Python version of the pipe as follows:
python -p 404411a8d22104920f3fc1f428f33642

This generates the following code:

from pipe2py import Context
from pipe2py.modules import *

def pipe_404411a8d22104920f3fc1f428f33642(context, _INPUT, conf=None, **kwargs):
    if conf is None:
        conf = {}

    forever = pipeforever.pipe_forever(context, None, conf=None)

    sw_502 = pipefetch.pipe_fetch(context, forever, conf={u'URL': {u'type': u'url', u'value': u''}})
    _OUTPUT = pipeoutput.pipe_output(context, sw_502, conf={})
    return _OUTPUT

We can then run this code as part of our own program. For example, grab the feed items and print out the feed titles:

context = Context()
p = pipe_404411a8d22104920f3fc1f428f33642(context, None)
for i in p:
  print i['title']

running a compiled pipe on the desktop

Not all the Yahoo Pipes blocks are implemented (if you want to volunteer code, I’m sure Greg would be happy to accept it!;-), but for simple pipes, it works a dream…

So for example, here’s a couple of feed mergers and then a sort on the title…

ANother pipe compilation demo

And a corresponding compilation, along with a small amount of code to display the titles of each post, and the author:

from pipe2py import Context
from pipe2py.modules import *

def pipe_2e4ef263902607f3eec61ed440002a3f(context, _INPUT, conf=None, **kwargs):
    if conf is None:
        conf = {}

    forever = pipeforever.pipe_forever(context, None, conf=None)

    sw_550 = pipefetch.pipe_fetch(context, forever, conf={u'URL': [{u'type': u'url', u'value': u''}, {u'type': u'url', u'value': u''}]})
    sw_572 = pipefetch.pipe_fetch(context, forever, conf={u'URL': {u'type': u'url', u'value': u''}})
    sw_580 = pipeunion.pipe_union(context, sw_550, conf={}, _OTHER = sw_572)
    sw_565 = pipesort.pipe_sort(context, sw_580, conf={u'KEY': [{u'field': {u'type': u'text', u'value': u'title'}, u'dir': {u'type': u'text', u'value': u'ASC'}}]})
    _OUTPUT = pipeoutput.pipe_output(context, sw_565, conf={})
    return _OUTPUT

context = Context()
p = pipe_2e4ef263902607f3eec61ed440002a3f(context, None)
for i in p:
        print i['title'], ' by ', i['author']

And the result?
MCMT013:pipes ajh59$ python
Build an app to search Delicious using your voice with the Android App Inventor by Liam Green-Hughes
Digging Deeper into the Structure of My Twitter Friends Network: Librarian Spotting by Tony Hirst
Everyday I write the book by mweller

So there we have it.. Thanks to Greg, the first pass at a Yahoo Pipes to Python compiler…

PS Note to self… I noticed that the ‘truncate’ module isn’t supported, so as it’s a relatively trivial function, maybe I should see if I can write a compiler block to implement it…

PPS Greg has also started exploring how to export a pipe so that it can be run on Google App Engine: Running Yahoo! Pipes on Google App Engine

Who’s Using Mendeley in Your Institution?

In a couple of days, I’ll be at RepoFringe 2010, where the emphasis this year will be on ” OPEN: Open Data; Open Access; Open Learning; Open Knowledge; Open Content; etc…” I’m writing this post (a scheduled post, written over the weekend) in advance of putting my presentation together – so I’m not sure what it’ll be on yet! – but to get myself into the swing of things I started looking at what some of the repository bloggers have been thinking about lately, with a view to maybe doing a quick hack inspired by one or more of their posts…

…and it didn’t take long to find an itch to scratch… In a couple of recent posts looking at the extent to which personal document and metadata collections using apps like Mendeley might be seen as a figure:ground complement to a repository, (Comparing Social Sharing of Bibliographic Information with Institutional Repositories, More on Mendeley and Repositories), Les Carr started to explore “the extent of Mendeley’s penetration into a University. What is visible is the public profiles that Mendeley users have created. Although the Mendeley API doesn’t allow searching for users, I have been able to identify 53 public profiles from the University of Cambridge through Google (and a lot of manual verification!)” [my emphasis].

Hmmm… Sounds like that was a bit of a chore… can we finesse an API for that, I wonder?

Mendeley - users by instituion pipe

To see how I put this Pipe together, let’s see what Google gives us (I’m limiting the search to because that’s where I want to find the profiles):

Googling Mendeley profiles

Note that there are several useful things we can spot simply from inspection of the Google search results:

– user profile information on Mendeley is located on URLs of the form, so we can refine the site: search limit to take that into account (i.e. by using the limit;
– the insitution name, if appropriately declared, appears in the page title, which gives the headline search result in Google results listings; so we can use search limits of the form intitle:”cambridge university”, or the more general intitle:cambridge
– sometimes (not shown in the image above), our search term appears in the title, but it’s the wrong one… So for example, if we have researchers in “Cambridge Massachusetts”, we may want to exclude results with Massachusetts in the title by negating an intitle limit: -intitle:massachusetts”

Putting those techniques together, and to test things out, we should be able to search for members of our institution using something like: intitle:”cambridge” -intitle:massachusetts

What else can we learn just by looking at the search results?

– if somebody’s surname matches the institution name, that may be returned as a result (e.g. Darren Cambridge). If we inspect the title, we see it has a regular structure: Name – Institution. Having got the results from Google, if we strip the name out of the title to leave just the affiliation, and then filter the results again to check that the search term appears in the affiliation, we can remove these false positives. (I have used this “double dip” search-then-filter approach in other contexts. For example, Paragraph Level Search Results on WordPress Using and Yahoo Pipes.)

We’re now in a position to build a Yahoo pipe to create some sort of API to provide a Mendeley status search. A good way of getting Google search results into the Pipes environment is to use Google’s AJAX search API. The Google AJAX search API returns either four or eight results at a time, along with an indication of how many other “pages” of results there are, as well as an index count that identifies the index count of the first result on a page. (So for example, on the first page, the index of the first result is 0. For pages with four results, the index of the first result on the second page is 4, 8 on the third, and so on.) The first results page is complete – we actually get the search results listed. But the API also provides a list of the other results pages available, and the index of the first result on each page. To call results from the later pages, given the index of the first result, we use the additional URL argument &start=index.

The first part of the Pipe constructs the AJAX API URL. The user inputs are a bit of a fudge (and the result of a bit of trial and error!) to try to support as clean a results set as possible by virtue of how we phrase the search query…

Here’s where we construct the URL, and then fetch the data:

Mendeley user profiels search

Just by the by, we can use another (unsaved) pipe to act as a previewer for AJAX search results:

Google AJAX search results

If we want out pipe to display the results from all the pages, we need to grab the list of responseData.cursor.pages and then generate the “more results” page for each one. So, grab the list of page and first result index data:

Google AJAX search API -getting paged results data

and then create a URL for each of these, before grabbing the results from each results page:

Grabbing paged results from google ajax search api in Yahoo pipes

Note that we are using the same query string that we used in the original search. (Also note that we only seem to get at most 64 responses; maybe the page list for pages later than the first page provide indices for more results? That is, maybe each results page only lists at most indices for 8 pages of results?

Having got the search results, we rename the results attributes to generate valid RSS elements (title, description, link) and do a spot of post filtering.

Mendeley profile search pipe - post filter

Remember the case where the search result appeared becuase the institution name was actually someone’s surname? The Regular Expression block strips out the Mendeley user’s name and allows us to filter the results on just the affiliation, to remove those false positives (the filter lets through results where the institution search term appears in the title’s affiliation part):

The resulting pipe allows us to search for Mendeley users by institution:

Mendeley - users by instituion pipe

(Having built the pipe, I think that an even more robust approach might be to tokenise the search terms required in the title and them add them as separate intitle: limits. So for example intitle:cambridge intitle:university would find pages where both Cambridge University and University of Cambridge appear in the page title.)

So that’s the pipe…

In many ways, it implements some sort of “stalker pattern” based on profile information that is released via title elements on personal profile pages. I’ve demonstrated a similar approach previously in A Couple of Twitter Search Tricks, which shows (courtesy of an update I added after a tweet from @daveyp) how to do a similar sort of search to find folk twittering with a university allegiance. In fact, here’s a pipe to do something that approximates to just that – Twitter profile search (via Google and Yahoo Pipes):

Twitter profile search pipe

A quick scout round other social networks shows that this is a trick we can use widely:

  • intitle:smith (try using different country codes for the subdomian to search different countries)
  • intitle:smith
  • “milton keynes” intitle:”25- Female” (the original demo, hence “the stalker pattern” epithet!)

Unfortunately, it’s not obvious how to search for anything other than name on Slideshare, or Scribd (that is, there is no obviously easy way of searching for members of an institution on Slideshare). This in turn suggests to me that if you are developing a site with a social element, and you ant people to be able to use things like Google search to finesse additional, structured search functionality over the site (as in the Mendeley user profile search pipe), you should design title elements with all due consideration…

PS in his original post, Les Carr went on: “Incredibly, only TWO of those 53 researchers have any existing deposits in Cambridge’s institutional repository.” So maybe the next step would be to build some pipework to run Mendeley discovered users against corresponding institutional repositories?;-)