Exporting Yahoo Pipe Definitions, Compiling Them to Python, and Running Them in Scraperwiki

[Update Jun 2105 – it seem Yahoo Pipes is finally shutting down – see my obituary for it here: Reflections on the Closure of Yahoo Pipes.]

So you’ve got a whole bunch of Yahoo Pipes running some critical information feeds, but you’re fearful that Yahoo Pipes is going to disappear: what are you going to do? Or maybe you want to use Yahoo Pipes to harvest and process a data feed once a day and pop the results into an incremental data store, but you don’t run your own database. This post describe how the Pipe2Py Yahoo Pipes to Python code compiler running inside the data harvesting tool Scraperwiki may provide one way of solving your problem.

Over the years, I’ve created dozens and dozens of Yahoo Pipes, as well as advocating their use as a rapid prototyping environment for feed based processing, particularly amongst the #mashedlibrary/digital librarianship community. There are several sorts of value associated variously with actual Yahoo Pipes designs, including: the algorithmic design, that demonstrates a particular way of sourcing, filtering, processing, mixing and transforming one or more data series; and the operational value, for example in terms of the value associated with running the pipe and publishing, syndicating or otherwise making direct use of the output of a particular pipe.

Whilst I have tried to document elements of some of the pipework I have developed (check the pipework category on this blog, for example), many of the blog posts I have written around Yahoo Pipes have complemented them in a documentation sense, rather than providing a necessary and sufficient explanation from which a pipe can be specifically recreated. (That is, to make full sense of the posts, you often had to have access to the “source” of the pipe as well…)

To try to mitigate against the loss of Yahoo Pipes as an essential complement to many OUseful.info posts, I have from time to time explored the idea of a Yahoo Pipes Documentation Project (countering the risk of algorithmic loss), as well as the ability to export and run equivalent or “compiled” versions of Yahoo Pipes on an arbitrary server (protecting against operational loss). The ability to generate routines with an equivalent behaviour to any given Yahoo Pipe also made sense in the face of perceived concerns “from IT” about the stability of the Yahoo Pipes platform (from time to time, it has been very shaky!) as well as it’s long term availability. Whilst my attitude was typically along the lines of “if you hack something together in Yahoo Pipes that does at least something of what you want, at least you can make use of it in the short term”, I was also mindful of the fact that when applications become the basis of any service they may not be looked at again if the service appears to be working and as such other things may come to depend or otherwise rely on them. As far as I am aware, the Pipe2Py project, developed by Greg Gaughan, has for some time been the best bet when it comes to generating standalone programmes that are functionally equivalent to a wide variety of Yahoo Pipes.

As Yahoo again suffers from a round of redundancies, I thought it about time that I reconsider my own preservation strategy with respect to the possible loss of Yahoo Pipes…

Some time ago, I persuaded @frabcus to make pipe2py library available on Scraperwiki, but to my shame never did anything with it. So today, I thought I’d better address that. Building on the script I linked to from Just in Case – Saving Your Yahoo Pipes…, I put together a simple Scraperwiki script that grabs the JSON descriptions of my public/published pipes and pops them into a Scraperwiki database (Scraperwiki: pipe2py test):

import scraperwiki,urllib,json,simplejson

def getPipesJSON(id,name):
    url = ("""http://query.yahooapis.com/v1/public/yql"""
               """?q=select%20PIPE.working%20from%20json%20"""
               """where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.info%3F_out%3Djson%26_id%3D"""
               + id + 
               """%22&format=json""")
    pjson = urllib.urlopen(url).readlines()
    pjson = "".join(pjson)
    pipe_def = json.loads(pjson)
    scraperwiki.sqlite.save(unique_keys=['id'], table_name='pipes', data={'id':id,'pjson':pjson,'title':name})
    if not pipe_def['query']['results']:
        print "Pipe not found"
        sys.exit(1)
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson

#-------
def getPipesPage(uid,pageNum):
    print 'getting',uid,pageNum
    pipesFeed='http://pipes.yahoo.com/pipes/person.info?_out=json&display=pipes&guid='+uid+'&page='+str(pageNum)
    feed=simplejson.load(urllib.urlopen(pipesFeed))
    return feed

def userPipesExport(uid):
    page=1
    scrapeit=True

    while (scrapeit):
        feeds= getPipesPage(uid,page)
        print feeds
        if feeds['value']['hits']==0:
            scrapeit=False
        else:
            for pipe in feeds['value']['items']:
                id=pipe['id']
                tmp=getPipesJSON(id,pipe['title'])
            page=page+1

#Yahoo pipes user ID
uid='PQULC4LQ3N5R4UGNFCLD4BULUQ'

userPipesExport(uid)

To export your own public pipe definitions, clone the scraperwiki, replace my Yahoo pipes user id (uid) with your own, and run the scraper…

Having preserved the JSON descriptions within a Scraperwiki database, the next step was to demonstrate the operationalisation of a preserved pipe. The example view at pipe2py – test view [code] demonstrates how to look up the JSONic description of a Yahoo Pipe, as preserved in a Scraperwiki database table, compile it, execute it, and print out the result of running the pipe.

import scraperwiki,json

from pipe2py import compile, Context

pipeid='2de0e4517ed76082dcddf66f7b218057'

def getpjsonFromDB(id):
    scraperwiki.sqlite.attach( 'pipe2py_test' )
    q = '* FROM "pipes" WHERE "id"="'+id+'"'
    data = scraperwiki.sqlite.select(q)
    #print data
    pipe_def = json.loads(data[0]['pjson'])
    if not pipe_def['query']['results']:
        print "Pipe not found"
        sys.exit(1)
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson

pjson=getpjsonFromDB(pipeid)

p = compile.parse_and_build_pipe(Context(), pjson)
for i in p:
    #print 'as',i
    print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'

The examplePipeOutput() function in the pipes preservation Scraperwiki scraper (rather than the view) provides another example of how to compile and execute a pipe, this time by directly loading in its description from Yahoo Pipes, given it’s ID.

To preview the output of one of your own pipes by grabbing the pipe description from Yahoo Pipes, compiling it locally and then running the local compiled version, here’s an example (pipe2py – pipe execution preview):

#Example of how to grab a pipe definition from Yahoo pipes, compile and execute it, and preview its (locally obtained) output

import scraperwiki,json,urllib

from pipe2py import compile, Context

pipeid='2de0e4517ed76082dcddf66f7b218057'

def getPipesJSON(id):
    url = ("""http://query.yahooapis.com/v1/public/yql"""
               """?q=select%20PIPE.working%20from%20json%20"""
               """where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.info%3F_out%3Djson%26_id%3D"""
               + id + 
               """%22&format=json""")
    pjson = urllib.urlopen(url).readlines()
    pjson = "".join(pjson)
    pipe_def = json.loads(pjson)
    if not pipe_def['query']['results']:
        print "Pipe not found"
        sys.exit(1)
    pjson = pipe_def['query']['results']['json']['PIPE']['working']
    return pjson


pjson=getPipesJSON(pipeid)

p = compile.parse_and_build_pipe(Context(), pjson)
for i in p:
    #print 'as',i
    print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'

To try it with a pipe of your own (no actual scraper required…), clone the view and replace the pipe ID with a (published) pipe ID of your own…

(If you want to publish an RSS feed from a view, see for example the httpresponseheader cribs in Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API .)

Note that this is all very much a work in progress, both at the code level and the recipe level, so if you have any ideas about how to take it forward, or spot any bugs in the compilation of any pipes you have preserved, please let me know via the comments, or, in the case of pipe2py, by filing an issue on github (maybe even posting a bugfix?!;-) and talking nicely to Greg:-) (I fear that my Python skills aren’t up to patching pipe2py!) Also note that I’m not sure what the Scraperwiki policy is with respect to updating third party libraries, so if you do make amy contributions to the pipe2py project, @frabcus may need a heads-up regarding updating the library on Scraperwiki ;-)

PS note that the pipe2py library may still be incomplete (i.e. not all of the Yahoo Pipes blocks may not be implemented as yet). In addition, I suspect that there are some workarounds required in order to run pipes that contain other, embedded custom pipes. (The embedded pipes need compiling first.) I haven’t yet: a) tried, b) worked out how to handle these in the Scraperwiki context. (If you figure it out before I do, please post a howto in the comments;-)

Also note that at the current time the exporter will only export published pipes associated with a specific user ID. To get the full list of pipes for a user (i.e. including unpublished pipes), I think you need to be authenticated as that user? Any workarounds you can come up with for this would be much appreciated ;-)

PPS One of the things that Yahoo Pipes doesn’t offer is the ability to preserve the output of a pipe. By hosting the executable version of a pipe on Scraperwiki, it is easy enough to create a scheduled scraper than loads in the JSON definition of a pipe, for example by a query onto a database table that contains pipe descriptions based on ID, compiles it into the currently running process, calls the pipe and then pops the results into another Scraperwiki database table.

Just in Case – Saving Your Yahoo Pipes…

Yahoo is laying off again, so just in case, if you’re a user of Yahoo Pipes, it may be worth exporting the “source code” of your pipes or any pipes that you make frequent use of in case the Yahoo Pipes service gets cut.

Why? Well, a little known fact about Yahoo pipes is that you can get hold of a JSON representation from a pipe that describes how the pipe is constructed…

…and some time ago, Greg Gaughan started working on a script that allows you to “compile” these descriptions of your Yahoo Pipes into Python programming code that can be run as a standalone programme on your own server: Pipe2Py. (Greg also did a demo that allowed Pipes to be “migrated” to a version of Pipe2Py running on Google App Engine.)

From a quick skim over the Pipes service, it seems you can get hold of a list of published pipes for a user easily enough, which means we can get a quick dump of the “source code” of all the published pipes for a given user (and then maybe compile them to Python using pipe2py so we can essentially keep the functionality running…). Here’s a first pass at a bulk exporter: Yahoo Pipes Exporter (published pipes by user).

To get a full list of pipes by user, I think you need to be logged in as that user?

See also: Yahoo Pipes Code Generator (Python): Pipe2Py

PS do I need to start worrying about flickr too?!

Backup and Run Yahoo Pipes Pipework on Google App Engine

Wouldn’t it be handy if you could use Yahoo Pipes as code free rapid prototyping development environment, then export the code and run it on your own server, or elsewhere in the cloud? Well now you can, using Greg Gaughan’s pipe2py and the Google App Engine “Pipes Engine” app.

As many readers of OUseful.info will know, I’m an advocate of using the visual editor/drag and drop feed-oriented programming application that is Yahoo Pipes. Some time ago, I asked the question What Happens if Yahoo Pipes Dies?, partly in response to concerns raised at many of the Pipes workshops I’ve delivered about the sustainability, as well as the reliability, of the Yahoo Pipes platform.

A major issue was that the “programmes” developed in the Pipes environment could only run in that environment. As I learned from pipes guru @hapdaniel, however, it is possible to export a JSON representation of a pipe and so at least grab some sort of copy of a pipes programme. This led to me doodling some ideas around the idea of a Yahoo Pipes Documentation Project, which would let you essentially export a functional specification of a pipe (I think the code appears to have rotted or otherwise broken on this?:-(

This in turn led naturally to Starting to Think About a Yahoo Pipes Code Generator, whereby we could take a description of a pipe and generate a code equivalent version from it.

Greg Gaughan took up the challenge with Pipe2Py (described here) to produce a pipes compiler capable of generating and running Python equivalents of a Yahoo pipe (not all Pipes blocks are implemented yet, but it works well for simple pipes).

And now Greg has gone a step further, by hosting pipe2py on Google App engine so you can make a working Python backup of a pipe in that environment, and run it: Running Yahoo! Pipes on Google App Engine.

As with pipe2py, it won’t work for every Yahoo pipe (yet!), but you should be okay with simpler pipes. (Support for more blocks is being added all the time, and implementations of currently supported blocks also get an upgrade if, as and when any issues are found with them. If you have a problem, or suggestion for a missing block, add a comment on Greg’s blog;-)

(Looking back over my old related posts, deploying to Google Apps also seems to be supported by Zoho Creator.)

Quite by chance, @paulgeraghty tweeted a link to an old post by @progrium on the topic of “POSS == Public Open Source Services: … or User Powered Self-sustaining Cloud-based Services of Open Source Software”:

How many useful bits of cool plumbing are made and abandoned on the web because people realize there’s no true business case for it? And by business case, I mean make sense to be able to turn a profit or at least enough to pay the people involved. Even as a lifestyle business, it still has to pay for at least one person … which is a lot! But forget abandoned … how much cool tech isn’t even attempted because there is an assumption that in order for it to survive and be worth the effort, there has to be a business? Somebody has to pay for hosting! Alternatively, what if people built cool stuff because it’s just cool? Or useful (but not useful enough to get people to pay — see Twitter)?

Well this is common in open source. A community driven by passion and wanting to build cool/useful stuff. A lot of great things have come from open source. But open source is just that … source. It’s not run. You have to run it. How do you get the equivalent of open source for services? This is a question I’ve been trying to figure out for years. But it’s all coming together now …

Enter POSS

POSS is an extension of open source. You start with some software that provides a service (we’ll just say web service … so it can be a web app or a web API, whatever — it runs “in the cloud”). The code is open source. Anybody can fix bugs or extend it. But there is also a single canonical instance of this source, running as a service in the cloud. Hence the final S … but it’s a public service. Made for public benefit. That’s it. Not profit. Just “to be useful.” Like most open source.

Hmmm….. ;-)