[Update Jun 2105 – it seem Yahoo Pipes is finally shutting down – see my obituary for it here: Reflections on the Closure of Yahoo Pipes.]
So you’ve got a whole bunch of Yahoo Pipes running some critical information feeds, but you’re fearful that Yahoo Pipes is going to disappear: what are you going to do? Or maybe you want to use Yahoo Pipes to harvest and process a data feed once a day and pop the results into an incremental data store, but you don’t run your own database. This post describe how the Pipe2Py Yahoo Pipes to Python code compiler running inside the data harvesting tool Scraperwiki may provide one way of solving your problem.
Over the years, I’ve created dozens and dozens of Yahoo Pipes, as well as advocating their use as a rapid prototyping environment for feed based processing, particularly amongst the #mashedlibrary/digital librarianship community. There are several sorts of value associated variously with actual Yahoo Pipes designs, including: the algorithmic design, that demonstrates a particular way of sourcing, filtering, processing, mixing and transforming one or more data series; and the operational value, for example in terms of the value associated with running the pipe and publishing, syndicating or otherwise making direct use of the output of a particular pipe.
Whilst I have tried to document elements of some of the pipework I have developed (check the pipework category on this blog, for example), many of the blog posts I have written around Yahoo Pipes have complemented them in a documentation sense, rather than providing a necessary and sufficient explanation from which a pipe can be specifically recreated. (That is, to make full sense of the posts, you often had to have access to the “source” of the pipe as well…)
To try to mitigate against the loss of Yahoo Pipes as an essential complement to many OUseful.info posts, I have from time to time explored the idea of a Yahoo Pipes Documentation Project (countering the risk of algorithmic loss), as well as the ability to export and run equivalent or “compiled” versions of Yahoo Pipes on an arbitrary server (protecting against operational loss). The ability to generate routines with an equivalent behaviour to any given Yahoo Pipe also made sense in the face of perceived concerns “from IT” about the stability of the Yahoo Pipes platform (from time to time, it has been very shaky!) as well as it’s long term availability. Whilst my attitude was typically along the lines of “if you hack something together in Yahoo Pipes that does at least something of what you want, at least you can make use of it in the short term”, I was also mindful of the fact that when applications become the basis of any service they may not be looked at again if the service appears to be working and as such other things may come to depend or otherwise rely on them. As far as I am aware, the Pipe2Py project, developed by Greg Gaughan, has for some time been the best bet when it comes to generating standalone programmes that are functionally equivalent to a wide variety of Yahoo Pipes.
As Yahoo again suffers from a round of redundancies, I thought it about time that I reconsider my own preservation strategy with respect to the possible loss of Yahoo Pipes…
Some time ago, I persuaded @frabcus to make pipe2py library available on Scraperwiki, but to my shame never did anything with it. So today, I thought I’d better address that. Building on the script I linked to from Just in Case – Saving Your Yahoo Pipes…, I put together a simple Scraperwiki script that grabs the JSON descriptions of my public/published pipes and pops them into a Scraperwiki database (Scraperwiki: pipe2py test):
import scraperwiki,urllib,json,simplejson def getPipesJSON(id,name): url = ("""http://query.yahooapis.com/v1/public/yql""" """?q=select%20PIPE.working%20from%20json%20""" """where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.info%3F_out%3Djson%26_id%3D""" + id + """%22&format=json""") pjson = urllib.urlopen(url).readlines() pjson = "".join(pjson) pipe_def = json.loads(pjson) scraperwiki.sqlite.save(unique_keys=['id'], table_name='pipes', data={'id':id,'pjson':pjson,'title':name}) if not pipe_def['query']['results']: print "Pipe not found" sys.exit(1) pjson = pipe_def['query']['results']['json']['PIPE']['working'] return pjson #------- def getPipesPage(uid,pageNum): print 'getting',uid,pageNum pipesFeed='http://pipes.yahoo.com/pipes/person.info?_out=json&display=pipes&guid='+uid+'&page='+str(pageNum) feed=simplejson.load(urllib.urlopen(pipesFeed)) return feed def userPipesExport(uid): page=1 scrapeit=True while (scrapeit): feeds= getPipesPage(uid,page) print feeds if feeds['value']['hits']==0: scrapeit=False else: for pipe in feeds['value']['items']: id=pipe['id'] tmp=getPipesJSON(id,pipe['title']) page=page+1 #Yahoo pipes user ID uid='PQULC4LQ3N5R4UGNFCLD4BULUQ' userPipesExport(uid)
To export your own public pipe definitions, clone the scraperwiki, replace my Yahoo pipes user id (uid) with your own, and run the scraper…
Having preserved the JSON descriptions within a Scraperwiki database, the next step was to demonstrate the operationalisation of a preserved pipe. The example view at pipe2py – test view [code] demonstrates how to look up the JSONic description of a Yahoo Pipe, as preserved in a Scraperwiki database table, compile it, execute it, and print out the result of running the pipe.
import scraperwiki,json from pipe2py import compile, Context pipeid='2de0e4517ed76082dcddf66f7b218057' def getpjsonFromDB(id): scraperwiki.sqlite.attach( 'pipe2py_test' ) q = '* FROM "pipes" WHERE "id"="'+id+'"' data = scraperwiki.sqlite.select(q) #print data pipe_def = json.loads(data[0]['pjson']) if not pipe_def['query']['results']: print "Pipe not found" sys.exit(1) pjson = pipe_def['query']['results']['json']['PIPE']['working'] return pjson pjson=getpjsonFromDB(pipeid) p = compile.parse_and_build_pipe(Context(), pjson) for i in p: #print 'as',i print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'
The examplePipeOutput() function in the pipes preservation Scraperwiki scraper (rather than the view) provides another example of how to compile and execute a pipe, this time by directly loading in its description from Yahoo Pipes, given it’s ID.
To preview the output of one of your own pipes by grabbing the pipe description from Yahoo Pipes, compiling it locally and then running the local compiled version, here’s an example (pipe2py – pipe execution preview):
#Example of how to grab a pipe definition from Yahoo pipes, compile and execute it, and preview its (locally obtained) output import scraperwiki,json,urllib from pipe2py import compile, Context pipeid='2de0e4517ed76082dcddf66f7b218057' def getPipesJSON(id): url = ("""http://query.yahooapis.com/v1/public/yql""" """?q=select%20PIPE.working%20from%20json%20""" """where%20url%3D%22http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.info%3F_out%3Djson%26_id%3D""" + id + """%22&format=json""") pjson = urllib.urlopen(url).readlines() pjson = "".join(pjson) pipe_def = json.loads(pjson) if not pipe_def['query']['results']: print "Pipe not found" sys.exit(1) pjson = pipe_def['query']['results']['json']['PIPE']['working'] return pjson pjson=getPipesJSON(pipeid) p = compile.parse_and_build_pipe(Context(), pjson) for i in p: #print 'as',i print '<a href="'+i['link']+'">'+i['title']+'</a><br/>',i['summary_detail']['value']+'<br/><br/>'
To try it with a pipe of your own (no actual scraper required…), clone the view and replace the pipe ID with a (published) pipe ID of your own…
(If you want to publish an RSS feed from a view, see for example the httpresponseheader cribs in Exporting and Displaying Scraperwiki Datasets Using the Google Visualisation API .)
Note that this is all very much a work in progress, both at the code level and the recipe level, so if you have any ideas about how to take it forward, or spot any bugs in the compilation of any pipes you have preserved, please let me know via the comments, or, in the case of pipe2py, by filing an issue on github (maybe even posting a bugfix?!;-) and talking nicely to Greg:-) (I fear that my Python skills aren’t up to patching pipe2py!) Also note that I’m not sure what the Scraperwiki policy is with respect to updating third party libraries, so if you do make amy contributions to the pipe2py project, @frabcus may need a heads-up regarding updating the library on Scraperwiki ;-)
PS note that the pipe2py library may still be incomplete (i.e. not all of the Yahoo Pipes blocks may not be implemented as yet). In addition, I suspect that there are some workarounds required in order to run pipes that contain other, embedded custom pipes. (The embedded pipes need compiling first.) I haven’t yet: a) tried, b) worked out how to handle these in the Scraperwiki context. (If you figure it out before I do, please post a howto in the comments;-)
Also note that at the current time the exporter will only export published pipes associated with a specific user ID. To get the full list of pipes for a user (i.e. including unpublished pipes), I think you need to be authenticated as that user? Any workarounds you can come up with for this would be much appreciated ;-)
PPS One of the things that Yahoo Pipes doesn’t offer is the ability to preserve the output of a pipe. By hosting the executable version of a pipe on Scraperwiki, it is easy enough to create a scheduled scraper than loads in the JSON definition of a pipe, for example by a query onto a database table that contains pipe descriptions based on ID, compiles it into the currently running process, calls the pipe and then pops the results into another Scraperwiki database table.