Appropriating Scraperwiki Views for Data Format Transformation – Google Spreadsheet CSV to GEXF
Unpicking the architecture of Scraperwiki, we see that it offers several hosting options:
- hosting for executable scraper scripts;
– a hosted database for storing scraped data;
– hosting for executable view pages that allow views of data to be published.
Although the intention of Views hosting is presumably to provide a view over data scraped using Scraperwiki itself, we can also use it simply as a hosted scripted environment for transforming data from one representation to another.
So for example, in the post Using Google Spreadsheets to extract Google+ friend/follower information for analysis in NodeXL, Gephi, Martin Hawksey describes a way of scraping Google+ friends data into a Google spreadsheet, along with a recipe for getting the data out of Google Spreadsheets as a CSV formatted file and then into a form in which it can be loaded into NodeXL, a free social network analysis add-on package for Excel. In so doing, he also makes a request to the lazy web for how the data might be loaded into other graph analysis packages:
I’m not familiar with getting the data into tools like Gephi (other than via a NodeXL export) so if someone wants to post or link to something to do this for other tools leave a comment or get in touch. Instead here are some basic instructions for getting the data into NodeXL…
As I’ve been playing with Scraperwiki views lately, including using them to generate GEXF file representations of network data (Visualising Networks in Gephi via a Scraperwiki Exported GEXF File), I thought I’d try to knock up a View that grabs data from Martin’s spreadsheet and puts it into GEXF form. You can find the result in the Google+ Net Spreadsheet Grapher (note that it sometimes seems to time out.. oops…).
For completeness, here’s the code [as updated by @mhawksey to include a paramter grab from the view URL… I was wondering how to do that?!;-)]:
#gexf from google spreadsheet, picking up on http://mashe.hawksey.info/2011/12/google-plus-network-info-to-nodexl import scraperwiki import csv,urllib import networkx as nx #key='0AqGkLMU9sHmLdGNuQTdJWkJiQlQ5a21aclpBNmZQZnc' #--via @mhawksey # query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/? #my defensive tweaks import cgi, os qstring=os.getenv("QUERY_STRING") if qstring!=None: get = dict(cgi.parse_qsl(qstring)) if 'key' in get: key=get['key'] else: key='' #--- stub='https://docs.google.com/spreadsheet/pub?key='+key+'&single=true&output=csv&gid=' nodesurl=stub+'1' edgesurl=stub+'0' nodesfile=urllib.urlopen(nodesurl) DG=nx.DiGraph() #utilities https://scraperwiki.com/scrapers/utility_library/ def vstr(s): if s: try: return unicode(s) except UnicodeDecodeError: return str(s) else: return u'' # remove any non ascii characters def ascii(s): return "".join(i for i in s if ord(i)<128) #=== #need to get rid of first row before the proper header row... nodesfile.next() nReader = csv.DictReader(nodesfile) for nrow in nReader: #print nrow if nrow['ImageUrl']!=None: imgurl=nrow['ImageUrl'] else: imgurl='' DG.add_node(nrow['Vertex'],label=ascii(nrow['Label']),name=ascii(nrow['Label']),imgurl=imgurl) edgesfile=urllib.urlopen(edgesurl) edgesfile.next() eReader = csv.DictReader(edgesfile) for erow in eReader: #print erow DG.add_edge(erow['Vertex 1'],erow['Vertex 2']) import networkx.readwrite.gexf as gf writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft') writer.add_graph(DG) scraperwiki.utils.httpresponseheader("Content-Type", "text/xml") from xml.etree.cElementTree import tostring print tostring(writer.xml)
So the take home is this. As well as providing a home for scrapers, is it appropriate to use Scraperwiki as a home for (examples of) data representation conversion scripts and (re)export formatters?
PS here’s another example of using a Scraperwiki View as a standalone item – Example Twitter hashtag user friendship network, which shows how to search Twitter for recent users of a hashtag, then generate a graph file representing the friend connections between the taggers.