Appropriating Scraperwiki Views for Data Format Transformation – Google Spreadsheet CSV to GEXF

Unpicking the architecture of Scraperwiki, we see that it offers several hosting options:

– hosting for executable scraper scripts;
– a hosted database for storing scraped data;
– hosting for executable view pages that allow views of data to be published.

Although the intention of Views hosting is presumably to provide a view over data scraped using Scraperwiki itself, we can also use it simply as a hosted scripted environment for transforming data from one representation to another.

So for example, in the post Using Google Spreadsheets to extract Google+ friend/follower information for analysis in NodeXL, Gephi, Martin Hawksey describes a way of scraping Google+ friends data into a Google spreadsheet, along with a recipe for getting the data out of Google Spreadsheets as a CSV formatted file and then into a form in which it can be loaded into NodeXL, a free social network analysis add-on package for Excel. In so doing, he also makes a request to the lazy web for how the data might be loaded into other graph analysis packages:

I’m not familiar with getting the data into tools like Gephi (other than via a NodeXL export) so if someone wants to post or link to something to do this for other tools leave a comment or get in touch. Instead here are some basic instructions for getting the data into NodeXL…

As I’ve been playing with Scraperwiki views lately, including using them to generate GEXF file representations of network data (Visualising Networks in Gephi via a Scraperwiki Exported GEXF File), I thought I’d try to knock up a View that grabs data from Martin’s spreadsheet and puts it into GEXF form. You can find the result in the Google+ Net Spreadsheet Grapher (note that it sometimes seems to time out.. oops…).

For completeness, here’s the code [as updated by @mhawksey to include a paramter grab from the view URL… I was wondering how to do that?!;-)]:

#gexf from google spreadsheet, picking up on http://mashe.hawksey.info/2011/12/google-plus-network-info-to-nodexl

import scraperwiki
import csv,urllib
import networkx as nx

#key='0AqGkLMU9sHmLdGNuQTdJWkJiQlQ5a21aclpBNmZQZnc'

#--via @mhawksey
# query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/?
#my defensive tweaks
import cgi, os
qstring=os.getenv("QUERY_STRING")
if qstring!=None:
    get = dict(cgi.parse_qsl(qstring))
    if 'key' in get:
        key=get['key']
else:
    key=''
#---

stub='https://docs.google.com/spreadsheet/pub?key='+key+'&single=true&output=csv&gid='
nodesurl=stub+'1'
edgesurl=stub+'0'

nodesfile=urllib.urlopen(nodesurl)

DG=nx.DiGraph()

#utilities https://scraperwiki.com/scrapers/utility_library/
def vstr(s):
    if s:
        try:
            return unicode(s)
        except UnicodeDecodeError:
            return str(s)
    else:
        return u''
# remove any non ascii characters
def ascii(s): return "".join(i for i in s if ord(i)<128)
#===

#need to get rid of first row before the proper header row...
nodesfile.next()

nReader = csv.DictReader(nodesfile)
for nrow in nReader:
    #print nrow
    if nrow['ImageUrl']!=None:
        imgurl=nrow['ImageUrl']
    else: imgurl=''
    DG.add_node(nrow['Vertex'],label=ascii(nrow['Label']),name=ascii(nrow['Label']),imgurl=imgurl)

edgesfile=urllib.urlopen(edgesurl)
edgesfile.next()
eReader = csv.DictReader(edgesfile)
for erow in eReader:
    #print erow
    DG.add_edge(erow['Vertex 1'],erow['Vertex 2'])


import networkx.readwrite.gexf as gf

writer=gf.GEXFWriter(encoding='utf-8',prettyprint=True,version='1.1draft')
writer.add_graph(DG)

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")

from xml.etree.cElementTree import tostring
print tostring(writer.xml)

So the take home is this. As well as providing a home for scrapers, is it appropriate to use Scraperwiki as a home for (examples of) data representation conversion scripts and (re)export formatters?

PS here’s another example of using a Scraperwiki View as a standalone item – Example Twitter hashtag user friendship network, which shows how to search Twitter for recent users of a hashtag, then generate a graph file representing the friend connections between the taggers.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “Appropriating Scraperwiki Views for Data Format Transformation – Google Spreadsheet CSV to GEXF”

  1. Right stop having fun with Scraperwiki ;)

    So that I don’t miss out too much here’s my Scraperwiki ‘hello world’ https://scraperwiki.com/views/googleplus_net_spreadsheet_grapher_1/

    This is a tweak on your existing script (not sure of protocol for forking/merging) What it does is play nicer with the Google+ Network Generator Spreadsheet (the one you were testing with was some post processed NodeXL output).

    I’ve also added some script to extract the spreadsheet key in the query string so you can call https://views.scraperwiki.com/run/googleplus_net_spreadsheet_grapher_1/?key=0AqGkLMU9sHmLdDd6NTNweFdKM2lnVW9LcVo4ak1CS3c and get the gexf

    I’m just waiting for you to wrap the gexf output in a html view with a graph render ;)

    Martin

Comments are closed.

%d bloggers like this: