OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘datajourn

Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community

…got that?;-) Or in other words, this is a post looking at some visualisations of the #ddj hashtag community…

ddj - PDF export

A couple of days ago, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre. I’ll try to post some notes on it, err, sometime; but for now, here’s a quick collection of some of the various things I’ve tinkered with around hashtag communities, using #ddj as an example, as a “note to self” that I really should pull these together somehow, or at least automate some of the bits of workflow; I also need to move away from Twitter’s Basic Auth (which gets switched off this week, I think?) to oAuth*…

*At least Twitter is offering a single access token which “is ideal for applications migrating to OAuth with single-user use cases”. Rather than having to request key and secret values in an oAuth handshake, you can just grab them from the Twitter Application Control Panel. Which means I should be able to just plug them into a handful of Tweepy commands:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)

So, first up is the hashtag community view showing how an individual is connected to the set of people using a particular hashtag (and which at the moment only works for as long as Twitter search turns up users around the hashtag…)

Hashtag community

Having got a Twapperkeeper API key, I guess I really should patch this to allow a user to explore their hashtag community based on a Twapperkeeper archive (like the #ddj archive)…

One thing the hashtag community page doesn’t do is explore any of the structure within the network… For that, I currently have a two step process:

1) get a list of people using a hashtag (recipe: Who’s Tweeting our hashtag?) using the this Yahoo Pipe and output the data as CSV using a URL with the pattern (replace ddj with your desired hashtag):
http://pipes.yahoo.com/pipes/pipe.run?_id=1ec044d91656b23762d90b64b9212c2c&_render=csv&q=%23ddj

2) take the column of twitter homepage URLs of people tweeting the hashtag and replace http://twitter.com/ with nothing to give the list of Twitter usernames using the hashtag.

3) run a script that finds the twitter numerical IDs of users given their Twitter usernames listed in a simple text file:

import tweepy

auth = tweepy.BasicAuthHandler("????", "????")
api = tweepy.API(auth)

f =open('hashtaggers.txt')
f2=open('hashtaggersIDs.txt','w')
f3=open('hashtaggersIDnName.txt','w')

for uid in f:
  user=api.get_user(uid)
  print uid
  f2.write(str(user.id)+'\n')
  f3.write(uid+','+str(user.id)+'\n')

f.close()
f2.close()
f3.close()

Note: this a) is Python; b) uses tweepy; c) needs changing to use oAuth.

4) run another script (gist here – note this code is far from optimal or even well behaved; it needs refactoring, and also tweaking so it plays nice with the Twitter API) that gets lists of the friends and the followers of each hashtagger, from their Twitter id and writes these to CSV files in a variety of ways. In particular, for each list (friends and followers, generate three files where the edges represent: i) link between an individual and other hashtaggers (“inner” edges within the community); ii) link between hashtagger and not hashtaggers (“outside” edges from the community); iii) links between hashtagger and hashtaggers as well as not hashtaggers);

5) an edit of the friends/followers CSV files to put them into an appropriate format for viewing in tools such as Gephi or Graphviz. For Gephi, edges can be defined using comma separated pairs of nodes (e.g. userID, followerID) with a bit of syntactic sugar; we can also use the list of Twitter usernames/user IDs to provide nice labels for the community’s “inner” nodes:

nodedef> name VARCHAR,label VARCHAR
23309699,tedeschini
23751864,datastore
13691922,nicolaskb

17474531,themediaisdying
2224401,munkyfonkey

edgedef> user VARCHAR,friend VARCHAR
23309699,17250069
23309699,91311875

2224401,878051
2224401,972651

Having got a gdf formatted file, we can load it straight in to Gephi:

ddj hashtag community

In 3d view we can get something like this:

ddj - hashtag community

Node size is proportional to number of hashtag users following a hashtagger. Colour (heat) is proportional to number of hashtaggers follwed by a hashtagger. Arrows go from a hashtagger to the people they are following. So a large node size means lots of other hashtaggers follow that person. A hot/red colour shows that person is following lots of the other hashtaggers. Gephi allows you to run various statistics over the graph to allow you to analyse the network properties of the community, but I’m not going to cover that just now! (Search this blog using “gephi” for some examples in other contexts.)

Use of different layouts, colour/text size etc etc can be used to modify this view in Gephi, which can also generate high quality PDF renderings of 2D versions of the graph:

ddj - PDF export

(If you want to produce your own visualisation in Gephi, I popped the gdf file for the network here.)

If we export the gexf representation of the graph, we can use Alexis Jacomy’s Flex gexfWalker component to provide an interactive explorer for the graph:

gexfwalker - full network

Clicking on a node allows you to explore who a particular hashtagger is following:

gexfwalker - node explorer

Remember, arrows go from a hashtagger to the person they are following. Note that the above visualisation allows you to see reciprocal links. The colourings are specified via the gexf file, which itself had its properties set in the Preview Settings dialogue in Gephi:

Gephi - preview settings

As well as looking at the internal structure of the hashtag community, we camn look at all the friends and/or followers of the hashtaggers. THe graph for this is rather large (70k nodes and 90k edges), so after a lazyweb reuest to @gephi I found I had to increase the memory allocation for the Gephi app (the app stalled on loading the graph when it had run out of memory…).

If we load in the graph of “outer friends” (that is the people the hashtaggers follow who are not hashtaggers) and filter the graph to only show nodes who have more than 5 or so incoming edges we can see which Twitter users are followed by large numbers of the hashtaggers, but who have not been using the hashtag themselves. Becuase the friends/followers lists return Twitter numercal IDs, we have to do a look up on Twitter to find out the actual Twitter usernames. This is something I need to automate, maybe using the Twitter lookup API call that lets authenticated users look up the details of up to 100 Twitter users at a time given their numerical IDs. (If anyone wants this data from my snapshot of 23/8/10, drop me a line….)

Okay, that/s more than enough for now… As I’ve shared the gdf and gexf files for the #ddj internal hashtaggers network, if any more graphically talented than I individuals would like to see what sort of views they can come up with, either using Gephi or any other tool that accepts those data formats, I’d love to see them:-)

PS It also strikes me that having got the list of hashtaggers, I need to package up this with a set of tools that would let you:
– create a Twitter list around a set of hashtaggers (and maybe then use that to generate a custom search engine over the hashtaggers’ linked to homepages);
find other hashtags being used by the hashtaggers (that is, hashtags they may be using in arbitrary tweets).

(See also Topic and Event based Twittering – Who’s in Your Community?)

Written by Tony Hirst

August 25, 2010 at 9:08 pm

Posted in Tinkering, Visualisation

Tagged with ,

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)

Written by Tony Hirst

June 28, 2010 at 9:22 am

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Written by Tony Hirst

June 25, 2010 at 12:51 pm

Programming, Not Coding: Infoskills for Journalists (and Librarians..?!;-)

A recent post on the journalism.co.uk site asks: How much computer science does a journalist really need?, commenting that whilst coding skills may undoubtedly be useful for journalists, knowing what can be achieved easily in a computational way may be more important, because there are techies around who can do the coding for you… (For another take on this, see Charles Arthur’s If I had one piece of advice to a journalist starting out now, it would be: learn to code, and this response to it: Learning to Think Like A Programmer.)

Picking up on a few thoughts that came to mind around a presentation I gave yesterday (Web Lego And Format Glue, aka Get Yer Mashup On), here’s a slightly different take on it, based on the idea that programming doesn’t necessarily mean writing arcane computer code.

Note that a lot of what follows I’d apply to librarians as well as journalists… (So for example, see Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library for infoskills that I think librarians as information professionals should at least be aware of (and these probably apply to journalists too…); Data Handling in Action is also relevant – it describes some of the practical skills involved in taking a “dirty” data set and getting it into a form where it can be easily visualised…)

So here we go…. An idea I’ve started working on recently as an explanatory device is the notion of feed oriented programming. I appreciate that this probably already sounds scary geeky, but it’s a made up phrase and I’ll try to explain it. A feed is something like an RSS feed. (If you don’t know what an RSS feed, this isn’t a remedial class, okay… go and find out… this old post should get you started: We Ignore RSS at OUr Peril.)

Typically, an RSS feed will contain a set of items, such as a set of blog posts, news stories, or bookmarks. Each item has the same structure in terms of how it is represented on a computer. Typically, the content of the feed will change over time – a blog feed represents the most recent posts on a blog, for example. That is, the publisher of the feed makes sure that the feed has current content in it – as a “programmer” you don’t really need to do anything to get the fresh content in the feed – you just need to look at the feed to see if there is new content in it – or let your feed reader show you that new content when it arrives. The feed is accessed via a web address/URL.

Some RSS feeds might not change over time. On WriteToReply, where we republish public documents, it’s possible to get hold of an RSS version of the document. The document RSS feed doesn’t change because the content of the document doesn’t change), although the content of the comment feeds might change as people comment on the document.

A nice thing about RSS is that lots of things publish it, and lots of things can import it. Importing an RSS feed into an application such as Google Reader simply means pasting the web address of the feed into a “Subscribe to feed” box in the application. Although it can do other things too, like supporting search, Google Reader is primarily a display application. It takes in RSS feeds and presents them to the user in an easy to read way. Google Maps and Google Earth are other display applications – they display geographical information in an appropriate way, a way that we can readily make sense of.

So what do we learn from this? Information can be represented in a standard way, such as RSS, and displayed in a visual way by an application that accepts RSS as an input. By subscribing to an RSS feed, which we identify by a fixed/permanent web address, we can get new content into our reader without doing anything. Subscribing is just a matter of copying a web address from the publisher’s web site and pasting it into our reader application. Cut and paste. No coding required. The feed publisher is responsible for putting new content into the feed, and our reader application is responsible for pulling that new content out and displaying it to us.

One of the tools I use a lot is Yahoo Pipes. Yahoo Pipes can take in RSS feeds and do stuff with it; it can take in a list of blog posts as an RSS feed and filter them so that you only get posts out that do – or don’t – mention cats, for example. And the output is in the form of an RSS feed…

What this means is that if we have a Yahoo pipe that does something we want in computational terms to an RSS feed, all we have to do is give it the web address of the feed we want to process, and then grab the RSS output web address from the Pipe. Cut and paste the original feed web address into the Pipe’s input. Cut and paste the web address of the RSS output from the pipe into our feed reader. No coding required.

Another couple of tools I use are Google Spreadsheets (a spreadsheet application) and Many Eyes WIkified (an interactive visualisation application). If you publish a spreadsheet on Google docs, you can get a web address/URL that points to a CSV (comma separated variable) version of the selected sheet. A CSV file is a simple text file where each spreadsheet row is a represented as a row in the CSV structured text file; and the value of each cell along a row in the original spreadsheet is represented as the same value in the text file, separated from the previous value by a comma. But you don’t need to know that… All you do need to know is that you can think of it as a feed… With a web address… And in a particular format…

Going to the “Share” options in the spreadsheet, you can publish the sheet and generate a web address that points to a range of cells in the spreadsheet (eg: B1:D120) represented as a CSV file. If we now turn to Many Eyes Wikified, I can provide it with the web address of a CSV file and it will go and fetch the data for me. At the click of a button I can then generate an interactive visualisation of the data in the spreadsheet. Cut and paste the web address of the CSV version of the data in a spreadsheet that Google Docs will generate for me into Many Eyes Wikified, and I can then create an interactive visualisation using the spreadsheet at the click of a button. Cut and paste a URL/web address that is generated for me. No coding required.

As to where the data in the spreadsheet came from? Well maybe it came from somewhere else on the web, via a URL? Like this, maybe?

So the model I’m working towards with feed oriented programming is the idea that you can get the web address of a feed which a publisher will publish current content or data to, and paste that address in an application that will render, or display the content (e.g. Google Reader, Many Eyes Wikified) or process/transform that data on your behalf.

So for example, Google Reader can transfrom an HTML table to CSV for you; (Google spreadsheets also lets you do all the normal spreadsheet things, so you could generate one sheet from another sheet using whatever spreadsheet formulae you like, and publish the CSV representation of that second sheet). Or in Yahoo Pipes, you can process an RSS feed by filtering its contents so that you only see posts that mention cats.

Yahoo Pipes offers other sorts of transformation as well. For example, in my original Wikipedia scraping demo, I took the feed from a Google spreadsheet and passed it to Yahoo Pipes where I geocoded city names and let pipes generate a map friendly feed (known as a KML feed) for me. Copying the web address of the KML feed output from the pipe and pasting it into Google Maps means I can generate an embeddable Google map view of data originally pulled from Wikipedia:

Once you start to think of the world in this way:

- where the web contains data and information that is represented in various standard ways and made available via a unique and persistent web address,

- where web applications can accept data and content that is represented in a standard way given the web address of that data,

- where web applications can transform data represented at one web address in one particular way and republish it in another standard format at another web address,

- or where web applications can take data represented in a particular way from one web adress and provide you with the tools to then visualise or display that data,

then the world is your toolbox. Have URL, will travel. All you need to know is which applications can import what format data, and how they can republish that data for you, whether in a different format, such as Google spreadsheets taking an HTML table from Wikipedia and publishing it as a CSV file, or as a visualisation/human friendly display (Many Eyes Wikified, Google Reader). And if you need to do “proper” programmer type things, then you might be able to do it using a spreadsheet formula or a Yahoo Pipe (no coding required…;-)

See also: The Journalist as Programmer: A Case Study of The New York Times Interactive News Technology Department [PDF]

Written by Tony Hirst

April 30, 2010 at 11:15 am

Posted in Anything you want

Tagged with

Does Funding Equal Happiness in Higher Education?

[The service used to create the visualisations mentioned in this post has been closed down. To search over recent (2013 intake) Guardian HE data, see this: Guardian Student Rankings]

With the announcement of the amount of funding awarded to UK Higher Education institutions from HEFCE, the government funding body, several people posted me links to the data wondering what I might do with it. I saw this as a good opportunity to do something I’ve been meaning to do for ages, specifically have another look at how to provide a view of a range of HE related datasets around particular institutions. So for example, if you ever wondered whether or not there is a relationship between the drop out rate from a university and a surveyed average teaching score, you should be able to look it up:

Since its launch, one of the more actively maintained areas of the Guardian datastore has been the education area. A quick skim over HE related data turns up at least the following:

In a follow on post, I’ll show how to pull this data together, but for now, let’s look at some of the possibilities of pulling data in from these separate sheets around an HEI. As a proof of concept, I grabbed the following data and popped it into Many Eyes Wikified:

(I need to add provenance info to the wiki, but didn’t in this instance because I don’t want the data to be referenced or trusted… I still need to check everything is working properly… (so I guess I should have used dummy HEI names… hmm…)

The data is pulled from four separate sheets and aggregated around HEI name. The “Percentage no longer in HE” comes from the datastore Dropout spreadsheet, the “Total staff earning ..” etc column is from the Pay spreadsheet, the “Average Teaching” to “Student to Staff Ratio” columns come from the 2009-10 university tables spreadsheet, and the “Teaching funding” to “Funding change” columns from the most recent (2010-11) funding spreadsheet.

I’ve posted a couple of interactive visualisations on to Many Eyes WIkified so you can have a play with the numbers (but don’t trust them or quote them unless you fact check them first…;-)

The first is a Scatter Chart, which gives us three dimensions to play with – x, y, and dot size.

So for example, in the chart shown above, we see that lower teaching scores seem to correlate with higher drop out rates. In the chart below, we seed how teaching scores relate to the expenditure per student and the student staff ratio (and how expenditure per student and student staff ratio relate to each other):

Is satisfaction rewarded with funding, or is funding to improve matters?

The other chart type I produced is a bar chart. These are less interesting, but heavily used…

I assume that university strategy and planning units worry over this sort of combined data all the time (but I’m not sure how they: obtain it, combine it, represent it, report it, use it? Maybe if an HE planner is reading they could post a comment or two to describe what data they use, how they use it and what they use it for…?;-) Anyway, it’s getting close to a stage now where the rest of us can play along too…;-)

Written by Tony Hirst

March 20, 2010 at 11:18 am

Posted in Data, Visualisation

Tagged with ,

Grabbing “Facts” from the Guardian Datastore with a Google Spreadsheets Formula

In Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula) I picked up on an idea outlined in Mulling Over =datagovukLookup() in Google Spreadsheets to show how to use Google Apps script to create a formula that could pull in live data from a data.gov.uk datastore.

So just because, here’s how to do something similar with data from a Google spreadsheet in the Guardian datastore. Note that what I’m essentially proposing here is using the datastore as a database…

To ground the example, consider the HE satisfaction tables:

Lots of data about lots of courses on lots of sheets in a single spreadsheet. But how do you compare the satisfaction ratings across subjects for a couple of institutions? How about like this:

Creating Subject comparison tables from Guardian HE data

(We can just click and drag the formula across a range of cells as we would any other formula.)

That is, how about defining a simple spreadsheet function that lets us look up a particular data value for a particular subject and a particular institution? How about being able to write a formula like:
=gds_education_unitable(“elecEng”,”Leeds”,”NSSTeachingPerCent”)
and get the national student satisfaction survey teaching satisfaction result back from students studying Electrical/Electronic Engineering at Leeds University?

Google Apps script provides a mechanism for defining formulae that can do this, and more:

Guardian Datastore as a database

The script takes the arguments and generates a query to the spreadsheet using the spreadsheet’s visualisation API, as used in my Guardian Datastore Explorer. The results are pulled back as CSV, run through a CSV2Javacript object function and then returned to the calling spreadsheet. Here’s an example of the Apps script:

function gds_education_unitable(sheet,uni,typ){
  var key="phNtm3LmDZEM6HUHUnVkPaA";
  var gid='0';//'Overall Institutional Table';
  var category="C"; //(Average) Guardian teaching score
  switch (sheet){
    case "full":
      gid='0';//'Overall Institutional Table';
      break;
    case "chemEng":
      gid='16';//'15 Chem Eng';
      break;
    case "matEng":
      gid='17';//'16 Mat Eng';
      break;
    case "civilEng":
      gid='18';//'17 Civil Eng';
      break;
    case "elecEng":
      gid='19';//'18 Elec Eng';
      break;
    case "mechEng":
      gid='20';//'19 Mech Eng';
      break;
    default:
  }

  switch (typ){
    case "guardianScore":
      category='C';//(Average) Guardian teaching score
      break;
    case "NSSTeachingPerCent":
      category='D';//
      break;
    case "expenditurePerStudent":
      category='E';//
      break;
    case "studentStaffRatio":
      category='F';//
      break;
    default:
  }

  if (sheet!='full') category=String.fromCharCode(category.charCodeAt(0)+2);

  var url="http://spreadsheets.google.com/tq?tqx=out:csv&tq=select%20B%2C"+category+"%20where%20B%20matches%20%22"+uni+"%22&key="+key+"&gid="+gid;

  var x=UrlFetchApp.fetch(url);
  var ret=x.getContentText();
  ret = CSVToArray( ret, "," );
  return ret[1][1];
}

(The column numbering between the first sheet in the spreadsheet and the subject spreadsheets is inconsistent, which is why we need a little bit of corrective fluff (if (sheet!=’full’)) in the code…)

Of course, we can also include script that will generate calls to other spreadsheets, or as I have shown elsewhere to other data sources such as the data.gov.uk Linked Data datastore.

Something that occurred to me though is if and how Google might pull on such “data formula” definitions to feed apps such as Google Squared (related: =GoogleLookup: Creating a Google Fact Engine Directory and Is Google Squared Just a Neatly Packaged and Generalised =googlelookup Array?).

Written by Tony Hirst

February 19, 2010 at 2:12 pm

Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula)

Disclaimer: before any Linked Data purists say I’m missing the point about what Linked Data is, does, or whatever, I don’t care, okay? I just don’t care. This is for me, and me is someone who can’t get to grips with writing SPARQL queries, can’t stand the sight of unreadable <rdf> <all:over the=”place”>, can’t even work out how to find things are queryable in a Linked Data triple store, let alone write queries that link data from one data store with data from another data store (or maybe SPARQL can’t do that yet? Not that I care about that either, because I can, in Yahoo Pipes, or Google Spreadsheets, and in a way that’s meaningful to me…)

In Mulling Over =datagovukLookup() in Google Spreadsheets, I started wondering about whether or not it would be useful to be able to write formulae to look up “live facts” in various datastores from within a spreadsheet (you know, that Office app that is used pretty much universally in workplace whenever there is tabular data to hand. That or Access of course…)

Anyway, I’ve started tinkering with how it might work, so now I can do things like this:

The formulae in columns G, H and I are defined according to a Google Apps script, that takes a school ID and then returns something linked to it in the data.gov.uk education datastore, such as the name of the school, or its total capacity.

Formulae look like:

  • =datagovuk_education(A2,”name”)
  • =datagovuk_education(A2,”opendate”)
  • =datagovuk_education(A2,”totcapacity”)

and are defined to return a single cell element. (I haven’t worked out how to return several cells worth of content from a Google Apps Script yet!)

At the moment, te script is a little messy, taking the form:

function datagovuk_education(id,typ) {
  var ret=""; var args=""
  switch (typ) {
    case 'totcapacity':
      args= _datagovuk_education_capacity_quri(id);
      break;
    ...
    default:
      //hack something here;
  }
  var x=UrlFetchApp.fetch('http://data-gov.tw.rpi.edu/ws/sparqlproxy.php',{method: 'post', payload: args});
  var ret=x.getContentText();
  var xmltest=Xml.parse(ret);
  ret=xmltest.sparql.results.result.binding.literal.getText();

  return ret;
}
function _datagovuk_education_capacity_quri(id){
  return "query=prefix+sch-ont%3A+%3Chttp%3A%2F%2Feducation.data.gov.uk%2Fdef%2Fschool%2F%3E%0D%0ASELECT+%3FschoolCapacity+WHERE+{%0D%0A%3Fschool+a+sch-ont%3ASchool%3B%0D%0Asch-ont%3AuniqueReferenceNumber+"+id+"%3B%0D%0Asch-ont%3AschoolCapacity+%3FschoolCapacity.%0D%0A}+ORDER+BY+DESC%28%3Fdate%29+LIMIT+1&output=xml&callback=&tqx=&service-uri=http%3A%2F%2Fservices.data.gov.uk%2Feducation%2Fsparql";
}

The datagovuk_education(id,typ) function takes the school ID and the requested property, uses the case statement to create an appropriate query string, and then fetches the data from the education datastore, returning the result in an XML format like this. The data is pulled from the datastore via Sparqlproxy, and the query string URIs generated (at the moment) by adding the school ID number into a query string generated by running the desired SPARQL query on Sparqlproxy and then grabbing the appropriate part of the URI. (It’s early days yet on this hack!;-)

By defining appropriate Apps script functions, I can also create formulae to call other datastores, run queries on Google spreadsheets (e.g. in the Guardian datastore) and so on. I assume similar sorts of functionality would be supported using VB Macros in Excel?

Anyway – this is my starter for ten on how to make live datastore data available to the masses. It’ll be interesting to see whether this approach (or one like it) is used in favour of getting temps to write SPARQL queries and RDF parsers… The obvious problem is that my approach can lead to an explosion in the number of formulae and parameters you need to learn; the upside is that I think these could be quite easily documented in a matrix/linked formulae chart. The approach also scales to pulling in data from CSV stores and other online spreadsheets, using spreadsheets as a database via the =QUERY() formula (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula), and so on. There might also be a market for selling prepackaged or custom formulae as script bundles via a script store within a larger Google Apps App store

PS I’m trying to collect example SPARQL queries that run over the various data.gov.uk end points because: a) I’m really struggling in getting my head round writing my own, not least because I struggle to make sense of the ontologies, if I can find them, and write valid queries off the back of them; even (in fact, especially) really simple training/example queries will do! b) coming up with queries that either have interesting/informative/useful results, or clearly demonstrate an important ‘teaching point’ about the construction of SPARQL queries, is something I haven’t yet got a feel for. If you’ve written any, and you’re willing to share, please post a gist to github and add a link in a comment here.

PPS utility bits, so I don’t lose track of them:
education datastore ontology
– Apps script XML Element class

PPPS HEre’s how to dump a 2D CSV table into a range of cells: Writing 2D Data Arrays to a Google Spreadsheet from Google Apps Script Making an HTTP POST Request for CSV Data

Written by Tony Hirst

February 17, 2010 at 11:52 pm

Follow

Get every new post delivered to your Inbox.

Join 820 other followers