Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

Guardian data table uni 2014

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

library(RCurl)
gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )
}

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

handler=function(key,i){
  tmp=gsqAPI(key,"select * where B!=''", i)
  subject=sub(".Rank",'',colnames(tmp)[1])
  colnames(tmp)[1]="Subject.Rank"
  tmp$subject=subject
  tmp
}

We can now download the first subject specific sheet using the following call:

key='0Aq73qj3QslDedHFNdUsxOVZQZ1dmbXNrYlZGUWgwdHc'
gdata=handler(key,2)

This loads in the data as follows:

seeding guardian ranking data table

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){
  gdata=rbind(gdata,handler(key,i))
}

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

Data format - need to make numbers

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors
gdata$Name.of.Institution=as.factor(gdata$Name.of.Institution)
gdata$subject=as.factor(gdata$subject)

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

Cross-subject rankings

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

library(ggplot2)
oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

oxfordbrookes start asking

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?

Filtering Guardian University Data Every Which Way You Can…

In a post from way back when – Does Funding Equal Happiness in Higher Education? – that I still get daily traffic to, though the IBM Many Eyes Wikified hack described in it no longer works, I aggregated and reused a selection of data sets collected by the Guardian datastore relating to HE.

Whilst the range of datasets used in that hack don’t seem to have been re-collected more recently, the Guardian DataStore does still publish and annual set of aggregated data (from Unistats?) for courses by subject area across UK HE (University guide 2013 – data tables).

The DataStore data is published using Google spreadsheets, which as regular readers will know also double up as a database. The Google Visualisation API that’s supported by Google Spreadsheets also makes it easy to pull data from the spreadsheets into an interactive dashboard view.

As an example, I’ve popped a quick demo up as a Scraperwiki View showing how to pull data from a selected sheet within the Guardian University data spreadsheet and filter it using a range of controls. I’ve also added a tabular view, and a handful of scatterplots, to show off the filtered data.

To play with the view, visit here: Guardian Student Rankings.

If you want to hack around with the view, it’s wikified here: wikified source code.

PS I’ve also pulled all the subject data tables into a single Scraperwiki database: Guardian HE Data 2013

Data Journalists Engaging in Co-Innovation…

You may or may not have noticed that the Boundary Commission released their take on proposed parliamentary constituency boundaries today.

They could have released the data – as data – in the form of shape files that can be rendered at the click of a button in things like Google Maps… but they didn’t… [The one thing the Boundary Commission quango forgot to produce: a map] (There are issues with publishing the actual shapefiles, of course. For one thing, the boundaries may yet change – and if the original shapefiles are left hanging around, people may start to draw on these now incorrect sources of data once the boundaries are fixed. But that’s a minor issue…)

Instead, you have to download a series of hefty PDFs, one per region, to get a flavour of the boundary changes. Drawing a direct comparison with the current boundaries is not possible.

The make-up of the actual constituencies appears to based on their member wards, data which is provided in a series of spreadsheets, one per region, each containing several sheets describing the ward makeup of each new constituency for the counties in the corresponding region.

It didn’t take long for the data junkies to get on the case though. From my perspective, the first map I saw was on the Guardian Datastore, reusing work by University of Sheffield academic Alasdair Rae, apparently created using Google Fusion Tables (though I haven’t see a recipe published anywhere? Or a link to the KML file that I saw Guardian Datablog editor Simon Rogers/@smfrogers tweet about?)

[I knew I should have grabbed a screen shot of the original map…:-(]

It appears that Conrad Quilty-Harper (@coneee) over at the Telegraph then got on the case, and came up with a comparative map drawing on Rae’s work as published on the Datablog, showing the current boundaries compared to the proposed changes, and which ties the maps together so the zoom level and focus are matched across the maps (MPs’ constituencies: boundary changes mapped):

Telegraph side by side map comparison

Interestingly, I was alerted to this map by Simon tweeting that he liked the Telegraph map so much, they’d reused the idea (and maybe even the code?) on the Guardian site. Here’s a snapshot of the conversation between these two data journalists over the course of the day (reverse chronological order):

Datajournalists in co-operative bootstrapping mode

Here’s the handshake…

Collaborative co-evolution

I absolutely love this… and what’s more, it happened over the course of four or five hours, with a couple of technology/knowledge transfers along the way, as well as evolution in the way both news agencies communicated the information compared to the way the Boundary Commission released it. (If I was evil, I’d try to FOI the Boundary Commission to see how much time, effort and expense went into their communication effort around the proposed changes, and would then try to guesstimate how much the Guardian and Telegraph teams put into it as a comparison…)

At the time of writing (15.30), the BBC have no data driven take on this story…

And out of interest, I also wondered whether Sheffield U had a take…

Sheffiled u media site

Maybe not…

PS By the by, the DataDrivenJournalism.net website relaunched today. I’m honoured to be on the editorial board, along with @paulbradshaw @nicolaskb @mirkolorenz @smfrogers and @stiles, and looking forward to seeing how we can start to drive interest, engagement and skills development in, as well as analysis and (re)use of, and commentary on, public open data through the data journalism route…

PPS if you’re into data journalism, you may also be interested in GetTheData.org, a question and answer site in the model of Stack Overflow, with an emphasis on Q&A around how to find, access, and make use of open and public datasets.

Using Google Spreadsheets as a Database Source for R

I couldn’t contain myself (other more pressing things to do, but…), so I just took a quick time out and a coffee to put together a quick and dirty R function that will let me run queries over Google spreadsheet data sources and essentially treat them as database tables (e.g. Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

Here’s the original function I used:

library(RCurl)
gsqAPI = function(key,query,gid=0){ return( read.csv( paste( sep="",'http://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid) ) ) }

However, with a move to https, this function kept breaking. The one I currently use is:

library(RCurl)
gsqAPI = function(key,query,gid=0){ 
  tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ) ) )
}

It requires the spreadsheet key value and a query; you can optionally provide a sheet number within the spreadsheet if the sheet you want to query is not the first one.

We can call the function as follows:

gsqAPI('tPfI0kerLllVLcQw7-P1FcQ','select * limit 3')

In that example, and by default, we run the query against the first sheet in the spreadsheet.

Alternatively, we can make a call like this, and run a query against sheet 3, for example:
tmpData=gsqAPI('0AmbQbL4Lrd61dDBfNEFqX1BGVDk0Mm1MNXFRUnBLNXc','select A,C where <= 10',3)
tmpData

My first R function

The real question is, of course, could it be useful.. (or even OUseful?!)?

Here’s another example: a way of querying the Guardian Datastore list of spreadsheets:

gsqAPI('0AonYZs4MzlZbdFdJWGRKYnhvWlB4S25OVmZhN0Y3WHc','select * where A contains "crime" and B contains "href" order by C desc limit 10')

What that call does is run a query against the Guardian Datastore spreadsheet that lists all the other Guardian Datastore spreadsheets, and pulls out references to spreadsheets relating to “crime”.

The returned data is a bit messy and requires parsing to be properly useful.. but I haven’t started looking at string manipulation in R yet…(So my question is: given a dataframe with a column containing things like <a href=”http://example.com/whatever”>Some Page</a>, how would I extract columns containing http://example.com/whatever or Some Page fields?)

[UPDATE: as well as indexing a sheet by sheet number, you can index it by sheet name, but you’ll probably need to tweak the function to look end with '&gid=', curlEscape(gid) so that things like spaces in the sheet name get handled properly I’m not sure about this now.. calling sheet by name works when accessing the “normal” Google spreadsheets application, but I’m not sure it does for the chart query language call??? ]

[If you haven’t yet discovered R, it’s an environment that was developed for doing stats… I use the RStudio environment to play with it. The more I use it (and I’ve only just started exploring what it can do), the more I think it provides a very powerful environment for working with data in quite a tangible way, not least for reshaping it and visualising it, let alone doing stats with in. (In fact, don’t use the stats bit if you don’t want to; it provides more than enough data mechanic tools to be going on with;-)]

PS By the by, I’m syndicating my Rstats tagged posts through the R-Bloggers site. If you’re at all interested in seeing what’s possible with R, I recommend you subscribe to R-Bloggers, or at least have a quick skim through some of the posts on there…

PPS The RSpatialTips post Accessing Google Spreadsheets from R has a couple of really handy tips for tidying up data pulled in from Google Spreadsheets; assuming the spreadsheetdata has been loaded into ssdata: a) tidy up column names using colnames(ssdata) <- c("my.Col.Name1","my.Col.Name2",...,"my.Col.NameN"); b) If a column returns numbers as non-numeric data (eg as a string "1,000") in cols 3 to 5, convert it to a numeric using something like: for (i in 3:5) ssdata[,i] <- as.numeric(gsub(",","",ssdata[,i])) [The last column can be identifed as ncol(ssdata) You can do a more aggessive conversion to numbers (assuming no decimal points) using gsub("[^0-9]","",ssdata[,i])]

PPPS via Revolutions blog, how to read the https file into R (unchecked):

require(RCurl)
myCsv = getURL(httpsCSVurl)
read.csv(textConnection(myCsv))

A First Quick Viz of UK University Fees

Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.

The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)

My first thought was to go for a proportional symbol map. (Does anyone know of a javascript library that can generate proportional symbol overlays on a Google Map or similar, even better if it can trivially pull in data from a Google spreadsheet via the Google visualisation? I have an old hack (supermarket catchment areas), but there must be something nicer to use by now, surely? [UPDATE: ah – forgot this: Polymaps])

In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:

The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results[0].formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results[0].geometry.location.lng and value.parseJson().results[0].geometry.location.lat).

Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:

Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?

What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…

The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)

Click thru’ to have a play with the chart yourself;-)

PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE – apparently, HESA codes are in the spreadsheet…. ;-0]

PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?

PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)

A Few More Thoughts on GetTheData.org

As we come up to a week in on GetTheData.org, there’s already an interesting collection of questions – and answers – starting to appear on the site, along with a fledgling community (thanks for chipping in, folks:-), so how can we maintain – and hopefully grow – interest in the site?

A couple of things strike me as the most likely things to make the site attractive to folk:

– the ability to find an appropriate – and useful – answer to your question without having to ask it, for example because someone has already asked the same, or a similar, question;
– timely responses to questions once asked (which leads to a sense of community, as well as utility).

I think it’s also worth bearing in mind the context that GetTheData sits in. Many of the questions result in answers that point to data resources that are listed in other directories. (The links may go to either the data home page or its directory page on a data directory site.)

Data Recommendations
One thing I think is worth exploring is the extent to which GetTheData can both receive and offer recommendations to other websites. Within a couple of days of releasing the site, Rufus had added a recommendation widget that could recommend datasets hosted on CKAN that seem to be related to a particular question.

GetTheData.org - related datasets on CKAN

What this means is that even before you get a reply, a recommendation might be made to you of a dataset that meets your requirements.

(As with many other Q&A sites, GetTheData also tries to suggest related questions to you when you enter you question, to prompt you to consider whether or not your question has already been asked – and answered.)

I think the recommendation context is something we might be able to explore further, both in terms of linking to recommendations of related data on other websites, but also in the sense of reverse links from GetTheData to those sites.

For example:

– would it be possible to have a recommendation widget on GetTheData that links to related datasets from the Guardian datastore, or National Statistics?
– are there other data directory sites that can take one or more search terms and return a list of related datasets?
– could a getTheData widget be located on CKAN data package pages to alert package owners/maintainers that a question possibly related to the dataset had been posted on GetTheData? This might encourage the data package maintainer to answer the question on the getTheData site with a link back to the CKAN data package page.

As well as recommendations, would it be useful for GetTheData to syndicate new questions asked on the site? For example, I wonder if the Guardian Datastore blog would be willing to add the new questions feed to the other datablogs they syndicate?;-) (Disclosure: data tagged posts from OUseful.info get syndicated in that way.)

Although I don’t have any good examples of this to hand from GetTheData, it strikes me that we might start to see questions that relate to obtaining data which is actually a view over a particular data set. This view might be best obtained via a particular query onto a particular data set. such as a specific SPARQL query on a Linked Data set, or a particular Google query language request to the visualisation API against a particular Google spreadsheet.

If we do start to see such queries, then it would be useful to aggregate these around the datastores they relate to, though I’m not sure how we could best do this at the moment other than by tagging?

News announcements
There are a wide variety of sites publishing data independently, and a fractured networked of data directories and data catalogues. Would it make sense for GetTheData to aggregate news announcements relating to the release of new data sets, and somehow use these to provide additional recommendations around data sets?

Hackdays and Data Fridays
As suggested in Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers:

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

Alternatively, how about “Data Fridays”, on the first Friday in the month, where folk agree to check GetTheData two or three times that day and engage in something of a distributed data related Question and Answer sprint, helping answer unanswered questions, and maybe pitching in a few new ones?

Aggregated Search
It would be easy enough to put together a Google custom search engine that searches over the domains of data aggregation sites, and possibly also offer filetype search limits?

So What Next?
Err, that’s it for now…;-) Unless you fancy seeing if there’s a question you can help out on right now at GetTheData.org

Where Linked Data Would Be Useful – Creating More MPs’ Maps from the Guardian Politics API

So given the news from the Commons today, I was wondering where the current crop of MPs came from in terms of birthplace, school, and university… Would a map based view turn up from across the UK, or something a little more clumpy…?

Searching for sources of biographical data, two sources came to mind – Wikipedia infoboxes (and hence DBpedia) and the Guardian Politics API. In this post, I’ll describe a minimal – and not very reliable – recipe for plotting a map of UK MPs’ alma mater based on data grabbed from the Guardian Politics API, identifying a couple of ways in which the data could be made so much more useful, and indicating why the Linked Data approach is a Good Thing…

Just because, I’ll give a Yahoo Pipes recipe…

The first step is a handler pipe for grabbing an MP’s details from the Guardian Politics API from their Guardian ID:

MP details via Guardian Politics API

The next step is to get a list of current MPs and annotate the list with MP details using the helper pipe:

Lookup MP details

Not every MP has an alma mater listed, so we filter out the ones where there is no university information. We then use the university data as the input to a rough and ready geocoder, which does its best to identify a location and then geocode it. The Yahoo pipes trick of putting geo-data into the y:location attribute means that the pipe will automagically generate KML and map based previews of the output of the pipe.

Lookup location by university

Finally, we tidy up the feed a little:

Tidy the pipe...

Here’s the output of the resulting pipe:

MPs by university - badly coded...

Clicking on the various markers, we see that there is a lot of miscoding going on. Also, some MPs have several universities listed, which may also contribute to the confusion. (A rough and ready way of handling that would be to split the university field on a semi-colon, and just use the first listed university in the location lookup.)

So what would make things easier? The Guardian Politics API is getting the data out there, but can it be improved in any way in order to make it a little (or a lot?!) more useful in a machine automated context such as this?

I think so…

Here’s one possible approach: a few weeks ago, the JISC Monitoring unit published a lookup service for looking up UK HEIs using a variety of identifier schemes and a crude name based lookup, and returning synonymous identifers, canonical URLs and lat/long data: data.jiscmu.ac.uk. As identified in the announcement post, this information complements rather more formally some of that already collated in the Guardian’s Education Datastore Rosetta Stone spreadsheet…

(I’m not sure if Leigh Dodds looked at how the JISCMU data could be used as part of a Google Refine reconciliation API service? I seem to remember a brief flurry of tweets on a related topic at the time…;-)

So, what would be really useful would be for the Guardian Politics API to use a weak Linked Data approach and provide a list of HEI identifiers using a formal identification scheme such as UCAS or HESA codes so that we knew which institutions they were actually referring to; (though this wouldn’t cope with overseas universities… Hmm… is there an international identifier scheme for universities?)

We could then hop over to the JISCMU service and pull down the lat/long information, before popping it on a map.

Looking deeper into the Guardian Politics API, we also see a field for listing the MPs’ schools… which in turn could be enhanced by including identifiers used in the data.gov.uk education datastore.

So – Linked Data: can you see how it works yet? And do you get the feeling that network effects could kick in to place really quickly as data is enhanced with linking elements such as well defined identifiers using know identification schemes?

PS Chris Gutteridge has also picked up the challenge of this post, contributing a list of DBPedia URIs for current MPs to the cause. I wouldn’t be surprised if he turns up a whole load more data actually cracks the problem way before I do!

Ah – seems like Chris has been on the case, and produced, (with caveats: “Note that the data is patchy. It only shows MPs with a geocoded birthplace/university listed on dbpedia”) a map [updated] of 313 MPs’ birthplaces:

MPs birthplaces

as well as a map [updated] of 176 MPs’ universities (though I don’t have a valid link for this… yet…;-) Ah – here it is:

MPs alma mater map

UPDATE: here’s the recipe – Studying the MPs

PPS I really need to add Chris’ geo-tagged RDF to KML converter service (described here to my toolkit…

If You’re Going to Republish Data, Try To Be Consistent…

A passing observation, not meant as a gripe, but a good example of how slight inconsistencies can make life hard in data land…

[UPDATE: I think the insconsistencies in the Guardian Datastore spreadhseets mentioned below have now all been addressed. Which I guess makes this an impactful post?;-)]

Early today/last night, the Guardian Datablog did a great job republishing government spending data in a series of spreadsheets organised by Government Department. I cobbled together a tool to run queries against each spreadsheet separately, and thought I’d have a look at doing one that could run the same query against all the sheets at the same time, and aggregate the results.

To simplify matters, it looks as if the Google Charts Visualisation Query Language that I use to interrogate the spreadsheets allows you to pass column names in the query (documentation). This is handy, because it gets round possible inconsistencies across the spreadsheets arising from the different ordering of columns. However, to run the same query across multiple spreadsheets, it does require that the columns are named in exactly the same way. But at the moment, they aren’t:

Guardian Datastore - spending data sheets

So for example, most of the spreadsheets have column names Department Family and Entity, except for the FCO and BIS, which use Dept family and Dept entity respectively. System Transaction Number is almost consistent except for at the FCO and BIS (System transaction no) and the Attorney General’s Office (Transaction number), with the Treasury sheet using Transaction Number; those four departments also use Date rather than Payment Date. FCO and BIS also differ from the norm in the way they capitalise Expense type (every other department uses Expense Type). Environment, Food and Rural Affairs is also out of kilter using YEAR rather than Year.

(I’m not sure if there are other differences? If you spot one, please note it in a comment.)

Just by the by, here’s the script I used to grab the column names. It’s written in Python, so could form part of a stub for a Scraperwiki script that runs CSV returning queries run across the Guardian Datastore spreadsheets and aggregates the results.

import csv,urllib

spreadsheets=[("Treasury","tVryVDy3K3O6kfV7vt0SdSg","0"),("Transport","thgmwL0KV4fX4g5Kr4rdM-w","0"),("Revenue and Customs","0AonYZs4MzlZbdHJLWXJfS1diemlnN084YlNSU0RjNWc","0"),("Justice","tWJDJMgScG8KKpiwVShSjMw","0"),("International Development","0AonYZs4MzlZbdFdjaGVOVFAydm5sUUlTb09JUzFaNXc","0"),("Home Office","tnBJa5GzHGs6BdHL2K-5N9w","0"),("Health","tPdRIE1Dtovo5adgjt_D5rA","0"),("Government Equalities Office","tKUQuJiBekBQxlDcasQFFaQ","0"),("Foreign and Commonwealth Office","tk38yFwkcFwIiV1Xa5N7ZDw","0"),("Environment, Food and Rural Affairs","tNHaggN5kosvBHIU0pCs4fw","0"),("Energy and Climate Change","tuYwPXA9dma-Y87Pw8yQVjw","0"),("Education","tCieqjFZQ9LKQEZxEJX7IXA","0"),("Defence","tBA1bSiVf7y_no4upsfl8yw","0"),("Culture, Media and Sport","tKlkQ_ocEHaUa3RDgIee5sQ","0"),("Communities and Local Government","tGjVskINaH4X5crKIF_7KaA","0"),("Cabinet Office","toK-_5Qg_7QW8fuEJH5vdgw","0"),("Business, Innovation and Skills","tR6ec9fJYbVpLOlH3X4TjlQ","1"),("Attorney General's Office","0AonYZs4MzlZbdHJKWFMtaGlsaTdDMm5QaFNWWWd0QWc","0")]

def getURL(key,gid):
	return "http://spreadsheets.google.com/tq?tqx=out:csv&tq=select%20*%20limit%201&key="+str(key)
	
for dept,key,gid in spreadsheets:
	url= getURL(key,gid)

	f = urllib.urlopen(url)
	data = csv.DictReader(f)
	data.next()
	print dept,data.fieldnames

Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)

Show Me the Data – But Not All of It (Just a Little Bit…)

Over the weekend, I managed to catch up with open data advocate Rufus Pollock for a bit of a chat on all manner of things. One of the things that came up in conversation related to a practical issue around the ability to preview data quickly and easily without having to download and open large data files that might turn out not to contain the data you were looking for.

When building data handling applications, it can also be useful to get your code working on a small number of sampled data rows, rather than a complete data file.

Anyway, here’s one possible solution if your data is in a Google Spreadsheet – a URL pattern that will provide you with an HTML preview of the first ten lines of a sheet:

http://spreadsheets.google.com/tq?tqx=out:html&tq=select%20*%20limit%2010&key=SPREADSHEET_KEY&gid=SHEET_NUMBER

What it does is look-up a particular sheet in a public/published Google spreadsheet, select every column (select *) and then limit the display to the first 10 rows (limit 10 – just change the number to preview more or less rows. If there is only one sheet in the spreadsheet, or to display the first sheet, you can remove the &gid=SHEET_NUMBER part of the URL).

And if you’d rather have CSV data than an HTML preview, just change the out:html switch to out:csv

So for example, here’s a preview of the “GDP budgetiser” spreadsheet maintained by the WhereDoesMyMoneyGo folk (and described here):

Google spreadsheet preview

It seems to me that if data is being published in a Google doc, then in some situations it might also make sense to either link to, or display, a sample of the data so that folk can check that it meets their expectations before they download it. (I’ve noticed, for example, that even with CSV, many browsers insist on downloading the doc in response to MIME type or server streaming settings so that you then have to open it up in another application, rather than just letting you preview it in the browser. Which is to say, if you have to keep opening up docs elsewhere, it makes browsing hard and can be a huge time waster. It can also be particularly galling if the downloaded file contains data that you’re not interested in, particularly when it’s a large file you’ve downloaded.)

Just by the by, I had thought that Google did a spreadsheet previewer that could take a link to an online spreadsheet or CSV document and preview it in Google Spreadsheets, but I must have misremembered. The Google Docs Viewer only seems to preview “PDF documents, PowerPoint presentations, and TIFF files”. For reference, the URL pattern is of the form http://docs.google.com/viewer?url=ESCAPED_PDF_URL

However, the Zoho Excel Viewer does preview public online Excel docs, along with CSV and OpenOffice Calc docs, using the URL pattern: http://sheet.zoho.com/view.do?url=SPREADSHEET_URL. (Apparently, you can also import docs into your Zoho account using this construction/: http://sheet.zoho.com/import.do?url=SPREADSHEET_URL). So for example, here’s a preview of the meetings that Cabinet Office ministers have had recently (via the new Number 10 Transparency website):

Previewing CSV in Zoho

Finally, one of the issues we’ve been having on WriteToReply is how to handle data nicely. The theme we’re using was conflicting (somehow) with Google spreadsheet embed codes, but when I came across this shortcode plugin earlier today, it struck me that it might help…? It simple takes a Google spreadsheet key, (using the pattern [gdoc key=”ABCDEFG”]) and then inserts the necessary HTML embedding tags: WordPress plugin: inline Google Spreadsheet Viewer

However, it also struck me that a few tweaks to that plugin could probably also provide a preview view of the data, showing for example the first 10 lines or so of data in a data file. Providing such a preview view over a sample of data in a data file, maybe in a by default collapsed section of a page, might be something worth exploring in CKAN and data.gov.uk data catalogue pages?

PS it just occurred to me: Scraperwiki offers just this sort of data preview:

Scraperwiki data preview

Maybe the UI patterns are starting to form and will then start to fall into place…?;-)

UPDATE 28/11/10: I just noticed that the data.gov.uk site is now offering a link to preview xls docs at least on Zoho:

Spreadsheet preview on data.gov.uk

Useful…:-)