OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘datajourn

Programming, Not Coding: Infoskills for Journalists (and Librarians..?!;-)

A recent post on the journalism.co.uk site asks: How much computer science does a journalist really need?, commenting that whilst coding skills may undoubtedly be useful for journalists, knowing what can be achieved easily in a computational way may be more important, because there are techies around who can do the coding for you… (For another take on this, see Charles Arthur’s If I had one piece of advice to a journalist starting out now, it would be: learn to code, and this response to it: Learning to Think Like A Programmer.)

Picking up on a few thoughts that came to mind around a presentation I gave yesterday (Web Lego And Format Glue, aka Get Yer Mashup On), here’s a slightly different take on it, based on the idea that programming doesn’t necessarily mean writing arcane computer code.

Note that a lot of what follows I’d apply to librarians as well as journalists… (So for example, see Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library for infoskills that I think librarians as information professionals should at least be aware of (and these probably apply to journalists too…); Data Handling in Action is also relevant – it describes some of the practical skills involved in taking a “dirty” data set and getting it into a form where it can be easily visualised…)

So here we go…. An idea I’ve started working on recently as an explanatory device is the notion of feed oriented programming. I appreciate that this probably already sounds scary geeky, but it’s a made up phrase and I’ll try to explain it. A feed is something like an RSS feed. (If you don’t know what an RSS feed, this isn’t a remedial class, okay… go and find out… this old post should get you started: We Ignore RSS at OUr Peril.)

Typically, an RSS feed will contain a set of items, such as a set of blog posts, news stories, or bookmarks. Each item has the same structure in terms of how it is represented on a computer. Typically, the content of the feed will change over time – a blog feed represents the most recent posts on a blog, for example. That is, the publisher of the feed makes sure that the feed has current content in it – as a “programmer” you don’t really need to do anything to get the fresh content in the feed – you just need to look at the feed to see if there is new content in it – or let your feed reader show you that new content when it arrives. The feed is accessed via a web address/URL.

Some RSS feeds might not change over time. On WriteToReply, where we republish public documents, it’s possible to get hold of an RSS version of the document. The document RSS feed doesn’t change because the content of the document doesn’t change), although the content of the comment feeds might change as people comment on the document.

A nice thing about RSS is that lots of things publish it, and lots of things can import it. Importing an RSS feed into an application such as Google Reader simply means pasting the web address of the feed into a “Subscribe to feed” box in the application. Although it can do other things too, like supporting search, Google Reader is primarily a display application. It takes in RSS feeds and presents them to the user in an easy to read way. Google Maps and Google Earth are other display applications – they display geographical information in an appropriate way, a way that we can readily make sense of.

So what do we learn from this? Information can be represented in a standard way, such as RSS, and displayed in a visual way by an application that accepts RSS as an input. By subscribing to an RSS feed, which we identify by a fixed/permanent web address, we can get new content into our reader without doing anything. Subscribing is just a matter of copying a web address from the publisher’s web site and pasting it into our reader application. Cut and paste. No coding required. The feed publisher is responsible for putting new content into the feed, and our reader application is responsible for pulling that new content out and displaying it to us.

One of the tools I use a lot is Yahoo Pipes. Yahoo Pipes can take in RSS feeds and do stuff with it; it can take in a list of blog posts as an RSS feed and filter them so that you only get posts out that do – or don’t – mention cats, for example. And the output is in the form of an RSS feed…

What this means is that if we have a Yahoo pipe that does something we want in computational terms to an RSS feed, all we have to do is give it the web address of the feed we want to process, and then grab the RSS output web address from the Pipe. Cut and paste the original feed web address into the Pipe’s input. Cut and paste the web address of the RSS output from the pipe into our feed reader. No coding required.

Another couple of tools I use are Google Spreadsheets (a spreadsheet application) and Many Eyes WIkified (an interactive visualisation application). If you publish a spreadsheet on Google docs, you can get a web address/URL that points to a CSV (comma separated variable) version of the selected sheet. A CSV file is a simple text file where each spreadsheet row is a represented as a row in the CSV structured text file; and the value of each cell along a row in the original spreadsheet is represented as the same value in the text file, separated from the previous value by a comma. But you don’t need to know that… All you do need to know is that you can think of it as a feed… With a web address… And in a particular format…

Going to the “Share” options in the spreadsheet, you can publish the sheet and generate a web address that points to a range of cells in the spreadsheet (eg: B1:D120) represented as a CSV file. If we now turn to Many Eyes Wikified, I can provide it with the web address of a CSV file and it will go and fetch the data for me. At the click of a button I can then generate an interactive visualisation of the data in the spreadsheet. Cut and paste the web address of the CSV version of the data in a spreadsheet that Google Docs will generate for me into Many Eyes Wikified, and I can then create an interactive visualisation using the spreadsheet at the click of a button. Cut and paste a URL/web address that is generated for me. No coding required.

As to where the data in the spreadsheet came from? Well maybe it came from somewhere else on the web, via a URL? Like this, maybe?

So the model I’m working towards with feed oriented programming is the idea that you can get the web address of a feed which a publisher will publish current content or data to, and paste that address in an application that will render, or display the content (e.g. Google Reader, Many Eyes Wikified) or process/transform that data on your behalf.

So for example, Google Reader can transfrom an HTML table to CSV for you; (Google spreadsheets also lets you do all the normal spreadsheet things, so you could generate one sheet from another sheet using whatever spreadsheet formulae you like, and publish the CSV representation of that second sheet). Or in Yahoo Pipes, you can process an RSS feed by filtering its contents so that you only see posts that mention cats.

Yahoo Pipes offers other sorts of transformation as well. For example, in my original Wikipedia scraping demo, I took the feed from a Google spreadsheet and passed it to Yahoo Pipes where I geocoded city names and let pipes generate a map friendly feed (known as a KML feed) for me. Copying the web address of the KML feed output from the pipe and pasting it into Google Maps means I can generate an embeddable Google map view of data originally pulled from Wikipedia:

Once you start to think of the world in this way:

- where the web contains data and information that is represented in various standard ways and made available via a unique and persistent web address,

- where web applications can accept data and content that is represented in a standard way given the web address of that data,

- where web applications can transform data represented at one web address in one particular way and republish it in another standard format at another web address,

- or where web applications can take data represented in a particular way from one web adress and provide you with the tools to then visualise or display that data,

then the world is your toolbox. Have URL, will travel. All you need to know is which applications can import what format data, and how they can republish that data for you, whether in a different format, such as Google spreadsheets taking an HTML table from Wikipedia and publishing it as a CSV file, or as a visualisation/human friendly display (Many Eyes Wikified, Google Reader). And if you need to do “proper” programmer type things, then you might be able to do it using a spreadsheet formula or a Yahoo Pipe (no coding required…;-)

See also: The Journalist as Programmer: A Case Study of The New York Times Interactive News Technology Department [PDF]

Written by Tony Hirst

April 30, 2010 at 11:15 am

Posted in Anything you want

Tagged with

Does Funding Equal Happiness in Higher Education?

[The service used to create the visualisations mentioned in this post has been closed down. To search over recent (2013 intake) Guardian HE data, see this: Guardian Student Rankings]

With the announcement of the amount of funding awarded to UK Higher Education institutions from HEFCE, the government funding body, several people posted me links to the data wondering what I might do with it. I saw this as a good opportunity to do something I’ve been meaning to do for ages, specifically have another look at how to provide a view of a range of HE related datasets around particular institutions. So for example, if you ever wondered whether or not there is a relationship between the drop out rate from a university and a surveyed average teaching score, you should be able to look it up:

Since its launch, one of the more actively maintained areas of the Guardian datastore has been the education area. A quick skim over HE related data turns up at least the following:

In a follow on post, I’ll show how to pull this data together, but for now, let’s look at some of the possibilities of pulling data in from these separate sheets around an HEI. As a proof of concept, I grabbed the following data and popped it into Many Eyes Wikified:

(I need to add provenance info to the wiki, but didn’t in this instance because I don’t want the data to be referenced or trusted… I still need to check everything is working properly… (so I guess I should have used dummy HEI names… hmm…)

The data is pulled from four separate sheets and aggregated around HEI name. The “Percentage no longer in HE” comes from the datastore Dropout spreadsheet, the “Total staff earning ..” etc column is from the Pay spreadsheet, the “Average Teaching” to “Student to Staff Ratio” columns come from the 2009-10 university tables spreadsheet, and the “Teaching funding” to “Funding change” columns from the most recent (2010-11) funding spreadsheet.

I’ve posted a couple of interactive visualisations on to Many Eyes WIkified so you can have a play with the numbers (but don’t trust them or quote them unless you fact check them first…;-)

The first is a Scatter Chart, which gives us three dimensions to play with – x, y, and dot size.

So for example, in the chart shown above, we see that lower teaching scores seem to correlate with higher drop out rates. In the chart below, we seed how teaching scores relate to the expenditure per student and the student staff ratio (and how expenditure per student and student staff ratio relate to each other):

Is satisfaction rewarded with funding, or is funding to improve matters?

The other chart type I produced is a bar chart. These are less interesting, but heavily used…

I assume that university strategy and planning units worry over this sort of combined data all the time (but I’m not sure how they: obtain it, combine it, represent it, report it, use it? Maybe if an HE planner is reading they could post a comment or two to describe what data they use, how they use it and what they use it for…?;-) Anyway, it’s getting close to a stage now where the rest of us can play along too…;-)

Written by Tony Hirst

March 20, 2010 at 11:18 am

Posted in Data, Visualisation

Tagged with ,

Grabbing “Facts” from the Guardian Datastore with a Google Spreadsheets Formula

In Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula) I picked up on an idea outlined in Mulling Over =datagovukLookup() in Google Spreadsheets to show how to use Google Apps script to create a formula that could pull in live data from a data.gov.uk datastore.

So just because, here’s how to do something similar with data from a Google spreadsheet in the Guardian datastore. Note that what I’m essentially proposing here is using the datastore as a database…

To ground the example, consider the HE satisfaction tables:

Lots of data about lots of courses on lots of sheets in a single spreadsheet. But how do you compare the satisfaction ratings across subjects for a couple of institutions? How about like this:

Creating Subject comparison tables from Guardian HE data

(We can just click and drag the formula across a range of cells as we would any other formula.)

That is, how about defining a simple spreadsheet function that lets us look up a particular data value for a particular subject and a particular institution? How about being able to write a formula like:
=gds_education_unitable(“elecEng”,”Leeds”,”NSSTeachingPerCent”)
and get the national student satisfaction survey teaching satisfaction result back from students studying Electrical/Electronic Engineering at Leeds University?

Google Apps script provides a mechanism for defining formulae that can do this, and more:

Guardian Datastore as a database

The script takes the arguments and generates a query to the spreadsheet using the spreadsheet’s visualisation API, as used in my Guardian Datastore Explorer. The results are pulled back as CSV, run through a CSV2Javacript object function and then returned to the calling spreadsheet. Here’s an example of the Apps script:

function gds_education_unitable(sheet,uni,typ){
  var key="phNtm3LmDZEM6HUHUnVkPaA";
  var gid='0';//'Overall Institutional Table';
  var category="C"; //(Average) Guardian teaching score
  switch (sheet){
    case "full":
      gid='0';//'Overall Institutional Table';
      break;
    case "chemEng":
      gid='16';//'15 Chem Eng';
      break;
    case "matEng":
      gid='17';//'16 Mat Eng';
      break;
    case "civilEng":
      gid='18';//'17 Civil Eng';
      break;
    case "elecEng":
      gid='19';//'18 Elec Eng';
      break;
    case "mechEng":
      gid='20';//'19 Mech Eng';
      break;
    default:
  }

  switch (typ){
    case "guardianScore":
      category='C';//(Average) Guardian teaching score
      break;
    case "NSSTeachingPerCent":
      category='D';//
      break;
    case "expenditurePerStudent":
      category='E';//
      break;
    case "studentStaffRatio":
      category='F';//
      break;
    default:
  }

  if (sheet!='full') category=String.fromCharCode(category.charCodeAt(0)+2);

  var url="http://spreadsheets.google.com/tq?tqx=out:csv&tq=select%20B%2C"+category+"%20where%20B%20matches%20%22"+uni+"%22&key="+key+"&gid="+gid;

  var x=UrlFetchApp.fetch(url);
  var ret=x.getContentText();
  ret = CSVToArray( ret, "," );
  return ret[1][1];
}

(The column numbering between the first sheet in the spreadsheet and the subject spreadsheets is inconsistent, which is why we need a little bit of corrective fluff (if (sheet!=’full’)) in the code…)

Of course, we can also include script that will generate calls to other spreadsheets, or as I have shown elsewhere to other data sources such as the data.gov.uk Linked Data datastore.

Something that occurred to me though is if and how Google might pull on such “data formula” definitions to feed apps such as Google Squared (related: =GoogleLookup: Creating a Google Fact Engine Directory and Is Google Squared Just a Neatly Packaged and Generalised =googlelookup Array?).

Written by Tony Hirst

February 19, 2010 at 2:12 pm

Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula)

Disclaimer: before any Linked Data purists say I’m missing the point about what Linked Data is, does, or whatever, I don’t care, okay? I just don’t care. This is for me, and me is someone who can’t get to grips with writing SPARQL queries, can’t stand the sight of unreadable <rdf> <all:over the=”place”>, can’t even work out how to find things are queryable in a Linked Data triple store, let alone write queries that link data from one data store with data from another data store (or maybe SPARQL can’t do that yet? Not that I care about that either, because I can, in Yahoo Pipes, or Google Spreadsheets, and in a way that’s meaningful to me…)

In Mulling Over =datagovukLookup() in Google Spreadsheets, I started wondering about whether or not it would be useful to be able to write formulae to look up “live facts” in various datastores from within a spreadsheet (you know, that Office app that is used pretty much universally in workplace whenever there is tabular data to hand. That or Access of course…)

Anyway, I’ve started tinkering with how it might work, so now I can do things like this:

The formulae in columns G, H and I are defined according to a Google Apps script, that takes a school ID and then returns something linked to it in the data.gov.uk education datastore, such as the name of the school, or its total capacity.

Formulae look like:

  • =datagovuk_education(A2,”name”)
  • =datagovuk_education(A2,”opendate”)
  • =datagovuk_education(A2,”totcapacity”)

and are defined to return a single cell element. (I haven’t worked out how to return several cells worth of content from a Google Apps Script yet!)

At the moment, te script is a little messy, taking the form:

function datagovuk_education(id,typ) {
  var ret=""; var args=""
  switch (typ) {
    case 'totcapacity':
      args= _datagovuk_education_capacity_quri(id);
      break;
    ...
    default:
      //hack something here;
  }
  var x=UrlFetchApp.fetch('http://data-gov.tw.rpi.edu/ws/sparqlproxy.php',{method: 'post', payload: args});
  var ret=x.getContentText();
  var xmltest=Xml.parse(ret);
  ret=xmltest.sparql.results.result.binding.literal.getText();

  return ret;
}
function _datagovuk_education_capacity_quri(id){
  return "query=prefix+sch-ont%3A+%3Chttp%3A%2F%2Feducation.data.gov.uk%2Fdef%2Fschool%2F%3E%0D%0ASELECT+%3FschoolCapacity+WHERE+{%0D%0A%3Fschool+a+sch-ont%3ASchool%3B%0D%0Asch-ont%3AuniqueReferenceNumber+"+id+"%3B%0D%0Asch-ont%3AschoolCapacity+%3FschoolCapacity.%0D%0A}+ORDER+BY+DESC%28%3Fdate%29+LIMIT+1&output=xml&callback=&tqx=&service-uri=http%3A%2F%2Fservices.data.gov.uk%2Feducation%2Fsparql";
}

The datagovuk_education(id,typ) function takes the school ID and the requested property, uses the case statement to create an appropriate query string, and then fetches the data from the education datastore, returning the result in an XML format like this. The data is pulled from the datastore via Sparqlproxy, and the query string URIs generated (at the moment) by adding the school ID number into a query string generated by running the desired SPARQL query on Sparqlproxy and then grabbing the appropriate part of the URI. (It’s early days yet on this hack!;-)

By defining appropriate Apps script functions, I can also create formulae to call other datastores, run queries on Google spreadsheets (e.g. in the Guardian datastore) and so on. I assume similar sorts of functionality would be supported using VB Macros in Excel?

Anyway – this is my starter for ten on how to make live datastore data available to the masses. It’ll be interesting to see whether this approach (or one like it) is used in favour of getting temps to write SPARQL queries and RDF parsers… The obvious problem is that my approach can lead to an explosion in the number of formulae and parameters you need to learn; the upside is that I think these could be quite easily documented in a matrix/linked formulae chart. The approach also scales to pulling in data from CSV stores and other online spreadsheets, using spreadsheets as a database via the =QUERY() formula (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula), and so on. There might also be a market for selling prepackaged or custom formulae as script bundles via a script store within a larger Google Apps App store

PS I’m trying to collect example SPARQL queries that run over the various data.gov.uk end points because: a) I’m really struggling in getting my head round writing my own, not least because I struggle to make sense of the ontologies, if I can find them, and write valid queries off the back of them; even (in fact, especially) really simple training/example queries will do! b) coming up with queries that either have interesting/informative/useful results, or clearly demonstrate an important ‘teaching point’ about the construction of SPARQL queries, is something I haven’t yet got a feel for. If you’ve written any, and you’re willing to share, please post a gist to github and add a link in a comment here.

PPS utility bits, so I don’t lose track of them:
- education datastore ontology
- Apps script XML Element class

PPPS HEre’s how to dump a 2D CSV table into a range of cells: Writing 2D Data Arrays to a Google Spreadsheet from Google Apps Script Making an HTTP POST Request for CSV Data

Written by Tony Hirst

February 17, 2010 at 11:52 pm

Using Google Spreadsheets Like a Database – The QUERY Formula

In this year’s student satisfaction tables, which universities have a good teaching score but low employment prospects? How would you find out? In this post, you’ll find out…

Whether or not it was one of my resolutions, one of the things I want to do more this year is try to try to make more use of stuff that’s already out there, and come up with recipes that hopefully demonstrate to others how to make use of those resources.

So today’s trick is prompted by a request from @paulbradshaw about “how to turn a spreadsheet into a form-searchable database for users” within a Google spreadsheet (compared to querying a google spreadsheet via a URI, as described in Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I’m not going to get as far as the form bit, but here’s how to grab details from a Google spreadsheet, such as one of the spreadsheets posted to the Guardian Datastore, and query it as if it was a database in the context of one of your own Google spreadsheets.

This trick actually relies on the original Google spreadsheet being shared in “the right way”, which for the purposes of this post we’ll take to mean – it can be viewed using a URL of the form:

http://spreadsheets.google.com/ccc?key=SPREADSHEETKEY&hl=en

(The &hl=en on the end is superfluous – it doesn’t matter if it’s not there…) The Guardian Datastore folks sometimes qualify this link with a statement of the form Link (if you have a Google Docs account).

If the link is of the form:
http://spreadsheets.google.com/pub?key=SPREADSHEETKEY
just change pub to ccc

So for example, take the case of the 2010-2011 Higher Education tables (described here):

http://spreadsheets.google.com/ccc?key=reBYenfrJHIRd4voZfiSmuw

The first thing to do is to grab a copy of the data into our own spreadsheet. So go to Google Docs, create a new spreadsheet, and in cell A1 enter the formula:
=ImportRange(“reBYenfrJHIRd4voZfiSmuw”,”Institutional Table!A1:K118″)

When you hit return, the spreadsheet should be populated with data from the Guardian Datastore spreadsheet.

So let’s see how that formula is put together.
=ImportRange(“reBYenfrJHIRd4voZfiSmuw”,”Institutional Table!A1:K118″)

Firstly, we use the =ImportRange() formula, which has the form:
=ImportRange(SPREADSHEETKEY, SHEET!RANGE)

This says that we want to import a range of cells from a sheet in another spreadsheet/workbook that we have access to (such as one we own, one that is shared with us in an appropriate way, or a public one). The KEY is the key value from the URL of the spreadsheet we want to import data from. The SHEET is the name of the sheet the data is on:

The RANGE is the range of the cells we want to copy over from the external spreadsheet.

Enter the formula into a single cell in your spreadsheet and the whole range of cells identified in the specified sheet of the original spreadsheet will be imported to your spreadsheet.

Give the sheet a name (I called mine ‘Institutional Table 2010-2011′; the default would be ‘Sheet1′).

Now we’re going to treat that imported data as if it was in a database, using the =QUERY() formula.

Create a new sheet, call it “My Queries” or something similar and in cell A1 enter the formula:

=QUERY(‘Institutional Table 2010-2011′!A1:K118,”Select A”)

What happens? Column A is pulled into the spreadsheet is what. So how does that work?

The =QUERY() formula, which has the basic form =QUERY(RANGE,DATAQUERY), allows us to run a special sort of query against the data specified in the RANGE. That is, you can think of =QUERY(RANGE,) as specifying a database; and DATAQUERY as a database query language query (sic) over that database.

So what sorts of DATAQUERY can we ask?

The simplest queries are not really queries at all, they just copy whole columns from the “database” range into our “query” spreadsheet.

So things like:

  • =QUERY(‘Institutional Table 2010-2011′!A1:K118,“Select C”) to select column C;
  • =QUERY(‘Institutional Table 2010-2011′!A1:K118,“Select C,D,G,H”) to select columns C, D, G and H;

So looking at copy of the data in our spreadsheet, import the columns relating to the Institution, Average Teaching Score, Expenditure per Student and Career Prospects, I’d select columns C, D, F and H:

like this:
=QUERY(‘Institutional Table 2010-2011′!A1:K118,“Select C,D, F,H”)
to give this:

(Remember that the column labels in the query refer to the spreadsheet we are treating as a database, not the columns in the query results sheet shown above.)

All well and good. But suppose we only want to look at institutions with a poor teaching score (column D), less than 40? Can we do that too? Well, yes, we can, with a query of the form:

“Select C,D, F,H where D < 40"

(The spaces around the less than sign are important… if you don’t include them, the query may not work.)

Here’s the result:

(Remember, column D in the query is actually the second selected column, which is placed into column B in the figure shown above.)

Note that we can order the results according to other columns to. So for example, to order the results according to increasing expenditure (column F), we can write:

“Select C,D, F,H where D < 40 order by F asc"

(For decreasing order, use desc.)

Note that we can run more complex queries too. So for example, if we want to find institutions with a high average teaching score (column D) but low career prospects (column H) we might ask:

“Select C,D, F,H where D > 70 and H < 70"

And so on…

Over the nect week or two, I’ll post a few more examples of how to write spreadsheet queries, as well as showing you a trick or two about how to build a simple form like interface within the spreadsheet for constructing queries automatically; but for now, why try having a quick play with the =QUERY() formula yourself?

Written by Tony Hirst

January 19, 2010 at 2:21 pm

Tinkering With Timetric – London Datastore Borough Population Data

Earlier this week, the Greater London Authority opened up a preview/prototype of the London Datastore (news reports).

Many of the data sets I quickly looked at are being made available as CSV and XML data feeds, which is very handy :-)

Anyway, in preparation for having some new recipes to drop into conversation at News:Rewired next week, I thought I’d have a quick play with visualising some of the timeseries data in Timetric to see what sorts of “issues” it might throw up.

So how does Timetric like to import data? There are three main ways – copy’n’paste, import a spreadsheet (CSV or XLS) from your desktop, or grab the data from a URL.

Obviously, the online route appeals to me:-)

Secondly, how does Timetirc expect the data to be formatted? At the moment, quite rigidly, it seems:

To publish data in a format Timetric can understand, you should expose it in one of two formats — either CSV or Excel (.xls) format. Dates/times must be in the first column of the file, and values in the second.

For importing CSV data, the date/time should be in W3C datetime format (like 2009-09-14 17:34:00) or, optionally, Unix timestamps as floating point numbers.

Hmmm, so at the moment, I can only import on time series at a time, unless I’m a geeky hacker type and know how to “write a programme” to upload multiple sets of data from a multi-column file via the API… But OUseful.info isn’t about that, right?!;-)

Let’s look at some of the London datastore data, anyway. How about this – “Historic Census Population” data.

Let’s preview the data in a Google spreadsheet – use the formula:
=ImportData(“http://data.london.gov.uk/datafiles/demographics/census-historic-population-borough.csv&#8221;)

Ok – so we have data for different London Boroughs, for every decade since 1801. But is the data in the format that Timetric wants?

- first no: the dates run across columns rather than down rows.

So we need to swap rows with columns somehow. We can do this in a Google spreadsheet with the TRANSPOSE formula. While we’re doing the transposition, we might as well drop the Area Code column and just use the area/borough names. In a new sheet, use the formula:
=TRANSPOSE( ‘Original Data’!A1:W )
(Note, I’d renamed the sheet containing the imported data as Original Data; typically it would be Sheet1, by default.)

NB It seems I could have combined the import and transpose formulae:

Now we hit the second no: the dates are in the wrong format.

Remember, for Timetric “the date/time should be in W3C datetime format (like 2009-09-14 17:34:00) or, optionally, Unix timestamps as floating point numbers.”

My fudge here was to copy all the data except the time data to a new sheet, and just add the time data by hand, using a default day/month/time of midnight first January of the appropriate year. Note that this is not good practice – the data in this sheet is now not just a representation of the original data, it’s been messed around with and the data field is not the original one, nor even derived from the original one (I don’t think Google spreadsheets has a regular expression search/replace formula that would allow me to do this?)

Anyway, that’s as may bee;-). To keep the correct number format (Google spreadsheets will try to force a different representation of the date), the format of the date cells needs to be set explicitly:

So now we have the data in rows, with the correct data format, the dates being added by hand. Remembering that Timetric can only import one time series at a time, let’s try with the first data set. We can grab the CSV for the first two columns as follows – from the Share Menu, “Publish as Web Page” option, choose the following settings:

(The ‘for timetric’ sheet is the sheet with the tidied date field.)

Here’s the CSV URI, that we can use to get the data in Timetric:

http://spreadsheets.google.com/pub?key=tufGbjv41z8fDauvzyfY7ZA&single=true&gid=2&range=A1%3AB22&output=csv

The upload took a good couple of minutes, with no reassuring user notifications (just the browser appearing to hang waiting for a new timetric page to load), but evntually it got there…

(And yes, that drop in population is what the data says – though for all the other boroughs you get a curve shaped more as you’d expect;-)

To import other data sets, we need to insert a new Date column, along with dat data (I copied it from the first Dat column I’d created) and then grab the CSV URI for the appropriate columns:

Anyway, there we have it – a recipe (albeit a slightly messy one) for getting CSV data out of the London datastore, into a Google spreadsheet, transposing its rows and columns, and then generating date information formatted just how Timetric likes it, before grabbing a new CSV data feed out of the spreadsheet and using it to import data into Timetric.

Written by Tony Hirst

January 10, 2010 at 11:32 am

Visualising Where the Money Goes: Westminster Quangos, Part 2

One of the things I try to impress on folk whenever I do a talk about web stats is how showing charts of mean (that is, simple averages) numbers of visitors or time on site is all but useless, because the actual distribution of values is not likely to be normal, and so the simple averages reported are, too all intents and purposes, not good for anything.

So when putting together a change treemap to summarise the changes in funding of quangos form government departments (Visualising Where the Money Goes: Westminster Quangos, Part 1), what I did originally was to visualise the change in funding for each quango within each department, rather than just displaying the following simpler, overall funding by government department treemap that I posted about previously:

The reason being? The overall change treemap might show increases or decreases in expenditure for a department as a whole, but it doesn’t reveal whether the funded quangos were all treated equally, or whether lots of small quangos received a cut whilst one big one received a huge rise in funding, for example:

So how did I create this treemap? The simplest way is just to create a query on the original spreadsheet that pulls in 4 columns – department, quango, and two expnditure columns (one for 2006 and one for 2007). A query built around this kernel, in fact:

select C,D,R,S order by C,R desc

(To see how such a query is put together, see Part 1 of this exercise.)

To generate a query that only displays quangos that had a non-zeor expenditure in both 2006 and 2007, just add:

where (R>0 AND S>0) after the select statement and before order by.

Just as an aside, note that it’s possible to to pull the output of this query into another spreadsheet. This allows you to add derived columns to the spreadsheet, for example, busing relative formaula that act on quantities automativcally inported into other columns in the spreadsheet. (One thing I intend to explore with the data store explorer is a 1-click way of creating a new spreadsheet that pulls in a query created using the explorer. See also: Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…))

The CSV feed can then be pulled into a Many Eyes Wikified and visulaised in a variety of ways that reveal both the overall expenditure of quangos funded from within each department, as well as the relative funding over 2006/2007 of each quango:

So for example, for the two biggest quangos by expenditure in Defence, one spent more over 2006/7, and the other spent less…

Using the same data page , we can create other visualisations within the Many Eyes Wikified environment that allow us to explore the data in a little more detail. So for example, here’s a bit more detail about the funding of quangos in the Department of Health. Parallel bands show that quangos spent equivalent amounts in 2006 and 2007, lines that diverge going from left to right show an increase in expenditure, and lines that converge going from left to right depict decreasing expenditure.

A scatter chart readily shows large changes in expnditure over the two years:

And some charts are just shiny ;-)

Compared with just trawling through the spreadsheet data, I think there is a lot to be said for using visualisations to identify out of the norm data points (e.g. using the scatterplot), or unpacking totals (as in the case of departmentally funded quango expenditure) in a quick and easy way as a starting point to exploring the data in a more systematic way, for example in order to ask journalistic questions (whatever they might be).

Written by Tony Hirst

July 14, 2009 at 1:40 pm

Posted in Data, Tinkering

Tagged with , ,

Follow

Get every new post delivered to your Inbox.

Join 757 other followers