Getting Started With Local Council Spending Data

With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.

To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)

The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.

The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.

Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):

Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon

The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):

(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)

Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!

It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)

Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:

However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:

So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
– compare the offerings with each other in more detail, e.g. the columns each view provides;
– compare the offerings with the guidance on release of council spending data;
– see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
– explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.

PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2

PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard

UK Journalists on Twitter

A post on the Guardian Datablog earlier today took a dataset collected by the Tweetminster folk and graphed the sorts of thing that journalists tweet about ( Journalists on Twitter: how do Britain’s news organisations tweet?).

Tweetminster maintains separate lists of tweeting journalists for several different media groups, so it was easy to grab the names on each list, use the Twitter API to pull down the names of people followed by each person on the list, and then graph the friend connections between folk on the lists. The result shows that the hacks are follow each other quite closely:

UK Media Twitter echochamber (via tweetminster lists)

Nodes are coloured by media group/Tweetminster list, and sized by PageRank, as calculated over the network using the Gephi PageRank statistic.

The force directed layout shows how folk within individual media groups tend to follow each other more intensely than they do people from other groups, but that said, inter-group following is still high. The major players across the media tweeps as a whole seem to be @arusbridger, @r4today, @skynews, @paulwaugh and @BBCLauraK.

I can generate an SVG version of the chart, and post a copy of the raw Gephi GDF data file, if anyone’s interested…

PS if you’re interested in trying out Gephi for yourself, you can download it from gephi.org. One of the easiest ways in is to explore your Facebook network

PPS for details on how the above was put together, here’s a related approach:
Trying to find useful things to do with emerging technologies in open education
Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community
.

For a slightly different view over the UK political Twittersphere, see Sketching the Structure of the UK Political Media Twittersphere. And for the House and Senate in the US: Sketching Connections Between US House and Senate Tweeps

A First Quick Viz of UK University Fees

Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.

The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)

My first thought was to go for a proportional symbol map. (Does anyone know of a javascript library that can generate proportional symbol overlays on a Google Map or similar, even better if it can trivially pull in data from a Google spreadsheet via the Google visualisation? I have an old hack (supermarket catchment areas), but there must be something nicer to use by now, surely? [UPDATE: ah – forgot this: Polymaps])

In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:

The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results[0].formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results[0].geometry.location.lng and value.parseJson().results[0].geometry.location.lat).

Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:

Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?

What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…

The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)

Click thru’ to have a play with the chart yourself;-)

PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE – apparently, HESA codes are in the spreadsheet…. ;-0]

PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?

PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)

Tech Tips: Making Sense of JSON Strings – Follow the Structure

Reading through the Online Journalism blog post on Getting full addresses for data from an FOI response (using APIs), the following phrase – relating to the composition of some Google Refine code to parse a JSON string from the Google geocoding API – jumped out at me: “This took a bit of trial and error…”

Why? Two reasons… Firstly, because it demonstrates a “have a go” attitude which you absolutely need to have if you’re going to appropriate technology and turn it to your own purposes. Secondly, because it maybe (or maybe not…) hints at a missed trick or two…

So what trick’s missing?

Here’s an example of the sort of thing you get back from the Google Geocoder:

{ “status”: “OK”, “results”: [ { “types”: [ “postal_code” ], “formatted_address”: “Milton Keynes, Buckinghamshire MK7 6AA, UK”, “address_components”: [ { “long_name”: “MK7 6AA”, “short_name”: “MK7 6AA”, “types”: [ “postal_code” ] }, { “long_name”: “Milton Keynes”, “short_name”: “Milton Keynes”, “types”: [ “locality”, “political” ] }, { “long_name”: “Buckinghamshire”, “short_name”: “Buckinghamshire”, “types”: [ “administrative_area_level_2”, “political” ] }, { “long_name”: “Milton Keynes”, “short_name”: “Milton Keynes”, “types”: [ “administrative_area_level_2”, “political” ] }, { “long_name”: “United Kingdom”, “short_name”: “GB”, “types”: [ “country”, “political” ] }, { “long_name”: “MK7”, “short_name”: “MK7”, “types”: [ “postal_code_prefix”, “postal_code” ] } ], “geometry”: { “location”: { “lat”: 52.0249136, “lng”: -0.7097474 }, “location_type”: “APPROXIMATE”, “viewport”: { “southwest”: { “lat”: 52.0193722, “lng”: -0.7161451 }, “northeast”: { “lat”: 52.0300728, “lng”: -0.6977000 } }, “bounds”: { “southwest”: { “lat”: 52.0193722, “lng”: -0.7161451 }, “northeast”: { “lat”: 52.0300728, “lng”: -0.6977000 } } } } ] }

The data represents a Javascript object (JSON = JavaScript Object Notation) and as such has a standard form, a hierarchical form.

Here’s another way of writing the same object code, only this time laid out in a way that reveals the structure of the object:

{
  "status": "OK",
  "results": [ {
    "types": [ "postal_code" ],
    "formatted_address": "Milton Keynes, Buckinghamshire MK7 6AA, UK",
    "address_components": [ {
      "long_name": "MK7 6AA",
      "short_name": "MK7 6AA",
      "types": [ "postal_code" ]
    }, {
      "long_name": "Milton Keynes",
      "short_name": "Milton Keynes",
      "types": [ "locality", "political" ]
    }, {
      "long_name": "Buckinghamshire",
      "short_name": "Buckinghamshire",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "Milton Keynes",
      "short_name": "Milton Keynes",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "United Kingdom",
      "short_name": "GB",
      "types": [ "country", "political" ]
    }, {
      "long_name": "MK7",
      "short_name": "MK7",
      "types": [ "postal_code_prefix", "postal_code" ]
    } ],
    "geometry": {
      "location": {
        "lat": 52.0249136,
        "lng": -0.7097474
      },
      "location_type": "APPROXIMATE",
      "viewport": {
        "southwest": {
          "lat": 52.0193722,
          "lng": -0.7161451
        },
        "northeast": {
          "lat": 52.0300728,
          "lng": -0.6977000
        }
      },
      "bounds": {
        "southwest": {
          "lat": 52.0193722,
          "lng": -0.7161451
        },
        "northeast": {
          "lat": 52.0300728,
          "lng": -0.6977000
        }
      }
    }
  } ]
}

Making Sense of the Notation

At its simplest, the structure has the form: {“attribute”:”value”}

If we parse this object into the jsonObject, we can access the value of the attribute as jsonObject.attribute or jsonObject[“attribute”]. The first style of notation is called a dot notation.

We can add more attribute:value pairs into the object by separating them with commas: a={“attr”:”val”,”attr2″:”val2″} and address them (that is, refer to them) uniquely: a.attr, for example, or a[“attr2”].

Try it out for yourself… Copy and past the following into your browser address bar (where the URL goes) and hit return (i.e. “go to” that “location”):

javascript:a={"attr":"val","attr2":"val2"}; alert(a.attr);alert(a["attr2"])

(As an aside, what might you learn from this? Firstly, you can “run” javascript in the browser via the location bar. Secondly, the javascript command alert() pops up an alert box:-)

Note that the value of an attribute might be another object.

obj={ attrWithObjectValue: { “childObjAttr”:”foo” } }

Another thing we can see in the Google geocoder JSON code are square brackets. These define an array (one might also think of it as an ordered list). Items in the list are address numerically. So for example, given:

arr[ “item1”, “item2”, “item3” ]

we can locate “item1” as arr[0] and “item3” as arr[2]. (Note: the index count in the square brackets starts at 0.) Try it in the browser… (for example, javascript:list=["apples","bananas","pears"]; alert( list[1] );).

Arrays can contain objects too:

list=[ “item1”, {“innerObjectAttr”:”innerObjVal” } ]

Can you guess how to get to the innerObjVal? Try this in the browser location bar:

javascript: list=[ "item1", { "innerObjectAttr":"innerObjVal" } ]; alert( list[1].innerObjectAttr )

Making Life Easier

Hopefully, you’ll now have a sense that there’s structure in a JSON object, and that that (sic) structure is what we rely on if we want to cut down on the “trial an error” when parsing such things. To make life easier, we can also use “tree widgets” to display the hierarchical JSON object in a way that makes it far easier to see how to construct the dotted path that leads to the data value we want.

A tool I have appropriated for previewing JSON objects is Yahoo Pipes. Rather than necessarily using Pipes to build anything, I simply make use of it as a JSON viewer, loading JSON into the pipe from a URL via the Fetch Data block, and then previewing the result:

Another tool (and one I’ve just discovered) is an Air application called JSON-Pad. You can paste in JSON code, or pull it in from a URL, and then preview it again via a tree widget:

Clicking on one of the results in the tree widget provides a crib to the path…

Summary

Getting to grips with writing addresses into JSON objects helps if you have some idea of the structure of a JSON object. Tree viewers make the structure of an object explicit. By walking down the tree to the part of it you want, and “dotting” together* the nodes/attributes you select as you do so, you can quickly and easily construct the path you need.

* If the JSON attributes have spaces or non-alphanumeric characters in them, use the obj[“attr”] notation rather than the dotted obj.attr notation…

PS Via my feeds today, though something I had bookmarked already, this Data Converter tool may be helpful in going the other way… (Disclaimer: I haven’t tried using it…)

If you know of any other related tools, please feel free to post a link to them in the comments:-)

Fragments: Glueing Different Data Sources Together With Google Refine

I’m working on a new pattern using Google Refine as the hub for a data fusion experiment pulling together data from different sources. I’m not sure how it’ll play out in the end, but here are some fragments….

Grab Data into Google Refine as CSV from a URL (Proxied Google Spreadsheet Query via Yahoo Pipes)

Firstly, getting data into Google Refine… I had hoped to be able to pull a subset of data from a Google Spreadsheet into Google Refine by importing CSV data obtained from the spreadsheet via a query generated using my Google Spreadsheet/Guardian datastore explorer (see Using Google Spreadsheets as a Database with the Google Visualisation API Query Language for more on this) but it seems that Refine would rather pull the whole of the spreadsheet in (or at least, the whole of the first sheet (I think?!)).

Instead, I had to tweak create a proxy to run the query via a Yahoo Pipe (Google Spreadsheet as a database proxy pipe), which runs the spreadsheet query, gets the data back as CSV, and then relays it forward as JSON:

Here’s the interface to the pipe – it requires the Google spreadsheet public key id, the sheet id, and the query… The data I’m using is a spreadsheet maintained by the Guardian datastore containing UK university fees data (spreadsheet.

You can get the JSON version of the data out directly, or a proxied version of the CSV, as CSV via the More options menu…

Using the Yahoo Pipes CSV output URL, I can now get a subset of data from a Google Spreadsheet into Google Refine…

Here’s the result – a subset of data as defined by the query:

We can now augment this data with data from another source using Google Refine’s ability to import/fetch data from a URL. In particular, I’m going to use the Yahoo Pipe described above to grab data from a different spreadsheet and pass it back to Google Refine as a JSON data feed. (Google spreadsheets will publish data as JSON, but the format is a bit clunky…)

To test out my query, I’m going to create a test query in my datastore explorer using the Guardian datastore HESA returns (2010) spreadsheet URL (http://spreadsheets1.google.com/spreadsheet/ccc?hl&key=tpxpwtyiYZwCMowl3gNaIKQ#gid=0) which also has a column containing HESA numbers. (Ultimately, I’m going to generate a URL that treats the Guardian datastore spreadsheet as a database that lets me get data back from the row with a particular HESA code column value. By using the HESA number column in Google Refine to provide the key, I can generate a URL for each institution that grabs its HESA data from the Datastore HESA spreadsheet.)

Hit “Preview Table Headings”, then scroll down to try out a query:

Having tested my query, I can now try the parameters out in the Yahoo pipe. (For example, my query is select D,E,H where D=21 and the key is tpxpwtyiYZwCMowl3gNaIKQ; this grabs data from columns D, E and H where the value of D (HESA Code) is 21). Grab the JSON output URL from the pipe, and use this as a template for the URL template in Google Refine. Here’s the JSON output URL I obtained:

http://pipes.yahoo.com/pipes/pipe.run?_id=4562a5ec2631ce242ebd25a0756d6381
&_render=json&key=tpxpwtyiYZwCMowl3gNaIKQ
&q=select+D%2CE%2CH+where+D%3D21

Remember, the HESA code I experiment with was 21, so this is what we want to replace in the URL with the value from the HESA code column in Google Refine…

Here’s how we create the URLs built around/keyed by an appropriate HESA code…

Google Refine does its thing and fetches the data…

Now we process the JSON response to generate some meaningful data columns (for more on how to do this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure).

First say we want to create a new column based on the imported JSON data:

Then parse the JSON to extract the data field required in the new column.

For example, from the HESA data we might extract the Expenditure per student /10:

value.parseJson().value.items[0]["Expenditure per student / 10"]

or the Average Teaching Score (value.parseJson().value.items[0]["Average Teaching Score"]):

And here’s the result:

So to recap:

– we use a Yahoo Pipe to query a Google spreadsheet and get a subset of data from it;
– we take the CSV output from the pipe and use it to create a new Google Refine database;
– we note that the data table in Google Refine has a HESA code column; we also note that the Guardian datastore HESA spreadsheet has a HESA code column;
– we realise we can treat the HESA spreadsheet as a database, and further that we can create a query (prototyped in the datastore explorer) as a URL keyed by HESA code;
– we create a new column based on HESA codes from a generated URL that pulls JSON data from a Yahoo pipe that is querying a Google spreadsheet;
– we parse the JSON to give us a couple of new columns.

And there we have it – a clunky, but workable, route for merging data from two different Google spreadsheets using Google Refine.

Merging Datasets with Common Columns in Google Refine

It’s an often encountered situation, but one that can be a pain to address – merging data from two sources around a common column. Here’s a way of doing it in Google Refine…

Here are a couple of example datasets to import into separate Google Refine projects if you want to play along, both courtesy of the Guardian data blog (pulled through the Google Spreadsheets to Yahoo pipes proxy mentioned here):

University fees data (CSV via pipes proxy)

University HESA stats, 2010 (CSV via pipes proxy)

Load one data file into a Google Refine/OpenRefine project called Merge Test A, the other into a separate project called Merge Test B.

We can now merge data from the two projects by creating a new column from values an existing column within one project that are used to index into a similar column in the other project. Looking at the two datasets, both HESA Code and institution/University look like candidates for merging the data. Which should we go with? I’d go with the unique identifier (i.e. HESA code in the case) every time…

First, create a new column:

Now do the merge, using the cell.cross GREL (Google Refine Expression Language) command. Trivially, and pinching wholesale from the documentation example, we might use the following command to bring in Average Teaching Score data from the second project into the first:

cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0]

Note that there is a null entry and an error entry. It’s possible to add a bit of logic to tidy things up a little:

if (value!='null',cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0],'')

Here’s the result:

Coping with not quite matching key columns

Another situation that often arises is that you have two columns that almost but don’t quite match. For example, this dataset has a different name representation that the above datasets (Merge Test C):

There are several text processing tools that we can use to try to help us match columns that differ in well-structured ways:

In the above case, where am I creating a new column based on the contents of the Institution column in Merge Test C, I’m using a couple of string processing tricks… The GREL expression may look complicated, but if you build it up in a stepwise fashion it makes more sense.

For example, the command replace(value,"this", "that") will replace occurrences of “this” in the string defined by value with “that”. If we replace “this” with an empty string (” (two single quotes next to each other) or “” (two double quotes next to each other)), we delete it from value: replace(value,"this", "")

The result of this operation can be embedded in another replace statement: replace(replace(value,"this", "that"),"that","the other"). In this case, the first replace will replace occurrences of “this” with “that”; the result of this operation is passed to the second (outer) replace function, which replaces “that” with “the other”). Try building up the expression in realtime, and see what happens. First use:
toLowercase(value)
(what happens?); then:
replace(toLowercase(value),'the','')
and then:
replace(replace(toLowercase(value),'the',''),'of','')

The fingerprint() function then separates out the individual words that are left, orders them, and returns the result (more detail). Can you see how this might be used to transform a column that originally contains “The University of Aberdeen” to “aberdeen university”, which might be a key in another project dataset?

When trying to reconcile data across two different datasets, you may find you need to try to minimise the distance between almost common key columns by creating new columns in each dataset using the above sorts of technique.

Be careful not to create false positive matches though; and also be mindful that not everything will necessarily match up (you may get empty cells when using cell.cross; (to mitigate this, filter rows using a crossed column to find ones where there was no match and see if you can correct them by hand). Even if you don’t completely successful cross data from one project to another, you might manage to automate the crossing of most of the rows, minimising the amount of hand crafted copying you might have to do to tidy up the real odds and ends…

So for example, here’s what I ended up using to create a “Pure key” column in Merge Test C:
fingerprint(replace(replace(replace(toLowercase(value),'the',''),'of',''),'university',''))

And in Merge Test A I create a “Complementary Key” column from the University column using fingerprint(value)

From the Complementary Key column in Merge Test A we call out to Merge Test C: cell.cross("Merge Test C", "Pure key").cells["UCAS ID"].value[0]

Obviously, this approach is far from ideal (and there may be more “correct” and/or efficient ways of doing this!) and the process described above is admittedly rather clunky, but it does start to reveal some of what’s involved in trying to bring data across to one Google Refine project from another using columns that don’t quite match in the original dataset, although they do (nominally) refer to the same thing, and does provide a useful introductory exercise to some of the really quite powerful text processing commands in Google Refine …

For other ways of combining data from two different data sets, see:
Merging Two Different Datasets Containing a Common Column With R and R-Studio
A Further Look at the Orange Data Playground – Filters and File Merging
Merging CSV data files with Google Fusion Tables

First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Anscombe's quartet /via Wikipedia

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)

I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
manipulate(
plot(cars, xlim=c(0,x.max)),
x.max=slider(15,25))

Hmm… can we set the filter value dynamically I wonder?

manipulate(
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
maxval=slider(100,140))

Seems like it…?:-) We can also combine interactive controls:

manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group’s one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you “see” statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome… Err….?!]

Plotting Tabular (CSV) Data and Algebraic Expressions On the Same Graph Using Gnuplot

A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is authentic, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, it doesn’t matter… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on visual analysis as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value y = sin(x) from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory \User\tony\f1\data, I would launch my terminal, enter cd \User\tony\f1\data or the equivalent to move to that directory, and then start gnuplot there using the command gnuplot):

gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The set xrange [1:58] command sets the range of the axis (there are 58 laps in a race, hence those settings.) The xlabel and ylabel settings are hopefully self-explanatory (they define axis labels). The set datafile separator "," command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like this, “this, that”, the other, the “this,that” string is detected as a single column/cell value, and not as two columns with cell values “this and that”? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime
25,1,106.951,102.334,9.73
25,2,99.264,94.728,2.124
...
25,55,94.979,94.574,1.97
25,56,95.083,94.759,2.155
20,1,103.531,98.914,6.959
20,2,97.370,92.834,0.879
...

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20’s data starts on line 26 and so on…

To plot from a data file, we use the command plot 'turlapTimeFuel.csv' (that is, plot ‘filename). To pull data from columns 3 and 5, we use the subcommand using 3 (x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:
gnuplot> plot 'turlapTimeFuel.csv' using 3

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form using y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form using x:y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:3, or Fuel Adjusted Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:4. The command plot ‘turlapTimeFuel.csv’ using 2:3 gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. every::12:23 for rows 12 to 23), but
plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):
gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines

How do we read this? The command is actually of the form using x:y, but we do a bit of work to choose a valid value of x. ($1==4 ? 2:1/0):3 says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement using 2:3, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the with lines command will join consecutive points using a line, to produce a line chart:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines

We can add an informative label using the title subcommand:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"

We can also plot two drivers’ times on the same chart using different lines:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:
gnuplot> plot 90+(58-x)*2.7*0.03

We can now overlay the drivers’ times and the fuel penalty on the same chart:
gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…
gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines


Doh! I should have changed the y-axis label…:-(

That is, plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines says: for rows applying to driver 1 rows where the value in column 1, ($1), equals (==) the driver number (1), $1==1), use the value in column 2 ($2) as the x-value and for the y-value use the value in column 3 ($3) minus 95+(58-$2)*2.7*0.03). Note that the (58-$2) fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History

Data Driven Journalism – Survey

The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:


(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)

As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)

[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher

I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…

PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data

PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…


Ben Fry interviewed at Where 2.0 2011

Whose Investor Relations Sites Do Thomson Reuters Host? A Form of URL Hacking…

A quick little infoskills demo using Google search…

I very rarely do more then skim the headline of posts in my feed from Techcrunch, but today I actually opened up a post about Amazon buying another imprint (Amazon Expands To Mysteries And Thrillers With Fifth Publishing Imprint, Thomas & Mercer). As with a lot of TechCrunch posts, it’s pretty much just a rebranding of a Press Release, though to their credit Techcrunch linked to the “original” source: www.businesswire.com/news/home/20110518005494/en/Amazon-Launches-Publishing-Imprint-Thomas-Mercer

…insofar as a wire service copy of a press release is an original source… This got me wondering whether the press release had also appeared on a more direct Amazon press release page…?

Googling for press release amazon on google.co.uk turned up a way in to Amazon’s UK media relations site: www.amazon.co.uk/gp/press/pr/20080710 (try hacking around the URL to find an amazon.com equivalent…) as well a page on the central US site: phx.corporate-ir.net/phoenix.zhtml?ID=1565581&c=176060&p=irol-newsArticle, a third party (?) hosted site with Amazon branding: phx.corporate-ir.net/phoenix.zhtml?p=irol-mediaHome&c=176060. (One of the oft-taught infoskills tips is not to necessarily trust a site where the domain in the URL doesn’t appear to fit…) Note: I also got redirected to the main page of the Amazon-looking-but-not-on-an-Amazon URL page through trying amazon.com/pr.

Seeing who hosts this site – that is, just trying phx.corporate-ir.net – we get a redirect to www.ccbn.com which immediately then hops to thomsonreuters.com/products_services/financial/financial_products/corporate_services/investor_relations/

So Thomson Reuters, maybe… (If we were doing a proper job, we’d be looking up internet domain registrations, but I’m a trusting sort.. If you’re interested, the service you need to use is called whois and then look for the registrant. For example, www.networksolutions.com/whois-search/corporate-ir.net;-) But who else do they run this service for?

If you Google inurl:phx.corporate-ir.net/phoenix.zhtml you can spot results from various sources. Looking through the URLs, many of the pages are hosted on (hosted) investor relations sites. The p= argument specifies the page, the c= argument looks like it might identify a company.

We can try blindly hacking company numbers to see what companies turn up, or we can be a little more structured, for example by finding the investor relations homepage value (p=irol-irhome looks a good bet) and Googling on inurl:http://phx.corporate-ir.net/phoenix.zhtml? inurl:p=irol-irhome site:phx.corporate-ir.net.

The results pages are then full of sites of investor relations hosted by Thomson Reuters for third parties.

(You might think that the use of site: is redundant given the first inurl: limit. I hit upon using it like that as a result of clicking on the “More results form this site” option from one of my initial search attempts… It seems to override the collapsing of multiple results on a single domain…)

If you want to tunnel down to press release/media relations sites, there seem to be a variety of landing pages (p=irol-news, p=irol-press), so instead we can fudge a bit and search for “press” in the page title: inurl:http://phx.corporate-ir.net/phoenix.zhtml? intitle:press site:phx.corporate-ir.net

If you live and work the web, being able to read URL, and use that knowledge to help you search URL, is a handy skill to have…

PS for another take on “website analysis” journalism, see @pubbbox: UKTI’s TechCityUK site: 100 WordPress pages, £53k