Archive for the ‘Data’ Category
Every four years, I get blown away by the dedication of people who have spent the previous four years focussed on their Olympic Challenge (I find it hard to focus for more than an hour or two on any one thing!)
Anyway, I was intrigued to see this post on Google Maps Mania yesterday – Olympic Heat Maps – that displayed the Olympics medal table in the form of a heat map, along with several variants (medal tallies normalised against population, or GDP, for example).
The maps were neat, but static – they’d been derived by cutting and pasting a snapshot of a medals table into a Google spreadsheet, and then creating a Heat Map widget using the data…
So I had a look round for a ‘live’ data source for the medals table, didn’t find anything obvious, so looked for a widget that might be pulling on a hidden data source somewhere… Whereupon I found a reference to a WordPress Olympic Medal Tally widget…
A quick peek at the code shows the widget pulling on a data feed from the 08:08:08 Olympics blog, so I ‘borrowed’ the feed and some of the widget code to produce a simple HTML table containing the ISO country codes that the Google Heat Map widget requires, linked to it from a Google Spreadsheet (Google Spreadsheets Lets You Import Online Data) and created a live Olympic medal table map (top 10).
If you want to use the heat map as an iGoogle widget, here it is: Olympic Medal Table Map Widget.
Some time ago now, I wrote a post about progressive enhancement of HTML web pages, including some examples of how HTML data tables could be enhanced to provide graphical views of the data contained in them.
I’m not sure if anyone is actively maintaining progressive enhancement browser extensions (I haven’t checked) but here are a couple more possible enhancements released as part of the Google visualisation API, as described in Table Formatters make Visualization tables even nicer:
A couple of other options allow you to colour a table cell according to value (an implementation of the ‘format cell on value’ function you find in many spreadsheets), and a formatter that will “format numeric columns by defining the number of decimal digits, how negative values are displayed and more”, such as adding a prefix or suffix to each number.
Also in my feed reader recently was this post on Search Engines Extracting Table Data on the Web, which asks:
what if Google focused upon taking information from tables that contain meaningful data (as opposed to tables that might be used on a web page to control the formatting of part or all of a page)?
What if it took all those data filled tables, and created a separate database just for them, and tried to understand which of those tables might be related to each other? What if it then allowed for people to search through that data, or combine the data in those tables with other data that those people own, or that they found elsewhere on the Web?
and then links to a couple of recent papers on the topic.
Alternatively, could we map the data contained in a “regular” Google or Yahoo UI table to a Google spreadsheets like format – in which case, we might be able to use the Google Visualisation API Query Language? (I’m not sure if the Query Language can be applied directly to Google datatable objects?)
It’s not too hard then to imagine a browser extension that can be used to overlay a SQL-like query engine on top of pages containing Yahoo or Google datatables, essentially turning the page into a queryable database? Maybe even Ubiquity could be used to support this?
Early last week, Google announced a Google Flu trends service, that leverages the huge number of searches on Google to provide a near real-time indicator of ‘flu outbreaks in the US. Official reports from medical centres and doctors can lag actual outbreaks by up to a couple of weeks, but by correlating search trend data with real medical data, the Google folks were able to show that their data led the the official reports.
John Naughton picked up on this service in his Networker Observer column this week, and responded to an email follow-up comment I sent him idly wondering what search terms might be indicators of recession in this post on Google as a predictor. “Jobseeker’s allowance” appears to be on the rise, unfortunately (as does “redundancy”).
For some time, I’ve been convinced that spotting clusters of related search terms, or meaningful correlations between clusters of search terms, is going to be big the next step towards, err, something(?!), and Google Flu trends is one of the first public appearances of this outside the search, search marketing and ad sales area.
Which is why, on the playful side, I tried to pitch something like Trendspotting to the Games With a Purpose (GWAP) folks (so far unreplied to!), the idea being that players would have to try to identify search terms who’s trends were correlated in some “folk reasonable” way. Search terms like “flowers” and “valentine”, for example, which appear to be correlated according to the Google Trends service:
Just out of interest, can you guess what causes the second peak? Here’s one way of finding out – take a look at those search terms on the Google Insights for Search service (like Google Trends on steroids!):
Then narrow down the date over which we’re looking at the trend:
By inspection, it looks like the peak hits around May, so narrow the trend display to that period:
If you now scroll down the Google Insights for Search page, you can see what terms were “breaking out” (i.e. being searched for in volumes way out of the the norm) over that period:
So it looks like a Mother’s Day holiday? If you want to check, the Mother’s Day breakout (and ranking in the top searches list) is even more evident if you narrow down the date range even further.
Just by the by, what else can we find out? That the “Mother’s Day” holiday at the start of May is not internationally recognised, maybe?
There are several other places that are starting to collect trend data – not just search trend data – from arbitrary sources, such as Microsoft Research’s DataDepot (which I briefly described in Chasing Data – Are You Datablogging Yet?) and Trendrr.
The Microsoft service allegedly allows you to tweet data in, and the Trendrr service has a RESTful API for getting data in.
Although I’ve not seen it working yet (?!), the DataDepot looks like it tries to find correlations between data sets:
Next stop convolution of data, maybe?
So whither the future? In an explanatory blog post on the flu trends service – How we help track flu trends – the Googlers let slip that “[t]his is just the first launch in what we hope will be several public service applications of Google Trends in the future.”
It’ll be interesting to see what exactly those are going to be?
PS I’m so glad I did electronics as an undergrad degree. Discrete maths and graph theory drove web 2.0 social networking theory algorithms, and signal processing – not RDF – will drive web 3.0…
What do you do when you scrape a data set, geocode it so you can plot it on a map, and find that the geocoding isn’t quite as good as you’d hoped?
I’d promised myself that I wasn’t going to keep on posting “yet another way of scraping data into Google spreadsheets then geocoding it with a Yahoo pipe” posts along the lines of Data Scraping Wikipedia with Google Spreadsheets, but a post on Google Maps mania – Water Quality Google Map – sent me off on a train of thought that has sort of paid dividends…
So first up, the post got me thinking about whether there are maps of Blue Flag beaches in the UK, and where I could find them. A link on the UK page of blueflag.org lists them: UK Blue Flag beaches, (but there is a key in the URL, so I’m not sure how persistent that URL is).
Pull it into a Google spreadsheet using:
Publish the CSV:
Geocode the beaches using a Yahoo pipe – rather than using the Pipe location API, I’m making a call to the Yahoo GeoPlanet/Where API – I’ll post about that another day…
Grab the KML from the pipe:
Now looking at the map, it looks like some of the markers may be mislocated – like the ones that appear in the middle of the country, hundreds of miles from the coast. So what it might be handy to do is use the scraped data as a buggy, downloaded data set that needs cleaning. (This means that we are not going to treat the data as “live” data any more.)
And here’s where the next step comes in… Google MyMaps lets you seed a map by importing a KML file:
The import can be from a desktop file, or a URL:
Import the KML from the Yahoo pipe, and we now have the data set in the Google MyMap.
So the data set in the map is now decoupled from the pipe, the spreadsheet and the original Blue Flag website. It exists as a geo data set within Google MyMaps. Which means that I can edit the markers, and relocate the ones that are in the wrong place:
And before the post-hegenomic tirade comes in (;-), here’s an attempt at capturing the source of the data on the Google MyMap.
So, to sum up – Google MyMaps can be used to import an approximately geo-coded data set and used to tidy it up and republish it.
PS dont forget you can also use Google Maps (i.e. MyMaps) for geoblogging
A quickie post, this one, to complement a post from a long time ago where I plotted out – as a network – the links between people who served on the same committee on the Isle of Wight Council (Visualising CoAuthors in Open Repository Online Papers, Part 3, half way through the post).
In this case, I trawled the Isle of Wight Council committees to populate the rows of a spreadsheet with column headings “Committee Name” and “Councillor”.
Pasting the results into Many Eyes gives an IW Council membership dataset that can be easily visualised. So for example, here’s a glanceable treemap showing the membership of each committee:
The search tool adds yet another dimension to the visualisation, in this case allowing us to pick out the various committees the searched for named individual sits on.
It strikes me that if the search tool supported Boolean expressions, such as AND and OR (maybe with each term being realised by a different colour bounding box?), it would be possible to explore the variation – or similarity – in make-up of different committees? On the first tree map, this approach would make it obvious which committees the same groups of people were sitting on?
And why would we want to do this? To identify potential clashes of interest, maybe, or a lack of variation in the composition of different committees that might, ideally, be independent of each other?
PS Hmm, I suppose you could use a similar visualisation to look at the distribution of named directors across FTSE 100 companies and their subsidiaries, suppliers and competitors, for example? ;-) Does anyone have simple lists of such information in a spreadsheet anywhere?;-)
Over dinner with friends a couple of nights ago, I was asked how I typically approach problem solving tasks. Thinking about it, it’s a bottom-up AND top-down approach where I attack both ends of the problem (the “what I’ve got now” end and the “ultimate vision”) at the same time, in the hope that the tiny steps taken from each end meet up somewhere in the middle…
So for example, in the dev8D Dragon’s Den I mentioned the desire to put together a thematic choropleth map depicting the funding that’s going into different UK Government office regions as a result of JISC or EPSRC project awards. Here’s how I’ve started to work out how to do that…
(What follows gets a little involved at times, so the main trick to look out for is how to create a single data table by mashing together data from three separate data tables.)
At one end, is the output surface. A quick scout around turned up no flash components or KML overlays I could use on Google maps or ThematicMapping (ffs why can’t National Statistics make some free warez available???) so I opted for the amMap interactive map instead.
To plot the map, I need to be able to sum the value of project grants over lead HEIs within particular GORs (got that?;-) So where’s the data?
All over the place, that’s where…
- EPSRC Support By Organisation shows the total amount of current project funding awarded to each HEI by EPSRC;
Hmm, no GOR, no geolocation data… Which means I need a mapping from HEI to GOR…
- …but the closest I can find is a listing of the postcodes of each HEI: HERO screenscraper, and even that’s a scrape of another service…
- and finally, here’s a mapping from postcode areas to GORs: postcode area lookup table.
There’s a warning though: please note “regions” were recorded for my own visual aid and are NOT an attempt to tie in with current UK Administrative Regions.. Hmm – okay – add that one to the caveats/risk assessment. If the maps turns out very wrong, that’s EPSRC’s problem, right, for not making the data available in a clean way?!;-)
Okay, so those are the data sources: one contains HEI names and project funding data, one contains HEI names, location data (well, postcodes) and homepage URIs, and one contains mappings from postcode towns to UK regions (which loosely relate, possibly, to GORs).
Now at this point point I’ve already decided that I want to try use Dabble DB to somehow conflate the data from these three separate sources (though I’m not totally sure how… it’s just something I seem to remember from somewhere and somewhen a long time ago that Dabble DB supports if there are common fields – and matching strings – across different data tables).
Getting the data into Dabble DB is a copy and paste operation, but I’m going to take an intermediate step, highlighting and copy the tables from the separate web pages and pasting them into a Google spreadsheet. Why? Because I already know that this works and it’ll also let me cast an eye over the data to make sure it looks about right.
Looking at the HEI names from EPSRC and the HERO screenscrape, they don’t really match though, which means that Dabble DB won’t be able to use HEI names to idenify common rows in the HE location and EPSRC project tables. However, the HERO screenscrape page does have the HEI homepage URI, and a look beneath the “Go to Site” link on the EPSRC page shows that those links point to the HEI homepage…
…which means I should be able to link items in the EPSRC projects listing to items in the HEI location table by virtue of common homepage URIs.
and the URIs are exposed, so I can copy and paste the table and drop it into a spreadsheet, with the HERO data and postcode/region data in separate sheets.
A quick look over the URIs from both sources in the spreadsheets shows minor differences though – some URIs end with a “/” and others don’t (there are also a few broken scrapes that I tidy by hand); now if Dabble DB uses strict string matching to relate data in one table to data in another table (which I’d guess is likely) then missed matches will presumably occur?
So just to be safe, we need a data cleaning stage. To do this, I copy the data from the URI column in each spreadsheet, drop it into my TextWrangler text editor, and just clean up all the URIs so they end with a trailing / by searching for \.uk$ and replacing it with .uk/
Then I copy the URIs from the text editor and past them back into the appropriate column in the appropriate spreadsheet.
Looking at the postcode/GOR table, I need to get one or two letter postal town identifiers from the HEI postcodes, so to do this I copy the postcode column from the spreadsheet, and paste it into my text editor. This time I do a regular expression powered search and replace using this regexp: ([A-Z]+).* and replacing with \1
So now I have three spreadsheets on Google docs, which I can scan by eye to make sure they look okay, then easily copy and paste into separate tables (known as separate categories) in the same Dabble DB project, like this:
- the EPSRC data:
- HERO screenscrape data:
- and the postcode/region mapping data:
Now for the fun part; each of the above tables is a separate category, with separate column fields, in a Dabble DB project. It is possible to link a column with a similar column in another category, and consequently “pair” similar items in different tables. (So a column containing a particular URI, for example, in a row in one table/category can be related to a particular row in a particular column in another category/table, if the corresponding cell there contains the same URI (Dabble DB handles the actual pairings, you just have to link the columns).
So playing blind, I linked the URI column in the EPSRC category with a new category, which I called Meta:
This created a new table/category – Meta – with a couple of columns: a ‘Name” column, containing the URIs, and a column that linked back to corresponding entries in the EPSRC project category.
And then I did the same linking for the URI column in the HEI Location table/category, which automatically added another column in the Meta table that linked across to rows in the corresponding HEI Location table:
In the Meta category view, I can now add additional columns that are derived from columns in the other, linked tables. So for example, I can add a derived column corresponding to the value of project grants that is pulled in from the linked EPSRC projects column:
So my Meta table/category now looks like this:
Which is pretty clever I think..? ;-)
But then it gets more so… Suppose I link the Postcode town column from the HEI location table with the Postcode/Regional mapping table:
If you’ve been keeping up, you might now expect the UK HEI to be linked to from the Postcode/Region table, which it is:
But the link is symmetrical… and if one category is linked to a second category that is in turn linked to a third category, the columns from the first category can be used as derived columns in the second and the third category…
…which means in the Meta category, I can pull in columns derived from the Postcode/Region category via the HEI location category, first by grabbing the postcode town column into Meta:
To give this:
Then pull in a further derived field from the postcode town column from the Postcode/Region category:
And so now we have a rather more complete Meta category view containing linked items from all three tables (one of which is actually linked indirectly via one of the others):
Clever, eh??? So now I know how to annotate data in one table using data from another table if the two tables each have a column that contains similar data :-)
Okay, so now I have a table that contains rows that contain both project funds and UK regions info – so now I’m in a position to calculate the total amount of funds flowing into each region and then plot them on the thematic map…
…but this post is already way too long, so that’ll have to be for another day…
(Plus I’m not totally sure how to do it yet… and Mission Impossible is just starting (this is a scheduled post…;-)
Last week, I posted a quick demo of how to visualise data stored in a Google spreadsheet in Many Eyes Wikified (HEFCE Grant Funding, in Pictures).
The data I used was the latest batch of HEFCE teaching funding data, but Joss soon tweeted to say he’d got Research funding data up on Google spreadsheets, and could I do something with that? You can see the results here: Visualising UK HEI Research Funding data on many Eyes Wikified (Joss has also had a go: RAE: UK research funding results visualised).
Anyway, today the Guardian announced a new content API (more on that later – authorised developer keys are still like gold dust), as well as the Guardian data store (strapline: “facts you can use”) and the associated Data Store Blog.
Interestingly, the data is being stored on Google docs, in part because Google spreadsheets offer an API and a wide variety of export formats.
As regular OUseful.info readers will know, one of the export formats from Google spreadsheets is CSV – Comma Separated Variable data – which just so happens to be liked by services such as Dabble DB and Many Eyes. I’ll try to come up with a demo of how to mash-up several different data sets in Dabble DB over the next few days, but as I’ve a spare half-hour now, I thought I’d post a qiuck demo of how to visualise some of the Guardian data store spreadsheet data in Many Eyes Wikified.
So to start, let’s look at the the RAE2008 results data – University research department rankings (you can find the actual data here: http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ.
If you speak URL, you’ll know that you can get the CSV version of the data by adding &output=csv to the URL, like this: http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ&output=csv
Inspection of the CSV output suggests there’s some crap at the top we don’t want – i.e. not actual column headings – as well as the the end of the file:
(Note this “crap” is actually important metadata – it describes the data and its provenance – but it’s not the actual data we want to visualise).
Grabbing the actualt data, without the metadata, can be achieve by grabbing a particular range of cells using the &range= URL argument. Inspection of the table suggests that meaningful data can be found in the columnar range of A to H; guesswork and a bit of binary search identifies the actual range of cell data as A2:H2365 – so we can export JUST the data, as CSV, using the URL http://spreadsheets.google.com/pub?key=phNtm3LmDZEM-RqeOVUPDJQ&output=csv&range=A2:H2365.
If you create a new page on Many Eyes Wikified, this data can be imported into a wiki page there as follows:
We can now use this data page as the basis of a set of Many Eyes visualisations. Noting that the “relative URL address” of the data page is ousefulTestboard/GuardianUKRAERankings2008 (the full URL of the wikified data page is http://manyeyes.alphaworks.ibm.com/wikified/ousefulTestboard/GuardianUKRAERankings2008), create a new page and put a visualisation placeholder or two in it:
Saving that page – and clicking through on the visualisation placeholder links – means you can now create your visualisation (Many Eyes seems to try to guess what visualisation you want if you use an appropriate visulisation name?):
Select the settings you want for you visualisation, and hit save:
A visualisation page will be created automatically, and a smaller, embedded version of the visualisation will appear in the wiki page:
If you visit the visualisation page – for example this Treemap visualisation, you should find it is fully interactive – which means you can explore the data for yourself, as I’ll show in a later post…
In Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs I showed how to pull data from Google spreadsheets (uploaded there by the Guardian as part of their Data Store initiative and visualise it using Many Eyes Wikified.
Unfortunately, one of the downsides of using Many Eyes Wikified is that you can’t filter the imported data or select subsets of rows from it (nor can you choose to just ignore particular columns in the visualisation editor – which would be a really handy thing to be able to do, and probably quite easy to implement?)
So for example, when looking at the RAE 2008 Data, it might be quite useful to be able to just visualise the data for a single institution, such as the Open University. But how can we do this?
One way would be to find a way of searching the data in the spreadsheet, and then only extracting the rows that contained the search term – such as “Open University” – in the institution name column. But I don’t know if that’s possible in Google Spreadsheets, (though it is possible in Zoho DB, which supports all manner of queries in SQL like dialects; but that’s for another day…;-).
An alternative way is to grab the whole of the spreadsheet and then just filter out all the rows that don’t contain the filter term in the desired column… which is an idea that came into my mind last night on my way home, and which appears to work quite well, as this Filtering UK RAE 2008 CSV Data from Google Docs pipe shows:
If your pipe imported a CSV file, as this one does (in fact, if the items being pushed out of the pipe have a y:row attribute set giving a row/count number for each item) then you can get a CSV file out of the pipe too:
Which in this case means we can filter through just the RAE 2008 data for a particular institution, grab the CSV URL for that data, and pull it into Many Eyes wikified in the same way as described before:
Here it is:
And once we have a wiki data page for it, we can visualise it – I’ve popped some examples up here: OU RAE 2008 performance.
For example, how about a bubble diagram view showing our 4* performance across the various units of assessment:
(The number is the percentage of submissions in that unit of assessment achieving the specified grade.)
Or how about this one – an interactive bar chart showing the percentages of 3* and 4* submissions in each unit of assessment:
If you look back at the pipe interface, you’ll see you can use the same pipe to pull out the data for any of the HEIs:
The pipe output CSV URI also makes this clear:
In the next post on this topic, I’ll show you how to create a rather more interesting RSS feed from this sort of pipe, including some Google chart URIs created dynamically within the pipe…
But for now, it’s time to go foraging for dinner…
I really should make this the last post on the Guardian data store for a while, because I’ve got a stack of other things I really ought to be doing instead, but it struck me that the following demo might open up some peoples eyes as to what’s possible when you have several data sets that play nicely…
It shows you how to take data from two different spreadsheets and link it together to create a third data set that contains elements from the two original ones. (If you just want to cut to the end, here’s a (visualised) reason why it might not be such a happy idea to go to Southampton Solent if you want to study Architecture or Planning: Student Happiness on Planning and Architecture courses. Now ask yourself the question: how would I (err, that is, you) have produced that sort of chart?)
This whole idea is in and around the area of Linked Data, but doing it the hard way…
If you don’t know what Linked Data is, watch this:
So let’s have a look at the Guardian University Tables/ Satisfaction Data that has been uploaded to Google Spreadsheets:
You’ll notice there are lots of sheets in there, covering the different subject areas – like “Architecture”, for instance:
Importantly, the format of the names of the institutions is consistent across spreadsheets.
So I was wondering – if I was a student wanting to study either planning or architecture, how could i find out which institutions had a good satisfaction rating, or low student to staff ratio, across both those subjects? (Maybe I’m not sure which subject I want to do, so if I choose the wrong one and try to switch course, I know I’m not going to switch into a duff department.)
That is, I’d like to be able to see a single table containing the data from both the overall results table as well as the Architecture and Planning tables.
Now if the data was in a database, and if I spoke SQL, this would be quite easy to do (hint condition: look up sql JOIN). But I only have my browser to hand, so what to do…?
Dabble DB provides one answer… How? Here’s how…
Start off by creating a new application:
I’m going to seed it with a table containing the names of HEIs as listed in column B of the overall data table by importing just that column, as CSV data, from the Google spreadsheet:
Pull the data in:
So now we have the data:
Okay – let’s import a couple more tables, say the data for Planning and Architecture areas.
First, Planning – here’s the CSV:
Click on “More…” and you’ll be offered the chance to Add a New Category.
Take that opportunity:-)
You hopefully get the idea…
Exactly as before…
Now do the same for the Architecture data:
So now i have three tables – known as categories – in Dabble DB.
Let’s link them… that is, let’s make the data from one category available to another.
Firstly, I’ll link the Architecture data to the table that just lists the HEIs – click on the Name of Institution column to pop-up a menu and select Configure:
We’re going to Link the column to Entries in another one:
in particular, we’re going to tell Dabble DB that the Names of Institutions in the Architecture table are the same things as the institions in the HEI category/table:
If you look at the HEIs category, you’ll see the Architecture column has been brought in:
We can now do the same for Planning (remember, pop up the Name of Institution menu and Configure it to Link Entries).
The next step is to pull in some data from the two linked categories. How about we compare the Teaching Satisfaction scores for these two subjects?
Click on the column header for one of the linked categories – say the planning one, select Add Derived Field and then the field you want to pull in:
The data gets pulled in…
(Oops – this is all a bit sparse; maybe I should have used a different filed, such as Average Entry Tariff? Never mind, let’s push on…)
Add the corresponding derived field for the Architecture courses:
If you click on the “unsaved View” label, you can save this data table:
To tidy up the table, let’s hide the duplicated Name columns and resave:
To give something like this:
A nice feature of Dabble DB is that is makes it easy to export data from any given view:
So if we grab the CSV URI:
We can take it to, I dunno, Many Eyes Wikified?
Create a placeholder for a visulisation (the data page is ousefulTestboard/StudentHappinessPlanningArchitecture):
Or just type the text yourself:
Click through to create the viz:
We’ll have the scatter plot:
The empty cells in the data columns may cause Many Eyes Wikified to think the data is Text – it’s not, it’s Number:
Now customise the view… I could just have every spot the same, but Architecture is my first preference, so let’s just highlight the places where students are happiest doing that subject (click through to play with the visualisation):
UCL seems to do best:
So here’s a recap – we’ve essentially JOINed data from two separate spreadsheets from Google Spreadsheets to create a new data table in Dabble DB, then visualised it in Many Eyes.
Can you see now why privacy hacks don’t like the idea of linked data in government, or across companies?
So here’s you’re weekend homework – create a data set that identifies entry requirements across the various engineering subject areas, and try to find a way of visualising it ;-)
Last week, Alan tweeted a challenge of sorts about me doing something to the academics’ pay data referred to in the THES article Pay packets of excellence. The data (Vice Chancellors pay in UK HEIs, and acadmics’ pay across UK HEIs) was published via two separate PDF documents, and “compiled and audited by Grant Thornton on behalf of Times Higher Education”.
The THES provided some analysis and interpretation of the data, and the survey was picked up by other media (e.g. here’s the Guardian’s take: Vice-chancellors’ salaries on a par with prime minister; the Telegraph said: Anger as university bosses claim £200,000 salaries; the Times: Campus fury at vice-chancellors’ windfalls; the Press Association: University chiefs pocket wage rise; and so on).
So partly to give Martin something concrete to think about in the context of Should universities break copyright law? and Universities as copyright warriors, is my republishing of the data contained in the two PDF documents on the THES website alongside the Pay packets of excellence article as a spreadsheet on Google spreadsheets a fair thing to do? (I didn’t notice any explicit license terms?)
(The data is here: UK HE VCs’ and Academics’ pay.)
Now why would I want to publish the data? Well, as it stands, the data as published in the PDF documents is all very well, but… what can you do with it? How useful is it to the reader? And what did the THES intend by publishing the data in the PDFs?
That readers could check the claims made in the article is one possibility; that other media channels could draw their own conclusions from the results and then cite the THES is another (“link bait”;-). But is there any implication or not that readers could take the data as data and manipulate, visualise it, and so on? If there is, is there any implication or expectation that journalists et al. might take the data into a private spreadsheet, maybe, manipulate it, understand it, and then publish their interpretation? Might there be a reasonable expectation that someone would republish the data as data so that people without the skills to take the data out of the PDF and put it into a spreadsheet could benefit from it being represented in that way?
As well as publishing the data via a Google spreadsheet, I also republished via two Many Eyes Wikified data pages: UK HE Vice Chancellors’ Salaries: Many Eyes wikified data page and UK HE Academic Salaries: Many Eyes wikified data page. So was this a fair thing to do, in any reasonable sense of the word?
And then of course, I did a few visualisations: UK HE Vice Chancellors’ Salaries: Many Eyes wikified visualisations page and UK HE Academic Salaries: Many Eyes wikified visualisations page.
So by making the data available, it means I can create visual interpretations of the data. Is this a fair thing to do with the data or not? If the data was published with the intention that other people publish their interpretations of it, does a visual interpretation count? And if so, what’s a fair way of creating that “data as data”? By publishing the data used to generate the visualisation in the spreadsheet, people can check the actual data that is feeding the visualisation, and then check that it’s the same as the THES data.
Finally, each Many Eyes visualisation is itself interactive. That is, a user can change the dimensions plotted in the chart and try to interpret (or make sense of) the data themselves in a visual way.
So is that a fair thing to do with data? Using it to underwrite the behaviour of a set of visualisations that a user can interact with and change themselves?
So here’s where we’re at: the THES published the data in a “closed” format – a PDF document. One of the features of the PDF is that the presentation of the document is locked down – it should always look the same.
By republishing the data as data in a public Google document, then other people can change how that data is presented. They can also use that data as the basis of a visualisation. Is there any difference between an IT literate journalist putting the data into a private spreadsheet and then publishing a visualisation of that data, and someone republishing the data so that anyone can visualise it?
Now let’s consider the Many Eyes visualisations. Suppose it is a fair use of the data to somehow use it to create a visuliastion, and then publish that visualisation as a static image. Presumably someone will have checked that the graphic is itself fair, and is not misrepresenting the data. That is, the data has not been misused or misapplied – it has been used in a responsible way and visualised appropriately.
But now suppose that Jo Public can start to play with the visualisation (because it is presented in an interactive way) and maybe configure the chart so that a nonsensical or misleading visulisation is produced, with the result that the person comes away from the data claiming something that is not true (for example, because they have misunderstood that the chart they have created does not show what they maybe intended it to show, or what they think it shows?). That person might now claim (incorrectly) that the THES data shows something that it does not – and they have a graphic to “prove” it.
This is where the educator thing comes in to play – I would probably want to argue that by republishing the data both as data and via interactive visualisations, I am providing an opportunity for people to engage with and interpret the data that the THES published.
If the THES published the data because they wanted people to be able to write about their own analysis of the data, I have just made that easier to do. I have “amplified” the intent of the THES. However, if the THES only published the data to back up the claims they made in their article, then what I have done may not be fair?
So, what do you think?