Last week, Alan tweeted a challenge of sorts about me doing something to the academics’ pay data referred to in the THES article Pay packets of excellence. The data (Vice Chancellors pay in UK HEIs, and acadmics’ pay across UK HEIs) was published via two separate PDF documents, and “compiled and audited by Grant Thornton on behalf of Times Higher Education”.
The THES provided some analysis and interpretation of the data, and the survey was picked up by other media (e.g. here’s the Guardian’s take: Vice-chancellors’ salaries on a par with prime minister; the Telegraph said: Anger as university bosses claim £200,000 salaries; the Times: Campus fury at vice-chancellors’ windfalls; the Press Association: University chiefs pocket wage rise; and so on).
So partly to give Martin something concrete to think about in the context of Should universities break copyright law? and Universities as copyright warriors, is my republishing of the data contained in the two PDF documents on the THES website alongside the Pay packets of excellence article as a spreadsheet on Google spreadsheets a fair thing to do? (I didn’t notice any explicit license terms?)
(The data is here: UK HE VCs’ and Academics’ pay.)
Now why would I want to publish the data? Well, as it stands, the data as published in the PDF documents is all very well, but… what can you do with it? How useful is it to the reader? And what did the THES intend by publishing the data in the PDFs?
That readers could check the claims made in the article is one possibility; that other media channels could draw their own conclusions from the results and then cite the THES is another (“link bait”;-). But is there any implication or not that readers could take the data as data and manipulate, visualise it, and so on? If there is, is there any implication or expectation that journalists et al. might take the data into a private spreadsheet, maybe, manipulate it, understand it, and then publish their interpretation? Might there be a reasonable expectation that someone would republish the data as data so that people without the skills to take the data out of the PDF and put it into a spreadsheet could benefit from it being represented in that way?
As well as publishing the data via a Google spreadsheet, I also republished via two Many Eyes Wikified data pages: UK HE Vice Chancellors’ Salaries: Many Eyes wikified data page and UK HE Academic Salaries: Many Eyes wikified data page. So was this a fair thing to do, in any reasonable sense of the word?
And then of course, I did a few visualisations: UK HE Vice Chancellors’ Salaries: Many Eyes wikified visualisations page and UK HE Academic Salaries: Many Eyes wikified visualisations page.
So by making the data available, it means I can create visual interpretations of the data. Is this a fair thing to do with the data or not? If the data was published with the intention that other people publish their interpretations of it, does a visual interpretation count? And if so, what’s a fair way of creating that “data as data”? By publishing the data used to generate the visualisation in the spreadsheet, people can check the actual data that is feeding the visualisation, and then check that it’s the same as the THES data.
Finally, each Many Eyes visualisation is itself interactive. That is, a user can change the dimensions plotted in the chart and try to interpret (or make sense of) the data themselves in a visual way.
So is that a fair thing to do with data? Using it to underwrite the behaviour of a set of visualisations that a user can interact with and change themselves?
So here’s where we’re at: the THES published the data in a “closed” format – a PDF document. One of the features of the PDF is that the presentation of the document is locked down – it should always look the same.
By republishing the data as data in a public Google document, then other people can change how that data is presented. They can also use that data as the basis of a visualisation. Is there any difference between an IT literate journalist putting the data into a private spreadsheet and then publishing a visualisation of that data, and someone republishing the data so that anyone can visualise it?
Now let’s consider the Many Eyes visualisations. Suppose it is a fair use of the data to somehow use it to create a visuliastion, and then publish that visualisation as a static image. Presumably someone will have checked that the graphic is itself fair, and is not misrepresenting the data. That is, the data has not been misused or misapplied – it has been used in a responsible way and visualised appropriately.
But now suppose that Jo Public can start to play with the visualisation (because it is presented in an interactive way) and maybe configure the chart so that a nonsensical or misleading visulisation is produced, with the result that the person comes away from the data claiming something that is not true (for example, because they have misunderstood that the chart they have created does not show what they maybe intended it to show, or what they think it shows?). That person might now claim (incorrectly) that the THES data shows something that it does not – and they have a graphic to “prove” it.
This is where the educator thing comes in to play – I would probably want to argue that by republishing the data both as data and via interactive visualisations, I am providing an opportunity for people to engage with and interpret the data that the THES published.
If the THES published the data because they wanted people to be able to write about their own analysis of the data, I have just made that easier to do. I have “amplified” the intent of the THES. However, if the THES only published the data to back up the claims they made in their article, then what I have done may not be fair?
So, what do you think?
Ah, if only this was in the U.S., it would appear to be simple: The data is facts, and you can’t copyright facts, so fair use doesn’t even come into play.
Personally, I’d argue that it’s fair use anyway because you’re using the data in order to comment on it (which includes the visualizations), and that’s generally regarded (in the U.S.) as fair use. And I’d certainly argue that it’s fair use ethically and morally, whether in the U.S. or the UK.
I believe the same holds for the UK — you can’t copyright the underlying facts.
If you put facts into a database, you can copyright the database schema (as creative effort will have gone into the design of that) and, in the EU, there’s also Database Rights.
You can also copyright the presentation of the facts (again, as creative effort went into that). However, to my mind, THES seem to be “encouraging” reuse of the data by not locking the PDF document down to disable copying & pasting.
IANAL, and all that jazz! :-D
Dave
Okay – so it’s good to know that “facts are for free” (though I still believe that algorithms, business processes and gene sequences are all patentable…?)
But that said, if the THES paid for the collection of the data, then in what senses do they own what they paid for? And in what sense can then publish the date yet still assert rights over it (and what sorts of rights)?
And re: owning a database schema, can’t you own the data in the database too (i.e. the actual dataset)?
As to fairness, would it be fair to only take part of the dataset and put that into e.g. a spreadsheet, so it becomes available as data?