It’s one thing publishing data just to comply with a formal requirement to make it public, quite another if you’re publishing it because you want folk to doing something with it.
But if you decide you’re publishing data because you want folk to do something with it, what does that mean exactly?
[Not quite related but relevant: Pete Sefton (@ptsefton) on Did you say you “own” this data? You keep using that word. I do not think it means what you think it means.]
One answer might be that you want them to be able to refer to the data for their own purposes, simply by cutting an pasting a results datatable out of one of your spreadsheets so they can paste it into one of theirs and refer to it as “evidence”:
Another might be that you want folk to be able to draw on your data as part of their own decision making process. And so on. (For a couple of other use cases, see First Thoughts On A Content Strategy for Data.)
A desire that appears to have have gained some traction over the last couple of years is to publish data so that folk can produce visualisations based on it. This is generally seen as a Good Thing, although I’m not sure I know exactly why…? Perhaps it’s because visualisations are shiny objects and folk can sometimes be persuaded to share (links to) shiny objects across their social networks; this in turn may help raise wider awareness about the existence of your data, and potentially bring it to the attention of somebody who can actually makes some use of it, or extract some value from it, possibly in combination with one or more other datasets that you may or may not be aware of.
Something that I’ve become increasingly aware of over the last couple of years is that people respond to graphics and data visualisations in very different ways. The default assumption seems to be that a graphic should expose some truth in very obvious way without any external input. (I’m including things like axis labels, legends, and captions in the definition of a graphic.) That is, the graphic should be a self-contained, atomic object, meaningful in its own right. I think this view is borne out of the assumption that graphics are used to communicate something that is known by the author who used it to their audience. The graphic is chosen because it does “self-evidently” make some point that makes the author’s case. Let’s call these “presentation graphics”. Presentation graphics are shiny objects, designed to communicate something in particular, to a particular audience, in (if at all possible) a self-contained way.
Another way of using visualisations is as part of a visual analysis process. In this case, visual representations of the data are generated by the analyst as part of a conversation they are having with the data. One aim of this conversation (or maybe we should call it an interrogation?!) may be to get the data to reveal something about its structure, or meaningful patterns contained within it. Visual analysis is therefore less to do with the immediate requirement of producing meaningful presentation graphics, and more to do with getting the data to tell its story. Scripted speeches contain soundbites – presentation graphics. Conversations can ramble all over the place and are often so deeply situated in a particular context they are meaningless to onlookers – as visualisations produced during a visual analysis activity may be. (Alternatively, the visual analyst spends their time trying to learn how to ride a bike. Chris Hoy and Victoria Pendleton show how it’s done with the presentation graphics…)
It could be that I’m setting up something of a false dichotomy between extrema here, because sometimes a simple, “directly generated” chart may be effective as both a simple 1-step visual analysis view, and as a presentation graphic. But I’m trying to think through my fingers and type my way round to what I actually believe about all this stuff, and arguing to limits is one lazy way of doing this! The distinction is also not just mine… For example, Iliinsky and Steele’s Designing Data Visualizations identifies the following:
Explanatory visualization: Data visualizations that are used to transmit information or a point of view from the designer to the reader. Explanatory visualizations typically have a specific “story” or information that they are intended to transmit.
Exploratory visualization: Data visualizations that are used by the designer for self-informative purposes to discover patterns, trends, or sub-problems in a dataset. Exploratory visualizations typically don’t have an already-known story.
They also define a data visualizations as “[v]isualizations that are algorithmically generated and can be easily regenerated with different data, are usually data-rich, and are often aesthetically shallow.” Leaving aside the aesthetics, the notion that data visualisations can be “algorithmically generated” is important here.
A related insight I picked up from the New York Times’ Amanda Cox is the use of statistical charts of visual analysis as sketches that help us engage with data en route to understanding some of the stories it contains, stories that may then be told by whatever means are appropriate (which may or may not include graphical representations or visualisations).
So when it comes to publishing data in the hope that folk will do something visual with it, does that mean we want to provide them with the data that can be directly used to convey some known truth in an appealing way, or do we want to provide them with data in such a way that they can engage with it in a (visual) analytic way and the communicate their insight through a really beautiful presentation graphic? (Note that it may often be the case that something discovered through a visual analysis step may actually best be communicated through a simple set of ranked, tabulated data presented as text…) Maybe this explains why folk are publishing the data in the hope that it will be “visualised”? They are conflating visual analysis with presentation graphics, and hoping that novel visualisation (visual analysis) techniques will: 1) provide new insights (new sense made from) the data, that: 2) also work as shiny, shareable and insightful presentation graphics? Hmmm…
Publishing Data to Support Visual Engagement
So we have our data set, but how can we publish it in a way that supports generative visual engagement with it (generative in the sense that we want the user to have at least some role in creating their own visual representations of the data)?
The easiest route to engagement is to publish an interactive visualisation on top of your data set so that the only way folk can engage with the data is through the interactive visualisation interface. So for example, interactive visualisations published by the BBC or New York Times. These typically support the generation of novel views over the data by allowing the user to construct queries over the data through interactive form elements (drop down lists, radio buttons, sliders, checkboxes, etc.); these queries are then executed to filter or analyse the data and provide a view over it that can be visually displayed in a predetermined way. The publisher may also choose to provide alternative ways of visualising the data (for example, scatter plot or bar chart) based on preidentified ways of mapping from the data to various graphical dimensions within particular chart types. In the case of the interactive visualisation hosted on the publisher’s website, the user is thus typically isolated from the actual data.
An alternative approach is to publish the data in an environment that supports the creation of visualisations at what we might term the row and column level. This is where ideas relating to a content strategy for data start to come in to play. An example of this is IBM’s Many Eyes data visualisation tool. Long neglected by IBM, the Many Eyes website provides an environment for: 1) uploading tabular datasets; 2) generating preconfigured interactive visualisations on top of the datasets; 3) providing embeddable versions of visualisations; 4) supporting discussions around visualisations. Note that a login is required to upload data and generate new visualisations.
As an example of what’s possible, I uploaded a copy of the DCMS CASE data relating to Capital Investment to Many Eyes (DCMS CASE data – capital investment (modified)):
Once the data is uploaded, the user has the option of generating one or more interactive visualisations of the data from a wide range of visualisation types. For example, here’s a matrix chart view (click through to see the interactive version; note: Java required).
And here’s a bubble chart:
In an ideal world, getting the data into Many Eyes should just(?) have been a case of copy and pasting data from the original spreadsheet. (Note that this requires access to a application that can open the spreadsheet, either on the desktop or online.) In the case of the DCMS CASE data, this required opening the spreadsheet, finding the correct sheet, then identifying the cell range containing the data we want to visualise:
Things are never just that simple, of course… Whilst it is possible to define columns as “text” or “number” in Many Eyes, the date field was recognised by the Many Eyes visualisation tools as a “number” which may lead to some visualisations incorrectly aggregating data from the same region across several date ranges. In order to force the Many Eyes visualisations to recongnise the date column as “text”, I had to edit the original data file (before uploading it to Many Eyes) by prepending the date ranges with an alphabetic character (so for example I replaced instances of 2004/05 with Y2004/05).
Recommendation In terms of a “content strategy for data”, then, we need to identify possible target domains where we wish to syndicate or republish our data and then either: 1) publish the data to that domain, possibly under our own branding, or “official” account on that domain – this approach also allows the publisher to add provenance metadata, a link to the homepage for the data or its original source, and so on; or, 2) publish the data on our site in such a way that we know it will work on the target domain (which means testing it/trying it out…). If you expect users to upload your data to services like Many Eyes themselves, it would make sense to provide easily cut and pastable example text of the sort you might expect to see appear in the metadata fields of the data page on the target site and encourage users to make use of that text.
Recommendation A lot of the CASE data spreadsheets contain multiple cell ranges corresponding to different tables within a particular sheet. Many cut and paste tools support data that can be cut and pasted from appropriately highlighted cell ranges. However, other tools require data in comma separated (rather than tab separated) format which mean the user must copy and paste the data into another sheet and then save it as CSV. Although a very simple format, there is a lot to be said for publishing very simple CSV files containing your data. Provenance and explanatory data often gets separated from data contained in CSV files, but you can always add a manifest text file to the collection of CSV data files to explain the contents of each one.
Whilst services such as Many Eyes do their best in trying to identify numeric versus categorical data columns, unless the user is familiar with the sorts of data a particular visualisation type requires and how it should be presented, it can sometimes be hard to understand why Many Eyes has automatically identified particular values for use in drop down list boxes, and at times hard to interpret what is actually being displayed. (This is a good reason to limit the use of Many Eyes to a visual analysis role, and use it to discover things that look interesting/odd and then go off and dig in the data a little more to se if there really is anything interesting there…)
In some cases, it may be possible to reshape the data and get it in to a form that Many Eyes can work with. (Remember the Iliinsky and Steele definition of a data visualisation as something “algorithmically generated”? If the data is presented in the right way, then Many Eyes can do something with it. But if it’s presented in the wrong way, not joy…) As an example, if we look at the CASE Capital Investment data, we see it has columns for Region, Local Authority, Date, as well as columns relating to the different investment types. Presented this way, we can easily group data across different years within an LA or Region. Alternatively, we might have selected Region, Local Authority, and Asset type columns, with separate columns for each date range. This different combination of rows and columns may provides a different basis for the sorts of visualisations we can generate within Many Eyes and the different summary views we can present over the data.
Recommendation The shape in which the data it published may have an effect on the range of visualisations that can be directly generated from the data, without reshaping by the user. It may be appropriate to publish the data in a variety of shapes, or provide tools for reshaping data for use with particular target services. Tools such as the Stanford Data Wrangler are making it easier for people to clean and reshape messy data sets, but that is out of scope for this review. In addition, it is worth consider the data type or physical form in which data is published. For example, in columns relating to finanacial amounts, prepending each data element in a cell with a £ my break cut and paste visualisation tools such as Many Eyes, which will recognise the element as a character string. Some tools are capable of recognising datetime formats, so in some cases it may be appropriate to publish date/datetime in a standardised way. Many tools choke on punctuation characters from Windows character sets, and despite best efforts, rogue characters and undeclared or incorrect character encodings often find their way in to datasets which present them working correctly in third party applications. Some tools will automatically strip out leading and trailing whitespace characters, others will treat them as actual characters. Where string matching operations are applied (for example, grouping data elements) a word with a trailing space and a word without a trailing space may be treated as defining different groups. (Which is to say, try to strip leading and trailing whitespace in your data. Experts know to check for this, novices don’t).
If the expectation is that users will make use of a service such as Many Eyes, it may be worth providing an FAQ area that describes what shape the different visualisation expect the data to be in, with examples from your own data sets. Services such as Number Picture, which provide a framework for visualising data by means of visualisation templates that accept data in a specified shape and form, provided helpful user prompts that explain what the (algorithmic) visualisation expects in terms of the shape and form of input data:
Custom Filter Visualisations – Google Fusion Tables
Google Fusion Tables are like spreadsheets on steroids. They combine features of traditional spreadsheets with database like query support and access to popular chart types. Google Fusion Tables can be populated by importing data from Google Spreadsheets or uploading data from CSV files (example Fusion Table).
Google Fusion Tables can also be used to generate new tables based on the fusion of two (or, by chaining, more than two) tables that share a common column. So for example, given three CSV data files containing different data sets (for example, file A has LA codes, and Regions, file B has LA codes and arts spend by LA, and file C has LA codes and sports engagement data) we can merge the files on the common columns to give a “fused” data set (for example, a single table containing four columns: LA codes, Regions, arts spend, sports engagement data). Note that the data may need to be appropriately shaped before it can be fused in a meaningful way with other data sets.
As with many sites that support data upload/import, it’s typically down the to the user to add appropriate metadata to to the data file. This metadata is important for a variety of reasons: firstly, it provides context around a dataset; secondly, it may aid in discovery of the data set if the data and its metadata is publicly indexed; thirdly, it may support tracking, which can be useful if the original publisher needs to demonsstrate how widely a dataset has been (re)used.
If there are too many steps involved in getting the data from the download site into the target environment (for example, if it needs downloading, a cell range copying, saving into another data file, cleaning, then uploading) the distance from the original data source to the file that is uploaded may result in the user not adding much metadata at all. As before, if it is anticipated that a service such as Google Fusion Tables is a likely locus for (re)use of a dataset, the publisher should consider publishing the data directly through the service, with high quality metadata in place, or provide obvious cues and cribs to users about the metadata they might wish to add to their data uploads.
A nice feature of Google Fusion Tables is the way it provides support for dynamic and compound queries over a data set. So for example, we can filter rows:
Or generate summary/aggregate views:
A range of standard visualisation types are available:
Charts can be used to generate views over filtered data:
Or filtered and aggregated data:
Note that these charts may not be of publishable quality/useful as presentation graphics, but they may be useful as part of a visual analysis of the data. To this extent, the lack of detailed legends and titles/captions for the chart does not necessarily present a problem – the visual analyst should be aware of what the data they are viewing actually represents (and they can always check the filter and aggregate settings if they are unsure, as well as dropping in to the tabular data view to check actual numerical values if anything appears to be “odd”. However, the lack of explanatory labeling is likely to be an issue if the intention is to produce a presentation graphic, in which case the user will need to grab a copy of the image and maybe postprocess it elsewhere.
Note that Google Fusion Tables is capable of geo-coding certain sorts of location related data such as placenames or postcodes and rendering associated markers on a map. It is also possible to generate thematic maps based on arbitrary geographical shapefiles (eg Thematic Maps with Google Fusion Tables [PDF]).
Helping Data Flow – Treat It as a Database
Services such as Google Spreadsheets provide online spreadsheets that support traditional spreadsheet operations that include chart generation (using standard chart types familiar to spreadsheet users) and support for interactive graphical widgets (including more exotic chart types, such as tree maps), powered by spreadsheet data, that can be embedded in third party webpages. Simple aggregate reshaping of data is provided in the from of support for Pivot Tables. (Note however that Google Spreadsheet functionality is sometimes a little bug ridden…) Google spreadsheets also provides a powerful query API (the Google Visulisation API), that allows the spreadsheet to be treated as a database. For an example in another government domain, see Government Spending Data Explorer; see also Guardian Datastore MPs’ Expenses Spreadsheet as a Database ).
Publishing data in this way has the following benefits: 1) treating the data as a spreadsheet allows query based views to be generated over it; 2) this views can be visualised directly in the page (this includes dynamic visulisations, as for example described in Google Chart Tools – dynamic controls, and gallery); 3) queries can be used to generated CSV based views over the data that can be (re)used in third party applications.
Sometimes it makes sense to visualise data in a geographical way. One service that provides a quick way of generating choropleth/thematic maps from simple two or three column data keyed by UK administrative geography labels or identifiers is OpenHeatmap. Data can be uploaded from a simple CSV file or imported from a Google spreadsheet using predetermined column names (a column identifying geographical areas according to one of fixed number of geographies, a number value column for colouring the geographical area, and an optional date column for animation purposes (so a map can be viewed in an animated way over consecutive time periods):
Once generated, links to an online version of the map are available.
The code for OpenHeatmap is available as open source software so without too much effort it should be possible to modify the code in order to host a local instance of the software and tie it in a set of predetermined Google spreadsheets, local CSV files, or data views generated from queries over a predetermined datasource so that only the publisher’s data can be visualised using the particular instance of OpenHeatmap.
Other services for publishing and visualising geo-related data are available (eg Geocommons) and could play a role as a possible outlet in a content strategy for data with a strong geographical bias.
Power Tools – R
A further class of tools that can be used to generate visual representations or arbitrary datasets are the fully programmatic tools, such as the R statistical programming language. Developed for academic use, R is currently increasing in popularity on the coat tails of “big data” and the growing interest in analysis of activity data (“paradata”) that is produced as a side-effect of our digital activities. R is capable of importing data in a wide variety of formats from local files as well as via a URL from an online source. The R data model supports a range of powerful transformations that allow data to be shaped as required. Merging data that shares common columns (in whole or part) from separate sources is also supported.
In order to reduce overheads in getting data into a useful shape within the R environment, it may make sense to publish datafile “wrappers” that act as a simple API to data contained with one or published spreadsheets or datafiles. By providing an object type and, where appropriate, access methods for the the data, the data publisher can provide a solid framework on top of which third parties can build their own analysis and statistical charts. R is supported by a wide range of third party extension libraries for generating a wide range of statistical charts and graphics, including maps. (Of particular note are ggplot2 for generating graphics according to the Grammar of Graphics model, and googleVis, which provides a range of functions that support the rapid generation of Google Charts). Many of the charts can be generated from a single R command if the data is in the correct shape and format.
As well as running as a local, desktop application, R can also be run as a hosted webservice (for example, cloudstat.org; the RStudio cross-platform desktop application can also be accessed as a hosted online service, and could presumably be used to provide a robust, online hosted analysis environment tied in to a set of locked down data sources). It is also possible to use R to power online hosted statistical charting services; see for example .
Some cleaning of the data may be required before uploading to the ggplot service. For example, empty cells marked as such by a “-” should be replaced by empty cells; numeric values containing a “,”, may be misinterpreted as character strings (factor levels) rather than numbers (in which case the data needs cleaning by removing commas). Again, if it is known that a service such as ggplot2 is likely to be a target for data reuse, publishing the data in a format that is known to work “just by loading the data in” to R with default import settings will reduce friction/overheads and keep the barriers to reusing the data within that environment to a minimum.
Observation Most of the time, most people don’t get past default settings on any piece of software. If someone tries to load your data into an application, odds on they will use default, factory settings. If you know that users are likely to want to use your data in a particular package, make at least a version of your data available in a format that will load into that package under the default settings in a meaningful way.
Finally, a couple of wild card possibilities.
Firstly, Wolfram Alpha. Wolfram Alpha provides access to a “computational search engine” that accepts natural language queries about facts or data and attempts to provide reasoned responses to those queries, including graphics. Wolfram Alpha is based around a wide range of curated data sets, so an engagement strategy with them may, in some certain circumstances, be appropriate (for example, working with them in the publication of data sets and then directing users to Wolfram Alpha in return). Wolfram Alpha also offers a “Pro” service (Wolfram Alpha Pro) that allows users to visualise and query their own data.
Secondly, the Google Refine Reconciliation API. Google Refine is a cross-platform for cleaning datasets, with the ability to reconcile the content of data columns with canonical identifiers published elsewhere. For example, it is possible to reconcile the names of local authorities with canonical Linked Data identifiers via the Kasabi platform (UK Adminstrative Geography codes and identifiers).
By anchoring cell values to canonical identifiers, it becomes possible to aggregate data from different sources around those known, uniquely identified items in a definite and non-arbitrary way. By publishing: a) a reconciliation service (eg for LEP codes); and b) data that relates to identifiers returned by the reconciliation service (for example, sports data by LEP), the data publisher provides a focus for third parties who want to reconcile their own data against the published identifiers, as well as a source of data values that can be used to annotate records referencing those identifiers. (So for example, if you make it easy for me to get Local Authority codes based on local authority names from your reconciliation service, and also publish data linked to those identifiers (sports engagement data, say), if I reconcile my data against your codes, I will also be provided with the opportunity to annotate my data with your data (so I can annotate my local LEP spend data with your LER sports engagement data; [probably a bad example… need something more convincing?!])… Although uptake of the reconciliation API (and the associated possibility of providing annotation services) is still a minority interest, there are some signs of interest in it (for example, Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data; note that data published on the Kasabi platform also exposes a Google Refine reconciliation service endpoint.) In my opinion, there are potentially significant benefits to be had by publishing reconciliation service endpoints with associated annotation services if a culture of use grows up around this protocol.
Not covered: as part of this review, I have not covered applications such as Microsoft Excel or Tableau Desktop (the latter being a Windows only data visualisation environment that is growing in popularity). Instead, I have tried to focus on applications that are freely available either via the web or on a cross-platform basis. There is also a new kid on the block – datawrapper.de – but it’s still early days for this tool…