Practical Visualisation Tools Presentation: #CASEprog

Last week I gave a presentation at the DCMS describing some hands-on tools for getting started with creating data powered visualisations (Visualisation Tools to Support Data Engagement) at the invitation of the Arts Council’s James Doeser from the Arts Council in the context of the DCMS CASE (Culture and Sport Evidence) Programme, #CASEprog:

I’ve also posted a resource list as a delicious stack: CASEprog – Visualisation Tools (Resource List).

Whilst preparing the presentation, I had a dig through the DCLG sponsored Improving Visualisation for the Public Sector site, which provides pathways for identifying appropriate visualisation types based on data type, policy objectives/communication goals and anticipated audience level. It struck me that being able to pick an appropriate visualisation type is one thing, but being able to create it is another.

My presentation, for example, was based very much around tools that could provide a way in to actually creating visualisations, as well as shaping and representing data so that it can be plugged straight in to particular visualisation views.

So I’m wondering, is there maybe an opportunity here for a practical programme of work that builds on the DCLG Improving Visulisation toolkit by providing worked, and maybe templated, examples, with access to code and recipes wherever possible, for actually creating examples of exemplar visualisation types from actual open/public data set that can be found on the web?

Could this even be the basis for a set of School of Data practical exercises, I wonder, to actual create some of these examples?

More Thoughts on a Content Strategy for Data – Many Eyes and Google Fusion Tables

It’s one thing publishing data just to comply with a formal requirement to make it public, quite another if you’re publishing it because you want folk to doing something with it.

But if you decide you’re publishing data because you want folk to do something with it, what does that mean exactly?

[Not quite related but relevant: Pete Sefton (@ptsefton) on Did you say you “own” this data? You keep using that word. I do not think it means what you think it means.]

One answer might be that you want them to be able to refer to the data for their own purposes, simply by cutting an pasting a results datatable out of one of your spreadsheets so they can paste it into one of theirs and refer to it as “evidence”:

Reference summary data

Another might be that you want folk to be able to draw on your data as part of their own decision making process. And so on. (For a couple of other use cases, see First Thoughts On A Content Strategy for Data.)

A desire that appears to have have gained some traction over the last couple of years is to publish data so that folk can produce visualisations based on it. This is generally seen as a Good Thing, although I’m not sure I know exactly why…? Perhaps it’s because visualisations are shiny objects and folk can sometimes be persuaded to share (links to) shiny objects across their social networks; this in turn may help raise wider awareness about the existence of your data, and potentially bring it to the attention of somebody who can actually makes some use of it, or extract some value from it, possibly in combination with one or more other datasets that you may or may not be aware of.

Something that I’ve become increasingly aware of over the last couple of years is that people respond to graphics and data visualisations in very different ways. The default assumption seems to be that a graphic should expose some truth in very obvious way without any external input. (I’m including things like axis labels, legends, and captions in the definition of a graphic.) That is, the graphic should be a self-contained, atomic object, meaningful in its own right. I think this view is borne out of the assumption that graphics are used to communicate something that is known by the author who used it to their audience. The graphic is chosen because it does “self-evidently” make some point that makes the author’s case. Let’s call these “presentation graphics”. Presentation graphics are shiny objects, designed to communicate something in particular, to a particular audience, in (if at all possible) a self-contained way.

Another way of using visualisations is as part of a visual analysis process. In this case, visual representations of the data are generated by the analyst as part of a conversation they are having with the data. One aim of this conversation (or maybe we should call it an interrogation?!) may be to get the data to reveal something about its structure, or meaningful patterns contained within it. Visual analysis is therefore less to do with the immediate requirement of producing meaningful presentation graphics, and more to do with getting the data to tell its story. Scripted speeches contain soundbites – presentation graphics. Conversations can ramble all over the place and are often so deeply situated in a particular context they are meaningless to onlookers – as visualisations produced during a visual analysis activity may be. (Alternatively, the visual analyst spends their time trying to learn how to ride a bike. Chris Hoy and Victoria Pendleton show how it’s done with the presentation graphics…)

It could be that I’m setting up something of a false dichotomy between extrema here, because sometimes a simple, “directly generated” chart may be effective as both a simple 1-step visual analysis view, and as a presentation graphic. But I’m trying to think through my fingers and type my way round to what I actually believe about all this stuff, and arguing to limits is one lazy way of doing this! The distinction is also not just mine… For example, Iliinsky and Steele’s Designing Data Visualizations identifies the following:

Explanatory visualization: Data visualizations that are used to transmit information or a point of view from the designer to the reader. Explanatory visualizations typically have a specific “story” or information that they are intended to transmit.
Exploratory visualization: Data visualizations that are used by the designer for self-informative purposes to discover patterns, trends, or sub-problems in a dataset. Exploratory visualizations typically don’t have an already-known story.

They also define a data visualizations as “[v]isualizations that are algorithmically generated and can be easily regenerated with different data, are usually data-rich, and are often aesthetically shallow.” Leaving aside the aesthetics, the notion that data visualisations can be “algorithmically generated” is important here.

A related insight I picked up from the New York Times’ Amanda Cox is the use of statistical charts of visual analysis as sketches that help us engage with data en route to understanding some of the stories it contains, stories that may then be told by whatever means are appropriate (which may or may not include graphical representations or visualisations).

So when it comes to publishing data in the hope that folk will do something visual with it, does that mean we want to provide them with the data that can be directly used to convey some known truth in an appealing way, or do we want to provide them with data in such a way that they can engage with it in a (visual) analytic way and the communicate their insight through a really beautiful presentation graphic? (Note that it may often be the case that something discovered through a visual analysis step may actually best be communicated through a simple set of ranked, tabulated data presented as text…) Maybe this explains why folk are publishing the data in the hope that it will be “visualised”? They are conflating visual analysis with presentation graphics, and hoping that novel visualisation (visual analysis) techniques will: 1) provide new insights (new sense made from) the data, that: 2) also work as shiny, shareable and insightful presentation graphics? Hmmm…

Publishing Data to Support Visual Engagement

So we have our data set, but how can we publish it in a way that supports generative visual engagement with it (generative in the sense that we want the user to have at least some role in creating their own visual representations of the data)?

The easiest route to engagement is to publish an interactive visualisation on top of your data set so that the only way folk can engage with the data is through the interactive visualisation interface. So for example, interactive visualisations published by the BBC or New York Times. These typically support the generation of novel views over the data by allowing the user to construct queries over the data through interactive form elements (drop down lists, radio buttons, sliders, checkboxes, etc.); these queries are then executed to filter or analyse the data and provide a view over it that can be visually displayed in a predetermined way. The publisher may also choose to provide alternative ways of visualising the data (for example, scatter plot or bar chart) based on preidentified ways of mapping from the data to various graphical dimensions within particular chart types. In the case of the interactive visualisation hosted on the publisher’s website, the user is thus typically isolated from the actual data.

An alternative approach is to publish the data in an environment that supports the creation of visualisations at what we might term the row and column level. This is where ideas relating to a content strategy for data start to come in to play. An example of this is IBM’s Many Eyes data visualisation tool. Long neglected by IBM, the Many Eyes website provides an environment for: 1) uploading tabular datasets; 2) generating preconfigured interactive visualisations on top of the datasets; 3) providing embeddable versions of visualisations; 4) supporting discussions around visualisations. Note that a login is required to upload data and generate new visualisations.

As an example of what’s possible, I uploaded a copy of the DCMS CASE data relating to Capital Investment to Many Eyes (DCMS CASE data – capital investment (modified)):

CASE Data on Many Eyes

Once the data is uploaded, the user has the option of generating one or more interactive visualisations of the data from a wide range of visualisation types. For example, here’s a matrix chart view (click through to see the interactive version; note: Java required).

Many Eyes example

And here’s a bubble chart:

Many Eyes Bubblechart

In an ideal world, getting the data into Many Eyes should just(?) have been a case of copy and pasting data from the original spreadsheet. (Note that this requires access to a application that can open the spreadsheet, either on the desktop or online.) In the case of the DCMS CASE data, this required opening the spreadsheet, finding the correct sheet, then identifying the cell range containing the data we want to visualise:

DCMS CASE data - raw

Things are never just that simple, of course… Whilst it is possible to define columns as “text” or “number” in Many Eyes, the date field was recognised by the Many Eyes visualisation tools as a “number” which may lead to some visualisations incorrectly aggregating data from the same region across several date ranges. In order to force the Many Eyes visualisations to recongnise the date column as “text”, I had to edit the original data file (before uploading it to Many Eyes) by prepending the date ranges with an alphabetic character (so for example I replaced instances of 2004/05 with Y2004/05).

Recommendation In terms of a “content strategy for data”, then, we need to identify possible target domains where we wish to syndicate or republish our data and then either: 1) publish the data to that domain, possibly under our own branding, or “official” account on that domain – this approach also allows the publisher to add provenance metadata, a link to the homepage for the data or its original source, and so on; or, 2) publish the data on our site in such a way that we know it will work on the target domain (which means testing it/trying it out…). If you expect users to upload your data to services like Many Eyes themselves, it would make sense to provide easily cut and pastable example text of the sort you might expect to see appear in the metadata fields of the data page on the target site and encourage users to make use of that text.

Recommendation A lot of the CASE data spreadsheets contain multiple cell ranges corresponding to different tables within a particular sheet. Many cut and paste tools support data that can be cut and pasted from appropriately highlighted cell ranges. However, other tools require data in comma separated (rather than tab separated) format which mean the user must copy and paste the data into another sheet and then save it as CSV. Although a very simple format, there is a lot to be said for publishing very simple CSV files containing your data. Provenance and explanatory data often gets separated from data contained in CSV files, but you can always add a manifest text file to the collection of CSV data files to explain the contents of each one.

Whilst services such as Many Eyes do their best in trying to identify numeric versus categorical data columns, unless the user is familiar with the sorts of data a particular visualisation type requires and how it should be presented, it can sometimes be hard to understand why Many Eyes has automatically identified particular values for use in drop down list boxes, and at times hard to interpret what is actually being displayed. (This is a good reason to limit the use of Many Eyes to a visual analysis role, and use it to discover things that look interesting/odd and then go off and dig in the data a little more to se if there really is anything interesting there…)

In some cases, it may be possible to reshape the data and get it in to a form that Many Eyes can work with. (Remember the Iliinsky and Steele definition of a data visualisation as something “algorithmically generated”? If the data is presented in the right way, then Many Eyes can do something with it. But if it’s presented in the wrong way, not joy…) As an example, if we look at the CASE Capital Investment data, we see it has columns for Region, Local Authority, Date, as well as columns relating to the different investment types. Presented this way, we can easily group data across different years within an LA or Region. Alternatively, we might have selected Region, Local Authority, and Asset type columns, with separate columns for each date range. This different combination of rows and columns may provides a different basis for the sorts of visualisations we can generate within Many Eyes and the different summary views we can present over the data.

Recommendation The shape in which the data it published may have an effect on the range of visualisations that can be directly generated from the data, without reshaping by the user. It may be appropriate to publish the data in a variety of shapes, or provide tools for reshaping data for use with particular target services. Tools such as the Stanford Data Wrangler are making it easier for people to clean and reshape messy data sets, but that is out of scope for this review. In addition, it is worth consider the data type or physical form in which data is published. For example, in columns relating to finanacial amounts, prepending each data element in a cell with a £ my break cut and paste visualisation tools such as Many Eyes, which will recognise the element as a character string. Some tools are capable of recognising datetime formats, so in some cases it may be appropriate to publish date/datetime in a standardised way. Many tools choke on punctuation characters from Windows character sets, and despite best efforts, rogue characters and undeclared or incorrect character encodings often find their way in to datasets which present them working correctly in third party applications. Some tools will automatically strip out leading and trailing whitespace characters, others will treat them as actual characters. Where string matching operations are applied (for example, grouping data elements) a word with a trailing space and a word without a trailing space may be treated as defining different groups. (Which is to say, try to strip leading and trailing whitespace in your data. Experts know to check for this, novices don’t).

If the expectation is that users will make use of a service such as Many Eyes, it may be worth providing an FAQ area that describes what shape the different visualisation expect the data to be in, with examples from your own data sets. Services such as Number Picture, which provide a framework for visualising data by means of visualisation templates that accept data in a specified shape and form, provided helpful user prompts that explain what the (algorithmic) visualisation expects in terms of the shape and form of input data:

Number picture - describes the shape and form the data needs to be in

Custom Filter Visualisations – Google Fusion Tables

Google Fusion Tables are like spreadsheets on steroids. They combine features of traditional spreadsheets with database like query support and access to popular chart types. Google Fusion Tables can be populated by importing data from Google Spreadsheets or uploading data from CSV files (example Fusion Table).

Google Fusion Table - data import

Google Fusion Tables can also be used to generate new tables based on the fusion of two (or, by chaining, more than two) tables that share a common column. So for example, given three CSV data files containing different data sets (for example, file A has LA codes, and Regions, file B has LA codes and arts spend by LA, and file C has LA codes and sports engagement data) we can merge the files on the common columns to give a “fused” data set (for example, a single table containing four columns: LA codes, Regions, arts spend, sports engagement data). Note that the data may need to be appropriately shaped before it can be fused in a meaningful way with other data sets.

As with many sites that support data upload/import, it’s typically down the to the user to add appropriate metadata to to the data file. This metadata is important for a variety of reasons: firstly, it provides context around a dataset; secondly, it may aid in discovery of the data set if the data and its metadata is publicly indexed; thirdly, it may support tracking, which can be useful if the original publisher needs to demonsstrate how widely a dataset has been (re)used.

Google spreadsheets provenance metadata

If there are too many steps involved in getting the data from the download site into the target environment (for example, if it needs downloading, a cell range copying, saving into another data file, cleaning, then uploading) the distance from the original data source to the file that is uploaded may result in the user not adding much metadata at all. As before, if it is anticipated that a service such as Google Fusion Tables is a likely locus for (re)use of a dataset, the publisher should consider publishing the data directly through the service, with high quality metadata in place, or provide obvious cues and cribs to users about the metadata they might wish to add to their data uploads.

A nice feature of Google Fusion Tables is the way it provides support for dynamic and compound queries over a data set. So for example, we can filter rows:

Google fusion table query filters

Or generate summary/aggregate views:

Generating aggregate views

A range of standard visualisation types are available:

Google Fusion tables visualisation options

Charts can be used to generate views over filtered data:

Google Fusion Tables Filters and charts

Or filtered and aggregated data:

Google Fusion Tables Filtered and aggregated views

Note that these charts may not be of publishable quality/useful as presentation graphics, but they may be useful as part of a visual analysis of the data. To this extent, the lack of detailed legends and titles/captions for the chart does not necessarily present a problem – the visual analyst should be aware of what the data they are viewing actually represents (and they can always check the filter and aggregate settings if they are unsure, as well as dropping in to the tabular data view to check actual numerical values if anything appears to be “odd”. However, the lack of explanatory labeling is likely to be an issue if the intention is to produce a presentation graphic, in which case the user will need to grab a copy of the image and maybe postprocess it elsewhere.

Note that Google Fusion Tables is capable of geo-coding certain sorts of location related data such as placenames or postcodes and rendering associated markers on a map. It is also possible to generate thematic maps based on arbitrary geographical shapefiles (eg Thematic Maps with Google Fusion Tables [PDF]).

Helping Data Flow – Treat It as a Database

Services such as Google Spreadsheets provide online spreadsheets that support traditional spreadsheet operations that include chart generation (using standard chart types familiar to spreadsheet users) and support for interactive graphical widgets (including more exotic chart types, such as tree maps), powered by spreadsheet data, that can be embedded in third party webpages. Simple aggregate reshaping of data is provided in the from of support for Pivot Tables. (Note however that Google Spreadsheet functionality is sometimes a little bug ridden…) Google spreadsheets also provides a powerful query API (the Google Visulisation API), that allows the spreadsheet to be treated as a database. For an example in another government domain, see Government Spending Data Explorer; see also Guardian Datastore MPs’ Expenses Spreadsheet as a Database ).

Publishing data in this way has the following benefits: 1) treating the data as a spreadsheet allows query based views to be generated over it; 2) this views can be visualised directly in the page (this includes dynamic visulisations, as for example described in Google Chart Tools – dynamic controls, and gallery); 3) queries can be used to generated CSV based views over the data that can be (re)used in third party applications.

Geographical Data

Sometimes it makes sense to visualise data in a geographical way. One service that provides a quick way of generating choropleth/thematic maps from simple two or three column data keyed by UK administrative geography labels or identifiers is OpenHeatmap. Data can be uploaded from a simple CSV file or imported from a Google spreadsheet using predetermined column names (a column identifying geographical areas according to one of fixed number of geographies, a number value column for colouring the geographical area, and an optional date column for animation purposes (so a map can be viewed in an animated way over consecutive time periods):


Once generated, links to an online version of the map are available.

The code for OpenHeatmap is available as open source software so without too much effort it should be possible to modify the code in order to host a local instance of the software and tie it in a set of predetermined Google spreadsheets, local CSV files, or data views generated from queries over a predetermined datasource so that only the publisher’s data can be visualised using the particular instance of OpenHeatmap.

Other services for publishing and visualising geo-related data are available (eg Geocommons) and could play a role as a possible outlet in a content strategy for data with a strong geographical bias.

Power Tools – R

A further class of tools that can be used to generate visual representations or arbitrary datasets are the fully programmatic tools, such as the R statistical programming language. Developed for academic use, R is currently increasing in popularity on the coat tails of “big data” and the growing interest in analysis of activity data (“paradata”) that is produced as a side-effect of our digital activities. R is capable of importing data in a wide variety of formats from local files as well as via a URL from an online source. The R data model supports a range of powerful transformations that allow data to be shaped as required. Merging data that shares common columns (in whole or part) from separate sources is also supported.

In order to reduce overheads in getting data into a useful shape within the R environment, it may make sense to publish datafile “wrappers” that act as a simple API to data contained with one or published spreadsheets or datafiles. By providing an object type and, where appropriate, access methods for the the data, the data publisher can provide a solid framework on top of which third parties can build their own analysis and statistical charts. R is supported by a wide range of third party extension libraries for generating a wide range of statistical charts and graphics, including maps. (Of particular note are ggplot2 for generating graphics according to the Grammar of Graphics model, and googleVis, which provides a range of functions that support the rapid generation of Google Charts). Many of the charts can be generated from a single R command if the data is in the correct shape and format.

As well as running as a local, desktop application, R can also be run as a hosted webservice (for example,; the RStudio cross-platform desktop application can also be accessed as a hosted online service, and could presumably be used to provide a robust, online hosted analysis environment tied in to a set of locked down data sources). It is also possible to use R to power online hosted statistical charting services; see for example .

Uploading data to ggplot2

Some cleaning of the data may be required before uploading to the ggplot service. For example, empty cells marked as such by a “-” should be replaced by empty cells; numeric values containing a “,”, may be misinterpreted as character strings (factor levels) rather than numbers (in which case the data needs cleaning by removing commas). Again, if it is known that a service such as ggplot2 is likely to be a target for data reuse, publishing the data in a format that is known to work “just by loading the data in” to R with default import settings will reduce friction/overheads and keep the barriers to reusing the data within that environment to a minimum.

Observation Most of the time, most people don’t get past default settings on any piece of software. If someone tries to load your data into an application, odds on they will use default, factory settings. If you know that users are likely to want to use your data in a particular package, make at least a version of your data available in a format that will load into that package under the default settings in a meaningful way.

Finally, a couple of wild card possibilities.

Firstly, Wolfram Alpha. Wolfram Alpha provides access to a “computational search engine” that accepts natural language queries about facts or data and attempts to provide reasoned responses to those queries, including graphics. Wolfram Alpha is based around a wide range of curated data sets, so an engagement strategy with them may, in some certain circumstances, be appropriate (for example, working with them in the publication of data sets and then directing users to Wolfram Alpha in return). Wolfram Alpha also offers a “Pro” service (Wolfram Alpha Pro) that allows users to visualise and query their own data.

Secondly, the Google Refine Reconciliation API. Google Refine is a cross-platform for cleaning datasets, with the ability to reconcile the content of data columns with canonical identifiers published elsewhere. For example, it is possible to reconcile the names of local authorities with canonical Linked Data identifiers via the Kasabi platform (UK Adminstrative Geography codes and identifiers).

Google refine reconciliation

By anchoring cell values to canonical identifiers, it becomes possible to aggregate data from different sources around those known, uniquely identified items in a definite and non-arbitrary way. By publishing: a) a reconciliation service (eg for LEP codes); and b) data that relates to identifiers returned by the reconciliation service (for example, sports data by LEP), the data publisher provides a focus for third parties who want to reconcile their own data against the published identifiers, as well as a source of data values that can be used to annotate records referencing those identifiers. (So for example, if you make it easy for me to get Local Authority codes based on local authority names from your reconciliation service, and also publish data linked to those identifiers (sports engagement data, say), if I reconcile my data against your codes, I will also be provided with the opportunity to annotate my data with your data (so I can annotate my local LEP spend data with your LER sports engagement data; [probably a bad example… need something more convincing?!])… Although uptake of the reconciliation API (and the associated possibility of providing annotation services) is still a minority interest, there are some signs of interest in it (for example, Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data; note that data published on the Kasabi platform also exposes a Google Refine reconciliation service endpoint.) In my opinion, there are potentially significant benefits to be had by publishing reconciliation service endpoints with associated annotation services if a culture of use grows up around this protocol.

Not covered: as part of this review, I have not covered applications such as Microsoft Excel or Tableau Desktop (the latter being a Windows only data visualisation environment that is growing in popularity). Instead, I have tried to focus on applications that are freely available either via the web or on a cross-platform basis. There is also a new kid on the block – – but it’s still early days for this tool…

Getting on the CASE – DCMS Culture and Sport Evidence Programme Data

As a warm-up exercise for a day’s consultancy with Fast Dawn Communications on a PR related project looking at engagement around the DCMS Cultural and Sport Evidence (CASE) research project (#CASEprog), I’ve been pottering around the project web pages having a look at what’s there…

A variety of data/evidence sources are available:

  • The CASE database – over 8,500 research studies in an online, searchable database
  • DCMS Longitudinal Data Library – documents all longitudinal surveys containing questions relevant to the DCMS sectors (culture, media and sport)
  • CASE Local Profiles and Insights datasets – the essential guide for mapping your local culture and sport assets
  • CASE Local Asset Mapping toolkit – all the key culture and sport data, local and historical, in one place
  • Drivers, Impacts and Value research – breakthrough research and evidence on engagement in culture and sport

The CASE database provides a collection of publications presumably related to the whole sports and culture area. Raising awareness of this collection through HE services such as the Research Information Network may be appropriate. Although licensing is likely to preclude this, there may be opportunities for text analysis or citation analysis of the collection as a basis for improving discovery services within it.

The longitudinal data library identifies items registered with the ESDS Longitudinal website that contain questions ‘relevant to DCMS sectors’. (Does ESDS have anything to do with

The CASE local profiles and insights datasets are a collection of Excel spreadsheet files containing raw and summary data at typically local authority level, often also aggregated to Regional level. A couple of Excel spreadsheet based toolkits aggregate data from multiple separate sources to provide canned report views at local or Local Enterprise Partnership level. There are also links to related Sport England activities, including a market segmentation report. If other communities can be encouraged to pivot around the market segments identified in that report, it could provide a lead in to making more use of CASE data that also relate to those segments?

I had a quick play in R with some of the data grabbed from one of the spreadsheets relating to Tourism, and whilst I needed to do a little cleaning, it wasn’t too bad… To try out the data, I used a treemap. So for example, limiting the data to only records that had values for 2004 and 2008, and further limiting the display to only include records where the change over that period was within the range +/-50%, we can look at how things changed within a region over that period.

CASE visitor attractions by region - 2008 vs 2004 w/ max +-50% change over that period

(Area corresponds to the 2008 figures, colour is the percentage change from 2004 to 2008.)

A tiny tweak to the code let’s us look at the breakdown across region by attraction type:

CASE example treemap - too many conditions - illustrative only

Another tweak and we can look within a region:

Ilustratiion only

Other widgets are available…

Other widgets are available...

(The biggest problem is labeling. It may be possible to get round this when creating interactive visualisations, but for static/print images, labeling small segments is likely to be an issue…)

The CASE Local Asset Mapping toolkit provides information relating to asset mapping. This includes metadata field recommendations for asset inventories that amongst other things identify asset classes for arts, culture and sports based facilities. This may be of interest to open data modelers in the and communities.

The Drivers, Impacts and Value research seems to be based around a searchable collection of research on culture and sport engagement (with one of the clunkiest search interfaces I have seen for a long time…) and a model of engagement that requires permission (and training?) to use. There is potentially scope for work developing infographics based on scenarios run through this model, or interactive tools that support visual analysis using the model, although care would need to be taken to ensure that it cannot be used to generate results that fall outside the model’s bounds/limits/sensible operating regime.

First Thoughts On A Content Strategy for Data

Whilst I was looking round the CSE minisite, I also started trying to think about how datasets are currently publicised. It seems to me that the notion of “content strategy” is currently in the air at moment, as folk start to realise that content shared is content that at least stands a chance of being ‘consumed’ (or at least, repeated…), in contrast to the typically-just-not-true publish-it-and-they-will-come attitude that still appears to drive a lot of web publishing. (Alternatively, a cynic might say that folk very well know no-one will come, but that where content may not have the most positive sentiment associated with it, simply pushing it onto a long forgotten and rarely visited part of a website allows you to claim that the communication duties have been discharged…)

This got me wondering what a “content strategy for data” might be…? Paraphrasing Carl Haggerty (#UKGC12 – beyond the bullet points), we might term this “getting contentdata to people and not people to websites reports” (although I still get a little twitchy about who the data junkies who might actually want the raw data are..!;-) As well as data, we might also want to some promote the availability of one or more means of analysis, such as providing

Tools such as the Local Culture and Heritage Profile Tool or the Local Sport Profiles Tool take the approach of publishing data wrapped in standard report generating template using a macro-laden, Microsoft Excel spreadsheet to allow users to reproduce area based reports for the areas that interest them.

However, the queries one can ask/views one can generate are all canned, which is to say that the tools do little more than provide a single container for delivering fixed reports over predetermined views of the data. Rather than generating dozens of separate reports, one for each locale, and then having to help the user find the document relevant to them, the toolkit essentially provides a single spreadsheet that can be used to generate a report for a specific area on the fly. If a user then wants to make use of the actual area-related data themselves, they need to copy and paste it and take it somewhere else, and in so doing run the risk of breaking the provenance chain. Users are also prevented from running their own queries or analysis over the data as a whole.

If we want to deliver the opportunity to run analyses over data to potential data users, we need to do make available the data in a way that allows it to be reused by third party tools. That means as well as publishing the data, the data needs to be in a form that can be readily used by a third party and may even be published with “API wrappers” that pull it into third party tools directly. (So for example, we might publish the data in whatever form, along with stub routines that load it into R dataframes, or Python/NumPy data structures etc.) Releasing data in a spreadsheet partially supports this ideal, because at least we get hold of the data in a regular and electronic form. However, to run our own analysis on the data, we may need to copy it and take it elsewhere. In contrast, if data is made available via a database*, we can more easily maintain an audit trail of what we did to it starting from the query we used to get hold of it.

* We may, to a limited extent, be able to treat a spreadsheet as a database if we can directly address data via cell ranges within a specified sheet of a specific spreadsheet, and then perform whatever transformation and analysis operations we require on it. So for example, it’s possible to use Google Spreadsheets as a database in just this way via URL based addresses and the Google Visualisation API.

The danger of people reusing your data via local copies is that if you update your data, the copy they are using will become outdated. In this respect, it probably makes sense to maintain a watching brief over data repositories where you data may appear or place your data there yourself and then maintain it. If you also add records that point to your data in the various data catalogues that exist, you also need to make sure you maintain those records appropriately. (This might mean adding a new record whenever you update a dataset, and marking the old one as deprecated/replaced by…)

As an example of maintaining an audit trail having acquired a data set, data cleaning tools such as Google Refine and Stanford Data Wrangler use the notion of “applied operations” to log the operations carried out on an original, raw dataset in order to produce the cleaned version. Logging this sequence of steps plays an important part in the data quality management process. It also opens up the possibility of “data-intermediaries” who may take original datasets and then publish cleaned, standardised versions (along with the data cleaning/transformation history) which can be used directly by third parties. This gets away from the case of third parties cleaning the data independently of each other, and maybe as a result ending up working on different data sets. (Of course, we may want third parties to independently clean the data and then compare the results using techniques similar to the fault tolerant/safety engineering N-modular redundancy or N-version programming methodologies; N-version data cleansing, perhaps?!)

Another example of audit trail generation is provided by documentation toolkits such as Sweave or, which are capable of printing out raw statistical programming code within a report, along with the results of executing that self-same code. So for example, a Sweave document might include R code for processing a particular dataset, along with the outputs achieved by running that code (which might include the results of a particular statistical test, a statistical chart, etc etc).

In terms of delivering a content strategy, there may also be a requirement to raise awareness of the data in those communities who may have an interest in it. (This shifts the burden away from expecting potential users to somehow magically find your data to one in which it’s the publisher’s responsibility to find potentially interested audiences and share the data in an appropriate form there.) Such a move away from a broadcasting model to one based more on locally targeting pre-qualified audiences acknowledges the fact that most people won’t be at all interested in doing anything with your data anyway;-)

So where might be an appropriate place to find a willing audience be? If you’re known to release data in the area, then your own site is an obvious candidate. If you think that your audience is in the habit of using open data sets, then making content available via public data catalogues may also be useful. (Catalogues often cross-promote linked data sets, so it may be that by adding your data to a catalogue gets it mentioned when people are viewing other, related data sets.) Examples in the public sector include data catalogues such as, and Platforms such as Kasabi provide free hosting of openly licensed data along with tools that make it easy to access that data in a vareity of ays. Media organisations such as the Guardian regularly publish (often tidied) versions of datasets in a “datastore” area. The Guardian republish public data using Google Spreadsheets and Google Fusion Tables, both of which provide URLs to machine readable, online versions of the data. Getting a mention on the Guardian datablog, along with a subset of your data in the Guardian datastore, may help to raise awareness around the dataset and may even lead to a news story breaking out of the datastore and getting into editorial pages. In terms of getting data to data journalists, is a hub for developing communities around journalistic investigations, often with a data feel to them. Tracking issue related terms on Scraperwiki can help identify people and/or projects who are having a hard time getting data as data in that area. If you see your own data being scraped to there, it maybe shows it’s not being published in as usable or useful a form as you thought…