As a warm-up exercise for a day’s consultancy with Fast Dawn Communications on a PR related project looking at engagement around the DCMS Cultural and Sport Evidence (CASE) research project (#CASEprog), I’ve been pottering around the project web pages having a look at what’s there…
- The CASE database – over 8,500 research studies in an online, searchable database
- DCMS Longitudinal Data Library – documents all longitudinal surveys containing questions relevant to the DCMS sectors (culture, media and sport)
- CASE Local Profiles and Insights datasets – the essential guide for mapping your local culture and sport assets
- CASE Local Asset Mapping toolkit – all the key culture and sport data, local and historical, in one place
- Drivers, Impacts and Value research – breakthrough research and evidence on engagement in culture and sport
The CASE database provides a collection of publications presumably related to the whole sports and culture area. Raising awareness of this collection through HE services such as the Research Information Network may be appropriate. Although licensing is likely to preclude this, there may be opportunities for text analysis or citation analysis of the collection as a basis for improving discovery services within it.
The longitudinal data library identifies items registered with the ESDS Longitudinal website that contain questions ‘relevant to DCMS sectors’. (Does ESDS have anything to do with data-archive.ac.uk?)
The CASE local profiles and insights datasets are a collection of Excel spreadsheet files containing raw and summary data at typically local authority level, often also aggregated to Regional level. A couple of Excel spreadsheet based toolkits aggregate data from multiple separate sources to provide canned report views at local or Local Enterprise Partnership level. There are also links to related Sport England activities, including a market segmentation report. If other communities can be encouraged to pivot around the market segments identified in that report, it could provide a lead in to making more use of CASE data that also relate to those segments?
I had a quick play in R with some of the data grabbed from one of the spreadsheets relating to Tourism, and whilst I needed to do a little cleaning, it wasn’t too bad… To try out the data, I used a treemap. So for example, limiting the data to only records that had values for 2004 and 2008, and further limiting the display to only include records where the change over that period was within the range +/-50%, we can look at how things changed within a region over that period.
(Area corresponds to the 2008 figures, colour is the percentage change from 2004 to 2008.)
A tiny tweak to the code let’s us look at the breakdown across region by attraction type:
Another tweak and we can look within a region:
Other widgets are available…
(The biggest problem is labeling. It may be possible to get round this when creating interactive visualisations, but for static/print images, labeling small segments is likely to be an issue…)
The CASE Local Asset Mapping toolkit provides information relating to asset mapping. This includes metadata field recommendations for asset inventories that amongst other things identify asset classes for arts, culture and sports based facilities. This may be of interest to open data modelers in the linkeduniversities.org and data.ac.uk communities.
The Drivers, Impacts and Value research seems to be based around a searchable collection of research on culture and sport engagement (with one of the clunkiest search interfaces I have seen for a long time…) and a model of engagement that requires permission (and training?) to use. There is potentially scope for work developing infographics based on scenarios run through this model, or interactive tools that support visual analysis using the model, although care would need to be taken to ensure that it cannot be used to generate results that fall outside the model’s bounds/limits/sensible operating regime.
First Thoughts On A Content Strategy for Data
Whilst I was looking round the CSE minisite, I also started trying to think about how datasets are currently publicised. It seems to me that the notion of “content strategy” is currently in the air at moment, as folk start to realise that content shared is content that at least stands a chance of being ‘consumed’ (or at least, repeated…), in contrast to the typically-just-not-true publish-it-and-they-will-come attitude that still appears to drive a lot of web publishing. (Alternatively, a cynic might say that folk very well know no-one will come, but that where content may not have the most positive sentiment associated with it, simply pushing it onto a long forgotten and rarely visited part of a website allows you to claim that the communication duties have been discharged…)
This got me wondering what a “content strategy for data” might be…? Paraphrasing Carl Haggerty (#UKGC12 – beyond the bullet points), we might term this “getting
contentdata to people and not people to websites reports” (although I still get a little twitchy about who the data junkies who might actually want the raw data are..!;-) As well as data, we might also want to some promote the availability of one or more means of analysis, such as providing
Tools such as the Local Culture and Heritage Profile Tool or the Local Sport Profiles Tool take the approach of publishing data wrapped in standard report generating template using a macro-laden, Microsoft Excel spreadsheet to allow users to reproduce area based reports for the areas that interest them.
However, the queries one can ask/views one can generate are all canned, which is to say that the tools do little more than provide a single container for delivering fixed reports over predetermined views of the data. Rather than generating dozens of separate reports, one for each locale, and then having to help the user find the document relevant to them, the toolkit essentially provides a single spreadsheet that can be used to generate a report for a specific area on the fly. If a user then wants to make use of the actual area-related data themselves, they need to copy and paste it and take it somewhere else, and in so doing run the risk of breaking the provenance chain. Users are also prevented from running their own queries or analysis over the data as a whole.
If we want to deliver the opportunity to run analyses over data to potential data users, we need to do make available the data in a way that allows it to be reused by third party tools. That means as well as publishing the data, the data needs to be in a form that can be readily used by a third party and may even be published with “API wrappers” that pull it into third party tools directly. (So for example, we might publish the data in whatever form, along with stub routines that load it into R dataframes, or Python/NumPy data structures etc.) Releasing data in a spreadsheet partially supports this ideal, because at least we get hold of the data in a regular and electronic form. However, to run our own analysis on the data, we may need to copy it and take it elsewhere. In contrast, if data is made available via a database*, we can more easily maintain an audit trail of what we did to it starting from the query we used to get hold of it.
* We may, to a limited extent, be able to treat a spreadsheet as a database if we can directly address data via cell ranges within a specified sheet of a specific spreadsheet, and then perform whatever transformation and analysis operations we require on it. So for example, it’s possible to use Google Spreadsheets as a database in just this way via URL based addresses and the Google Visualisation API.
The danger of people reusing your data via local copies is that if you update your data, the copy they are using will become outdated. In this respect, it probably makes sense to maintain a watching brief over data repositories where you data may appear or place your data there yourself and then maintain it. If you also add records that point to your data in the various data catalogues that exist, you also need to make sure you maintain those records appropriately. (This might mean adding a new record whenever you update a dataset, and marking the old one as deprecated/replaced by…)
As an example of maintaining an audit trail having acquired a data set, data cleaning tools such as Google Refine and Stanford Data Wrangler use the notion of “applied operations” to log the operations carried out on an original, raw dataset in order to produce the cleaned version. Logging this sequence of steps plays an important part in the data quality management process. It also opens up the possibility of “data-intermediaries” who may take original datasets and then publish cleaned, standardised versions (along with the data cleaning/transformation history) which can be used directly by third parties. This gets away from the case of third parties cleaning the data independently of each other, and maybe as a result ending up working on different data sets. (Of course, we may want third parties to independently clean the data and then compare the results using techniques similar to the fault tolerant/safety engineering N-modular redundancy or N-version programming methodologies; N-version data cleansing, perhaps?!)
Another example of audit trail generation is provided by documentation toolkits such as Sweave or dexy.it, which are capable of printing out raw statistical programming code within a report, along with the results of executing that self-same code. So for example, a Sweave document might include R code for processing a particular dataset, along with the outputs achieved by running that code (which might include the results of a particular statistical test, a statistical chart, etc etc).
In terms of delivering a content strategy, there may also be a requirement to raise awareness of the data in those communities who may have an interest in it. (This shifts the burden away from expecting potential users to somehow magically find your data to one in which it’s the publisher’s responsibility to find potentially interested audiences and share the data in an appropriate form there.) Such a move away from a broadcasting model to one based more on locally targeting pre-qualified audiences acknowledges the fact that most people won’t be at all interested in doing anything with your data anyway;-)
So where might be an appropriate place to find a willing audience be? If you’re known to release data in the area, then your own site is an obvious candidate. If you think that your audience is in the habit of using open data sets, then making content available via public data catalogues may also be useful. (Catalogues often cross-promote linked data sets, so it may be that by adding your data to a catalogue gets it mentioned when people are viewing other, related data sets.) Examples in the public sector include data catalogues such as data.gov.uk, and thedatahub.org. Platforms such as Kasabi provide free hosting of openly licensed data along with tools that make it easy to access that data in a vareity of ays. Media organisations such as the Guardian regularly publish (often tidied) versions of datasets in a “datastore” area. The Guardian republish public data using Google Spreadsheets and Google Fusion Tables, both of which provide URLs to machine readable, online versions of the data. Getting a mention on the Guardian datablog, along with a subset of your data in the Guardian datastore, may help to raise awareness around the dataset and may even lead to a news story breaking out of the datastore and getting into editorial pages. In terms of getting data to data journalists, helpmeinvestigate.com is a hub for developing communities around journalistic investigations, often with a data feel to them. Tracking issue related terms on Scraperwiki can help identify people and/or projects who are having a hard time getting data as data in that area. If you see your own data being scraped to there, it maybe shows it’s not being published in as usable or useful a form as you thought…