Tagged: datajourn

What is a Data Journalist?

Jod ads come and go, so I thought I’d capture the main elements of this one from the BBC:

Data Journalist – Role Purpose and Aims

You will be required to humanize statistics; to make sense of potentially complicated data and present it in a user friendly format.

You will be asked to focus on a range of data-rich subjects relating to long-term projects or high impact daily new stories, in line with Global News editorial priorities. These could include the following: reports on development, global poverty, Afghanistan casualties, internet connectivity around the world, or global recession figures.

Key Knowledge and Experience

You will be a self-starter, brimming with story ideas who is comfortable with statistics and has the expertise to delve beneath the headline figures and explain the fuller picture.
You will have significant journalistic experience gained ideally from working in an international news environment.
The successful candidate should have experience (or at least awareness) of visualising data and visualisation tools.
You should be excited about developing the way that data is interpreted and presented on the web, from heavy number crunching, to dynamic mapping and interactive graphics. You must have demonstrated knowledge of statistics, statistical analysis, with a good understanding of the range and breadth of data sources in the UK and internationally, broad experience with data sources, data mining and have good visual and statistical skills.
You must have a Computer-assisted reporting background or similar, including a good knowledge of the relevant software (including Excel and mapping software).
Experience of producing and developing data driven web content a senior level within time and budget constraints.
A thorough understanding of the BBC World Service’s aims and the part this initiative plays in meeting them.
Excellent communication and interpersonal skills with ability to present information concisely to a broad audience including journalists and commissioning editors. You should be able to demonstrate the ability to influence, negotiate with and persuade others.
Central to the role is an ability to analyse complicated information and present it to our readers in a way that is visually engaging and easy to understand, using a range of web-based technologies, for which you should have familiarity with database interfaces and web presentation layers, as well as database concepting, content entry and management.
You will be expected to have your own original ideas on how to best apply data driven journalism, either to complement stories when appropriate or to identify potential original stories while interpreting data, researching and investigating them, crunching the data yourself and working with designers and developers on creating content that will engage our audience, and provide them with useful, personalised information.
You will work in a multimedia way, when appropriate, liaising with online but also radio and TV and specialist output producers as required, from a range of language services. You will help lead the development of computer-assisted reporting skills in the wider news specials team.


To identify a range of significant statistics and data -driven stories that can be developed and result in finished graphics that can be used across BBC News websites.
To take a lead role in devising compelling ways of telling data-driven stories on the web, working with specials team designers, developers and journalists as required. Also liaising with radio and TV and specialist output producers across World Service as required, providing a joined-up multi-platform proposition for the audience.
To work with senior stakeholders and programme teams and be an internal expert who can interpret and concisely explain the significance of data to others, and related good practice.
Support the College of Journalism – to help devise training sessions in order to spread the knowledge and best practices of data driven journalism
To help inform the future development by FM&T of tools which enable data-driven stories to be told more quickly and effectively on the web.
To keep abreast of developments in data driven journalism, and pursue collaboration with other teams working on the same area, both within the BBC and also with external organisations.
Willingness to work across a range of online production skills in a flexible manner to BBC standards and values.
· Using their own initiative the successful candidate will be required to build relationships with major sources of content (e.g. BBC networks, programmes, external interest groups) and promote opportunities for cross-media production


Editorial Judgement
Makes the right editorial and policy decisions based upon a clear understanding of the BBC’s distinctive news agenda, the requirements of news and current affairs coverage.

Planning & Organising
Is able to think ahead in order to establish an effective and appropriate course of action for self and others. Prioritises and plans activities taking into account all the relevant issues and factors such as deadlines and resources requirements.

Analytical Thinking
Able to simplify complex problems, process projects into component parts, explore and evaluate them systematically.

Creative Thinking
Is able to transform creative ideas/impulses into practical reality. Can look at existing situations and problems in novel ways and come up with creative solutions.

Can maintain personal effectiveness by managing own emotions in the face of pressure, set backs or when dealing with provocative situations. Can demonstrate an approach to work that is characterised by commitment, motivation and energy.

The ability to get one’s message understood clearly by adopting a range of styles, tools and techniques appropriate to the audience and the nature of the information.

Influencing and Persuading
Ability to present sound and well reasoned arguments to convince others. Can draw from a range of strategies to persuade people in a way that results in agreement or behaviour change.

Managing Relationships and Team Working
Able to build and maintain effective working relationships with a range of people. Highly effective team player.

Orange Visual Visualisation Tool

A few days ago, I came across a drag’n’drop, wire it together visualisation and data analysis tool called Orange.

Here’s a quick run through of some of the basics (at least, a run through of the first few things I tried to do with the tool…)

First off, we need some data. Orange likes TSV (tab separated values) rather than CSV, so I grabbed some TSV from one of the Guardian Datastore spreadheets on Google Docs (use “Save as Text” to get the tab separated value format…)

TSV from google docs

Orange is a canvas based visual programming environment, in which functional blocks are added the the canvas and certain parameters set within the block. Here’s how we get some data into Orange from a TSV file:

Orangie viz tool - import data

The File icon is giving me a warning (no dependent variable) but I’m not sure why…? I’m sure Orange has managed to detect labels and quantities correctly from other files I’ve tried?

Anyway… we can inspect the data by looking at it in a data table widget – just wire one in:

Orange viz tool - data table

The table is sortable by column, and the Report button can be used to save a version of the table. Looking t the data table, we see it has identified columns with missing entries. We can clean these from out data set using the Preprocessing widget:

Orange - data cleaning

If we now wire the output of the Processing widget into the Scatterplot widget, we can generate a variety of scatterplots:

Orange scatterplot

If you want to save a copy of the chart, it’s easy enough to do so. (I can’t get colour palettes to work on my Mac, so I’m stuck with greyscale displays. Also, the blob sizing doesn’t seem very responsive…)

Orange - save a scatterplot

The Report tool allows us to create a report from various bits of the dataflow, including adding information from several widgets to either separate report pages or the same report page.

Orange - report generator

Saving a Report saves all the report pages to a navigable set of HTML pages that resemble the Orange Report viewer.

Here are a couple of other things we can do with the data, this time using a data set that isn’t throwing the “dependent variable missing” error, in particular the distribution of comments in a small Friendfeed network…

So for example, here’s how the number of comments made by members of the network is distributed:

Orange - distribution of values

Alternatively, we may look at the distribution in a more “statistical” way:

Orange - simple distributions

(Remember, we can generate these reports interactively, and then add them to a growing report.)

The survey plot gives us a macroscopic birds eye view over the whole of the data set:

Orange - survey plot

Okay, that’s enough for starters – hopefully you get the idea: wire stuff together and generate visual reports… So why not go and download Orange now?!;-)

There are a whole range of clustering tools, too, which look like they could be interesting…

And I think the platform is extensible, which means there’s a way of adding your own widgets (written in Python, maybe..?)

A First – Not Very Successful – Look at Using Ordnance Survey OpenLayers…

What’s the easiest way of creating a thematic map, that shows regions coloured according to some sort of measure?

Yesterday, I saw a tweet go by from @datastore about Carbon emissions in every local authority in the UK, detailing those emissions for a list of local authorities (whatever they are… I’ll come on to that in a moment…)

Carbon emissions data table

The dataset seemed like a good opportunity to try out the Ordnance Survey’s OpenLayers API, which I’d noticed allows you to make use of OS boundary data and maps in order to create thematic maps for UK data:

OS thematic map demo

So – what’s involved? The first thing was to try and get codes for the authority areas. The ONS make various codes available (download here) and the OpenSpace website also makes available a list of boundary codes that it can render (download here), so I had a poke through the various code files and realised that the Guardian emissions data seemed to identify regions that were coded in different ways? So I stalled there and looked at another part f the jigsaw…

…specifically, OpenLayers. I tried the demo – Creating thematic boundaries – got it to work for the sample data, then tried to put in some other administrative codes to see if I could display boundaries for other area types… hmmm…. No joy:-) A bit of digging identified this bit of code:

boundaryLayer = new OpenSpace.Layer.Boundary("Boundaries", {
strategies: [new OpenSpace.Strategy.BBOX()],
area_code: ["EUR"],
styleMap: styleMap });

which appears to identify the type of area codes/boundary layer required, in this case “EUR”. So two questions came to mind:

1) does this mean we can’t plot layers that have mixed region types? For example, the emissions data seemed to list names from different authority/administrative area types?
2) what layer types are available?

A bit of digging on the OpenLayers site turned up something relevant on the Technical FAQ page:

OS OpenSpace boundary DESCRIPTION, (AREA_CODE) and feature count (number of boundary areas of this type)

County, (CTY) 27
County Electoral Division, (CED) 1739
District, (DIS) 201
District Ward, (DIW) 4585
European Region, (EUR) 11
Greater London Authority, (GLA) 1
Greater London Authority Assembly Constituency, (LAC) 14
London Borough, (LBO) 33
London Borough Ward, (LBW) 649
Metropolitan District, (MTD) 36
Metropolitan District Ward, (MTW) 815
Scottish Parliament Electoral Region, (SPE) 8http://ouseful.wordpress.com/wp-admin/edit.php
Scottish Parliament Constituency, (SPC) 73
Unitary Authority, (UTA) 110
Unitary Authority Electoral Division, (UTE) 1334
Unitary Authority Ward, (UTW) 1464
Welsh Assembly Electoral Region, (WAE) 5
Welsh Assembly Constituency, (WAC) 40
Westminster Constituency, (WMC) 632

so presumably all those code types can be used as area_code arguments in place of “EUR”?

Back to one of the other pieces of the jigsaw: the OpenLayers API is called using official area codes, but the emissions data just provides the names of areas. So somehow I need to map from the area names to an area code. This requires: a) some sort of lookup table to map from name to code; b) a way of doing that.

Normally, I’d be tempted to use a Google Fusion table to try to join the emissions table with the list of boundary area names/codes supported by OpenSpace, but then I recalled a post by Paul Bradshaw on using the Google spreadsheets VLOOKUP formula (to create a thematic map, as it happens: Playing with heat-mapping UK data on OpenHeatMap), so thought I’d give that a go… no joy:-( For seem reason, the vlookup just kept giving rubbish. Maybe it was happy with really crappy best matches, even if i tried to force exact matches. It almost felt like formula was working on a differently ordered column to the one it should have been, I have no idea. So I gave up trying to make sense of it (something to return to another day maybe; I was in the wrong mood for trying to make sense of it, and now I am just downright suspicious of the VLOOKUP function!)…

…and instead thought I’d give the openheatmap application Paul had mentioned a go…After a few false starts (I thought I’d be able to just throw a spreadsheet at it and then specify the data columns I wanted to bind to the visualisation, (c.f. Semantic reports), but it turns out you have to specify particular column names, value for the data value, and one of the specified locator labels) I managed to upload some of the data as uk_council data (quite a lot of it was thrown away) and get some sort of map out:

openheatmap demo

You’ll notice there are a few blank areas where council names couldn’t be identified.

So what do we learn? Firstly, the first time you try out a new recipe, it rarely, if ever, “just works”. When you know what you’re doing, and “all you have to do is…”, all is a little word. When you don’t know what you’re doing, all is a realm of infinite possibilities of things to try that may or may not work…

We also learn that I’m not really that much closer to getting my thematic map out… but I do have a clearer list of things I need to learn more about. Firstly, a few hello world examples using the various different OpenLayer layers. Secondly, a better understanding of the differences between the various authority types, and what sorts of mapping there might be between them. Thirdly, I need to find a more reliable way of reconciling data from two tables and in particular looking up area codes from area names (in two ways: code and area type from area name; code from area name and area type). VLOOKUP didn’t work for me this time, so I need to find out if that was my problem, or an “issue”.

Something else that comes to mind is this: the datablog asks: “Can you do something with this data? Please post your visualisations and mash-ups on our Flickr group”. IF the data had included authority codes, I would have been more likely to persist in trying to get them mapped using OpenLayers. But my lack of understanding about how to get from names to codes meant I stumbled at this hurdle. There was too much friction in going from area name to OpenLayer boundary code. (I have no idea, for example, whether the area names relate to one administrative class, or several).

Although I don’t think the following is the case, I do think it is possible to imagine a scenario where the Guardian do have a table that includes the administrative codes as well as names for this data, or an environment/application/tool for rapidly and reliably generating such a table, and that they know this makes the data more valuable because it means they can easily map it, but others can’t. The lack of codes means that work needs to be done in order to create a compelling map from the data that may attract web traffic. If it was that easy to create the map, a “competitor” might make the map and get the traffic for no real effort. The idea I’m fumbling around here is that there is a spectrum of stuff around a data set that makes it more or less easy to create visualiations. In the current example, we have area name, area code, map. Given an area code, it’s presumably (?) easy enough to map using e.g. OpenLayers becuase the codes are unambiguous. Given an area name, if we can reliably look up the area code, it’s presumably easy to generate the map from the name via the code. Now, if we want to give the appearance of publishing the data, but make it hard for people to use, we can make it hard for them to map from names to codes, either by messing around with the names, or using a mix of names that map on to area codes of different types. So we can taint the data to make it hard for folk to use easily whilst still be being seen to publish the data.

Now I’m not saying the Guardian do this, but a couple of things follow: firstly, obfuscating or tainting data can help you prevent casual use of it by others whilst at the same time ostensibly “open it up” (it can also help you track the data; e.g. mapping agencies that put false artefacts in their maps to help reveal plagiarism); secondly, if you are casual with the way you publish data, you can make it hard for people to make effective use of that data. For a long time, I used to hassle folk into publishing RSS feeds. Some of them did… or at least thought they did. For as soon as I tried to use their feeds, they turned out to be broken. No-one had ever tried to consume them. Same with data. If you publish your data, try to do something with it. So for example, the emissions data is illustrated with a Many Eyes visualisation of it; it works as data in at least that sense. From the place names, it would be easy enough to vaguely place a marker on a map showing a data value roughly in the area of each council. But for identifying exact administrative areas – the data is lacking.

It might seem as is if I’m angling against the current advice to councils and government departments to just “get their data out there” even if it is a bit scrappy, but I’m not… What I am saying (I think) is that folk should just try to get their data out, but also:

– have a go at trying to use it for something themselves, or at least just demo a way of using it. This can have a payoff in at least a three ways I can think of: a) it may help you spot a problem with the way you published the data that you can easily fix, or at least post a caveat about; b) it helps you develop your own data handling skills; c) you might find that you can encourage reuse of the data you have just published in your own institution…

– be open to folk coming to you with suggestions for ways in which you might be able to make the data more valuable/easier to use for them for little effort on your own part, and that in turn may help you publish future data releases in an ever more useful way.

Can you see where this is going? Towards Linked Data… ;-)

PS just by the by, a related post (that just happens to mention OUseful.info:-) on the Telegraph blogs about Open data ‘rights’ require responsibility from the Government led me to a quick chat with Telegraph data hack @coneee and the realisation that the Telegraph too are starting to explore the release of data via Google spreadsheets. So for example, a post on Councils spending millions on website redesigns as job cuts loom also links to the source data here: Data: Council spending on websites.

Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community

…got that?;-) Or in other words, this is a post looking at some visualisations of the #ddj hashtag community…

ddj - PDF export

A couple of days ago, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre. I’ll try to post some notes on it, err, sometime; but for now, here’s a quick collection of some of the various things I’ve tinkered with around hashtag communities, using #ddj as an example, as a “note to self” that I really should pull these together somehow, or at least automate some of the bits of workflow; I also need to move away from Twitter’s Basic Auth (which gets switched off this week, I think?) to oAuth*…

*At least Twitter is offering a single access token which “is ideal for applications migrating to OAuth with single-user use cases”. Rather than having to request key and secret values in an oAuth handshake, you can just grab them from the Twitter Application Control Panel. Which means I should be able to just plug them into a handful of Tweepy commands:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)

So, first up is the hashtag community view showing how an individual is connected to the set of people using a particular hashtag (and which at the moment only works for as long as Twitter search turns up users around the hashtag…)

Hashtag community

Having got a Twapperkeeper API key, I guess I really should patch this to allow a user to explore their hashtag community based on a Twapperkeeper archive (like the #ddj archive)…

One thing the hashtag community page doesn’t do is explore any of the structure within the network… For that, I currently have a two step process:

1) get a list of people using a hashtag (recipe: Who’s Tweeting our hashtag?) using the this Yahoo Pipe and output the data as CSV using a URL with the pattern (replace ddj with your desired hashtag):

2) take the column of twitter homepage URLs of people tweeting the hashtag and replace http://twitter.com/ with nothing to give the list of Twitter usernames using the hashtag.

3) run a script that finds the twitter numerical IDs of users given their Twitter usernames listed in a simple text file:

import tweepy

auth = tweepy.BasicAuthHandler("????", "????")
api = tweepy.API(auth)

f =open('hashtaggers.txt')

for uid in f:
  print uid


Note: this a) is Python; b) uses tweepy; c) needs changing to use oAuth.

4) run another script (gist here – note this code is far from optimal or even well behaved; it needs refactoring, and also tweaking so it plays nice with the Twitter API) that gets lists of the friends and the followers of each hashtagger, from their Twitter id and writes these to CSV files in a variety of ways. In particular, for each list (friends and followers, generate three files where the edges represent: i) link between an individual and other hashtaggers (“inner” edges within the community); ii) link between hashtagger and not hashtaggers (“outside” edges from the community); iii) links between hashtagger and hashtaggers as well as not hashtaggers);

5) an edit of the friends/followers CSV files to put them into an appropriate format for viewing in tools such as Gephi or Graphviz. For Gephi, edges can be defined using comma separated pairs of nodes (e.g. userID, followerID) with a bit of syntactic sugar; we can also use the list of Twitter usernames/user IDs to provide nice labels for the community’s “inner” nodes:

nodedef> name VARCHAR,label VARCHAR


edgedef> user VARCHAR,friend VARCHAR


Having got a gdf formatted file, we can load it straight in to Gephi:

ddj hashtag community

In 3d view we can get something like this:

ddj - hashtag community

Node size is proportional to number of hashtag users following a hashtagger. Colour (heat) is proportional to number of hashtaggers follwed by a hashtagger. Arrows go from a hashtagger to the people they are following. So a large node size means lots of other hashtaggers follow that person. A hot/red colour shows that person is following lots of the other hashtaggers. Gephi allows you to run various statistics over the graph to allow you to analyse the network properties of the community, but I’m not going to cover that just now! (Search this blog using “gephi” for some examples in other contexts.)

Use of different layouts, colour/text size etc etc can be used to modify this view in Gephi, which can also generate high quality PDF renderings of 2D versions of the graph:

ddj - PDF export

(If you want to produce your own visualisation in Gephi, I popped the gdf file for the network here.)

If we export the gexf representation of the graph, we can use Alexis Jacomy’s Flex gexfWalker component to provide an interactive explorer for the graph:

gexfwalker - full network

Clicking on a node allows you to explore who a particular hashtagger is following:

gexfwalker - node explorer

Remember, arrows go from a hashtagger to the person they are following. Note that the above visualisation allows you to see reciprocal links. The colourings are specified via the gexf file, which itself had its properties set in the Preview Settings dialogue in Gephi:

Gephi - preview settings

As well as looking at the internal structure of the hashtag community, we camn look at all the friends and/or followers of the hashtaggers. THe graph for this is rather large (70k nodes and 90k edges), so after a lazyweb reuest to @gephi I found I had to increase the memory allocation for the Gephi app (the app stalled on loading the graph when it had run out of memory…).

If we load in the graph of “outer friends” (that is the people the hashtaggers follow who are not hashtaggers) and filter the graph to only show nodes who have more than 5 or so incoming edges we can see which Twitter users are followed by large numbers of the hashtaggers, but who have not been using the hashtag themselves. Becuase the friends/followers lists return Twitter numercal IDs, we have to do a look up on Twitter to find out the actual Twitter usernames. This is something I need to automate, maybe using the Twitter lookup API call that lets authenticated users look up the details of up to 100 Twitter users at a time given their numerical IDs. (If anyone wants this data from my snapshot of 23/8/10, drop me a line….)

Okay, that/s more than enough for now… As I’ve shared the gdf and gexf files for the #ddj internal hashtaggers network, if any more graphically talented than I individuals would like to see what sort of views they can come up with, either using Gephi or any other tool that accepts those data formats, I’d love to see them:-)

PS It also strikes me that having got the list of hashtaggers, I need to package up this with a set of tools that would let you:
– create a Twitter list around a set of hashtaggers (and maybe then use that to generate a custom search engine over the hashtaggers’ linked to homepages);
find other hashtags being used by the hashtaggers (that is, hashtags they may be using in arbitrary tweets).

(See also Topic and Event based Twittering – Who’s in Your Community?)

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

COI web report 2009-10 table 1

COI web report 2009-10 table 2

COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings


COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Programming, Not Coding: Infoskills for Journalists (and Librarians..?!;-)

A recent post on the journalism.co.uk site asks: How much computer science does a journalist really need?, commenting that whilst coding skills may undoubtedly be useful for journalists, knowing what can be achieved easily in a computational way may be more important, because there are techies around who can do the coding for you… (For another take on this, see Charles Arthur’s If I had one piece of advice to a journalist starting out now, it would be: learn to code, and this response to it: Learning to Think Like A Programmer.)

Picking up on a few thoughts that came to mind around a presentation I gave yesterday (Web Lego And Format Glue, aka Get Yer Mashup On), here’s a slightly different take on it, based on the idea that programming doesn’t necessarily mean writing arcane computer code.

Note that a lot of what follows I’d apply to librarians as well as journalists… (So for example, see Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library for infoskills that I think librarians as information professionals should at least be aware of (and these probably apply to journalists too…); Data Handling in Action is also relevant – it describes some of the practical skills involved in taking a “dirty” data set and getting it into a form where it can be easily visualised…)

So here we go…. An idea I’ve started working on recently as an explanatory device is the notion of feed oriented programming. I appreciate that this probably already sounds scary geeky, but it’s a made up phrase and I’ll try to explain it. A feed is something like an RSS feed. (If you don’t know what an RSS feed, this isn’t a remedial class, okay… go and find out… this old post should get you started: We Ignore RSS at OUr Peril.)

Typically, an RSS feed will contain a set of items, such as a set of blog posts, news stories, or bookmarks. Each item has the same structure in terms of how it is represented on a computer. Typically, the content of the feed will change over time – a blog feed represents the most recent posts on a blog, for example. That is, the publisher of the feed makes sure that the feed has current content in it – as a “programmer” you don’t really need to do anything to get the fresh content in the feed – you just need to look at the feed to see if there is new content in it – or let your feed reader show you that new content when it arrives. The feed is accessed via a web address/URL.

Some RSS feeds might not change over time. On WriteToReply, where we republish public documents, it’s possible to get hold of an RSS version of the document. The document RSS feed doesn’t change because the content of the document doesn’t change), although the content of the comment feeds might change as people comment on the document.

A nice thing about RSS is that lots of things publish it, and lots of things can import it. Importing an RSS feed into an application such as Google Reader simply means pasting the web address of the feed into a “Subscribe to feed” box in the application. Although it can do other things too, like supporting search, Google Reader is primarily a display application. It takes in RSS feeds and presents them to the user in an easy to read way. Google Maps and Google Earth are other display applications – they display geographical information in an appropriate way, a way that we can readily make sense of.

So what do we learn from this? Information can be represented in a standard way, such as RSS, and displayed in a visual way by an application that accepts RSS as an input. By subscribing to an RSS feed, which we identify by a fixed/permanent web address, we can get new content into our reader without doing anything. Subscribing is just a matter of copying a web address from the publisher’s web site and pasting it into our reader application. Cut and paste. No coding required. The feed publisher is responsible for putting new content into the feed, and our reader application is responsible for pulling that new content out and displaying it to us.

One of the tools I use a lot is Yahoo Pipes. Yahoo Pipes can take in RSS feeds and do stuff with it; it can take in a list of blog posts as an RSS feed and filter them so that you only get posts out that do – or don’t – mention cats, for example. And the output is in the form of an RSS feed…

What this means is that if we have a Yahoo pipe that does something we want in computational terms to an RSS feed, all we have to do is give it the web address of the feed we want to process, and then grab the RSS output web address from the Pipe. Cut and paste the original feed web address into the Pipe’s input. Cut and paste the web address of the RSS output from the pipe into our feed reader. No coding required.

Another couple of tools I use are Google Spreadsheets (a spreadsheet application) and Many Eyes WIkified (an interactive visualisation application). If you publish a spreadsheet on Google docs, you can get a web address/URL that points to a CSV (comma separated variable) version of the selected sheet. A CSV file is a simple text file where each spreadsheet row is a represented as a row in the CSV structured text file; and the value of each cell along a row in the original spreadsheet is represented as the same value in the text file, separated from the previous value by a comma. But you don’t need to know that… All you do need to know is that you can think of it as a feed… With a web address… And in a particular format…

Going to the “Share” options in the spreadsheet, you can publish the sheet and generate a web address that points to a range of cells in the spreadsheet (eg: B1:D120) represented as a CSV file. If we now turn to Many Eyes Wikified, I can provide it with the web address of a CSV file and it will go and fetch the data for me. At the click of a button I can then generate an interactive visualisation of the data in the spreadsheet. Cut and paste the web address of the CSV version of the data in a spreadsheet that Google Docs will generate for me into Many Eyes Wikified, and I can then create an interactive visualisation using the spreadsheet at the click of a button. Cut and paste a URL/web address that is generated for me. No coding required.

As to where the data in the spreadsheet came from? Well maybe it came from somewhere else on the web, via a URL? Like this, maybe?

So the model I’m working towards with feed oriented programming is the idea that you can get the web address of a feed which a publisher will publish current content or data to, and paste that address in an application that will render, or display the content (e.g. Google Reader, Many Eyes Wikified) or process/transform that data on your behalf.

So for example, Google Reader can transfrom an HTML table to CSV for you; (Google spreadsheets also lets you do all the normal spreadsheet things, so you could generate one sheet from another sheet using whatever spreadsheet formulae you like, and publish the CSV representation of that second sheet). Or in Yahoo Pipes, you can process an RSS feed by filtering its contents so that you only see posts that mention cats.

Yahoo Pipes offers other sorts of transformation as well. For example, in my original Wikipedia scraping demo, I took the feed from a Google spreadsheet and passed it to Yahoo Pipes where I geocoded city names and let pipes generate a map friendly feed (known as a KML feed) for me. Copying the web address of the KML feed output from the pipe and pasting it into Google Maps means I can generate an embeddable Google map view of data originally pulled from Wikipedia:

Once you start to think of the world in this way:

– where the web contains data and information that is represented in various standard ways and made available via a unique and persistent web address,

– where web applications can accept data and content that is represented in a standard way given the web address of that data,

– where web applications can transform data represented at one web address in one particular way and republish it in another standard format at another web address,

– or where web applications can take data represented in a particular way from one web adress and provide you with the tools to then visualise or display that data,

then the world is your toolbox. Have URL, will travel. All you need to know is which applications can import what format data, and how they can republish that data for you, whether in a different format, such as Google spreadsheets taking an HTML table from Wikipedia and publishing it as a CSV file, or as a visualisation/human friendly display (Many Eyes Wikified, Google Reader). And if you need to do “proper” programmer type things, then you might be able to do it using a spreadsheet formula or a Yahoo Pipe (no coding required…;-)

See also: The Journalist as Programmer: A Case Study of The New York Times Interactive News Technology Department [PDF]