OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘ddj

Questioning Election Data to See if It Has a Story to Tell

with 9 comments

I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…

So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?

Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):

IW council - notice of poll

Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:

scraperwiki candidates

So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.

In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)

Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:

scraperwiki CSV export

In OpenRefine, we can Create a New Project and then import the data directly:

openrefine import from URL

The data is in comma separated CSV format, so let’s specify that:

import as csv comma separated

We can then name and create the project and we’re ready to start…

…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.

Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.

mapit postcode lookup

And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:

mapit json

Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT

Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT

The %20 is a character encoding for a space. In this case, we can also use a +.

So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)

Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.

To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.

What we want to do then is to create a new, derived column based on the address:

Add derived column

And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:

grab a postcode

Here’s the result…

postcodes...

Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:

col from URL

When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:

get data from URL

The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.

When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…

lots of data

We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:

that's handy...

We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:

filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']

to create a new column containing the electoral division:

parse out the electroal division

Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:

inward or out

If we collapse down the spare columns, we get a clearer picture:

collapse...

Like this:

summary data

If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):

facet on inout

We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…

inout facet filter

Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)

Hmm.. labour out

There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson's Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?

One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:

'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')

os postcode lookup

The data, as imported from the Ordnance Survey, looks something like this:

o:sdata

As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.

Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.

The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:

easting northing lat -long

Here’s what the returned lat/long data looks like:

lat-long json

We can then create a new column derived from this JSON data by parsing it as follows
parse latlong to lat

A similar trick can be used to generate a column containing just the longitude data.

We can then export a view over the data to a CSV file, or direct to Google Fusion tables.

postcode lat long export

With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222

Fusion table edit column

Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:

fusion table config cols to location type

We can inspect the location data using a more convenient “natural” view over it…

fusion table add map

By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:

havenstreet candidates

(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)

We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:

refine filter by division

Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:

postcode missing...

Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…

So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.

Written by Tony Hirst

May 5, 2013 at 11:38 pm

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column

with 3 comments

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into mpExpenses2012. Previewing it we see there is a column MP.s.Name identifying which MP each expense claim line item corresponds to:

mpexpenses2012R

We can easily pull out the unique values of the MP names using the levels command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  tmp=subset(mpExpenses2012,MP.s.Name==name)
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP
  write.csv(tmp,fn,row.names=FALSE)
}

Simples:-)

This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.

Written by Tony Hirst

April 3, 2013 at 8:54 am

Posted in Rstats

Tagged with ,

Reshaping Horse Import/Export Data to Fit a Sankey Diagram

with 9 comments

As the food labeling and substituted horsemeat saga rolls on, I’ve been surprised at how little use has been made of “data” to put the structure of the food chain into some sort of context* (or maybe I’ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use EU horse import/export data to produce interactive map of the European trade in horsemeat

*One for the to do list – a round up of “#ddj” stories around the episode.)

Guardian datablog - EU trade in horsemeat

(The article describes the source of the data as the Eurpoean Union Unistat statistics website, although I couldn’t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he’d come by the data, he suggested putting questions into the Eurostat press office;-)

The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:

Guardian Datablog horsemeat importexport data

One way of viewing this data structure is as an edge weighted adjacency matrix that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals from one country to another.

Thinking about trade as flow suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of “stuff” flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here’s an example of a Sankey diagram I created around the horse export data:

Horse exports - EU - Sankey demo

(The layout is a little rough and ready – I was more interested in finding a recipe for creating the base graphic – sans design tweaks;-) – from the data as supplied.)

So how did I get to the diagram from the data?

As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the Sankey diagram plugin to Mike Bostock’s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can generate an appropriate data object from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram “for free”.

So how can we create the graph from the data?

The networkx documentation describes a method – read_weighted_edgelist – for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the “import to” and “export from” nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a weighted edge list in which each row has three columns: exporting country, importing country, amount.

So how can we do that?

One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a source data file that looks something like this:

horse export source data

In contrast, the edge list looks something like this:
reshaped horse data

So how do we get from one to the other?

Here’s the R script I used – it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the melt function from the reshape library to generate the edge list, finally filtering out edges where there were no exports:

#R code

horseexportsEU <- read.delim("~/Downloads/horseexportsEU.txt")
require(reshape)
#Get a "long" edge list
x=melt(horseexportsEU,id='COUNTRY')
#Turn the numbers into numbers by removing the comma, then casting to an integer
x$value2=as.integer(as.character(gsub(",", "", x$value, fixed = TRUE) ))
#If we have an NA (null/empty) value, make it -1
x$value2[ is.na(x$value2) ] = -1
#Column names with countries that originally contained spaces convert spaces dots. Undo that. 
x$variable=gsub(".", " ", x$variable, fixed = TRUE)
#I want to export a subset of the data
xt=subset(x,value2>0,select=c('COUNTRY','variable','value2'))
#Generate a text file containing the edge list
write.table(xt, file="foo.csv", row.names=FALSE, col.names=FALSE, sep=",")

(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as igraph, that can do similar things. But then, I wouldn’t have go to show the “melt” method to reshaping data;-)

Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:

#python code

import StringIO
import csv

#Bring in the edge list explicitly
#rawdata = '''"SLOVENIA","AUSTRIA",1200
#"AUSTRIA","BELGIUM",134600
#"BULGARIA","BELGIUM",181900
#"CYPRUS","BELGIUM",200600
#... etc
#"ITALY","UNITED KINGDOM",12800
#"POLAND","UNITED KINGDOM",129100'''

#We convert the rawdata string into a filestream
f = StringIO.StringIO(rawdata)
#..and then read it in as if it were a CSV file..
reader = csv.reader(f, delimiter=',')

def gNodeAdd(DG,nodelist,name):
    node=len(nodelist)
    DG.add_node(node,name=name)
    #DG.add_node(node,name=name)
    nodelist.append(name)
    return DG,nodelist

nodelist=[]

DG = nx.DiGraph()

#Here's where we build the graph
for item in reader:
    #Even though import and export countries have the same name, we create a unique version depending on
    # whether the country is the importer or the exporter.
    importTo=item[0]+'.'
    exportFrom=item[1]
    amount=item[2]
    if importTo not in nodelist:
        DG,nodelist=gNodeAdd(DG,nodelist,importTo)
    if exportFrom not in nodelist:
        DG,nodelist=gNodeAdd(DG,nodelist,exportFrom)
    DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount)

json = json.dumps(json_graph.node_link_data(DG))
#The "json" serialisation can then be passed to a d3js containing web page...

Once the JSON object is generated, it can be handed over to d3.js. The whole script is available here: EU Horse imports Sankey Diagram.

What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.

PS In passing, as well as the data tables that can be searched on Eurostat, I also found the Eurostat Yearbook, which (for the most recent release at least), includes data tables relating to reported items:

Eurostat Yearbook

So it seems that the more I look, the more and more places seems to making data that appears in reports available as data

Written by Tony Hirst

February 18, 2013 at 10:31 am

Posted in Infoskills, Rstats

Tagged with , ,

Local News Templates – A Business Opportunity for Data Journalists?

leave a comment »

As well as serendipity, I believe in confluence

A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.

It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.

[Update - 25/3/13: Trinity Mirror to create digital data journalism unit to produce content for online and printed titles]

In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:

IWCP gambling

So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)

Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.

So what’s confluential about those?

The IWCP article localises the Fairer Gambling data in several ways:
- the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
- a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
- a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.

The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).

Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.

A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).

Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.

The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)

Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).

The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.

Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.

PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…

Written by Tony Hirst

February 7, 2013 at 1:02 pm

Posted in Infoskills, OBU, Thinkses

Tagged with ,

Quick Intro to Data Journalism, Sort Of…

leave a comment »

Slides without commentary from a presentation I gave to undergrads on data journalism at the University of Lincoln yesterday…

My plan is to write some words around this deck, (or maybe even to record (and perhaps transcribe), some sort of narration…) just not right now…

I’m happy to give variants of this presentation elsewhere, if you can cover costs…

Written by Tony Hirst

February 6, 2013 at 3:06 pm

Posted in Infoskills, Presentation

Tagged with ,

Recently (And Not So Recently) Advertised Data Jobs…

with one comment

I’m doing a couple of talks to undergrad and postgrad students next work – on data journalism at the University of Lincoln, and on open data at the University of Southampton – so I thought I’d do a quick round up of recently advertised data related jobs that I could reuse for an employability slide…

So, here are some of the things I’ve noticed recently:

  • The Technology Strategy board, funders of many a data related activity (including the data vouchers for SMEs) are advertising for a Lead Technologist – Data Economy (£45,000 to £55,000):

    The UK is increasingly reliant on its service economy, and on the ability to manage its physical economy effectively, and it exports these capabilities around the world. Both aspects of this are heavily dependent on the availability of appropriate information at the right place and time, which in turn depends on our ability to access and manipulate diverse sources of data within a commercial environment.

    The internet and mobile communications and the ready availability of computing power can allow the creation of a new, data-rich economy, but there are technical, human and business challenges still to be overcome. With its rich data resources, inventive capacity and supportive policy landscape, the UK is well placed to be the centre of this innovation.

    Working within the Digital team, to develop and implement strategies for TSB’s interventions in and around the relevant sectors.

    This role requires the knowledge and expertise to develop priorities for how the UK should address this opportunity, as well as the interpersonal skills to introduce the relevant communities of practice to appropriate technological solutions. It also requires a knowledge of how innovation works within businesses in this space, to allow the design and targeting of TSB’s activities to effectively facilitate change.

    Accessible tools include, but are not restricted to, networking and community building, grant-funding of projects at a wide range of scales, directing support services to businesses, work through centres such as the Open Data Institute and Connected Digital Economy Catapult, targeted procurement through projects such as LinkedGov, and inputs to policy. The role requires drawing upon this toolkit to design a coordinated programme of interventions that has impact in its own right and which also coordinates with other activities across TSB and the wider innovation landscape.

  • Via the ECJ, a relayed message from the NICAR-L mailing list about a couple of jobs going with The Times and Sunday Times:

    A couple of jobs that might be of interest to NICAR members here at the
    Times of London…

    The first is an investigative data journalist role, joining the new data journalism unit which will work across both The Times and The Sunday Times.

    The other is a editorial developer role: this will sit within the News Development Team and will focus on anything from working out how we tell stories in richer more immersive ways, to creating new ways of presenting Times and Sunday Times journalism to new audiences.

    Please get in touch if you are interested!

    Lucia
    Head of news development, The Times and Sunday Times
    @lucia_adams

Not a job ad as such, but an interesting recent innovation from the BirminghamMail:

We’ve launched a new initiative looking at the numbers behind our city and the stories in it.
‘Behind The Numbers’ is all about the explosion in ‘data’: information about our hospitals and schools, crime and the way it is policed, business and sport, arts and culture.
We’d like you to tell us what data you’d like us to publish and dig into. Email suggestions to ben.hurst@trinitymirror.com. Follow @bhamdatablog on Twitter for updates or to share ideas.

This was also new to me: FT Data, a stats/datablog from the FT? FullFact is another recent addition to my feed list, with a couple of interesting stories each day and plenty of process questions and methodological tricks that can be, erm, appropriated ;-) Via @JackieCarter, the Social Statistics blog looked interesting, but the partial RSS feed is a real turn off for me so I’ll probably drop it from my reader pretty quickly unless it turns up some *really* interesting posts.

Here are some examples of previously advertised jobs…

  • A job that was being advertised at the end of last year (now closed) by the Office of National Statistics (ONS) (current vacancies) was for the impressive sounding Head of Rich Content Development:

    The postholder is responsible for inspiring and leading development of innovative rich content outputs for the ONS website and other channels, which anticipate and meet user needs and expectations, including those of the Citizen User. The role holder has an important part to play in helping ONS to realise its vision “for official statistics to achieve greater impact on key decisions affecting the UK and to encourage broader use across the country”.

    Key Responsibilities:

    1. Inspires, builds, leads and develops a multi-disciplinary team of designers, developers, data analysts and communications experts to produce innovative new outputs for the ONS website and other channels.
    2. Keeps abreast of emerging trends and identifies new opportunities for the use of rich web content with ONS outputs.
    3. Identifies new opportunities, proposes new directions and developments and gains buy in and commitment to these from Senior Executives and colleagues in other ONS business areas.
    4. Works closely with business areas to identify, assess and commission new rich-content projects.
    5. Provides, vision, guidance and editorial approval for new projects based on a continual understanding of user needs and expectations.
    6. Develops and manages an ongoing portfolio of innovative content, maximising impact and value for money.
    7. Builds effective partnerships with media to increase outreach and engagement with ONS content.
    8. Establishes best practice in creation of rich content for the web and other channels, and works to improve practice and capability throughout ONS.

  • From December 2010, a short term contract at the BBC for a data journalist:

    The team is looking for a creative, tech-savvy data journalist (computer-assisted reporter) to join its website specials team to work with our online journalists, graphic designer and development teams.

    Role Purpose and Aims

    You will be required to humanize statistics; to make sense of potentially complicated data and present it in a user friendly format.

    You will be asked to focus on a range of data-rich subjects relating to long-term projects or high impact daily new stories, in line with Global News editorial priorities. These could include the following: reports on development, global poverty, Afghanistan casualties, internet connectivity around the world, or global recession figures.

    Key Knowledge and Experience

    You will be a self-starter, brimming with story ideas who is comfortable with statistics and has the expertise to delve beneath the headline figures and explain the fuller picture.
    You will have significant journalistic experience gained ideally from working in an international news environment.
    The successful candidate should have experience (or at least awareness) of visualising data and visualisation tools.
    You should be excited about developing the way that data is interpreted and presented on the web, from heavy number crunching, to dynamic mapping and interactive graphics. You must have demonstrated knowledge of statistics, statistical analysis, with a good understanding of the range and breadth of data sources in the UK and internationally, broad experience with data sources, data mining and have good visual and statistical skills.
    You must have a Computer-assisted reporting background or similar, including a good knowledge of the relevant software (including Excel and mapping software).
    Experience of producing and developing data driven web content a senior level within time and budget constraints.

    Central to the role is an ability to analyse complicated information and present it to our readers in a way that is visually engaging and easy to understand, using a range of web-based technologies, for which you should have familiarity with database interfaces and web presentation layers, as well as database concepting, content entry and management.
    You will be expected to have your own original ideas on how to best apply data driven journalism, either to complement stories when appropriate or to identify potential original stories while interpreting data, researching and investigating them, crunching the data yourself and working with designers and developers on creating content that will engage our audience, and provide them with useful, personalised information.

    FWIW, it’s probably worth remembering that the use of data is not necessarily a new thing.. for example, this post – The myth of the missing Data Scientist – does a good job debunking some of the myths around “data science”.

Written by Tony Hirst

February 1, 2013 at 1:02 pm

Posted in Jobs

Tagged with ,

Corporate Sprawl Sketch Trawls Using OpenCorporates

with 4 comments

A recent post on the OpenCorporates blog (Major Milestone: Over 50 million companies (& a sneak peak at the future)) provides a sneak preview of a tool they’re developing for visualising networks of companies based on “links of control or minority shareholdings”. I’m not sure what that actually means, but it all sounds very exciting;-)

Since Chris et al.(?) added the ability to view director information for companies directly, as well as search for directors via the new 0.2 version of the OpenCorporates API, I’ve been meaning to update my corporate sprawl hack (eg in context of Tesco, G4S and Thames Water) to make use of the director information directly. (Previously, I was trying to scrape it myself from company filings data that is also published via OpenCorporates.)

I finally got round to it over the weekend (ScraperWiki: OpenCorporates trawler), so here’s my opening recipe which tries to map the extent of a current corporate network based on common directorship:

  1. Given an OpenCorporates company ID, get the list of directors
  2. Try to find current directors (I’m using the heuristic of looking for ones with no end date on their appointment and add them to a directors set;
  3. For each director in the directors set, search for directors with the same name, At the moment, the directors search is really loose, so I do a filtering pass to further limit results to only directors with exactly the same name.
    [There are three things to note here: i) it would be useful to have an 'exact search' limit option on the directors search to limit responses to just directors name that exactly match the query string; ii) the directors search returns individual records for the appointment of a particular director in a particular company - at the moment, there is no notion of an actual person who may be the director of multiple companies (FRBR comes to mind here, eg in sense of a director as a work?!, as well as researcher ID schemes such as Orcid); iii) the director records contain a uid element that is currently set to null. Is this possibly for a director ID scheme so we can know that two people with the same name who are directors of different companies are actually the same person?]
    The filtered directors search returns a list of director appointments relating to people with exactly the same name as the current directors of the target company. Each record relates to an appointment to a particular company, which gives us a list of companies that are possibly related to the target company by virtue of co-directorship.
  4. Having got a list of possibly related companies, look up the details for each. If the company is an active company, I run a couple of tests to see if it is related to the target company. The heuristics I’ve started off with are:
    • does it share exactly the same registered address as the target company? If so, there’s a chance it’s related. [Note: being able to search companies by address could be quite useful, as a step on the functionality road to a full geo-search based on geocoding of addresses, maybe?!;-)]/li>

    • does the company share N or more current directors with directors in the directors set? (I’m starting off with N=2.) If so, there’s a chance it’s related.
  5. This is the end of the first pass, and it returns a set of active companies that are possibly related to a target company by virtue of: i) sharing at least N active directors; and/or ii) sharing at least one common director and the same address.
  6. I also set the trawl up to recurse: the above description is a depth=1 search. For depth 2, from the list of companies added to the sprawl, grab all their active directors and repeat. We can do this to any depth required, though it may make sense to increase N as more directors get added to the directors set. If we increase the search depth we can search ever deeper (I haven’t tried this much yet!).
  7. Note that I also added a couple of optimisation steps to try to counter directors that are just nominees – ones that have hundreds of pages of results in the directors lookup and end up taking the sprawl across the corporate connected giant component (related: FRBR superduping, xISBN lookups and Harry Potter…

As an example of what sort of thing this discovers, here’s a depth 2 search around Care UK, whose annual report I was reading last week… (I hadn’t realised quite how privatisation of care services had got…)

care uk sprawl

Here’s depth 3:

careuk depth 3

And here’s a glimpse at some of the companies identified:

careuk discoverd companies

One thing that occurred to me might be that this tool could be used to support corporate discovery during the curation process of “corporate groupings“:

opencorporates - corporate grouping

A few things to note about corporate groupings:

  1. it would be useful to be able to filter on all/active/inactive status?
  2. if you mistakenly add a company to a corporate grouping, how do you remove it?
  3. the feature that pulls in spending items from OpenlyLocal is really nice, but it requires better tools on the OpenlyLocal side for associating spending line elements with companies. This is particularly true for sprawls, where eg council spending items declare an amount spent with eg “Care UK” but you have no idea which legal entity that actually relates to?

And just in passing, what’s going on here?

refund?

Hmmm.. this post has itself turned into a bit of a sprawl, hasn’t it?! For completeness, here’s the code from the scraper:

#The aim of this scraper is to provide, in the first instance, a way of bootstrapping a search around either a company ID or a director ID
#The user should also define a tablename stub to identify the trawl.

#If one or more company IDs are specified:
#Get the company details
#??Add any names the company was previously known a list of 'previous' companies ?
#??do "morph chains" to show how company names change?
#Get the directors
#Search for directors of same name and then do an exact match filter pass
#Get the companies associated with those exact matches


#TO DO - Spot and handle rate limiting
#TO DO - populate db

targetCompanies=['gb/01668247'] #list of OpenCorporates Company IDs with leading country code
targetDirectors=[] #list of OpenCorporates Director IDs
targetStub='Care UK 2,2 test' #name of the db table stub
trawldepth=2
coverage='current' #all, current, previous **Relates to directors
status='active' #all, active, inactive **Relates to companies
DIRINTERSECT=2 #The minimum number of shared directors (current or past) to count as part of same grouping
#------

targetStub=targetStub.replace(' ','_')

import scraperwiki, simplejson,urllib,re

#Keep the API key [private - via http://blog.scraperwiki.com/2011/10/19/tweeting-the-drilling/
import os, cgi
try:
    qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING")))
    ockey=qsenv["OCKEY"]
    ykey=qsenv["YKEY"]
except:
    ockey=''

#----
APISTUB='http://api.opencorporates.com/v0.2'

def deslash(x): return x.strip('/')
def signed(url): return url+'?api_token='+ockey

def occStrip(ocURL):
    return deslash(ocURL.replace('http://opencorporates.com/companies',''))

def buildURL(items):
    url=APISTUB
    for i in items:
        url='/'.join([url,deslash(i)])
    return signed(url)

def getOCcompanyData(ocid):
    ocurl=buildURL(['companies',ocid])
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    if 'results' in ocdata: return ocdata['results']
    else: return -1

def getOCofficerData(ocid):
    ocurl=buildURL(['officers',ocid])
    ocdata=simplejson.load(urllib.urlopen(ocurl))
    return ocdata['results']


def recorder(data):
    d=[]
    for record in data['companies']:
        dd=record.copy()
        d.append(dd)
        if len(d)>100:
            scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='companies_'+targetStub, data=d)
            d=[]
    scraperwiki.sqlite.save(unique_keys=['jurisdiction_code','company_number'], table_name='companies_'+targetStub, data=d)
    data['companies']=[]
    d=[]
    for record in data['directors']:
            dd=record.copy()
            d.append(dd)
            if len(d)>100:
                scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='directors_'+targetStub, data=d)
                d=[]
    scraperwiki.sqlite.save(unique_keys=['ocid'], table_name='directors_'+targetStub, data=d)
    data['directors']=[]
    return data
    
exclusions_d=['FIRST SCOTTISH SECRETARIES LIMITED','FIRST DIRECTORS LIMITED']
exclusions_r=['nominated director','nominated secretary']
def getOCofficerCompaniesSearch(name,page=1,cidnames=[]):
    durl=APISTUB+'/officers/search?q='+name+'&per_page=100&page='+str(page)
    ocdata=simplejson.load(urllib.urlopen(durl+'&api_token='+ockey))['results']
    optimise=0
    #?need a heuristic for results with large page count?
    #Maybe put things into secondary possibles to check against?
    #The logic of this is really hacky and pragmatic(?!;-) Need to rethink... 
    for officer in ocdata['officers']:
        if (officer['officer']['name'].strip() in exclusions_d) or officer['officer']['position'] in exclusions_r:
            optimise=1
            break
        elif name==officer['officer']['name']:
            #print 'Possible new company for',name,officer['officer']['company']['name']
            #would a nominated secretary be interesting to search on? eg FIRST SECRETARIES LIMITED
            cidnames.append( ( occStrip(officer['officer']['company']['opencorporates_url']), occStrip(officer['officer']['company']['name']) ) )
    if page < ocdata['total_pages'] and optimise==0:
        page=page+1
        cidnames=getOCofficerCompaniesSearch(name,page,cidnames)
    #http://api.opencorporates.com/v0.2/officers/search?q=john+smith
    return cidnames
#-----

def trawlPass(data=[],depth=1,coverage='current',status='active'):
    data['depth']=data['depth']+1
    done=1
    newTargets=[]
    for ocid in data['targetCompanies']:
        if ocid not in data['cids']:
            bigtmp=[]
            data['cids'].append(ocid)
            cd=getOCcompanyData(ocid)
            if cd!=-1:
                if status=='active' and (cd['company']['inactive']): cd=-1
                elif status=='inactive' and not (cd['company']['inactive']): cd=-1
            if cd!=-1:
                cd=cd['company']
                uid=occStrip(cd['opencorporates_url'])
                dids=cd['officers']
                tmp={'ocid':uid}
                for x in ['name','jurisdiction_code','company_number','incorporation_date','dissolution_date','registered_address_in_full']:
                    tmp[x]=cd[x]
                didset=[]
                for didr in dids:
                    did=didr['officer']
                    #TEST - TO DO  - is None the right thing here?
                    print did['name'],did['end_date']
                    if coverage=='all':
                        didset.append(did['name'])
                    elif coverage=='current' and did['end_date'] is None:
                        didset.append(did['name'])
                    elif coverage=='previous' and did['end_date']is not None:
                        didset.append(did['name'])
                #some additional logic for heuristically determining whether or not a company is in same grouping
                if data['depth']==1: inset=1
                else: inset=0
                print coverage,'dirset',didset
                if (len(list(set(didset) & set(data['dnames'])))) >= DIRINTERSECT : inset=1
                if cd['registered_address_in_full'] in data['addresses']: inset=1
                if (inset==1):
                    data['companies'].append(tmp.copy())
                    print 'Added',tmp
                    if cd['registered_address_in_full'] not in data['addresses']: data['addresses'].append(cd['registered_address_in_full'])
                    for didr in dids:
                        if didr['officer']['name'] in didset:
                            did=didr['officer']
                            print 'dir',did['name']
                            did['ocid']=did['opencorporates_url'].replace("http://opencorporates.com/officers/","")
                            did['cname']=cd['name']
                            data['directors'].append(did.copy())
                            if did['name'] not in data['dnames']:
                                data['dnames'].append(did['name'])
                                #get matchalikes
                                cidnames=getOCofficerCompaniesSearch(did['name'])
                                for (cid,cname) in cidnames:
                                    bigtmp.append({'cid':cid,'cname':cname,'dname':did['name']})
                                    if len(bigtmp)>20:
                                        scraperwiki.sqlite.save(unique_keys=['cid','dname'], table_name='possibles_'+targetStub, data=bigtmp)
                                        bigtmp=[]
                                    if cid not in data['targetCompanies'] and cid not in newTargets:
                                        #print 'Brand new company for dir',cid
                                        newTargets.append(cid)
                    #if len(data['companies'])>20 or len(data['directors'])>20:
                    data=recorder(data)
                scraperwiki.sqlite.save(unique_keys=['cid','dname'], table_name='possibles_'+targetStub, data=bigtmp)
                bigtmp=[]
    data=recorder(data)
    for ocid in newTargets:
        data['targetCompanies'].append(ocid)
        done=0
    for director in data['targetDirectors']:
        od=getOCofficerData(ocid)['officer']
        ocid=occStrip(od['company']['opencorporates_url'])
        if ocid not in data['targetCompanies']:
            data['targetCompanies'].append(ocid)
            done=0
    depth=depth-1
    if (done==0) and depth>0:
        return trawlPass(data,depth,coverage,status)
    else: return data

_targetCompanies=[]
for c in targetCompanies:
    _targetCompanies.append(deslash(c))

init={'depth':0,'targetCompanies':_targetCompanies,'targetDirectors':targetDirectors,'cids':[],'dnames':[],'addresses':[],'companies':[],'directors':[]}
data=trawlPass(init,trawldepth,coverage,status)
print data

When I get a chance, I’ll try to pop up a couple of viewers over the data that’s scraped.

Written by Tony Hirst

January 28, 2013 at 10:57 am

Posted in Anything you want

Tagged with ,

Press Releases and Convenient Report Publication Formats for Data Journalists

with one comment

One of the things that I’ve been pondering lately is how I increasingly read the news in a “View Source”* frame of mind, wanting to look behind news stories as reported to read the actual survey report, press release, or Hansard report they take their lead from (more from this in a future post…) – see for example Two can play at that game: When polls collide for a peek behind the polls that drove a couple of conflicting recent news stories. Once you start reading news stories in the context of the press releases that drove them, you can often start to see how little journalistic value add there is to a large proportions of particular sorts of news stories. When FutureLearn was announced, most of the early stories were just a restatement of the press release, for example.

[*View Source refers to the ability, in most desktop based web browsers, to view the HTML source code that is used to generate a rendered HTML web page. That is, you can look to see how a particular visual or design effect in web page was achieved by looking at the code that describes how it was done.]

I’m still a little hazy about what the distinguishing features of “data journalism” actually are (for example, Sketched Thoughts On Data Journalism Technologies and Practice), but for the sake of this post let’s just assume that doing something with an actual data file is necessary part of the process when producing a data driven journalistic story. Note that this might just be limited to re-presenting a supplied data set in a graphical form, or it might involve a rather more detailed analysis that requires, in part, the combination of several different original data sets.

So what might make for a useful “press release” or report publication as far as a data journalist goes? One example might be raw data drops published as part of a predefined public data publication scheme by a public body. But again, for the purposes of this post, I’m more interested in examples of data that is released in a form that is packaged in a that reduces the work the data journalist needs to do and yet still allows them to argue that what they’re doing is data journalism, as defined above (i.e. it involves doing something with a dataset…).

Here are three examples that I’ve seen “in the wild” lately, without doing any real sort of analysis or categorisation of the sorts of thing they contain, the way in which they publish the data, or the sorts of commentary they provide around it. That can come later, if anyone thinks there is mileage in trying to look at data releases in this way…

The press release for the UCAS End of Cycle report 2012 includes headline statistical figures, a link to a PDF report, a link to PNG files of the figures used in the report (so that they can be embedded in articles about the report, presumably) and a link to the datasets used to create the figures used in the report.

ucasEndCycleReportPR

Each figure has it’s own datafile in CSV format:

ucasCycleReportData

Each datafile also contains editorial metadata, such as chart title and figure number:

ucasendcyclereportdatafig

The released data thus allows the data journalist (or the art department of a large news organisation…) to publish their own stylised view of the charts (or embed their own biases in the way they display the data…) and do a very limited amount of analysis on that data. The approach is still slightly short of true reproducibility, or replicability, though – it might take a little bit of effort for us to replicate the figure as depicted from the raw dataset, for example in the setting of range limits for numerical axes. (For an old example of what a replicable report might look like, see How Might Data Journalists Show Their Working?. Note that tools and workflows have moved on since that post was written – I really need to do an update. If you’re interested in examples of what’s currently possible, search for knitr…)

In this sort of release, where data is available separately for each published figure, it may be possible for the data journalist to combine data from different chart-related datasets (if they are compatible) into a new dataset. For example, if two separate charts displayed the performance of the same organisations on two different measures, we might be able to generate a single dataset that lets us plot a “dodged” bar chart showing the performance of each of those organisations against the two measures on the same chart; where two charts compare the behaviour of the same organisations at two different times, we may be able to combine the data to produce a slopegraph. And so on…

The ONS – the Office of National Statistics – had a hard time in December 2012 from the House of Commons Public Administration Committee over its website as part of an inquiry on Communicating and publishing statistics (see also the session the day before). I know I struggle with the ONS website from time to time, but it’s maybe worth considering as a minimum viable product, and to start iterating…?

So for example, the ONS publishes lots of statistical bulletins using what appears to be a templated format. For example, if we look at the Labour Market Statistics, December 2012, we see a human readable summary of the headline items in the release along with links to specific data files containing the data associated with each chart and a download area for data associated with the release:

ONSstatisitcalbulletin

If we look at the Excel data file associated with a “difference over time” chart, we notice the the data used to derive the difference is also included:

onsexcel

In this case, we could generate a slope graph directly from the datafile associated with the chart, even though not all that information was displayed in the original chart.

(This might then be a good rule of thumb for testing the quality of “change” data supplied as part of a data containing press release – are the original figures that are differenced to create the difference values also released?)

If we follow the data in this release link, we find a set of links to a whole range of downloadable statistical data tables, as well as “Datasets associated with this release

onsDatawiththisrelease

It can all start getting a bit rathole, rabbit warren from here on in… For example, here are the datasets related with the statistical bulletin:

onsDatasets withRelease

Here’s a page for the Labour Market statistics dataset, and so on…

onsDataset

That said, the original statistical bulletin does provide specific data downloads that are closely tied to each chart contained within the bulletin.

The third example is the Chief Medical Officer’s 2012 annual report, a graphically rich report published in November 2012. (It’s really worth a look…) The announcement page mentions that “All of the underlying data used to create the images in this report will be made available at data.gov.uk.” (The link points to the top level of the data.gov.uk site). A second link invites you to Read the CMO’s report, leading to a page that breaks out the report in the form of links to chapter level PDFs. However, that page also describes how “When planning this report, the Chief Medical Officer decided to make available all of the data used to create images in the report, which in turn leads to a page that contains links to a set of Dropbox pages that allow you to download data on a chapter by chapter basis from the first volume of the report in an Excel format.

Whilst the filenames are cryptic, and the figures in the report not well identified, the data is available, which is a Good Thing. (The page also notes: “The files produced this year cannot be made available in csv format. This option will become available once the Chief Medical Officer’s report is refreshed.” I’m not sure if that means CSV versions of the data will be produced for this report, or will be produced for future versions of the report, in the sense of the CMO’s Annual Report for 2013, etc?)

Once again, though, there may still be work to be done recreating a particular chart from a particular dataset (not least because some of the charts are really quite beautiful!;-) Whilst it may seem a little churlish to complain about a lack of detail about how to generate a particular chart from a particular dataset, I would just mention that one reason the web developed its graphical richness so quickly was that by “Viewing Source” developers could pinch the good design ideas they saw on other websites and implement (and further develop) them simply by cutting and pasting code from one page into another.

What each of the three examples described shows is an opening up of the data immediately behind a chart (and in at least one example from the ONS, making available the data from which the data displayed in a difference chart was calculated), and good examples of a basic form of data transparency? The reader does not have to take a ruler to a chart to work out what value a particular point is (which can be particularly hard on log-log or log-lin scale charts!), they can look it up in the original data table used to generate the chart. Taking them as examples of support for a “View Source” style of behaviour, what other forms of “View Source” supporting behaviour should we be trying to encourage?

PS If we now assume that the PR world is well versed with the idea that there are data journalists (or chart producing graphics editors) out there and that they do produce data bearing press releases for them. How might the PR folk try to influence the stories the data journalists tell by virtue of the data they release to them, and the way in which they release it?

PPS by the by, I noticed today that there is a British Standard Guide to presentation of tables and graphs [ BS 7581:1992 ] (as well as several other documents providing guidance on different forms of “statistical interpretation”). But being a British Standard, you have to pay to see it… unless you have a subscription, of course; which is one of the perks you get as a member of an academic library with just such a subscription. H/T to “Invisible librarian” (in sense of Joining the Flow – Invisible Library Tech Support) Richard Nurse (@richardn2009) for prefetching me a link to the OU’s subscription on British Standards Online in rsponse to a tweet I made about it:-)

Written by Tony Hirst

January 14, 2013 at 12:02 pm

Posted in onlinejournalismblog

Tagged with ,

When Machine Readable Data Still Causes “Issues” – Wrangling Dates…

with 4 comments

With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I’ll give asn example of a dataset that is, arguably, released in a machine readable way – as an Excel spreadsheet, but that still requires quite a bit of work to become useful as data; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable as data? As a secondary result, through trying to make the data useful as data, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to… Which makes those data columns misleading at best, useless at worst…And as to the February data in a release allegedly relating to a weekly release from November…? Sigh…

[Note - I'm not meaning to be critical in the sense of "this data is too broken to be useful so don't publish it". My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more "accessible" open data releases published in the future. ]

So what, and where is, the data…? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the NHS Winter pressures daily situation reports 2012 – 13. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own…

The data look like it’ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.

All well and good… almost…

If we load the data into something like Scraperwiki, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal “9-Nov-2012″), they are represented as date types (in this case, the number of days since a baseline starting date). A quick check on StackOverflow turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:

#http://stackoverflow.com/a/1112664/454773
import datetime

def minimalist_xldate_as_datetime(xldate, datemode):
    # datemode: 0 for 1900-based, 1 for 1904-based
    return (
        datetime.datetime(1899, 12, 30)
        + datetime.timedelta(days=xldate + 1462 * datemode)
        )

The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:

  • 16 – 18/11/2012
  • 16 Nov 12 to 18-NOV-2012
  • 16 to 18-Nov-12

In addition, we see that some of the sheets split the data into what we might term further “subtables” as you should notice if you compare the following sheet with the previous one shown above:

Notwithstanding that the “shape” of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I’ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.

The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto “fromDates” and “toDates”. ***BEWARE – I DON’T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:

dateString=minimalist_xldate_as_datetime(cellValue,book.datemode).date().strftime("%Y-%m-%d")

In this case, cellValue is the value of a header cell that is represented as an Excel time element, book is the workbook, as parsed using the xlrd library:

import xlrd
xlbin = scraperwiki.scrape(spreadsheetURL)
book = xlrd.open_workbook(file_contents=xlbin)

and book.datemode is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:
dateString=cellvalue

The next step was to look at the date range cells, and cast any “literal” date strings into a recognised date format. (I’ve just realised I should have optimised the way this is called in the Scraperwiki code – I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,

import time
from time import mktime
from datetime import datetime

def dateNormalise(d):
    #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this
    #The idea is to try to identify the date pattern used, and parse the string accordingly
    for trials in ["%d %b %y",'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']:
        try:
            dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials)))
            break
        except: dtf=d
    if type(dtf) is datetime.datetime:
        dtf=dtf.strftime("%Y-%m-%d")
    return dtf

def patchDate(f,t):
    #Grab the month and year elements from the todate, and add in the from day of month number
    tt=t.split('-')
    fromdate='-'.join( [ str(tt[0]),str(tt[1]),str(f) ])
    return fromdate

def dateRangeParse(daterange):
    #In this first part, we simply try to identify from and to portions
    dd=daterange.split(' to ')
    if len(dd)<2:
        #That is, split on 'to' doesn't work
        dd2=daterange.split(' - ')
        if len(dd2)<2:
            #Doesn't split on '-' either; set from and todates to the string, just in case.
            fromdate=daterange
            todate=daterange
        else:
            fromdate=dd2[0]
            todate=dd2[1]
    else:
        fromdate=dd[0]
        todate=dd[1]
    #By inspection, the todate looks like it's always a complete date, so try to parse it as such 
    todate=dateNormalise(todate)
    #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it
    fromdate=dateNormalise(fromdate)
    if len(fromdate)<3:
        fromdate=patchDate(fromdate,todate)
    return (fromdate,todate)

#USAGE:
(fromdate,todate)=dateRangeParse(dateString)

One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.

XKCD - ISO 8601

The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.

When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven’t checked), compared to daily totals. Indeed, as the PS below shows, I’m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period? [This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate “as-if daily” returns simply by dividing ranged date totals by the number of days in the range. A more “truthful” approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the “toDate” date as its x/date value?

When I get a chance, I’ll do a couple more posts around this dataset:
- one looking at datashaping in general, along with an example of how I shaped the data in this particular case
- one looking at different queries we can run over the shaped data.

PS Another problem… on the NHS site, we see that there appear to be weekly spreadsheet releases and an aggregated release:

Because I didn’t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I’m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but looking at a query over the scraped data I spotted this:

If we look at the weekly sheet we see this:

That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?

Looking at a different sheet – Adult Critical Care – we get variation at the other end of the range:

If we look into the aggregated sheet, we get:

Which is to say – the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?

And here’s another query:

February data??? I thought we were looking at November data?

Hmmm…

PPS If you’re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:

  1. representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);
  2. parsing date ranges into pairs of date represented elements (from and to dates);
  3. where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?
  4. shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets…)

Hmmm….

PPPS Another week, another not-quite-right feature:

another date mixup

PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: “On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops). Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).”

PPPPPS Another week, ans this time three new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)

Written by Tony Hirst

November 27, 2012 at 5:55 pm

Posted in Data, Infoskills

Tagged with , ,

#online12 Reflections – Can Open Public Data Be Disruptive to Information Vendors?

with 2 comments

Whilst preparing for my typically overloaded #online12 presentation, I thought I should make at least a passing attempt at contextualising it for the corporate attendees. The framing idea I opted for, but all too briefly reviewed, was whether open public data might be disruptive to the information industry, particularly purveyors of information services in vertical markets.

If you’ve ever read Clayton Christensen’s The Innovator’s Dilemma, you’ll be familiar with the idea behind disruptive innovations: incumbents allow start-ups with cheaper ways of tackling the less profitable, low-quality end of the market to take that part of the market; the start-ups improve their offerings, take market share, and the incumbent withdraws to the more profitable top-end. Learn more about this on OpenLearn: Sustaining and disruptive innovation or listen again to the BBC In Business episode on The Innovator’s Dilemma, from which the following clip is taken.


In the information industry, the following question then arises: will the availability of free, open public data be adopted at the low, or non-consuming end of the market, for example by micro- and small companies who haven’t necessarily be able to buy in to expensive information or data services, either on financial grounds or through lack of perceived benefits? Will the appearance of new aggregation services, often built around screenscrapers and/or public open data sources start to provide useful and useable alternatives at the low end of the market, in part because of their (current) lack of comprehensiveness or quality? And if such services are used, will they then start to improve in quality, comprehensiveness and service offerings, and in so doing start a ratcheting climb to quality that will threaten the incumbents?

Here are a couple of quick examples, based around some doodles I tried out today using data from OpenCorporates and OpenlyLocal. The original sketch (demo1() in the code here) was a simple scraper on Scraperwiki that accepted a person’s name, looked them up via a director search using the new 0.2 version of the OpenCorporates API, pulled back the companies they were associated with, and then looked up the other directors associated with those companies. For example, searching around Nigel Richard Shadbolt, we get this:

One of the problems with the data I got back is that there are duplicate entries for company officers; as Chris Taggart explained, “[data for] UK officers [comes] from two Companies House sources — data dump and API”. Another problem is that officers’ records don’t necessarily have start/end dates associated with them, so it may be the case that directors’ terms of office don’t actually overlap within a particular company. In my own scraper, I don’t check to see whether an officer is marked as “director”, “secretary”, etc, nor do I check to see whether the company is still a going concern or whether it has been dissolved. Some of these issues could be addressed right now, some may need working on. But in general, the data quality – and the way I work with it – should only improve from this quick’n'dirty minimum viable hack. As it is, I now have a tool that at a push will give me a quick snapshot of some of the possible director relationships surrounding a named individual.

The second sketch (demo2() in the code here) grabbed a list of elected council members for the Isle of Wight Council from another of Chris’ properties, OpenlyLocal, extracted the councillors names, and then looked up directorships held by people with exactly the same name using a two stage exact string match search. Here’s the result:

As with many data results, this is probably most meaningful to people who know the councillors – and companies – involved. The results may also surprise people who know the parties involved if they start to look-up the companies that aren’t immediately recognisable: surely X isn’t a director of Y? Here we have another problem – one of identity. The director look-up I use is based on an exact string match: the query to OpenCorporates returns directors with similar names, which I then filter to leave only directors with exactly the same name (I turn the strings to lower case so that case errors don’t cause a string mismatch). (I also filter companies returned to be solely ones with a gb jurisdiction.) In doing the lookup, we therefore have the possibility of false positive matches (X is returned as a director, but it’s not the X we mean, even though they have exactly the same name); and false negative lookups (eg where we look up a made up director John Alex Smith who is actually recorded in one or more filings as (the again made-up) John Alexander Smith.

That said, we do have a minimum viable research tool here that gives us a starting point for doing a very quick (though admittedly heavily caveated) search around companies that a councillor may be (or may have been – I’m not checking dates, remember) associated with.

We also have a tool around which we can start to develop a germ of an idea around conflict of interest detection.

The Isle of Wight Armchair Auditor, maintained by hyperlocal blog @onthewight (and based on an original idea by @adrianshort) hosts local spending information relating to payments made by the Isle of Wight Council. If we look at the payments made to a company, we see the spending is associated with a particular service area.

If you’re a graph thinker, as I am;-), the following might then suggest itself to you:

  1. From OpenlyLocal, we can get a list of councillors and the committees they are on;
  2. from OnTheWight’s Armchair Auditor, we can get a list of companies the council has spent money with;
  3. from OpenCorporates, we can get a list of the companies that councillors may be directors of;
  4. from OpenCorporates, we should be able to get identifiers for at least some of the companies that the council has spent money with;
  5. putting those together, we should be able to see whether or not a councillor may be a director of a company that the council is spending with and how much is being spent with them in which spending areas;
  6. we can possibly go further, if we can associate council committees with spending areas – are there councillors who are members of a committee that is responsible for a particular spending area who are also directors of companies that the council has spent money with in those spending areas? Now there’s nothing wrong with people who have expertise in a particular area sitting on a related committee (it’s probably a Good Thing). And it may be that they got their experience by working as a director for a company in that area. Which again, could be a Good Thing. But it begs a transparency question that a journalist might well be interested in asking. And in this case, with open data to hand, might technology be able to help out? For example, could we automatically generate a watch list to check whether or not councillors who are directors of companies that have received monies in particular spending areas (or more generally) have declared an interest, as would be appropriate? I think so…(caveated of course by the fact that there may be false positives and false negatives in the report…; but it would be a low effort starting point).

Once you get into this graph based thinking, you can take it mich further of course, for example looking to see whether councillors in one council are directors of companies that deal heavily with neighbouring councils… and so on.. (Paranoid? Me? Nah… Just trying to show how graphs work and how easy it can be to start joining dots once you start to get hold of the data…;-)

Anyway – this is all getting off the point and too conspiracy based…! So back to the point, which was along the lines of this: here we have the fumblings of a tool for mixing and matching data from two aggregators of public information, OpenlyLocal and OpenCorporates that might allow us to start running crude conflict of interest checks. (It’s easy enough to see how we can run the same idea using lists of MP names from the TheyWorkForYou API; or looking up directorships previously held by Ministers and the names of companies of lobbiests they meet (does WhosLobbying have an API of such things?). And so on…

Now I imagine there are commercial services around that do this sort of thing properly and comprehensively, and for a fee. But it only took me a couple of hours, for free, to get started, and having started, the paths to improvement become self-evident… and some of them can be achieved quite quickly (it just takes a little (?!) but of time…) So I wonder – could the information industry be at risk of disruption from open public data?

PS if you’re into conspiracies, Cambridge’s Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) has a post-doc positions open with Professior John Naughton on The impact of global networking on the nature, dissemination and impact of conspiracy theories. The position is complemented by several parallel fellowships, including ones on Rational Choice and Democratic Conspiracies and Ideals of Transparency and Suspicion of Democracy.

Written by Tony Hirst

November 22, 2012 at 7:37 pm

Follow

Get every new post delivered to your Inbox.

Join 349 other followers