Data Handling in Action – Visualising UK General Election 2005 Results

Over the weekend, a bit of random googling with filetype:csv OR filetype:xls limits turned up an archive of what looked like a complete set of UK general election results from 2005 on the Electoral Commission website.

(By the by, I also came across the National Digital Archive of Datasets (NDAD), which looks like it could be interesting: “The National Digital Archive of Datasets (NDAD) preserves and provides online access to archived digital datasets and documents from UK central government departments. Our collection spans 40 years of recent history, with the earliest available dataset dating back to about 1963.”)

Anyway, the election results data was presented in a CSV file with a format that looked like:

Election results data

The first thing I wanted to try to do with it was to produce an election votes treemap (e.g. along similar lines to Work In Progress – Voting Treemaps). The target visualisation tool was Many Eyes, simply because it’s quick and easy to use (I think a little bit of setting up is still required to use the wrapper I’ve placed around the JIT treemaps…:-(

In order to use the Many Eyes treemap to it’s full effect, I needed each row to start with the constituency name. Looking at the above screengrab of the Electoral Commission data, we see that the constituency is only explicitly stated for the winning candidate. So what’s the best way of filling in the blanks?

In a couple of recent talks, I’ve made the claim that library folk should be in the business of teaching on effective data handling skills, and this is a good case in point. So what would you do?

The naive response is to go into the spreadsheet and by hand click and drag each separate constituency name into the blanks. With over six hundred constituencies, that should take you how long..?!

Two rather more efficient ways came to my mind. Firstly, use a Python script to read in the CSV file, check to see if the constituency field is populated, and if it isn’t, populate it with the constituency name from the previous row:

import csv
filein='candidates_49576.csv'
fileout='election2005.csv'
fout=open(fileout,"wb+")
csv_file = csv.writer(fout)
file_obj=open(filein,"rb")
for line in csv.reader(file_obj, 
              delimiter=',',    # Your custom delimiter.
              skipinitialspace=True): # Strips whitespace after delimiter.
  if line: # Make sure there's at least one entry.
     if line[0]=='':
       line[0]=last
     else:
       last=line[0]
     csv_file.writerow(  line )
fout.close()

As ever, most of the above code isn’t mine, it’s stolen (in this case, from Stack Overflow).

The second approach I thought of was to use a Google Apps script… (this exercise is left to the reader… feel free to post a copy of the script, or a link to a spreadsheet containing it, in the comments;-)

The third approach was to use a spreadsheet formula and create a new, fully populated constituency column. Here’s one formula that I thought of:

Fill in the blanks... Excel formula

That is, if the cell to the left is empty, copy the contents of the cell above…

So now I had a copy of the data in the form I needed, I could just copy and paste it from a spreadsheet into Many Eyes

You can see a copy of a derived treemap here. The view below uses colour to show the constituencies where the winning candidate won more than half of the valid votes cast in that constituency:

WHere constituencies gave a true majority mandate to a party, general election 2005

The next thing I wanted to do was to look at the actual number of votes a winning candidate beat the second placed candidate by, and also view this as a percentage of the total number of valid votes cast in that constituency. To do this, I created three new columns:

– one containing the total number of valid votes cast;
– one containing the number of votes the winning candidate beat the second placed candidate by;
– one containing the number of votes the winning candidate beat the second placed candidate by divided by the total number of valid votes cast, given as a percentage.

So how would you calculate those items?

Go on – think about it – how would you do it?

Here’s how I did it…

To count the total number of votes cast in a constituency, I used the formula:
=SUMIF(A:A,A2,D:D)
(Note that before yesterday, I’d never heard of this formula… I just went googling for help on something like excel sum similar items…;-)

SUMIF

Select and drag this cell to populate the column…

For the vote majority formula, I only wanted this to apply to the winning candidate, so I put a trap in to see whether the constituency from the previous row is the same as the current row. If it isn’t, then because I assume the first candidate is the winning candidate, I can find the difference between the current row’s votes and that of the second placed candidate, which is immediately below:

=IF(A1<>A2,D2-D3,0)

Once again, select and drag this cell to populate the column…

I guess I really should also trap just to check that the row below is in the same constituency in case someone (the Speaker?) stands unopposed:
=IF(A1<>A2,IF(A2=A3,D2-D3),0)

The percentage value is found simply by dividing the absolute number of votes majority by the total number of valid votes cast and multiplying by 100.

I uploaded the data to Many Eyes here.

So now we can look to see the number of votes majority an individual had, for example, compared to the turn out:

GEneral election 2005

This chart brings to mind two other things to me. Firstly it shows all the candidates – it would be more convenient in this case to just show the data for the winning candidates. Secondly, it suggests that we might also plot the number of votes behind the winning candidate each of the other candidates was… but that’s for another day…

What I will do though is consider the first point, specifically how we might just look at the data for the winning candidates… One way of achieving this is to use Google Spreadsheets as a database using the Google spreadsheets API as described in First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground.

Uploading the data to a Google Spreadsheet, and publishing it means I can interrogate it via the API in my Datastore Explorer (howto):

Guardian Datasore explorer

The query also generates a link to a CSV version of the data. It’s not hard then to pop the data into Many Eyes so now we can visualise the winning MP majorities more easily – here we see the votes ahead of the second placed candidate vs. the valid number of votes cast:

VOtes majority vs votes cast, 2005 General Election

We can also use the Datastore explorer to interrogate the database. For example, which MPs are defending a slim majority in terms of absolute number of votes?

Not many votes in it...

Okay, that’s more than enough for now – this was only supposed to be a half hour post… What I hope this post has done, though, is show how I mix and match tools and techniques as a I wander around a data set, as well as happily trawling the net to look for solutions to problems that I’m sure someone else must have solved before (such as summing a number of separate items labeled the same way in an Excel spreadsheet).

PS Somewhere along the line, I think when looking at majorities, a couple of points had negative values – the visualisation had turned up missing rows, like this one:

Data in need of cleaning...

UK General Election 2010 – Interactive Maps and Swingometers

So it seems like the General Election has been a Good Thing for the news media’s interactive developer teams… Here’s a quick round up of some of the interactives I’ve found…

First up, the BBC’s interactive election seat calculator:

BBC election interactive

This lets you set the percentage vote polled by each party and it will try to predict the outcome…

The Guardian swingometer lets you play with swing from any two of the three big parties to the third:

Guardian swingometer

The Daily Telegraph swingometer lets you look at swing between any two parties…

Telegraph election map

The Economist also lets you explore pairwise swings

Economist - election map

The Times doesn’t really let you do much at all… and I wonder – is Ladbrokes in there as product placement?!

Time election interactive

Sky doesn’t go in for modeling or prediction, it’s more of just a constituency browser

Sky Election Map

The Sun probably has Tiffany, 23…

From elsewhere, this swingometer from the Charts & numbers – UK Election 2010 blog lets you model swings between the various parties

Swingometer

As to what swing is? It’s defined in this Parliamentary briefing doc [PDF]/

Playing With Processing – arc() and General Election Data, 2005

Over the weekend, I thought I’d have a little bit more of a play with the data I got into shape in Data Handling in Action – Visualising UK General Election 2005 Results. One of the thing’s I’ve been pondering with in Processing is how to draw circles and arcs, so having discovered the arc() formula, I decided to have a little play with it…

[UPDATE – there maybe a minor oops or two in some of the following images… e.g. scaling (proportion of vote cast vs size of constituency may be mangled in a one of charts; and if neither of lib/lab/con won a seat, votes may still show (in order of most-least from among the big three)… But having doodled, now I know the sorts of things I want to measure and plot for doing it properly next time!;-)]

Here we have the 2005 general election results for each constituency for Labour, the Lib Dems and the Conservative party. Each concentric circle is one constituency, in alphabetical order, going from inside to out. The colours are as you might expect, the length of an arc is the percentage of the overall vote in that constituency polled by the corresponding party. The six plots show all the possible orderings…

General Election 2005

It strikes me that I assumed the origin that the arc() function takes… Maybe I should set it more deliberately (but according to what criterion?)

I also experimented with a range of plots based in equal segments, with origin axes 120 degrees apart. Visualisation folk won’t like these because the whole circle does not add up 100% of anything…

Here, each party has 120 degrees; the length of each arc is the percentage of valid votes * 120 degrees… (I think… pretty sure it’s not percentage of possible electorate!)

General election 2005

In the following case, the 120 degree angle about each party is 100% of the eligible voters; the arc length is: 120 degrees * number of votes / number of eligible voters.

faceofdemocracy2 general election 2005

If we just look at winners, the following allocates 120 degrees to the votes cast in the constituency; arcs show number of votes for winner / number of valid votes cast (err, maybe… unless it’s per size of consituency… hmmm):

General election 2005

The next chart shows the size of the majority, as a percentage of the whole circle:

General election 2005

Finally, here’s just the winning party as a whole ring:

General election 2005 - winning seats

UPDATE: and here’s a view showing the percentage votes in order (winner first, second placed candidate, then third, again from just taking into account the three main parties… and ah, I’ve maybe realised an oops for constituencies where one of the three main parties doesn’t win….maybe their votrs are showing too…?! Hmm……)

General Election 2005

If I had any more time to play on these, I’d look at ordering the circles in various different ways. But I don’t have any more time to play on this just now…

PS For more election related data, particulalry in Linked Data from, see @kitwallace’s Election nominations in the Talis Data Incubator and An exercise in RDF – 2005 Election data.

PPS see also this summary/round-up of UK media interactive swingometers

What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events

Following a chat with @andypryke, I thought I’d try out a simple bit of feature detection around approximated follower acquisition charts (e.g. Estimated Follower Accession Charts for Twitter) to see if I could detect dates around which there were spikes in follower acquisition.

So for example, here’s the follower acquistion chart for Seem Malhotra:

seemaMalhotra

We see a spike in follower count about 440 days ago, with an increased daily follower acquisition rate thereafter. WHat happened 440 days or so ago? We can easily look this up on something like Wolfram Alpha (query on /440 days ago/) or directly in R:

as.Date(Sys.time())-440
[1] "2011-12-20"

So what happened in December 2011? A quick search on /”Seema Malhotra” December 2011/ turns up the news that she won a by-election on 16 December 2011. The spike in followers matches the by-election date well, and the increased rate in daily follower acquisition since then is presumably related to the fact that Seema Malhotra is now an MP.

So what’s the new line on the chart (the black, stepped line along the bottom)? It’s actually a 5 point moving average of the first difference in follower count over time (that is, sort of a smoothed version of a crude approximation to the gradient of the follower acquisition curve; the firstdiff curve is normalised by finding the difference in accumulated follower count between consecutive time samples divided the number of days between samples. So it’s a sort of gradient rather than first difference. If the samples were all a day apart, it would be a first difference…). I also filter the line to only show days on which there was a “significant jump” in follower count, arbitrarily set at a 5 sample moving average of more than 50 new followers per day. Note that scaling of the moving average values too – the numerical y-axis scale is 1:1 for the cumulative follower number, but 10x the moving average value. The numerical value labels that annotate the line chart correspond to the number of days ago (relative to the date the chart was generated) that the peak corresponds to. For any chart critics out there – this is a “working chart”, rather than a polished presentation graphic;-)

#Process Twitter user data file
processUsersData=function(data){
  data$tz=as.POSIXct(data$created_at)
  data$days=as.integer(difftime(Sys.time(),data$tz,units='days'))
  data=data[rev(rownames(data)),]
  data$acc=1:length(data$days)
  data$recency=cummin(data$days)
  data$frfo=data$friends_count/data$followers_count
  data$stfo=data$statuses_count/data$followers_count
  data$foperday=data$followers_count/data$days
  data$stperday=data$statuses_count/data$days
  data$fost=data$followers_count/(1+data$statuses_count)
  
  data
}

#The TTR library includes various moving average functions
require(TTR)

differ_a=function(d){
  d=processUsersData(d)

  #Find the users who are used to approximate the accession date
  d2=subset(d,days==recency)
  #Dedupe these rows (need to check if I grab the first of the last...)
  d3=d2[!duplicated(d2$recency),]

  #First difference
  d3$accdiff=c(0,diff(d3$acc))
  d3$daysdiff=c(0,-diff(d3$days))
  d3$firstdiff=d3$accdiff/d3$daysdiff

  #First difference smoothed over 5 values - note we do dodgy things against time here - just look for signal!
  d3$SMA5=SMA(d3$firstdiff,5)

  #Second difference
  d3$fdd=c(0,diff(d3$firstdiff))
  d3$seconddiff=d3$fdd/d3$daysdiff
  
  d3
}

#An example plotter - sm is the user data
g= ggplot(processUsersData(sm))
g=g+geom_point(aes(x=-days,y=acc),size=1) #The black acc/days dots
g=g+geom_point(aes(x=-recency,y=acc),col='red',size=1) #The red acc/days  acquisition date estimate dots
g=g+geom_line(data=differ_a(sm),aes(x=-days,y=10*SMA5)) #The firstdiff moving average line
g=g+geom_text(data=subset(differ_a(sm),SMA5>50),aes(x=-days,y=10*SMA5,label=days),size=3) #Feature label
g=g+ggtitle("Seema Malhotra") #Chart title

Here’s Chris Pincher:

chrispincher

This account got hit about 79 days ago (December 15th 2012) – we need to ignore the width of the moving average curve and just focus on the leading edge, as a zoom into the chart, with a barchart depicting firstdiff replacing the first diff moving average line, demonstrates.

#Got a rogue datapoint in there somehow?
ggplot(subset(processUsersData(cpmp),days&lt;5000))
g=g+geom_point(aes(x=-days,y=acc),size=1)
g=g+geom_point(aes(x=-recency,y=acc),col='red',size=1)
g=g+geom_bar(data=subset(differ_a(cpmp),days50 &amp; days&lt;5000),aes(x=-days,y=firstdiff,label=days),size=3)
g=g+ggtitle(&quot;Chris Pincher&quot;)+xlim(-200,-25)

chrispincherzoom

The spam followers that were signed up to the account look like they were created in batches several months prior to what I presume was an attack? COuld this have been in response to his Speaking Out about the Collapse of Drive Assist on Thursday, December 13th, 2012, his Huffpo post on the 11th, or his vote against the Human Rights Act as reported on the 5th?

Who else has an odd follower acquisition chart? How about Aidan Burley?

AidanBurley

219 days ago – 28th July 2012…

aidanBurleycrap

I guess that caused something of a Twitter storm, and a resulting growth in his follower count… Diane Abbott’s racist tweet row from December 2012 also grew her twitter following… Top tip for follower acquisition, there;-)

Nadine Dorries’ outspoken comments in May 2012 around David Cameron’s party leadership, and then same sex marriage, was good for her Twitter follower count, which received another push when she joined I’m a Celebrity and was suspended from the Parliamentary Conservative party.

Showing your emotions in Parliament looks like a handy trick too…Here’s a spike around about October 20th, 2011…

alisonMcgovern

(There also looks to be a gradient change around 200 days ago maybe? The second diff calculations might pull this out?)

Chris Bryant’s speech on the phone hacking saga in July 2011 showed that publicly well-received parliamentary speeches can be good for numbers too; not surprisingly, the phone hacking scandal was also good for Tom Watson’s follower count around the end of July 2011. Election victories can be good too: Andy Sawford got a jump in followers when he was announced as a PPC (10th August 2012) and then when he won his seat (November 7th 2012); Ben Bradshaw’s numbers also jumped around the time of his May 2010 election victory, as did Lynne Featherstone’s, particularly with her appointment to a government position. Jesse Norman appeared to get a bump after the Prime Minister confronted him on July 11th 2012; Nick de Bois saw a leap in followers following the riots in his constituency in early August 2011, and the riots also seem to have bumped David Lammy’s and Diane Abbott’s numbers up.

A tragedy on September 17th looks like it may have pushed Peter Hain’s numbers, but he was in the news a reasonable amount around that time – maybe getting your name in the press for several days in a row is good for Twitter follower counts? Steve Rotherham also benefited from another recalled tragedy, the Hillsborough distaster, when, in October 2011, he called the ex-Sun’s editor out over it’s original coverage; he seems to have received another boost in followers when he lead a debate on internet trolls in September 2012.

Personal misfortune didn’t do Michael Fabricant any harm – his speeding conviction colourful Twitter baiting in October 2012 caused his follower count to fly and achieve an elevated rate of daily growth it’s maintained ever since.

A Dispatches special on ticket touts got a bounce in followers for Sharon Hodgson, who was sponsoring a private member’s bill on ticket touts at the time; winning a social media award seemed to do Kevin Brennan a favour in terms of his daily follower acquisition rate, as this ramp increase around the start of December 2010 shows:

kevinBrennan

So there we have it; political life as seen through the lens of Twitter follower acquisition bursts:-)

But what now? I guess one thing to do would be to have a go at estimating the daily growth rates of the various twittering MPs, and see if thy have any bearing to things like ministerial (or Shadow Minister) responsiblity? Where rates seem to change (sustained kinks in the curve), it might be worth looking to see whether we can identify any signs of changes in tweeting behaviour – or maybe news stories that come to associate the MP with Twitter in some way?

A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page

Here’s a quick walkthrough of how to use OpenRefine to prepare a simple data file. The original data can be found on a web page that looks like this (h/t/ The Gravesend Reporter):

polling station list

Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.

Here’s how I want the data to look after I’ve cleaned it:

data in a fusion table

Can you see what needs to be done? Somehow, we need to:

– remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.

If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:

polling stations

We can also paste the data into a new OpenRefine Project:

paste data into OpenRefine

We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:

OpenRefine parse line data

But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?

OpenRefine data in - more cleaning required

Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.

cleaning data part 1

Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.

open refine leave the data the same

If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.

delete the filter

Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.

fill down

If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.

reset filter, star rows, then remove them...

Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…

we're going to tidy

…by replacing the colon with nothing (an empty string).

tidy the column

Here’s the data – neat and tidy:-)

Neat and tidy...

To finish, let’s export the data.

prepare to export

How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).

upload to fusion table

And here it is:-)

data in a fusion table

So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)

Simple Map Making With Google Fusion Tables

A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…

Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:

data in a fusion table

Now let’s map it:-)

First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…

tweak the column

Then we say we want the column to be recognised as a column type:

change name make location

Computer says yes, highlighting the location type cells with a yellow background:

fusion table.. yellow...

As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)

map tab...

Once the points have been geocoded, they’re placed onto a map:

mapped

We can now publish the map in preparation for sharing it with the world…

publish map

We need to change the visibility of the map to something folk can see!

privacy and link

Public on the web, or just via a shared link – your choice:

make seeable

Here’s my map:-)

The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.

Boundary Files for Electoral Wards Covered by a Particular Geography

A week or so ago, I went looking for boundary lines for UK electoral wards, with half a mind towards trying to do something with them related to this week’s local council elections. One source I came across was the UK 2011 Census, (2011 Census geography products for England and Wales) which publishes the appropriate data from across England and Wales in large single shapefiles. For the amateur cartographer (which I very much am!) wanting to work at a local level, this presents something of a challenge: not only do I have to find a way of downloading and opening the large dataset, I also need to find a way of extracting from it the data for my local area, which is what I actually want. (Getting the local data from the national dataset can be blocking, in other words.)

Another source of data is MySociety’s MapIt service, which provides data about various geographies, including electoral wards, covered by other geographical areas:

mapit areas covered

as well as boundary files for each geography in a variety of formats:

Mapit Geometry

Unfortunately, the MapIt API doesn’t (yet?!) support the ability to generate a single file that contains boundary data for all the electoral wards in a single council or local authority area. So here’s a quick hack, posted as a view on Scraperwiki, that generates a single KML file for the electoral wards contained by a council area (including the wider boundary of that council area) – KML Merge Test. (There are probably better ways of doing this?! I would if I should try to make sense of the MapIt code to see if I can work out how to submit a proper patch…)

import scraperwiki,simplejson,urllib2
from lxml import etree
from copy import deepcopy


#--via @mhawksey
# query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/?
import cgi, os
qstring=os.getenv("QUERY_STRING")

key='65791' #Use the Isle of Wight as a default
typ='UTE'
#typs are:
#CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

if qstring!=None:
    get = dict(cgi.parse_qsl(qstring))
    if 'key' in get: key=get['key']
    if 'typ' in get: typ=get['typ'] 
#---


#Get a stub KML file for the local council level
url='http://mapit.mysociety.org/area/'+str(key)+'.kml'
xmlraw = urllib2.urlopen(url).read()
xml=etree.fromstring(xmlraw)

#Get the list of electoral wards covered by that council area
wards=simplejson.load(urllib2.urlopen('http://mapit.mysociety.org/area/'+str(key)+'/covers?type='+typ))

#Get the KML for each ward, extract the Placemark data, and add it to our comprehensive KML tree
for ward in wards:
    url='http://mapit.mysociety.org/area/'+ward+'.kml'
    xmlraw = scraperwiki.scrape(url)
    xml2=etree.fromstring(xmlraw)
    p= xml2.xpath('//geo:Placemark',namespaces={'geo':'http://www.opengis.net/kml/2.2'})
    xml.append( deepcopy(p[0] ))

scraperwiki.utils.httpresponseheader("Content-Type", "text/xml")
print etree.tostring(xml)

The key value is the ID of the council area within which you want to find the electoral wards. So for example the Isle of Wight parliamentary constituency page -http://mapit.mysociety.org/area/65791.html – gives us the ID 65791, which we pass as an argument to the Scraperwiki view. The subdivision we want to grab data for is given by the typ parameter:

CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

So for example, here’s a link to an aggregate KML file for Unitary authority electoral divisions (UTE) on the Isle of Wight – https://views.scraperwiki.com/run/kml_merge_test/?key=65791; and here’s one for Unitary authority wards (UTW) in Milton Keynes: https://views.scraperwiki.com/run/kml_merge_test/?key=2588&typ=UTW

If you save the resulting file as a .kml file (for example, as kml_merge_test_mk.kml) you can then load it into something like Google Fusion tables to view it:

GOogle fusion table map

Note that the MapIt API is rate limited (I think), so be gentle ;-)

UK Local Elections 2013 – Live Data, Live Results Maps…

I wonder whether this will be the last round of elections without a national live data feed from somewhere pushing out the results in a standardised form? So far, here are the local “live election data” initiatives I’ve spotted/had pointed out to me:

Lincolnshire

The Lincolnite – Lincolnshire Local Elections 2013, described here: The Lincolnite to cover elections live with interactive map. (Ex-?) University of Lincoln developer Alex Bilbie (@alexbilbie), who built the Lincolnshire map app, describes a little of the process behind it here Developing an interactive county council election map (part one).

Lincolnshore live election map

Warwickshire

2013 Elections: Warwickshire area, described here: 2013 Elections – In Real Time.

warwickshire live election map

The Warwickshire team are also making shapefiles/KML files (for the plotting of boundary line maps) and live results data (via Google Fusion Tables) too, as well as making data available about previously elected candidates: 2013 Elections – In Real Time

Here’s the map after the fact… I like the clear statement of seats by part in the bottom left corner too…

Warwickshire

Surrey

Surrey has the Surrey elections dashboard (via @BenUnsworth) that will switch to a live map as the results come in, but currently offers a search box that accepts a postcode and then tells you who your candidates are and where you can vote:

Surrey elections dashboard

Kent
This looks pretty, from Kent County Council:

Kent live election results

I was looking forward to seeing how this view played out once the results started to come in, but, erm, oops?!

oops - Kent

Managing the bursty load on an election results service server is probably something worth building into the planning…

Bristol
Bristol opted for a tabbed display showing a range of views over their council elecitons results. A simple coloured symbol maps shows the distribution of seats by ward across the parties:

Bristol election map

Bristol also provided a view over the turnout:
Bristol turnout

(Would it be useful to also be able to see this as percentage turnouts? Or to depict the proportional turnout on a map to see if any geographical reasons jump out as a source of possible differences?)

Cumbria
Cumbria County Council show how to make use of boundary files to mark or <choropleth election maps relate the party affiliation of the candidate taking each particular seat by electoral division area:

Cumbria election map

Cumbria also provided a view of seat allocations in districts; I don’t understand the scale they used to on the x-axes though? It differs from district to district. For example, it looks to me as if more seats went to Conservatives in Eden than in Carlise? Or is the scale related to the percentage of seats in the district? I’d class as “infographic-standard”, i.e. meaningless as a visualisation;-)
Cumbria district summary  - bad scales?

Norfolk

Norfolk’s election map looks quite, erm, “child-friendly” (chunky?! kids TV?) to me?

Norfolk election map

Norfolk also produced a graphic showing how seats might be distributed in the chamber:

Norfolk seats

I think one of the major issues with this sort of graphic is how you communicate the possible structurings of the chamber based on what sort of affiliations and groupings play out?

Wales Online

Wales online have a nice clean feel to their results map for Anglesey, but what’s going on with the legend? They don’t make it easy to get the branding into the screengrab either?!

Wales online report - Anglesey

National Reporting

THe Telegraph produced a map showing results of the elections at national scale based on control of councils by party:

Telegraph

And ever helpful, the Guardian datablog made the data available (will they do data broken down at seat level too, I wonder?) Here’s the map they produced:

guardian elecotion map

Hmmm… do I recognise that sort of layout? Ah, I know, it reminds me of this example of Data Journalists Engaging in Co-Innovation… around boundary changes.

Other…
For lists of current councillors, see OpenlyLocal, which has data available via a JSON API relating to current councillors and their affiliations. (It would be good if a frozen snapshot of this could be grabbed today, for comparison with the results following today’s election?)

This may also be of interest… UK Data Service British Local Election Database, 1889-2003 and Andrew Teale’s Local Elections Archive Project.

Data relating to general elections can be found on the Electoral Commission website: General Elections. TheyWorkForYou provide an API over current MPs by constituency, and MySOciety also produce the MapIt service for accessing constituency and electoral division boundary line data files.

If you’re interested in doing something data related around the election, or would like to learn how to do something with the data generated by the election, check out this informal resource co-ordination document. If you’re interested in checking out your local council website to see whether they publish any #opendata that would help you generate you own live maps, dashboards or consoles, the School of Data post “wot I wrote” on Proving the Data – A Quick Guide to Mapping England and Wales Local Elections may provide you with a quick start guide to making use of some of it…

If you know of any other councils or local presses publishing election related data warez, maps, live data feeds, etc, please post a link and brief description in the comments, and I’ll try to keep this post up to date…

A Wrangling Example With OpenRefine: Making “Oven Ready Data”

As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:

onthewight results

There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:

Slightly tidier

Or how about something like this, that shows the size of the electorate for each ward:

turnout

So how can we get from the OnTheWight results into a ready data format?

Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-c); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:

Paste data into OpenRefine

* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:

_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714

_Votes won by figures – filled in
Lots of the ‘Votes won by figures’ had the wrong number in them. It’s one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn’t possible.

‘Postal votes (inc)’ line inserted between ‘Total votes cast’ and ‘Papers spoilt’

Deleted an empty row from Ventnor West

The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.

import data as TSV no blanks

You also need to ensure that the Parse cell text into number/dates… option is selected.

OpenRefine

Here’s what we end up with:

ELection data raw import

The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:

COpy a column

Let’s define the new column as having exactly the same value as the original column:

Create new col as copy of old

Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)

The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:

Facet the numbers

Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.

IW_results_txt_-_OpenRefine

Then remove the numeric facet filter:

filter update

Here’s the result, much tidier:

Much tidier

Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:

Selectt OpenRefine filter

Simply state the value you want to filter on and blitz the matching rows…

CHoose rows then blitz them

…then remove the filter:

then remove the filter

We can now fill down a the blanks in the Electoral Division column:

Fill down on Electoral Division

Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.

Filled down - now flag unwanted row

Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:

facet on flagged row

Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.

Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:

Remove header rows

And then this time just retain the non-numeric rows.

Electoral ward properties

Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:

Rename column

Like so:

columns renamed

Now let’s just make those comma mangled numbers into numbers, by transforming them:

transform the cells by removeing commas

The transform we’re going to use is to replace the comma by nothing:

replace comma

Then convert the values to a number type.

then convert to number

We can the do the same thing for the Number on Roll column:

reuse is good

We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.

Final stages of electroal division data

There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.

a little more tidying to do

Having done that, we can take the opportunity to export the data.

openrefine exporter

Using the custom tabular exporter, we can select the columns we wish to export.

Export column select

Then we can export the data to the desktop as a file in a variety of formats:

OPenrefine export download

Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:

OPenRefine upload to goole

Here’s the data I uploaded.

If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)

change filter settings

When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:

reorder columns

Here’s what I uploaded to a Google spreadsheet:

Spreadsheet

[OpenRefine project file]

So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…

Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.

Questioning Election Data to See if It Has a Story to Tell

I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…

So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?

Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):

IW council - notice of poll

Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:

scraperwiki candidates

So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.

In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)

Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:

scraperwiki CSV export

In OpenRefine, we can Create a New Project and then import the data directly:

openrefine import from URL

The data is in comma separated CSV format, so let’s specify that:

import as csv comma separated

We can then name and create the project and we’re ready to start…

…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.

Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.

mapit postcode lookup

And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:

mapit json

Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT

Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT

The %20 is a character encoding for a space. In this case, we can also use a +.

So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)

Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.

To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.

What we want to do then is to create a new, derived column based on the address:

Add derived column

And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:

grab a postcode

Here’s the result…

postcodes...

Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:

col from URL

When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:

get data from URL

The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.

When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…

lots of data

We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:

that's handy...

We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:

filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']

to create a new column containing the electoral division:

parse out the electroal division

Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:

inward or out

If we collapse down the spare columns, we get a clearer picture:

collapse...

Like this:

summary data

If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):

facet on inout

We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…

inout facet filter

Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)

Hmm.. labour out

There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson’s Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?

One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:

'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')

os postcode lookup

The data, as imported from the Ordnance Survey, looks something like this:

o:sdata

As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.

Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.

The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:

easting northing lat -long

Here’s what the returned lat/long data looks like:

lat-long json

We can then create a new column derived from this JSON data by parsing it as follows
parse latlong to lat

A similar trick can be used to generate a column containing just the longitude data.

We can then export a view over the data to a CSV file, or direct to Google Fusion tables.

postcode lat long export

With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222

Fusion table edit column

Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:

fusion table config cols to location type

We can inspect the location data using a more convenient “natural” view over it…

fusion table add map

By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:

havenstreet candidates

(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)

We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:

refine filter by division

Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:

postcode missing...

Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…

So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.