Over the weekend, a bit of random googling with filetype:csv OR filetype:xls limits turned up an archive of what looked like a complete set of UK general election results from 2005 on the Electoral Commission website.
(By the by, I also came across the National Digital Archive of Datasets (NDAD), which looks like it could be interesting: “The National Digital Archive of Datasets (NDAD) preserves and provides online access to archived digital datasets and documents from UK central government departments. Our collection spans 40 years of recent history, with the earliest available dataset dating back to about 1963.”)
Anyway, the election results data was presented in a CSV file with a format that looked like:
The first thing I wanted to try to do with it was to produce an election votes treemap (e.g. along similar lines to Work In Progress – Voting Treemaps). The target visualisation tool was Many Eyes, simply because it’s quick and easy to use (I think a little bit of setting up is still required to use the wrapper I’ve placed around the JIT treemaps…:-(
In order to use the Many Eyes treemap to it’s full effect, I needed each row to start with the constituency name. Looking at the above screengrab of the Electoral Commission data, we see that the constituency is only explicitly stated for the winning candidate. So what’s the best way of filling in the blanks?
In a couple of recent talks, I’ve made the claim that library folk should be in the business of teaching on effective data handling skills, and this is a good case in point. So what would you do?
The naive response is to go into the spreadsheet and by hand click and drag each separate constituency name into the blanks. With over six hundred constituencies, that should take you how long..?!
Two rather more efficient ways came to my mind. Firstly, use a Python script to read in the CSV file, check to see if the constituency field is populated, and if it isn’t, populate it with the constituency name from the previous row:
import csv filein='candidates_49576.csv' fileout='election2005.csv' fout=open(fileout,"wb+") csv_file = csv.writer(fout) file_obj=open(filein,"rb") for line in csv.reader(file_obj, delimiter=',', # Your custom delimiter. skipinitialspace=True): # Strips whitespace after delimiter. if line: # Make sure there's at least one entry. if line[0]=='': line[0]=last else: last=line[0] csv_file.writerow( line ) fout.close()
As ever, most of the above code isn’t mine, it’s stolen (in this case, from Stack Overflow).
The second approach I thought of was to use a Google Apps script… (this exercise is left to the reader… feel free to post a copy of the script, or a link to a spreadsheet containing it, in the comments;-)
The third approach was to use a spreadsheet formula and create a new, fully populated constituency column. Here’s one formula that I thought of:
That is, if the cell to the left is empty, copy the contents of the cell above…
So now I had a copy of the data in the form I needed, I could just copy and paste it from a spreadsheet into Many Eyes
You can see a copy of a derived treemap here. The view below uses colour to show the constituencies where the winning candidate won more than half of the valid votes cast in that constituency:
The next thing I wanted to do was to look at the actual number of votes a winning candidate beat the second placed candidate by, and also view this as a percentage of the total number of valid votes cast in that constituency. To do this, I created three new columns:
– one containing the total number of valid votes cast;
– one containing the number of votes the winning candidate beat the second placed candidate by;
– one containing the number of votes the winning candidate beat the second placed candidate by divided by the total number of valid votes cast, given as a percentage.
So how would you calculate those items?
Go on – think about it – how would you do it?
Here’s how I did it…
To count the total number of votes cast in a constituency, I used the formula:
=SUMIF(A:A,A2,D:D)
(Note that before yesterday, I’d never heard of this formula… I just went googling for help on something like excel sum similar items…;-)
Select and drag this cell to populate the column…
For the vote majority formula, I only wanted this to apply to the winning candidate, so I put a trap in to see whether the constituency from the previous row is the same as the current row. If it isn’t, then because I assume the first candidate is the winning candidate, I can find the difference between the current row’s votes and that of the second placed candidate, which is immediately below:
=IF(A1<>A2,D2-D3,0)
Once again, select and drag this cell to populate the column…
I guess I really should also trap just to check that the row below is in the same constituency in case someone (the Speaker?) stands unopposed:
=IF(A1<>A2,IF(A2=A3,D2-D3),0)
The percentage value is found simply by dividing the absolute number of votes majority by the total number of valid votes cast and multiplying by 100.
I uploaded the data to Many Eyes here.
So now we can look to see the number of votes majority an individual had, for example, compared to the turn out:
This chart brings to mind two other things to me. Firstly it shows all the candidates – it would be more convenient in this case to just show the data for the winning candidates. Secondly, it suggests that we might also plot the number of votes behind the winning candidate each of the other candidates was… but that’s for another day…
What I will do though is consider the first point, specifically how we might just look at the data for the winning candidates… One way of achieving this is to use Google Spreadsheets as a database using the Google spreadsheets API as described in First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground.
Uploading the data to a Google Spreadsheet, and publishing it means I can interrogate it via the API in my Datastore Explorer (howto):
The query also generates a link to a CSV version of the data. It’s not hard then to pop the data into Many Eyes so now we can visualise the winning MP majorities more easily – here we see the votes ahead of the second placed candidate vs. the valid number of votes cast:
We can also use the Datastore explorer to interrogate the database. For example, which MPs are defending a slim majority in terms of absolute number of votes?
Okay, that’s more than enough for now – this was only supposed to be a half hour post… What I hope this post has done, though, is show how I mix and match tools and techniques as a I wander around a data set, as well as happily trawling the net to look for solutions to problems that I’m sure someone else must have solved before (such as summing a number of separate items labeled the same way in an Excel spreadsheet).
PS Somewhere along the line, I think when looking at majorities, a couple of points had negative values – the visualisation had turned up missing rows, like this one:
Wonderful, as always. The NDAD link is hugely useful – have added that to my list of data sources in the data journalism chapter I’ve been working on. How you describe your work process is really interesting too and very useful to show students. And yes, wouldn’t it be lovely if I could point them to the library and say they should ask a librarian when they have a data problem…
It would be good to know what the visualisations ‘meant’ to you at each stage – what did they tell you either about the data or about what else you needed to do to find something meaningful in it?
This web page is more informative than many others.
I am trying to build up my own set of statistics,
a page for each National Election
followed by a list of parties and votes cast in each constituency.
I thought it would be easy to find,
not so, any help would be gratefully received.
Kindest regards
Fred