Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language

There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners (or Asking Questions of Data – Garment Factories Data Expedition), you can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki [no longer there:-(] that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.

To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).

IW Poll spreadsheet

The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.

To query the data, I’ve popped up a simple query form on Scraperwikia Github site: Google Spreadsheet Explorer

Google spreadsheet explorer

To use the explorer, you need to:

  1. provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
  2. preview the table headings;
  3. construct a query using the column letters;
  4. select the output format;
  5. run the query.

So what sort of questions might we want to ask of the data? Let’s build some up.

We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F

SImple query

We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC

Results in order

Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').

Papers spoilt included

How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC

Postal Turnout

Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'

Good vote turnout

Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.

How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…

Worst 3 by turnout

The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A

Summing rows in a group

In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5

Top 5 by votes cast

How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5

TOp 5 as percentage on roll

Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC

VOtes by party

How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC

Average votes per candidate

To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.

One comment

  1. Cave Horse

    So you can do all this, but + (to only show pages containing a string) and “” (to only show pages containing an exact phrase) no longer work. Useful, that.