# Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language

There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners (or Asking Questions of Data – Garment Factories Data Expedition), you can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki [no longer there:-(] that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.

To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on *election results* data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is `0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc`).

The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.

To query the data, I’ve popped up a simple query form on ~~Scraperwiki~~a Github site: Google Spreadsheet Explorer

To use the explorer, you need to:

- provide a spreadsheet key value and optional sheet number (for example,
`0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc`); - preview the table headings;
- construct a query using the column letters;
- select the output format;
- run the query.

So what sort of questions might we want to ask of the data? Let’s build some up.

We might start by just looking at the raw results as they come out of the spreadsheet-as-database: `SELECT A,D,E,F`

We might then want to look at each electoral division seeing the results in rank order: `SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC`

Let’s bring the spoiled vote count back in: `SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC` (we might equally have said `OR D = 'Papers spoilt'`).

How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? `SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC`

Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned *two* candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this *caveat* in mind, we can run the number for the other electoral divisions though. The `Total votes cast` is actually the number of “good” votes cast – the turnout was actually the `Total votes cast` *plus* the `Papers spoilt`. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (`ORDER BY 100*F/B DESC`), label the turnout column appropriately (`LABEL 100*F/B 'Percentage'`) and format the results (` FORMAT 100*F/B '#,#0.0'`) using the query `SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'`

Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.

How about the three electoral divisions with the lowest turn out? `SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'` (Note that the order of the arguments – such as where to put the `LIMIT` – is important; the wrong order can prevent the query from running…

The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: `SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A`

In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? `SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5`

How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – `SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5`

Looking at the parties, how did the sum of their votes across all the electoral divisions compare? `SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC`

How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? `SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC`

To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.

So you can do all this, but + (to only show pages containing a string) and “” (to only show pages containing an exact phrase) no longer work. Useful, that.