In this year’s student satisfaction tables, which universities have a good teaching score but low employment prospects? How would you find out? In this post, you’ll find out…
Whether or not it was one of my resolutions, one of the things I want to do more this year is try to try to make more use of stuff that’s already out there, and come up with recipes that hopefully demonstrate to others how to make use of those resources.
So today’s trick is prompted by a request from @paulbradshaw about “how to turn a spreadsheet into a form-searchable database for users” within a Google spreadsheet (compared to querying a google spreadsheet via a URI, as described in Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).
I’m not going to get as far as the form bit, but here’s how to grab details from a Google spreadsheet, such as one of the spreadsheets posted to the Guardian Datastore, and query it as if it was a database in the context of one of your own Google spreadsheets.
This trick actually relies on the original Google spreadsheet being shared in “the right way”, which for the purposes of this post we’ll take to mean – it can be viewed using a URL of the form:
http://spreadsheets.google.com/ccc?key=SPREADSHEETKEY&hl=en
(The &hl=en on the end is superfluous – it doesn’t matter if it’s not there…) The Guardian Datastore folks sometimes qualify this link with a statement of the form Link (if you have a Google Docs account).
If the link is of the form:
http://spreadsheets.google.com/pub?key=SPREADSHEETKEY
just change pub to ccc
So for example, take the case of the 2010-2011 Higher Education tables (described here):
http://spreadsheets.google.com/ccc?key=reBYenfrJHIRd4voZfiSmuw
The first thing to do is to grab a copy of the data into our own spreadsheet. So go to Google Docs, create a new spreadsheet, and in cell A1 enter the formula:
=ImportRange(“reBYenfrJHIRd4voZfiSmuw”,”Institutional Table!A1:K118″)
When you hit return, the spreadsheet should be populated with data from the Guardian Datastore spreadsheet.
So let’s see how that formula is put together.
=ImportRange(“reBYenfrJHIRd4voZfiSmuw”,”Institutional Table!A1:K118″)
Firstly, we use the =ImportRange() formula, which has the form:
=ImportRange(SPREADSHEETKEY, SHEET!RANGE)
This says that we want to import a range of cells from a sheet in another spreadsheet/workbook that we have access to (such as one we own, one that is shared with us in an appropriate way, or a public one). The KEY is the key value from the URL of the spreadsheet we want to import data from. The SHEET is the name of the sheet the data is on:
The RANGE is the range of the cells we want to copy over from the external spreadsheet.
Enter the formula into a single cell in your spreadsheet and the whole range of cells identified in the specified sheet of the original spreadsheet will be imported to your spreadsheet.
Give the sheet a name (I called mine ‘Institutional Table 2010-2011’; the default would be ‘Sheet1’).
Now we’re going to treat that imported data as if it was in a database, using the =QUERY() formula.
Create a new sheet, call it “My Queries” or something similar and in cell A1 enter the formula:
=QUERY(‘Institutional Table 2010-2011’!A1:K118,”Select A”)
What happens? Column A is pulled into the spreadsheet is what. So how does that work?
The =QUERY() formula, which has the basic form =QUERY(RANGE,DATAQUERY), allows us to run a special sort of query against the data specified in the RANGE. That is, you can think of =QUERY(RANGE,) as specifying a database; and DATAQUERY as a database query language query (sic) over that database.
So what sorts of DATAQUERY can we ask?
The simplest queries are not really queries at all, they just copy whole columns from the “database” range into our “query” spreadsheet.
So things like:
- =QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C”) to select column C;
- =QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C,D,G,H”) to select columns C, D, G and H;
So looking at copy of the data in our spreadsheet, import the columns relating to the Institution, Average Teaching Score, Expenditure per Student and Career Prospects, I’d select columns C, D, F and H:
like this:
=QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C,D, F,H”)
to give this:
(Remember that the column labels in the query refer to the spreadsheet we are treating as a database, not the columns in the query results sheet shown above.)
All well and good. But suppose we only want to look at institutions with a poor teaching score (column D), less than 40? Can we do that too? Well, yes, we can, with a query of the form:
“Select C,D, F,H where D < 40"
(The spaces around the less than sign are important… if you don’t include them, the query may not work.)
Here’s the result:
(Remember, column D in the query is actually the second selected column, which is placed into column B in the figure shown above.)
Note that we can order the results according to other columns to. So for example, to order the results according to increasing expenditure (column F), we can write:
“Select C,D, F,H where D < 40 order by F asc"
(For decreasing order, use desc.)
Note that we can run more complex queries too. So for example, if we want to find institutions with a high average teaching score (column D) but low career prospects (column H) we might ask:
“Select C,D, F,H where D > 70 and H < 70"
And so on…
Over the nect week or two, I’ll post a few more examples of how to write spreadsheet queries, as well as showing you a trick or two about how to build a simple form like interface within the spreadsheet for constructing queries automatically; but for now, why try having a quick play with the =QUERY() formula yourself?