The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:
(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)
As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)
[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher
I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…
PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data
PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…
Ben Fry interviewed at Where 2.0 2011
A post on the Guardian Datablog earlier today took a dataset collected by the Tweetminster folk and graphed the sorts of thing that journalists tweet about ( Journalists on Twitter: how do Britain’s news organisations tweet?).
Tweetminster maintains separate lists of tweeting journalists for several different media groups, so it was easy to grab the names on each list, use the Twitter API to pull down the names of people followed by each person on the list, and then graph the friend connections between folk on the lists. The result shows that the hacks are follow each other quite closely:
Nodes are coloured by media group/Tweetminster list, and sized by PageRank, as calculated over the network using the Gephi PageRank statistic.
The force directed layout shows how folk within individual media groups tend to follow each other more intensely than they do people from other groups, but that said, inter-group following is still high. The major players across the media tweeps as a whole seem to be @arusbridger, @r4today, @skynews, @paulwaugh and @BBCLauraK.
I can generate an SVG version of the chart, and post a copy of the raw Gephi GDF data file, if anyone’s interested…
PPS for details on how the above was put together, here’s a related approach:
Trying to find useful things to do with emerging technologies in open education
Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community.
For a slightly different view over the UK political Twittersphere, see Sketching the Structure of the UK Political Media Twittersphere. And for the House and Senate in the US: Sketching Connections Between US House and Senate Tweeps
It’s all very well using a Google spreadsheet as a database, but sometimes you just want to provide a simple form to let people run a particular query. Here’s a quick way of doing that within a Spreadsheet…
So for example: Can you help me crowd source a solution?. The problem is as follows:
Students will make five choices from a list of over 200 projects that have been anonymised… We will give each project a code, and have already entered all the details into an excel sheet so we can tie the project code to the supervisor.
We need a solution that will enable students to enter their project code and then have the title of the project displayed as a check to make sure they have entered the code correctly. The list of projects is just too long for a drop down list, even when split by department (around 50 in each).
Does anyone have any suggestions of tools that we can use for students to submit this type of information, so that we get it in a format that we can use, and they get confirmation of the project titles they have chosen? A simple google form isn’t going to hack it!
Here’s one way…
Create a “form” – the text entry cell can be highlighted by setting the background colour from the spreadsheet toolbar:
Construct a query. In this case, I need to select three results columns (H, I and J) from another sheet (‘Sheet1′, the one that acts as the database and contains the project codes) so the query will be of the form “select H,I,J where H contains “BIOCHEM”; the search term (“BIOCHEM”) is pulled in from the query form show above:
=concatenate(“select H,I,J where H contains ‘”,B2,”‘”)
(As a rule of thumb, if you want your query to select cells A, D, AC, the range set in the first part of the query that defines the database should span the first to the last column in the select range (Sheet1!A:AC, for example).)
By using the contains relation, this query will generate a set of results that are, in effect, a list of auto-complete suggestions as the result of a searching on a partially stated query term.
Assuming I have placed the query in cell A4, I can automatically get the results from the query as follows:
Note that it would be possible to hide the query generator (the contents of cell A4) in another sheet and just have the search box and the results displayed in the user interface sheet.
Another approach is to query the spreadsheet via its API.
So for example, if the original spreadsheet database was published as a public document, we could also grab the results as an HTML table via an API using a URI of the form:
Setting out:csv would return the results in comma separated variable format, so we could create a Yahoo pipes interface to query the form, for example:
What would be really useful would be if the Google/Yahoo widget options for the feed respected the form elements, rather than just generating a widget that displays the feed corresponding to the current Run of the pipe with the provided search terms.
Building such a widget is something I need to put on my to do list, I guess?! Sigh…
In this year’s student satisfaction tables, which universities have a good teaching score but low employment prospects? How would you find out? In this post, you’ll find out…
Whether or not it was one of my resolutions, one of the things I want to do more this year is try to try to make more use of stuff that’s already out there, and come up with recipes that hopefully demonstrate to others how to make use of those resources.
So today’s trick is prompted by a request from @paulbradshaw about “how to turn a spreadsheet into a form-searchable database for users” within a Google spreadsheet (compared to querying a google spreadsheet via a URI, as described in Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).
I’m not going to get as far as the form bit, but here’s how to grab details from a Google spreadsheet, such as one of the spreadsheets posted to the Guardian Datastore, and query it as if it was a database in the context of one of your own Google spreadsheets.
This trick actually relies on the original Google spreadsheet being shared in “the right way”, which for the purposes of this post we’ll take to mean – it can be viewed using a URL of the form:
(The &hl=en on the end is superfluous – it doesn’t matter if it’s not there…) The Guardian Datastore folks sometimes qualify this link with a statement of the form Link (if you have a Google Docs account).
If the link is of the form:
just change pub to ccc
So for example, take the case of the 2010-2011 Higher Education tables (described here):
The first thing to do is to grab a copy of the data into our own spreadsheet. So go to Google Docs, create a new spreadsheet, and in cell A1 enter the formula:
When you hit return, the spreadsheet should be populated with data from the Guardian Datastore spreadsheet.
So let’s see how that formula is put together.
Firstly, we use the =ImportRange() formula, which has the form:
This says that we want to import a range of cells from a sheet in another spreadsheet/workbook that we have access to (such as one we own, one that is shared with us in an appropriate way, or a public one). The KEY is the key value from the URL of the spreadsheet we want to import data from. The SHEET is the name of the sheet the data is on:
The RANGE is the range of the cells we want to copy over from the external spreadsheet.
Enter the formula into a single cell in your spreadsheet and the whole range of cells identified in the specified sheet of the original spreadsheet will be imported to your spreadsheet.
Give the sheet a name (I called mine ‘Institutional Table 2010-2011′; the default would be ‘Sheet1′).
Now we’re going to treat that imported data as if it was in a database, using the =QUERY() formula.
Create a new sheet, call it “My Queries” or something similar and in cell A1 enter the formula:
=QUERY(‘Institutional Table 2010-2011’!A1:K118,”Select A”)
What happens? Column A is pulled into the spreadsheet is what. So how does that work?
The =QUERY() formula, which has the basic form =QUERY(RANGE,DATAQUERY), allows us to run a special sort of query against the data specified in the RANGE. That is, you can think of =QUERY(RANGE,) as specifying a database; and DATAQUERY as a database query language query (sic) over that database.
So what sorts of DATAQUERY can we ask?
The simplest queries are not really queries at all, they just copy whole columns from the “database” range into our “query” spreadsheet.
So things like:
- =QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C”) to select column C;
- =QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C,D,G,H”) to select columns C, D, G and H;
So looking at copy of the data in our spreadsheet, import the columns relating to the Institution, Average Teaching Score, Expenditure per Student and Career Prospects, I’d select columns C, D, F and H:
=QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C,D, F,H”)
to give this:
(Remember that the column labels in the query refer to the spreadsheet we are treating as a database, not the columns in the query results sheet shown above.)
All well and good. But suppose we only want to look at institutions with a poor teaching score (column D), less than 40? Can we do that too? Well, yes, we can, with a query of the form:
“Select C,D, F,H where D < 40"
(The spaces around the less than sign are important… if you don’t include them, the query may not work.)
Here’s the result:
(Remember, column D in the query is actually the second selected column, which is placed into column B in the figure shown above.)
Note that we can order the results according to other columns to. So for example, to order the results according to increasing expenditure (column F), we can write:
“Select C,D, F,H where D < 40 order by F asc"
(For decreasing order, use desc.)
Note that we can run more complex queries too. So for example, if we want to find institutions with a high average teaching score (column D) but low career prospects (column H) we might ask:
“Select C,D, F,H where D > 70 and H < 70"
And so on…
Over the nect week or two, I’ll post a few more examples of how to write spreadsheet queries, as well as showing you a trick or two about how to build a simple form like interface within the spreadsheet for constructing queries automatically; but for now, why try having a quick play with the =QUERY() formula yourself?
For once, I didn’t put links into a presentation, so here instead are the link resources for my News:Rewired presentation:
(If I get a chance over the next week or so, I may even try to make a slidecast out of the above…)
The link story for the presentation goes something like this:
- Visualising MPs’ Expenses Using Scatter Plots, Charts and Maps, reported here;
- Many Eyes – (news:rewired) MPExpensesData, Many Eyes – (news:rewired) MPExpensesData Visualisations
- MPs’ Expenss – scratchbuilt interactive Google map, MPs’ expenses (alternative map views) (not shown in presentation);
- MPs Expnses (Google spreadsheet/Guardian Datastore (via Guardian Datastore);
- MPs’ Expenses – Data Cleaning Yahoo Pipe, MP Expenses by name pipe, MP Expenses by postcode pipe;
- Demo Location Extractor/Geocoder Yahoo Pipe
- Prototype Guardian Datastore Explorer (how to);
- Linked Data Via Yahoo Pipes
- Filtering RA Data Yahoo Pipe (howto).
If there’s something “dataflow” related you’d like see explored here, please leave a request as a comment and I’ll see what I can do :-) I’ve also started a newsrw) category (view it here) which I’ll start posting relevant content to; (see also the datajourn tag).