Archive for the ‘Infoskills’ Category
There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners, we can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.
To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).
The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.
To query the data, I’ve popped up a simple query form on Scraperwiki: Google Spreadsheet Explorer
To use the explorer, you need to:
- provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
- preview the table headings;
- construct a query using the column letters;
- select the output format;
- run the query.
So what sort of questions might we want to ask of the data? Let’s build some up.
We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F
We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC
Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').
How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC
Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'
Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.
How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…
The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A
In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5
How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5
Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC
How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC
To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.
A couple of months ago, I came across an interesting slide deck reviewing some of the initiatives that Narrative Science have been involved with, including the generation of natural language interpretations of school education grade reports (I think: some natural language take on an individual’s academic scores, at least?). With MOOC fever in part focussing on the development of automated marking and feedback reports, this represents one example of how we might take numerical reports and dashboard displays and turn them into human readable text with some sort of narrative. (Narrative Science do a related thing for reports on schools themselves – How To Edit 52,000 Stories at Once.)
Whenever I come across a slide deck that I think may be in danger of being taken down (for example, because it’s buried down a downloads path on a corporate workshop promoter’s website and has CONFIDENTIAL written all over it) I try to grab a copy of it, but this presentation looked “safe” because it had been on Slideshare for some time.
Since I discovered the presentation, I’ve been recommending it to variou folk, particularly slides 20-22? that refer to the educational example. Trying to find the slidedeck today, a websearch failed to turn it up so I had to go sniffing around to see if I had mentioned a link to the original presentation anywhere. Here’s what I found:
The Wayback machine had grabbed bits and pieces of text, but not the actual slides…
Not only did I not download the presentation, I don’t seem to have grabbed any screenshots of the slides I was particularly interested in… bah:-(
For what it’s worth, here’s the commentary:
Introduction to Narrative Science — Presentation Transcript
We Transform Data IntoStories and Insight…In Seconds
Automatically,Without Human Intervention and at a Significant Scale
To Help Companies: Create New Products Improve Decision-MakingOptimize Customer Interactions
Customer Types Media and Data Business Publishing Companies Reporting
How Does It Work? The Data The Facts The Angles The Structure Stats Tests Calls The Narrative Language Completed Text Our technology platform, Quill™, is a powerful integration of artificial intelligence and data analytics that automatically transforms data into stories.
The following slides are examples of our work based upon a simple premise: structured data in, narrative out. These examples span several domains, including Sports Journalism, Financial Reporting, Real Estate, Business Intelligence, Education, and Marketing Services.
Sports Journalism: Big Ten Network – Data InTransforming Data into Stories
Sports Journalism: Big Ten Network – NarrativeTransforming Data into Stories
Financial Journalism: Forbes – Data InTransforming Data into Stories
Financial Journalism: Forbes – NarrativeTransforming Data into Stories
Short Sale Reporting: Data Explorers – JSON Input
Short Sale Reporting: Data Explorers – Overview North America Consumer Services Short Interest Update There has been a sharp decline in short interest in Marriott International (MAR) in the face of an 11% increase in the companys stock price. Short holdings have declined nearly 14% over the past month to 4.9% of shares outstanding. In the last month, holdings of institutional investors who lend have remained relatively unchanged at just below 17% of the companys shares. Investors have built up their short positions in Carnival (CCL) by 54.3% over the past month to 3.1% of shares outstanding. The share price has gained 8.3% over the past week to $31.93. Holdings of institutional investors who lend are also up slightly over the past month to just above 23% of the common shares in issue by the company. Institutional investors who make their shares available to borrow have reduced their holdings in Weight Watchers International (WTW) by more than 26% to just above 10% of total shares outstanding over the past month. Short sellers have also cut back their positions slightly to just under 6% of the market cap. The price of shares in the company has been on the rise for seven consecutive days and is now at $81.50.
Sector Reporting: Data Explorers – JSON Input
Sector Reporting: Data Explorers – OverviewThursday, October 6, 2011 12:00 PM: HEALTHCARE MIDDAY COMMENTARY:The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcarestocks trailed the market by 0.4%. So far, the Dow rose 0.2%, the NASDAQ saw growth of 0.8%, andthe S&P500 was up 0.4%.Here are a few Healthcare stocks that bucked the sectors downward trend.MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recentlyannounced its chairman is stepping down. MRK stock traded in the range of $31.21 – $31.56. MRKsvolume was 86.1% lower than usual with 2.5 million shares trading hands. Todays gains still leavethe stock about 11.1% lower than its price three months ago.LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose3.8% to $26.92. LUX traded in the range of $26.48 – $26.99. Luxottica Group’s early share volumewas 34,155. Todays gains still leave the stock 21.8% below its 52-week high of $34.43. The stockremains about 16.3% lower than its price three months ago.Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from theprevious close of $32.08. UHS traded in the range of $32.06 – $33.01…
Real Estate: Hanley Wood – Data InTransforming Data into Stories
Real Estate: Hanley Wood – NarrativeTransforming Data into Stories
BI: Leading Fast Food Chain – Data InTransforming Data into Stories
BI: Leading Fast Food Chain – Store Level Report January Promotion Falling Behind Region The launch of the bagels and cream cheese promotion began this month. While your initial sales at the beginning of the promotion were on track with both your ad co-op and the region, your sales this week dropped from last week’s 142 units down to 128 units. Your morning guest count remained even across this period. Taking better advantage of this promotion should help to increase guest count and overall revenue by bringing in new customers. The new item with the greatest growth opportunity this week was the Coffee Cake Muffin. Increasing your sales by just one unit per thousand transactions to match Sales in the region would add another $156 to your monthly profit. That amounts to about $1872 over the course of one year.Transforming Data into Stories
Education: Standardized Testing – Data InTransforming Data into Stories
Education: Standardized Testing – Study RecommendationsTransforming Data into Stories
Marketing Services & Digital Media: Data InTransforming Data into Stories
Marketing Services & Digital Media: NarrativeTransforming Data into Stories
PS Slideshare appears to have a new(?) feature – Saved Files – that keeps a copy of files you have downloaded. Or does it? If I save a file and someone deletes it, will the empty shell only remain in my “Saved Files” list?
Reading through another wonderful post on the FullFact blog last night (Full Fact sources index: where to find the information you need), I noticed that the linked to resources from that post were being redirected via Google URL:
A tweet confirmed that this wasn’t intentional, so what had happened? I gather the workflow used to generate the post was to write it in Google docs, and then copy and paste the rich/HTML text into a rich text editor in Drupal, although I couldn’t recreate this effect (and nor could FullFact). However, suitably suspicious, I started having a play, writing a simple test document in Google docs:
The Google doc automatically links the test URL I added to the document. (This is often referred to as “linkification” – if a piece of text is recognised as something that looks like a URL or web link, it gets rewritten as a clickable link. Typically, you might assume that the link you’ll now be clicking on is the link that was recognised. This may be a bad assumption to make…) If you hover over the URL as written in the document, you get a tooltip that suggests the link is to the same URL. However, if you hover over the tooltip listed URL, (or click on it) you can see from the indicator in the bottom left hand corner of the browser what the actual URL you’re clicking on is. Like this:
In this case, the link you’ll actually click on is referral to the original link via a Google URL. This one, in fact:
What this means is that if I click on the link, Google tracks the fact that the link was clicked on. From the value of the usg variable (in this case, AFQjCNHgu25L-v9rkkMqZSX54E8kP_XR-A) it presumably also knows the source document containing the link and whatever follows from that.
Hmmm… If I publish the document, the Google rewrite appears to be removed:
There are also several export options associated with the document:
So what links get exported?
Here’s the Word export:
That seems okay – no tracking. How about odt?
That looks okay too. RTF and and HTML export also seem to publish the “clean” link.
What about PDF?
Hmm… so tracking is included here. So if you write a doc in Google docs that contains links that are autolinked, then you export that doc as PDF and share it with other folk, Google will know when folk click on that link from a copy of that PDF document and (presumably) the originally authored Google docs document (and all that that entails…)
How about if we email a doc as a PDF attachment to someone from within Google docs:
So that seems okay (untracked).
What’s the story then? FullFact claimed they cut and paste rich HTML from Google docs into a rich text editor and the Google redirection attack was inserted into the link. I couldn’t recreate that, and nor could the FullFact folk, so either there are some Google “experiments” going on, or the workflow was misremembered.
In my own experiments, I got a Google redirection from clicking links within my original document, and from the exported PDF, but not from any other formats?
So what do we learn? I guess this at least: be aware that when Google linkifies links for you, it may be redirecting clicks on those links through Google tracking servers. And that these tracked links may be propagated to exported and/or otherwise shared versions of the document.
PS see also Google/Feedburner Link Pollution or More Link Pollution – This Time from WordPress.com for more of the same, and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There for a quick overview of what might happen when you actually land on a page…
Link rewriters are, of course, to be find in lots of other places too…
Twitter, for example, actually wraps all shared links in it’s t.co wrapper:
Delicious (which I’ve stopped using – I moved over to Pinboard) also uses it’s own proxy for clicked on stored bookmarks…
If you have any other examples, particularly of link rewriting/annotation/pollution where you wouldn’t expect it, please let me know via the comments…
I though this was handy on the OER-DISCUSS mailing list:
Our copyright officer writes:
… US Copyright ‘Fair Use’ or S29 copying for non-commercial research and private study which allows copying but the key word here is ‘private’. i.e. the provisos are that you don’t make the work or copies available to anyone else.
Although there are UK Exceptions for education, they are very limited or obsolete.
S.32 (1) and (2A) do have the proviso “is not done by reprographic process” which basically means that any copying by any mechanical means is excluded, i.e. you may only copy by hand.
S36 educational provision in law for reprographic copying is
a) only applicable to passages in published works i.e. books journals etc and
b) negated becauses licences are now available S.36 (3)
S.32 (2) permits only students studying courses in making Films or Film soundtracks to copy Film, broacasts or sound recordings.
The only educational exception students can rely on is s.32(3) for Examination athough this also is potentially restrictive. For the exception to apply, the work must count towards their final grade/award and any further dealing with the work after the examination process, becomes infringement.
I’m not sure how they are using Voicethread, but if the presentations are part of their assessed coursework and only available to students, staff and examiners on the course, they may use any Copyright protected content, provided it’s all removed from availability after the assessment (not sure how this works with cloud applications though)
There is also exception s.30 for Criticism or Review, which is a general exception for all, and the copying is necessary for a genuine critique or review of it.
If the students can’t rely on the last 3 exceptions, using Copyright free or licenced material (e.g. Creative Commons), would be highly recommended.
Kate Vasili – Copyright Officer, Middlesex University, Sheppard Library
Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx
I’ve been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working with the data as a graph/network.
What I’ve been focussing on for now are networks that show connections between directors and companies, something we might imagine as follows:
In this network, the blue circles represent companies and the red circles directors. A line connecting a director with a company says that the director is a director of that company. So for example, company C1 has just one director, specifically D1; and director D2 is director of companies C2 and C3, along with director D3.
It’s easy enough to build up a graph like this from a list of “company-director” pairs (or “relations”). These can be described using a simple two column data format, such as you might find in a simple database table or CSV (comma separated value) text file, where each row defines a separate connection:
This is (sort of) how I’m representing data I’ve pulled from OpenCorporates, data that starts with a company seed, grabs the current directors of that target company, searches for other companies those people are directors of (using an undocumented OpenCorporates search feature – exact string matching on the director search (put the direction name in double quotes…;-), and then captures which of those companies share are least two directors with the original company.
In order to turn the data, which looks like this:
into a map that resembles something like this (this is actually a view over a different dataset):
we need to do a couple of things. Working backwards, these are:
- use some sort of tool to generate a pretty picture from the data;
- get the data out of the table into the tool using an appropriate exchange format.
Note that the positioning of the nodes in the visualisation may be handled in a couple of ways:
- either the visualisation tool uses a layout algorithm to work out the co-ordinates for each of the nodes; or
- the visualisation tool is passed a graph file that contains the co-ordinates saying where each node should be placed; the visualisation tool then simply lays out the graph using those provided co-ordinates.
The dataflow I’m working towards looks something like this:
networkx is a Python library (available on Scraperwiki) that makes it easy to build up representations of graphs simply by adding nodes and edges to a graph data structure. networkx can also publish data in a variety of handy exchange formats, including gexf (as used by Gephi and sigma.js), and a JSON graph representation (as used by d3.js and maybe sigma.js (example plugin?).
As a quick demo, I’ve built a scraperwiki view (opencorporates trawler gexf) that pulls on a directors_ table from my opencorporates trawler and then publishes the information either as gexf file (default) or as a JSON file using URLs of the form:
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2 (gexf default)
This view can therefore be used to easily export data from my OpenCorporates trawler as a gexf file that can be used to easily import data into the Gephi desktop tool, or provide a URL to some JSON data that can be visualised using a Javscript library within a web page (I started doodling the mechanics of one example here: sigmajs test; better examples of what’s possible can be seen at Exploring Data and on the Oxford Internet Institute – Visualisations blog. If anyone would like to explore building a nice GUI to my OpenCorporates trawl data, feel free:-).
We can also use networks to publish data based on processing the network. The example graph above shows a netwrok with two sorts of nodes, connected by edges: company nodes and director nodes. This is a special sort of graph in that companies are only ever connected to directors, and directors are only ever connected to companies. That is, the nodes fall into one of two sorts – company or director – and they only ever connect “across” node type lines. If you look at this sort of graph (sometimes referred to as a bipartite or bimodal graph) for a while, you might be able to spot how you can fiddle with it (technical term;-) to get a different view over the data, such as those directors connected to other directors by virtue of being directors of the same company:
or those companies that are connected by virtue of sharing common directors:
(Note that the lines/edges can be “weighted” to show the number of connections relating two companies or directors (that is, the number of companies that two directors are connected by, or the number of directors that two companies are connected by). We can then visually depict this weight using line/edge thickness.)
The networkx library conveniently provides functions for generating such views over the data, which can also be accessed via my scraperwiki view:
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=companies gives the view over companies connected by virtue of the fact that they share one or more common director;
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=officers&output=json gives a view over how directors are connected based on being co-directors of one or more of the same companies.
As the view is paramaterised via a URL, it can be used as a form of “glue logic” to bring data out of a directors table (which itself was populated by mining data from OpenCorporates in a particular way) and express it in a form that can be directly plugged in to a visualisation toolkit. Simples:-)
PS related: a templating system by Craig Russell for generating XML feeds from Google Spreadsheets – EasyOpenData.
I’ve been doodling… Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I’d have a quick look at some stats around follower groupings…
I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the “location”, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).
So what can we do with that data? Simple scatter plots, for one thing – here’s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:
We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in…:
We can also calculate ratios and then plot them – followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):
This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a ‘change of use’.
As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof – I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.
So for example, things like this – a scatterplot of friends/follower counts similar to the one above:
…sort of. There’s something obviously odd about that graph, isn’t there? The “step up” at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.
How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?
Alternatively, we can do counts by number of days the followers have been on Twitter:
The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:
Sometimes, you get a distribution that is very, very wrong… If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:
However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:
There’s obviously something not quite, erm, normal(?!) about this account (at least, at the time I grabbed the data, there was something not quite normal etc etc…).
When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list – for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats – but which ones does it make sense to look at?
Here’s one… a plot of the median followers per status ratio versus the median friend/follower ratio:
Spot the outlier ;-)
So that’s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I’ve also been doing for some time:
It’s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(
[In the next post in this series I'll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable "events" surrounding the person being followed...]
As the food labeling and substituted horsemeat saga rolls on, I’ve been surprised at how little use has been made of “data” to put the structure of the food chain into some sort of context* (or maybe I’ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use EU horse import/export data to produce interactive map of the European trade in horsemeat
*One for the to do list – a round up of “#ddj” stories around the episode.)
(The article describes the source of the data as the Eurpoean Union Unistat statistics website, although I couldn’t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he’d come by the data, he suggested putting questions into the Eurostat press office;-)
The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:
One way of viewing this data structure is as an edge weighted adjacency matrix that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals from one country to another.
Thinking about trade as flow suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of “stuff” flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here’s an example of a Sankey diagram I created around the horse export data:
(The layout is a little rough and ready – I was more interested in finding a recipe for creating the base graphic – sans design tweaks;-) – from the data as supplied.)
So how did I get to the diagram from the data?
As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the Sankey diagram plugin to Mike Bostock’s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can generate an appropriate data object from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram “for free”.
So how can we create the graph from the data?
The networkx documentation describes a method – read_weighted_edgelist – for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the “import to” and “export from” nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a weighted edge list in which each row has three columns: exporting country, importing country, amount.
So how can we do that?
One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a source data file that looks something like this:
So how do we get from one to the other?
Here’s the R script I used – it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the melt function from the reshape library to generate the edge list, finally filtering out edges where there were no exports:
#R code horseexportsEU <- read.delim("~/Downloads/horseexportsEU.txt") require(reshape) #Get a "long" edge list x=melt(horseexportsEU,id='COUNTRY') #Turn the numbers into numbers by removing the comma, then casting to an integer x$value2=as.integer(as.character(gsub(",", "", x$value, fixed = TRUE) )) #If we have an NA (null/empty) value, make it -1 x$value2[ is.na(x$value2) ] = -1 #Column names with countries that originally contained spaces convert spaces dots. Undo that. x$variable=gsub(".", " ", x$variable, fixed = TRUE) #I want to export a subset of the data xt=subset(x,value2>0,select=c('COUNTRY','variable','value2')) #Generate a text file containing the edge list write.table(xt, file="foo.csv", row.names=FALSE, col.names=FALSE, sep=",")
(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as igraph, that can do similar things. But then, I wouldn’t have go to show the “melt” method to reshaping data;-)
Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:
#python code import networkx as nx import StringIO import csv #Bring in the edge list explicitly #rawdata = '''"SLOVENIA","AUSTRIA",1200 #"AUSTRIA","BELGIUM",134600 #"BULGARIA","BELGIUM",181900 #"CYPRUS","BELGIUM",200600 #... etc #"ITALY","UNITED KINGDOM",12800 #"POLAND","UNITED KINGDOM",129100''' #We convert the rawdata string into a filestream f = StringIO.StringIO(rawdata) #..and then read it in as if it were a CSV file.. reader = csv.reader(f, delimiter=',') def gNodeAdd(DG,nodelist,name): node=len(nodelist) DG.add_node(node,name=name) #DG.add_node(node,name=name) nodelist.append(name) return DG,nodelist nodelist= DG = nx.DiGraph() #Here's where we build the graph for item in reader: #Even though import and export countries have the same name, we create a unique version depending on # whether the country is the importer or the exporter. importTo=item+'.' exportFrom=item amount=item if importTo not in nodelist: DG,nodelist=gNodeAdd(DG,nodelist,importTo) if exportFrom not in nodelist: DG,nodelist=gNodeAdd(DG,nodelist,exportFrom) DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount) json = json.dumps(json_graph.node_link_data(DG)) #The "json" serialisation can then be passed to a d3js containing web page...
Once the JSON object is generated, it can be handed over to d3.js. The whole script is available here: EU Horse imports Sankey Diagram.
What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.
PS In passing, as well as the data tables that can be searched on Eurostat, I also found the Eurostat Yearbook, which (for the most recent release at least), includes data tables relating to reported items:
So it seems that the more I look, the more and more places seems to making data that appears in reports available as data…
As well as serendipity, I believe in confluence…
A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.
It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.
In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:
So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)
Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.
So what’s confluential about those?
The IWCP article localises the Fairer Gambling data in several ways:
- the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
- a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
- a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.
The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).
Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.
A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).
Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.
The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)
Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).
The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.
Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.
PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…
Slides without commentary from a presentation I gave to undergrads on data journalism at the University of Lincoln yesterday…
My plan is to write some words around this deck, (or maybe even to record (and perhaps transcribe), some sort of narration…) just not right now…
I’m happy to give variants of this presentation elsewhere, if you can cover costs…
One of my most successful posts in terms of lifetime traffic numbers has been a recipe for scraping data from a Wikipedia page, pulling it into a Google spreadsheet, publishing it as CSV, pulling it into a Yahoo Pipe, geo-coding it, publishing it as a KML file, displaying the KML in Google maps and embedding the map in another page (which could in principle be the original WIkipedia page): Data Scraping Wikipedia with Google Spreadsheets.
A variant of this recipe in other words:
Running the hack now on a new source web page, we get the following data pulled into the spreadsheet:
And the following appearing in the pipe (I am trying to replace the first line with my own headers):
The CSV file appears to be misbehaving… Downloading the CSV data and looking at it in TextWrangler, a text editor, we start to see what’s wrong:
The text editor creates line numbers for things it sees as separate, well formed rows in the CSV data. We see that the header, which should be a single row, is actually spread over four rows. In addition, the London data is split over two rows. The line for Greater Manchester behaves correctly…: if you look at the line numbers, you can see line 7 overflows in the editor (the … in the line number count shows the CSV line (a separate dataflow) has overflowed the width of the editor and been wrapped round in the editor view).
If I tell the editor to stop “soft wrapping” each line of data in the CSV file, the editor displays each line of the CSV file on a single line in the editor:
So… where does this get us in fixing the pipe? Not too far. We can skip the first 5 lines of the file that we import into the pipe, and that gets around all the messed up line breaks at the top of the file, but we lose the row containing the data for London. In the short term, this is probably the pragmatic thing to do.
Next up, we might look to the Wikipedia file and see how the elements that appear to be breaking the the CSV file might be fixed to unbreak them. Finally, we could go to the Google Spreadsheets forums and complain about the pile of crap broken CSV generation that the Googlers appear to have implemented…
PS MartinH suggests workaround in the comments, wrapping a QUERY round the import and renaming the columns…