Archive for the ‘Tinkering’ Category
As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?
Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.
If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…
Ah ha – it’s an embedded map from a Google Fusion Table…
We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:
The description data is curtailed, but we can see the full description on the card view:
Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.
This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.
Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:
And as if by magic, we have recreated a workable table:-)
If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?
If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?
What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?
Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:
Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.
If we cast these values to a number:
we can then use a numeric facet to help us explore the amounts.
In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:
- Summary: Lib Dems in Government have secured a £73,000 grant for …
- Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
- Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
- Summary: Lib Dems in Government have given £982,000 to …
Here’s an improved regular expression for parsing out some more of these amounts:
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)
So now we can start to identify some of the bigger grants…
More to add? eg around:
– ...have secured a £150,000 grant...
– Summary: Lib Dems have given a £1,571,000 grant...
– Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...
Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?
In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:
- £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
- £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant
But that’s for another day…
If you want to play with the data yourself, you can find it here.
To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info
In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)
From the Scraperwiki database, we need to interrogate the API:
To do this, we’ll use a database query language – SQL.
What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:
select * from support where support in (select candinit from support)
Note that “support” refers to two things here – these are columns:
select * from support where support in (select candinit from support)
and these are the table the columns are being pulled from:
select * from support where support in (select candinit from support)
Here’s the result of Runing the query:
We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).
There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:
More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?
In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)
This is what we’ll be working towards:
If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.
Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…
When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).
We can download a CSV representation of the data from the scraper directly:
The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.
To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.
Start Gephi, select Data Laboratory tab and then New Project from the File menu.
You should now see a toolbar that includes an “Import Spreadsheet option”:
Import the CSV file as such, identifying it as an Edges Table:
You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…
You should now see the data has been loaded in to Gephi…
If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.
Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:
If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)
However, there are also other groupings in which one candidate may support another:
These connections may allow us to see grouping of candidates supporting each other along party lines.
One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (out degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:
Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:
That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:
We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:
Let’s reset the colours and go back to colourings along party lines:
If we go to the Preview view, we can generate a prettified view of the network:
In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:
Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)
To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:
Just that one then…
Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.
Here’s the result:
And again we can go into Preview mode to generate a nice vectorised version of the graph:
This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).
So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.
If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.
PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)
A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…
Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:
Now let’s map it:-)
First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…
Then we say we want the column to be recognised as a column type:
Computer says yes, highlighting the location type cells with a yellow background:
As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)
Once the points have been geocoded, they’re placed onto a map:
We can now publish the map in preparation for sharing it with the world…
We need to change the visibility of the map to something folk can see!
Public on the web, or just via a shared link – your choice:
The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.
Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.
Here’s how I want the data to look after I’ve cleaned it:
Can you see what needs to be done? Somehow, we need to:
- remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.
If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:
We can also paste the data into a new OpenRefine Project:
We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:
But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?
Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.
Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.
If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.
Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.
If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.
Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…
…by replacing the colon with nothing (an empty string).
Here’s the data – neat and tidy:-)
To finish, let’s export the data.
How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).
And here it is:-)
So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)
A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context
Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.
To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):
- you can’t start following someone on Twitter until you join Twitter;
- follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
- starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.
A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)
There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).
So with low follower numbers, where can we get our timestamps from?
In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).
If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.
This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.
(I’m making this up as I’m going along…;)
If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.
Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.
Got that?!;-) (I’m still making this up as I’m going along…!)
We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.
If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.
Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].
If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)
We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).
For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:
if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]
Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]
If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:
- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
– encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
– consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
– if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.
I think I need to go and walk the dog again.
PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.
Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:
As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.
Read the whole thing here: Glue Logic and Flowable Data.
PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on OUseful.info.