OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Rstats’ Category

Generating Sankey Diagrams from rCharts

A couple of weeks or so ago, I picked up an inlink from an OCLC blog post about Visualizing Network Flows: Library Inter-lending. The post made use of Sankey diagrams to represent borrowing flows, and by implication suggested that the creation of such diagrams is not as easy as it could be…

Around the same time, @tiemlyportfolio posted a recipe for showing how to wrap custom charts so that they could be called from the amazing Javascript graphics library wrapping rCharts (more about this in a forthcoming post somewhere…). rCharts Extra – d3 Horizon Conversion provides a walkthrough demonstrating how to wrap a d3.js implemented horizon chart so that it can be generated from R with what amounts to little more than a single line of code. So I idly tweeted a thought wondering how easy it would be to run through the walkthrough and try wrapping a Sankey diagram in the same way (I didn’t have time to try it myself at that moment in time.)

Within a few days, @timelyportfolio had come up with the goods – Exploring Networks with Sankey and then a further follow on post: All My Roads Lead Back to Finance–PIMCO Sankey. The code itself can be found at https://github.com/timelyportfolio/rCharts_d3_sankey

Somehow, playtime has escaped me for the last couple of weeks, but I finally got round to trying the recipe out. The data I opted for is energy consumption data for the UK, published by DECC, detailing energy use in 2010.

As ever, we can’t just dive straight into the visualiastion – we need to do some work first to get int into shape… The data came as a spreadsheet with the following table layout:

Excel - copy data

The Sankey diagram generator requires data in three columns – source, target and value – describing what to connect to what and with what thickness line. Looking at the data, I thought it might be interesting to try to represent as flows the amount of each type of energy used by each sector relative to end use, or something along those lines (I just need something authentic to see if I can get @timelyportfolio’s recipe to work;-) So it looks as if some shaping is in order…

To tidy and reshape the data, I opted to use OpenRefine, copying and pasting the data into a new OpenRefine project:

Refine - paste DECC energy data

The data is tab separated and we can ignore empty lines:

Refine - paste import settings (DECC)

Here’s the data as loaded. You can see several problems with it: numbers that have commas in them; empty cells marked as blank or with a -; empty/unlabelled cells.

DECC data as imported

Let’s make a start by filling in the blank cells in the Sector column – Fill down:

DECC data fill down

We don’t need the overall totals because we want to look at piecewise relations (and if we do need the totals, we can recalculate them anyway):

DECC filter out overall total

To tidy up the numbers so they actually are numbers, we’re going to need to do some transformations:

DECC need to clean numeric cols

There are several things to do: remove commas, remove – signs, and cast things as numbers:

DECC clean numeric column

value.replace(',','') says replace commas with an empty string (ie nothing – delete the comma).

We can then pass the result of this transformation into a following step – replace the – signs: value.replace(',','').replace('-','')

Then turn the result into a number: value.replace(',','').replace('-','').toNumber()

If there’s an error, not that we select to set the cell to a blank.

having run this transformation on one column, we can select Transform on another column and just reuse the transformation (remembering to set the cell to blank if there is an error):

DECC number cleaner reuse

To simplify the dataset further, let’s get rid of he other totals data:

DECC remove data column

Now we need to reshape the data – ideally, rather than having columns for each energy type, we want to relate the energy type to each sector/end use pair. We’re going to have to transpose the data…

DECC start to reshape

So let’s do just that – wrap columns down into new rows:

DECC data transposition

We’re going to need to fill down again…

DECC need to fill down again

So now we have our dataset, which can be trivially exported as a CSV file:

DECC export as CSV

Data cleaning and shaping phase over, we’re now ready to generate the Sankey diagram…

As ever, I’m using RStudio as my R environment. Load in the data:

R import DECC data

To start, let’s do a little housekeeping:

#Here’s the baseline column naming for the dataset

#Inspired by @timelyportfolio - All My Roads Lead Back to Finance–PIMCO Sankey

#Now let's create a Sankey diagram - we need to install RCharts

#Download and unzip @timelyportfolio's Sankey/rCharts package
#Take note of where you put it!

sankeyPlot <- rCharts$new()

#We need to tell R where the Sankey library is.
#I put it as a subdirectory to my current working directory (.)

#We also need to point to an HTML template page
sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")

having got everything set up, we can cast the data into the form the Sankey template expects – with source, target and value columns identified:

#The plotting routines require column names to be specified as:
##source, target, value
#to show what connects to what and by what thickness line

#If we want to plot from enduse to energytype we need this relabelling

Following @timelyportfolio, we configure the chart and then open it to a browser window:

  data = workingdata,
  nodeWidth = 15,
  nodePadding = 10,
  layout = 32,
  width = 750,
  height = 500,
  labelFormat = ".1%"


Here’s the result:

Basic sankey DECC

Let’s make plotting a little easier by wrapping that routine into a function:

#To make things easier, let's abstract a little more...
  sankeyPlot <- rCharts$new()
  #See note in PPS to this post about a simplification of this part....
  #We need to tell R where the Sankey library is.
  #I put it as a subdirectory to my current working directory (.)
  #We also need to point to an HTML template page
  sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
    data = df,
    nodeWidth = 15,
    nodePadding = 10,
    layout = 32,
    width = 750,
    height = 500,
    labelFormat = ".1%"

Now let’s try plotting something a little more adventurous:

#If we want to add in a further layer, showing how each Sector contributes
#to the End-use energy usage, we need to additionally treat the Sector as
#a source and the sum of that sector's energy use by End Use
#Recover the colnames so we can see what's going on
sectorEnergy=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)

#We can now generate a single data file combing all source and target data


And the result?

Full Sankey DECC

Notice that the bindings are a little bit fractured – for example, the Heat block has several contributions from the Gas block. This also suggests that a Sankey diagram, at least as configured above, may not be the most appropriate way of representing the data in this case. Sankey diagrams are intended to represent flows, which means that there is a notion of some quantity flowing between elements, and further that that quantity is conserved as it passes through each element (sum of inputs equals sum of outputs).

A more natural story might be to show Energy type flowing to end use and then out to Sector, at least if we want to see how energy is tending to be used for what purpose, and then how end use is split by Sector. However, such a diagram would not tell us, for example, that Sector X was dominated in its use of energy source A for end use P, compared to Sector Y mainly using energy source B for the same end use P.

One approach we might take to tidying up the chart to make it more readable (for some definition of readable!), though at the risk of making it even more misleading, is to do a little bit more aggregation of the data, and then bind appropriate blocks together. Here are a few more examples of simple aggregations:

We can also explore other relationships and trivially generate corresponding Sankey diagram views over them:

#How much of each energy type does each sector use
enduseBySector=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)


#How much of each energy type is associated with each enduse
energyByEnduse=aggregate(value ~ EnergyType + Enduse, DECC.overall.energy, sum)


So there we have it – quick and easy Sankey diagrams from R using rCharts and magic recipe from @timelyportfolio:-)

PS the following routine makes it easier to grab data into the appropriately named format

#This routine makes it easier to get the data for plotting as a Sankey diagram
#Select the source, target and value column names explicitly to generate a dataframe containing
#just those columns, appropriately named.

#For example:

The code automatically selects the appropriate columns and renames them as required.

PPS it seems that a recent update(?) to the rCharts library by @ramnath_vaidya now makes things even easier and removes the need to download and locally host @timelyportfolio’s code:

#We can remove the local dependency and replace the following...
#sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
##with this simplification

Written by Tony Hirst

July 23, 2013 at 11:34 am

Posted in OpenRefine, Rstats

Tagged with

Generating Alerts From Guardian University Tables Data

One of the things I’ve been pondering with respect to the whole data journalism process is how journalists without a lot of statistical training can quickly get a feel for whether there may be interesting story leads in a dataset, or how they might be able to fashion “alerts” that bring attention to data elements that might be worth investigating further. In the case of the Guardian university tables, this might relate to identifying which universities appear to have courses that rank particularly well in a subject and within their own institution, or which subject areas appear to have teaching or assessment satisfaction issues in a particular institution. In this post, I have a quick play with an idea for generating visual alerts that might help us set up or identify hopefully interesting or fruitful questions to ask along these lines.

Statistics will probably have a role to play in generating most forms of alert, but as Jeff Leek has recently pointed out in the Simply Statistics blog, [t]he vast majority of statistical analysis is not performed by statisticians. Furthermore:

We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren’t statisticians.

By no stretch of the imagination would I class myself as a statistician. But I have started trying to develop some statistical intuitions by going back to several histories of statistics in order to see what sorts of problems the proto-statisticians were trying to develop mathematical techniques to solve (for example, I’m currently reading The History of Statistics: The Measurement of Uncertainty before 1900)).

One of the problems I’ve encountered for myself relates to trying to find outliers in a quick and easy way. One way of detecting an outlier is to assume that the points have a normal distribution about the population mean value, and then look for points that lay several standard deviations away from the mean. If that all sounds way to complicated, the basic idea is this: which items in a group have values a long way from the average value in the group.

A statistic that captures this idea in a crude way is the z-score (or should than be, z-statistic? z-value?), which for a particular point is is the deviation from the mean divided by the standard deviation (I think). Which is to say, it’s proportional to how far away from ‘the average’ a point is dividing by the average of how far away all points are. (What is doesn’t tell you is whether that distance is meaningful or important in any sense, which I think statisticians refer to as “the power of the effect”. I think. They talk in code.)

Anyway, the magic of R makes it easy to calculate the z-score for a group of numbers using the scale() function. So I had a quick play with it to see if it might be useful in generating alerts around the Guardian University tables data.

This post build on two earlier posts and may not make much sense if you haven’t been following the story to date. The first post, which tells how to grab the data into R, can be found in Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data ; the second, how to start building a simple interactive viewer for the data using the R shiny library, can be found in Disposable Visual Data Explorers with Shiny – Guardian University Tables 2014.

The general question I had in mind was this: is there a way we can generate a simple view over the data to highlight outperformers and underperformers under each of the satisfaction scores? The sort of view I thought I wanted depended on the stance from which a question about the data might arise. For example, through building a single dataset from the subject level data, we find ourselves in a position where we can ask questions from the perspective of someone from a particular university who may be interested in how the various satisfaction scores for a particular subject compare to the scores for the other subjects offered by the institution. Alternatively, we might be interested in how the satisfaction scores for a particular subject area might compare across several institutions.

This seemed like things the z-score might be able to help with. For example, for each of the satisfaction scores/columns associated with a particular subject, we can generate the z-score that shows how far away from the average in that satisfaction column for that subject each institution is.

##gdata contains the Guardian university tables data for all subjects
#Get the list of subjects
#Grab a working copy of the data
#Find the z-scores for the various numerical satisfaction score columns in the data for the first subject
#Add another column that labels this set of z-scores with the appropriate institution names
#Add another column that specifies the subject the z-scores correspond to
#We now repeat the above process for each of the other subjects using a temporary working copy of the data
for (g in ggs[-1]){
  #The only difference is we add this temporary data to the end of the working copy
#The resulting data frame, ggs1s, contains z-scores for each satisfaction data column across institutions within a subject.

This data essentially gives us a clue about how well one institution’s scores compare with the scores of the other institutions in each separate subject area. The lead in here is to questions along the lines of “which universities have particularly good career prospects for students studying subject X”, or “which universities are particularly bad in terms of assessment satisfaction when it comes to subject Y?”.

In addition, for each of the satisfaction scores/columns associated with a particular institution, we can generate the z-score that shows how far away from the average in that satisfaction column for that institution each subject is.

##The pattern of this code should look familiar...
##We're going to do what we did for subjects, only this time for institutions...
#Get the list of subjects
#Grab a working copy of the data
#Find the z-scores for the various numerical satisfaction score columns in the data for the first institution
#Add another column that labels this set of z-scores with the appropriate institution names
#Add another column that specifies the subject the z-scores correspond to
#We now repeat the above process for each of the other institutions using a temporary working copy of the data
for (g in ggi[-1]){
  #As before, the only difference is we add this temporary data to the end of the working copy
#The resulting data frame, ggs1i, contains z-scores for each satisfaction data column across subjects within a institution.

This second take on the data essentially gives us a clue about how well each subject area performs within an institution compared to the performance of the other subjects offered by that institution. This might help lead us in to questions of the form “which department within an institution appears to have an unusually high entrance tariff compared to other courses offered by the institution?” or “which courses have a particularly overall satisfaction, which might be breaking our overall NSS result?” For university administrators taking a simplistic view of this data, it might let you tell a Head of Department that their chaps are letting the side down. Or it might be used by a Director of a Board of Studies in a particular subject area to boost a discretionary “bonus” application.

So here’s something I’ve started playing with to try to generate “alerts” based on outperformers and underperformers within an institution and/or within a particular subject area compared to other institutions: plots that chart show subject/institution combinations that have “large” z-scores as calculated above.

To plot the charts, we need to combine the z-score datasets and get the data into the right shape:

#Convert the z-score dataframes from "wide" format to "long" format
#The variable column will identify which satisfaction score the row conforms to for a given subject/institution pair
#Merge the datasets by subject/institution/satisfaction score

It’s then a simple matter to plot the outliers for a given institution – let’s filter to show only items where at least one of the z-scores has magnitude greater than 2:

g=ggplot(subset(ggs2,inst==ii & !is.na(value.x) & !is.na(value.y) & (abs(value.x)>2 | abs(value.y)>2))) + geom_text(aes(x=value.x,y=value.y,label=ss),size=2)+facet_wrap(~variable)
g=g+labs(title=paste("Guardian University Tables 2014 Alerts:",ii), x='z relative to other institutions in same subject', y='z relative to other subjects in same institution')

Or for given subject:

g=ggplot(subset(ggs2,ss==ss & !is.na(value.x) & !is.na(value.y) & (abs(value.x)>2 | abs(value.y)>2))) + geom_text(aes(x=value.x,y=value.y,label=inst),size=2)+facet_wrap(~variable)
g=g+labs(title=paste("Guardian University Tables 2014 Alerts:",ss), x='z relative to other institutions in same subject', y='z relative to other subjects in same institution')

Let’s see how meaningful these charts are, and whether they provide us with any glanceable clues about where there may be particular successes or problems…

Here’s the institutional view:


A couple of things we might look for are items in the top right and bottom left corners – these are subjects where the scores are well above average both nationally and within the institution. Subjects close to the y-axis are well away from the average within the institution, but fairly typical on a national level for that subject area. Subjects close to the x-axis are fairly typical within the institution, but away from average when compared to other institutions.

So for example, dentistry appears to be having a hard time of it? Let’s look at the dentistry subject table on an updated version of the Shiny viewer (which now included the alert charts), focussing on a couple of the satisfaction values that appear to be weak at a national scale:

guardian 2014 shiny dentistry

How about the subject alert?


Hmmm…what’s happening at UC Suffolk?

I’m still not sure how useful these views are, and the approach thus far still doesn’t give a glanceable alert over the whole data set in one go (we have to run views over institutions or subjects, for example). But it was relatively quick to do (this post took almost as long to write as the coding), and it maybe opens up ideas about what other questions we might ask of the data?

Written by Tony Hirst

June 23, 2013 at 12:24 pm

Posted in Rstats

Tagged with

Disposable Visual Data Explorers with Shiny – Guardian University Tables 2014

Have data – now what? Building your own interactive data explorer need not be a chore with the R shiny library… Here’s a quick walkthrough…

In Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data, I showed how to grab some data from several dozen commonly formatted sheets in a Google spreadsheet, and combine them to produce a single monolithic data set. The data relates to UK universities and provides several quality/satisfaction scores for each of the major subject areas they offer courses in.

We could upload this data to something like Many Eyes in order to generate visualisations over it, or we could create a visual data explorer app of our own. It needn’t take too long, either…

Here’s an example, the Simple Guardian University Rankings 2014 Explorer, that lets you select a university and then generate a scatterplot that shows how different quality/ranking scores vary for that university by subject area:

Crude data explorer - guardian 2014 uni stats

The explorer allows you to select a university and then generate a scatterplot based around selected quality scores. The label size is also set relative to a selected quality score.

The application is built up from three files. A generic file, that we use to load the source data in (in this example I pull it form a file, though we could bring it in live from the Google spreadsheet).

#In this case, the data is loaded into the dataframe: gdata
#Once it's loaded, we tidy it (I should have tidied the saved data really!)
gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)

A “server” file that takes input from the user interface elements on the left of the app and generates the displayed chart:

# Define server logic
shinyServer(function(input, output) {
  #Simple test plot
  output$testPlot = renderPlot( {
    pdata=subset(gdata, Name.of.Institution==input$tbl)
    #g=ggplot(pdata) + geom_text(aes(x=X..Satisfied.with.Teaching,y=X..Satisfied.with.Assessment,label=subject,size=Value.added.score.10))
    g=ggplot(pdata) + geom_text(aes_string(x=input$tblx,y=input$tbly,size=input$tbls, label='subject'))
    g=g+labs(title=paste("Guardian University Tables 2014:",input$tbl))

A user interface definition file.

#Generate a list containing the names of the institutions
names(uList) = uList
#Generate a list containing the names of the quality/ranking score columns by column name
names(cList) = cList
# Define UI for application that plots random distributions 
  # Application title
  headerPanel("Guardian 2014 University Tables Explorer"),
    #Which table do you want to view, based on the list of institution names?
    selectInput("tbl", "Institution:",uList),

    #Also let the user select the x, y and label size, based on quality/ranking columns
    selectInput("tblx", "x axis:",cList,selected = 'X..Satisfied.with.Teaching'),
    selectInput("tbly", "y axis:",cList,selected='X..Satisfied.with.Assessment'),
    selectInput("tbls", "Label size:",cList,selected = 'Value.added.score.10'),
    div("This demo provides a crude graphical view over data extracted from",
          "Guardian Datablog: University guide 2014 data tables") )
  #The main panel is where the "results" charts are plotted

And that’s it… If we pop these files into a single gist – such as the one at https://gist.github.com/psychemedia/5824495, which also includes code for grabbing the data from the Google spreadsheet – we can run application from the RStudio command line as follows:


(Hit “escape” to stop the script running.)

With a minor tweak, we can get a list of unique subjects, rather than institutions and allow the user to compare courses across institution by subject, rather than across subject areas within an institution.

We can then combine the two approaches into a single interface, Guardian 2014 university table explorer v2 – whilst not ideal (we should really grey out the inactive selector – institution or subject area according to which one hasn’t been selected via the radio button).

guardian uni 2014 explorer 2

The global.R file is the same, although we need to tweak the ui.R and server.R files.

To the UI file, we add a radio button selector and an additional menu (for subjects):


names(uList) = uList

#Pull out the list of subjects
names(sList) = sList

names(cList) = cList

# Define UI for application that plots random distributions 
  # Application title
  headerPanel("Guardian 2014 University Tables Explorer v.2"),
    #Add in a radio button selector
    radioButtons("typ", "View by:",
                 list("Institution" = "inst",
                      "Subject" = "subj")),
    #Just a single selector here - which table do you want to view?
    selectInput("tbli", "Institution:",uList),
    #Add a selector for the subject list
    selectInput("tblb", "Subject:",sList),
    selectInput("tblx", "x axis:",cList,selected = 'X..Satisfied.with.Teaching'),
    selectInput("tbly", "y axis:",cList,selected='X..Satisfied.with.Assessment'),
    selectInput("tbls", "Label size:",cList,selected = 'Value.added.score.10'),
    div("This demo provides a crude graphical view over data extracted from",
          "Guardian Datablog: University guide 2014 data tables") )
  #The main panel is where the "results" charts are plotted

To the server file, we add a level of indirection, setting local state variables based on the UI selectors, and then using the value of these variables within the chart generator code itself.


# Define server logic
shinyServer(function(input, output) {

  #We introduce a level of indirection, creating routines that set state within the scope of the server based on UI actions
  #If the radio button state changes, reset the data filter
  pdata <- reactive({
           inst=subset(gdata, Name.of.Institution==input$tbli),
           subj=subset(gdata, subject==input$tblb)
  #Make sure we use the right sort of label (institution or subject) in the title
  ttl <- reactive({
  #Make sure we display the right sort of label (by institution or by subject) in the chart
  lab <- reactive({
  #Simple test plot
  output$testPlot = renderPlot({
    #g=ggplot(pdata) + geom_text(aes(x=X..Satisfied.with.Teaching,y=X..Satisfied.with.Assessment,label=subject,size=Value.added.score.10))
    g=ggplot(pdata()) + geom_text(aes_string(x=input$tblx,y=input$tbly,size=input$tbls, label=lab()  ))
    g=g+labs(title=paste("Guardian University Tables 2014:",ttl()))

What I hoped to show here was how it’s possible to create a quick visual explorer interface over a dataset using the R shiny library. Many users will be familiar with using wizards to create charts in spreadsheet programmes, but may get stuck when it comes to figuring out how to generate large numbers of charts. As a quick and dirty tool, shiny provides a great environment for knocking up disposable interfaces that provide you with a playground for checking out a wide range of chart data settings from automatically populated list selectors.

With a few more tweaks, we could add in the option to download data by subject or institution, add range selectors to allow us to view only results where a score falls within a particular range, and so on. We can also define new charts and displays (including tabular data displays) to view the data, just slotting them in with very simple UI and server components as used in the original example.

PS this post isn’t necessarily intended to say that we should just be adding to the noise by publishing interactive data explorers that folk don’t how to use to support “datajournalism” stories or research dissemination (the above apps are way to scruffy for the that, and the charts potentially too confusing or cluttered for the uninitiated to make sense of). Rather, I suggest that journalists, researchers etc should feel as if they are in a position to knock up their own data exploration tools as part of the “homework” involved in prepping for a conversation with a data source. The tool building also becomes and extension of the conversation with the data. Complete/complex apps aren’t built in one go. As the example described here shows, it was built up in baby steps, starting with the data grab an initial chart in the previous post, moving on to a simple interactive chart at the start of this post, then starting to evolve into a more complex tool through the addition of additional features.

If the app gets more complex, eg in response to me wanting to be able to ask more refined questions of the data, or take filtered data dumps from it, (for example, for use in other charting applications, such as datawrapper.de), this just represents an evolution of, or increase in depth of, the conversation I am having with the data and the notes I am taking of it.

Written by Tony Hirst

June 21, 2013 at 10:24 am

Posted in Rstats

Tagged with

Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

Guardian data table uni 2014

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

  tmp=gsqAPI(key,"select * where B!=''", i)

We can now download the first subject specific sheet using the following call:


This loads in the data as follows:

seeding guardian ranking data table

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

Data format - need to make numbers

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

Cross-subject rankings

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

oxfordbrookes start asking

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?

Written by Tony Hirst

June 20, 2013 at 3:35 pm

Posted in Rstats

Tagged with

Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics

Last year I sat on a couple of panels organised by I’m a Scientist’s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking Who are you Twitter?, a quick review of a social media mapping exercise carried out on the followers of the @imascientist Twitter account.

Using the technique described in Estimated Follower Accession Charts for Twitter, we can estimate a follower acquisition growth curve for the @imascientist Twitter account:


I’ve already noted how we may be able to use “spikes” in follower acquisition rates to identify news events that involved the owner of a particular Twitter account and caused a surge in follower numbers as a result (What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events).

Thinking back to the context of evaluating the impact of events that include social media as part of the overall campaign, it struck me that whilst running a particular event may not lead to a huge surge in follower numbers on the day of the event or in the immediate aftermath, the followers who do sign up over that period might have signed up as a result of the event. And now we have the first inklings of a post hoc analysis tool that lets us try to identify these people, and perhaps look to see if their profiles are different to profiles of followers who signed up at different times (maybe reflecting the audience interest profile of folk who attended a particular event, or reflecting sign ups from a particular geographical area?)

In other words, through generating the follower acquisition curve, can we use it to filter down to folk who started following around a particular time in order to then see whether there is a possibility that they started following as a result of a particular event, and if so can count as some sort of “conversion”? (I appreciate that there are a lot of caveats in there!;-)

A similar approach may also be relevant in the context of analysing link building around historical online community events, such as MOOCs… If we know somebody took a particular MOOC at a particular time, might we be able to construct their follower acquisition curve and then analyse it around the time of the MOOC, looking to see if the connections built over that period are different to the users other followers, and as such may represent links developed as a result of taking the MOOC? Analysing the timelines of the respective parties may further reveal conversational dynamics between those parties, and as such allow is to see whether a fruitful social learning relationship developed out of contact made in the MOOC?

Written by Tony Hirst

April 21, 2013 at 5:40 pm

Estimated Follower Accession Charts for Twitter

Just over a year or so ago, Mat Morrison/@mediaczar introduced me to a visualisation he’d been working on (How should Page Admins deal with Flame Wars?) that I started to refer to as an accession chart (Visualising Activity Around a Twitter Hashtag or Search Term Using R). The idea is that we provide each entrant into a conversation or group with an accession number: the first person has accession number 1, the second person accession number 2 and so on. The accession number is plotted in rank order on the vertical y-axis, with ranked/time ordered “events” along the horizontal x-axis: utterances in a conversation for example, or posts to a forum.

A couple of months ago, I wondered whether this approach might also be used to estimate when folk started following an individual on Twitter. My reasoning went something like this:

One of the things I think is true of the Twitter API call for the followers of an account is that it returns lists of followers in reverse accession order. So the person who followed an account most recently will be at the top of the list (the first to be returned) and the person who followed first will be at the end of the list. Unfortunately, we don’t know when followers joined, so it’s hard to spot bursty growth in the number of followers of an account. However, it struck me that we may be able to get a bound on this by looking at the dates at which followers joined Twitter, along with their ‘accession order’ as followers of an account. If we get the list of followers and reverse it, and assume that this gives an ordered list of followers (with the follower that started following the longest time ago first), we can then work through this list and keep track of the oldest ‘created_at’ date seen so far. This gives us an upper bound (most recent date) for when followers that far through the list started following. (You can’t start following until you join twitter…)

So for example, if followers A, B, C, D in that accession order (ie started following target in that order) have user account creation dates 31/12/09, 1/1/09, 15/6/12, 5/5/10 then:
- A started following no earlier than 31/12/09 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
- B started following no earlier than 31/12/09 (because they started following after B)
- C started following no earlier than 15/6/12 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
- D started following no earlier than 15/6/12 (because they started following after C, which gave use the most recent creation date seen so far)

That’s probably confused you enough, so here’s a chart – accession number is along the bottom (i.e. the x-axis), joining date (in days ago) is on the y-axis:


NOTE: this diverges from the accession graph described above, where accession number goes on the y-axis and rank ordered event along the x-axis.

What the chart shows is an estimate (the red line) of how many days ago a follower with a particular accession number started to follow a particular Twitter account.

As described in Sketches Around Twitter Followers, we see a clear break at 1500 days ago when Twitter started to get popular. This approach also suggests a technique for creating “follower probes” that we can use to date a follower record: if you know which day a particular user followed a target account, you can use that follower to put a datestamp into the follower record (assuming the Twitter API returned followers in reverse accession order).

Here’s an example of the code I used based on Twitter follower data grabbed for @ChrisPincher (whose follower profile appeared to be out of sorts from the analysis sketched in Visualising Activity Around a Twitter Hashtag or Search Term Using R). I’ve corrected the x/y axis ordering so follower accession number is now the vertical, y-component.


processUserData = function(data) {
    data$tz = as.POSIXct(data$created_at)
    data$days = as.integer(difftime(Sys.time(), data$tz, units = "days"))
    data = data[rev(rownames(data)), ]
    data$acc = 1:length(data$days)
    data$recency = cummin(data$days)


mp_cp <- read.csv("~/code/MPs/ChrisPincher_fo_0__2013-02-16-01-29-28.csv", row.names = NULL)

ggplot(processUserData(mp_cp)) +  geom_point(aes(x = -days, y = acc), size = 0.4) + geom_point(aes(x = -recency, y = acc), col = "red", size = 1)+xlim(-2000,0)

Here’s @ChrisPincher’s chart:


The black dots reveal how many days ago a particular follower joined Twitter. The red line is the estimate of when a particular follower started following the account, estimated based on the most recently created account seen to date amongst the previously acceded followers.

We see steady growth in follower numbers to start with, and then the account appears to have been spam followed? (Can you spot when?!;-) The clumping of creation dates of accounts during the attack also suggests they were created programmatically.

[In the "next" in this series of posts [What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events], I’ll show how spikes in follower acquisition on a particular day can often be used to “detect” historical news events.]

PS after coming up with this recipe, I did a little bit of “scholarly research” and I learned that a similar approach for estimating Twitter follower acquisition times had already been described at least once, at the opening of this paper: We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter – “We estimate the edge creation time for any follower of a celebrity by positing that it is equal to the greatest lower bound that can be deduced from the edge orderings and follower creation times for that celebrity”.

Written by Tony Hirst

April 5, 2013 at 10:31 am

Posted in Rstats

Tagged with

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into mpExpenses2012. Previewing it we see there is a column MP.s.Name identifying which MP each expense claim line item corresponds to:


We can easily pull out the unique values of the MP names using the levels command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP


This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.

Written by Tony Hirst

April 3, 2013 at 8:54 am

Posted in Rstats

Tagged with ,

Revisiting MPs’ Expenses

I couldn’t but notice the chatter about Iain Duncan Smith claiming he’d have no problem “living on 53 pounds a dayweek“, which made me wonder not only how many meal catered events he attends each week (and how many of his scheduled meeting also have complementary tea and biscuits (a bellweather of the extent of cuts in many institutions…;-), but also how he fares on the expenses stakes…

For the last couple of years, details about MPs’ expense claims have been published via the Independent Parliamentary Standards Authority (IPSA) website. The data seems to be most easily grabbed as CSV files containing all MPs’ claims for a parliamentary session (or tax year?) – eg 2012/13 or 2011/2012. As you might expect, that means the files are relatively large – 20MB (~100,000 rows) for 12/13, or just over 40MB (~190k rows) for 2011/12.

Files of this size are fine if you’re happy working with files of this size (?!), but can be a pain if you aren’t… So here are a couple of ways getting the data into a more manageable form, starting from raw data files that look something like this…

mp expenses raw

The file is made up of a series of rows, one per expense entry, with common columns. If we loaded this data into a spreadsheet application such as Excel or Google Spreadsheets, we’d see a single sheet containing however many tens of thousands of rows of data. Assuming that the application could cope with loading such a large amount of data of course… which it might not be able to do…which means we may need to make the data file a bit more manageable, somehow…

Let’s start with grabbing data relating to Iain Duncan Smith’s expense claims. On a Linux box, or a Mac, this is easy enough from a Terminal command line. (On Windows, something like cygwin should provide you with equivalents of some of the more useful unix tools.) For example, the grep command let’s us pull just the rows that contain the phrase Iain Duncan Smith:

grep "Iain Duncan Smith" DataDownload_2012.csv > IDS_expenses_2012.csv

This reads along the lines of: search through the file “DataDownload_2012.csv” looking for rows that contain “Iain Duncan Smith”, then copy those rows and only those rows into the file “IDS_expenses_2012.csv”

For what it’s worth, I’ve posted IDS’ expenses data from 2011 and 2012 to a couple of Google Fusion Tables: 2011/12, 2012/13

Another way of wrangling the data into a state we can start to play with it is to load it into RStudio, where we can start applying magical R incantations to it.

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")


We can then generate a subset of the data containing just IDS’ data:

ids2012=subset(mpExpenses2012, MP.s.Name=="Iain Duncan Smith")

We can also generate a combined data set of IDS’ expense claims from both 2011/12 and 2012/13, for example:

mpExpenses2011 = read.csv("~/Downloads/DataDownload_2011.csv")
ids2011=subset(mpExpenses2011, MP.s.Name=="Iain Duncan Smith")


However, all is not well…

On loading the 2011 data into R, 158320 observations are loaded in. The actual number of rows (including the header – so one more row than the number of “observations”) can be given by running a line count (from the terminal/command line) over the original file:

wc -l DataDownload_2011.csv
187447 DataDownload_2011.csv

That is, 187447 rows…

If we try to pull out a list of MPs’ names using the levels command:


we find that as well as the expected MPs names, there’s some “bad” data:

etl messup

(What we expect to see in the name column is a list of MP names, one unique name per row.)

This is, of course, the way of the world. Folk who publish data rarely if ever, provide a demonstration of how to actually open it cleanly into an application (typically because data publishers think that once they have published the data, it’s bound to be all right and doesn’t need checking. This is, however, rarely true…although, for the 2012/13 data, there are 99071 loaded observations against 99072 (including 1 header) rows in the download file, which does seem to be correct).

What we should do now, of course, is go into a an ETL (or at least, TL) debug mode on the 2011 data and try to figure out what’s going wrong with the simple import… or we just work with the data we have and try to work around the dodgy rows…

…or we limit ourselves to the 2012 data, which does seem okay…

So if we do that, what other sorts of investigation come to mind?

One thing that came to mind after skimming the data…

mp cost of travel

was a “rail travel fares according to MPs’ expenses” lookup table.

So for example, we might start by creating a subset of the data based on expenses categorised as “Travel” and then look to see what sorts of trvel classification falls within that Category:


Here’s what we get:

[1] "Car Hire"                       "Car Hire Fuel"                  "Car Hire Fuel MP"              
 [4] "Car Hire Fuel MP Staff"         "Car Hire Insurance MP Staff"    "Car Hire MP"                   
 [7] "Car Hire MP Staff"              "Congest. Zone/Toll Seas Ticket" "Congestion Zone/Toll"          
[10] "Congestion Zone/Toll Dependant" "Congestion Zone/Toll MP"        "Congestion Zone/Toll MP Staff" 
[13] "Food & Drink"                   "Food & Drink @ Parliament"      "Food & Drink @ Parlmnt OFF Est"
[16] "Food & Drink MP"                "Food & Drink MP Staff"          "Hotel Late Sitting"            
[19] "Hotel Late Sitting > 1.00"      "Hotel London Area MP Staff"     "Hotel NOT London Area (Travel)"
[22] "Hotel NOT London Area MP Staff" "Hotel Outside UK"               "Own Bicycle MP"                
[25] "Own Car Dependant"              "Own Car MP"                     "Own Car MP Staff"              
[28] "Own Vehicle Bicycle"            "Own Vehicle Bicycle MP Staff"   "Own Vehicle Car"               
[31] "Own Vehicle Car Dependant"      "Own Vehicle Car MP Staff"       "Own Vehicle Mot Cycle MP Staff"
[34] "Parking"                        "Parking Dependant"              "Parking MP Staff"              
[37] "Parking Season Ticket"          "Public Tr AIR"                  "Public Tr AIR Dependant"       
[40] "Public Tr AIR MP Staff"         "Public Tr BUS"                  "Public Tr BUS MP Staff"        
[43] "Public Tr COACH"                "Public Tr COACH MP Staff"       "Public Tr FERRY"               
[46] "Public Tr FERRY MP Staff"       "Public Tr OTHER"                "Public Tr OTHER Dependant"     
[49] "Public Tr OTHER MP Staff"       "Public Tr RAIL - RTN"           "Public Tr RAIL - SGL"          
[52] "Public Tr RAIL Dependant - RTN" "Public Tr RAIL Dependant - SGL" "Public Tr RAIL Foreign"        
[55] "Public Tr RAIL MP Staff - RTN"  "Public Tr RAIL MP Staff - SGL"  "Public Tr RAIL Other"          
[58] "Public Tr RAIL Other Dependant" "Public Tr RAIL Other MP Staff"  "Public Tr RAIL Railcard"       
[61] "Public Tr RAIL Railcd MP Staff" "Public Tr RAIL Sleeper Suppl"   "Public Tr Season Ticket"       
[64] "Public Tr UND"                  "Public Tr UND Dependant"        "Public Tr UND MP Staff"        
[67] "Public Tr Underground MP"       "Taxi"                           "Taxi After Late Sitting"       
[70] "Taxi after Late Sitting 11pm"   "Taxi Dependant"                 "Taxi MP"                       
[73] "Taxi MP Staff"                  "Taxi Working Late After 9pm"    "Taxi Working Late Before 9pm"

We might further pull out just the rows relating to rail travel (almost 10,000 rows from the 2012/13 dataset):


and then we might start looking to see who’s travelling First vs. who’s travelling Standard, as well as building up a database of rail fares between locations as claimed on expenses. But that’s for another day…

Written by Tony Hirst

April 2, 2013 at 11:20 pm

Publishing Stats for Analytic Reuse – FAOStat Website and R Package

How can stats and data publishers, from NGOs and (inter)national statistics agencies to scientific researchers, publish their data in a way that supports its analysis directly, as well as in combination with other datasets?

Here’s one approach I learned about from Michael Kao of the UN Food and Agriculture Organisation statistics division, FAOStat.

At first glimpse, the FAOStat website offers a rich website that supports data downloads, previews and simple analysis tools around a wide variety of international food related datasets:

FAOStat website

FAOstat - graphical tools

faostat - inline data preview

FAOStat - ddata analysis

One problem with having so many controls and fields available is that it can be hard to know where (or how) to get started – a bit like the problem of being presented with an empty SPARQL query box…

It would be quite handy to be able to set – and save with meaningful labels – preference sets about the countries you’re interested in so you don’t have to keep keep scrolling through long country lists looking for the countries you want to generate reports for? (Support for “standard” groupings of countries might also be useful?) Being able to share URLs to predefined reports might also be handy? But this would possibly make the site even more complex to use!

One easier way of working with FAOStat data, particularly if you access the FAO datasets regularly, might be to take a programmatic route using the FAOStat R package. Making datasets available in ways that bring that data directly into a desktop analysis environment where they can be worked on without requiring cleaning or other forms of tidying up (which is often the case when data is made available via Excel spreadsheets or CSV files) is a trend I hope we see more of. (That is not to say that data shouldn’t also be published in “generic” document formats…). If you are using a reproducible research strategy, queries to original datasources provide implicit, self-describing metadata about the data source and the query used to return a particular dataset, metadata that is all to easy to lose, or otherwise detach from a dataset when working with downloaded files.

I haven’t had chance to play with this package yet – it’s still in testing anyway, I think? – but it looks quite handy at a first glance (I need to do a proper review…). As well as providing a way of running data grab queries over theFAO FAOSTAT and World Bank WDI APIs, it seems to provide support for “linkage”. As the draft vignette suggests, “Merge is a typical data manipulation step in daily work yet a non-trivial exercise especially when working with different data sources. The built in mergeSYB function enables one to merge data from different sources as long as the country coding system is identified. … Data from any source with [a] classification [supported by the package] can be supplied to mergeSYB in order to obtain a single merged data. (sic)“. Supported formats currently include: United Nations M49 country standard [UN_CODE]; FAO country code scheme [FAOST_CODE]; FAO Global Administrative Unit Layers (GAUL) [ADM0_CODE]; ISO 3166-1 alpha-2 [ISO2_CODE]; ISO 3166-1 alpha-2 (World Bank) [ISO2_WB_CODE]; ISO 3166-1 alpha-3 [ISO3_CODE]; ISO 3166-1 alpha-3 (World Bank) [ISO3_WB_CODE].

By releasing an “official” R package to access the FAOStat API, it occurs to me that this makes it much easier to start building sector specific Shiny applications around particular datasets? I wonder whether the FAOstat folk have considered whether there is a possibility of developing a small Shiny app or custom client ecosystem around their data, even if it just takes the form of a curated set of gists that can be downloaded directly into RStudio, for example, using runGist?

I don’t know whether the Eurostat EC Statistics database has an associated R package too? (If so, it could be quite interesting trying to tie them together?! I do note, however, that Eurostat data is available for download (though I haven’t read the terms/license conditions…).

I also note that a Linked Data/SPARQL way in to Eurostat data appears to be available? Eurostat Linked Data.

[Man flu, hence the brevity of the post... skulks back off to sick bed...]

PS BY the by, I notice that the NHS are experimenting with making some data releases available via Google Public Data Explorer [scroll down...]

PPS See also this package – Smarter Poland – which provides an API to the Eurostat database.

Written by Tony Hirst

March 8, 2013 at 2:45 pm

Posted in Rstats

Tagged with , ,

What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events

Following a chat with @andypryke, I thought I’d try out a simple bit of feature detection around approximated follower acquisition charts (e.g. Estimated Follower Accession Charts for Twitter) to see if I could detect dates around which there were spikes in follower acquisition.

So for example, here’s the follower acquistion chart for Seem Malhotra:


We see a spike in follower count about 440 days ago, with an increased daily follower acquisition rate thereafter. WHat happened 440 days or so ago? We can easily look this up on something like Wolfram Alpha (query on /440 days ago/) or directly in R:

[1] "2011-12-20"

So what happened in December 2011? A quick search on /”Seema Malhotra” December 2011/ turns up the news that she won a by-election on 16 December 2011. The spike in followers matches the by-election date well, and the increased rate in daily follower acquisition since then is presumably related to the fact that Seema Malhotra is now an MP.

So what’s the new line on the chart (the black, stepped line along the bottom)? It’s actually a 5 point moving average of the first difference in follower count over time (that is, sort of a smoothed version of a crude approximation to the gradient of the follower acquisition curve; the firstdiff curve is normalised by finding the difference in accumulated follower count between consecutive time samples divided the number of days between samples. So it’s a sort of gradient rather than first difference. If the samples were all a day apart, it would be a first difference…). I also filter the line to only show days on which there was a “significant jump” in follower count, arbitrarily set at a 5 sample moving average of more than 50 new followers per day. Note that scaling of the moving average values too – the numerical y-axis scale is 1:1 for the cumulative follower number, but 10x the moving average value. The numerical value labels that annotate the line chart correspond to the number of days ago (relative to the date the chart was generated) that the peak corresponds to. For any chart critics out there – this is a “working chart”, rather than a polished presentation graphic;-)

#Process Twitter user data file

#The TTR library includes various moving average functions


  #Find the users who are used to approximate the accession date
  #Dedupe these rows (need to check if I grab the first of the last...)

  #First difference

  #First difference smoothed over 5 values - note we do dodgy things against time here - just look for signal!

  #Second difference

#An example plotter - sm is the user data
g= ggplot(processUsersData(sm))
g=g+geom_point(aes(x=-days,y=acc),size=1) #The black acc/days dots
g=g+geom_point(aes(x=-recency,y=acc),col='red',size=1) #The red acc/days  acquisition date estimate dots
g=g+geom_line(data=differ_a(sm),aes(x=-days,y=10*SMA5)) #The firstdiff moving average line
g=g+geom_text(data=subset(differ_a(sm),SMA5>50),aes(x=-days,y=10*SMA5,label=days),size=3) #Feature label
g=g+ggtitle("Seema Malhotra") #Chart title

Here’s Chris Pincher:


This account got hit about 79 days ago (December 15th 2012) – we need to ignore the width of the moving average curve and just focus on the leading edge, as a zoom into the chart, with a barchart depicting firstdiff replacing the first diff moving average line, demonstrates.

#Got a rogue datapoint in there somehow?
g=g+geom_bar(data=subset(differ_a(cpmp),days50 &amp; days&lt;5000),aes(x=-days,y=firstdiff,label=days),size=3)
g=g+ggtitle(&quot;Chris Pincher&quot;)+xlim(-200,-25)


The spam followers that were signed up to the account look like they were created in batches several months prior to what I presume was an attack? COuld this have been in response to his Speaking Out about the Collapse of Drive Assist on Thursday, December 13th, 2012, his Huffpo post on the 11th, or his vote against the Human Rights Act as reported on the 5th?

Who else has an odd follower acquisition chart? How about Aidan Burley?


219 days ago – 28th July 2012…


I guess that caused something of a Twitter storm, and a resulting growth in his follower count… Diane Abbott’s racist tweet row from December 2012 also grew her twitter following… Top tip for follower acquisition, there;-)

Nadine Dorries’ outspoken comments in May 2012 around David Cameron’s party leadership, and then same sex marriage, was good for her Twitter follower count, which received another push when she joined I’m a Celebrity and was suspended from the Parliamentary Conservative party.

Showing your emotions in Parliament looks like a handy trick too…Here’s a spike around about October 20th, 2011…


(There also looks to be a gradient change around 200 days ago maybe? The second diff calculations might pull this out?)

Chris Bryant’s speech on the phone hacking saga in July 2011 showed that publicly well-received parliamentary speeches can be good for numbers too; not surprisingly, the phone hacking scandal was also good for Tom Watson’s follower count around the end of July 2011. Election victories can be good too: Andy Sawford got a jump in followers when he was announced as a PPC (10th August 2012) and then when he won his seat (November 7th 2012); Ben Bradshaw’s numbers also jumped around the time of his May 2010 election victory, as did Lynne Featherstone’s, particularly with her appointment to a government position. Jesse Norman appeared to get a bump after the Prime Minister confronted him on July 11th 2012; Nick de Bois saw a leap in followers following the riots in his constituency in early August 2011, and the riots also seem to have bumped David Lammy’s and Diane Abbott’s numbers up.

A tragedy on September 17th looks like it may have pushed Peter Hain’s numbers, but he was in the news a reasonable amount around that time – maybe getting your name in the press for several days in a row is good for Twitter follower counts? Steve Rotherham also benefited from another recalled tragedy, the Hillsborough distaster, when, in October 2011, he called the ex-Sun’s editor out over it’s original coverage; he seems to have received another boost in followers when he lead a debate on internet trolls in September 2012.

Personal misfortune didn’t do Michael Fabricant any harm – his speeding conviction colourful Twitter baiting in October 2012 caused his follower count to fly and achieve an elevated rate of daily growth it’s maintained ever since.

A Dispatches special on ticket touts got a bounce in followers for Sharon Hodgson, who was sponsoring a private member’s bill on ticket touts at the time; winning a social media award seemed to do Kevin Brennan a favour in terms of his daily follower acquisition rate, as this ramp increase around the start of December 2010 shows:


So there we have it; political life as seen through the lens of Twitter follower acquisition bursts:-)

But what now? I guess one thing to do would be to have a go at estimating the daily growth rates of the various twittering MPs, and see if thy have any bearing to things like ministerial (or Shadow Minister) responsiblity? Where rates seem to change (sustained kinks in the curve), it might be worth looking to see whether we can identify any signs of changes in tweeting behaviour – or maybe news stories that come to associate the MP with Twitter in some way?

Written by Tony Hirst

March 4, 2013 at 9:42 pm

Posted in Anything you want, Rstats

Tagged with


Get every new post delivered to your Inbox.

Join 729 other followers