OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Rstats’ Category

Local Council Spending Data – Time Series Charts

In What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? I started wondering about the extent to which local spending transparency data might play a role in supporting consultation around new budgets.

As a first pass, I’ve popped up a quick application up at http://glimmer.rstudio.com/psychemedia/iwspend2013_14/ (shiny code here). You can pass form items in via the URL (except to set the Directorate – oops!), and also search using regular expressions, but at the moment still need to hit the Search button to actually run the search. NOTE – there’s a little bug – you need to hit the Search button to get it to show data; note – selecting All directorates and no filter terms can be a bit slow to display anything…

Examples:

- http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?expensesType=(oil)|(gas)|(electricity)
http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?serviceArea=mainland
http://glimmer.rstudio.com/psychemedia/iwspend2013_14/?supplierName=capita

I’ve started exploring various views over the data, but these need thinking through properly (in particular with respect to finding out views that may actually be useful!)

iw spend music expneses type

Hmm… did the budget change directorate?!

IW spend - music service area

IW spend music suppliers

Some more views over the suppliers tab – I started experimenting with some tabular views in the suppliers tab too…

IW spend music suppliers table 1

IW spend music suppliers table 2

This is all very “shiny” of course, but is it useful? From these early glimpses over the data, can you think of any ways that a look at the spending data might help support budget consultations? What views over the data, in particular, might support such an aim, and what sort of stories might we be able to tell around this sort of data?

Written by Tony Hirst

November 6, 2013 at 11:27 am

Posted in Policy, Rstats

Tagged with ,

Generating d3js Motion Charts from rCharts

Remember Gapminder, the animated motion chart popularised by Hans Rosling in his TED Talks and Joy of Stats TV programme? Well it’s back on TV this week in Don’t Panic – The Truth About Population, a compelling piece of OU/BBC co-produced stats theatre featuring Hans Rosling, and a Pepper’s Ghost illusion brought into the digital age courtesy of the Musion projection system:

Whilst considering what materials we could use to support the programme, we started looking for ways to make use of the Gapminder visualisation tool that makes several appearances in the show. Unfortunately, neither Gapminder (requires Java?), nor the Google motion chart equivalent of it (requires Flash?), appear to work with a certain popular brand of tablet that is widely used as a second screen device…

Looking around the web, I noticed that that Mike Bostock had produced a version of the motion chart using d3.js: The Wealth & Health of Nations. Hmmm…

Playing with that rendering on a tablet, I had a few problems when trying to highlight individual countries – the interaction interfered with an invisible date slider control – but a quick shout out to my OU colleague Pete Mitton resulted in a tweaked version of the UI with the date control moved to the side. I also added a tweak to allow specified countries to be highlighted. You can find an example here (source).

Looking at how the data was pulled into the chart, it seems to be quite a convoluted form of JSON. After banging my head against a wall for a bit, a question on Stack Overflow about how to wrangle the data from something that looked like this:

Country Region  Year    V1  V2
AAAA    XXXX    2001    12  13
BBBB    YYYY    2001    14  15
AAAA    XXXX    2002    36  56
AAAA    XXXX    1999    45  67

to something that looked like this:

[
  {"Country": "AAAA",
   "Region":"XXXX",
    "V1": [ [1999,45], [2001,12] , [2002,36] ],
    "V2":[ [1999,67], [2001,13] , [2002,56] ]
  },
  {"Country": "BBBB",
   "Region":"YYYY",
   "V1":[ [2001,14] ],
   "V2":[ [2001,15] ]
  }
]

resulted in a handy function from Ramnath Vaidyanathan that fitted the bill.

One of the reasons that I wanted to use R for the data transformation step, rather than something like Python, was that I was keen to try to get a version of the motion charts working with the rCharts library. Such is the way of the world, Ramnath is the maintainer of rCharts, and with his encouragement I had a go at getting the motion chart to work with that library, heavily cribbing from @timelyportfolio’s rCharts Extra – d3 Horizon Conversion tutorial on getting things to work with rCharts along the way.

For what it’s worth, my version of the code is posted here: rCharts_motionchart.

I put together a couple of demo’s that seem to work, including the one shown below that pulls data from the World Bank indicators API and then chucks it onto a motion chart…

UPDATE: I’ve made things a bit easier compared to the original recipe included in this post… we can now generate fertility/GDP/population motion chart for a range of specified countries using data pulled directly from the World Bank development indicators API with just the following two lines of R code:

test.data=getdata.WDI(c('GB','US','ES','BD'))
rChartMotionChart(test.data,'Country','Year','Fertility','GDP','Region','Population')

It’s not so hard to extend the code to pull in other datasets, either…

Anyway, here’s the rest of the original post… Remember, it’s easier now;-) [Code: https://github.com/psychemedia/rCharts_motionchart See example/demo1.R]

To start with, here are a couple of helper functions:

require('WDI')

#A handy helper function for getting country data - this doesn't appear in the WDI package?
#---- https://code.google.com/p/google-motion-charts-with-r/source/browse/trunk/demo/WorldBank.R?r=286
getWorldBankCountries <- function(){
  require(RJSONIO)
  wbCountries <- fromJSON("http://api.worldbank.org/country?per_page=300&format=json")
  wbCountries <- data.frame(t(sapply(wbCountries[[2]], unlist)))
  wbCountries$longitude <- as.numeric(wbCountries$longitude)
  wbCountries$latitude <- as.numeric(wbCountries$latitude)
  levels(wbCountries$region.value) <- gsub("\\(all income levels\\)", "", levels(wbCountries$region.value))
  return(wbCountries)
}


#----http://stackoverflow.com/a/19729235/454773
pluck_ = function (element){
  function(x) x[[element]]
}

#' Zip two vectors
zip_ <- function(..., names = F){
  x = list(...)
  y = lapply(seq_along(x[[1]]), function(i) lapply(x, pluck_(i)))
  if (names) names(y) = seq_along(y)
  return(y)
}

#' Sort a vector based on elements at a given position
sort_ <- function(v, i = 1){
  v[sort(sapply(v, '[[', i), index.return = T)$ix]
}

library(plyr)

This next bit still needs some refactoring, and a bit of work to get it into a general form:

#I chose to have a go at putting all the motion chart parameters into a list

params=list(
  start=1950,
  end=2010,
  x='Fertility',
  y='GDP',
  radius='Population',
  color='Region',
  key='Country',
  yscale='log',
  xscale='linear',
  rmin=0,
  xmin=0

  )

##This bit needs refactoring - grab some data; the year range is pulled from the motion chart config;
##It would probably make sense to pull countries and indicators etc into the params list too?
##That way, we can start to make this block a more general function?

tmp=getWorldBankCountries()[,c('iso2Code','region.value')]
names(tmp)=c('iso2Code','Region')

data <- WDI(indicator=c('SP.DYN.TFRT.IN','SP.POP.TOTL','NY.GDP.PCAP.CD'),start = params$start, end = params$end,country=c("BD",'GB'))
names(data)=c('iso2Code','Country','Year','Fertility','Population','GDP')

data=merge(data,tmp,by='iso2Code')

#Another bit of Ramnath's magic - http://stackoverflow.com/a/19729235/454773
dat2 <- dlply(data, .(Country, Region), function(d){
  list(
    Country = d$Country[1],
    Region = d$Region[1],
    Fertility = sort_(zip_(d$Year, d$Fertility)),
    GDP = sort_(zip_(d$Year, d$GDP)),
    Population=sort_(zip_(d$Year, d$Population))
  )
})

#cat(rjson::toJSON(setNames(dat2, NULL)))

To minimise the amount of motion chart configuration, can we start to set limits based on the data values?

#This really needs refactoring/simplifying/tidying/generalising
#I'm not sure how good the range finding heuristics I'm using are, either?!
paramsTidy=function(params){
  if (!('ymin' %in% names(params))) params$ymin= signif(min(0.9*data[[params$y]]),3)
  if (!('ymax' %in% names(params))) params$ymax= signif(max(1.1*data[[params$y]]),3)
  if (!('xmin' %in% names(params))) params$xmin= signif(min(0.9*data[[params$x]]),3)
  if (!('xmax' %in% names(params))) params$xmax= signif(max(1.1*data[[params$x]]),3)
  if (!('rmin' %in% names(params))) params$rmin= signif(min(0.9*data[[params$radius]]),3)
  if (!('rmax' %in% names(params))) params$rmax= signif(max(1.1*data[[params$radius]]),3)
  params
}

params=paramsTidy(params)

This is the function that generates the rChart:

require(rCharts)

#We can probably tidy the way that the parameters are mapped...
#I wasn't sure whether to try to maintain the separation between params and rChart$params?
rChart.generator=function(params, h=400,w=800){
  rChart <- rCharts$new()
  rChart$setLib('../motionchart')
  rChart$setTemplate(script = "../motionchart/layouts/motionchart_Demo.html")

  rChart$set(
  
   countryHighlights='',
   yearMin= params$start,
   yearMax=params$end,
  
   x=params$x,
   y=params$y,
   radius=params$radius,
   color=params$color,
   key=params$key,
  
   ymin=params$ymin,
   ymax=params$ymax,
   xmin=params$xmin,
   xmax=params$xmax,
   rmin=params$rmin,
   rmax=params$rmax,
  
   xlabel=params$x,
   ylabel=params$y,
  
   yscale=params$yscale,
   xscale=params$xscale,
   
   width=w,
   height=h
 )

 rChart$set( data= rjson::toJSON(setNames(dat2, NULL)) )

 rChart
}

rChart.generator(params,w=1000,h=600)

Aside from tidying – and documenting/commenting – the code, the next thing on my to do list is to see whether I can bundle this up in a Shiny app. I made a start sketching a possible UI, but I’ve run out of time to do much more for a day or two… (I was also thinking of country checkboxes for either pulling in just that country data, or highlighting those countries.)

items=c("Fertility","GDP","Population")
names(items)=items

shinyUI(pageWithSidebar(
  headerPanel("Motion Chart demo"),
  
  sidebarPanel(
    selectInput(inputId = 'x',
                label = "X",
                choices = items,
                selected = 'Fertility'),
    selectInput(inputId = 'y',
                label = "Y",
                choices = items,
                selected = 'GDP'),
    selectInput(inputId = 'r',
                label = "Radius",
                choices = items,
                selected = 'Population')
  ),
  mainPanel(
    #The next line throws an error (a library is expected? But I don't want to use one?)
    showOutput("motionChart",'')
  )
))

As ever, we’ve quite possibly run out of time on getting much up on the OpenLearn website by Thursday to support the programme as it airs, which is partly why I’m putting this code out now… If you manage to do anything with it that would allow folk to dynamically explore a range of development indicators over the next day or two (especially GDP, fertility, mortality, average income, income distributions (this would require different visualisations?)), we may be able to give it a plug from OpenLearn, and maybe via any tweetalong campaign that’s running as the programme airs…

If you do come up with anything, please let me know via the comments, or twitter (@psychemedia)…

Written by Tony Hirst

November 4, 2013 at 6:34 pm

Posted in OBU, Rstats

Tagged with

Negative Payments in Local Spending Data

In anticipation of a new R library from School of Data data diva @mihi_tr that will wrap the OpenSpending API and providing access to OpenSpending.org data directly from within R, I thought I’d start doodling around some ideas raised in Identifying Pieces in the Spending Data Jigsaw. In particular, common payment values, repayments/refunds and “balanced payments”, that is, multiple payments where the absolute value of a negative payment matches that of an above zero payment (so far example, -£269.72 and £269.72 would be balanced payments).

The data I’ve grabbed is Isle of Wight Council transparency (spending) data for the financial year 2012/2013. The data was pulled from the Isle of Wight Council website and cleaned using OpenRefine broadly according to the recipe described in Using OpenRefine to Clean Multiple Documents in the Same Way with a couple of additions: a new SupplierNameClustered column, originally created from the SupplierName column, but then cleaned to remove initials, (eg [SD]), direct debit prefixes (DD- etc) and then clustered using a variety of clustering algorithms; and a new unique index column, created with values '_'+row.index. You can find a copy of the data here.

To start with, let’s see how we can identify “popular” transaction amounts:

#We going to use some rearrangement routines...
library(plyr)

#I'm loading in the data from a locally downloaded copy
iw <- read.csv("~/code/Rcode/eduDemos1/caseStudies/IWspending/IWccl2012_13_TransparencyTest4.csv")

#Display most common payment values
commonAmounts=function(df,bcol='Amount',num=5){
  head(arrange(data.frame(table(df[[bcol]])),-Freq),num)
}
ca=commonAmounts(iw)

ca
#     Var1 Freq
#1 1567.72 2177
#2 1930.32 1780
#3  1622.6 1347
#4 1998.08 1253
#5  1642.2 1227

This shows that the most common payment by value is for £1567.72. An obvious question to ask here is: does this correspond to some sort of “standard payment” or tariff? And if so, can we reconcile this against the extent of the delivery of a particular level of service, perhaps as disclosed elsewhere?

We can also generate a summary report to show what transactions correspond to this amount, or the most popular amounts.

#We can then get the rows corresponding to these common payments
commonPayments=function(df,bcol,commonAmounts){
  df[abs(df[[bcol]]) %in% commonAmounts,]
}
cp.df=commonPayments(iw,"Amount",ca$Var1)

More usefully, we might generate “pivot table” style summaries of how the popular payments breakdown with respect to expenses area, supplier, or some other combination of factors. So for example, we can learn that there were 1261 payments of £1567.72 booked to the EF Residential Care services area, and SOMERSET CARE LTD [SB] received 231 payments of £1567.72. (Reporting by the clustered supplier name is often more useful…)

#R does pivot tables, sort of...
#We can also run summary statistics over those common payment rows.
zz1=aggregate(index ~ ServiceArea + Amount, data =cp.df, FUN="length")
head(arrange(zz1,-index))

#                       ServiceArea  Amount index
#1              EF Residential Care 1567.72  1261
#2             EMI Residential Care 1642.20   659
#3             EMI Residential Care 1930.32   645
#4              EF Residential Care 1930.32   561
#5              EF Residential Care 1622.60   530
#6 Elderly Frail Residential Income 1622.60   36

#Another way of achieving a similar result is using the plyr count() function:
head(arrange(count(cp.df, c('ServiceArea','ExpensesType','Amount')),-freq))
#That is:
zz1.1=count(cp.df, c('ServiceArea','ExpensesType','Amount'))

zz2=aggregate(index ~ ServiceArea +ExpensesType+ Amount, data =cp.df, FUN="length")
head(arrange(zz2,-index))
#                       ServiceArea                       ExpensesType  Amount
#1              EF Residential Care                Chgs from Ind Provs 1567.72
#2             EMI Residential Care                Chgs from Ind Provs 1930.32
#3             EMI Residential Care                Chgs from Ind Provs 1642.20
#4              EF Residential Care Charges from Independent Providers 1622.60
#...

zz3=aggregate(index ~ SupplierName+ Amount, data =cp.df, FUN="length")
head(arrange(zz3,-index))
#                SupplierName  Amount index
#1     REDACTED PERSONAL DATA 1567.72   274
#2     SOMERSET CARE LTD [SB] 1567.72   231
#3 ISLAND HEALTHCARE LTD [SB] 1930.32   214
#...

zz4=aggregate(index ~ SupplierNameClustered+ Amount, data =cp.df, FUN="length")
head(arrange(zz4,-index))

If I was a suspicious type, I suppose I might look for suppliers who received one of these popular amounts only once…

Let’s have a search for items declared as overpayments or refunds, perhaps as preliminary work in an investigation about why overpayments are being made…:

#Let's look for overpayments
overpayments=function(df,fcol,bcol='Amount',num=5){
  df=subset(df, grepl('((overpay)|(refund))', df[[fcol]],ignore.case=T))
  df[ order(df[,bcol]), ]
}
rf=overpayments(iw,'ExpensesType')

#View the largest "refund" transactions
head(subset(arrange(rf,Amount),select=c('Date','ServiceArea','ExpensesType','Amount')))
#        Date                           ServiceArea         ExpensesType    Amount
#1 2013-03-28 Elderly Mentally Ill Residential Care     Provider Refunds -25094.16
#2 2012-07-18                   MH Residential Care Refund of Overpaymts -24599.12
#3 2013-03-25 Elderly Mentally Ill Residential Care     Provider Refunds -23163.84

#We can also generate a variety of other reports on the "refund" transactions
head(arrange(aggregate(index ~ SupplierName+ ExpensesType+Amount, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType   Amount index
#1 THE ORCHARD HOUSE CARE HOME Provider Refunds  -434.84     8
#2 THE ORCHARD HOUSE CARE HOME Provider Refunds  -729.91     3
#3         SCIO HEALTHCARE LTD Provider Refunds  -434.84     3
#...
##Which is to say: The Orchard House Care home made 8 refunds of £434.84 and 3 of £729.91

head(arrange(aggregate(index ~ SupplierName+ ExpensesType, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType index
#1         SCIO HEALTHCARE LTD Provider Refunds    31
#2           SOMERSET CARE LTD Provider Refunds    31
#3 THE ORCHARD HOUSE CARE HOME Provider Refunds    22
#...

Another area of investigation might be “balanced” payments. Here’s one approach for finding those, based around first identifying negative payments. Let’s also refine the approach in this instance for looking for balanced payments involving the supplier involved in the largest number of negative payments.

##Find which suppliers were involved in most number of negative payments
#Identify negative payments
nn=subset(iw,Amount<=0)
#Count the number of each unique supplier
nnd=data.frame(table(nn$SupplierNameClustered))
#Display the suppliers with the largest count of negative payments
head(arrange(nnd,-Freq))

#                    Var1 Freq
#1 REDACTED PERSONAL DATA 2773
#2      SOUTHERN ELECTRIC  313
#3   BRITISH GAS BUSINESS  102
#4      SOMERSET CARE LTD   75
#...

Let’s look for balanced payments around SOUTHERN ELECTRIC…

#Specific supplier search
#Limit transactions to just transactions involving this supplier
se=subset(iw, SupplierNameClustered=='SOUTHERN ELECTRIC')

##We can also run partial matching searches...
#sw=subset(iw,grepl('SOUTHERN', iw$SupplierNameClustered))

#Now let's search for balanced payments
balanced.items=function(df,bcol){
  #Find the positive amounts
  positems=df[ df[[bcol]]>0, ]
  #Find the negative amounts
  negitems=df[ df[[bcol]]<=0, ]
  
  #Find the absolute unique negative amounts
  uniqabsnegitems=-unique(negitems[[bcol]])
  #Find the unique positive amounts
  uniqpositems=unique(positems[[bcol]])
  
  #Find matching positive and negative amounts
  balitems=intersect(uniqabsnegitems,uniqpositems)
  #Subset the data based on balanced positive and negative amounts
  #bals=subset(se,abs(Amount) %in% balitems)
  bals=df[abs(df[[bcol]]) %in% balitems,]
  #Group the data by sorting, largest absolute amounts first
  bals[ order(-abs(bals[,bcol])), ]
}

dse=balanced.items(se,'Amount')

head(subset(dse,select=c('Directorate','Date','ServiceArea','ExpensesType','Amount')))
#                              Directorate       Date                ServiceArea ExpensesType   Amount
#20423               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity -2770.35
#20424               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity  2770.35
#52004 Chief Executive, Schools & Learning 2013-01-30 Haylands - Playstreet Lane  Electricity  2511.20
#52008 Chief Executive, Schools & Learning 2013-01-31 Haylands - Playstreet Lane  Electricity -2511.20

We can also use graphical techniques to try to spot balanced payments.

#Crude example plot to highlight matched +/1 amounts
se1a=subset(se,ServiceArea=='Ryde Town Hall' & Amount>0)
se1b=subset(se,ServiceArea=='Ryde Town Hall' & Amount<=0)

require(ggplot2)
g=ggplot() + geom_point(data=se1a, aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

In this first image, we see payments over time – the red markers are “minus” amounts on the negative payments. Notice that in some cases balanced payments seem to appear on the same day. If the y-value of a red and a black marker are the same, they are balanced in value. The x-axis is time. Where there is a period of equally spaced marks over x with the same y-value, this may represent a regular scheduled payment.

ryde - energy

g=ggplot(se1a)+geom_point( aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+facet_wrap(~ExpensesType)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')
g=g+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

We can also facet out the payments by expense type.

ryde-energy facet

Graphical techniques can often help us spot patterns in the data that might be hard to spot just by looking at rows and columns worth of data. In many cases, it is worthwhile developing visual analysis skills to try out quick analyses “by eye” before trying to implement them as more comprehensive analysis scripts.

Written by Tony Hirst

August 17, 2013 at 11:08 pm

Posted in Rstats, School_Of_Data

Tagged with

Generating Sankey Diagrams from rCharts

A couple of weeks or so ago, I picked up an inlink from an OCLC blog post about Visualizing Network Flows: Library Inter-lending. The post made use of Sankey diagrams to represent borrowing flows, and by implication suggested that the creation of such diagrams is not as easy as it could be…

Around the same time, @tiemlyportfolio posted a recipe for showing how to wrap custom charts so that they could be called from the amazing Javascript graphics library wrapping rCharts (more about this in a forthcoming post somewhere…). rCharts Extra – d3 Horizon Conversion provides a walkthrough demonstrating how to wrap a d3.js implemented horizon chart so that it can be generated from R with what amounts to little more than a single line of code. So I idly tweeted a thought wondering how easy it would be to run through the walkthrough and try wrapping a Sankey diagram in the same way (I didn’t have time to try it myself at that moment in time.)

Within a few days, @timelyportfolio had come up with the goods – Exploring Networks with Sankey and then a further follow on post: All My Roads Lead Back to Finance–PIMCO Sankey. The code itself can be found at https://github.com/timelyportfolio/rCharts_d3_sankey

Somehow, playtime has escaped me for the last couple of weeks, but I finally got round to trying the recipe out. The data I opted for is energy consumption data for the UK, published by DECC, detailing energy use in 2010.

As ever, we can’t just dive straight into the visualiastion – we need to do some work first to get int into shape… The data came as a spreadsheet with the following table layout:

Excel - copy data

The Sankey diagram generator requires data in three columns – source, target and value – describing what to connect to what and with what thickness line. Looking at the data, I thought it might be interesting to try to represent as flows the amount of each type of energy used by each sector relative to end use, or something along those lines (I just need something authentic to see if I can get @timelyportfolio’s recipe to work;-) So it looks as if some shaping is in order…

To tidy and reshape the data, I opted to use OpenRefine, copying and pasting the data into a new OpenRefine project:

Refine - paste DECC energy data

The data is tab separated and we can ignore empty lines:

Refine - paste import settings (DECC)

Here’s the data as loaded. You can see several problems with it: numbers that have commas in them; empty cells marked as blank or with a -; empty/unlabelled cells.

DECC data as imported

Let’s make a start by filling in the blank cells in the Sector column – Fill down:

DECC data fill down

We don’t need the overall totals because we want to look at piecewise relations (and if we do need the totals, we can recalculate them anyway):

DECC filter out overall total

To tidy up the numbers so they actually are numbers, we’re going to need to do some transformations:

DECC need to clean numeric cols

There are several things to do: remove commas, remove – signs, and cast things as numbers:

DECC clean numeric column

value.replace(',','') says replace commas with an empty string (ie nothing – delete the comma).

We can then pass the result of this transformation into a following step – replace the – signs: value.replace(',','').replace('-','')

Then turn the result into a number: value.replace(',','').replace('-','').toNumber()

If there’s an error, not that we select to set the cell to a blank.

having run this transformation on one column, we can select Transform on another column and just reuse the transformation (remembering to set the cell to blank if there is an error):

DECC number cleaner reuse

To simplify the dataset further, let’s get rid of he other totals data:

DECC remove data column

Now we need to reshape the data – ideally, rather than having columns for each energy type, we want to relate the energy type to each sector/end use pair. We’re going to have to transpose the data…

DECC start to reshape

So let’s do just that – wrap columns down into new rows:

DECC data transposition

We’re going to need to fill down again…

DECC need to fill down again

So now we have our dataset, which can be trivially exported as a CSV file:

DECC export as CSV

Data cleaning and shaping phase over, we’re now ready to generate the Sankey diagram…

As ever, I’m using RStudio as my R environment. Load in the data:

R import DECC data

To start, let’s do a little housekeeping:

#Here’s the baseline column naming for the dataset
colnames(DECC.overall.energy)=c(‘Sector’,’Enduse’,’EnergyType’,’value’)

#Inspired by @timelyportfolio - All My Roads Lead Back to Finance–PIMCO Sankey
#http://timelyportfolio.blogspot.co.uk/2013/07/all-my-roads-lead-back-to-financepimco.html

#Now let's create a Sankey diagram - we need to install RCharts
##http://ramnathv.github.io/rCharts/
require(rCharts)

#Download and unzip @timelyportfolio's Sankey/rCharts package
#Take note of where you put it!
#https://github.com/timelyportfolio/rCharts_d3_sankey

sankeyPlot <- rCharts$new()

#We need to tell R where the Sankey library is.
#I put it as a subdirectory to my current working directory (.)
sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')

#We also need to point to an HTML template page
sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")

having got everything set up, we can cast the data into the form the Sankey template expects – with source, target and value columns identified:

#The plotting routines require column names to be specified as:
##source, target, value
#to show what connects to what and by what thickness line

#If we want to plot from enduse to energytype we need this relabelling
workingdata=DECC.overall.energy
colnames(workingdata)=c('Sector','source','target','value')

Following @timelyportfolio, we configure the chart and then open it to a browser window:

sankeyPlot$set(
  data = workingdata,
  nodeWidth = 15,
  nodePadding = 10,
  layout = 32,
  width = 750,
  height = 500,
  labelFormat = ".1%"
)

sankeyPlot

Here’s the result:

Basic sankey DECC

Let’s make plotting a little easier by wrapping that routine into a function:

#To make things easier, let's abstract a little more...
sankeyPlot=function(df){
  sankeyPlot <- rCharts$new()
  
  #--------
  #See note in PPS to this post about a simplification of this part....
  #We need to tell R where the Sankey library is.
  #I put it as a subdirectory to my current working directory (.)
  sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')
  
  #We also need to point to an HTML template page
  sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
  #---------
 
  sankeyPlot$set(
    data = df,
    nodeWidth = 15,
    nodePadding = 10,
    layout = 32,
    width = 750,
    height = 500,
    labelFormat = ".1%"
  )
  
  sankeyPlot
}

Now let’s try plotting something a little more adventurous:

#If we want to add in a further layer, showing how each Sector contributes
#to the End-use energy usage, we need to additionally treat the Sector as
#a source and the sum of that sector's energy use by End Use
#Recover the colnames so we can see what's going on
sectorEnergy=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)
colnames(sectorEnergy)=c('source','target','value')

#We can now generate a single data file combing all source and target data
energyfull=subset(workingdata,select=c('source','target','value'))
energyfull=rbind(energyfull,sectorEnergy)

sankeyPlot(energyfull)

And the result?

Full Sankey DECC

Notice that the bindings are a little bit fractured – for example, the Heat block has several contributions from the Gas block. This also suggests that a Sankey diagram, at least as configured above, may not be the most appropriate way of representing the data in this case. Sankey diagrams are intended to represent flows, which means that there is a notion of some quantity flowing between elements, and further that that quantity is conserved as it passes through each element (sum of inputs equals sum of outputs).

A more natural story might be to show Energy type flowing to end use and then out to Sector, at least if we want to see how energy is tending to be used for what purpose, and then how end use is split by Sector. However, such a diagram would not tell us, for example, that Sector X was dominated in its use of energy source A for end use P, compared to Sector Y mainly using energy source B for the same end use P.

One approach we might take to tidying up the chart to make it more readable (for some definition of readable!), though at the risk of making it even more misleading, is to do a little bit more aggregation of the data, and then bind appropriate blocks together. Here are a few more examples of simple aggregations:

We can also explore other relationships and trivially generate corresponding Sankey diagram views over them:

#How much of each energy type does each sector use
enduseBySector=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)
colnames(enduseBySector)=c('source','target','value')
sankeyPlot(enduseBySector)

colnames(enduseBySector)=c('target','source','value')
sankeyPlot(enduseBySector)

#How much of each energy type is associated with each enduse
energyByEnduse=aggregate(value ~ EnergyType + Enduse, DECC.overall.energy, sum)
colnames(energyByEnduse)=c('source','target','value')

sankeyPlot(energyByEnduse)

So there we have it – quick and easy Sankey diagrams from R using rCharts and magic recipe from @timelyportfolio:-)

PS the following routine makes it easier to grab data into the appropriately named format

#This routine makes it easier to get the data for plotting as a Sankey diagram
#Select the source, target and value column names explicitly to generate a dataframe containing
#just those columns, appropriately named.
sankeyData=function(df,colsource='source',coltarget='target',colvalue='value'){
  sankey.df=subset(df,select=c(colsource,coltarget,colvalue))
  colnames(sankey.df)=c('source','target','value')
  sankey.df
}

#For example:
data.sdf=sankeyData(DECC.overall.energy,'Sector','EnergyType','value')
data.sdf

The code automatically selects the appropriate columns and renames them as required.

PPS it seems that a recent update(?) to the rCharts library by @ramnath_vaidya now makes things even easier and removes the need to download and locally host @timelyportfolio’s code:

#We can remove the local dependency and replace the following...
#sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')
#sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
##with this simplification
sankeyPlot$setLib('http://timelyportfolio.github.io/rCharts_d3_sankey')

Written by Tony Hirst

July 23, 2013 at 11:34 am

Posted in OpenRefine, Rstats

Tagged with

Generating Alerts From Guardian University Tables Data

One of the things I’ve been pondering with respect to the whole data journalism process is how journalists without a lot of statistical training can quickly get a feel for whether there may be interesting story leads in a dataset, or how they might be able to fashion “alerts” that bring attention to data elements that might be worth investigating further. In the case of the Guardian university tables, this might relate to identifying which universities appear to have courses that rank particularly well in a subject and within their own institution, or which subject areas appear to have teaching or assessment satisfaction issues in a particular institution. In this post, I have a quick play with an idea for generating visual alerts that might help us set up or identify hopefully interesting or fruitful questions to ask along these lines.

Statistics will probably have a role to play in generating most forms of alert, but as Jeff Leek has recently pointed out in the Simply Statistics blog, [t]he vast majority of statistical analysis is not performed by statisticians. Furthermore:

We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren’t statisticians.

By no stretch of the imagination would I class myself as a statistician. But I have started trying to develop some statistical intuitions by going back to several histories of statistics in order to see what sorts of problems the proto-statisticians were trying to develop mathematical techniques to solve (for example, I’m currently reading The History of Statistics: The Measurement of Uncertainty before 1900)).

One of the problems I’ve encountered for myself relates to trying to find outliers in a quick and easy way. One way of detecting an outlier is to assume that the points have a normal distribution about the population mean value, and then look for points that lay several standard deviations away from the mean. If that all sounds way to complicated, the basic idea is this: which items in a group have values a long way from the average value in the group.

A statistic that captures this idea in a crude way is the z-score (or should than be, z-statistic? z-value?), which for a particular point is is the deviation from the mean divided by the standard deviation (I think). Which is to say, it’s proportional to how far away from ‘the average’ a point is dividing by the average of how far away all points are. (What is doesn’t tell you is whether that distance is meaningful or important in any sense, which I think statisticians refer to as “the power of the effect”. I think. They talk in code.)

Anyway, the magic of R makes it easy to calculate the z-score for a group of numbers using the scale() function. So I had a quick play with it to see if it might be useful in generating alerts around the Guardian University tables data.

This post build on two earlier posts and may not make much sense if you haven’t been following the story to date. The first post, which tells how to grab the data into R, can be found in Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data ; the second, how to start building a simple interactive viewer for the data using the R shiny library, can be found in Disposable Visual Data Explorers with Shiny – Guardian University Tables 2014.

The general question I had in mind was this: is there a way we can generate a simple view over the data to highlight outperformers and underperformers under each of the satisfaction scores? The sort of view I thought I wanted depended on the stance from which a question about the data might arise. For example, through building a single dataset from the subject level data, we find ourselves in a position where we can ask questions from the perspective of someone from a particular university who may be interested in how the various satisfaction scores for a particular subject compare to the scores for the other subjects offered by the institution. Alternatively, we might be interested in how the satisfaction scores for a particular subject area might compare across several institutions.

This seemed like things the z-score might be able to help with. For example, for each of the satisfaction scores/columns associated with a particular subject, we can generate the z-score that shows how far away from the average in that satisfaction column for that subject each institution is.

##gdata contains the Guardian university tables data for all subjects
#Get the list of subjects
ggs=levels(gdata$subject)
#Grab a working copy of the data
ggs1=gdata
#Find the z-scores for the various numerical satisfaction score columns in the data for the first subject
ggs1s=data.frame(scale(ggs1[ggs1$subject==ggs[1],-c(1,2,12)]))
#Add another column that labels this set of z-scores with the appropriate institution names
ggs1s$inst=ggs1[ggs1$subject==ggs[1],2]
#Add another column that specifies the subject the z-scores correspond to
ggs1s$ss=ggs[1]
#We now repeat the above process for each of the other subjects using a temporary working copy of the data
for (g in ggs[-1]){
  tmp=data.frame(scale(ggs1[ggs1$subject==g,-c(1,2,12)]))
  tmp$inst=ggs1[ggs1$subject==g,2]
  tmp$ss=g
  #The only difference is we add this temporary data to the end of the working copy
  ggs1s=rbind(ggs1s,tmp)
}
#The resulting data frame, ggs1s, contains z-scores for each satisfaction data column across institutions within a subject.

This data essentially gives us a clue about how well one institution’s scores compare with the scores of the other institutions in each separate subject area. The lead in here is to questions along the lines of “which universities have particularly good career prospects for students studying subject X”, or “which universities are particularly bad in terms of assessment satisfaction when it comes to subject Y?”.

In addition, for each of the satisfaction scores/columns associated with a particular institution, we can generate the z-score that shows how far away from the average in that satisfaction column for that institution each subject is.

##The pattern of this code should look familiar...
##We're going to do what we did for subjects, only this time for institutions...
#Get the list of subjects
ggi=levels(gdata$Name.of.Institution)
#Grab a working copy of the data
ggi1=gdata
#Find the z-scores for the various numerical satisfaction score columns in the data for the first institution
ggi1s=data.frame(scale(ggi1[ggi1$Name.of.Institution==ggi[1],-c(1,2,12)]))
#Add another column that labels this set of z-scores with the appropriate institution names
ggi1s$inst=ggi[1]
#Add another column that specifies the subject the z-scores correspond to
ggi1s$ss=ggi1[ggi1$Name.of.Institution==ggi[1],12]
#We now repeat the above process for each of the other institutions using a temporary working copy of the data
for (g in ggi[-1]){
  tmp=data.frame(scale(ggi1[ggi1$Name.of.Institution==g,-c(1,2,12)]))
  tmp$ss=ggi1[ggi1$Name.of.Institution==g,12]
  tmp$inst=g
  #As before, the only difference is we add this temporary data to the end of the working copy
  ggi1s=rbind(ggi1s,tmp)
}
#The resulting data frame, ggs1i, contains z-scores for each satisfaction data column across subjects within a institution.

This second take on the data essentially gives us a clue about how well each subject area performs within an institution compared to the performance of the other subjects offered by that institution. This might help lead us in to questions of the form “which department within an institution appears to have an unusually high entrance tariff compared to other courses offered by the institution?” or “which courses have a particularly overall satisfaction, which might be breaking our overall NSS result?” For university administrators taking a simplistic view of this data, it might let you tell a Head of Department that their chaps are letting the side down. Or it might be used by a Director of a Board of Studies in a particular subject area to boost a discretionary “bonus” application.

So here’s something I’ve started playing with to try to generate “alerts” based on outperformers and underperformers within an institution and/or within a particular subject area compared to other institutions: plots that chart show subject/institution combinations that have “large” z-scores as calculated above.

To plot the charts, we need to combine the z-score datasets and get the data into the right shape:

#Convert the z-score dataframes from "wide" format to "long" format
#The variable column will identify which satisfaction score the row conforms to for a given subject/institution pair
ggi1sm=melt(ggi1s,id=c('inst','ss'))
ggs1sm=melt(ggs1s,id=c('inst','ss'))
#Merge the datasets by subject/institution/satisfaction score
ggs2=merge(ggs1sm,ggi1sm,by=c('inst','ss','variable'))

It’s then a simple matter to plot the outliers for a given institution – let’s filter to show only items where at least one of the z-scores has magnitude greater than 2:

ii='Cardiff'
g=ggplot(subset(ggs2,inst==ii & !is.na(value.x) & !is.na(value.y) & (abs(value.x)>2 | abs(value.y)>2))) + geom_text(aes(x=value.x,y=value.y,label=ss),size=2)+facet_wrap(~variable)
g=g+labs(title=paste("Guardian University Tables 2014 Alerts:",ii), x='z relative to other institutions in same subject', y='z relative to other subjects in same institution')
g

Or for given subject:

ss='Psychology'
g=ggplot(subset(ggs2,ss==ss & !is.na(value.x) & !is.na(value.y) & (abs(value.x)>2 | abs(value.y)>2))) + geom_text(aes(x=value.x,y=value.y,label=inst),size=2)+facet_wrap(~variable)
g=g+labs(title=paste("Guardian University Tables 2014 Alerts:",ss), x='z relative to other institutions in same subject', y='z relative to other subjects in same institution')
g

Let’s see how meaningful these charts are, and whether they provide us with any glanceable clues about where there may be particular successes or problems…

Here’s the institutional view:

guardianalert2014Cardiff

A couple of things we might look for are items in the top right and bottom left corners – these are subjects where the scores are well above average both nationally and within the institution. Subjects close to the y-axis are well away from the average within the institution, but fairly typical on a national level for that subject area. Subjects close to the x-axis are fairly typical within the institution, but away from average when compared to other institutions.

So for example, dentistry appears to be having a hard time of it? Let’s look at the dentistry subject table on an updated version of the Shiny viewer (which now included the alert charts), focussing on a couple of the satisfaction values that appear to be weak at a national scale:

guardian 2014 shiny dentistry

How about the subject alert?

guardianalerts2014psychology

Hmmm…what’s happening at UC Suffolk?

I’m still not sure how useful these views are, and the approach thus far still doesn’t give a glanceable alert over the whole data set in one go (we have to run views over institutions or subjects, for example). But it was relatively quick to do (this post took almost as long to write as the coding), and it maybe opens up ideas about what other questions we might ask of the data?

Written by Tony Hirst

June 23, 2013 at 12:24 pm

Posted in Rstats

Tagged with

Disposable Visual Data Explorers with Shiny – Guardian University Tables 2014

Have data – now what? Building your own interactive data explorer need not be a chore with the R shiny library… Here’s a quick walkthrough…

In Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data, I showed how to grab some data from several dozen commonly formatted sheets in a Google spreadsheet, and combine them to produce a single monolithic data set. The data relates to UK universities and provides several quality/satisfaction scores for each of the major subject areas they offer courses in.

We could upload this data to something like Many Eyes in order to generate visualisations over it, or we could create a visual data explorer app of our own. It needn’t take too long, either…

Here’s an example, the Simple Guardian University Rankings 2014 Explorer, that lets you select a university and then generate a scatterplot that shows how different quality/ranking scores vary for that university by subject area:

Crude data explorer - guardian 2014 uni stats

The explorer allows you to select a university and then generate a scatterplot based around selected quality scores. The label size is also set relative to a selected quality score.

The application is built up from three files. A generic file, that we use to load the source data in (in this example I pull it form a file, though we could bring it in live from the Google spreadsheet).

##global.R
load("guardian2014unidata.Rda")
#In this case, the data is loaded into the dataframe: gdata
#Once it's loaded, we tidy it (I should have tidied the saved data really!)
gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
gdata$Name.of.Institution=as.factor(gdata$Name.of.Institution)
gdata$subject=as.factor(gdata$subject)

A “server” file that takes input from the user interface elements on the left of the app and generates the displayed chart:

##server.R
library(shiny)
library(ggplot2)
 
# Define server logic
shinyServer(function(input, output) {
  
  #Simple test plot
  output$testPlot = renderPlot( {
    pdata=subset(gdata, Name.of.Institution==input$tbl)
    #g=ggplot(pdata) + geom_text(aes(x=X..Satisfied.with.Teaching,y=X..Satisfied.with.Assessment,label=subject,size=Value.added.score.10))
    g=ggplot(pdata) + geom_text(aes_string(x=input$tblx,y=input$tbly,size=input$tbls, label='subject'))
    g=g+labs(title=paste("Guardian University Tables 2014:",input$tbl))
    print(g)
  })
  
})

A user interface definition file.

##ui.R
library(shiny)
 
#Generate a list containing the names of the institutions
uList=levels(gdata$Name.of.Institution)
names(uList) = uList
 
#Generate a list containing the names of the quality/ranking score columns by column name
cList=colnames(gdata[c(1,3:11)])
names(cList) = cList
 
# Define UI for application that plots random distributions 
shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("Guardian 2014 University Tables Explorer"),
  
  sidebarPanel(
    #Which table do you want to view, based on the list of institution names?
    selectInput("tbl", "Institution:",uList),

    #Also let the user select the x, y and label size, based on quality/ranking columns
    selectInput("tblx", "x axis:",cList,selected = 'X..Satisfied.with.Teaching'),
    selectInput("tbly", "y axis:",cList,selected='X..Satisfied.with.Assessment'),
    selectInput("tbls", "Label size:",cList,selected = 'Value.added.score.10'),
    
    div("This demo provides a crude graphical view over data extracted from",
        a(href='http://www.guardian.co.uk/news/datablog/2013/jun/04/university-guide-2014-table-rankings',
          "Guardian Datablog: University guide 2014 data tables") )
    
  ),
  
  #The main panel is where the "results" charts are plotted
  mainPanel(
    plotOutput("testPlot")#,
    #tableOutput("view")  
  )
))

And that’s it… If we pop these files into a single gist – such as the one at https://gist.github.com/psychemedia/5824495, which also includes code for grabbing the data from the Google spreadsheet – we can run application from the RStudio command line as follows:

library(shiny)
runGist('5824495')

(Hit “escape” to stop the script running.)

With a minor tweak, we can get a list of unique subjects, rather than institutions and allow the user to compare courses across institution by subject, rather than across subject areas within an institution.

We can then combine the two approaches into a single interface, Guardian 2014 university table explorer v2 – whilst not ideal (we should really grey out the inactive selector – institution or subject area according to which one hasn’t been selected via the radio button).

guardian uni 2014 explorer 2

The global.R file is the same, although we need to tweak the ui.R and server.R files.

To the UI file, we add a radio button selector and an additional menu (for subjects):

##ui.R
library(shiny)

uList=levels(gdata$Name.of.Institution)
names(uList) = uList

#Pull out the list of subjects
sList=levels(gdata$subject)
names(sList) = sList

cList=colnames(gdata[c(1,3:11)])
names(cList) = cList

# Define UI for application that plots random distributions 
shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("Guardian 2014 University Tables Explorer v.2"),
  
  sidebarPanel(
    #Add in a radio button selector
    radioButtons("typ", "View by:",
                 list("Institution" = "inst",
                      "Subject" = "subj")),
    
    #Just a single selector here - which table do you want to view?
    selectInput("tbli", "Institution:",uList),
    #Add a selector for the subject list
    selectInput("tblb", "Subject:",sList),
    selectInput("tblx", "x axis:",cList,selected = 'X..Satisfied.with.Teaching'),
    selectInput("tbly", "y axis:",cList,selected='X..Satisfied.with.Assessment'),
    selectInput("tbls", "Label size:",cList,selected = 'Value.added.score.10'),
    
    div("This demo provides a crude graphical view over data extracted from",
        a(href='http://www.guardian.co.uk/news/datablog/2013/jun/04/university-guide-2014-table-rankings',
          "Guardian Datablog: University guide 2014 data tables") )
    
  ),
  
  #The main panel is where the "results" charts are plotted
  mainPanel(
    plotOutput("testPlot")#,
    #tableOutput("view")
  )
))

To the server file, we add a level of indirection, setting local state variables based on the UI selectors, and then using the value of these variables within the chart generator code itself.

##server.R
library(shiny)
library(ggplot2)

# Define server logic
shinyServer(function(input, output) {

  #We introduce a level of indirection, creating routines that set state within the scope of the server based on UI actions
  #If the radio button state changes, reset the data filter
  pdata <- reactive({
    switch(input$typ,
           inst=subset(gdata, Name.of.Institution==input$tbli),
           subj=subset(gdata, subject==input$tblb)
    )
  })
  
  #Make sure we use the right sort of label (institution or subject) in the title
  ttl <- reactive({
    switch(input$typ,
           inst=input$tbli,
           subj=input$tblb
    )
  })
 
  #Make sure we display the right sort of label (by institution or by subject) in the chart
  lab <- reactive({
    switch(input$typ,
           inst='subject',
           subj='Name.of.Institution'
           )
  })
  
  #Simple test plot
  output$testPlot = renderPlot({
    
    #g=ggplot(pdata) + geom_text(aes(x=X..Satisfied.with.Teaching,y=X..Satisfied.with.Assessment,label=subject,size=Value.added.score.10))
    g=ggplot(pdata()) + geom_text(aes_string(x=input$tblx,y=input$tbly,size=input$tbls, label=lab()  ))
    g=g+labs(title=paste("Guardian University Tables 2014:",ttl()))
    print(g)
  })
  
})

What I hoped to show here was how it’s possible to create a quick visual explorer interface over a dataset using the R shiny library. Many users will be familiar with using wizards to create charts in spreadsheet programmes, but may get stuck when it comes to figuring out how to generate large numbers of charts. As a quick and dirty tool, shiny provides a great environment for knocking up disposable interfaces that provide you with a playground for checking out a wide range of chart data settings from automatically populated list selectors.

With a few more tweaks, we could add in the option to download data by subject or institution, add range selectors to allow us to view only results where a score falls within a particular range, and so on. We can also define new charts and displays (including tabular data displays) to view the data, just slotting them in with very simple UI and server components as used in the original example.

PS this post isn’t necessarily intended to say that we should just be adding to the noise by publishing interactive data explorers that folk don’t how to use to support “datajournalism” stories or research dissemination (the above apps are way to scruffy for the that, and the charts potentially too confusing or cluttered for the uninitiated to make sense of). Rather, I suggest that journalists, researchers etc should feel as if they are in a position to knock up their own data exploration tools as part of the “homework” involved in prepping for a conversation with a data source. The tool building also becomes and extension of the conversation with the data. Complete/complex apps aren’t built in one go. As the example described here shows, it was built up in baby steps, starting with the data grab an initial chart in the previous post, moving on to a simple interactive chart at the start of this post, then starting to evolve into a more complex tool through the addition of additional features.

If the app gets more complex, eg in response to me wanting to be able to ask more refined questions of the data, or take filtered data dumps from it, (for example, for use in other charting applications, such as datawrapper.de), this just represents an evolution of, or increase in depth of, the conversation I am having with the data and the notes I am taking of it.

Written by Tony Hirst

June 21, 2013 at 10:24 am

Posted in Rstats

Tagged with

Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

Guardian data table uni 2014

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

library(RCurl)
gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )
}

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

handler=function(key,i){
  tmp=gsqAPI(key,"select * where B!=''", i)
  subject=sub(".Rank",'',colnames(tmp)[1])
  colnames(tmp)[1]="Subject.Rank"
  tmp$subject=subject
  tmp
}

We can now download the first subject specific sheet using the following call:

key='0Aq73qj3QslDedHFNdUsxOVZQZ1dmbXNrYlZGUWgwdHc'
gdata=handler(key,2)

This loads in the data as follows:

seeding guardian ranking data table

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){
  gdata=rbind(gdata,handler(key,i))
}

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

Data format - need to make numbers

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors
gdata$Name.of.Institution=as.factor(gdata$Name.of.Institution)
gdata$subject=as.factor(gdata$subject)

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

Cross-subject rankings

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

library(ggplot2)
oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

oxfordbrookes start asking

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?

Written by Tony Hirst

June 20, 2013 at 3:35 pm

Posted in Rstats

Tagged with

Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics

Last year I sat on a couple of panels organised by I’m a Scientist’s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking Who are you Twitter?, a quick review of a social media mapping exercise carried out on the followers of the @imascientist Twitter account.

Using the technique described in Estimated Follower Accession Charts for Twitter, we can estimate a follower acquisition growth curve for the @imascientist Twitter account:

imascientist

I’ve already noted how we may be able to use “spikes” in follower acquisition rates to identify news events that involved the owner of a particular Twitter account and caused a surge in follower numbers as a result (What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events).

Thinking back to the context of evaluating the impact of events that include social media as part of the overall campaign, it struck me that whilst running a particular event may not lead to a huge surge in follower numbers on the day of the event or in the immediate aftermath, the followers who do sign up over that period might have signed up as a result of the event. And now we have the first inklings of a post hoc analysis tool that lets us try to identify these people, and perhaps look to see if their profiles are different to profiles of followers who signed up at different times (maybe reflecting the audience interest profile of folk who attended a particular event, or reflecting sign ups from a particular geographical area?)

In other words, through generating the follower acquisition curve, can we use it to filter down to folk who started following around a particular time in order to then see whether there is a possibility that they started following as a result of a particular event, and if so can count as some sort of “conversion”? (I appreciate that there are a lot of caveats in there!;-)

A similar approach may also be relevant in the context of analysing link building around historical online community events, such as MOOCs… If we know somebody took a particular MOOC at a particular time, might we be able to construct their follower acquisition curve and then analyse it around the time of the MOOC, looking to see if the connections built over that period are different to the users other followers, and as such may represent links developed as a result of taking the MOOC? Analysing the timelines of the respective parties may further reveal conversational dynamics between those parties, and as such allow is to see whether a fruitful social learning relationship developed out of contact made in the MOOC?

Written by Tony Hirst

April 21, 2013 at 5:40 pm

Estimated Follower Accession Charts for Twitter

Just over a year or so ago, Mat Morrison/@mediaczar introduced me to a visualisation he’d been working on (How should Page Admins deal with Flame Wars?) that I started to refer to as an accession chart (Visualising Activity Around a Twitter Hashtag or Search Term Using R). The idea is that we provide each entrant into a conversation or group with an accession number: the first person has accession number 1, the second person accession number 2 and so on. The accession number is plotted in rank order on the vertical y-axis, with ranked/time ordered “events” along the horizontal x-axis: utterances in a conversation for example, or posts to a forum.

A couple of months ago, I wondered whether this approach might also be used to estimate when folk started following an individual on Twitter. My reasoning went something like this:

One of the things I think is true of the Twitter API call for the followers of an account is that it returns lists of followers in reverse accession order. So the person who followed an account most recently will be at the top of the list (the first to be returned) and the person who followed first will be at the end of the list. Unfortunately, we don’t know when followers joined, so it’s hard to spot bursty growth in the number of followers of an account. However, it struck me that we may be able to get a bound on this by looking at the dates at which followers joined Twitter, along with their ‘accession order’ as followers of an account. If we get the list of followers and reverse it, and assume that this gives an ordered list of followers (with the follower that started following the longest time ago first), we can then work through this list and keep track of the oldest ‘created_at’ date seen so far. This gives us an upper bound (most recent date) for when followers that far through the list started following. (You can’t start following until you join twitter…)

So for example, if followers A, B, C, D in that accession order (ie started following target in that order) have user account creation dates 31/12/09, 1/1/09, 15/6/12, 5/5/10 then:
– A started following no earlier than 31/12/09 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
– B started following no earlier than 31/12/09 (because they started following after B)
– C started following no earlier than 15/6/12 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
– D started following no earlier than 15/6/12 (because they started following after C, which gave use the most recent creation date seen so far)

That’s probably confused you enough, so here’s a chart – accession number is along the bottom (i.e. the x-axis), joining date (in days ago) is on the y-axis:

recencyVacc

NOTE: this diverges from the accession graph described above, where accession number goes on the y-axis and rank ordered event along the x-axis.

What the chart shows is an estimate (the red line) of how many days ago a follower with a particular accession number started to follow a particular Twitter account.

As described in Sketches Around Twitter Followers, we see a clear break at 1500 days ago when Twitter started to get popular. This approach also suggests a technique for creating “follower probes” that we can use to date a follower record: if you know which day a particular user followed a target account, you can use that follower to put a datestamp into the follower record (assuming the Twitter API returned followers in reverse accession order).

Here’s an example of the code I used based on Twitter follower data grabbed for @ChrisPincher (whose follower profile appeared to be out of sorts from the analysis sketched in Visualising Activity Around a Twitter Hashtag or Search Term Using R). I’ve corrected the x/y axis ordering so follower accession number is now the vertical, y-component.

require(ggplot2)

processUserData = function(data) {
    data$tz = as.POSIXct(data$created_at)
    data$days = as.integer(difftime(Sys.time(), data$tz, units = "days"))
    data = data[rev(rownames(data)), ]
    data$acc = 1:length(data$days)
    data$recency = cummin(data$days)

    data
}

mp_cp <- read.csv("~/code/MPs/ChrisPincher_fo_0__2013-02-16-01-29-28.csv", row.names = NULL)

ggplot(processUserData(mp_cp)) +  geom_point(aes(x = -days, y = acc), size = 0.4) + geom_point(aes(x = -recency, y = acc), col = "red", size = 1)+xlim(-2000,0)

Here’s @ChrisPincher’s chart:

cp_demo

The black dots reveal how many days ago a particular follower joined Twitter. The red line is the estimate of when a particular follower started following the account, estimated based on the most recently created account seen to date amongst the previously acceded followers.

We see steady growth in follower numbers to start with, and then the account appears to have been spam followed? (Can you spot when?!;-) The clumping of creation dates of accounts during the attack also suggests they were created programmatically.

[In the “next” in this series of posts [What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events], I’ll show how spikes in follower acquisition on a particular day can often be used to “detect” historical news events.]

PS after coming up with this recipe, I did a little bit of “scholarly research” and I learned that a similar approach for estimating Twitter follower acquisition times had already been described at least once, at the opening of this paper: We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter – “We estimate the edge creation time for any follower of a celebrity by positing that it is equal to the greatest lower bound that can be deduced from the edge orderings and follower creation times for that celebrity”.

Written by Tony Hirst

April 5, 2013 at 10:31 am

Posted in Rstats

Tagged with

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into mpExpenses2012. Previewing it we see there is a column MP.s.Name identifying which MP each expense claim line item corresponds to:

mpexpenses2012R

We can easily pull out the unique values of the MP names using the levels command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  tmp=subset(mpExpenses2012,MP.s.Name==name)
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP
  write.csv(tmp,fn,row.names=FALSE)
}

Simples:-)

This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.

Written by Tony Hirst

April 3, 2013 at 8:54 am

Posted in Rstats

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 784 other followers