## More Dabblings With Local Sentencing Data

In Accessing and Visualising Sentencing Data for Local Courts I posted a couple of quick ways in to playing with Ministry of Justice sentencing data for the period July 2010-June 2011 at the local court level. At the end of the post, I wondered about how to wrangle the data in R so that I could look at percentage-wise comparisons between different factors (Age, gender) and offence type and mentioned that I’d posted a related question to to the Cross Validated/Stats Exchange site (Casting multidimensional data in R into a data frame).

Courtesy of Chase, I have an answer:-) So let’s see how it plays out…

To start, let’s just load the Isle of Wight court sentencing data into RStudio:

require(ggplot2)
require(reshape2)

Now we’re going to shape the data so that we can plot the percentage of each offence type by gender (limited to Male and Female options):

iw.m = melt(iw, id.vars = "sex", measure.vars = "Offence_type")
iw.sex = ddply(iw.m, "sex", function(x) as.data.frame(prop.table(table(x\$value))))
ggplot(subset(iw.sex,sex=='Female'|sex=='Male')) + geom_bar(aes(x=Var1,y=Freq)) + facet_wrap(~sex)+ opts(axis.text.x=theme_text(angle=-90)) + xlab('Offence Type')

Here’s the result:

We can also process the data over a couple of variables. So for example, we can look to see how female recorded sentences break down by offence type and age range, displaying the results as a percentage of how often each offence type on its own was recorded by age:

iw.m2 = melt(iw, id.vars = c("sex","Offence_type" ), measure.vars = "AGE")
iw.off=ddply(iw.m2, c("sex","Offence_type"), function(x) as.data.frame(prop.table(table(x\$value))))

ggplot(subset(iw.off,sex=='Female')) + geom_bar(aes(x=Var1,y=Freq)) + facet_wrap(~Offence_type) + opts(axis.text.x=theme_text(angle=-90)) + xlab('Age Range (Female)')

Note that this graphic may actually be a little misleading because percentage based reports donlt play well with small numbers…: whilst there are multiple Driving Offences recorded, there are only two Burglaries, so the statistical distribution of convicted female burglars is based over a population of size two… A count would be a better way of showing this

PS I was hoping to be able to just transmute the variables and generate a raft of other charts, but I seem to be getting an error, maybe because some rows are missing? So: anyone know where I’m supposed to post R library bug reports?

## Accessing and Visualising Sentencing Data for Local Courts

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]

In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…

…but first, a couple of observations:

– the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
– the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?

The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.

The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…

### Unix Command Line Tools

I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).

In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:

Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:

grep -i wight recordlevel.csv > recordsContainingWight.csv

(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )

Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.

Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):

We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:

Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?

### R/RStudio

If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.

Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:

iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))

We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:

age=table(iw\$AGE)
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")

We can also start to look at combinations of factors. For example, how do offence types vary with age?

ageOffence=table(iw\$AGE, iw\$Offence_type)
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

If we remove the beside=T argument, we can produce a stacked bar chart:

barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:

require(ggplot2)
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')

Alternatively, we can break down offence types by age:

ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)

We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:

ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)

One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below ;-)

PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:

PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw\$Offence_type,iw\$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male'])
iwpFemale=data.frame(iwp['Female'])

We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)

## Getting Started With Twitter Analysis in R

Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.

I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…

Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)

```require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
# Note that the Twitter search API only goes back 1500 tweets (I think?)

#Create a dataframe based around the results
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
#Here are the columns
names(df)
#And some example content

So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:

```counts=table(df\$screenName)
barplot(counts)

# Let's do something hacky:
# Limit the data set to show only folk who tweeted twice or more in the sample
cc=subset(counts,counts>1)
barplot(cc,las=2,cex.names =0.3)
```

Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:

```#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df\$text=sapply(df\$text,function(row) iconv(row,to='UTF-8'))

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
library(stringr)
#Pull out who a message is to
df\$to=sapply(df\$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df\$to=sapply(df\$to,function(name) trim(name))

#And here's a way of grabbing who's been RT'd
df\$rt=sapply(df\$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
```

So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…

```require(ggplot2)
ggplot()+geom_bar(aes(x=na.omit(df\$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)```

Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)

## Data Referenced Journalism and the Media – Still a Long Way to Go Yet?

Reading our local weekly press this evening (the Isle of Wight County Press), I noticed a page 5 headline declaring “Alarm over death rates at St Mary’s”, St Mary’s being the local general hospital. It seems a Department of Health report on hospital mortality rates came out earlier this week, and the Island’s hospital, it seems, has not performed so well…

Seeing the headline – and reading the report – I couldn’t help but think of Ben Goldacre’s Bad Science column in the Observer last week (DIY statistical analysis: experience the thrill of touching real data ), which commented on the potential for misleading reporting around bowel cancer death rates; among other things, the column described a statistical graphic known as a funnel plot which could be used to support the interpretation of death rate statistics and communicate the extent to which a particular death rate, for a given head of population, was “significantly unlikely” in statistical terms given the distribution of death rates across different population sizes.

I also put together a couple of posts describing how the funnel plot could be generated from a data set using the statistical programming language R.

Given the interest there appears to be around data journalism at the moment (amongst the digerati at least), I thought there might be a reasonable chance of finding some data inspired commentary around the hospital mortality figures. So what sort of report was produced by the Guardian (Call for inquiries at 36 NHS hospital trusts with high death rates) or the Telegraph (36 hospital trusts have higher than expected death rates), both of which have pioneering data journalists working for them, come up with? Little more than the official press release: New hospital mortality indicator to improve measurement of patient safety.

The reports were both formulaic, picking on leading with the worst performing hospital (which admittedly was not mentioned in the press release) and including some bog standard quotes from the responsible Minister lifted straight out of the press release (and presumably written by someone working for the Ministry…) Neither the Guardian nor the Telegraph story contained a link to the original data, which was linked to from the press release as part of the Notes to editors rider.

If we do a general, recency filtered, search for hospital death rates on either Google web search:

we see a wealth of stories from various local press outlets. This was a story with national reach and local colour, and local data set against a national backdrop to back it up. Rather than drawing on the Ministerial press released quotes, a quick scan of the local news reports suggests that at least the local journalists made some effort compared to the nationals’ churnalism, and got quotes from local NHS spokespeople to comment on the local figures. Most of the local reports I checked did not give a link to the original report, or dig too deeply into the data. However, This is Tamworth, (which had a Tamworth Herald byline in the Google News results), did publish the URL to the full report in its article Shock report reveals hospital has highest death rate in country, although not actually as a link… Just by the by, I also noticed the headline was flagged with a “Trusted Source” badge:

Is that Tamworth Herald as the trusted source, or the Department of Health?!

Given that just a few days earlier, Ben Goldacre had provided an interesting way of looking at death rate data, it would have been nice to think that maybe it could have influenced someone out there to try something similar with the hospital mortality data. Indeed, if you check the original report, you can find a document describing How to interpret SHMI bandings and funnel plots (although, admittedly, not that clearly perhaps?). And along with the explanation, some example funnel plots.

However, the plots as provided are not that useful. They aren’t available as image files in a social or rich media press release format, nor are statistical analysis scripts that would allow the plots to be generated from the supplied data in too like R; that is to say, the executable working wasn’t shown…

So here’s what I’m thinking: firstly, we need data press officers as well as data journalists. Their job would be to put together the tools that support the data churnalist in taking the raw data and producing statistical charts and interpretation from it. Just like the ministerial quote can be reused by the journalist, so the data press pack can be used to hep the journalist get some graphs out there to help them illustrate the story. (The finishing of the graph would be up to the journalist, but the mechanics of the generation of the base plot would be provided as part of the data press pack.)

Secondly, there may be an opportunity for an enterprising individual to take the data sets and produced localised statistical graphics from the source data. In the absence of a data press officer, the enterprising individual could even fulfil this role. (To a certain extent, that’s what the Guardian Datastore does.)

(Okay, I know: the local press will have allocated only a certain amount of space to the story, and the editor would likely see any mention of stats or funnel plots as scaring folk off, but we have to start changing attitudes, expectations, willingness and ability to engage with this sort of stuff somehow. Most people have very little education in reading any charts other than pie charts, bar charts, and line charts, and even then are easily misled. We have start working on this, we have to start looking at ways of introducing more powerful plots and charts and helping people get a folk understanding of them. And funnel plots may be one of the things we should be starting to push?)

Now back to the hospital data. In How Might Data Journalists Show Their Working? Sweave, I posted a script that included the working for generating a funnel plot from an appropriate online CSV data source. Could this script be used to generate a funnel plot from the hospital data?

I had a quick play, and managed to get a scatterplot distribution that looks like the one on the funnel plot explanation guide by setting the number value to the SHMI Indicator data (csv) EXPECTED column and the p to the VALUE column. However, because the p value isn’t a probability in the range 0..1, the p.se calculation fails:
p.se <- sqrt((p*(1-p)) / (number))

Anyway, here’s the script for generating the straightforward scatter plot (I had to read the data in from a local file because there was some issue with the security certificate trying to read the data in from the online URL using the RCurl library and hospitaldata = data.frame( read.csv( textConnection( getURL( DATA_URL ) ) ) ):

number = hospitaldata\$EXPECTED
p = hospitaldata\$VALUE
df = data.frame(p, number, Area=hospitaldata\$PROVIDER.NAME)
ggplot(aes(x = number, y = p), data = df) + geom_point(shape = 1)

There’s presumably a simple fix to the original script that will take the range of the VALUE column into account and allow us to plot the funnel distribution lines appropriately? If anyone can suggest the fix, please let me know in a comment…;-)

## How Might Data Journalists Show Their Working? Sweave

If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)

In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?

For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:

You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.

The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.

This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?

PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)

## Power Tools for Aspiring Data Journalists: Funnel Plots in R

Picking up on Paul Bradshaw’s post A quick exercise for aspiring data journalists which hints at how you can use Google Spreadsheets to grab – and explore – a mortality dataset highlighted by Ben Goldacre in DIY statistical analysis: experience the thrill of touching real data, I thought I’d describe a quick way of analysing the data using R, a very powerful statistical programming environment that should probably be part of your toolkit if you ever want to get round to doing some serious stats, and have a go at reproducing the analysis using a bit of judicious websearching and some cut-and-paste action…

R is an open-source, cross-platform environment that allows you to do programming like things with stats, as well as producing a wide range of graphical statistics (stats visualisations) as if by magic. (Which is to say, it can be terrifying to try to get your head round… but once you’ve grasped a few key concepts, it becomes a really powerful tool… At least, that’s what I’m hoping as I struggle to learn how to use it myself!)

I’ve been using R-Studio to work with R, a) because it’s free and works cross-platform, b) it can be run as a service and accessed via the web (though I haven’t tried that yet; the hosted option still hasn’t appeared yet, either…), and c) it offers a structured environment for managing R projects.

So, to get started. Paul describes a dataset posted as an HTML table by Ben Goldacre that is used to generate the dots on this graph:

The lines come from a probabilistic model that helps us see the likely spread of death rates given a particular population size.

If we want to do stats on the data, then we could, as Paul suggests, pull the data into a spreadsheet and then work from there… Or, we could pull it directly into R, at which point all manner of voodoo stats capabilities become available to us.

As with the =importHTML formula in Google spreadsheets, R has a way of scraping data from an HTML table anywhere on the public web:

#First, we need to load in the XML library that contains the scraper function
library(XML)
#Scrape the table

The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).

We can inspect the data we’ve imported as follows:

#Look at the whole table
cancerdata
names(cancerdata)
#Look at the first 10 rows
#Look at the last 10 rows
tail(cancerdata)
#What sort of datatype is in the Number column?
class(cancerdata\$Number)

The last line – class(cancerdata\$Number) – identifies the data as type ‘factor’. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers… (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.)

#Convert the numerical columns to a numeric datatype
cancerdata\$Rate=as.numeric(levels(cancerdata\$Rate)[as.integer(cancerdata\$Rate)])
cancerdata\$Population=as.numeric(levels(cancerdata\$Population)[as.integer(cancerdata\$Population)])
cancerdata\$Number=as.numeric(levels(cancerdata\$Number)[as.integer(cancerdata\$Number)])

#Just check it worked…
class(cancerdata\$Number)

We can now plot the data:

#Plot the Number of deaths by the Population
plot(Number ~ Population,data=cancerdata)

If we want to, we can add a title:
#Add a title to the plot
plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population')

We can also tweak the axis labels:

plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population',ylab='Number of deaths')

The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggpplot2 isn’t part of the standard R bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies…):

require(ggplot2)
ggplot(cancerdata)+geom_point(aes(x=Population,y=Number))+opts(title='Bowel Cancer Data')+ylab('Number of Deaths')

Doing a bit of searching for the “funnel plot” chart type used to display the ata in Goldacre’s article, I came across a post on Cross Validated, the Stack Overflow/Statck Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R?

The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code… This is a dangerous thing to do, and I can’t guarantee that the analysis is the same type of analysis as the one Goldacre refers to… but what I’m trying to do is show (quickly) that R provides a very powerful stats analysis environment and could probably do the sort of analysis you want in the hands of someone who knows how to drive it, and also knows what stats methods can be appropriately applied for any given data set…

Anyway – here’s something resembling the Goldacre plot, using the cribbed code which has confidence limits at the 95% and 99.9% levels. Note that I needed to do a couple of things:

1) work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.

#TH: funnel plot code from:
#TH: http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
#TH: Use our cancerdata
number=cancerdata\$Population
#TH: The rate is given as a 'per 100,000' value, so normalise it
p=cancerdata\$Rate/100000

p.se <- sqrt((p*(1-p)) / (number))
df <- data.frame(p, number, p.se)

## common effect (fixed effect model)
p.fem <- weighted.mean(p, 1/p.se^2)

## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
#TH: I'm going to alter the spacing of the samples used to generate the curves
number.seq <- seq(1000, max(number), 1000)
number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)

## draw plot
#TH: note that we need to tweak the limits of the y-axis
fp <- ggplot(aes(x = number, y = p), data = df) +
geom_point(shape = 1) +
geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
geom_hline(aes(yintercept = p.fem), data = dfCI) +
scale_y_continuous(limits = c(0,0.0004)) +
xlab("number") + ylab("p") + theme_bw()

fp

As I said above, it can be quite dangerous just pinching other folks’ stats code if you aren’t a statistician and don’t really know whether you have actually replicated someone else’s analysis or done something completely different… (this is a situation I often find myself in!); which is why I think we need to encourage folk who release statistical reports to not only release their data, but also show their working, including the code they used to generate any summary tables or charts that appear in those reports.

In addition, it’s worth noting that cribbing other folk’s code and analyses and applying it to your own data may lead to a nonsense result because some stats analyses only work if the data has the right sort of distribution…So be aware of that, always post your own working somewhere, and if someone then points out that it’s nonsense, you’ll hopefully be able to learn from it…

Given those caveats, what I hope to have done is raise awareness of what R can be used to do (including pulling data into a stats computing environment via an HTML table screenscrape) and also produced some sort of recipe we could take to a statistician to say: is this the sort of thing Ben Goldacre was talking about? And if not, why not?

[If I’ve made any huge – or even minor – blunders in the above, please let me know… There’s always a risk in cutting and pasting things that look like they produce the sort of thing you’re interested in, but may actually be doing something completely different!]

PS for how to generate reports that can (optionally) also self-document with actually source R code, see How might data journalists show their working? Sweave. The code used in, and comments added to, that post make further refinements to the funnel plot code.

In Using Google Spreadsheets as a Database Source for R, I described a simple Google function for pulling data into R from a Google Visualization/Chart tools API query language query applied to a Google spreadsheet, given the spreadsheet key and worksheet ID. But how do you get a list of sheets in spreadsheet, without opening up the spreadsheet and finding the sheet names or IDs directly? [Update: I’m not sure the query language API call lets you reference a sheet by name…]

To look up the sheets associated with a spreadsheet identified by its key value KEY, construct a URL of the form:

This should give you an XML output. To get the output as a JSON feed, append ?alt=json to the end of the URL.

Having constructed the URL for sheets listing for a spreadsheet with a given key identifier, we can pull in and parse either the XML version, or the JSON version, into R and identify all the different sheets contained within the spreadsheet document as a whole.

First, the JSON version. I use the RJSONIO library to handle the feed:

library(RJSONIO)
sskey='0AmbQbL4Lrd61dDBfNEFqX1BGVDk0Mm1MNXFRUnBLNXc'
sheets=c()
as.data.frame(sheets)

Using a variant of the function described in the previous post, we can look up the data contained in a sheet by the sheet ID (I’m not sure you can look it up by name….?) – I’m not convinced that the row number is a reliable indicator of sheet ID, especially if you’ve deleted or reordered sheets. It may be that you do actually need to go to the spreadsheet to look up the sheet number for the gid, which actually defeats a large part of the purpose behind this hack?:-(

library(RCurl)
gsqAPI = function( key, query,gid=0){ return( read.csv( paste( sep="", 'http://spreadsheets.google.com/tq?', 'tqx=out:csv', '&tq=', curlEscape(query), '&key=', key, '&gid=', curlEscape(gid) ) ) ) }
gsqAPI(sskey,"select * limit 10", 9)

The second approach is to pull on the XML version of the sheet data feed. (This PDF tutorial got me a certain way along the road: Extracting Data from XML, but then I got confused about what to do next (I still don’t have a good feel for identifying or wrangling with R data structures, though at least I now know how to use the class() function to find out what R things the type of any given item is;-) and had to call on the lazy web to work out how to do this in the end!)

library(XML)
ssd=xmlTreeParse( ssURL, useInternal=TRUE )
nodes=getNodeSet( ssd, "//x:entry", "x" )
titles=sapply( nodes, function(x) xmlSApply( x, xmlValue ) )
library(stringr)
data.frame( sheetName = titles['content',], sheetId = str_sub(titles['id',], -3, -1 ) )

In this example, we also pull out the sheet ID that is used by the Google spreadsheets API to access individual sheets, just in case. (Note that these IDs are not the same as the numeric gid values used in the chart API query language…)

PS Note: my version of R seemed to choke if I gave it https: headed URLs, but it was fine with http:

I couldn’t contain myself (other more pressing things to do, but…), so I just took a quick time out and a coffee to put together a quick and dirty R function that will let me run queries over Google spreadsheet data sources and essentially treat them as database tables (e.g. Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

Here’s the original function I used:

```library(RCurl)
```

However, with a move to https, this function kept breaking. The one I currently use is:

```library(RCurl)
gsqAPI = function(key,query,gid=0){
tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
return( read.csv( textConnection( tmp ) ) )
}
```

It requires the spreadsheet key value and a query; you can optionally provide a sheet number within the spreadsheet if the sheet you want to query is not the first one.

We can call the function as follows:

gsqAPI('tPfI0kerLllVLcQw7-P1FcQ','select * limit 3')

In that example, and by default, we run the query against the first sheet in the spreadsheet.

Alternatively, we can make a call like this, and run a query against sheet 3, for example:
tmpData=gsqAPI('0AmbQbL4Lrd61dDBfNEFqX1BGVDk0Mm1MNXFRUnBLNXc','select A,C where <= 10',3)
tmpData

The real question is, of course, could it be useful.. (or even OUseful?!)?

Here’s another example: a way of querying the Guardian Datastore list of spreadsheets:

gsqAPI('0AonYZs4MzlZbdFdJWGRKYnhvWlB4S25OVmZhN0Y3WHc','select * where A contains "crime" and B contains "href" order by C desc limit 10')

What that call does is run a query against the Guardian Datastore spreadsheet that lists all the other Guardian Datastore spreadsheets, and pulls out references to spreadsheets relating to “crime”.

The returned data is a bit messy and requires parsing to be properly useful.. but I haven’t started looking at string manipulation in R yet…(So my question is: given a dataframe with a column containing things like <a href=”http://example.com/whatever”>Some Page</a>, how would I extract columns containing http://example.com/whatever or Some Page fields?)

[UPDATE: as well as indexing a sheet by sheet number, you can index it by sheet name, but you’ll probably need to tweak the function to look end with '&gid=', curlEscape(gid) so that things like spaces in the sheet name get handled properly I’m not sure about this now.. calling sheet by name works when accessing the “normal” Google spreadsheets application, but I’m not sure it does for the chart query language call??? ]

[If you haven’t yet discovered R, it’s an environment that was developed for doing stats… I use the RStudio environment to play with it. The more I use it (and I’ve only just started exploring what it can do), the more I think it provides a very powerful environment for working with data in quite a tangible way, not least for reshaping it and visualising it, let alone doing stats with in. (In fact, don’t use the stats bit if you don’t want to; it provides more than enough data mechanic tools to be going on with;-)]

PS By the by, I’m syndicating my Rstats tagged posts through the R-Bloggers site. If you’re at all interested in seeing what’s possible with R, I recommend you subscribe to R-Bloggers, or at least have a quick skim through some of the posts on there…

PPS The RSpatialTips post Accessing Google Spreadsheets from R has a couple of really handy tips for tidying up data pulled in from Google Spreadsheets; assuming the spreadsheetdata has been loaded into ssdata: a) tidy up column names using colnames(ssdata) <- c("my.Col.Name1","my.Col.Name2",...,"my.Col.NameN"); b) If a column returns numbers as non-numeric data (eg as a string "1,000") in cols 3 to 5, convert it to a numeric using something like: for (i in 3:5) ssdata[,i] <- as.numeric(gsub(",","",ssdata[,i])) [The last column can be identifed as ncol(ssdata) You can do a more aggessive conversion to numbers (assuming no decimal points) using gsub("[^0-9]","",ssdata[,i])]

PPPS via Revolutions blog, how to read the https file into R (unchecked):

```require(RCurl)
myCsv = getURL(httpsCSVurl)

## The Visual Difference – R and Anscombe’s Quartet

I spent a chunk of today trying to get my thoughts in order for a keynote presentation at next week’s The Difference that Makes a Difference conference. The theme of my talk will be on how visualisations can be used to discover structure and pattern in data, and as in many or my other recent talks I found the idea of Anscombe’s quartet once again providing a quick way in to the idea that sometimes the visual dimension can reveal a story that simple numerical analysis appears to deny.

For those of you who haven’t come across Anscombe’s quartet yet, it’s a set of four simple 2 dimensional data sets (each 11 rows long) that have similar statistical properties, but different stories to tell…

Quite by chance, I also happened upon a short exercise based on using R to calculate some statistical properties of the quartet (More useless statistics), so I thought I’d try and flail around in my unprincipled hack-it-and-see approach to learning R to see if I could do something similar with rather simpler primitives than described in that blog post.

(If you’re new to R and want to play along, I recommend RStudio…)

Here’s the original data set – you can see it in R simply by typing anscombe:

```   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89```

We can construct a simple data frame containing just the values of x1 and y1 with a construction of the form: data.frame(x=c(anscombe\$x1),y=c(anscombe\$y1)) (where we identify the columns explicitly by column name) or alternatively data.frame(x=c(anscombe[1]),y=c(anscombe[5])) (where we refer to them by column index number).

x y
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68

A tidier way of writing this is as follows:
with(anscombe,data.frame(x1Val=c(x1),y1Val=c(y1)))

In order to call on, or refer to, the data frame, we assign it to a variable: g1data=with(anscombe,data.frame(xVal=c(x1),yVal=c(y1)))

We can then inspect the mean and sd values: mean(g1data\$xVal), or sd(g1data\$yVal)

> mean(g1data\$xVal)
[1] 9
> sd(g1data\$xVal)
[1] 3.316625
>

To plot the data, we can simply issue a plot command: plot(g1data)

It would be possible to create similar datasets for each of the separate groups of data, but R has all sorts of tricks for working with data (apparently…?!;-) There are probably much better ways of getting hold of the statistics in a more direct way, but here’s the approach I took. Firstly, we need to reshape the data a little. I cribbed the “useless stats” approach for most of this. The aim is to produce a data set with 44 rows, and 3 columns: x, y and a column that identifies the group of results (let’s call them myX, myY and myGroup for clarity). The myGroup values will range from 1 to 4, identifying each of the four datasets in turn (so the first 11 rows will be for x1, y1 and will have myGroup value 1; then the values for x2, y2 and myGroup equal to 2, and so on). That is, we want a dataset that starts:

```1 10 9.14 1
2 8 8.14 1```

and ends:

```43 8 7.91 4
44 8 6.89 4```

To begin with, we need some helper routines:

– how many rows are there in the data set? nrow(anscombe)
– how do we create a set of values to label the rows by group number (i.e. how do we generate a set of 11 1’s, then 11 2’s, 11 3’s and 11 4’s)? Here’s how: gl(4, nrow(anscombe)) [typing ?gl in R should bring up the appropriate help page;-) What we do is construct a list of 4 values, with each value repeating nrow(anscombe) times]
– to add in a myGroup column to a dataframe containing x1 and y1 columns, set with just values 1, we simply insert an additional column definition: data.frame(xVal=c(anscombe\$x1), yVal=c(anscombe\$y1), mygroup=gl(1,nrow(anscombe)))
– to generate a data frame containing three columns and the data for group 1, followed by group 2, we would use a construction of the form: data.frame(xVal=c(anscombe\$x1,anscombe\$x2), yVal=c(anscombe\$y1,anscombe\$y2), mygroup=gl(2,nrow(anscombe))). That is, populate the xVal column with rows from x1 in the anscombe dataset, then the rows from column x2; populate yVal with values from y1 then y2; and populate myGroup with 11 1’s followed by 11 2’s.
– a more compact way of constructing the data frame is to specify that we want to concatenate (c()) values from several columns from the same dataset: with(anscombe,data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), mygroup=gl(4,nrow(anscombe))))
– to be able to reference this dataset, we need to assign it to a variable: mydata=with(anscombe,data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), mygroup=gl(4,nrow(anscombe))))

This final command will give us a data frame with the 3 columns, as required, containing values from group 1, then group 2, then groups 3 and 4, all labelled appropriately.

To find the means for each column by group, we can use the aggregate command: aggregate(.~mygroup,data=mydata,mean)

(I think you read that as follows:”aggregate the current data (.) by each of the groups in (~) mygroup using the mydata dataset, reporting on the groupwise application of the function: mean)

To find the SD values: aggregate(.~mygroup,data=mydata,sd)

Cribbing an approach I discovered from the hosted version of the ggplot R graphics library, here’s a way of plotting the data for each of the four groups from within the single aggregate dataset. (If you are new to R, you will need to download and install the ggplot2 package; in RStudio, from the Packages menu, select Install Packages and enter ggplot2 to download and install the package. To load the package into your current R session, tick the box next to the installed package name or enter the command library("ggplot2").)

The single command to plot xy scatterplots for each of the four groups in the combined 3 column dataset is as follows:

ggplot(mydata,aes(x=xVal,y=yVal,group=mygroup))+geom_point()+facet_wrap(~mygroup)

And here’s the result (remember, the statistical properties were the same…)

To recap the R commands:

mydata=with(anscombe,data.frame(xVal=c(x1,x2,x3,x4), yVal=c(y1,y2,y3,y4), group=gl(4,nrow(anscombe))))
aggregate(.~mygroup,data=mydata,mean)
aggregate(.~mygroup,data=mydata,sd)
library(ggplot2)
ggplot(mydata,aes(x=xVal, y=yVal)) + geom_point() + facet_wrap(~mygroup)

PS this looks exciting from an educational and opendata perspective, though I haven’t had a chance to play with it: OpenCPU: a server where you can upload and run R functions. (The other hosted R solutions I was aware of – R-Node – doesn’t seem to be working any more? online R-Server [broken?]. For completeness, here’s the link to the hosted ggplot IDE referred to in the post. And finally – if you need to crucnh big files, CloudNumbers may be appropriate (disclaimer: I haven’t tried it))

PPS And here’s something else for the data junkies – an easy way of getting data into R from Datamarket.com: How to access 100M time series in R in under 60 seconds.

## Data Driven Story Discovery: Working Up a Multi-Layered Chart

How many different dimensions (or “columns” in a dataset where each row represents a different sample and each column a different measurement taken as part of that sample) can you plot on a chart?

Two are obvious: X and Y values, which are ideal for representing continuous numerical variables. If you’re plotting points, as in a scatterplot, the size and the colour of the point allow you to represent two further dimensions. Using different symbols to plot the points gives you another dimension. So we’re up to five, at least.

Whilst I was playing with ggplot over the weekend, I fell into this view over F1 timing data:

It shows the range of positions held by each car over the course of the race (cars are identified by car number on the x-axis, 1 to 25 (there is no 13), and range of positions on the y-axis. The colour, uninterestingly, relates to car number.

If you follow any of the F1 blogs, you’ll quite often see references to “driver of the day” discussions (e.g. Who Was Your Driver of the Day). Which got me wondering…could a variant of the above chart provide a summary of the race as a whole that would, at a glance, suggest which drivers had “interesting” races?

For example, if a driver took on a wide range of race positions during the race, might this mean they had an exciting time of it? If every car retained the same couple of positions throughout the race, was it a procession?

Having generated the base chart using the ggplot web interface, I grabbed the code used to generate it and took it into RStudio.

What was lacking in the original view was an explicit statement about the position of each car at the end of the race. But we can add this using a point, if we know the final race position*:

ggplot() + geom_step(aes(x=h\$car, y=h\$pos, group=h\$car)) + scale_x_discrete(limits = c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','HEI','PET','BAR','MAL','','SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB')) + xlab(NULL) + opts(axis.text.x=theme_text(angle=-90, hjust=0)) + geom_point(aes(x=k\$driverNum, y=k\$classification))

Let’s add it using a different coloured mark:

+geom_point(aes(x=k\$driverNum, y=k\$grid,col='red'))

(Note that in the above example, the points are layered in the order they appear in the command line, with lower layers first. So if a car finishes (black dot) in the position it started (red dot), we will only see the red dot. If we make the lower layer dot slightly larger, we would then get a target effect in this case.)

The chart as we now have it shows where a driver started, where they finished and how many different race positions they found themselves in. So for example, we see BUE rapidly improved on his grid position at the start of the race and made progress through the race, but SUT went backwards from the off, as did PER.

Hmm… lots of race action seems to have happened during the first lap, but we’re not getting much sense of it… Maybe we need to add in the position of the car at the end of the first lap too. Unfortunately, the data I was using did not contain the actual grid/starting position (it just contains the positions at the end of each lap), but we can pull this data in from elsewhere… Using another dot to represent this piece of data might get confusing, so let’s use a line (set the symbol type using pch=3):

+geom_point(aes(x=l\$car, y=l\$pos, pch=3))

Now we have a chart that shows the range of positions occupied by each car, their grid position, final race position and position at the end of the first lap Using different symbol types and colours we can distinguish between them. (For a presentation graphic, we also need a legend…) By using different sized symbols and layers, we could also display multiple dimensions at the same x, y co-ordinate.

library("ggplot2")
h=hun_2011proximity
//in racestats, convert DNF etc to NA
k=hun_2011racestatsx
l=subset(h,lap==1)
ggplot() + geom_step(aes(x=h\$car, y=h\$pos, group=h\$car)) + scale_x_discrete(limits =c('VET','WEB','HAM','BUT','ALO','MAS','SCH','ROS','HEI','PET','BAR','MAL','','SUT','RES','KOB','PER','BUE','ALG','KOV','TRU','RIC','LIU','GLO','AMB')) + xlab(NULL) + opts(title="F1 2011 Hungary", axis.text.x=theme_text(angle=-90, hjust=0)) + geom_point(aes(x=l\$car, y=l\$pos, pch=3, size=2)) + geom_point(aes(x=k\$driverNum, y=k\$classification,size=2), label='Final') + geom_point(aes(x=k\$driverNum, y=k\$grid, col='red')) + ylab("Position")

A title can be added to the post using:

I still haven’t worked out how to do a caption that will display:
– [black dot] “Final Race Position”
– [red dot] “Grid Position”
– [tick mark] “End of lap 1”

Any ideas?

The y-axis gridline on the half is also misleading. To add gridlines for each position, add in +scale_y_discrete(breaks=1:24) or to label them as well +scale_y_discrete(breaks=1:24,limits=1:24)

So what do we learn from this? Layers can be handy and allow us to construct charts with different overlaid data sets. The order of the layers can make things easier or harder to read. Different symbol types work differently well with each other when overlaid. The same symbol shape in different sizes and appropriate layer ordering allows you to overlay points and still see them data. Symbol styles give the chart a grammar (I used circles for the start and end of the race positions, for example, and a line for the first lap position).

Lots of little things, in other words. But lots of little things that can allow us to add more to a chart and still keep it legible (arguably! I don’t claim to be a graphic designer, I’m just trying to find ways of communicating different dimensions within a data set).

* it strikes me that actually we could plot the final race position under the final classification. (On occasion, the race stewards penalise drivers so the classified result is different to the positions the cars ended the race in. Maybe we should make that eventuality evident?)

PS as @sidepodcast pointed out, I really should remove non-existent driver 13. I guess this means converting x-axis to a categorical one?