One of the things I’ve been pondering with respect to the whole data journalism process is how journalists without a lot of statistical training can quickly get a feel for whether there may be interesting story leads in a dataset, or how they might be able to fashion “alerts” that bring attention to data elements that might be worth investigating further. In the case of the Guardian university tables, this might relate to identifying which universities appear to have courses that rank particularly well in a subject and within their own institution, or which subject areas appear to have teaching or assessment satisfaction issues in a particular institution. In this post, I have a quick play with an idea for generating visual alerts that might help us set up or identify hopefully interesting or fruitful questions to ask along these lines.

Statistics will probably have a role to play in generating most forms of alert, but as Jeff Leek has recently pointed out in the Simply Statistics blog, [t]he vast majority of statistical analysis is not performed by statisticians. Furthermore:

We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren’t statisticians.

By no stretch of the imagination would I class myself as a statistician. But I have started trying to develop some statistical intuitions by going back to several histories of statistics in order to see what sorts of problems the proto-statisticians were trying to develop mathematical techniques to solve (for example, I’m currently reading The History of Statistics: The Measurement of Uncertainty before 1900)).

One of the problems I’ve encountered for myself relates to trying to find outliers in a quick and easy way. One way of detecting an outlier is to assume that the points have a normal distribution about the population mean value, and then look for points that lay several standard deviations away from the mean. If that all sounds way to complicated, the basic idea is this: which items in a group have values a long way from the average value in the group.

A statistic that captures this idea in a crude way is the z-score (or should than be, z-statistic? z-value?), which for a particular point is is the deviation from the mean divided by the standard deviation (I think). Which is to say, it’s proportional to how far away from ‘the average’ a point is dividing by the average of how far away all points are. (What is doesn’t tell you is whether that distance is meaningful or important in any sense, which I think statisticians refer to as “the power of the effect”. I think. They talk in code.)

Anyway, the magic of R makes it easy to calculate the z-score for a group of numbers using the *scale()* function. So I had a quick play with it to see if it might be useful in generating alerts around the Guardian University tables data.

*This post build on two earlier posts and may not make much sense if you haven’t been following the story to date. The first post, which tells how to grab the data into R, can be found in Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data ; the second, how to start building a simple interactive viewer for the data using the R shiny library, can be found in Disposable Visual Data Explorers with Shiny – Guardian University Tables 2014.*

The general question I had in mind was this: *is there a way we can generate a simple view over the data to highlight outperformers and underperformers under each of the satisfaction scores?* The sort of view I thought I wanted depended on the stance from which a question about the data might arise. For example, through building a single dataset from the subject level data, we find ourselves in a position where we can ask questions from the perspective of someone from a particular university who may be interested in how the various satisfaction scores for a particular subject compare to the scores for the other subjects offered by the institution. Alternatively, we might be interested in how the satisfaction scores for a particular subject area might compare across several institutions.

This seemed like things the z-score might be able to help with. For example, for each of the satisfaction scores/columns associated with a particular subject, we can generate the z-score that shows how far away from the average in that satisfaction column for that subject each institution is.

##gdata contains the Guardian university tables data for all subjects #Get the list of subjects ggs=levels(gdata$subject) #Grab a working copy of the data ggs1=gdata #Find the z-scores for the various numerical satisfaction score columns in the data for the first subject ggs1s=data.frame(scale(ggs1[ggs1$subject==ggs[1],-c(1,2,12)])) #Add another column that labels this set of z-scores with the appropriate institution names ggs1s$inst=ggs1[ggs1$subject==ggs[1],2] #Add another column that specifies the subject the z-scores correspond to ggs1s$ss=ggs[1] #We now repeat the above process for each of the other subjects using a temporary working copy of the data for (g in ggs[-1]){ tmp=data.frame(scale(ggs1[ggs1$subject==g,-c(1,2,12)])) tmp$inst=ggs1[ggs1$subject==g,2] tmp$ss=g #The only difference is we add this temporary data to the end of the working copy ggs1s=rbind(ggs1s,tmp) } #The resulting data frame, ggs1s, contains z-scores for each satisfaction data column across institutions within a subject.

This data essentially gives us a clue about how well one institution’s scores compare with the scores of the other institutions in each separate subject area. The lead in here is to questions along the lines of “which universities have particularly good career prospects for students studying subject X”, or “which universities are particularly bad in terms of assessment satisfaction when it comes to subject Y?”.

In addition, for each of the satisfaction scores/columns associated with a particular institution, we can generate the z-score that shows how far away from the average in that satisfaction column for that institution each subject is.

##The pattern of this code should look familiar... ##We're going to do what we did for subjects, only this time for institutions... #Get the list of subjects ggi=levels(gdata$Name.of.Institution) #Grab a working copy of the data ggi1=gdata #Find the z-scores for the various numerical satisfaction score columns in the data for the first institution ggi1s=data.frame(scale(ggi1[ggi1$Name.of.Institution==ggi[1],-c(1,2,12)])) #Add another column that labels this set of z-scores with the appropriate institution names ggi1s$inst=ggi[1] #Add another column that specifies the subject the z-scores correspond to ggi1s$ss=ggi1[ggi1$Name.of.Institution==ggi[1],12] #We now repeat the above process for each of the other institutions using a temporary working copy of the data for (g in ggi[-1]){ tmp=data.frame(scale(ggi1[ggi1$Name.of.Institution==g,-c(1,2,12)])) tmp$ss=ggi1[ggi1$Name.of.Institution==g,12] tmp$inst=g #As before, the only difference is we add this temporary data to the end of the working copy ggi1s=rbind(ggi1s,tmp) } #The resulting data frame, ggs1i, contains z-scores for each satisfaction data column across subjects within a institution.

This second take on the data essentially gives us a clue about how well each subject area performs within an institution compared to the performance of the other subjects offered by that institution. This might help lead us in to questions of the form “which department within an institution appears to have an unusually high entrance tariff compared to other courses offered by the institution?” or “which courses have a particularly overall satisfaction, which might be breaking our overall NSS result?” For university administrators taking a simplistic view of this data, it might let you tell a Head of Department that their chaps are letting the side down. Or it might be used by a Director of a Board of Studies in a particular subject area to boost a discretionary “bonus” application.

So here’s something I’ve started playing with to try to generate “alerts” based on outperformers and underperformers within an institution and/or within a particular subject area compared to other institutions: plots that chart show subject/institution combinations that have “large” z-scores as calculated above.

To plot the charts, we need to combine the z-score datasets and get the data into the right shape:

#Convert the z-score dataframes from "wide" format to "long" format #The variable column will identify which satisfaction score the row conforms to for a given subject/institution pair ggi1sm=melt(ggi1s,id=c('inst','ss')) ggs1sm=melt(ggs1s,id=c('inst','ss')) #Merge the datasets by subject/institution/satisfaction score ggs2=merge(ggs1sm,ggi1sm,by=c('inst','ss','variable'))

It’s then a simple matter to plot the outliers for a given institution – let’s filter to show only items where at least one of the z-scores has magnitude greater than 2:

ii='Cardiff' g=ggplot(subset(ggs2,inst==ii & !is.na(value.x) & !is.na(value.y) & (abs(value.x)>2 | abs(value.y)>2))) + geom_text(aes(x=value.x,y=value.y,label=ss),size=2)+facet_wrap(~variable) g=g+labs(title=paste("Guardian University Tables 2014 Alerts:",ii), x='z relative to other institutions in same subject', y='z relative to other subjects in same institution') g

Or for given subject:

ss='Psychology' g=ggplot(subset(ggs2,ss==ss & !is.na(value.x) & !is.na(value.y) & (abs(value.x)>2 | abs(value.y)>2))) + geom_text(aes(x=value.x,y=value.y,label=inst),size=2)+facet_wrap(~variable) g=g+labs(title=paste("Guardian University Tables 2014 Alerts:",ss), x='z relative to other institutions in same subject', y='z relative to other subjects in same institution') g

Let’s see how meaningful these charts are, and whether they provide us with any glanceable clues about where there may be particular successes or problems…

Here’s the institutional view:

A couple of things we might look for are items in the top right and bottom left corners – these are subjects where the scores are well above average both nationally and within the institution. Subjects close to the y-axis are well away from the average within the institution, but fairly typical on a national level for that subject area. Subjects close to the x-axis are fairly typical within the institution, but away from average when compared to other institutions.

So for example, dentistry appears to be having a hard time of it? Let’s look at the dentistry subject table on an updated version of the Shiny viewer (which now included the alert charts), focussing on a couple of the satisfaction values that appear to be weak at a national scale:

How about the subject alert?

Hmmm…what’s happening at UC Suffolk?

I’m still not sure how useful these views are, and the approach thus far still doesn’t give a glanceable alert over the whole data set in one go (we have to run views over institutions or subjects, for example). But it was relatively quick to do (this post took almost as long to write as the coding), and it maybe opens up ideas about what other questions we might ask of the data?

Might be worth also looking to detect institutions (or subjects) that are outlying overall – maybe the R package mvoutlier would help with that?

Hi Kevin – yes, so far I haven’t looked at the overall summary data, or made comparisons across to it from subject areas (on the to do list!;-).

re: mvoutlier, thanks for the tip:-)