NHS Winter Situation Reports: Shiny Viewer v2

Having got my NHS Winter sitrep data scraper into shape (I think!), and dabbled with a quick Shiny demo using the R/Shiny library, I thought I’d tidy it up a little over the weekend and long the way learn a few new presentation tricks.

To quickly recap the data availability, the NHS publish a weekly spreadsheet (with daily reports for Monday to Friday – weekend data is rolled over to the Monday) as an Excel workbook. The workbook contains several sheets, corresponding to different data collections. A weekly scheduled scraper on Scraperwiki grabs each spreadsheet and pulls the data into a rolling database: NHS Sitreps scraper/aggregator. This provides us with a more convenient longitudinal dataset if we want to look at sitrep measures for a period longer than a single week.

So here’s where I’ve got to now – NHS sitrep demo:

NHS sitrep2

The panel on the left controls user actions. The PCT (should be relabelled as “Trust”) drop down list is populated based on the selection of a Strategic Health Authority. The Report types follow the separate sheets in the Winter sitrep spreadsheet (though some of them include several reported measures, which is handled in the graphical display). The Download button allows you to download, as CSV data, the data for the selected report. By default, it downloads data at the SHA level (that is, data for each Trust in the selected SHA), although checkbox control allows you to limit the downloaded results to just data for the selected Trust:

NHS sitrep panel

Using just these controls, then, the user can select and download Winter sitrep data (to date), as a CSV file, for any selected Trust, or for all the Trusts in a given SHA. Here’s how the downloader was put together using Shiny:

So how does the Download work? Quite straightforwardly, as it turns out:

#This function marhsals the data for download
downloadData <- reactive(function() {
  if (input$pctdownonly==TRUE) 
    ds=subset(ds,tid==input$rep & Code==input$tbl,select=c('Name','fromDateStr','toDateStr','tableName','facetB','value'))
output$downloadData <- downloadHandler(
  #Add a little bit of logic to name the download file appropriately
  filename = function() { if (input$pctdownonly==FALSE) paste(input$sha,'_',input$rep, '.csv', sep='') else paste(input$tbl,'_',input$rep, '.csv', sep='') },
  content = function(file) { write.csv(downloadData(), file, row.names=FALSE) }

Graphical reports are split into two panels: at the top, views over the report data for each Trust in the selected SHA; at the bottom, more focussed views over the currently selected Trust.

Working through the charts, the SHA level stacked bar char is intended to show summed metrics at the SHA level:

NHS sitrep - stacked bar

My thinking here was that it may be useful to look at bed availability across an SHA, for example. The learning I had to do for this view was in the layout of the legend:

#g is a ggplot object
g=g+theme( legend.position = 'bottom' )
g=g+scale_fill_discrete( guide = guide_legend(title = NULL,ncol=2) )

The facetted, multiplot view also uses independent y-axis scales for each plot (sometimes this makes sense, sometimes it doesn’t. Maybe I need to some logic to control when to use this and when not to?)

#The 'scales' parameter allows independent y-axis limits for each facet plot 
g=g+facet_wrap( ~tableName+facetB, scales = "free_y" )

The line chart shows the ame data in a more connected way:

NHS sitrep SHA line

To highlight the data trace for the currently selected Trust, I overplot that line with dots that show the value of each data point for that Trust. I’m not sure whether these should be coloured? Again, the y-axis scales are free.

The SHA Boxplot shows the distribution of values for each Trust in the SHA. I overplot the box for the selected Trust using a different colour.

NHS sitrep SHA boxplot

(I guess a “semantic depth of field“/blur approach might also be used to focus attention on the plot for the currently selected Trust?)

My original attempt at this graphic was distorted by very long text labels, that were also misaligned. To get round this, I generated a new label attribute that included line breaks:

#Wordwrapper via:
#Limit the length of each line to 15 chars
limiter=function(x) gsub('(.{1,15})(\\s|$)', '\\1\n', x)
#We can then print axis tick labels using d$sName

We can offset the positioning of the label when it is printed:

#Tweak the positioning using vjust, rotate it and also modify label size
g=g+theme( axis.text.x=element_text(angle=-90,vjust = 0.5,size=5) )

The Trust Barchart and Linechart are quite straightforward. The Trust Daily Boxplot is a little more involved. The intention of the Daily plot is to try to identify whether or not there are distributional differences according to the day of the week. (Note that some of the data reports relate to summed values over the weekend, so these charts are likely to have comparatively high values on the weekend reporting Monday figure!)

NHS sitrep daily boxplot

I ‘borrowed’ a script for identifying days of the week… (I need to tweak the way these are ordered – the original author had a very particular application in mind.)

# the month too 
# but turn months into ordered facors to control the appearance/ordering in the presentation
tmp$monthf<-factor(tmp$month,levels=as.character(1:12), labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)
# the day of week is again easily found
tmp$weekday = as.POSIXlt(tmp$fdate)$wday
# again turn into factors to control appearance/abbreviation and ordering
# I use the reverse function rev here to order the week top down in the graph
# you can cut it out to reverse week order
# the monthweek part is a bit trickier 
# first a factor which cuts the data into month chunks
# then find the "week of year" for each day
tmp$week <- as.numeric(format(tmp$fdate,"%W"))
# and now for each monthblock we normalize the week to start at 1 

The weekdayf value could then be used as the basis for plotting the results by day of week.

To add a little more information to the chart, I overplot the boxplot with actual data point, adding a small amount of jitter added to the x-component (the y-value is true).

g=g+geom_point(aes(x=weekdayf,y=val),position = position_jitter(w = 0.3, h = 0))

I guess it would be more meaningful if the data points were actually ordered by week/year. (Indeed, what I originally intended to do was a seasonal subseries style plot at the day level, to see whether there were any trends within a day of week over time, as well as pull out differences at the level of day of week.)

Finally, the Trust datatable shows the actual data values for the selected report and Trust:

NHS sitrep Trust datatable

(Remember, this data, or data for this report for each trusts in the selected SHA, can also be downloaded directly as a CSV file.)

The thing I had to learn here was how to disable the printing of the dataframe row names in the SHiny context:

output$view = reactiveTable(function() {
    #...get the data and return it for printing
    }, include.rownames=FALSE)

As a learning exercise, this app got me thinking about solving several presentational problems, as well as trying to consider what reports might be informative or pattern revealing (for example, the Daily boxplots).
The biggest problem, of course, is coming up with views that are meaningful and useful to end-users, the sorts of questions they may want to ask of the data, and the sorts of things they may want to pull from it. I have no idea who the users, if any, of the Winter sit rep data as published on the NHS website might be, or how they make use of the data, either in mechanistic terms – what do they actually do with the spreadsheets – or at the informational level – what stories they look for in the data/pull out out of it, and what they then use that information for.

This tension is manifest around a lot of public data releases, I think – hacks’n’hackers look for why shiny(?!) things they can do with the data, though often out of any sort of context other than demonstrating technical prowess or quick technical hacks. Users of the data may possibly struggle with doing anything other than opening the spreadsheet in Excel and then copying and pasting it into other spreadsheets, although they might know exactly what they want to get out of the data as presented to them. Some users may be frustrated at a technical level in the sense of knowing what they’d like to be able to get from the data (for example, getting monthly timeseries from weekly timeseries spreadsheets) but may not be able to do it easily for lack of technical skills. Some users may not know what can be readily achieved with the way data is organised, aggregated and mixed with other datasets, and what this data manipulation then affords in its potential for revealing stories, trends, structures and patterns in the data, and here we have a problem with even knowing what value might become unlockable (“Oh, I didn’t know you coud do that with it…”). This is one reason why hackdays – such as the NHS Hackday and various govcamps – can be so very useful (I’m also reminded of MashLib/Mashed Library events where library folk and techie shambrarians come together to learn from each other). What I think I’d like to see more of, though, is people with real/authentic questions that might be asked of data, or real answers they’d like to be able to find from data, starting to air them as puzzles that the data junkies, technicians, wranglers and mechanics amongst us can start to play with from a technical side.

PS this could be handy… downloading PDF docs from Shiny.

PPS Radio 4’s Today programme today had a package on NHS release of surgeon success data. In an early interview with someone from the NHS, the interviewee made the point that the release of the data was there for quality/improvement purposes and to help identify opportunities for supporting best practice (eg along the lines of previous releases of heart surgery performance data. The 8am, after 8 interview, and 8.30am news bulletins all pushed the faux and misleading line of how this data could be used for “parent choice”, (I complained bitterly – twice- via Twitter;-) though the raising standards line was taken in the 9am bulletin. There’s real confusion, I think, about how all this data stuff might, could and should be used (I’m thoroughly confused by it myself), and I’m not sure the press are always very helpful in communicating it…


  1. Andrew Marritt

    You make some really great points about users (those who could most benefit from the insight data could provide) and the data.

    From my experience to get real value for users you almost always need to reframe the question away from the data to their problem. For many people just mentioning data makes them think about what data they think they have and what it could tell them. The conversation quickly becomes data-led rather than user-led and vast numbers of doors become closed.

    We use a process heavily based on design-thinking to develop data-driven experiences and often use similar ethnographic techniques to understand the user-need. The process, heavily simplified is as follows:

    What is the user’s problem / decision which they need to make?
    How could information support that decision?
    Where is that information? Do we need to add additional measurement / data collection?
    How do we analyse / transform data to bring it into a form which supports the users’ decision?
    How do we design ways of communicating the information in an easy-to-use manner. This includes where should it be? How do they access it? How often? When? It might be the user ‘pulls’ it by reaching for the display, it might be the information should be pushed towards them because they need to decide.
    Prototype / test (we run usability testing on our reports) / refine.

    The decision to omit the neutral response on the net stacked distribution displays of likert data that you’ve written about before was based on such a process. We realised that users didn’t make decisions based on the neutral responses so omitted them as they added no value to the task. It wasn’t about accurately portrayal of information but instead answering typical questions.

    Starting from a data perspective is like having a solution and looking for a problem, which I guess has similarities to the problem that universities face commercialising academic research.

    • Tony Hirst

      Hi Andrew, Thanks for the comment… “Starting from a data perspective is like having a solution and looking for a problem” – yep, that’s sort of where I am tinkering with public open data. So I’ve taken it as an opportunity to work out some techie recipes that might be useful if I ever find a problem. One of my failed resolutions at the start of this year was to try to get more involved with folk who might have authentic problems that could contextualise some of the stuff I play with and help me make it meaningful (and end-user useful) for others, as well as helping me frame things more relevantly for folk who might run with some of the ideas as implementers. I will try to do better on this front next year!

      As to the Likert scale/no showing neutral responses, I was reminded of that when I discovered the YouGov poll data resource and saw how they handled questions/responses… http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/two-can-play-game-when-polls-collide

  2. Pingback: Local News Templates – A Business Opportunity for Data Journalists? « OUseful.Info, the blog…
  3. Pingback: Quick Shiny Demo – Exploring NHS Winter Sit Rep Data | OUseful.Info, the blog...