A Google Spreadsheets View Over DWP Tabulation Tool 1-Click Time Series Data

Whilst preparing for an open data training session for New Economy in Manchester earlier this week, I was introduced to the DWP tabulation tool that provides a quick way of analysing various benefits and allowances related datasets, including bereavement benefits, incapacity benefit and employment and support allowance.

The tool supports the construction of various data views as well as providing 1-click link views over “canned” datasets for each category of data.

DWP_Tabulation_Tool

The data is made available in the form on an HTML data table via a static URL (example):

Bereavement_Benefits__Bereavement_Benefit_and_Widows_Benefit_combined__--_On_Flows__thousands____Time_Series_by_Gender_of_claimant

To simplify working with data, we can import the data table directly into Google spreadsheets using the importHTML() formula, which allows you to specify a URL, and then import a specified HTML data table from that page. In the following example, the first table from a results page – that contains the description of the table – is imported into cell A1, and the actual datatable (table 2) is imported via an importhtml() formula specified in cell A2.

test1_-_Google_Sheets

Note that the first data row does not appear to import cleanly – inspection of the original HTML table shows why – the presence of what is presumably a split cell that declares the name of the timeseries index column along with the first time index value.

To simplify the import of these data tables into a Google Spreadsheet, we can make use of a small script to add an additional custom menu into Google spreadsheets that will import a particular dataset.

test1_-_Google_Sheets_addscript

The following script shows one way of starting to construct such a set of menus:

function onOpen() {
  var ui = SpreadsheetApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('DWP Tabs')
      .addSubMenu(ui.createMenu('Bereavement Benefits')
          .addItem('1-click BW/BB timeseries', 'mi_bb_b')
          .addItem('1-click Region timeseries', 'mi_bb_r')
          .addItem('1-click Gender timeseries', 'mi_bb_g')
          .addItem('1-click Age timeseries', 'mi_bb_a')
      )
      .addSubMenu(ui.createMenu('Incapacity Benefit/Disablement')
          .addItem('1-click Region timeseries', 'mi_ic_r')
      )
      .addToUi();
}

function menuActionImportTable(url){
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];

  var cell = sheet.getRange("A1");
  cell.setFormula('=importhtml("'+url+'","table",1)');
  cell = sheet.getRange("A2");
  cell.setFormula('=importhtml("'+url+'","table",2)');
}

//--Incapacity Benefit/Disablement
function mi_ic_r() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/ibsda/cdquarter/ccgor/a_carate_r_cdquarter_c_ccgor.html';
  menuActionImportTable(url)
}


//-- Bereavement Benefits
function mi_bb_r() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccgor/a_carate_r_cdquarter_c_ccgor.html';
  menuActionImportTable(url)
}

function mi_bb_g() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccsex/a_carate_r_cdquarter_c_ccsex.html';
  menuActionImportTable(url)
}

function mi_bb_a() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/cnage/a_carate_r_cdquarter_c_cnage.html';
  menuActionImportTable(url)
}

function mi_bb_b() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccbbtype/a_carate_r_cdquarter_c_ccbbtype.html';
  menuActionImportTable(url)
}

Copying the above script into the script editor associated with a spreadsheet, and then reloading the spreadsheet (permissions may need to be granted to the script the first time it is run), provides a custom menu that allows the direct import of a particular dataset:

test1_-_Google_Sheets_menu

Duplicating the spreadsheet carries the script along with it (I think) and can presumably also be shared… (It’s been some time since I played with apps script – I’m not sure how permissioning works or how easy it is to convert scripts to add-ons, though I note from the documentation that top-level app menus aren’t supported by add-ons.

Authoring Dynamic Documents in IPython / Jupyter Notebooks?

One of the reasons I started writing the Wrangling F1 Data With R book was to see how it felt writing combined text, code and code output materials in the RStudio/RMarkdown context. For those of you that haven’t tried it, RMarkdown lets you insert executable code elements inside a markdown document, either as code blocks or inline. The knitr library can then execture the code and display the code output (which includes tables and charts) and pandoc transforms the output to a desired output document format (such as HTML, or PDF, for example). And all this at the click of a single button.

In IPython (now Jupyter) notebooks, I think we can start to achieve a similar effect using a combination of extensions. For example:

  • python-markdown allows you to embed (and execute) python code inline within a markdown cell by enclosing it in double braces (For example, I could say “{{ print(‘hello world’}}”);
  • hide_input_all is an extension that will hide code cells in a document and just display their executed output; it would be easy enough to tweak this extension to allow a user to select which cells to show and hide, capturing that cell information as cell metadata;
  • Readonly allows you to “lock” a cell so that it cannot be edited; using a notebook server that implements this extension means you can start to protect against accidental changes being made to a cell by mistake within a particular workflow; in a journalistic context, assigning a quote to a python variable, locking that code cell, and then referencing that quote/variable in a python-markdown might be one of working, for example.
  • Printview-button will call nbconvert to generate an HTML version of the current notebook – however, I suspect this does not respect the extension based customisations that operate on cell metadata. To do that, I guess we need to generate our outptut via an nbconvert custom template? (The Download As... notebook option doesn’t seem to save the current HTML view of a notebook either?)

So – my reading is: tools are there to support the editing side (inline code, marking cells to be hidden etc) of dynamic document generation, but not necessarily the rendering to hard copy side, which need to be done via nbconvert extensions?

Related: Seven Ways of Running IPython Notebooks

Keeping Track of an Evolving “Top N” Cutoff Threshold Value

In a previous post (Charts are for Reading), I noted how it was difficult to keep track of which times in an F1 qualifying session had made the cutoff time as a qualifying session evolved. The problem can be stated as follows: in the first session, with 20 drivers competing, the 15 drivers with the best ranked laptime will make it into the next session. Each driver can complete zero or more timed laps, with drivers completing laps in any order.

Finding the 15 drivers who will make the cutoff is therefore not simply a matter of ranking the best 15 laptimes at any point, because the same 5 drivers, say, may each record 3 fast laptimes, thus taking up the 15 slots that record the 15 fastest laptimes.

If we define a discrete time series with steps corresponding to each recorded laptime (from any driver), then at each time step we can find the best 15 drivers by finding each driver’s best laptime to date and ranking by those times. Conceptually, we need something like a lap chart which uses a ‘timed lap count’ rather than race lap index to keep track of the top 15 cars at any point.

example_fia_lapchart

At each index step, we can then find the laptime of the 15th ranked car to find the current session laptime.

In a dataframe that records laptimes in a session by driver code for each driver, along with a column that contains the current purple laptime, we can arrange the laptimes by cumulative session laptime (so the order of rows follows the order in which laptimes are recorded) and then iterate through those rows one at a time. At each step, we can summarise the best laptime recorded so far in the session for each driver.

df=arrange(df,cuml)
dfc=data.frame()
for (r in 1:nrow(df)) {
  #summarise best laptime recorded so far in the session for each driver
  dfcc=ddply(df[1:r,],.(qsession,code),summarise,dbest=min(stime))
  #Keep track of which session we are in
  session=df[r,]$qsession
  #Rank the best laptimes for each driver to date in the current session
  #(Really should filter by session at the start of this loop?)
  dfcc=arrange(dfcc[dfcc['qsession']==session,],dbest)
  #The different sessions have different cutoffs: Q1, top 15; Q2, top 10
  n=cutoffvals[df[r,]$qsession]
  #if we have at least as many driver best times recorded as the cutoff number
  if (nrow(dfcc) >=n){
    #Grab a record of the current cut-off time
    #along with info about each recorded laptime
    dfc=rbind(dfc,data.frame(df[r,]['qsession'],df[r,]['code'],df[r,]['cuml'],dfcc[n,]['dbest']) )
  }
}

We can then plot the evolution of the cut-off time as the sessions proceed. The chart in it’s current form is still a bit hard to parse, but it’s a start…

qualicutoff

In the above sketch, the lines connect the current purple time and the current cut-off time in each session (aside from the horizontal line which represents the cut-off time at the end of the session). This gives a false impression of the evolution of the cutoff time – really, the line should be a stepped line that traces the current cut-off time horizontally until it is improved, at which point it should step vertically down. (In actual fact, the line does track horizontally as laptimes are recorded that do not change the cuttoff time, as indicated by the horizontal tracks in the Q1 panel as the grey times (laptime slower than driver’s best time in session so far) are completed.

The driver labels are coloured according to: purple – current best in session time; green – driver best in session time to date (that wasn’t also purple); red – driver’s best time in session that was outside the final cut-off time. This colouring conflates two approaches to representing information – the purple/green colours represent online algorithmic processing (if we constructed the chart in real time from laptime data as laps we completed, that’s how we’d colour the points), whereas the red colouring represents the results of offline algorithmic processing (the colour can only be calculated at the end of the session when we know the final session cutoff time). I think these mixed semantics contribute to making the chart difficult to read…

In terms of what sort of stories we might be able to pull from the chart, we see that in Q2, Hulkenberg and Sainz were only fractions of a second apart, and Perez looked like he had squeezed in to the final top 10 slot until Sainz pushed him out. To make it easier to see which times contribute to the top 10 times, we could use font weight (eg bold font) to highlight each drivers session best laptimes.

To make the chart easier to read, I think each time improvement to the cutoff time should be represented by a faint horizontal line, with a slightly darker line tracing the evolution of the cutoff time as a stepped line. This would all us to see which times were within the cutoff time at any point.

I also wonder whether it might be interesting to generate a table a bit like the race lap chart, using session timed lap count rather than race lap count, perhaps with additional colour fields to show which car recorded the time that increased the lap count index, and perhaps also where in the order that time fell if it didn’t change the order in the session so far. We could also generate online and offline differences between each laptime in the session and the current cutoff time (online algorithm) as well as the final overall session cutoff time (offline algorithm).

[As and when I get this chart sorted, it will appear in an update to the Wrangling F1 Data With R lean book.]

Alternative Routes to Academic Publishing?!

Flippantly…..

  1. Get in a ghost writer to write your publications for you; inspired the old practice of taking first authorship on work done by your research assistant/postgrad etc etc…
  2. Use your influence – tell your research assistant/postgrad/unpublished colleague that you’ll add your name to their publication and your chums in the peer review pool will let it through;
  3. Comment on everything anyone sends you or tells you – where possible, make changes inline to any document that anyone sends you (rather than commenting in the margins) – and make it so difficult to to disentangle what you’ve added that they’re forced to give you an author credit. Alternatively, where possible, make structural changes to the organisation of a paper early on so that other authors think you’ve contributed more than you have… Reinforce these by commenting on “the paper you’re writing with X” to every one else so they think it actually is a joint paper;
  4. Give your work away because you’re too lazy to write it up – start a mentoring or academic writing scheme, write half baked, unfinished articles and get unpublished academics or academics playing variants of the games above to finish them off for you.
  5. Identify someone who has lot of started but not quite finished papers and offer to help them bring them to completion in exchange for an authorship credit.

Note that some of the options may be complementary and allow two people to exploit each other…

Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:

ROGUE CODE

One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:

regex1

If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)

regex2

To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)

regex2a

However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex3

So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex4

Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')

regex5a

This generates rows of the form:

regex6

We can now split these cells into several columns:

regex7

We use the :: sequence as the separator:

regex8

Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.

regex9

We now need to do a little more cleaning. For example, tidying up column names:

regex10

And then cleaning down each new column to remove the column heading.

regex11

As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):

regex12

To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)

regex13

As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)

Charts are for Reading…

If charts are pictures, and every picture not only tells a story, but also saves a thousand words in doing so, how then are we to actually read them?

Take the following example, a quick #f1datajunkie sketch show how the Bahrain 2015 qualifying session progressed. The chart is split into three, one for each part of qualifying (which we might refer to as fractional sessions), which already starts to set the scene for the story. The horizontal x-axis is the time in seconds into qualifying at which each laptime is recorded, indexed against the first laptime recorded in qualifying overall. The vertical y-axis records laptimes in in seconds, limited to 107% of the fastest laptime recorded in a particular session. The green colour denotes a driver’s fastest laptime recorded in each fractional session, purple the overall fasted laptime recorded so far in a fractional session (purple trumps green). So again, the chart is starting to paint a picture.

bah_2105_q_session_times

An example of the sort of analysis that can be provided for a qualifying session can be found in a post by Justin Hynes, Lewis Hamilton seals his first Bahrain pole but Vettel poses the menace to Mercedes’ hopes, that appeared on he James Allen on F1 blog. In this post, I’ll try to match elements of that analysis with things we can directly see in the chart above…

[Hamilton] finish[ed] 0.411s clear of Ferrari’s Sebastian Vettel and more than half a second in front of his Mercedes team-mate Nico Rosberg

We don’t get the time gap exactly from the chart, but looking to the rightmost panel (Q3), finding the lowest vertical marks for HAM, VET and ROS, and imagining a horizontal line across to the y-axis, we get a feeling for the relative gaps.

Q1 got underway in slightly calmer conditions than blustery FP3 and Raikkonen was the first to take to the track, with Bottas joining the fray soon after. The Williams driver quickly took P1 but was then eclipsed by Rosberg, who set a time of 1: 35.657 on the medium tyres.

Q1 is the leftmost panel, in which we see RAI setting the first representative laptime at least (within the 107% limit of the session best overall), followed by BOT and then ROS improving on the early purple times.

The Mercedes man was soon joined in the top five by soft-tyre runners Nico Hulkenberg and Felipe Nasr.

HUL and NAS appear around the 300 cuml (cumulative laptime) mark. We note that PER is there in the mix too, but is not mentioned explicitly in the report.

In the closing stages of the session those in the danger zone were Max Verstappen, Pastor Maldonado and Will Stevens and Roberto Merhi.

On the right hand side of the chart, we see laps at the end of the session from MAL and VES (and way off the pace, STE). One problem with the chart as style above (showing cumulative best times in the session, makes it hard to see which a driver’s best session time overall actually is. (We could address this by perhaps displaying a driver’s session best time using a bold font.) The chart is also very cluttered around the cutoff time which makes it hard to see clearly who got through and who didn’t. And we don’t really know where the danger zone is because we have no clear indication of what the best 15 drivers’ times are – and hence, where the evolving cut-off time is…

Verstappen found the required pace and scraped into Q2 with a time of 1:35.611. Maldonado, however, failed to make it through, his best lap of 1:35.677 only being good enough for P16.

Verstappen’s leap to safety also pushed out Daniil Kvyat, with the Russian putting in a disappointing final lap that netted him P17 behind the Lotus driver. Hulkenberg was the last man through to Q2, the Force India driver’s 1:35.653 seeing him safely through with just two hundredths of a second in hand over Maldonado…

With an evolution of the cutoff time, and a zoom around the final cutoff time, we should be able to see what went on rather more clearly.

At the top of the order, Hamilton was quickest, finishing a tenth in front of Bottas. Rosberg was third, though he finished the session close on half a second down on his team-mate.

Felipe Massa was fourth for Williams, ahead of Raikkonen, Red Bull’s Daniel Ricciardo and Sebastian Vettel, who completed just three laps in the opening session. All drivers set their best times on the soft tyre.

This information can be quite clearly seen on the chart – aside from the tyre data which is not made available by the FIA.

The follow description of Q2 provides quite a straightforward reading of the second panel of the chart.

In the second session, Rosberg initially set the pace but Hamilton quickly worked his way back to the top of the order, his first run netting a time of 1:32.669. Rosberg was also again eclipsed by Massa who set a time three tenths of a second quicker than Rosberg’s.

The last to set an opening time were the Ferraris of Raikkonen and Vettel, though both rapidly staked a claim on a Q3 berth with the Finn in P2 and the German in P4.

Most of the front runners opted to rely on their first run to see them through and in the closing stages those in the drop zone were Hulkenberg, Force India team-mate Sergio Perez, Nasr, Sauber team-mate Ericsson and McLaren’s Fernando Alonso.

However, the chart does not clearly show how ROS’ early purple time was challenged by BOT, or how MAS early pace time was challenged mid-way through the session by VET and RAI.

Hulkenberg was the man to make the big move, claiming ninth place in Q2 with a time of 1:34.613. Behind him Toro Rosso’s Carlos Sainz scraped through in P10, six hundredths of a second clear of 11th-placed Sergio Perez. The Mexican was followed by Nasr and Ericsson. Alonso claimed P14, while 15th place went to the unfortunate Verstappen, who early in the session had reported that he was down on power.

Again, this reading of the chart would be aided by an evolving cut-off time line.

Looking now to the third panel…

The first runs in Q3 saw Hamilton in charge again, with the champion setting a time of 1:33.552 on used softs to take P1 three tenths of a second ahead of Red Bull’s Ricciardo, who prior to Hamilton’s lap had claimed the fastest S3 time of the session using new soft tyres.

Rosberg, also on used softs, was third, four thousandths of a second down on the Australian’s time. Hulkenberg, with just one new set of softs at his disposal, opted to sit out the first run.

The chart clearly shows the early and late session runs, and is reflected in the analysis:

In the final runs, Vettel was the first of the likely front-row men across the line and with purple times in S1 and S2, the German set a provisional pole time of 1:32.982. It was a superb lap but Hamilton was already running faster, stealing the S1 purple time from the German.

Ahead of the champion on track, Rosberg had similarly taken the best S2 time but he could not find more pace and when he crossed the line he slotted into third, four hundredths [??] of a second behind Vettel.

So what does Justin Hynes’ qualifying session commentary tell us about how we might be able to read the charted summary of the session? And how can we improve the chart to help draw out some of the stories? A couple of things jump out for me – firstly, the evolving purple and green times can be confusing, and are perhaps better placed (for a summary reading of the session) by best in session purple/green times; secondly, the evolution of the cut-off times would help to work out where drivers were placed at different stages of qualifying and what they still had to do – or whether a best-time-so-far recorded by a driver earlier in the session was bumped by the cutoff evolution. Note that the purple time evolution is identified implicitly by the lower envelope of the laptimes in each session.

Creating Interactive Election Maps Using folium and IPython Notebooks

During the last couple of weeks of Cabinet Office Code Clubs, we’ve started to explore how we can use the python folium library to generate maps. Last week we looked at getting simple markers onto maps along with how to pull data down from a third party API (the Food Standards Agency hygiene ratings), and this week we demonstrated how to use shapefiles.

As a base dataset, I used Chris Hanretty et al.’s election forecasts data as a foil for making use of Westminster parliamentary constituency shapefiles. The dataset gives a forecast of the likelihood of each party winning a particular seat, so within a party we can essentially generate a heat map of how likely a party is to win each seat. So for example, here’s a forecast map for the Labour party

Week_8_-_Shapefiles_likelihood_LAB

Although the election data table doesn’t explicitly say which party has the highest likelihood of winning each seat, we can derive that from the data with a little bit of code to melt the original dataset into a form where a row indicates a constituency and party combination (rather than a single row per constituency, with columns for each party’s forecast), then grouping by constituency, sorting by forecast value and picking the first (highest) value. (Ties will be ignored…)

electionforecast_reshape

We can then generate a map based on the discrete categorical values of which party has the highest forecast likelihood of taking each seat.

Week_8_-_Shapefiles_likelyparty

An IPython notebook showing how to generate the maps can be found here: how to use shapefiles.

One problem with this sort of mapping technique for the election forecast data is that the areas we see coloured are representative of geographical area, not population size. Indeed, the population of each constituency is roughly similar, so our impression that the country is significantly blue is skewed by the relative areas of the forecast blue seats compared to the forecast red ones, for example.

Ways round this are to use cartograms, or regularly sized hexagonal boundaries, such as described on Benjamin Hennig’s Views of the World website, from which the following image is republished; (see also the University of Sheffield’s (old) Social and Spatial Inequalities Research Group election mapping project website):

UK_election2010_MapsCompared

(A hexagonal constituency KML file, coloured by 2010 results, and corresponding to constituencies defined for that election, can be found from this post.)