Wrangling Complex Spreadsheet Column Headers

[This isn’t an R post, per se, but I’m syndicating it via RBloggers because I’m interested – how do you work with hierarchical column indices in R? Do you try to reshape the data to something tidier on the way in? Can you autodetect elements to help with any reshaping?]

Not a little p****d off by the Conservative election pledge to extend the right-to-buy to housing association tenants (my response: so extend the right to private tenants too?) I thought I’d have a dig around to see what data might be available to see what I could learn about the housing situation on the Isle of Wight, using a method that could also be used in other constituencies. That is, what datasets are provided at a national level, broken down to local level. (To start with, I wanted to see what I could lean ex- of visiting the DCLG OpenDataCommuniteis site.

One source of data seems to be the Local authority housing statistics data returns for 2013 to 2014, a multi-sheet spreadsheet reporting at a local authority level on:

– Dwelling Stock
– Local Authority Housing Disposals
– Allocations
– Lettings, Nominations and Mobility Schemes
– Vacants
– Condition of Dwelling Stock
– Stock Management
– Local authority Rents and Rent Arrears
– Affordable Housing Supply

Local_Authority_Housing_Statistics_dataset_2013-14_xlsx

Something I’ve been exploring lately are “external spreadsheet data source” wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

For example, I got started on the LA housing stats sheet as follows – first a look at the sheets, then a routine to grab sheet names out of the Menu sheet:

import pandas as pd
dfx=pd.ExcelFile('Local_Authority_Housing_Statistics_dataset_2013-14.xlsx')
dfx.sheet_names
#...
#Menu sheet parse to identify sheets A-I
import re
sd=re.compile(r'Section (\w) - (.*)$')
sheetDetails={}
for row in dfx.parse('Menu')[[1]].values:
    if str(row[0]).startswith('Section'):
        sheetDetails[sd.match(row[0]).group(1)]=sd.match(row[0]).group(2)
sheetDetails
#{u'A': u'Dwelling Stock',
# u'B': u'Local Authority Housing Disposals',
# u'C': u'Allocations',
# u'D': u'Lettings, Nominations and Mobility Schemes',
# u'E': u'Vacants',
# u'F': u'Condition of Dwelling Stock',
# u'G': u'Stock Management',
# u'H': u'Local authority Rents and Rent Arrears',
# u'I': u'Affordable Housing Supply'}

All the data sheets have similar columns on the left-hand side, which we can use as a crib to identify the simple, single row, coded header column.

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    df=dfx.parse(sheet,header=None)
    df=df.dropna(how='all')
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=dfx.parse(sheet,header=row).dropna(how='all').dropna(how='all',axis=1)
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    return df

#usage:
dfgrabber(dfx,'H')[:5]

That gives something like the following:

Housing_Data

This is completely useable if we know what the column codes refer to. What is handy is that a single row is available for columns, although metadata that neatly describes the codes is not so tidily presented:

excel_headers_complex

Trying to generate pandas hierarchical index from this data is a bit messy…

One approach I’ve explored is trying to create a lookup table from the coded column names back into the hierarchical column names.

For example, if we can detect the column multi-index rows, we can fill down on the first row (for multicolumn labels, the label is in the leftmost cell), then fill down to fill the index grid spanned cells with the value that spans them.

#row is autodetected and contains the row for the simple header
row=7
#Get the header columns - and drop blank rows
xx=dfx.parse('A',header=None)[1:row].dropna(how='all')
xx

excel_headerparse1

#Fill down
xx.fillna(method='ffill', axis=0,inplace=True)
#Fill across
xx=xx.fillna(method='ffill', axis=1)
xx

excel_headerparse2

#append the coded header row
xx=xx.append(dfx.parse('A',header=None)[row:row+1])
xx

excel_headerparse3

#Now make use of pandas' ability to read in a multi-index CSV
xx.to_csv('multi_index.csv',header=False, index=False)
mxx=pd.read_csv('multi_index.csv',header=[0,1,2])
mxx

excel_headerparse4

Note that the pandas column multi-index can span several columns, but not “vertical” levels.

Get rid of the columns that don’t feature in the multi-index:

for c in mxx.columns.get_level_values(0).tolist():
    if c.startswith('Unnamed'):
        mxx = mxx.drop(c, level=0, axis=1)
mxx

Now start to work on the lookup…

#Get a dict from the multi-index
mxx.to_dict(orient='record')

excel_headerparse5

We can then use this as a basis for generating a lookup table for the column codes.

keyx={}
for r in dd:
    keyx[dd[r][0].split(' ')[0]]=r
keyx

excel_headerparse6

We could also generate more elaborate dicts to provide ways of identifying particular codes.

Note that the key building required a little bit of tidying required arising from footnote numbers that appear in some of the coded column headings:

excel header footnote

This tidying should be also be applied to the code column generation step above…

I’m thinking there really should be an easier way?

PS and then, of course, there are the additional gotchas… like UTF-8 pound signs that’s break ascii encodings…

non-ascii

PPS Handy… informal guidance on Releasing data or statistics in spreadsheets, acts as a counterpoint to GSS’ Releasing statistics in spreadsheets: Good practice guidance.

A Couple of Handy ggplot Tricks – Using Environmental Variables and Saving Charts

A couple of handy tricks when working with ggplot that had escaped my radar until today.

First up, I had a problem in a function I was using to generate a ggplot2 in which I wanted to accept a couple of optional arguments in to the function and then make use of them in a ggplot aes() element.

Normally, ggplot will try to dereference a variable as a column in the current ggplot data context, rather than as a variable in it’s own right. So what can we do? Hack around aith aes_string()? In actual fact, the following trick identified via Stack Overflow did exactly what I needed – make a particular environment context available in ggplot directly:

core_qualifying_rank_slopegraph= function(qualiResults,qm,
                                          spacer=0.25,cutoff=c(16,10)){  
  #http://stackoverflow.com/questions/10659133/local-variables-within-aes
  .e = environment()
  # if we pass this context into ggplot, we can access both spacer and cutoff
  g=ggplot(qualiResults,aes(x=session,y=laptime), environment = .e)
  g= g+geom_text(data=qm[qm['session']=='q1time',],
                 aes(x=1,y=qspos,label=driverName,
                     colour=(qspos>cutoff[1] )
                 ), size=3)
  g= g+geom_text(data=qm[qm['session']=='q2time',],
                 aes(x=2,y=qspos,label=driverName,
                     colour=(qspos>cutoff[2] )
                 ), size=3)
  ...
  g=g+geom_segment(data=qualiResults[!is.na(qualiResults['q2time']),],
                   x=1+spacer,xend=2-spacer,
                   aes(y=q1pos,yend=q2pos,group=driverName),
                   colour='slategrey')
  ...
  g
}

By the by, the complete version of the fragment above generates a chart like the, heavily influenced by Tufte style slopegraphs, which shows progression through the F1 qualifying session in China this weekend:

f1_chn_2015_qualiprogression

Note that I use a discrete rank rather than continuous laptime scale for the y-axis, which would be more in keeping with the original slope graph idea. (The f1datajunkie.com post on F1 China 2015 – How They Qualified explores another chart type where continuous laptime scales are used, and a two column layout reminiscent of an F1 grid as a trick to try to minimise overlap of drivername labels, along with a 1-dimensional layout that shows all the qualifying session classification laptimes.)

The second useful trick I learned today was a recipe for saving chart objects. (With sales of the Wrangling F1 Data With R book (which provides the context for my learning these tricks) all but stalled, I need a new dream to live for that gives me hope of making enough from F1 related posts to cover the costs of seeing a race for real one weekend, so I’ve started wondering whether I could sell or license particular charts one day (if I can produce them quickly enough), either as PDFs or, perhaps, as actual chart objects, rather than having to give away all the code and wrangled data, for example….

So in that respect, the ability to save ggplot chart objects and then share them in a way that others can use them (if they have a workflow that can accommodate R/grid grobs) could be quite attractive… and this particular trick seems to do the job…

g=ggplot(..etc..) ...
#Get the grob...
g_out = ggplotGrob(g)
#Save the grob
save(g_out,file='ggrobtest')

#Look - nothing up my sleeves
rm(g_out)
rm(g)
#> g
#Error: object 'g' not found
#> g_out
#Error: object 'g_out' not found

load("ggrobtest")

#The g_out grob is reinstated and can be plotted as follows:
library(grid)
grid.draw(g_out) 

Handy…:-)

Mixing Numbers and Symbols in Time Series Charts

One of the things I’ve been trying to explore with my #f1datajunkie projects are ways of representing information that work both in a glanceable way as well as repaying deeper reading. I’ve also been looking at various ways of using text labels rather than markers to provide additional information around particular data points.

For example, in a race battlemap, with lap number on the horizontal x-axis and gap time on the vertical y-axis, I use a text label to indicate which driver is ahead (or behind) a particular target driver.

battlemaps-postionbattles-1

In the revised version of this chart type shown in F1 Malaysia, 2015 – Rosberg’s View of the Race, and additional numerical label along the x-axis indicatesd the race position of the target driver at the end of each lap.

What these charts are intended to do is help the eye see particular structural shapes within the data – for example whether a particular driver is being attacked from behind in the example of a battlemap, or whether they are catching the car ahead (perhaps with intervening cars in the way – although more needs to be done on the chart with respect to this for examples where there are several intervening cars; currently, only a single intervening car immediately ahead on track is shown.)

Two closer readings of the chart are then possible. Firstly, by looking at the y-value we can see the actual time a car is ahead (and here the dashed guide line at +/1 1s helps indicate in a glanceable way the DRS activation line; I’m also pondering how to show an indication of pit loss time to indicate what effect a pit stop might have on the current situation). Secondly, we can read off the labels of the drivers involved i a battle to get a more detailed picture of the race situation.

The latest type of chart I’ve been looking at are session utilisation maps, which in their simplest form look something like the following:

simple_session_utilisation

The charts show how each driver made use of a practice session or qualifying – drivers are listed on the vertical y-axis and the time into the session each lap was recorded at is identified along the horizontal x-axis.

This chart makes it easy to see how many stints, and of what length, were completed by each driver and at what point in the session. Other information might be inferred – for example, significant gaps in which no cars are recording times may indicate poor weather conditions or red flags. However, no information is provided about the times recorded for each lap.

We can, however, use colour to identify “purple” laps (fastest lap time recorded so far in the session) and “green” laps (a driver’s fastest laptime so far in the session that isn’t a purple time), as well as laps on which a driver pitted:

augmented_session_utilisation

But still, no meaningful lap times.

One thing to note about laptimes is that they come in various flavours, such as outlaps, when a driver starts the lap from the pitlane; inlaps, or laps on which a driver comes into the pits at the end of the lap; and flying laps when a driver is properly going for it. There are also those laps on which a driver may be trying out various new lines, slowing down to give themselves space for a flying lap, and so on.

Assuming that inlaps and outlaps are not the best indicators of pace, we can use a blend of symbols and text labels on the chart to identify inlaps and outlaps, as well as showing laptimes for “racing” laps, also using colour to highlight purple and green laps:

session_utlisation_annotated

The chart is produced using ggplot, and a layered approach in which chart elements are added to the chart in separate layers.

#The base chart with the dataset used to create the original chart
#In this case, the dataset included here is redundant
g = ggplot(f12015test)

#Layer showing in-laps (laps on which a driver pitted) and out-laps
#Use a subset of the dataset to place markers for outlaps and inlaps
g = g + geom_point(data=f12015test[f12015test['outlap'] | f12015test['pit'],],aes(x=cuml, y=name, color=factor(colourx)), pch=1)

#Further annotation to explicitly identify pit laps (in-laps)
g = g + geom_point(data=f12015test[f12015test['pit']==TRUE,],aes(x=cuml, y=name),pch='.')

#Layer showing full laps with rounded laptimes and green/purple lap highlights
#In this case, use the laptime value as a text label, rather than a symbol marker
g = g + geom_text(data=f12015test[!f12015test['outlap'] & !f12015test['pit'],],aes(x=cuml, y=name, label=round(stime,1), color=factor(colourx)), size=2, angle=45)

#Force the colour scale to be one we want
g = g + scale_colour_manual(values=c('darkgrey','darkgreen','purple'))

This version of the chart has the advantage of being glanceable when it comes to identifying session utilisation (number, duration and timing of stints) as well as when purple and green laptimes were recorded, as well as repaying closer reading when it comes to inspecting the actual laptimes recorded during each stint.

To reduce clutter on the chart, laptimes are round to 1 decimal place (tenths of a second) rather than using the full lap time which is recorded down to thousandths of a second.

Session utilisation charts are described more fully in a forthcoming recently released chapter of the Wrangling F1 Data With R Leanpub book. Buying a copy of the book gains you access to future updates of the book. A draft version of the chapter can be found here.

From Front Running Algorithms to Bot Fraud… Or How We’ve Lost Control of the Bits…

I’ve just finished reading Michael Lewis’ Flash Boys, a cracking read about algorithmic high frequency trading and how the code and communication systems that contribute to the way stock exchanges operate can be gamed by front-running bots. (For an earlier take, see also Scott Patterson’s Dark Pools; for more “official” takes, see things like the SEC’s regulatory ideas response to the flash crash of May 6, 2010, an SEC literature review on high frequency trading, or this Congressional Research Service report on High-Frequency Trading: Background, Concerns, and Regulatory Developments).

As the book describes, some of the strategies pursued by the HFT traders were made possible because of the way the code underlying the system was constructed. As Lessig pointed out way back way in Code and Other Laws of Cyberspace, and revisited in Codev2:

There is regulation of behavior on the Internet and in cyberspace, but that regulation is imposed primarily through code. The differences in the regulations effected through code distinguish different parts of the Internet and cyberspace. In some places, life is fairly free; in other places, it is more controlled. And the difference between these spaces is simply a difference in the architectures of control — that is, a difference in code.

The regulation imposed on the interconnected markets by code was gameable. Indeed, it seems that it could be argued that it was even designed to be gameable…

Another area in which the bots are gaming code structures is digital advertising. A highly amusing situation is described in the following graphic, taken from The Bot Baseline: Fraud in Digital Advertising (via http://www.ana.net/content/show/id/botfraud):

ANA-White_Ops_-_The_Bot_Baseline_-_Fraud_in_Digital_Advertising_pdf

A phantom layer of “ad laundering” fake websites whose traffic comes largely from bots is used to generate ad-impression revenue. (Compare this with networks of bots on social media networks that connect to each other, send each other messages, and so on, to build up “authentic” profiles of themselves, at least in terms of traffic usage dynamics. Examples: MIT Technlogy Review on Fake Persuaders; or this preprint on The Rise of Social Bots.)

As the world becomes more connected and more and more markets become exercises simply in bit exchange between bots, I suspect we’ll be seeing more and more of these phantom layer/bot audience combinations on the one hand, and high-speed, market stealing, front running algorithms on the other.

PS Not quite related, but anyway: how you’re being auctioned in realtime whenever you visit a website that carries ads – The Curse of Our Time – Tracking, Tracking Everywhere.

PPS Interesting example of bots reading the business wires and trading on the back of them: The Wolf of Wall Tweet: A Web-reading bot made millions on the options market.

A Quick Look Around the Open Data Landscape

I haven’t done a round up of open data news for a bit, so I here’s a quick skim through some of my current open browser tabs on the subject.

First up, the rather concerning news that DCLG [are] to withhold funding from district over transparency “failings”. Noticing a failure to publish information about a termination of employment pay settlement, it was also noticed in a letter to Rother District Council that it “appears from your website that your council has not published data in a number of important areas, for example, contracts over £5,000, land and assets, senior salaries, an organisation chart, trade union facility time, parking revenues, grants to the voluntary sector and the like”, in contravention of the Local Government Transparency Code 2014.

From running several open data training days for representatives of local councils and government departments on behalf of Open Knowledge recently, I know from looking at council websites that finding “transparency” data on official websites is not always that simple. (The transparency” keyword sometimes(?!) works, “open data” frequently doesn’t…; often using a web search engine with a site: search limit on the council website works better than the local search.) I’m still largely at a loss as to what can usefully be done with things like spending data, though I do have a couple of ideas about how we might feed some of that data into some small investigations…

However, I’m not convinced that punishing councils by withholding funding is the best approach to promoting open data publication. On the other hand, promoting effective data workflows that naturally support open data publishing (and hopefully, from that, “an increase in reuse within the organisation as data availability/awareness/accessibility improves”) and encouraging effective transparency through explaining how decisions were made in the context of available data (whilst at the same time making data available so that the data basis of those decisions can be checked) would both seem to be more useful approaches?

The funding being withheld from Rother Council seems to be the new burdens funding, which presumably helps cover the costs of publishing transparency data. Something that’s bugged me over the years (eg Some Sketchnotes on a Few of My Concerns About #opendata) is how privatisation of contracts is associated with several asymmetries in the public vs private provision of services. On the one hand, public bodies have transparency and freedom of information “burdens” placed on them, which means: 1) they take a financial hit, needing to cover the costs of meeting those burdens; 2) they are accountable, in that the public has access to certain sorts of information about their activities. Private contractors are not subject to the same terms, so not only can they competitively bid less than public bodies for service delivery, they also get to avoid the same public disclosure requirements about their activities and potentially are less accountable than the public body counterparts; who remain overall accountable as commissioners of the services; but presumably have to cover the costs of that accountability, as well as the administrative overheads of managing the private contracts.

Now it seems that the ICO is call[ing] for greater transparency around government outsourcing, publishing a report and a roadmap on the subject that recommend a “transparency by design” approach in which councils should:

– Make arrangements to publish as much information as possible, including the contract and regular performance information, in open formats with a licence permitting re-use.

– When drawing up the contract, think about any types of information that the contractor will hold on their behalf eg information that a public authority would reasonably need to see to monitor performance. Describe this in an annex to the contract. This is itself potentially in scope of a FOIA request.

– Set out in the contract the responsibilities of both parties when dealing with FOIA requests. Look at standard contract terms (eg the Model Services Contract ) for guidance.

Around about the same time, a Cabinet Office policy document on Transparency of suppliers and government to the public also appeared. Whilst on the one hand “Strategic Suppliers to Government will supply data on a contract basis that is then aggregated to the departmental level and aggregated again to the Government level” means it should be easier to see how much particular companies receive from government (assuming they don’t operate as BigCo Red Ltd, BigCo Blue Ltd, BigCo Purple Ltd, using a different company for each contract) it does mean that data can presumably also be aggregated to a point of meaninglessness.

(Just by the by, I tried looking through various NHS Commissioning Group and NHS Trust spending datasets looking to see how much was going to Virgin Care and other private providers. Whilst I could see from news reports and corporate websites that those operators were providing services in particular areas, I couldn’t find any spend items associated with them. Presumably I was looking in the wrong place… but if so, it suggests that even if you do have a question about spend in a particular context with a particular provider, it doesn’t necessarily follow that even if you think you know how to drive this open transparency data stuff, you’ll get anywhere…)

When looking at affordability of contracts, and retaining private vs public contractors, it would seem only fair that any additional costs associated with the contracting body having to meet transparency requirements “on behalf of” the private body should be considered part of the cost of the contract. If private bodies complain that this gives an unfair advantage to public bodies competing for service provision, they can perhaps opt-in to FOI regulations and transparency codes and cover the costs of disclosure of information themselves to level the playing field that way?

Another by the by… in appointments land, Mike Bracken has been appointed the UK’s first Chief Data Officer (CDO), suggesting that we should talk about “data as a public asset. In this regard, the National Information Infrastructure still appears to be a thing, for the moment at least. An implementation document was published in March that has some words in it (sic…!)…

As purdah approached, there was a sudden flurry of posts on the data.gov.uk blog. Four challenges for the future of Open Data identified the following as major issues:

  • Pushing Open Data where it is not fully embraced
  • Achieving genuine (Open) Data by default; (this actually seems to be more about encouraging open data workflows/open data (re)use – “a general move to adopt data practice into the way public services are run”)
  • Improving public confidence in Open Data
  • Improving (infra)structure around Open Data

The question is – how to best address them? I think that Open Knowledge has delivered all the open data training sessions it was due to deliver under the open data voucher scheme, which means my occasional encounters with folk tasked with open data delivery from councils and government departments may have come to an end via that route; which is a shame, because I felt we never really got a chance to start building on those introductory sessions…

The Cabinet Office also made a state of the nation announcement to finish off the parliamentary session by announcing the Local authorities setting standards as Open Data Champions. A quick skim down the list seems to suggest that the champions are typically councils that have started their own datastore…

Data Mining: Separating Fact From Fiction – Presentation to NetIKX

A couple of weeks ago I gave a presentation to NetIKX, an information professionals’ knowledge exchange network.

Search consultant Karen Blakeman opened with a presentation showing how the search engines don’t always return what you’d expect, and I tweaked my presentation somewhat as Karen spoke to try to better complement it.

My slides are available on slideshare in a partially annotated form. I didn’t deliver the more technical data mining algorithm slides towards the end of the deck, and I haven’t yet annotated those slides either.

Iteratively Populating Templated Sentences With Inline R in knitr/Rmd

As part of the Wrangling F1 Data With R project, I want to be able to generate sentences iteratively from a templated base.

The following recipe works for sentences included in an external file:

reusableCOdeinRmd

What I’d really like to be able to do is put the Rmd template into a chunk something like this…:

```{rmd stintPara, eval=FALSE, echo=FALSE}
`r name` completed `r sum(abs(stints[stints['name']==name,]['l']))` laps over `r nrow(stints[stints['name']==name,])` stints, with a longest run of `r max(abs(stints[stints['name']==name,]['l']))` laps.
```

and then do something like:

```{r results='asis'}
stints['name']=factor(stints$name)
for (name in levels(stints$name)){
  cat(paste0(knit_child('stintPara',quiet=TRUE),'\n\n'))
}
```

Is anything like that possible?

PS via the knitr Google Group, h/t Jeff Newmiller for pointing me to the knit_child() text argument…

rmd_template2