Segmenting F1 Qualifying Session Laptimes

I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into, using some sort of clustering algorithm… or other means…:


For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:


The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.

If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:


We can see a big rain delay gap, and also a tighter gap between the first and second sessions.

If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:

# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means
clusters <- kmeans(f12015test['cuml'], 3)

f12015test = data.frame(f12015test, clusters$cluster)

ggplot(f12015test)+geom_text(aes(x=cuml, y=stime,
label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)


In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.

However, we know that there is at least a 2 minute gap between sessions (regulations suggest 7 minutes, though if this is the time between lights going red then green again, we might need to knock a couple of minutes off the gap to account to for drivers who start their last lap just before the lights go red on a session) so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.

#Look for a two minute gap
f12015test['gapflag']= (f12015test['gap']>=120)

ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3
+facet_wrap(~qsession, scale="free")


(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)

This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)

One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:


#Use these values in ylim()...

However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.

Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.

Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.

Data Journalism in Practice

For the last few years, I’ve been skulking round the edges of the whole “data journalism” thing, pondering it, dabbling with related tools, technologies and ideas, but never really trying to find out what the actual practice might be. After a couple of twitter chats and a phone conversation with Mark Woodward (Johnston Press), one of the participants at the BBC College of Journalism data training day held earlier this year, I spent a couple of days last week in the Harrogate Advertiser newsroom, pitching questions to investigations reporter and resident data journalist Ruby Kitchen, and listening in on the development of an investigations feature into food inspection ratings in and around the Harrogate area.

Here’s a quick debrief-to-self of some of the things that came to mind…

There’s not a lot of time available and there’s still “traditional” work to be done
One of Ruby’s takes on the story was to name low ranking locations, and contact each one that was going to be named to give them a right to response. Contacting a couple of dozen locations takes time and diplomacy (which even then seemed to provoke a variety of responses!), as does then writing those responses into the story in a fair and consistent way.

Even simple facts can take the lead in a small story
…for example, x% of schools attained the level 5 rating, something that can then also be contextualised and qualified by comparing it to other categories of establishment or national, regional or neighbouring locale averages. As a data junkie, it can be easy to count things by group, perhaps overlooking a journalistic take that many of these counts could be used as the basis of a quick filler story or space-filling, info-snack glanceable breakout box in a larger story.

Is the story tellable?
Looking at data, you can find all sorts of things that are perhaps interesting in their subtlety or detail, but if you can’t communicate a headline or what’s interesting in a few words, it maybe doesn’t fit… (Which is not to say that data reporting needs to be dumbed down or simplistic…) Related to this is the “so what?” question..? (I guess for news, if you wouldn’t share it in the pub or over dinner have read it – that is, if you wouldn’t remark on it – you’d have to ask: is it really that interesting? (Hmm… is “Liking” the same as remarking on something? I get the feeling it’s less engaged…)

There’s a huge difference between the tinkering I do and production warez

I have all manner of pseudo-workflows that allow me to generate quick sketches in an exploratory data analysis sort of way, but things that work for the individual “researcher” are not the same as things can work in a production environment. For example, I knocked up a quick interactive map using the folium library in an IPython notebook, but there are several problems with this:

  1. to regenerate the map requires someone having an IPython notebook environment set up and appropriate libraries installed
  2. there is a certain “distance” between producing a map as a single HTML file and getting the map actually published. For example, the HTML page pulls in all manner of third party files (javascript, css, image tiles, marker-icon/css-sprite image files) and so on. For example, working out whether (and if so, where) to host these various resources on a local production server so as not to inappropriately draw them down from third party server.
  3. there isn’t much time available… so you need to think about what to offer. For example:
    • the map I produced was a simple one – just markers and popups. At the time, I hadn’t worked out how to colour the markers or add new icons to them (and I still don’t have a route for putting numbers into the markers…), so the look is quite simple (and clunky)
    • there is no faceted navigation – so you can’t for example search for particular sorts of establishment or locations with a particular rating.

    Given more time, it would have been possible to consider richer, faceted navigation, for example, but for a one off, what’s reasonable? If a publisher starts to use more and more maps, one possible workflow may to be iterate on previous precedents. (To an extent, I tried to do this with things I’ve posted on the OU OpenLearn site over the years. For example, first step was to get a marker displaying map embedded, which required certain technical things being put in place the first time but could then be reused for future maps. Next up was a map with user submitted marker locations – this represented an extension of the original solution, but again resulted in a new precedent that could be reused and in turn extended or iterated on again.)

    This suggests an interesting development process in which ever richer components can perhaps be developed iteratively over an extended period of time or set of either related or independent stories, as the components are used in more and more stories. Where a media group has different independent publications, other ways of iterating are possible…

    The whole tech angle also suggests that a great stumbling block to folk getting (interactive) data elements up on a story page is not just the discovery of the data, the processing and cleaning of it, and the generation of the initial sketch to check it could be something that could add to the telling of a story, (each of which may require a particular set of skills), but also the whole raft of production related issues that then result (which could require a whole raft of other technical skills (which are, for example, skills I know that I don’t really have, even given my quirky skillset…). And if the corporate IT folk take ownership of he publication element, there is then a cascade back upstream of constraints relating to how the data needs to be treated so it can fit in with the IT production system workflow.

Whilst I tend to use ggplot a lot in R for exploring datasets graphically, rather than producing presentation graphics to support the telling of a particular story. Add to that, I’m still not totally up to speed on charting in the python context, and the result is that I didn’t really (think to) explore how graphical, chart based representations might be used to support the story. One thing that charts can do – like photographs – is take up arbitrary amounts of space, which can be a Good Thing (if you need to fill the space) or a Bad Thing (is space is at a premium, or page (print or web) layout is a constraint, perhaps due to page templating allowances, for example.

Some things I didn’t consider but that now come to mind now are:

  1. how are charts practically handed over? (As Excel charts? as image files?)
  2. does a sub-editor or web-developer then process the charts somehow?
  3. for print, are there limitations on use of colour, line thickness, font-size and style?

Print vs Web
I didn’t really consider this, but in terms of workflow and output, are different styles of presentation required for:

  • text
  • data tables
  • charts
  • maps

Many code based workflows now allow you to “style” outputs in the same way you can style web pages (eg the CSS Zen Garden sites are all visually distinct but have exactly the same content – just the style is changed; thinks: data zen garden.. hmmm… (and related: chart redesigns…). For example, in the python environment ggplot or Seaborn style charts can be styled visually using themes to generate charts that can be save as image files, for example, or converted to interactive web charts (using eg mpld3, which converts base matplotlib charts (which ggplot and seaborn generate) to d3js interactive charts); alternatively, libraries such as pandas highcharts (or in the R context, rCharts) let you generate interactive charts using well-developed javascript chart libraries.

If you want data tables, there are various libraries or tools for styling charts, but again the question of workflow and the actual form in which items are handed over for print or web publication needs to be considered.

Being right/being wrong
Every cell in a data table is a “fact”. If your code is wrong and and one column, or row, or cell is wrong, that can cause trouble. When you’re tinkering in private, that doesn’t matter so much – every cell can be used as the basis for another question that can be used to test, probe or check that fact further. If you publish that cell, and it’s wrong, you’ve made a false claim… Academics are cautious and don’t usually like to commit to anything without qualifying it further (sic;-). I trust most data, metadata and my own stats skills little enough that I see stats’n’data as a source that needs corroborating, which means showing it to someone else with my conclusions and a question along the lines of “it seems to me that this data suggests that – would you agree?”. This perhaps contrasts with relaying a fact (eg a particular food hygiene score) and taking it as-is as a trusted fact, given it was published from a trusted authoritative source, obtained directly from that source, and not processed locally, but then asking the manager of that establishment for a comment about how that score came about or what action they have taken as a result of getting it.)

I’m also thinking it’d be interesting to compare the similarities and differences between journalists and academics in terms of their relative fears of being wrong…!

Human Factors
One of things I kept pondering – and have been pondering for months – is the extent to which templated analyses can be used to create local “press release” story packs around national datasets that can be customised for local or regional use. That’s a far more substantial topic for another day, but it was put into relief last week by my reading of Nick Carr’s The Glass Cage which got me thinking about the consequences of “robot” written stories… (More about that in a forthcoming post.)

Lots of skills issues, lots of process and workflow issues, lots of story discovery, story creation, story telling and story checking issues, lots of production constraints, lots of time constraints. Fascinating. Got me really excited again about the challenges of, and opportunities for, putting data to work in a news context…:-)

Thanks to all at the Harrogate Advertiser, in particular Ruby Kitchen for putting up with my questions and distractions, and Mark Woodward for setting it all up.

Software Apps As Independent, Free Running, Self-Contained Services

The buzz phrase for elements of this (I think?) is microservices or microservice architecture (“a particular way of designing software applications as suites of independently deployable services”, [ref.]) but the idea of being able to run apps anywhere (yes, really, again…!;-) seems to have been revitalised by the recent excitement around, and rapid pace of development of, docker containers.

Essentially, docker containers are isolated/independent containers that can be run in a single virtual machine. Containers can also be linked together within so that they can talk to each other and yet remain isolated from other containers in the same VM. Containers can also expose services to the outside world.

In my head, this is what I think various bits and pieces of it look like…


A couple of recent announcements from docker suggest to me at least one direction of travel that could be interesting for delivering distance education and remote and face-to-face training include:

  • docker compose (fig, as was) – “with Compose, you define your application’s components – their containers, their configuration, links, volumes, and so on – in a single file, then you can spin everything up with a single command that does everything that needs to be done to get your application running.”
  • docker machine“a tool that makes it really easy to go from ‘zero to Docker’. Machine creates Docker Engines on your computer, on cloud providers, and/or in your data center, and then configures the Docker client to securely talk to them.” [Like boot2docker, but supports cloud as well?]
  • Kitematic UI“Kitematic completely automates the Docker installation and setup process and provides an intuitive graphical user interface (GUI) for running Docker containers on the Mac.” ) [Windows version coming soon]

I don’t think there is GUI support for configuration management provided out of docker directly, but presumably if they don’t buy up something like panamax they’ll be releasing their own version of something similar at some point soon?!

(With the data course currently in meltdown, I’m tempted to add a bit more to the confusion by suggesting we drop the monolithic VM approach and instead go for a containerised approach, which feels far more elegant to me… It seems to me that with a little bit of imagination, we could come up with a whole new way of supporting software delivery to students. eg an OU docker hub with an app container for each app we make available to students, container compositions for individual courses, a ‘starter kit’ DVD (like the old OLA CD-ROM) with a local docker hub to get folk up and running without big downloads etc etc. ..) It’s unlikely to happen of course – innovation seems to be too risky nowadays, despite the rhetoric…:-(

As well as being able to run docker containers locally or in the cloud, I also wonder about ‘plug and play’ free running containers that run on a wifi enabled Raspberry Pi that you can grab off the shelf, switch on, and immediately connect to? So for example, a couple of weeks ago Wolfram and Raspberry announced the Wolfram Language and Mathematica on Raspberry Pi, for free [Wolfram’s Raspberry Pi pages]. There are also crib sheets for how to run docker on a Raspberry Pi (the downside of this being that you need ARM based images rather than x86 ones), which could be interesting?

So pushing the thought a bit further, for the mythical submariner student who isn’t allowed to install software onto their work computer, could we give them a Raspberry Pi running their OU course software as service they could remotely connect to?!

PS by the by, at the Cabinet Office Code Club I help run for Open Knowledge last week, we had an issue with folk not being able to run OpenRefine properly on their machines. Fortunately, I’d fired up a couple of OpenRefine containers on a cloud host so we could still run the session as planned…

Tech Company Inspired Driverless Cars Don’t Cause Accidents (Apparently), But Their Mobile Phones Do, Presumably..?

A few weeks ago, the UK Department for Transport (DfT) published a summary report and action plan entitled The Pathway to Driverless Cars that sets out the UK response to car manufacturers and tech companies pushing to develop, test and produce autonomous vehicles on public roads. This was followed by an announcement in the recent (March, 2015) budget where the Chancellor announced that “we are going to back our brilliant automotive industry by investing £100 million to stay ahead in the race to driverless technology” Hansard), and a new code of practice around the testing of driverless vehicles to appear sometime this Spring (2015).

One of the claims widely made in favour of autonomous vehicles/driverless cars by the automotive lobby is that in testing they have a better safety record than human drivers. (Human error plays a role in many accidents.) Whilst the story that a human driver crashes Google’s self-driving car is regularly wheeled out to illustrate how Google’s cars are much safer than humans, we don’t tend to see so many stories about when the human test driver had to take control of the vehicles, whether to avoid an accident, or because road and/or traffic conditions were “inappropriate” for “driverless” operation.

Nor do we hear much about the technology firms making road transport by safer by doing something to mitigate the role that their technology plays in causing accidents.

As Nick Carr writes in his history of automation, The Glass Cage:

It’s worth noting Silicon Valley’s concern with highway safety, though no doubt sincere, has been selective. The distractions caused by cellphones and smartphones have in recent years become a major factor in car crashes. An analysis by the National Safety Council [in the US] implicated phone use in one-fourth of all accidents on US roads in 2012. Yet Google and other top tech firms have made little or no effort to develop software to prevent people from calling, texting or using apps while driving — surely a modest undertaking compared with building a car that can drive itself.

To see what proportion of road traffic incidents involved distractions caused by mobile phones, I thought I’d check the STATS19 dataset. This openly published dataset records details of UK road accidents involving casualties reported to the police. The form used to capture the data includes information relating to up to six contributory factors, including “Driver using mobile phone”.


Unfortunately, the data collected on this part of the form is deemed to be RESTRICTED as opposed to UNCLASSIFIED (the latter classification applying to those elements released in the STATS19 dataset), which means we can’t do any stats around this from the raw data. (I think the reason the data is not released is that it may be used to help de-anonymise incident data by triangulating information contained in the dataset with information gleaned from local news reports about an incident, for example?)

The closest it seems we can get are the DfT’s annual reported road casualties reports (eg 2013) and an old DfT mobile phone usage survey.

The release of the 2013 annual report is supported by a set of statistical tables that break down accidents in all sorts of ways including two tables (ras50012.xls and ras50016.xls) that summarise accident counts on a local authority basis that include mobile phones as a contributory factor in reported incidents. So for example, in England in 2013 there were 384 such incidents. (It is not clear how many of the 2,669 incidents that included a “distraction in vehicle” might also have related to distractions caused by mobile phones particularly… Nor is it clear what the severity or impact of incidents with mobile phones recorded as a contributory factor actually were…

In terms of autonomous vehicle safety, and how the lobbying groups pitch their case, it would also be interesting to know how autonomous vehicles are likely to cope in the context of other contributory factors, such as vision affected by external factors (10,272 (11%) in England in 2013), pedestrian factors only (11,877 (12%)), vehicle defects (1,757, (2%)), or road environment contributed (12,436 (13%)). For cases where there was an “impairment or distraction” in general (12,162 (13%)), it would be interesting to know what would be likely to happen in an autonomous vehicle where the vehicle tried to hand control back to a supervising human driver… (Note that percentages across contributory factors do not sum to 100% – incidents may have had several contributory factors.)

As technology continues to offer ever more “solutions” to claimed problems, I’m really mindful that we need to start being more critical of it and the claims made in pushing particular solutions. In particular, three things concern me: 1) that if we look at the causes of problems that technology claims to fix, maybe technology is contributing to the problem (and the answer is not to apply more technology to treat problems caused by other technology); 2) that we don’t tend to look at the systemic consequences of applying a particular technology; 3) that we don’t tend to recognise how adopting a particular technology can lock us in to a particular set of (inflexible) technology mediated practices, nor how we change our behaviour to suit that technological solution.

On balance, I’m probably negative on the whole tech thing, even though I guess I work within it…

Geographical Rights Management, Mesh based Surveillance, Trickle-Down and Over-Reach

Every so often there’s a flurry of hype around the “internet of things”, but in many respects it’s already here – and has been for several decades. I remember as a kind being intrigued by some technical documents describing some telemetry system or other that remote water treatment plants used to transmit status information back to base. And I vaguely remember from a Maplin magazine around the time an article or two about what equipment you needed to listen in on, and decode, the radio chatter of all manner of telemetry systems.

Perhaps the difference now is a matter of scale – it’s easier to connect to the network, comms are bidirectional (you can receive as well as transmit information), and with code you can effect change on receipt of a message. The tight linkage between software and hardware – bits controlling atoms – also means that we can start to treat more and more things as “plant” whose behaviour we can remotely monitor, and govern.

A good example of how physical, consumer devices can already be controlled – or at least, disabled – by a remote operator is described in a New York Times article that crossed my wires last week, Miss a Payment? Good Luck Moving That Car, which describes how “many subprime borrowers [… in the US] must have their car outfitted with a so-called starter interrupt device, which allows lenders to remotely disable the ignition. Using the GPS technology on the devices, the lenders can also track the cars’ location and movements.” As the loan payment due date looms, it seems that some devices also emit helpful beeps to remind you…. And if your car loan agreement stipulates you’ll only drive within a particular area, I imagine that you could find it’s been geofenced. (A geofence is geographical boundary line that can be used to detect whether a GPS tracked device has passed into, or exited from, a particular region. When used to disable a device that leaves – or enters – a particular area, as for example drones flying into downtown Washington, we might consider it a form “location based management” (or “geographical rights management (GRM)”?!) that can disable activity in a particular location where someone who claims to control use of that device in that space actually exerts their control. (Think: DRM for location…))

One of the major providers “starter interrupt devices” is a company called PassTime (product list). Their products include:

  • PassTime Plus, the core of their “automated collection technology”.
  • Trax: “PassTime TRAX is the entry level GPS tracking product”. Includes: Pin point GPS location service, Up to Six (6) simultaneous Geo-fences.
  • PassTime GPS: “provides asset protection at an economical price while utilizing the same hardware and software platform of PassTime’s Elite Pro line of products. GPS tracking and remote vehicle disable features offer customers tools for a swift recovery if needed.” Includes: Pin point GPS location service, Remote vehicle disable option, Tow-Detect Notification, Device Tamper Notification, Up to Six (6) simultaneous Geo-fences, 24-Hour Tracking, Automatic Location Heartbeat
  • Elite-Pro: “the ultimate combination of GPS functionality and Automated Collection Technology”. Includes the PassTime GPS features but also mentions “Wireless Command Delivery”.

PassTime seem to like the idea of geofences so much they have patents in related technologies: PassTime Awarded Patent for Geo-Fence and Tamper Notice (US Patent: 8018329). You can find other related patents by looking up other patents held by the inventors (for example…).

You’ll be glad to know that PassTime have UK partners… in the form of The Car Finance Company, who are apparently “the world’s largest user and first company in the UK to start fitting Payment Reminder Technology to your new car”. Largest user?! According to a recent [March 12, 2015] press release announcing an extension to their agreement that “will bring 70,000 payment assurance and telematics devices to the United Kingdom”.

Here’s how The Car Finance Company spin it: The Passtime system helps remind you when your repayments are due so you can ensure you stay on track with your loan and help repair and rebuild your credit. The device is only there to help you keep your repayments up to date, it doesn’t affect your car nor does it monitor the way you drive. From the recent press release, “PassTime has been supplying Payment Assurance and GPS devices to The Car Finance Company since 2009″ (my emphasis). I’m not sure if that means the PassTime GPS (with the starter interrupt) or the Trax device? If I was a journalist, rather than a blogger, I’d probably phone them to try to clarify that…

In passing, whilst searching for providers of automotive GPS trackers in the UK (and there are lots of them – search on something like GPS fleet management, for example…) I came across this rather intrusive piece of technology, The TRACKER Mesh Network, which “uses vehicles fitted with TRACKER Locate and TRACKER Plant to pick up reply codes from stolen vehicles with an activated TRACKER unit making them even easier to locate and recover”. Which is to say, this company has an ad hoc, mobile, distributed network of sensors spread across the UK road network that listen out for each other and opportunistically track each other. It’s all good, though:

“The TRACKER Mesh Network will enable the police to extend the network of ‘eyes and ears’ to identify and locate stolen vehicles more effectively using advanced technology and allow us to stay one step ahead of criminals who are becoming more and more adept at stealing cars. This is a real opportunity for the motoring public to help us clamp down on car thieves and raises public confidence in our ability to recover their possessions and bring the offenders to justice.”

(By the by, previous notes on ANPR – Automatic Number Plate Recognition. Also, note the EU eCall accident alerting system that automatically calls for help if you have a car accident [about, UK DfT eCall cost/benefit analysis].)

This conflation of commercial and police surveillance is… to be expected. But the data’s being collected, and it won’t go away. Snowden revelations revealed the scope of security service data collection activities, and chunks of that data won’t be going away either. The scale of the data collection is such that it’s highly unlikely that we’re all being actively tracked or that this data will ever meaningfully contribute to the detection of conspiracies, but it can and will be used post hoc to create paranoid data driven fantasies about who could have have met whom, when, discussed what, and so on.

I guess where we can practically start to get concerned is in considering the ‘trickle down’ way in which access to this data will increasingly be opened up, and/or sold, to increasing numbers of agencies and organisations, both public and private. As Ed Snowden apparently commented in a session as SXSW (Snowden at SXSW: Be very concerned about the trickle down of NSA surveillance to local police), “[t]hey’ve got everything. The question becomes, Now they’re empowered. They can leak [this stuff]. It does happen at the local level. These capabilities are created. High tech. Super secret. But they inevitably bleed over to law enforcement. When they’re brand new they’re only used in the extremes. But as that transition happens, more and more people get access, they use it in newer and more and more expansive and more abusive ways.”

(Trickle down – or over-reach – applies to legislation too. For example, from a story widely reported in April, 2008: Half of councils use anti-terror laws to spy on ‘bin crimes’, although the legality of such practices was challenged: Councils warned over unlawful spying using anti-terror legislation and guidance brought in in November 2012 that required local authorities to obtain judicial approval prior to using covert techniques. (I realise I’m in danger here of conflating things not specifically related to over-reach on laws “intended” to be limited to anti-terrorism associated activities (whatever they are) with over-reach…) Other reviews: Lords Constitution Committee – Second Report – Surveillance: Citizens and the State (Jan 2009), Big Brother Watch on How RIPA has been used by local authorities and public bodies and Cataloguing the ways in which local authorities have abused their covert surveillance powers. I’m guessing a good all round starting point would be the reports of the Independent Reviewer of Terrorism Legislation.)

When it comes to processing large amounts of data, finding meaningful, rather than spurious, connections connections between things can be hard… (Correlation is not causation, right?, as Spurious Correlations wittily points out…;-)

What is more manageable is dumping people onto lists and counting things… Or querying specifics. A major problem with the extended and extensive data collection activities going on at the moment is that access to the data to allow particular queries to be made will be extended. The problem is not that all your data is being collected now, the issue is that post hoc searches over it it could be made by increasing numbers of people in the future. Like bad tempered council officers having a bad day, or loan company algorithms with dodgy parameters.

PS Schneier on connecting the dots.. Why Mass Surveillance Can’t, Won’t, And Never Has Stopped A Terrorist

Quick Sketch – Election Maps

On my to do list over the next few weeks is to pull together a set of resources that could be useful in supporting data related activities around the UK General Election.

For starters, I’ve popped up an example of using the folium Python library for plotting simple choropleth maps using geojson based Westminster Parliamentary constituency boundaries: example election maps notebook.

What I haven’t yet figured out – and don’t know if it’s possible – is how to generate qualitative/categorical maps using predefined colour maps (so eg filling boundaries using colour to represent the party of the current MP etc). If you know how to do this, please let me know via the comments…;-)

Also in the notebook is a reference to an election odds scraper I’m running over each (or at least, many…let me know if you spot any missing ones…) Parliamentary constituencies. The names associated with the constituencies don’t correspond in an exact match sense to any standard vocabularies, so on the to do list is to work out a mapping from the election odds constituency names to standard constituency identifiers. I’m thinking this could be represent a handy way of demonstrating my Westminster Constituency reconciliation service docker container….:-)

What’s the Point of an API?

Trying to clear my head of code on a dog walk after a couple of days tinkering with the nomis API and I started to ponder what an API is good for.

Chris Gutteridge and Alex Dutton’s open data excuses bingo card and Owen Boswarva’s Open Data Publishing Decision Tree both suggest that not having an API can be used as an excuse for not publishing a dataset as open data.

So what is an API good for?

I think one naive view is that this is what an API gets you…


It doesn’t of course, because folk actually want this…


Which is not necessarily that easy even with an API:


For a variety of reasons…


Even when the discovery part is done and you think you have a reasonable idea of how to call the API to get the data you want out of it, you’re still faced with the very real practical problem of how to actually get the data in to the analysis environment in a form that is workable on in that environment. Just because you publish standards based SDMX flavoured XML doesn’t mean anything to anybody if they haven’t got an “import from SDMX flavoured XML directly into some format I know how to work with” option.


And even then, once the data is in, the problems aren’t over…


(I’m assuming the data is relatively clean and doesn’t need any significant amount of cleaning, normalising, standardising, type-casting, date par;-sing etc etc. Which is of course likely to be a nonsense assumption;-)

So what is an API good for, and where does it actually exist?

I’m starting to think that for many users, if there isn’t some equivalent of an “import from X” option in the tool they are using or environment they’re working in, then the API-for-X is not really doing much of a useful job for them.

Also, if there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context and essentially means that the API-for-X is not really doing much of a useful job for them.

What I tried to do in doodling the Python / pandas Remote Data Access Wrapper for the Nomis API for myself create some tools that would help me discover various datasets on the nomis platform from my working environment – an IPython notebook – and then fetch any data I wanted from the platform into that environment in a form in which I could immediately start working with it – which is to say, typically as a pandas dataframe.

I haven’t started trying to use it properly yet – and won’t get a chance to for a week or so at least now – but that was the idea. That is, the wrapper should support a conversation with the discovery and access parts of the conversation I want to have with the nomis data from within my chosen environment. That’s what I want from an API. Maybe?!;-)

And note further – this does not mean things like a pandas Remote Data Access plugin or a CRAN package for R (such as the World Bank Development Indicators package or any of the other data/API packages referenced from the rOpenSci packages list should be seen as extensions of the API. At worst, they should be seen as projections of the API into user environments. At best, it is those packages that should be seen as the actual API.

APIs for users – not programmers. That’s what I want from an API.

See also: Opening Up Access to Data: Why APIs May Not Be Enough….

PS See also this response from @apievangelist: The API Journey.