OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Anything you want’ Category

Assessing Data Wrangling Skills – Conversations With Data

with one comment

By chance, a couple of days ago I stumbled across a spreadsheet summarising awarded PFI contracts as of 2013 (private finance initiative projects, 2013 summary data).

The spreadsheet has 800 or so rows, and a multitude of columns, some of which are essentially grouped (although multi-level, hierarchical headings are not explicitly used) – columns relating to (estimated) spend in different tax years, for example, or the equity partners involved in a particular project.

As part of the OU course we’re currently developing on “data”, we’re exploring an assessment framework based on students running small data investigations, documented using IPython notebooks, in which we will expect students to demonstrate a range of technical and analytical skills, as well as some understanding of data structures and shapes, data management technology and policy choices, and so on. In support of this, I’ve been trying to put together one or two notebooks a week over the course of a few hours to see what sorts of thing might be possible, tractable and appropriate for inclusion in such a data investigation report within the time constraints allowed.

To this end, the Quick Look at UK PFI Contracts Data notebook demonstrates some of the ways in which we might learn something about the data contained within the PFI summary data spreadsheet. The first thing to note is that it’s not really a data investigation: there is no specific question I set out to ask, and no specific theme I set out to explore. It’s more exploratory (that is, rambling!) than that. It has more of the form what I’ve started referring to as a conversation with data.

In a conversation with data, we develop an understanding of what the data may have to say, along with a set of tools (that is, recipes, or even functions) that allow us to talk to it more easily. These tools and recipes are reusable within the context of the data set or datasets selected for the conversation, and may be transferrable to other data conversations. For example, one recipe might be how to filter and summarise a dataset in a particular way, or generate a particular view or reshaping of it that makes it easy to ask a particular sort of question, or generate a particular sort of a chart (creating a chart is like asking a question where the surface answer is returned in a graphical form).

If we are going to use IPython notebook documented conversations with data as part of an assessment process, we need to tighten up a little more how we might expect to see them structured and what we might expect to see contained within them.

Quickly skimming through the PFI conversation, elements of the following skills are demonstrated:

- downloading a dataset from a remote location;
– loading it into an appropriate data representation;
– cleaning (and type casting) the data;
– spotting possibly erroneous data;
– subsetting/filtering the data by row and column, including the use of partial string matching;
– generating multiple views over the data;
– reshaping the data;
– using grouping as the basis for the generation of summary data;
– sorting the data;
– joining different data views;
– generating graphical charts from derived data using “native” matplotlib charts and a separate charting library (ggplot);
– generating text based interpretations of the data (“data2text” or “data textualisation”).

The notebook also demonstrates some elements of reflection about how the data might be used. For example, it might be used in association with other datasets to help people keep track of the performance of facilities funded under PFI contracts.

Note: the notebook I link to above does not include any database management system elements. As such, it represents elements that might be appropriate for inclusion in a report from the first third of our data course – which covers basic principles of data wrangling using pandas as the dominant toolkit. Conversations held later in the course should also demonstrate how to get data into an out of an appropriately chosen database management system, for example.

Written by Tony Hirst

September 24, 2014 at 11:00 am

Posted in Anything you want, Infoskills

Tagged with

Recreational Data – Could I Answer a Written Question My MP Asked?

leave a comment »

Via one of my feeds from TheyWorkForYou, I noticed this written answer to a question my MP, Andrew Turner, asked of the Secretary of State for International Development:


I’ve been playing around with development data lately, trying to sketch together some pieces for an OpenLearn course on data visualisation for development (hopefully!), so I thought this would be a good test of how quickly I could find the data and confirm the results.

Working backwards, GDP data (in various adjusted forms) is available from the World Bank API, which I’ve been accessing via the remote data interface calls in pandas (for example, Easy Access to World Bank and UN Development Data from IPython Notebooks).

So where do get the aid ranking from?

There are two ways of doing this – one to look for local UK sources (eg from DFID perhaps), the other to look for international sources of data. The advantage of the former is that these are presumably the sources that whoever answered the question went to. The advantage of the latter is that we should be able to generalise the question to query similar rankings for aid distributed by other countries.

“Official Development Assistance” seems to be a key phrase, with a quick websearch for that phrase and the term “data” turning up this Aid statistics – charts, tables and databases resource page, which in turn points to a whole raft of datatables as Excel files detailing statistics on resource flows to developing countries; the International Development Statistics (IDS) online databases page links to several more general online databases. (There’s also a beta data.oecd.org site.)

Forsaking the raw data files for a minute, the site claims that “the Query Wizard for International Development Statistics [QWIDS] is the easiest way to search our database as it automatically extracts the most appropriate dataset from OECD.Stat to match your search” – so let’s try that… QWIDS.


Nice and simple then…?!

A bit of tinkering (setting the donor, unticking recipients so only countries – rather than countries and groupings are included) gives what I think is the data for the aid disbursements from the UK to other countries, data I could export as a CSV file; but there are no tools onsite to help me look at the top 10.


Poking around, it looks like the data’s also there to allow us to look at disbursements (or perhaps just allocations) by donor country and sector into a particular country? Maybe?! This would then let us see how aid was being allocated from the UK to the top 10 recipients, broken down by sector, which might be more illuminating? I also wonder if there are any relationships between aid paid by donors into a particular sector, and imports into the recipient country from the donor country within the same sectors? For this, we need trade data breakdowns. (We can get total flows between countries (I think?!) but I’m not sure how to find the data broken down by sector?)

The stats.oecd.org site does let us sort, but I couldn’t find an easy or clean way to limit results to countries, and exclude groupings:


The order (of aid disbursements from the UK in 2012) has the same rank order as the response to my MP’s question.

For the GDP and GDP per capita data, we can go to the World Bank:


Note a couple of things – units tend to be given in US dollars rather than Sterling; there are all sorts of US dollars… (see for example Accounting for Inflation – Deflators, or “What Does ‘Prices in Real Terms’ Actually Mean?”).

Hmm… maybe it would have been easier to find the data on the DFID site instead…

PS Indeed it was – Statistics on International Development 2013 – Tables has a link to a dataset that contains the league table: “Table 4: Top Twenty Recipients UK Net Bilateral ODA 2010 – 2012″.

Written by Tony Hirst

September 15, 2014 at 1:35 pm

Posted in Anything you want

Local Data Journalism – Care Homes

leave a comment »

The front page of this week’s Isle of Wight County Press describes a tragic incident relating to a particular care home on the Island earlier this year:



(Unfortunately, the story doesn’t seem to appear on the County Press’ website? Must be part of a “divide the news into print and online, and never the twain shall meet” strategy?!)

As I recently started pottering around the CQC website and various datasets they publish, I thought I’d jot down a few notes about what I could find. The clues from the IWCP article were the name of the care home – Waxham House, High Park Road, Ryde – and the proprietor – Sanjay Ramdany.

Using the “CQC Care Directory – With Filters” from the CQC data and information page, I found a couple of homes registered to that provider.

1-120578256, 19/01/2011, Waxham House, 1 High Park Road, Ryde, Isle of Wight, PO33 1BP
1-120578313, 19/01/2011, Cornelia Heights 93 George Street, Ryde, Isle of Wight, PO33 2JE

1-101701588, Mr Sanjay Prakashsingh Ramdany & Mrs Sandhya Kumari Ramdany

Looking up “Waxham House” on the CQC website gives us a copy of the latest report outcome:


Looking at the breadcrumb navigation, it seems we can directly get a list of other homes operated by the same proprietors:

cqc provider

I wonder if we can search the site by proprietor name too?

cqc properieot search

Looks like it…

So how did their other home fare?




By the by, according to the Food Standards Agency, how’s the food?



And how much money is the local council paying these homes?

(Note – I haven’t updated the following datasets for a bit – I also note I need to add dates to the transaction tables. local spending explorer info; app.)

[Click through on the image to see the app - hit Search to remove the error message and load the data!]



Why the refunds?

A check on OpenCorporates for director names turned up nothing.

I’m not trying to offer any story here about the actual case reported by the County Press, more a partial story about how we can start to look for data around a story to see if there may be more to the story we can find from open data sources.

Written by Tony Hirst

September 14, 2014 at 9:55 am

Posted in Anything you want

Tagged with

So What Does an Armchair Auditor Do All Day?

leave a comment »

I’ve no idea… Because there aren’t any, apparently: Poor data quality hindering government open data programme. And as I try to make sense of that article, it seems there aren’t any because of UTF-8, I think? Erm…

For my own council, the local hyperlocal, OnTheWight, publish a version of Adrian Short’s Armchair Auditor app at armchairauditor.onthewight.com. OnTheWight have turned a few stories from this data, I think, so they obviously have a strategy for making use of the data.

My own quirky skillset, such as it is, meant that it wasn’t too hard for me to start working with the original council published data to build an app showing spend in different areas, by company etc – Local Council Spending Data – Time Series Charts – although the actual application appears to have rotted (pound signs are not liked by the new shiny library and I can’t remember how to log in to the glimmer site:-(

I also tried to make sense of the data by trying to match it up to council budget areas, but that wasn’t too successful: What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations?

But I still don’t know what questions to ask, what scripts to run? Some time ago, Ben Worthy asked Where are the Armchair Auditors? but I’m more interested in the question: what would they actually do? and what sort of question or series of questions might they usefully ask, and why?

Just having access to data is not really that very interesting. It’s the questions you ask for it, and the sorts of stories you look for in it, that count. So what stories might Armchair Auditors go looking for, what odd things might they seek out, what questions might they ask of the data?

Written by Tony Hirst

August 29, 2014 at 9:46 am

Running “Native” Data Wrangling Applications in the Browser – IPython Notebooks (and R?) in Chrome

leave a comment »

Using browser based data analysis toolkits such as pandas in IPython notebooks, or R in RStudio, means you need to have access to python or R and the corresponding application server either on your own computer, or running on a remote server that you have access to.

When running occasional training sessions or workshops, this can cause several headaches: either a remote service needs to be set up that is capable of supporting the expected number of participants, security may need putting in place, accounts configured (or account management tools supported), network connections need guaranteeing so that participants can access the server, and so on; or participants need to install software on their own computers: ideally this would be done in advance of a training session, otherwise training time is spent installing, configuring and debugging software installs; some computers may have security policies that prevent users installing software, or require and IT person with admin privileges to install the software, and so on.

That’s why the coLaboratory Chrome extension looks like an interesting innovation – it runs an IPython notebook fork, with pandas and matplotlib as a Chrome Native Client application. I posted a quick walkthrough of the extension over on the School of Data blog: Working With Data in the Browser Using python – coLaboratory.

Via a Twitter exchange with @nativeclient, it seems that there’s also the possibility that R could run as a dependency free Chrome extension. Native Client seems to like things written in C/C++, which underpins R, although I think R also has some fortran dependencies. (One of the coLaboratory talks mentioned the to do list item of getting scipy (I think?) running in the coLaboratory extension, the major challenge there (or whatever the package was) being the fortran src; so there maybe be synergies in working the fortran components there?))

Within a couple of hours of the twitter exchange starting, Brad Nelson/@flagxor posted a first attempt at an R port to the Native Client. I don’t pretend to understand what’s involved in moving from this to an extension with some sort of useable UI, even if only a command line, but it represents an interesting possibility: of being able to run R in the browser (or at least, in Chrome). Package availability would be limited of course to packages compiled to run using PNaCl.

For training events, there is still the requirement that users install a Chrome browser on their computer and then install the extension into that. However, I think it is possible to run Chrome as a portable app – that is, from a flash drive such as a USB memory stick: Google Chrome Portable (Windows).

I’m not sure how fast it would be able to run, but it suggests there may be a way of carrying a portable, dependency free pandas environment around that you can run on a Windows computer from a USB key?! And maybe R too…?

Written by Tony Hirst

August 22, 2014 at 9:42 am

A Conversation With Data – Isle of Wight Car Parking Meter Transaction Logs

leave a comment »

Killer post title, eh?

Some time ago I put in an FOI request to the Isle of Wight Council for the transaction logs from a couple of ticket machines in the car park at Yarmouth. Since then, the Council made some unpopular decisions about car parking charges, got a recall and then in passing made the local BBC news (along with other councils) in respect of the extent of parking charge overpayments…

Here’s how hyperlocal news outlet OnTheWight reported the unfolding story…

I really missed a trick not getting involved in this process – because there is, or could be, a significant data element to it. And I had a sample of data that I could have doodled with, and then gone for the whole data set.

Anyway, I finally made a start on looking at the data I did have with a view to seeing what stories or insight we might be able to pull from it – the first sketch of my conversation with the data is here: A Conversation With Data – Car Parking Meter Data.

It’s not just the parking meter data that can be brought to bear in this case – there’s another set of relevant data too, and I also had a sample of that: traffic penalty charge notices (i.e. traffic warden ticket issuances…)

With a bit of luck, I’ll have a go at a quick conversation with that data over the next week or so… Then maybe put in a full set of FOI requests for data from all the Council operated ticket machines, and all the penalty notices issued, for a couple of financial years.

Several things I think might be interesting to look at Island-wide:

  • in much the same was as Tube platforms suffer from loading problems, where folk surge around one entrance or another, do car parks “fill up” in some sort of order, eg within a car park (one meter lags the other in terms of tickets issued) or one car park lags another overall;
  • do different car parks have a different balance of ticket types issued (are some used for long stay, others for short stay?) and does this change according to what day of the week it is?
  • how does the issuance of traffic penalty charge notices compare with the sorts of parking meter tickets issued?
  • from the timestamps of when traffic penalty charge notices tickets are issued, can we work out the rounds of different traffic warden patrols?

The last one might be a little bit cheeky – just like you aren’t supposed to share information about the mobile speed traps, perhaps you also aren’t supposed to share information that there’s a traffic warden doing the rounds…?!

Written by Tony Hirst

July 25, 2014 at 6:04 pm


Get every new post delivered to your Inbox.

Join 812 other followers