OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Assessing Data Wrangling Skills – Conversations With Data

with one comment

By chance, a couple of days ago I stumbled across a spreadsheet summarising awarded PFI contracts as of 2013 (private finance initiative projects, 2013 summary data).

The spreadsheet has 800 or so rows, and a multitude of columns, some of which are essentially grouped (although multi-level, hierarchical headings are not explicitly used) – columns relating to (estimated) spend in different tax years, for example, or the equity partners involved in a particular project.

As part of the OU course we’re currently developing on “data”, we’re exploring an assessment framework based on students running small data investigations, documented using IPython notebooks, in which we will expect students to demonstrate a range of technical and analytical skills, as well as some understanding of data structures and shapes, data management technology and policy choices, and so on. In support of this, I’ve been trying to put together one or two notebooks a week over the course of a few hours to see what sorts of thing might be possible, tractable and appropriate for inclusion in such a data investigation report within the time constraints allowed.

To this end, the Quick Look at UK PFI Contracts Data notebook demonstrates some of the ways in which we might learn something about the data contained within the PFI summary data spreadsheet. The first thing to note is that it’s not really a data investigation: there is no specific question I set out to ask, and no specific theme I set out to explore. It’s more exploratory (that is, rambling!) than that. It has more of the form what I’ve started referring to as a conversation with data.

In a conversation with data, we develop an understanding of what the data may have to say, along with a set of tools (that is, recipes, or even functions) that allow us to talk to it more easily. These tools and recipes are reusable within the context of the data set or datasets selected for the conversation, and may be transferrable to other data conversations. For example, one recipe might be how to filter and summarise a dataset in a particular way, or generate a particular view or reshaping of it that makes it easy to ask a particular sort of question, or generate a particular sort of a chart (creating a chart is like asking a question where the surface answer is returned in a graphical form).

If we are going to use IPython notebook documented conversations with data as part of an assessment process, we need to tighten up a little more how we might expect to see them structured and what we might expect to see contained within them.

Quickly skimming through the PFI conversation, elements of the following skills are demonstrated:

- downloading a dataset from a remote location;
– loading it into an appropriate data representation;
– cleaning (and type casting) the data;
– spotting possibly erroneous data;
– subsetting/filtering the data by row and column, including the use of partial string matching;
– generating multiple views over the data;
– reshaping the data;
– using grouping as the basis for the generation of summary data;
– sorting the data;
– joining different data views;
– generating graphical charts from derived data using “native” matplotlib charts and a separate charting library (ggplot);
– generating text based interpretations of the data (“data2text” or “data textualisation”).

The notebook also demonstrates some elements of reflection about how the data might be used. For example, it might be used in association with other datasets to help people keep track of the performance of facilities funded under PFI contracts.

Note: the notebook I link to above does not include any database management system elements. As such, it represents elements that might be appropriate for inclusion in a report from the first third of our data course – which covers basic principles of data wrangling using pandas as the dominant toolkit. Conversations held later in the course should also demonstrate how to get data into an out of an appropriately chosen database management system, for example.

Written by Tony Hirst

September 24, 2014 at 11:00 am

Posted in Anything you want, Infoskills

Tagged with

Recreational Data – Could I Answer a Written Question My MP Asked?

leave a comment »

Via one of my feeds from TheyWorkForYou, I noticed this written answer to a question my MP, Andrew Turner, asked of the Secretary of State for International Development:

Overseas_Aid__8_Sep_2014__Hansard_Written_Answers_-_TheyWorkForYou

I’ve been playing around with development data lately, trying to sketch together some pieces for an OpenLearn course on data visualisation for development (hopefully!), so I thought this would be a good test of how quickly I could find the data and confirm the results.

Working backwards, GDP data (in various adjusted forms) is available from the World Bank API, which I’ve been accessing via the remote data interface calls in pandas (for example, Easy Access to World Bank and UN Development Data from IPython Notebooks).

So where do get the aid ranking from?

There are two ways of doing this – one to look for local UK sources (eg from DFID perhaps), the other to look for international sources of data. The advantage of the former is that these are presumably the sources that whoever answered the question went to. The advantage of the latter is that we should be able to generalise the question to query similar rankings for aid distributed by other countries.

“Official Development Assistance” seems to be a key phrase, with a quick websearch for that phrase and the term “data” turning up this Aid statistics – charts, tables and databases resource page, which in turn points to a whole raft of datatables as Excel files detailing statistics on resource flows to developing countries; the International Development Statistics (IDS) online databases page links to several more general online databases. (There’s also a beta data.oecd.org site.)

Forsaking the raw data files for a minute, the site claims that “the Query Wizard for International Development Statistics [QWIDS] is the easiest way to search our database as it automatically extracts the most appropriate dataset from OECD.Stat to match your search” – so let’s try that… QWIDS.

QWIDS_-_Query_Wizard_for_International_Development_Statistics

Nice and simple then…?!

A bit of tinkering (setting the donor, unticking recipients so only countries – rather than countries and groupings are included) gives what I think is the data for the aid disbursements from the UK to other countries, data I could export as a CSV file; but there are no tools onsite to help me look at the top 10.

QWIDS_-_query

Poking around, it looks like the data’s also there to allow us to look at disbursements (or perhaps just allocations) by donor country and sector into a particular country? Maybe?! This would then let us see how aid was being allocated from the UK to the top 10 recipients, broken down by sector, which might be more illuminating? I also wonder if there are any relationships between aid paid by donors into a particular sector, and imports into the recipient country from the donor country within the same sectors? For this, we need trade data breakdowns. (We can get total flows between countries (I think?!) but I’m not sure how to find the data broken down by sector?)

The stats.oecd.org site does let us sort, but I couldn’t find an easy or clean way to limit results to countries, and exclude groupings:

OECD_Statistics_aid

The order (of aid disbursements from the UK in 2012) has the same rank order as the response to my MP’s question.

For the GDP and GDP per capita data, we can go to the World Bank:

GDP_per_capita__constant_2005_US_____Data___Table

Note a couple of things – units tend to be given in US dollars rather than Sterling; there are all sorts of US dollars… (see for example Accounting for Inflation – Deflators, or “What Does ‘Prices in Real Terms’ Actually Mean?”).

Hmm… maybe it would have been easier to find the data on the DFID site instead…

PS Indeed it was – Statistics on International Development 2013 – Tables has a link to a dataset that contains the league table: “Table 4: Top Twenty Recipients UK Net Bilateral ODA 2010 – 2012″.

Written by Tony Hirst

September 15, 2014 at 1:35 pm

Posted in Anything you want

Local Data Journalism – Care Homes

leave a comment »

The front page of this week’s Isle of Wight County Press describes a tragic incident relating to a particular care home on the Island earlier this year:

20140914_103541

20140914_103611

(Unfortunately, the story doesn’t seem to appear on the County Press’ website? Must be part of a “divide the news into print and online, and never the twain shall meet” strategy?!)

As I recently started pottering around the CQC website and various datasets they publish, I thought I’d jot down a few notes about what I could find. The clues from the IWCP article were the name of the care home – Waxham House, High Park Road, Ryde – and the proprietor – Sanjay Ramdany.

Using the “CQC Care Directory – With Filters” from the CQC data and information page, I found a couple of homes registered to that provider.

1-120578256, 19/01/2011, Waxham House, 1 High Park Road, Ryde, Isle of Wight, PO33 1BP
1-120578313, 19/01/2011, Cornelia Heights 93 George Street, Ryde, Isle of Wight, PO33 2JE

1-101701588, Mr Sanjay Prakashsingh Ramdany & Mrs Sandhya Kumari Ramdany
	

Looking up “Waxham House” on the CQC website gives us a copy of the latest report outcome:

Waxham_House

Looking at the breadcrumb navigation, it seems we can directly get a list of other homes operated by the same proprietors:

cqc provider

I wonder if we can search the site by proprietor name too?

cqc properieot search

Looks like it…

So how did their other home fare?

Cornelia_Heights

Cornelia_Heights2

Hmmm…

By the by, according to the Food Standards Agency, how’s the food?

Food_Standards_Agency_-waxham

Food_Standards_Agency_-cornelia

And how much money is the local council paying these homes?

(Note – I haven’t updated the following datasets for a bit – I also note I need to add dates to the transaction tables. local spending explorer info; app.)

[Click through on the image to see the app - hit Search to remove the error message and load the data!]

IW_Council_Spending_Explorer_waxham

IW_Council_Spending_Explorer_cornelia

Why the refunds?

A check on OpenCorporates for director names turned up nothing.

I’m not trying to offer any story here about the actual case reported by the County Press, more a partial story about how we can start to look for data around a story to see if there may be more to the story we can find from open data sources.

Written by Tony Hirst

September 14, 2014 at 9:55 am

Posted in Anything you want

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 807 other followers