Category: Open Data

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

A Unit of Comparison for Local Council Budget Consultations, Based on Transparency Spending Data – ASCdays?

A few years ago, via the BBC Radio 4 & World Service programme More or Less (incidentally co-produced by the OU), I came across the notion of miciromorts (Contextualising the Chance of Something Happening – Micromorts), a one in a million chance of death that can be used as a unit of risk to compare various likelihoods of dying. Associated with this measure is David Spiegelhalter’s microlife, “30 minutes of your life expectancy”. The point behind the microlife measure is that it provides a way of comparing life threatening risks based on how much life is likely to be lost, on average, when exposed to such risks.

Life expectancy for a man aged 22 in the UK is currently about 79 years, which is an extra 57 years, or 20,800 days, or 500,000 hours, or 1 million half hours. So, a young man of 22 typically has 1,000,000 half-hours (57 years) ahead of him, the same as a 26 year-old woman. We define a microlife as the result of a chronic risk that reduces life, on average, by just one of the million half hours that they have left.

The idea of micromorts came to mind last night as I was reflecting on a public budget consultation held by the Isle of Wight Council yesterday (a day that also saw the Council’s Leader and Deputy Leader resign their positions). The Council needs to improve budgetary matters by £20 million over the next 3 years, starting with £7.5m in the next financial year. This can come through increasing funding, or cuts. By far the biggest chunk of expenditure by the council, as with all councils, is on adult social care (ASC) [community care statistics /via @jonpoole].

As with every year for the past however many years, I’ve had a vague resolution to do something with local council spending data, and never got very far. Early dabblings with the data that I’ve so far played with this year (and intend to continue…) reinforce the notion that ASC is expensive. Here’s a quick summary of the spending data items for October, 2016:

The spend for each of the directorates was as follows:

  • Adult Services:
    • total spend: £7,746,875.55 (48.33%% of total monthly spend)
    • capital: £395,900.06 (5.11% of directorate monthly spend)
    • revenue: £7,350,975.49 (94.89% of directorate monthly spend)
  • Chief Executive:
    • total spend: £501,021.32 (3.13%% of total monthly spend)
    • capital: £492,507.54 (98.30% of directorate monthly spend)
    • revenue: £8,513.78 (1.70% of directorate monthly spend)
  • Childrens Services:
    • total spend: £2,044,524.26 (12.76%% of total monthly spend)
    • capital: £243,675.08 (11.92% of directorate monthly spend)
    • revenue: £1,800,849.18 (88.08% of directorate monthly spend)
  • Place:
    • total spend: £4,924,117.40 (30.72%% of total monthly spend)
    • capital: £974,024.13 (19.78% of directorate monthly spend)
    • revenue: £3,950,093.27 (80.22% of directorate monthly spend)
  • Public Health:
    • total spend: £434,654.13 (2.71%% of total monthly spend)
    • revenue: £434,654.13 (100.00% of directorate monthly spend)
  • Regeneration:
    • total spend: £57.65 (0.00%% of total monthly spend)
    • revenue: £57.65 (100.00% of directorate monthly spend)
  • Resources:
    • total spend: £377,172.20 (2.35%% of total monthly spend)
    • capital: £20,367.87 (5.40% of directorate monthly spend)
    • revenue: £356,804.33 (94.60% of directorate monthly spend)

Cancelling out Adult Services revenue spend for a month would match the £7.5 million required to make up next year’s funds. That’s unlikely to happen, but it does perhaps hint at a possible unit of comparison when trying to make budget decisions, or at least, support budget consultations.

From my naive perspective, adult social care needs to support a certain number of people, a number that evolves (probably?) in line with demographics. One of the ways people exit care is by dying, though the service is set up to minimise harm and help prolong life. Folk may also be transiently cared for (that is, they enter the care system and then leave it). By looking at the amount spent on adult social care, we can come up with an average cost (mean, median?) per person per day of adult social care – ASCdays. We can reduce the total cost by reducing the amount of time folk spend in the system, either by shortening transient stays or postponing entry into the system.

So what I’ve started wondering is this: as one way of trying to make sense of transparency spending data, is there any use in casting it into equivalent units of ASCdays? If we use ASCday equivalent units, can we take a weak systems view and try to get a feel for whether a cut to a particular service (or improvement of another) can help us get a handle on the ASC expenditure – or whether it might cause problems down the line?

For example, suppose a week’s respite care costs the same as two weeks worth of ASCdays. If that week’s respite care keeps someone out of the adult care service for a month, we’re quids in. If cutting respite care saves 100 ASCdays of funding, but is likely to bring just one person into the care system 3 months early, we might start to doubt whether it will actually lead to any saving at all. (Longer tail saves complicate matters given councils need to balance a budget within a financial year. Spending money this year to save next year requires access to reserves – and confidence in your bet…)

For trying to make budget decisions, or helping engage citizens in budgetary consultations, costing things as per ASCday equivalents, and then trying to come up with some probabilities about the likelihood that a particular cut or expense will result in a certain number of people entering or leaving ASC sooner or later, may help you get a feel for the consequences for a particular action.

As to whether prior probabilities exist around whether cutting this service, or supporting that, are likely to impact on the adult care system, maybe data for that is out there, also?

Convention Based Used URLs Support Automation

I spent a chunk of last week at Curriculum Development Hackathon for a Data Carpentry workshop on Reproducible Research using Jupyter Notebooks (I’d like to thank the organisers for the travel support). One of the planned curriculum areas looked at data project organisation, another on automation. Poking around on an NHS data publication webpage for a particular statistical work area just now suggests an example of how the two inter-relate… and how creating inconsistent URLs or filenames makes automatically downloading similar files a bit of a faff when it could be so easy…

To begin with, the A&E Attendances and Emergency Admissions statistical work area has URL:

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/

The crumb trail in the on-page navigation has the form:

Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions

which we might note jars somewhat with the slug ae-waiting-times-and-activity, and perhaps reflects some sort of historical legacy in how the data was treated previously…

Monthly data is collected on separate financial year related pages linked from that page:

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/statistical-work-areasae-waiting-times-and-activityae-attendances-and-emergency-admissions-2016-17/

https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/statistical-work-areasae-waiting-times-and-activityae-attendances-and-emergency-admissions-2015-16-monthly-3/

The breadcrumb for these pages has the form:

Home -> Statistics -> Statistical Work Areas -> A&E Attendances and Emergency Admissions -> A&E Attendances and Emergency Admissions 20MM-NN

A few of observations about those financial year related page URLs. Firstly, the path is rooted on the parent page (a Good Thing), but the slug looks mangled together from what looks like a more reasonable parent path (statistical-work-areasae-waiting-times-and-activity; this looks as if it’s been collapsed from statistical-work-areas/ae-waiting-times-and-activity).

The next part of the URL specifies the path to the A & E Attendances and Emergency Admissions page for a particular year, with an appropriate slug for the name – ae-attendances-and-emergency-admissions- but differently formed elements for the years: 2016-17 compared to 2015-16-monthly-3.

(Note that the 2015-16 monthly listing is incomplete and starts in June 2015.)

If we look at URLs for some of the monthly 2016-17 Excel data file downloads, we see inconsistency in the filenames:

https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/November-2016-AE-by-provider-W0Hp0.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/October-2016-AE-by-provider-Nxpai.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2016/06/September-2016-AE-by-provider-BtD4b.xls

(Note that CSV data seems only to be available for the latest (November 2016) data set. I don’t know if this means that the CSV data link only appears for the current month, or data in the CSV format only started to be published in November 2016.)

For the previous year we get:

https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/08/March-2016-AE-by-provider-9g0dQ-Revised-11082016.xls
https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2015/08/February-2016-AE-by-provider-1gWNy-Revised-11082016.xls

and so on.

Inspection of these URLs suggests:

  1. the data is being uploaded to and published from a WordPress site (wp-content/uploads);
  2. the path to the data directory for the annual collection is minted according to the month in which the first dataset of the year is uploaded (data takes a month or two to be uploaded, so presumably the April 2016 data was posted in June, 2016 (2016/06); the 2015 data started late – the first month (June 2015) presumably being uploaded in August of that year (2015/08);
  3. the month slug for the data file starts off fine, being of the form MONTH-YEAR-AE-by-provider-, but then breaks things by having some sort of code value that perhaps uniquely identifies the version of the file;
  4. the month slug may be further broken by the addition of a revision element (eg -Revised-11082016).

If the URLs all had a common pattern, it would be easy enough to automate their generation from a slug pattern and month/year combination, and then automatically download them. (I haven’t yet explored inside each spreadsheet to see what inconsistency errors/horrors make it non-trivial to try to combine the monthly data into a single historical data set…)

As it is, to automate the download of the files requires scraping the web pages for the links, or manually retrieving them. (At least the link text on the pages seems to be reasonably consistent!)

Datadive Reproducibility – Time for a DataBox?

Whilst at the Global Witness “Beneficial Ownership” datadive a couple of weeks ago, one of the things I was pondering  – how to make the weekend’s discoveries reproducible on the one hand, useful as a set of still working legacy tooling on the other – blended into another: how to provide an on-ramp for folk attending the event who were not familiar with the data or the way in which t was provided.

Event facilitators DataKind worked in advance with Global Witness to produce an orientation exercise based around a sample dataset. Several other prepped datasets were also made available via USB memory sticks distributed as required to the three different working groups.

The orientation exercise was framed as a series of questions applied to a core dataset, a denormalised flat 250MB or so CSV file containing just over a million or so rows, with headers. (I think Excel could cope with this – not sure if that was by design or happy accident.)

For data wranglers expert at working with raw datafiles and their own computers, this doesn’t present much of a problem. My gut reaction was to open the datafile into a pandas dataframe in a Jupyter notebook and twiddle with it there; but as pandas holds dataframes in memory, this may not be the best approach, particularly if you have multiple large dataframes open at the same time. As previously mentioned, I think the data also fit into Excel okay.

Another approach after previewing the data, even if just by looking at it on the command line with a head command, was to load the data into a database and look at it from there.

This immediately begs several questions of course  – if I have a database set up on my machine and import the database without thinking about it, how can someone else recreate that? If I don’t have a database on my machine (so I need to install one and get it running) and/or I don’t then know how to get data into the database, I’m no better off. (It may well be that there are great analysts who know how to work with data stored in databases but don’t know how to do the data engineering stuff of getting the database up and running and populated with data in the first place.)

My preferred solution for this at the moment is to see whether Docker containers can help. And in this case, I think they can. I’d already had a couple of quick plays looking at getting the Companies House significant ownership data into various databases (Mongo, neo4j) and used a recipe that linked a database container with a Jupyter notebook server that I could write my analysis scripts in (linking RStudio rather than Jupyter notebooks is just as straightforward).

Using those patterns, it was easy enough to create a similar recipe to link a Postgres database container to a Jupyter notebook server. The next step – loading the data in. Now it just so happens that in the days before the datadive, I’d been putting together some revised notebooks for an OU course on data management and analysis that dealt with quick ways of loading data into a Postgres data, so I wondered whether those notes provided enough scaffolding to help me load the sample core data into a database: a) even if I was new to working with databases, and b) in a reproducible way. The short answer was “yes”. Putting the two steps together, the results can be found here: Getting started – Database Loader Notebook.

With the data in a reproducibly shareable and “live” queryable form, I put together a notebook that worked through the orientation exercises. Along the way, I found a new-to-me HTML5/d3js package for displaying small  interactive network diagrams, visjs2jupyter. My attempt at the orientation exercises can be found here: Orientation Activities.

Whilst I am all in favour of experts datawranglers using their own recipes, tools and methods for working with the data – that’s part of the point of these expert datadives – I think there may also be mileage in providing a base install where the data is in some sort of immediately queryable form, such as in a minimal, even if not properly normalised, database. This means that datasets too large to be manipulated in memory or loaded into Excel can be worked with immediately. It also means that orientation materials can be produced that pose interesting questions that can be used to get a quick overview of the data, or tutorial materials produced that show how to work with off-the-shelf powertool combinations (Jupyter notebooks / Python/pandas / PostgreSQL, for example, or RStudio /R /PostgreSQL ).

Providing a base set up to start from also acts as an invitation to extend that environment in a reproducible way over the course of the datadive. (When working on your own computer with your own tooling, it can be way too easy to forget what packages (apt-get, pip and so on) you have pre-installed that will cause breaking changes to any outcome code you show with others who do not have the same environment. Creating a fresh environment for the datadive, and documenting what you add to it, can help with that, but testing in a linked container, but otherwise isolated, context really helps you keep track of what you needed to add to make things work!

If you also keep track of what you needed to do handle undeclared file encodings, weird separator characters, or password protected zip files from the provided files, it means that others should be able to work with the files in a reliable way…

(Just a note on that point for datadive organisers – metadata about file encodings, unusual zip formats, weird separator encodings etc is a useful thing to share, rather than have to painfully discover….)

Using tools like Docker is one way of improving the shareability of immediately queryable data, but is there an even quick way? One thing I want to explore on my to do list is the idea of a “databox”, a Raspberry Pi image that when booted runs a database server and Jupyter notebook (or RStudio) environment. The database can be pre-seeded with data for the datadive, so all that should be required is for an individual to plug the Raspberry Pi into their computer with an ethernet cable, and run from there. (This won’t work for really large datasets – the Raspberry Pi lacks grunt – but it’s enough to get you started.)

Note that these approaches scale out to other domains, such as data journalism projects (each project on its own Raspberry PI SD card or docker-compose setup…)

Revisiting Diabetes Prescribing Data

Last year, I had a quick dabble with creating Data Driven Press Releases From HSCIC Data based around a diabetes prescribing data release. Noticing that the figures for 2015/16 had been released over the summer break, I revisited last year’s script to see it if works with this years data – which is does, save for a few tweaks to the import of the spreadsheets.

So this year’s report for the Isle of Wight is as follows, constructed automatically from the data by looking across several sheets in two spreadsheets:

Figures recently published by the HSCIC for the NHS Isle Of Wight CCG show that for the reporting period 2015/16, the total Net Ingredient Costs (NIC) for prescribed diabetes drugs was £2,579,592.41, representing 9.95% of overall Net Ingredient Costs. The NHS Isle Of Wight CCG prescribed 137,741 diabetes drugs items, representing 4.28% of all prescribed items. The average net ingredient cost (NIC) was £18.73 per item. This compares to 4.14% of items (10.31% of total NIC) in the Wessex (Q70) region and 4.61% of items (10.57% of total NIC) in England.

Of the total diabetes drugs prescribed, Insulins accounted for 21,884 items at a total NIC of £1,071,880.50 (£48.98 per item (on average), 0.68% of overall prescriptions, 4.13% of total NIC) and Antidiabetic Drugs accounted for 94,347 items at a total NIC of £890,424.84 (£9.44 per item (on average), 2.93% of overall prescriptions, 3.43% of total NIC). Diagnostic and monitoring devices accounted for 20,485 items at a total NIC of £605,971.30 (£29.58 per item (on average), 0.64% of overall prescriptions, 2.34% of total NIC).

For the NHS ISLE OF WIGHT CCG, the NIC in 2015/16 per patient on the QOF diabetes register in 2014/15 was £330.42. The QOF prevalence of diabetes, aged 17+, for the NHS ISLE OF WIGHT CCG in 2014/15 was 6.61%. This compares to a prevalence rate of 6.37% in Wessex and 5.83% across England.

Creating reports for other CCGs is simply a matter of changing the CCG code. On the to do list is pull dat from last year as well as this year into a simple database, and then write some more sentence templates that compare the year on year performance. (I also really need to have a think about a more sensible way of generating sentences!)

After an interesting chat last night with Gary Warner from Island based Pinnacle Health Partnership, a social enterprise providing backend services for community pharmacies, I thought I’d have a poke around what services are out there based on the NHS open prescribing data, as a short cut to population my own databases with the original data (each month’s dataset comes in around 1GB).

OpenPrescribing.net seems really useful in this respect, at least as a quick way in to summaries of the data – the API allows you to pull down data by CCG, as well as GP practices within a CCG, and break out prescriptions by item using BNF codes. A handy look up service also helps find items by BNF section, such as Drugs used in diabetes (BNF 6.1).

I’ve posted a quick sketch notebook as a gist, also embedded below.  But here’s a quick glimpse at some of the first reports I had a look at generating. For example, we can look at the spend associated with particular BNF section codes broken down by GP practices in a particular CCG area:

iw_diabetes-2016_0

and then group it by period (I made a crude guess at the financial year but I’m not sure what the dates in openprescribing actually relate to…) so the aggregates are indicative only. The aggregate value over the CCG for item counts seems to be broadly in line with the NHS Diabetes Prescribing report, which I took as weak confirmation that it’s sort of working!

iw_diabetes-2016

We can also use the API to look up items by BNF section:

iw_diabetes-2016_3

If we loop round the items in a BNF subsection, we can generate reports about the prescribing of particular items within that subsection across practices, merging in the item names to make them easier to identify:

iw_diabetes-2016_5

The pandas charting tools aren’t brilliant – after the Yhat refresh of ggplot (for python), I think I need to revisit that library when tinkering in the python context – but we can do crude sketches quite easily.

iw_diabetes-2016_4

Anyway, playtime over. It was interesting to give the openprescribing,net a go, and give a chance I’ll try to play with it a bit more to explore some more quick reports to add to the diabetes notebook.


 

 

Bands Incorporated

A few weeks ago, as I was doodling with some Companies House director network mapping code and simple Companies House chatbot ideas, I tweeted an example of Iron Maiden’s company structure based on co-director relationships. Depending on the original search is seeded, the maps may also includes elements of band members’ own personal holdings/interests. The following map, for example, is seeded just from the Iron Maiden LLP company number:

iron_maiden

If you know anything about the band, you’ll know Bruce Dickinson’s aircraft interests make complete sense…

That graph is actually a bipartite graph – nodes are either directors or companies. We can easily generate a projection of the graph that replaces directors that link companies by edges that represent “common director” links between companies:

ireonmaiden2.png

(The edges are actually weighted, so the greater the edge weight, the more directors there are in common between the linked companies.)

In today’s Guardian, I notice they’re running a story about Radiohead’s company structure, with a parallel online piece, Radiohead’s corporate empire: inside the band’s dollars and cents which shows how to get a story out of such a map, as well as how to re-present the original raw map to provide to a bit more spatial semantic structure to it:

Radiohead_s_corporate_empire__inside_the_band_s_dollars_and_cents___Music___The_Guardian

(The story also digs into the financial reports from some of the companies.)

By way of comparison, here’s my raw map of Radiohead’s current company structure, generated from Companies House data seeded on the company number for Radiohead Trademark:

radiohead

It’s easy enough to grab the data for other bands. So how about someone like The Who? If we look in the immediate vicinity of The Who Group, we see core interests:

who1

But if we look for linkage to the next level of co-director links, we start to see other corporate groups that hold some at least one shared interest with the band members:

who2

So what other bands incorporated in the UK might be worth mapping?

Want to Get Started With Open Data? Looking for an Introductory Programming Course?

Want to learn to code but never got round to it? The next presentation of OUr FutureLearn course Learn to Code for Data Analysis will teach you how to write you own programme code, a line a time, to analyse real open data datasets. The next presentation starts on 6 June, 2016, and runs for 4 weeks, and takes about 5 hrs per week.

I’ve often thought that there are several obstacles to getting started with programming. Firstly, there’s the rationale or context: why bother/what could I possibly use programming for? Secondly, there are the practical difficulties: to write and execute programmes, you need to get an programming environment set up. Thirdly, there’s the so what: “okay, so I can programme now, but how do I use this in the real world?”

Many introductory programming courses reuse educational methods and motivational techniques or contexts developed to teach children (and often very young children) the basics of computer programming to set the scene: programming a “turtle” that can drive around the screen, for example, or garishly coloured visual programming environments that let you plug logical blocks together as if they were computational Lego. Great fun, and one way of demonstrating some of the programming principles common to all programming languages, but they don’t necessarily set you up for seeing how such techniques might be directly relevant to an IT problem or issue you face in your daily life. And it can be hard to see how you might use such environments or techniques at work to help you get perform real tasks… (Because programmes can actually be good at that – automating the repetitive and working through large amounts of stuff on your behalf.) At the other extreme are professional programming environments, like geekily bloated versions of Microsoft Word or Excel, with confusing preference setups and menus and settings all over the place. And designed by hardcore programmers for hardcore programmers.

So the approach we’ve taken in the OU FutureLearn course Learn to Code for Data Analysis is slightly different to that.

The course uses a notebook style programming environment that blends text, programme code, and the outputs of running that code (such as charts and tables) in a single, editable web page accessed via your web browser.

Learn_to_Code_-_SageMathCloud

To motivate your learning, we use real world, openly licensed data sets from organisations such as the World Bank and the United Nations – data you can download and access for yourself – that you can analyse and chart using your own programme code. A line at a time. Because each line does it’s own thing, each line is useful, and you can see what each line does to your dataset directly.

So that’s the rationale: learn to code so you can work with data (and that includes datasets much larger than you can load into Excel…)

The practicalities of setting up the notebook environment still have to be negotiated, of course. But we try to help you there too. If you want to download and install the programming environment on your computer, you can do, in the form of the freely available Anaconda Scientific Computing Python Distribution. Or you can access an online versions of the notebook based programming environment via SageMathCloud and do all your programming online, through your browser.

So that’s the practical issues hopefully sorted.

But what about the “so what”? Well, the language you’ll be learning is Python, a widely used language programming language that makes it ridiculously easy to do powerful things.

Pyython cartoon - via https://xkcd.com/353/

But not that easy, perhaps..?!

The environment you’ll be using – Jupyter notebooks – is also a “real world” technology, inspired as an open source platform for scientific computing but increasingly being used by journalists (data journalism, anyone?) and educators. It’s also attracted the attention of business, with companies such as IBM supporting the development of a range of interactive dashboard tools and backend service hooks that allow programmes written using the notebooks to be deployed as standalone online interactive dashboards.

The course won’t take you quite that far, but it will get you started, and safe in the knowledge that whatever you learn, as well as the environment you’re learning in, can be used directly to support your own data analysis activities at work, or at home as a civically minded open data armchair analyst.

So what are you waiting for? Sign up now and I’ll see you in the comments:-)