Category: Anything you want

(Superfluous?) Jupyter / pandas Notebook for Wrangling TEF Excel Spreadsheets

A couple of days ago, new Wonkhe employee Dave Kernohan (congrats!:-) got in touch asking if I’d be interested in helping wrangle the as then yet to be released TEF (Teaching Excellence Framework) data into a form they could work with. The suspicion was that if the data was be released in a piecemeal fashion – one Excel spreadsheet per institution – rather than as a single data set, it would be a bit of a faff to analyse…

I found an example of the sort of spreadsheet it looked might be used to publish the data in, and started hacking a notebook to parse the sheets and pull the data into a set of monolithic files.

As it was, the data was published in a variety of forms, rendering the notebook superfluous. But I’ll post it anyway as a curiosity here.

FWIW, the final, published spreadsheets differed slightly in layout and in the naming of the internal sheets, both of which broke the original notebook. The revised notebook is no less brittle – cell ranges are typically hard coded, rather than being auto detected.

The spreadsheets contain many compound tables – for example, in the following case we have full-time and part time student data in the same table. (I parse these out by hard coded cell ranges – I really should autodetect the row number of the row the headcount headers appear on and use those values, noting the number of rows in each subtable is the same…)

Also note the the use of single and multi-level columns headings in adjacent columns.

A single sheet may also contain multiple tables. For example, like this:

or like this:

As mentioned, the sheets also contained adminstrative data cruft, as revealed when opening the full sheet into a pandas dataframe:


Anyway – superfluous, as the data exists in a single long form CSV file anyway on the TEF data download site. But maybe useful as a crib for the future. Here’s a reminder of the link to the notebook.

PS another tool for wrangling spreadsheets that I really need to get my head round is databaker, which was originally developed for working with the spreadsheet monstrosities that ONS produce…

PPS the OU actively opted out of TEF, citing “justifiable reasons”… It will be interesting to see the NSS results this year… (and also see how NSS and TEF scores correlate).

So What Do the OU’s Lifelong Student Demographics Look Like?

In last week’s “state of the OU” presentation, mention was made of the OU’s rich history in innovation as something we could look back on, plus supporting “lifelong learning” as a (renewed?) commitment for the future.

Which got me wondering about what stories some of the OU’s historical data might have to tell us?

For example, for each of the last fifty years, I’d like to see some charts (for example, ggplot charts faceted by year) showing the distribution of:

  • students who joined the OU that year, by age;
  • the age distribution of *all* OU students that year;
  • the age distribution of OU students graduating that year;
  • the age distribution of the sum total of OU graduates over the life of the university, up to a particular point in time.

I’m probably completely wrong, but I’d guess the  shapes may have changed over the years. Depending how they’ve changed might help us tell ourselves some stories about what’s been going on over the last 50 years.

For example, imagine if the populations looked like this originally:

and looked more like this now?

Plotting all 50 years, in a 5 x 10 facet using the same x and y axes would tell at a glance how the university student body has evolved.

Animating the charts would tell another story.

Using a population pyramid (which typically contrasts gender by age range), we could look at the relative make up of the OU compared to other UK HEIs perhaps:

Again, doing this by year and animating may tell a useful story. Using absolute numbers rather than percentages would tell another. And again, these charts could be generated for each year’s intake, graduates, overall population, and accumulated graduates.

The charts might also help show us what lifelong learning has meant for OU populations over the last 50 years. I suspect that we’ve had two sorts of lifelong learning students over that period: students who mid-life signed up for their first degree (sometimes characterised as “second chancers who missed out the first time around”) and students who took one or two very specifically chosen modules to learn about a particular topic that might help with promotion or met a particular skills gap,

I’ve never really understood why the OU regime of the last 10 years have been hell bent on competing with other institutions for signing 18 year olds up to full (named) degrees. To support lifelong learning, don’t we need to provide upskilling in particular areas (one off modules, no entry requirements), lifelong support with access to all OU content for self study over the course of a career (or maybe a “lifelong” degree where you take a module every couple of years to fit in with career or “professional amateur” interests”,  or intense conversion courses to help with mid-career transitions?

Whatever – I’m guessing looking at some pictures and telling some stories off the back off them could provoke other ideas too… Not sure if the data is available in a handy form though?

Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose

I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using docker-compose. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the docker-compose version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.

So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.

The result is something that looks like this:

The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.

The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.

This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate ./data directory and provide some means for populating it with data files. Alternatively, if the files already exist on host,  mounting the host data directory onto a /data volume in the Apache Drill container would work too.

Here’s the docker-compose.yaml file I’ve ended up with:

  image: dialonce/drill
    - 8047:8047
    - zookeeper
    -  ./notebooks/data:/nbdata
    -  ./R/data:/rdata

  image: jplock/zookeeper

  container_name: notebook-apache-drill-test
  image: psychemedia/ou-tm351-jupyter-custom-pystack-test
    - 35200:8888
    - ./notebooks:/notebooks/
    - drill:drill

  container_name: rstudio-apache-drill-test
  image: rocker/tidyverse
    - PASSWORD=letmein
  #default user is: rstudio
    - ./R:/home/rstudio
    - 8787:8787
    - drill:drill

If you have docker installed and running, running docker-compose up -d in the folder containing the docker-compose.yaml file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the ./notebooks, ./notebooks/data, ./R and ./R/data subfolders don’t exist they will be created.

We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the pydrill package to connect. Note the hostname used is the linked container name (in this case, drill).

If we download data to the ./notebooks/data folder which is mounted inside the Apache Drill container as /nbdata, we can query against it.

(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)

We can also query against that same data file from the RStudio container. In this case I used the DrillR package (I had hoped to use the sergeant package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find java installed, rather than DBI, and java isn’t installed in the rocker/tidyverse container I used.) UPDATE: sergeant now works without Java dependency... Thanks, Bob:-)

I’m not sure if DrillR is being actively developed, but it would be handy if it could return the data from the query as a dataframe.

So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have docker installed:-)

A Quick Look at Github Classroom and a Note on How Easy Github on the Web Is To Use…

With the OU VC announcing a future vision for the OU as a “university of the cloud” (quiet, at the back there…), I thought I’d have a look at Github Classroom, prompted by a post I was alerted to by @biztechpm on How to grade programming assignments on GitHub.

Getting started meant creating a Github organisation (I created innovationOUtside), logging in to Github Classroom (about), and requesting some private repositories (I asked for 25 and got unlimited) that could be used for setting Github mediated assignments as well as receiving student submissions back).

The Github Classroom model is based around Classrooms that contain Assignments, which can be specified as individual assignments or group assignments.

When students accept an invitation to an assignment (or at least, a private individual assignment, which is all I’ve gad a chance to look at so far), creates a repository viewable within the classroom on the group account. This repo is viewable by the student and the Classroom moderators.

If the assignment has been seeded with starter code, such as the statement of the assignment, or files associated with the assignment, will be used to see the student’s repository. (For convenience, I created a private repo on the group account to act as the repo for see files in a particular assignment.) If the seed files are updated, the student can update their own repository and then mark changes against that, but this needs to be done under git in a synched repo on the command line:-(

git remote add upstream

git fetch upstream
git rebase upstream/master

git can be a bit of a pain to work with on the command line and the desktop client, but it doesn’t have to be that hard. If the seed files arenlt updated once student repos are created, the student can operate completely via the Github website. For example, to add or update files contained within a particular directory in the repository, users can simply drag a file from their desktop and drop it onto the appropriate repo and directory listing webpage.

Additional files can then be uploaded and committed, along with corresponding commit messages:

Files  – and directories – can also be created and edited directly on the Github website. Simply Create a new file:

and then enter the file name required, along with a new subdirectory path if you want to place the file in a newly created subdirectory.

You can then edit the file (and if it’s markdown, preview a rendered version of it) in the browser.

Of course, working in the Github file editor means students can’t execute and test their code; but it’s easy enough to just download a zip file of the files contained in a repo, work on those, and then drag the completed files back into the online repo to upload it and check it in.


If a student checks in their updated script and gives it an obvious commit message, a moderator with privileges over the classroom the student’s repository is associated with can click onto that check in and (If the student has made multiple checkins, it would be useful if they rebased against the upstream master. I’m not sure if this can be done in the web client?)

The moderator can then see what changes made to the original script as their submitted work and comment on a line by line basis on the student’s script.

Unfortunately, if checked in documents are Jupyter notebooks, the simple github differencer isn’t as helpful as it could be:

Given that Github added a notebook previewer for notebook .ipynb files, it’s possible that they may add in support for the nbdime notebook differ. However, in the meantime, it would be possible to download student repos (possibly using a tool like Gitomator? and then check them using nbdime. It may also be possible to create a moderator/marker extension to help with this? The problem then is getting annotated notebooks back to the student. In which respect, I’m not sure if a moderator can check a file back in to a student’s private assignment repository, although I suppose a group assignment could perhaps be used in this respect with two member groups: the student and the moderator.

One final thought in terms of OU workflow, assignments are marked by students’ tutors, which makes for a management issue when allocating moderator permissions. One solution here may be to create a classroom for each moderator and with duplicate assignments seeded from the same original repo then allocated into each classroom.

I’ve Not Been Keeping Up With Robot Journalists (and Completely Missed Mentioning Document Automation Before Now…)

It’s been some time since I last had a look around at where people are at with robot journalism. The companies that still tend to come to mind (for me) in this area are still Automated Insights (US, Wordsmith), Narrative Science (US, Quill), Arria NLG (UK), AX Semantics (DE), Yseop (FR, Compose) and Tencent (CN, Dreamwriter) (was it really four years ago when I first started paying attention to this?!) but things have moved on since then, so I probably need to do another round up…

A recent addition I hadn’t noticed comes from the Washington Post, (owner: a certain Mr Jeff Bezos), and their Heliograf tool (via Wired: What News-Writing Bots Mean for the Future of Journalism); this currently seems to write under the byline powered by Heliograf and had its first major outing covering the 2016 Olympics (The Washington Post experiments with automated storytelling to help power 2016 Rio Olympics coverage); since then, it’s also moved into US election reporting (The Washington Post to use artificial intelligence to cover nearly 500 races on Election Day).

The model appears, in part, to be using Heliograf as a drafting tool, which is one of more interesting ways I always thought this sort of stuff would play out: “using Heliograf’s editing tool, Post editors can add reporting, analysis and color to stories alongside the bot-written text. Editors can also overwrite the bot text if needed” (my emphasis, via).

It seems that automated election reporting has also been used in Finland. Via @markkuhanninen, “[o]ur national broadcasting company Yle did the same for elections and some sports news (NHL ice hockey). All articles seem to be in Finnish.”

(Hmmm… thinks… maybe I should have tried doing something for UK General Election? Anyone know if anyone was using robot journalists to produce ward or constituency level reports for UK general election or local elections?)

Looking back to 2015, Swedish publisher Mittmedia (interview) look like they started off (as many do) with a simple weather reporting back in 2015, as well as using NLG to report internally on media stats. The Swedish national wire service TT Nyhetsbyrån also look to have started building their own robot reporter (TT building “reporter robot”) which perhaps made it’s debut last November? “The first robot reporter has left the factory on @ttnyhetsbyran and given out in reality. Hope it behaves” (translated by a bot…) (@matsrorbecker).

And in Norway, it looks like the Norwegian news agency NTB also started out with some automated sports reports last year, I think using (Norwegian News Agency is betting on automation for football coverage, and an example of the recipe: Building a Robot Journalist).

2016 also saw Bloomberg start to look at making more use of automation: Bloomberg EIC: Automation is ‘crucial to the future of journalism’.

Offhand, I haven’t found a specific mention of Thomson Reuters using automation for producing business reports (although I suspect they do), but I did notice that they appear to have been in the document automation game for years with their Contract Express application, a template solution that supports the automated creation of legal documents. A quick skim around suggests that document automation is a commodity level service in the legal industry, with lots of players offering a range of template based products.

Thinking in terms of complexity, I wonder if it’s useful imagining automated journalism in the context of something like: mail merge, document automation (contracts), automated reporting (weather, sports, financial, election, …)? Certainly, weather reporting and sports reporting appear to be common starting points, perhaps because they are “low risk” as folk get comfortable with producing and publishing automated copy.

I also wonder (again?) about how bylines are used, and have been evolving, to attribute the automated creation of news content. Is anyone maintaining a record or collection of such things, I wonder?

Tinkering With Apache Drill – JOINed Queries Across JSON and CSV files

Coming down from a festival high, I spent the day yesterday jamming with code and trying to get a feel for Apache Drill. As I’ve posted before, Apache Drill is really handy for querying large CSV files.

The test data I’ve been using is Evan Odell’s 3GB Hansard dataset, downloaded as a CSV file but recast in the parquet format to speed up queries (see the previous post for details). I had another look at the dataset yesterday, and popped together some simple SQL warm up exercises in notebook (here).

Something I found myself doing was flitting between running SQL queries over the data using Apache Drill to return a pandas dataframe, and wrangling the pandas dataframes directly, following the path of least resistance to doing the thing I wanted to do at each step. (Which is to say, if I couldn’t figure out the SQL, I’d try moving into pandas; and if the pandas route was too fiddly, rethinking the SQL query! That said, I also noticed I had got a bit rusty with SQL…) Another pattern of behaviour I found myself falling into was using Apache Drill to run summarising queries over the large original dataset, and then working with these smaller, summary datasets as in-memory pandas dataframes. This could be a handy strategy, I think.

As well as being able to query large flat CSV files, Apache Drill also allows you to run queries over JSON files, as well directories full of similarly structured JSON or CSV files. Lots of APIs export data as JSON, so being able to save the response of multiple calls on a similar topic as uniquely named (and the name doesn’t matter..) flat JSON files in the same folder, and then run a SQL query over all of them simply by pointing to the host directory, is really appealing.

I explored these features in a notebook exploring UK Parliament Written Answers data. In particular, the notebook shows:

  • requesting multiple pages on the same from the UK Parliament data API (not an Apache Drill feature!)
  • querying deep into a single large JSON file;
  • running JOIN queries over data contained in a JSON file and a CSV file;
  • running JOIN queries over data contained in a JSON file and data contained in multiple, similarly structured, data files in the same directory.

All I need to do now is come up with a simple “up and running” recipe for working Apache Drill. I’m thinking: Docker compose and some linked containers: Jupyter notebook, RStudio, Apache Drill, and maybe a shared data volume between them?

My Isle of Wight Festival 2017 Highlights

I didn’t keep track properly this year, but here’s a quick round up of some of my takeaway bands from this year’s Isle of Wight Festival…

Thursday night highlight was… the UK General Election. Sky News had a small TV screen showing early coverage (even though it was the “HD” feed, the screen resolution meant we couldn’t read the numbers…). The screen moved to a location near the all-night cafe around midnight, but once Sky had their fill of free content (“interviews with festival-goers”), they f**ked off. The nice folk in the all night cafe came to the rescue, though, agreeing to pop a BBC Radio 4 live feed on for us in place of the music they’d been playing to attract customers in, and even donating a couple of free pizzas to feed the election party that suddenly appeared on them… Their connection was a bit flaky though, but someone with a phone that worked stepped in until their data ran out… at which point we huddled round a Guardian live feed until 4 in the morning or so for a nice dawn chorus walk back to the tent…

So that’s one festival highlight… but what about the bands…?

Top pop rock band: Germein Sisters (homepage, Twitter: @GermeinSisters) – like an Australian First Aid Kit, I got to see them a couple of times, on Friday and Saturday.

Once again, I spent a lot of time popping in and out of the Jack Rocks / this feeling tent. The Sundowners (@TheSundownersUK)
also piled on the harmonies, and I’m guessing their psych vibe would be brilliant in a small venue with a full set:

I’ve no idea how you pronounce Gutxi Bibang (homepage, Twitter: @gutxi_bibang) but they make a great three-piece – low on banter but tight on the show…:

In contrast, Sheffield 5 piece Sheafs (Twitter: @SHEAFSBand) worked the audience hard:

or was that the Doors channeling Paves (Twitter: @pavesband)?!

I first saw Hands Off Gretel (homepage, Twitter: @HandsOffGretel) at last year’s Isle of Wight Festival (playlist), which ended up with seeing them play more than a few times over the last year. This year, the rest of the family popped their head into the tent to see what is it about them that keeps me “gallivanting around the country”. The fact that “we’re not hippies”, maybe?!

Clashfinder identified a clash for me putting the The Wholls (homepage, Twitter: @TheWholls) up against HoG for their first set, but I managed to catch the last bit of their second set and now need to track them down somewhere else so I can get to see a full set:

As to why I didn’t see the first bit of the set? I’d been watching some gritty Isle of Wight blues in the form of Confed Fred (Twitter: @confedfred).

I’m sure I recognised the band name The Strypes (homepage, Twitter: @The_Strypes), but can’t place where from. Maybe I need to start a gig diary again…

Skimming my Clashfinder schedule, I notice there aren’t many bands in this list from Thursday or Friday. Hmmm… was there really nothing notable, other than the Germein Sisters there?!

Ah – maybe – Isle of Wight rock band 30ten were fun in the Platform One tent, though again, I only caught the end of them while waiting for Curse of Lono (homepage, Twitter: @curseoflonoband)

For the proper folk rock experience though, Rusty Shackle (homepage, Twitter: @rustyshackleuk) really delivered the goods:

Which was also a nice complement to the bit of jig I have whenever I get the chance to see Dub Pistols (homepage, Twitter: @dubpistols), who were doing their thing once again on the Cirque De La Quirk stage…:

…after realising for the first time quite how good an all round, real deal rock band the Pretenders (homepage, Twitter: @thepretendershq) are:

(which isn’t to say Run DMC weren’t fun for the first part of the set I saw… But you know: clashes…!)

And Clashfinder lies – because Island reggae band The Ohmz (homepage, Twitter: @theohmz) had their face off on Friday night, not Saturday:

Which also reminds me – the Quay Arts’ Kashmir Cafe (@kashmir_iw) stage also hosted the ‘closest to the Ozrics’ moment for me in the form of Wille and the Bandits (homepage, Twitter: @wille_bandits), when they weren’t cracking out the blues:

The Kashmir Cafe also hosted a great way of rounding off the festival in the form of one of my repeated festival highlights, Tankus the Henge (homepage, Twitter: @tankusthehenge):

Right… better get that playlist and Twitter list together then…

Best T-shirt slogan: I’m so meta, even this acronym.

Most common political sentiment from the stage: Theresa May is a c**t.

PS As to whom didn’t play but I’d have like to have seen, Island band The Orders (Twitter: @the_ordersuk, and Fuzzwalker (homepage, Twitter: [@fuzzwalkerband])(, both of whom churn out guitar driven earworms like there’s no tomorrow, and both of whom would have been a blast in the Jack Rocks / This Feeling or the Kashmir tent.