Responsibilities and Required Skills in Recent “Data Journalist” Job ads…

A quick peek at a couple of recent job ads that describe what you might be expected to do…

From the BBC – Senior Data Journalist. In part, the role responsibilities include:

  • generating ideas for data-driven stories and for how they might be developed and visualized
  • exploring those ideas using statistical tools – and presenting them to wider stakeholders from a non-statistical background
  • reporting on and analysing data in a way that contributes to telling compelling stories on an array of news platforms
  • collaborating with reporters, editors, designers and developers to bring those stories to publication
  • exploring and summarizing data using relational database software
  • visualizing and to find patterns in spatial data using GIS software
  • using statistical tools to identify significant data trends
  • representing the data team and the Visual Journalism team at editorial meetings
  • leading editorially on data projects as required and overseeing the work of other data team colleagues
  • using skills and experience to advise on best approaches to data-led storytelling and the development and publication of data-led projects

Required technical skills include “a good understanding of statistics and statistical analysis; a strong grasp of how to clean, parse and query data; a good knowledge of some of the following: spreadsheet software, SQL, Python and R; demonstrable experience of visualising data, using either tools or scripts; experience of GIS or other mapping software; experience of gathering information from Freedom of Information requests”.

Desirable skills include “knowledge of basic scripting and HTML, as it might pertain to data visualization or data analysis and knowledge of several of the following; Carto, D3, QGIS, Tableau”.

And over at Trinity Mirror, there’s an open position for a data journalist, where role responsibilities include:

  • Having ideas for data-based stories and analysis, on a range of topics, which would be suitable for visualisation in regional newspapers across the group.
  • Working with a designer and the head of data journalism to come up with original and engaging ways of visualising this content.
  • Writing copy, as required, to accompany these visualisations.
  • Working on the data unit’s wider output, for print and web, as required by the head of data journalism.

As far as technical skills go, these “should include a broad knowledge of UK data sources, an ability to quickly and effectively interrogate data using spreadsheets, and an understanding of the pros and cons of different methods of visualising data”.  In addition, “[a]n ability to use scraping software, and some proficiency in using R, would be an advantage”.

Sharing Folders into VMs on Different Machines Using Dropbox, Google Drive, Microsoft OneDrive etc

Ever since I joined the OU, I’ve believed in trying to deliver distance education courses in an agile and responsive way, which is to say: making stuff up for students whilst the course is in presentation.

This is generally not done (by course/module teams at least) because the aim of most course/module teams is to prepare the course so thoroughly that it can “just” be presented to students.


I personally think we should try to improve the student experience of the course as it presents if we can by being responsive and reactive to student questions and issues.

So… TM351, the data management course that uses a VM, has started again, and issues / questions are already starting to hit the forums.

One of the questions – which I’d half noted but never really thought through in previous presentations (my not iterating/improving the course experience in, or between, previous presentations)  – related to sharing Jupyter notebooks across different machines using Google Drive (equally, Dropbox or Microsoft OneDrive).

The VirtualBox VM we use is fired up using the vagrant provisioner. A Vagrantfile defines various configuration settings – which ports are exposed by the VM, for example. By default, the contents of the folder in which vagrant is started up in are shared into the VM. At the same time, vagrant creates a hidden .vagrant folder that contains state relating to the instance of that VM.

The set up on a single machine is something like this:

If a student wants to work across several machines, they need to share their working course files (Jupyter notebooks, and so on) but not the VM machine state. Which is to say, they need a set up more like the following:

For students working across several machines, it thus makes sense to have all project files in one folder and a separate .vagrant settings folder on each separate machine.

Checking the vagrant docs, it seems as if this is quite manageable using the synced folder configuration settings.

The default copies the current project folder (containing the vagrantfile and from which vagrant is rum), which I’m guessing is a setting something like:

config.vm.synced_folder "./", "/vagrant"

By explicitly setting this parameter, we can decide how we want the mapping to occur. For example:

config.vm.synced_folder "/PATH/ON/HOST", "/vagrant"

allows you to to specify the folder you want to share into the VM. Note that the /PATH/ON/HOST folder needs to be created before trying to share it.

To put the new shared directory into effect, reload and reprovision the VM. For example:

vagrant reload --provision

Student notebooks located in the notebooks folder of that shared directory should now be available in the VM. Furthermore, if the shared folder is itself in a webshared folder (for example, a synced Dropbox, Google Drive or Microsoft OneDrive folder) it should be available wherever that folder is synched to.

For example, on a Mac (where ~ is an alias to my home directory), I can create a directory in my dropbox folder ~/Dropbox/TM351VMshare and then map this into the VM using by adding the following line to the Vagrantfile:

config.vm.synced_folder "~/Dropbox/TM351VMshare", "/vagrant"

Note the possibility of slight confusion – the shared folder will not now be the folder from which vagrant is run (unless the folder are running from is /PATH/ON/HOST ).

Furthermore, the only thing that needs to be in the folder from which vagrant is run is the Vagrantfile and the hidden .vagrant folder that vagrant creates.

Fingers crossed this recipe works…;-)

Sharing Pre-Built VMs Via Vagrant Cloud

In passing, I noticed yesterday that Vagrant Cloud (docs) can be used to host and distribute public Vagrant base boxes. So I exported a box file from my V-REP’n’Jupyter VM:

vagrant package

uploaded it to Vagrant Cloud – ouseful/ou-robotics-test – and tweaked my Vagrantfile to use that copy as the base box: = "ouseful/ou-robotics-test"

Now I’m thinking I should probably do the same for the TM351 VM, giving the hassle it seems to take trying to get the .box file hosted for download on an OU URL…

Distributing Virtual Machines That Include a Virtual Desktop To Students – V-REP + Jupyter Notebooks

When we put together the virtual machine for TM351, the data management and analysis course, we built a headless virtual machine that did not contain a graphical desktop, but instead ran a set of services that could be accessed within the machine at a machine level, and via a browser based UI at the user level.

Some applications, however, don’t expose an HTML based graphical user interface over http, instead they require access to a native windowing system.

One way round this is to run a system that can generate an HTML based UI within the VM and then expose that via a browser. For an example, see Accessing GUI Apps Via a Browser from a Container Using Guacamole.

Another approach is to expose an X11 window connection from the VM and connect to that on the host, displaying the windows natively on host as a result. See for example the Viewing Application UIs and Launching Applications from Shortcuts section of BYOA (Bring Your Own Application) – Running Containerised Applications on the Desktop.

The problem with the X11 approach is that is requires gubbins (technical term!) on the host to make it work. (I’d love to see a version of Kitematic extended not only to support docker-compose but also pre-packaged with something that could handle X11 connections…)

So another alternative is to create a virtual machine that does expose a desktop, and run the applications on that.

Here’s how I think the different approaches look:


As an example of the desktop VM idea, I’ve put together a build script for a virtual machine containing a Linux graphic desktop that runs the V-REP robot simulator. You can find it here: ou-robotics-vrep.

The script uses one Vagrant script to build the VM and another to launch it.

Along with the simulator, I packaged a Jupyter notebook server that can be used to create Python notebooks that can connect to the simulator and control the simulated robots running within it. These notebooks could be be viewed view a browser running on the virtual machine desktop, but instead I expose the notebook server so notebooks can be viewed in a browser on host.

The architecture thus looks something like this:

I’d never used Vagrant to build a Linux desktop box before, so here are a few things I learned about and observed along the way:

  • installing ubuntu-desktop naively installs a whole range of applications as well. I wanted a minimal desktop that contained just the simulator application (though I also added in a terminal). For the minimal desktop, apt-get install -y ubuntu-desktop --no-install-recommends;
  • by default, Ubuntu requires a user to login (user: vagrant; password: vagrant). I wanted to have as simple an experience as possible so wanted to log the user in automatically. This could be achieved by adding the following to /etc/lightdm/lightdm.conf:
  • a screensaver kept kicking in and kicking back to the login screen. I got round this by creating a desktop settings script (/opt/
#dock location
gsettings set com.canonical.Unity.Launcher launcher-position Bottom

#screensaver disable
gsettings set org.gnome.desktop.screensaver lock-enabled false

and then pointing to that from a desktop_settings.desktop file in the /home/vagrant/.config/autostart/ directory (I set execute permissions set on the script and the .desktop file):

[Desktop Entry]
Name=Apply Gnome Settings
  • because the point of the VM is largely to run the simulator, I thought I should autostart the simulator. This can be done with another .desktop file in the autostart directory:
[Desktop Entry]
Name=V-REP Simulator
  • the Jupyter notebook server is started as a service and reuses the installation I used for the TM351 VM;
  • I thought I should also add a desktop shortcut to run the simulator, though I couldnlt find an icon to link to? Create an executable run_vrep.desktop file and place it on the desktop:
[Desktop Entry]
Name=V-REP Simulator
Comment=Run V-REP Simulator

Her’s how it looks:

If you want to give it a try, comments on the build/install process would be much appreciated: ou-robotics-vrep.

I will also be posting a set of activities based on the RobotLab activities used in TM129 in the possibility that we start using V-REP on TM129. The activity notebooks will be posted in the repo and via the associated uncourse blog if you want to play along.

One issue I have noticed is that if I resize the VM window, V-REP crashes… I also can’t figure out how to open a V-REP scene file from script (issue) or how to connect using a VM hostname alias rather than IP address (issue).

Distributing Software to Distance Education Students – Checksums

One of the issues with distributing software to distance education students is ensuring that the software package they are trying to install hasn’t been corrupted ins some way during transport. For example, one of the ways we ship software to students is via USB memory stick. But in one course last year, it seems that some of the sticks were a bit dodgy, and the files wouldn’t install from them.

Which is where checksums come in.

If a student is having issues installing a piece of software, we can check the checksum of the distributed installer package to see if it matches the checksum of a pristine package. If it doesn’t, we know the problem is a corrupted installer package (rather than a problem downstream of that, for example).

So what is a checksum? Essentially, it’s a single number derived from all the bits in the file you’re generating the checksum for. If any bit in the file is changed, the checksum should too.

So here are a couple of ways of generating checksums…


Download the Windows fciv (Windows File Checksum Integrity Verifier) utility:

Run a command of the form:

This will produce checksums using different coding mechanisms:



On a Mac, you can find the checksum using the following command:


For example:


If we distribute a copy of the checksum for installer packages, along with the installer packages, assuming that the checksum is not corrupted, a student can check that the installer package is not corrupted by generating the checksum for their installer package and comparing it to the distributed one.

When it comes to support in the event of a problem, and a call to the help desk, then the first question from support can be: “Have you checked the checksum of the original package?” (Or we can prompt students to do this themselves as part of self-service support…)

Even better if we shipped a simple one-click file integrity checking utility that:

1) runs a checksum test on itself to check that it’s working;
2) runs a checksum test on the distributed package(s) to check that they are not corrupted.

Building a JSON API Using Jupyter Notebooks in Under 5 Minutes

In the post Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes, I idly pondered creating a simple API for looking up ONS geographic codes.

Popping out to the shop for a pint of milk, I recalled one of the many things on my to do list was too look at the Jupyter Kernel Gateway, as described in the IBM Emerging Technologies blogpost on Jupyter Notebooks as RESTful Microservices.

The kernel_gateway project looks to be active – install it using something like pip3 install jupyter_kernel_gateway – and set up a notebook, such as simpleAPI.ipynb (the following code blocks represent separate code cells).

Import some helper packages and connect the geographic codes db I created in the previous post.

import json
import pandas as pd
import sqlite3
con = sqlite3.connect("onsgeocodes.sqlite")

Create a placeholder for the global REQUEST object that will be created when the API is invoked:

REQUEST = json.dumps({
'path' : {},
'args' : {}

Now let’s define the API.

For example, suppose I want to return some information about a particular code:

# GET /ons/:code
request = json.loads(REQUEST)
code = request['path'].get('code')

q='SELECT * FROM codelist WHERE "GEOGCD"="{code}"'.format(code=code)
print('{"codes":%s}' % pd.read_sql_query(q, con).to_json(orient='records'))

Or suppose I want to look up what current codes might exist that partially match a particular place name:

# GET /ons/current/:name
request = json.loads(REQUEST)
name = request['path'].get('name')

SELECT * FROM codelist JOIN metadata
WHERE "GEOGNM"="{name}" AND codeAbbrv=sheet AND codelist.STATUS="live"

print('{"codes":%s}' % pd.read_sql_query(q, con).to_json(orient='records'))

On the command line in the same directory as the notebook (for example, SimpleAPI.ipynb), I can now run the command:

jupyter kernelgateway --KernelGatewayApp.api='kernel_gateway.notebook_http' --KernelGatewayApp.seed_uri='./SimpleAPI.ipynb'

to start the server.

And as if by magic, an API appears:

The original blogpost also describes a simple docker container for running the API. Which makes me thing: this is so easy… And using something like LaunchBot, it’d be easy enough to have a simple manager for local API servers running on a personal desktop?

Here’s the complete notebook.

Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes

Via a GSS (Government Statistical Service) blog post yesterday – Why Do We Need Another Register? – announcing the Statistical Geography Register, which contains “the codes for each type of statistical geography within the UK”, I came across mention of the ONS Register of Gepgraphic Codes.

This register, maintained by the Office of National Statistics, is released as a zip file containing an Excel spreadsheet. Separate worksheets in the spreadsheet list codes for various geographies in England and Wales (but not Scotland; that data is available elsewhere).

To make a rather more reproducible component for accessing the data, I hacked together a simple notebook to pull the data out of the spreadsheet and pop it into a simple SQLite3 database as a set of separate tables, one per sheet.

One thing we need to do to reconcile items in the metadata sheet and the sheetnames by joining a couple of the columns together with a subscript:

xl['RGC']["codeAbbrv"] = xl['RGC']["Entity code"].map(str) + '_' + xl['RGC']["Entity abbreviation"].map(str)

The bulk of the script is quite simple (see the full notebook here):

import sqlite3
con = sqlite3.connect("onsgeocodes.sqlite")


bigcodes.to_sql(con=con, name='codelist', index=False, if_exists='replace')

sheets= list(xl.keys())
for sheet in sheets[2:]:
  xl[sheet].to_sql(con=con, name=sheet, index=False, if_exists='replace')
  #Reorder the columns
  xl[sheet][['sheet']+cols].to_sql(con=con, name='codelist', index=False, if_exists='append')

You may also notice that it creates a “big” table (codelist) that contains all the codes – which means we can look up the provenance of a particular code:

SELECT sheet, GEOGCD, GEOGNM, GEOGNMW, codelist.STATUS, "Entity name"
FROM codelist JOIN metadata WHERE "GEOGCD"="{code}" AND codeAbbrv=sheet
pd.read_sql_query(q, con)
0 W40_CMLAD W40000004 Denbighshire Sir Ddinbych live Census Merged Local Authority Districts

We can also look to see what (current) geographies might be associated with a particular name:

SELECT DISTINCT "Entity name", sheet FROM codelist JOIN metadata
WHERE "GEOGNM" LIKE "%{name}%" AND codeAbbrv=sheet AND codelist.STATUS="live"
'''.format(name='Isle of Wight')
pd.read_sql_query(q, con)
Entity name sheet
0 Super Output Areas, Lower Layer E01_LSOA
1 Super Output Areas, Middle Layer E02_MSOA
2 Unitary Authorities E06_UA
3 Westminster Parliamentary Constituencies E14_WPC
4 Community Safety Partnerships E22_CSP
5 Registration Districts E28_REGD
6 Registration Sub-district E29_REGSD
7 Travel to Work Areas E30_TTWA
8 Fire and Rescue Authorities E31_FRA
9 Built-up Areas E34_BUA
10 Clinical Commissioning Groups E38_CCG
11 Census Merged Local Authority Districts E41_CMLAD
12 Local Resilience Forums E48_LRF
13 Sustainability and Transformation Partnerships E54_STP

What I’m wondering now is – can I crib from the way the ergast API is put together to create a simple API that takes a code, or a name, and returns geography register information related to it?

The same approach could also be applied to the registers I pull down from NHS Digital (here) – which makes me think I should generate a big codelist table for those codes too…

PS this in part reminds me of a conversation years ago with Richard et al from @cottagelabs who were mooting, at the time, a service that would take an arbitrary code and try to pattern match the coding scheme it was part of and then look it up in that coding scheme.

PPS hmm, also thinks: maybe names associated with coding schemes could be added to a simple document tagger.