August 2017 – OUseful.Info, the blog…

O’Reilly LaunchBot – Like a Personal Binder App for Running Jupyter Notebooks and Other Containerised Browser Accessed Apps

One of the issues I keep coming up against when trying to encourage folk to at least give Jupyter notebooks a try is that “it’s too X to install” (for negatively sentimented X…). One way round this is to use something like Binder, which allows you to create a Docker image from the contents of a public Github repository and run a container instance from that image on somebody else’s server, accessing the notebook running “in the cloud” via your own browser. A downside to this approach is that the public binder service can be a bit slow – and a bit flaky. And it probably doesn’t scale well if you’re running a class. (Institutions running their own elastic Binder service could work round this, of course…)

So here’s yet another possible way in for folk – O’Reilly Media’s LaunchBot (via O’Reilly Media CTO Andrew Odewahn’s draft article Computational Publishing with Jupyter) – though admittedly it also requires the possibly (probably?!) “too hard” requirement to install Docker as a pre-requisite. Which means it’s possibly doomed from the start in my institution…

Anyway, for anyone not in a Computing HE department, and who’s willing to install Docker first (it isn’t really hard at all unless your IT department has got their hands on your computer first: just download the appropriate Community Edition for your operating system (Mac, Windows, Linux) from here), the LaunchBot app provides a handy way to run Jupyter environments in a custom Docker container, “preloaded” with notebooks from a public Github repo.

By the by, if you’re on a Mac, you may be warned off installing LaunchBot.

Simply select the app in a Finder window, right click on it, and then select “Open” from the pop-up menu; agree that you do want to launch it, and it should run happily thereafter.

In a sense, LaunchBot is a bit like running a personal Binder instance: the LaunchBot app, which runs from the desktop but is accessed via browser, allows you to clone projects from public Github repos that contain a Dockerfile, and perhaps a set of notebooks. (You can use this url as an example – it’s a set of notebooks from original OU/FutureLearn Learn to Code for Data Analysis course.) The project README is displayed as part of the project homepage, and an option given to view, edit and save the project Dockerfile.

The Dockerfile can then be used to launch an instance of a corresponding container using your own (locally running) Docker instance:

(A status bar in the top margin tracks progress as image layers are downloaded and Dockerfile commands run.)

The base image specified in the Dockerfile will be downloaded, the Dockerfile run, and I’m guessing links to all services running on exposed ports are displayed:

In the free plan, you are limited to cloning from public repos, with no support for pushing changes back to a repo. In the $7 a month fee based plan, pulls from private repos and pushes to all repos are supported.

From a quick play, LaunchBot is perhaps easier to use than Kitematic. Whilst neither of them support launching linked containers using docker-compose, LaunchBot’s ability to clone a set of notebooks (and other files) from a repo, as well as the Dockerfile, makes it more attractive for delivering content as well as the custom container runtime environment.

I had a quick look around to see if the source code for LaunchBot was around anywhere, but couldn’t find it offhand. The UI is a likely to be little bit scary for many people who don’t really get the arcana of git (which includes me! I prefer to use it via the web UI ;-) and could be simplified for users who are unlikely to want to push commits back. For example, students might only need to to pull a repo once from a remote master. On the other hand, it might be useful to support some simplified mechanism for pulling down updates (or restoring trashed notebooks?), with conflicts on the client side managed in a very sympathetic and hand-holdy way (perhaps by backing up any files that would otherwise be overwritten from the remote master?!). (At the risk of feature creeping LaunchBot, more git-comfortable users might find some way of integrating the nbdime notebook diff-er useful?)

Being able to brand the LaunchPad UI would also be handy…

From a distributed authoring perspective, the git integration could be handy. As far as the TM351 module team experiments in coming up with our own distributed authoring processes go, the use of a private Github repo means we can’t use the LaunchBot approach, at least under the free plan. (From the odd scraps I’ve found around the OU’s new OpenCreate authoring system, it supposedly supports versioning in some way, so it’ll be interesting to see how that works…) The TM351 dockerised VM also makes use of multiple containers linked using docker-compose, so I guess to use it in a LaunchBot context I’d need to build a monolithic container that runs all the required services (Postgres and MongoDB, as well as Jupyter) in a single image (or may I could run docker-compose inside a Docker container?)

[See also Seven Ways of Running IPython / Jupyter Notebooks, which I guess I probably needs another update…]

PS this could also be a handy tool for supporting authoring and maintenance workflows around notebooks: nbval, a “py.test plugin to validate Jupyter notebooks”. I wonder if it’s happy to test the raising of particular errors/warnings, as well as cleanly executed code fragments? Certainly, one thing we’re lacking in our Jupyter notebook production route at the moment is a sensible way of testing notebooks. At the risk of complicating things further, I wonder how that might fit into something like LaunchBot?

PPS Binder also lacks support for docker-compose, though I have asked about this and there may be a way in if someone figures out a spawner that invokes docker-compose.

Python “Natural Time Periods” Package

Getting on for a year ago, I posted a recipe for generating “Natural Language” Time Periods in Python. At the time, a couple of people asked whether it was packaged – and it wasn’t…

It’s still not on pip, but I have since made a package repo for it: python-natural-time-periods.

Install using: pip3 install --force-reinstall --no-deps --upgrade git+https://github.com/psychemedia/python-natural-time-periods.git

The following period functions are defined:

today(), yesterday(), tomorrow()
last_week(), this_week(), next_week(), later_this_week(), earlier_this_week()
last_month(), next_month(), this_month(), earlier_this_month(), later_this_month()
day_lastweek(), day_thisweek(), day_nextweek()

Here’s an example of how to use it:

import natural_time_periods as ntpd

ntpd.today()
>>> datetime.date(2017, 8, 9)

ntpd.last_week()
>>> (datetime.date(2017, 7, 31), datetime.date(2017, 8, 6))

ntpd.later_this_month()
>>> (datetime.date(2017, 8, 10), datetime.date(2017, 8, 31))

ntpd.day_lastweek(ntpd.MON)
>>> datetime.date(2017, 7, 31)

ntpd.day_lastweek(ntpd.MON, iso=True)
>>> '2017-07-31'

NHS Digital Organisation Data Service (ODS) Python / Pandas Data Loader

One of the nice things about the Python pandas data analysis package is that there is a small – but growing – amount of support for downloading data from third party sources directly as a pandas dataframe using the pandas-datareader.

So I thought I’d have a go at producing a package inspired by the pandas wrapper for the World Bank indicators API for downloading administrative data from the NHS Digital Organisation Data Service. You can find the package here: python-pandas-datareader-NHSDigital.

There’s also examples of how to use the package here: python-pandas-datareader-NHSDigital demo notebook.

Replacing RobotLab…?

Somewhen around 2002-3, when we first ran the short course “Robotics and the Meaning of Life”, my colleague Jon Rosewell developed a drag and drop text based programming environment – RobotLab – that could be used to programme the behaviour of a Lego Mindstorms/RCX brick controlled robot, or a simulated robot in an inbuilt 2D simulator.

RobotLab

The environment was also used in various OU residential schools, although an update in 2016 saw us move from the old RCX bricks to Lego EV3 robots, and with it a move to the graphical Lego/Labview EV3 programming environment.

RobotLab is still used in the introductory robotics unit of a level 1 undergrad OU course, a unit that is itself an update of the original Robotics and the Meaning of Life short course. And although the software is largely frozen – the screenshot below shows it running under Wineskin on my Mac – it continues to do the job admirably:

the environment is drag and drop, to minimise errors, but uses a text based language (inspired originally by Lego scripting code, which it generated to control the actual RCX powered robots);
the simulated robot could be configured by the user, with noise being added to sensor inputs and motor outputs, if required, and custom background images could be loaded into the simulator:
a remote control panel could be used to control the behaviour of the real – or simulated – robot to provide simple tele-operation of it. A custom remote application for the residential school allowed a real robot to controlled via the remote app, with a delay in the transmission of the signal that could be used to model the signal flight time to a robot on the moon! The RobotLab remote app provided a display to show the current power level of each motor, as well as the values of any sensor readings.

– the RobotLab environment allowed instructions to be stepped through an instruction at a time, in order to support debugging;
– a data logging tool allowed real or simulated logged data to be “uploaded” and viewed as a time series line chart.

Time moves on, however, and we’re now starting to thing about revising the robotics material. We had started looking at an updated, HTML5 version of RobotLab last year, but that effort seems to have stalled. So I’ve started looking for an alternative.

Robot Operating System (ROS)

Following on from a capital infrastructure bid a couple of years ago, we managed to pick up a few Baxter robots that are intended to be used in the “real, remote experiment” OpenSTEM Lab. (Baxter is also demoed at the level 1 engineering residential school.) Baxter runs under ROS, the Robot Operating System, and can be programmed using Python. Both 2D and 3D simulators are available for ROS, which means we could go down the ROS root as a the programming environment for any revised level 1 course.

At this point, it’s also worth saying that the level 1 course is something of a Frankenstein course, including introductory units to Linux and networking, as well as robotics. If the course is rewritten, the current idea is to replace the Linux module with one on more general operating systems. This means that we could try to create a set of modules that all complement each other, yet standalone as separate modules. For example, ROS works on a client server model, which allows us to foreshadow, or counterpoint, ideas arising in the other units.

The relative complexity of the ROS environment means that we can also use it to motivate the idea of using virtual machines for running scientific software with complex dependencies and rather involved installation processes. However, from a quick look at the ROS tools, they do look rather involved for a first intro course and I think would require quite a bit of wrapping to hide some of the complexity.

If you’re interested, here‘s a quick run down of what needs adding to a base Linux 16.04 install:

	sudo apt-get update
	sudo apt-get install -y software-properties-common
	sudo apt-add-repository multiverse
	sudo sh -c 'echo "deb http://packages.ros.org/ros/ubuntu $(lsb_release -sc) main" > /etc/apt/sources.list.d/ros-latest.list'
	sudo apt-key adv –keyserver hkp://ha.pool.sks-keyservers.net:80 –recv-key 421C365BD9FF1F717815A3895523BAEEB01FA116
	#If that address doesn't work:
	sudo apt-key adv –keyserver hkp://pgp.mit.edu:80 –recv-key 421C365BD9FF1F717815A3895523BAEEB01FA116
	sudo apt-get update
	sudo apt-get install ros-kinetic-ros-base
	sudo rosdep init
	rosdep update
	echo "source /opt/ros/kinetic/setup.bash" >> ~/.bashrc
	source ~/.bashrc
	sudo apt-get install -y ros-$ROS_DISTRO-stdr-simulator
	sudo apt-get install -y ros-$ROS_DISTRO-teleop-twist-keyboard

	##To run simulator:
	#roslaunch stdr_launchers server_with_map_and_gui_plus_robot.launch

	##To run keyboard remote:
	#rosrun teleop_twist_keyboard teleop_twist_keyboard.py cmd_vel:=robot0/cmd_vel


	##From: https://hub.docker.com/r/davetcoleman/baxter_simulator/
	##Gazebo simulator
	#sudo sh -c 'echo "deb http://packages.osrfoundation.org/gazebo/ubuntu-stable `lsb_release -cs` main" > /etc/apt/sources.list.d/gazebo-stable.list'
	#wget http://packages.osrfoundation.org/gazebo.key -O – \| sudo apt-key add –
	#sudo apt-get update
	#sudo apt-get install gazebo7 libgazebo7-dev

	#sudo apt-get -y install python-rosdep python-catkin-tools python-wstool
	#mkdir -p ~/ws_baxter/src
	#cd ws_baxter/src
	#wstool init .
	#wstool merge https://raw.githubusercontent.com/vicariousinc/baxter_simulator/kinetic-gazebo7/baxter_simulator.rosinstall
	#wstool update
	#rosdep install -y –from-paths . –ignore-src –rosdistro kinetic –as-root=apt:false
	#cd ..
	#catkin config –extend /opt/ros/kinetic –cmake-args -DCMAKE_BUILD_TYPE=Release
	#catkin build

	##Need to modify bash.sh
	#source ~/ws_baxter/devel/setup.bash
	#export ROS_MASTER_URI=http://localhost:11311

	##start gazebo
	#roslaunch baxter_gazebo baxter_world.launch

	##Simple demo:
	#rosrun baxter_examples joint_velocity_wobbler.py
	##Simple keyboard-based control demo:
	#rosrun baxter_examples joint_position_keyboard.py

view raw

ros_install.sh

hosted with ❤ by GitHub

The simulator and the keyboard remote need launching from separate terminal processes.

Open RobertaLab

A rather more friendly environment we could work with is the blockly based RobertaLab. Scratch is already being used in one of the new first level courses to introduce basic programming, of a sort, so the blocks style environment is one that students will see elsewhere, albeit in a rather simplistic fashion. (I’d argued for using BlockPy instead, but the decision had already been taken…)

RobertaLab runs as an online hosted environment, but the code is openly licensed which means we should be able to run it institutionally, or allow students to run it themselves.

In keeping with complementing the operating systems unit, we could use a Docker containerised version of RobertaLab to allow students to run RobertaLab on their own machines:

A couple of other points to note about RobertaLab. Firstly, it has a simulator, with several predefined background environments. (I’m not sure if new ones can be easily uploaded, but if not the repo can be forked an new ones added form src.)

As with BlockPy, there’s the ability to look at Python script generated from the blocks view. In the EV3Dev selection, Python code compatible with the Ev3dev library is generated. But other programming environments are available too, in which case different code is generated. For example, the EV3lejos selection will generate Java code.

This ability to see the code behind the blocks is a nice stepping stone towards text based programming, although unlike BlockPy I don’t think you can go from code to blocks? The ability to generate code for different languages from the same blocks programme could also be used to make a powerful point, I think?

Whether or not we go with the robotics unit rewrite, I think I’ll try to clone the current RobotLab activities in using the RobertaLab environment, and also add some extra value bits by commenting on the Python code that is generated.

By the by, here’s a clone of the Dockerfile used by exmatrikulator for building the RobertaLab container image:

	FROM ubuntu

	RUN apt update && apt install -y avrdude avr-libc binutils-avr default-jdk gcc-arm-none-eabi gcc-avr gdb-avr git libssl-dev libusb-0.1-4 maven nbc phantomjs python-pip srecord unzip wget && apt-get clean
	RUN pip install uflash

	RUN wget -q https://github.com/OpenRoberta/robertalab/archive/master.zip && \
	unzip master.zip && \
	rm master.zip


	WORKDIR /robertalab-master/OpenRobertaParent
	RUN mvn clean install


	WORKDIR /robertalab-master
	RUN /robertalab-master/ora.sh –createemptydb OpenRobertaServer/db-2.2.0/openroberta-db

	VOLUME /robertalab-master/OpenRobertaServer/db-2.2.0
	EXPOSE 1999
	CMD ["/robertalab-master/ora.sh", "–start-from-git"]

view raw

Dockerfile

hosted with ❤ by GitHub

In the original robotics unit, one of the activities looked at how a simple neural network could be trained using data collected by the robot. I’m not sure if Dale Lane’s Machine Learning for Kids application, which also looks to be written using ~~block.ly~~Scratch. This means it probably can’t be integrated with Open RobertaLab, ~~but even if it isn’t,~~ although it would perhaps make a nice complement to both the use of OpenRobertaLab to control a simple simulated robot, and the use of Scratch in the other level 1 module as an environment for building simple games as a way of motivating the teaching of basic programming concepts.

Summary

No-one’s interested in Jupyter notebooks, but folk seem to think Scratch is absolutely bloody lovely for teaching programming to novice ADULTS, so I might as well go with them in adopting that style of interface…

That said, I should probably keep tabs on what Doug Blank is up to with his Conx dabblings, in case there is a sudden interest in using notebooks…

Playgrounds, Playgrounds, Everywhere…

A quick round up of things I would have made time to try to play with in the past, but that I can’t get motivated to explore, partly because there’s nowhere I can imagine using it, and partly because, as blog traffic to this blog dwindles, there’s no real reason to share. So there here just as timeline markers in the space of tech evolution, if nothing else.

Play With Docker Classroom

Play With Docker Classroom provides an online interactive playground for getting started playing with classroom. A ranges of tutorials guide you through the commands you need to run to explore the Docker ecosystem and get your own demo services up and running.

You’re provided with a complete docker environment, running in the cloud (so no need to install anything…) and it also looks as if you can go off-piste. For example, I set one of my own OpenRefine containers running:

…had a look at the URL used to preview the hosted web services launched as part of one of the official tutorials, and made the guess that I could change the port number in the subdomain to access my own OpenRefine container…

Handy…

Python Anywhere

Suspecting once again I’ve been as-if sent to Coventry by the rest of my department, one of the few people that still talks to me in the OU tipped me off to the online hosted Python Anywhere environment. The free plan offers you a small hosted Python environment to play with, accessed via a browser IDE, and that allows you to run a small web app, for example. There’s a linked MySQL database for preserving state, and the ability to schedule jobs – so handy for managing the odd scraper, perhaps (though I’m not sure what external URLs you can access?)

The relatively cheap (and up) paid for accounts also offer a Jupyter notebook environment – it’s interesting this isn’t part of the free plan, which makes me wonder if that environment works as an effective enticement to go for the upgrade?

The Python Anywhere environment also looks as if it’s geared up to educational use – student’s signing up to the service can nominate another account as their “teacher”, which allows the teacher to see the student files, as well as get a realtime shared view of their console.

(Methinks it’d be quite interesting to have a full feature by feature comparison of Python Anywhere and CoCalc (SageMathCloud, as was…)

AWS SAM Local

Amazon Web Services (AWS) Service Application Model (SAM) describes a way of hosting serverless applications using Amazon Lambda functions. A recent blogpost describes a local Docker testing environment that allows you to develop and test SAM compliant applications on your local machine and then deploy them to AWS. I always find working with AWS a real pain to set up, so if being able to develop and test locally means I can try to get apps working first and then worry about negotiating the AWS permissions and UI minefield if I ever want to try to actually run the thing on AWS, I may be more likely to play a bit with AWS SAM apps…

Tinkering With Neural Networks

On my to do list is writing some updated teaching material on neural networks. As part of the “research” process, I keep meaning to give some of the more popular deep learning models – PyTorch, Tensorflow and ConvNet, perhaps – a spin, to see if we can simplify them to an introductory teaching level.

A first thing to do is evaluate the various playgrounds and demos, such as the Tensorflow NeuralNetwork Playground:

Another, possibly more useful, approach might be to start with some of the “train a neural network in your browser” Javascript libraries and see how easy it is to put together simple web app demos using just those libraries.

For example, ConvNetJS:

Or deeplearn.js:

(I should probably also look at Dale Lane’s Machine Learning For Kids site, that uses Scratch to build simple ML powered apps.)

LearnR R/Shiny Tutorial Builder

The LearnR package is a new-ish package for developing interactive tutorials in RStudio. (Hmmm… I wonder if the framework lets you write tutorial code using language kernels other than R?)

Just as Jupyter notebooks provided a new sort of interactive environment that I don’t think we have even begun to explore properly yet as an interactive teaching and learning medium, I think this could take a fair bit of playing with to get a proper feel for how to do things most effectively. Which isn’t going to happen, of course.

Plotly Dash

“Reactive Web Apps for Python”, apparently… This perhaps isn’t so much of interest as an ed tech, but it’s something I keep meaning to play with as a possible Python/Jupyter alternative to R/Shiny.

Outro

Should play with these, can’t be a****d…

Jupyter / IPython Notebook Textbook Companions

I still remember the first time I was introduced to Jupyter (then IPython) notebooks – a demonstration at the back of a large lecture room by Alfred Essa of his “Rwandan Tragedy” notebook:

(I think this was a OpenEd12 in Vancouver (“Beyond Content”), back in the days when I used to blag entry to conferences, somehow or other, so this would date it to October 2012…).

I don’t recall offhand what my immediate reaction was (I’d like to think it was unbridled enthusiasm, but I’m not sure I completely grokked it…). A scan of my laptop (commissioned over summer 2013?) shows I have notebook files dating back to at least December 2013 (Jan 2014 on Github gists), at which point I seemed to really start playing…

In that period of time, I suspect things had moved on quite a bit since seeing the Rwanda demo, such as the embedding of output charts in the notebooks. (Of course, now you can embed, and generate embeddings, of pretty much anything, as well as being able to easily add in interactive widgets into a notebook to control embedded interactives using code defined in the same notebook…)

So it seems it took me some time to starting to explore them. But when I did start to use them, there was no going back…

I also remember meeting Alistair Willis in the OU Berrill café, mooting plans for the course that was to become TM351 (the first module team meeting was October 2013, perhaps?) , and at some point the idea of using IPython notebooks for the course came up, though I’m not sure I had any experience of using the notebooks – just that they seemed like something worth exploring…

Alistair quickly became a fellow early believer in the notebooks, and since then, the FutureLearn course Learn to Code for Data Analysis, led by Michel Wermelinger, has also used them.

But that, to my knowledge, and to my shame, is as far as it’s gone in the OU.

The new first year equivalent introductory computing courses use Scratch (I tried to argue for BlockPy, but the decision had been made to go with what had been used before….) and IDLE, and whilst there was some talk of Python coding in the new level 2 and/or level 3 Engineering courses, I’m not sure how that’s progressed.

I’ve no idea what OUr Science courses are up to – or Maths courses. Or courses that use statistics. Or interactive maps.

Anyway, over the last few years I’ve come to live in Jupyter notebooks – they’re great for trying stuff out, keeping records of play and experiments in a way that the interactive command line isn’t (even if you do save all your history files), and can be used for sharing complete, worked, and working recipes.

As my own timeline suggests, being aware of the notebooks and actually buying into them, takes – using them. Which is maybe one reason why adoption in the OU has been slow: fear of the new.

Which is a shame – because there’s a great ecosystem developing around the use of notebooks.

For example, yesterday, whilst search for “tractive effort gearing ipynb”, trying to find notebook examples of tractive effort curves (a phrase I picked up from Racecar Engineering mag – race engineers cut their teeth on the maths of finding gear ratios by calculating such curves, apparently…) I came across this notebook file:

Not rendered, but that’s easy enough using nbviewer:

which gives this:

Hmmm… a set of worked examples from a textbook. What textbook?, I wondered, and went up the URL path:

Interesting… So how active a project is this?

Hmm… really interesting… The examples may be pared down, but that means they can also be worked up. (It look like there’s a Github repo, which I guess you can fork and then make pull requests back to with worked examples for new books, or improved notebooks for current ones?) And they show how to go about solving a wide range of problems by scripting them. (This is one reason why I think computing folk don’t like notebooks. They aren’t really interested in folk using simple scripting to get simple things done. Which is also the reason why computing folk are the worst people to try to teach computing to the masses, who don’t know code can be used, a line at a time, to get things done, and who don’t see the point in being taught the stuff that the computing folk want to teach. Which is old school computing principles, rather than TECHNOLOGY THAT’S USEFUL TO FOLK.)

Whatever…. As a “digital first” organisation, I keep wondering why we’re not buying into Jupyter as a piece of freakin’ awesome edtech?! (By the by, a history of the IPython notebook project can be found here.)

If nothing else, I’d be really interested to see research from OUr digital innovation leaders why there’s no interest in adopting Jupyter notebooks in the institution?

See also: I Just Don’t Understand Why…

Coping With Time Periods in Python Pandas – Weekly, Monthly, Quarterly

One of the more powerful, and perhaps underused (by me at least), features of the Python/pandas data analysis package is its ability to work with time series and represent periods of time as well as simple “pointwise” dates and datetimes.

Here’s a link to a first draft Jupyter notebook showing how to cast weekly, monthly and quarterly periods in pandas from NHS time series datasets: Recipes – Representing Time Periods.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

view raw

Recipes – Representing Time Periods.ipynb

hosted with ❤ by GitHub

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

view raw

localelection17_wight.ipynb

hosted with ❤ by GitHub

When Identifiers Don’t Align – NHS Digital GP Practice Codes and CQC Location IDs

One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.

Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.

Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.

So if we do want to join the CQC and NHS Digital datasets, what are we to do?

One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.

This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.

Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

view raw

CQC and NHS Code Reconciliation.ipynb

hosted with ❤ by GitHub