Category: Tinkering

Python “Natural Time Periods” Package

Getting on for a year ago, I posted a recipe for generating “Natural Language” Time Periods in Python. At the time, a couple of people asked whether it was packaged – and it wasn’t…

It’s still not on pip, but I have since made a package repo for it: python-natural-time-periods.

Install using: pip3 install --force-reinstall --no-deps --upgrade git+https://github.com/psychemedia/python-natural-time-periods.git

The following period functions are defined:

  • today(), yesterday(), tomorrow()
  • last_week(), this_week(), next_week(), later_this_week(), earlier_this_week()
  • last_month(), next_month(), this_month(), earlier_this_month(), later_this_month()
  • day_lastweek(), day_thisweek(), day_nextweek()

Here’s an example of how to use it:

import natural_time_periods as ntpd

ntpd.today()
>>> datetime.date(2017, 8, 9)

ntpd.last_week()
>>> (datetime.date(2017, 7, 31), datetime.date(2017, 8, 6))

ntpd.later_this_month()
>>> (datetime.date(2017, 8, 10), datetime.date(2017, 8, 31))

ntpd.day_lastweek(ntpd.MON)
>>> datetime.date(2017, 7, 31)

ntpd.day_lastweek(ntpd.MON, iso=True)
>>> '2017-07-31'

NHS Digital Organisation Data Service (ODS) Python / Pandas Data Loader

One of the nice things about the Python pandas data analysis package is that there is a small – but growing – amount of support for downloading data from third party sources directly as a pandas dataframe using the pandas-datareader.

So I thought I’d have a go at producing a package inspired by the pandas wrapper for the World Bank indicators API for downloading administrative data from the NHS Digital Organisation Data Service. You can find the package here: python-pandas-datareader-NHSDigital.

There’s also examples of how to use the package here: python-pandas-datareader-NHSDigital demo notebook.

When Identifiers Don’t Align – NHS Digital GP Practice Codes and CQC Location IDs

One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.

Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.

Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.

So if we do want to join the CQC and NHS Digital datasets, what are we to do?

One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some  do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.

This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY  are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.

Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb

Fragment – Diagramming the Structure of a Python dict Using BlockDiag, & Some Quick Reflections on Computing Education

As a throwaway diagram in a piece of teaching material I wanted to visualise the structure of a Python dict. One tool I use for generating simple block diagrams is BlockDiag. This uses a simple notation for generating box and arrow diagrams:

So how can we get the key structure from a nested Python dict in that form? There may well be a Python method somewhere for grabbing this information, but it was just a quick coffee break puzzle to write a thing to grab that data and represent it as required:

def getChildren(d,t=None,root=None):
    if t is None: t=[]
    if root is None:
        for k in d:
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    else:
        for k in d:
            t.append('"{root}" -> "{k}";'.format(root=root,k=k))
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    return t

r={'a':{'b':1, 'c':{'d':2,'e':3}}}
print('{{ {graph} }}'.format(graph='\n'.join(getChildren(r)))) 

#------
{ "a" -> "b";
"a" -> "c";
"c" -> "d";
"c" -> "e"; }

There are probably better ways of doing it, but that’s not necessarily the point. Which is a point I realised chatting to a colleague earlier today: I’m not that interested in the teaching of formal computing approaches as a way of training enterprise developers. Nor am I interested in the teaching of computing through contrived toy examples. What I’m far more interested in is helping students do without us; and students, at that, who have end-user computing needs that they want to be able to satisfy in whatever domain they end up in.

Which is to say, not (in the first instance) enterprise level, production quality code. It’s code to get stuff done. End-user application development code. Personal, disposable/throwaway, ad hoc productivity tool development. Scruffy code that lets you use bits of string and gaffer tape and chunks of other people’s code to solve a particular problem.

But that’s not to say the code has to stay ropey… In testing the first attempt at the above code, it lacked the guards that checked whether a variable was a dict, at which point it failed whenever a literal value was encountered. There may well be other things that are broken about it but I can fix those as they crop up (because I know the sort of output I expect to see*, and if I don’t get it, I can try to fix it.) I also had to go and look up how to include literal curly brackets in a python formatted string (double them up) for the print statement. But that’s okay. That’s just syntax… Knowing that I should be able to print the literal brackets was the important thing… And that’s all part of what I think a lot of our curriculum lacks – enthusing folk, making them curious, getting them to have expectations about what is and isn’t and should be possible**, and then being able to act on that.

* informal test driven end user software application development…?!;-)
** with some personal ethics about what may be possible but shouldn’t be pursued and should be lobbied against…

Using Jupyter Notebooks For Assessment – Export as Word (.docx) Extension

One of the things we still haven’t properly worked out in our Data management and analysis (TM351 course is how best to handle Jupyter notebook based assignments. The assignments are set using a notebook to describe the tasks to be completed and completed by the student. We then need some mechanism for:

  • students to submit the assessment electronically;
  • markers mark assessments for their students: if the document contains a lot of OU text, it can be hard for the marker to locate the student text;
  • markers may provide on-script feedback; this means the marker needs to be able to edit the document and make changes/annotations.
  • markers return scripts to students;
  • students read feedback – so they need to be able to locate and distinguish the marker feedback within the document.

One Frankenstein process we tried was for students to save a Jupyter notebook file as a Markdown or HTML document and then convert it to a Microsoft Word document using pandoc.

This document could then be submitted and marked in a traditional way, with markers using comments and track chances to annotate the student script. Unfortunately, our original 32 bit VM meant we had to use an old version of pandoc, with the result that tabular data was not handled at all well in the conversion-to-Word process.

Updating to a 64 bit virtual machine means we can update pandoc, and the Word document conversion is now much smoother. However, the conversion process still requires students to export the word document as HTML and then use pandoc to convert the HTML to to the Microsoft Word .docx format. (The Jupyter nbconvert utility does not currently export to Word.)

So to make things a little easier, here’s my first attempt at a Download Jupyter Notecbook as Word (.docx) extension to do just that. It makes use of the Jupyter notebook custom bundler extensions API which allows you to add additional options to the notebook File -> Download menu option. The code I used was also cribbed from the dashboards_bundlers which converts a notebook to a dashboard and then downloads it.

One thing it doesn’t handle at the moment are things like embedded interactive maps. I’ve previously come up with a workaround for generating static images of interactive maps created using the folium package by using selenium to render the map and grab a screenshot of it; I’m not sure if that would work in our headless VM, though? (One to try, I guess?) There’s also a related thread in the folium repo issue tracker.

The above script is placed in a wordexport folder inside a package folder containing a simple setup.py script:

from setuptools import setup

setup(name='wordexport',
      version='0.0.1',
      description='Export Jupyter notebook as .docx file',
      author='Tony Hirst',
      author_email='tony.hirst@open.ac.uk',
      license='MIT',
      packages=['wordexport'],
      zip_safe=False)

The package can be installed and the extension enabled using a riff along the lines of the following command-line commands:

echo "...wordexport install..."
#Install the wordexport (.docx exporter) extension package
pip3 install --upgrade --force-reinstall ${THISDIR}/jupyter_custom_files/nbextensions/wordexport

#Enable the wordexport extension
jupyter bundlerextension enable --py wordexport.wordexport  --sys-prefix
echo "...wordexport done"

Restart the Jupyter server after enabling the extension, and the result should be a new MS Word (.docx) option in the notebook File -> Download menu option.

Querying Large CSV Files With Apache Drill

Via a post on the rud.is blog – Drilling Into CSVs — Teaser Trailer – I came across a handy looking Apache tool: Apache Drill. A Java powered service, Apache Drill allows you to query large CSV and JSON files (as well as a range of other data backends) using SQL, without any particular manipulation of the target data files. (The notes also suggest you can query directly over a set of CSV files (with the same columns?) in a directory, though I haven’t tried that yet…)

To give it a go, I dowloaded Evan Odell’s Hansard dataset which comes in as a CSV file at just over 3GB.

Installing Apache Drill, and running it from the command line – ./apache-drill-1.10.0/bin/drill-embedded – it was easy enough to start running queries from the off (Querying Plain Text Files):

SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv` LIMIT 3;

By default, queries from a CSV file ignore headers and treat all rows equally. Queries over particular columns can be executed by referring to numbered columns in he form COLUMNS[0], COLUMNS[1], etc. (Querying Plain Text Files).  However, Bob Rudis’ blog hinted there was a way to configure the server to use the first row of a CSV file as a header row. In particular, the Data Source Plugin Configuration Basics docs page describes how the CSV data source configuration can be amended with the clauses "skipFirstLine": true, "extractHeader": true to allow CSV files to be queried with the header row intact.

The configuration file for the live server can be amended via a web page published by the running Apache Drill service, by default on localhost port 8047:

Updating the configuration means we can start to run named column queries:

The config files are actually saved to a temporary location – /tmp/drill. If the (updated) files are copied to a persistent location – mv /tmp/drill /my/configpath – and the drill-override.conf file updated with the setting drill.exec: {sys.store.provider.local.path="/my/configpath"}, the (updated) configuration files will in future be uploaded from that location, rather than temporary default config files being created for each new server session (docs: Storage Plugin Registration).

Bob Rudis’ post also suggested that more efficient queries could be run by converting the CSV data file to a parquet data format, and then querying over that:

CREATE TABLE dfs.tmp.`/senti_post_v2.parquet` AS SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv`;

This creates a new parquet data folder /tmp/senti_post_v2.parquet. This can then be queried as for the CSV file:

SELECT gender, count(*) FROM dfs.tmp.`/senti_post_v2.parquet` GROUP BY gender;

…but with a significant speed up, on some queries at least:

To quit, !​exit.

And finally, to make using the Apache Drill service easier to use from code, wrapper libraries are available for R – sergeant R package – and Python – pydrill package.

First Attempt at Running the TM351 VM as an AMI on Amazon Web Services

One of the things that’s been on my to do list for ages is trying to get a version of the TM351 virtual machine (VM) up and running on Amazon Web Services (AWS) as an Amazon Machine Instance (AMI). This would allow students who are having trouble running the VM on their own computer to access the services running in the cloud.

(Obviously, it would be preferable if we could offer such a service via OU operated servers, but I can’t do politics well enough, and don’t have the mentality to attend enough of the necessary say-the-same-thing-again-again meetings, to make that sort of thing happen.)

So… a first attempt is up on the eu-west-1 region in all its insecure glory: TM351 AMI v1. The security model is by obscurity as much as anything – there’s no model for setting separate passwords for separate students, for example, or checking back agains an OU auth layer. And I suspect everything runs as root…

(One of the things we have noticed in (brief) testing is that the Getting Started instructions don’t work inside the OU, at least if you try to limit access to your (supposed) IP address. Reminds of when we gave up trying to build the OU VM from machines on the OU network because solving proxy and blocked port issues was an irrelevant problem to have to worry about when working from the outside…)

Open Refine doesn’t seem to want to run with the other services in the free tier micro (1GB) machine instance, but at 2GB everything seems okay. (I don’t know if possible race conditions in starting services means that Open Refine could start and then block the Jupyter service’s request for resource.  I need to do an Apollo 13 style startup sequence exploration to see if all services can run in 1GB, I guess!) One thing I’ve added to the to do list is to split things out so into separate AMIs that will work on the 1GB free tier machines. I also want to check that I can provision the AMI from Vagrant, so students could then launch a local VM or an Amazon Instance that way, just by changing the vagrant provider. (Shared folders/volumes might get a bit messed up in that case, though?)

If services can run one at a time in the 1GB machines, it’d be nice to provide a simple dashboard to start and stop the services to make that easier to manage. Something that looks a bit like this, for example, exposed via an authenticated web page:

This needn’t be too complex – I had in mind a simple Python web app that could run under nginx (which currently provides a simple authentication layer for Open Refine to sit behind) and then just runs simple systemctl start, stop and restart commands on the appropriate service.

#fragment...
import os
os.system('systemctl restart jupyter.service')

I’m not sure how the status should be updated (based on whether a service is running or not) or what heartbeat it should update to. There may be better ways, of course, in which case please let me know via the comments:-)

I did have a quick look round for examples, but the dashboards/monitoring tools that do exist, such as pydash, are far more elaborate than what I had in mind. (If you know of a simple example to do the above, or can knock one up for me, please let me know via the comments. And the simpler the better ;-)

If we are to start exploring the use of browser accessed applications running inside user-managed VMs, this sort of simple application could be really handy… (Another approach would be to use a VM running docker, and then have a container manager running, such as portainer.)