Category: Anything you want

Building a JSON API Using Jupyter Notebooks in Under 5 Minutes

In the post Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes, I idly pondered creating a simple API for looking up ONS geographic codes.

Popping out to the shop for a pint of milk, I recalled one of the many things on my to do list was too look at the Jupyter Kernel Gateway, as described in the IBM Emerging Technologies blogpost on Jupyter Notebooks as RESTful Microservices.

The kernel_gateway project looks to be active – install it using something like pip3 install jupyter_kernel_gateway – and set up a notebook, such as simpleAPI.ipynb (the following code blocks represent separate code cells).

Import some helper packages and connect the geographic codes db I created in the previous post.

import json
import pandas as pd
import sqlite3
con = sqlite3.connect("onsgeocodes.sqlite")

Create a placeholder for the global REQUEST object that will be created when the API is invoked:

REQUEST = json.dumps({
'path' : {},
'args' : {}
})

Now let’s define the API.

For example, suppose I want to return some information about a particular code:

# GET /ons/:code
request = json.loads(REQUEST)
code = request['path'].get('code')

q='SELECT * FROM codelist WHERE "GEOGCD"="{code}"'.format(code=code)
print('{"codes":%s}' % pd.read_sql_query(q, con).to_json(orient='records'))

Or suppose I want to look up what current codes might exist that partially match a particular place name:

# GET /ons/current/:name
request = json.loads(REQUEST)
name = request['path'].get('name')

q='''
SELECT * FROM codelist JOIN metadata
WHERE "GEOGNM"="{name}" AND codeAbbrv=sheet AND codelist.STATUS="live"
'''.format(name=name)

print('{"codes":%s}' % pd.read_sql_query(q, con).to_json(orient='records'))

On the command line in the same directory as the notebook (for example, SimpleAPI.ipynb), I can now run the command:

jupyter kernelgateway --KernelGatewayApp.api='kernel_gateway.notebook_http' --KernelGatewayApp.seed_uri='./SimpleAPI.ipynb'

to start the server.

And as if by magic, an API appears:

The original blogpost also describes a simple docker container for running the API. Which makes me thing: this is so easy… And using something like LaunchBot, it’d be easy enough to have a simple manager for local API servers running on a personal desktop?

Here’s the complete notebook.

Scribbles – Data Rhetoric

Not in the sense of “rhetoric around the utility, or otherwise, of data, open data” etc, which refers to argumentation in favour of releasing open data, for example.

Nor strictly in the sense of how data can be used as part of a communication to support the persuasive aims of that communication.

But more in the sense of: how can data itself be used in a rhetorical way to persuade?

For example, taking the list of rhetorical devices from this glossary of rhetorical terms, what data rhetoric, or data visualisation rhetoric, examples might we give for each of them, or what corollaries might they have in a specifically “communicating with data” context?

Wider Context

Baesler, E. James, and Judee K. Burgoon. “The temporal effects of story and statistical evidence on belief change.” Communication Research 21.5 (1994): 582-602.

Cushion, Stephen, Justin Lewis, and Robert Callaghan. “Data Journalism, Impartiality and Statistical Claims: Towards more independent scrutiny in news reporting.” Journalism Practice (2016): 1-18.

Han, Bing, and Edward L. Fink. “How do statistical and narrative evidence affect persuasion?: The role of evidentiary features.” Argumentation and Advocacy 49.1 (2012): 39-58.

Pandey, Anshul Vikram, et al. “The persuasive power of data visualization.” IEEE transactions on visualization and computer graphics 20.12 (2014): 2211-2220.

Playgrounds, Playgrounds, Everywhere…

A quick round up of things I would have made time to try to play with in the past, but that I can’t get motivated to explore, partly because there’s nowhere I can imagine using it, and partly because, as blog traffic to this blog dwindles, there’s no real reason to share. So there here just as timeline markers in the space of tech evolution, if nothing else.

Play With Docker Classroom

Play With Docker Classroom provides an online interactive playground for getting started playing with classroom. A ranges of tutorials guide you through the commands you need to run to explore the Docker ecosystem and get your own demo services up and running.

You’re provided with a complete docker environment, running in the cloud (so no need to install anything…) and it also looks as if you can go off-piste. For example, I set one of my own OpenRefine containers running:

…had a look at the URL used to preview the hosted web services launched as part of one of the official tutorials, and made the guess that I could change the port number in the subdomain to access my own OpenRefine container…

Handy…

Python Anywhere

Suspecting once again I’ve been as-if sent to Coventry by the rest of my department, one of the few people that still talks to me in the OU tipped me off to the online hosted Python Anywhere environment. The free plan offers you a small hosted Python environment to play with, accessed via a browser IDE, and that allows you to run a small web app, for example. There’s a linked MySQL database for preserving state, and the ability to schedule jobs – so handy for managing the odd scraper, perhaps (though I’m not sure what external URLs you can access?)

The relatively cheap (and up) paid for accounts also offer a Jupyter notebook environment – it’s interesting this isn’t part of the free plan, which makes me wonder if that environment works as an effective enticement to go for the upgrade?

The Python Anywhere environment also looks as if it’s geared up to educational use – student’s signing up to the service can nominate another account as their “teacher”, which allows the teacher to see the student files, as well as get a realtime shared view of their console.

(Methinks it’d be quite interesting to have a full feature by feature comparison of Python Anywhere and CoCalc (SageMathCloud, as was…)

AWS SAM Local

Amazon Web Services (AWS) Service Application Model (SAM) describes a way of hosting serverless applications using Amazon Lambda functions. A recent blogpost describes a local Docker testing environment that allows you to develop and test SAM compliant applications on your local machine and then deploy them to AWS. I always find working with AWS a real pain to set up, so if being able to develop and test locally means I can try to get apps working first and then worry about negotiating the AWS permissions and UI minefield if I ever want to try to actually run the thing on AWS, I may be more likely to play a bit with AWS SAM apps…

Tinkering With Neural Networks

On my to do list is writing some updated teaching material on neural networks. As part of the “research” process, I keep meaning to give some of the more popular deep learning models – PyTorch, Tensorflow and ConvNet, perhaps – a spin, to see if we can simplify them to an introductory teaching level.

A first thing to do is evaluate the various playgrounds and demos, such as the Tensorflow NeuralNetwork Playground:

Another, possibly more useful, approach might be to start with some of the “train a neural network in your browser” Javascript libraries and see how easy it is to put together simple web app demos using just those libraries.

For example, ConvNetJS:

Or deeplearn.js:

(I should probably also look at Dale Lane’s Machine Learning For Kids site, that uses Scratch to build simple ML powered apps.)

LearnR R/Shiny Tutorial Builder

The LearnR package is a new-ish package for developing interactive tutorials in RStudio. (Hmmm… I wonder if the framework lets you write tutorial code using language kernels other than R?)

Just as Jupyter notebooks provided a new sort of interactive environment that I don’t think we have even begun to explore properly yet as an interactive teaching and learning medium, I think this could take a fair bit of playing with to get a proper feel for how to do things most effectively. Which isn’t going to happen, of course.

Plotly Dash

“Reactive Web Apps for Python”, apparently… This perhaps isn’t so much of interest as an ed tech, but it’s something I keep meaning to play with as a possible Python/Jupyter alternative to R/Shiny.

Outro

Should play with these, can’t be a****d…

Jupyter / IPython Notebook Textbook Companions

I still remember the first time I was introduced to Jupyter (then IPython) notebooks – a demonstration at the back of a large lecture room by Alfred Essa of his “Rwandan Tragedy” notebook:

(I think this was a OpenEd12 in Vancouver (“Beyond Content”), back in the days when I used to blag entry to conferences, somehow or other, so this would date it to October 2012…).

I don’t recall offhand what my immediate reaction was (I’d like to think it was unbridled enthusiasm, but I’m not sure I completely grokked it…). A scan of my laptop (commissioned over summer 2013?) shows I have notebook files dating back to at least December 2013 (Jan 2014 on Github gists), at which point I seemed to really start playing…

In that period of time, I suspect things had moved on quite a bit since seeing the Rwanda demo, such as the embedding of output charts in the notebooks. (Of course, now you can embed, and generate embeddings, of pretty much anything, as well as being able to easily add in interactive widgets into a notebook to control embedded interactives using code defined in the same notebook…)

So it seems it took me some time to starting to explore them. But when I did start to use them, there was no going back…

I also remember meeting Alistair Willis in the OU Berrill café, mooting plans for the course that was to become TM351 (the first module team meeting was October 2013, perhaps?) , and at some point the idea of using IPython notebooks for the course came up, though I’m not sure I had any experience of using the notebooks – just that they seemed like something worth exploring…

Alistair quickly became a fellow early believer in the notebooks, and since then, the FutureLearn course Learn to Code for Data Analysis, led by Michel Wermelinger, has also used them.

But that, to my knowledge, and to my shame, is as far as it’s gone in the OU.

The new first year equivalent introductory computing courses use Scratch (I tried to argue for BlockPy, but the decision had been made to go with what had been used before….) and IDLE, and whilst there was some talk of Python coding in the new level 2 and/or level 3 Engineering courses, I’m not sure how that’s progressed.

I’ve no idea what OUr Science courses are up to – or Maths courses. Or courses that use statistics. Or interactive maps.

Anyway, over the last few years I’ve come to live in Jupyter notebooks – they’re great for trying stuff out, keeping records of play and experiments in a way that the interactive command line isn’t (even if you do save all your history files), and can be used for sharing complete, worked, and working recipes.

As my own timeline suggests, being aware of the notebooks and actually buying into them, takes – using them. Which is maybe one reason why adoption in the OU has been slow: fear of the new.

Which is a shame – because there’s a great ecosystem developing around the use of notebooks.

For example, yesterday, whilst search for “tractive effort gearing ipynb”, trying to find notebook examples of tractive effort curves (a phrase I picked up from Racecar Engineering mag – race engineers cut their teeth on the maths of finding gear ratios by calculating such curves, apparently…) I came across this notebook file:

Not rendered, but that’s easy enough using nbviewer:

which gives this:

Hmmm… a set of worked examples from a textbook. What textbook?, I wondered, and went up the URL path:

Interesting… So how active a project is this?

Hmm… really interesting… The examples may be pared down, but that means they can also be worked up. (It look like there’s a Github repo, which I guess you can fork and then make pull requests back to with worked examples for new books, or improved notebooks for current ones?) And they show how to go about solving a wide range of problems by scripting them. (This is one reason why I think computing folk don’t like notebooks. They aren’t really interested in folk using simple scripting to get simple things done. Which is also the reason why computing folk are the worst people to try to teach computing to the masses, who don’t know code can be used, a line at a time, to get things done, and who don’t see the point in being taught the stuff that the computing folk want to teach. Which is old school computing principles, rather than TECHNOLOGY THAT’S USEFUL TO FOLK.)

Whatever…. As a “digital first” organisation, I keep wondering why we’re not buying into Jupyter as a piece of freakin’ awesome edtech?! (By the by, a history of the IPython notebook project can be found here.)

If nothing else, I’d be really interested to see research from OUr digital innovation leaders why there’s no interest in adopting Jupyter notebooks in the institution?

See also: I Just Don’t Understand Why…

Coping With Time Periods in Python Pandas – Weekly, Monthly, Quarterly

One of the more powerful, and perhaps underused (by me at least), features of the Python/pandas data analysis package is its ability to work with time series and represent periods of time as well as simple “pointwise” dates and datetimes.

Here’s a link to a first draft Jupyter notebook showing how to cast weekly, monthly and quarterly periods in pandas from NHS time series datasets: Recipes – Representing Time Periods.ipynb

 

When Identifiers Don’t Align – NHS Digital GP Practice Codes and CQC Location IDs

One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.

Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.

Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.

So if we do want to join the CQC and NHS Digital datasets, what are we to do?

One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some  do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.

This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY  are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.

Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb

The Four S’s of Real Data – And a Need for Data Technicians, Not Data Scientists?

“Meh” to the 4 Vs of “big data”, for most people, most of the time, real data is:

  • small: a few rows and a few columns;
  • slow: comes out rarely, often according to a trailing schedule (once a week, once a month, some time after the reported period),
  • spreadsheeted: it just is…
  • smelly: indications in the data that something is wrong with the way it’s been collected, processed or analysed. (Cf. code smells, spreadsheet smells).

At the same time, all data projects, big or small, often require folk to do a whole chunk of work with the data before they can actually get round to using it. (Much of the time spent on data projects is spent getting the data, cleaning it (is “J. Smith” the same as “J Smith”?; data-typing: is “1” the number one or a character “1”, should “12/1/17” be saved as a date (and it is day or month first? Is it the date or is it the period corresponding to that day, etc), putting it into a form you can work with (which may be a database, or a well formed spreadsheet), getting it into the right shape (that it, structured using rows and columns you can easily work with), and so on.

If the value you think you want from, and what you pay your, data scientist for is the stats’n’insights’n’data mining stuff, then should they be spending most of their time doing the grunt work, much of which relies on craft knowledge and skills? How many data scientists do we actually need if they arenlt spending all their time poking around fixing the plumbing?

Don’t we need more data technicians or data tech eng‘s (technical engineers) who can do the labour intensive stuff well (using their craft knowledge) as well as making a bit of sense from it (getting a bit of “insight” out of it based on familiarity with it) using the real data every company has? I just don’t get this whole “data science” hype thing… More people need to fix a dripping tap than a leaking high pressure, superheated steam valve in an online nuclear power station. So why the hype about a huge skills gap in the latter when what every company needs is someone who can do the former?