Category: Tinkering

Authoring Multiple Docs from a Single IPython Notebook

It’s my not-OU today, and whilst I should really be sacrificing it to work on some content for a FutureLearn course, I thought instead I’d tinker with a workflow tool related to the production process we’re using.

The course will be presented as a set of HTML docs on FutureLearn, supported by a set of IPython notebooks that learners will download and execute themselves.

The handover resources will be something like:

– a set of IPython notebooks;
– a Word document for each week containing the content to appear online. (This document will be used as the basis for multiple pages on the course website. The content is entered into the FutureLearn system by someone else as markdown (though I’m not sure what flavour?)
– for each video asset, a Word document containing the script;
– ?separate image files (the images will also be in the Word doc).

Separate webpages provide teaching that leads into a linked to IPython notebook. (Learners will be running IPython via Anaconda on their own desktops – which means tablet/netbook users won’t be able to do the interactive activities as currently delivered; we looked at using Wakari, but didn’t go with it; offering our own hosted solution or tmpnb server was considered out of scope.)

The way I have authored my week is to create a single IPython document that proceeds in a linear fashion, with “FutureLearn webpage” content authored using as markdown, as well as incorporating executed code cells, followed by “IPython notebook” activity content relating to the previous “webpage”. The “IPython notebook” sections are preceded by a markdown cell containing a NOTEBOOK START statement, and closed with markdown cell containing a NOTEBOOK END statement.

I then run a simple script that:

  • generates one IPython notebook per “IPython notebook” section;
  • creates a monolithic notebook containing all, but just, the “FutureLearn webpage” content;
  • generates a markdown version of that monolithic notebook;
  • uses pandoc to convert the monolithic markdown doc to a Microsoft Word/docx file.

fl_ipynb_workflow

Note that it would be easy enough to render each “FutureLearn webpage” doc as markdown directly from the original notebook source, into its own file that could presumably be added directly to FutureLearn, but that was seen as being overly complex compared to the original “copy rendered markdown from notebook into Word and then somehow generate markdown to put into FutureLearn editor” route.

import io, sys
import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

#Are we in a notebook segment?
innb=False

#Quick and dirty count of notebooks
c=1

#The monolithic notebook is the content ex of the separate notebook content
monolith=nb4.new_notebook()

#Load the original doc in
mynb=nb.read('ORIGINAL.ipynb',nb.NO_CONVERT)

#For each cell in the original doc:
for i in mynb['cells']:
    if (i['cell_type']=='markdown'):
        #See if we can stop a standalone notebook code delimiter
        if ('START NOTEBOOK' in i['source']):
            #At the start of a block, create a new notebook
            innb=True
            test=nb4.new_notebook()
        elif ('END NOTEBOOK' in i['source']):
            #At the end of the block, save the code to a new standalone notebook file
            innb=False
            nb.write(test,'test{}.ipynb'.format(c))
            c=c+1
        elif (innb):
            test.cells.append(nb4.new_markdown_cell(i['source']))
        else:
            monolith.cells.append(nb4.new_markdown_cell(i['source']))
    elif (i['cell_type']=='code'):
        #For the code cells, preserve any output text
        cc=nb4.new_code_cell(i['source'])
        for o in i['outputs']:
            cc['outputs'].append(o)
        #Route the code cell as required...
        if (innb):
            test.cells.append(cc)
        else:
            monolith.cells.append(cc)

#Save the monolithic notebook
nb.write(monolith,'monolith.ipynb')

#Convert it to markdown
!ipython nbconvert --to markdown monolith.ipynb

##On a Mac, I got pandoc via:
#brew install pandoc

#Generate a Microsoft .docx file from the markdown
!pandoc -o monolith.docx -f markdown -t docx monolith.md

What this means is that I can author a multiple chapter, multiple notebook minicourse within a single IPython notebook, then segment it into a variety of different standalone files using a variety of document types.

Of course, what I really should have been doing was working on the course material… but then again, it was supposed to be my not-OU today…;-)

Data Driven Press Releases From HSCIC Data – Diabetes Prescribing

By chance, I saw a tweet from the HSCIC yesterday announcing Prescribing for Diabetes, England – 2005/06 to 2014/15′ http://bit.ly/1J3h0g8 #hscicstats.

The data comes via a couple of spreadsheets, broken down at the CCG level.

As an experiment, I thought I’d see how quickly I could come up with a story form and template for generating a “data driven press release” that localises the data, and presents it in a textual form, for a particular CCG.

It took a couple of hours, and at the moment my recipe is hard coded to the Isle of Wight, but it should be easily generalisable to other CCGs (the blocker at the moment is identifying regional codes from CCG codes (the spreadsheets in the release don’t provide that linkage – another source for that data is required).

Anyway, here’s what I came up with:

Sketching_a_handcrafted_data2text_report_for_diabetes_prescribing_

Figures recently published by the HSCIC show that for the reporting period Financial 2014/2015, the total Net Ingredient Costs (NIC) for prescribed diabetes drugs was £2,450,628.59, representing 9.90% of overall Net Ingredient Costs. The ISLE OF WIGHT CCG prescribed 136,169 diabetes drugs items, representing 4.17% of all prescribed items. The average net ingredient cost (NIC) was £18.00 per item. This compares to 4.02% of items (9.85% of total NIC) in the Wessex (Q70) region and 4.45% of items (9.98% of total NIC) in England.

Of the total diabetes drugs prescribed, Insulins accounted for 21,170 items at a total NIC of £1,013,676.82 (£47.88 per item (on average), 0.65% of overall prescriptions, 4.10% of total NIC) and Antidiabetic Drugs accounted for 93,660 items at a total NIC of £825,682.54 (£8.82 per item (on average), 2.87% of overall prescriptions, 3.34% of total NIC).

For the NHS ISLE OF WIGHT CCG, the NIC in 2014/15 per patient on the QOF diabetes register in 2013/14 was £321.53. The QOF prevalence of diabetes, aged 17+, for the NHS ISLE OF WIGHT CCG in 2013/14 was 6.43%. This compares to a prevalence rate of 6.20% in Wessex and 5.70% across England.

All the text generator requires me to do is pass in the name of the CCG and the area code, and it does the rest. You can find the notebook that contains the code here: diabetes prescribing textualiser.

Fragments – Scraping Tabular Data from PDFs

Over the weekend, we went to Snetterton to watch the BTCC touring cars and go-for-it Ginetta Juniors. Timing sheets from the event are available on the TSL website, so I thought I’d have a play with the data…

Each series has it’s own results booklet, a multi-page PDF document containing a range of timing sheets. Here’s an example of part of one of them:

ginettaJnrSnetterton2015_pdf__page_28_of_44_

It’s easy enough to use tools like Tabula (at version 1.0 as of August, 2015) to extract the data from regular (ish) tables, but for more complex tables we’d need to do some additional cleaning.

For example, on a page like:

ginettaJnrSnetterton2015_pdf__page_35_of_44_

we get the data out simply by selecting the bits of the PDF we are interested in:

Select_Tables___Tabula_and_Downloads

and preview (or export it):

Export_Data___Tabula

Note that this would still require a bit of work to regularise it further, perhaps using something like OpenRefine.

When I scrape PDFs, I tend to use pdf2html (from the poppler package, I think?) and then parse in the resulting XML:

import os
fn='ginettaJnrSnetterton2015'
cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes %s "%s" "%s"' % ( '',fn+'.pdf', os.path.splitext(fn+'.xml')[0])
# Can't turn off output? Throw it away...
cmd + " >/dev/null 2>&1"
os.system(cmd)

import lxml.etree

xmldata = open(fn+'.xml','r').read()
root = lxml.etree.fromstring(xmldata)
pages = list(root)

We can then quickly preview the “raw” data we’re getting from the PDF:

def flatten(el):
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)

def pageview(pages,page):
    for el in pages[page]:
        print( el.attrib['left'], el.attrib['top'],flatten(el))

TSL_-_BTCC_Scraper

The scraped data includes top and left co-ordinates for each text element. We can count how many data elements are found at each x (left) co-ordinate and use that to help build our scraper.

By eye, we can spot natural breaks in the counts…:

TSL_-_BTCC_Scraper2

but can we also detect them automatically? The Jenks Natural Breaks algorithm [code] looks like it tries to do that…

TSL_-_BTCC_Scraper4

The centres identified by the Jenks natural breaks algorithm could then be used as part of a default hierarchy to assign a particular data element to a particular column. Crudely, we might use something like the following:

TSL_-_BTCC_Scraper5

Whilst it’s quite possible to hand-build scrapers that inspect each element scraped from the PDF document in turn, I notice that the Tabula extraction engine now has a command line interface, so it may be worth spending some time looking at that instead. (It would also be nice if the Tabula GUI could be used to export configuration info, so you could highlight areas of a PDF using the graphical tools and then generate the command line parameter values for reuse from from the command line?)

PS another handy PDF table extractor is published by Scraperwiki: pdftables.com. Which is probably the way to go if you have the funds to pay for it…

PPS A handy summary from the Scraperwiki blog about the different sorts of table containing documents you often come across as PDFS: The four kinds of data PDF (“large tables”, “pivotted tables”, “transactions”, “reports”).

PPPS This also looks relevant – an MSc thesis by Anssi Nurminen, from Tampere University of Technology, on Algorithmic Extraction of Data in Tables in PDF; also this report by Burcu Yildiz, Katharina Kaiser, and Silvia Miksch on pdf2table: A Method to Extract Table Information from PDF Files and an associated Masters thesis by Burcu Yildiz, Information Extraction – Utilizing Table Patterns.

Running a Shell Script Once Only in vagrant

Via somewhere (I’ve lost track of the link), here’s a handy recipe for running a shell script once and once only from Vagrantfile.

In the shell script (runonce.sh):

#!/bin/bash

if [ ! -f ~/runonce ]
then

  #ONCE RUN CODE HERE

  touch ~/runonce
fi

In the Vagrantfile:

  config.vm.provision :shell, :inline => <<-SH
    chmod ugo+x /vagrant/runonce.sh
    /vagrant/runonce.sh
  SH

Exporting and Distributing Docker Images and Data Container Contents

Although it was a beautiful day today, and I should really have spent it in the garden, or tinkering with F1 data, I lost the day to the screen and keyboard pondering various ways in which we might be able to use Kitematic to support course activities.

One thing I’ve had on pause for some time is the possibility of distributing docker images to students via a USB stick, and then loading them into Kitematic. To do this we need to get tarballs of the appropriate images so we could then distribute them.

docker save psychemedia/openrefine_ou:tm351d2test | gzip -c > test_openrefine_ou.tgz
docker save psychemedia/tm351_scipystacknserver:tm351d3test | gzip -c > test_ipynb.tgz
docker save psychemedia/dockerui_patch:tm351d2test | gzip -c > test_dockerui.tgz
docker save busybox:latest | gzip -c > test_busybox.tgz
docker save mongo:latest | gzip -c > test_mongo.tgz
docker save postgres:latest | gzip -c > test_postgres.tgz

On the to do list is getting to these to with the portable Kitematic branch (I’m not sure if that branch will continue, or whether the interest is too niche?!), but in the meantime, I could load it into the Kitematic VM from the Kitematice CLI using:

docker load < test_mongo.tgz

assuming the test_mongo.tgz file is in the current working directory.

Another I need to explore is how to get the set up the data volume containers on the students’ machine.

The current virtual machine build scripts aim to seed the databases from raw data, but to set up the student machines it would seem more sensible to either rebuild a database from a backup, or just load in a copy of the seeded data volume container. (All the while we have to be mindful of providing a route for the students to recreate the original, as distributed, setup, just in case things go wrong. At the same time, we also need to start thing about backup strategies for the students so they can checkpoint their own work…)

The traditional backup and restore route for PostgreSQL seems to be something like the following:

#Use docker exec to run a postgres export
docker exec -t vagrant_devpostgres_1 pg_dumpall -Upostgres -c &gt; dump_`date +%d-%m-%Y"_"%H_%M_%S`.sql
#If it's a large file, maybe worth zipping: pg_dump dbname | gzip > filename.gz

#The restore route would presumably be something like:
cat postgres_dump.sql | docker exec -i vagrant_devpostgres_1 psql -Upostgres
#For the compressed backup: cat postgres_dump.gz | gunzip | psql -Upostgres

For mongo, things seem to be a little bit more complicated. Something like:

docker exec -t vagrant_mongo_1 mongodump

#Complementary restore command is: mongorestore

would generate a dump in the container, but then we’d have to tar it and get it out? Something like these mongodump containers may be easier? (mongo seems to have issues with mounting data containers on host, on a Mac at least?

By the by, if you need to get into a container within a Vagrant launched VM (I use vagrant with vagrant-docker-compose), the following shows how:

#If you need to get into a container:
vagrant ssh
#Then in the VM:
  docker exec -it CONTAINERNAME bash

Another way of getting to the data is to export the contents of the seeded data volume containers from the build machine. For example:

#  Export data from a data volume container that is linked to a database server

#postgres
docker run --volumes-from vagrant_devpostgres_1 -v $(pwd):/backup busybox tar cvf /backup/postgresbackup.tar /var/lib/postgresql/data 

#I wonder if these should be run with --rm to dispose of the temporary container once run?

#mongo - BUT SEE CAVEAT BELOW
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db

We can then take the tar file, distribute it to students, and use it to seed a data volume container.

Again, from the Kitematic command line, I can run something like the following to create a couple of data volume containers:

#Create a data volume container
docker create -v /var/lib/postgresql/data --name devpostgresdata busybox true
#Restore the contents
docker run --volumes-from devpostgresdata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/postgresbackup.tar"
#Note - the docker helpfiles don't show how to use sh -c - which appears to be required...
#Again, I wonder whether this should be run with --rm somewhere to minimise clutter?

Unfortunately, things don’t seem to run so smoothly with mongo?

#Unfortunately, when trying to run a mongo server against a data volume container
#the presence of a mongod.lock seems to break things
#We probably shouldn't do this, but if the database has settled down and completed
#  all its writes, it should be okay?!
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db --exclude=*mongod.lock
#This generates a copy of the distributable file without the lock...

#Here's an example of the reconstitution from the distributable file for mongo
docker create -v /data/db --name devmongodata busybox true
docker run --volumes-from devmongodata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/mongobackup.tar"

(If I’m doing something wrong wrt the getting the mongo data out of the container, please let me know… I wonder as well with the cavalier way I treat the lock file whether the mongo container should be started up in repair mode?!)

If have a docker-compose.yml file in the working directory like the following:

mongo:
  image: mongo
  ports:
    - "27017:27017"
  volumes_from:
    - devmongodata

##We DO NOT need to declare the data volume here
#We have already created it
#Also, if we leave it in, a "docker-compose rm" command
#will destroy the data volume container...
#...which means we wouldn't persist the data in it
#devmongodata:
#    command: echo created
#    image: busybox
#    volumes: 
#        - /data/db

We can the run docker-compose up and it should fire up a mongo container and link it to the seeded data volume container, making the data contains in that data volume container available to us.

I’ve popped some test files here. Download and unzip, from the Kitematic CLI cd into the unzipped dir, create and populate the data containers as above, then run: docker-compose up

You should be presented with some application containers including OpenRefine and an OU customised IPython notebook server. You’ll need to mount the IPython notebooks folder onto the unzipped folder. The example notebook (if everything works!) should show demonstrate calls to prepopulated mongo and postgres databases.

Hopefully!

Doodling With 3d Animated Charts in R

Doodling with some Gapminder data on child mortality and GDP per capita in PPP$, I wondered whether a 3d plot of the data over the time would show different trajectories over time for different countries, perhaps showing different development pathways over time.

Here are a couple of quick sketches, generated using R (this is the first time I’ve tried to play with 3d plots…)

library(xlsx)
#data downloaded from Gapminder
#dir()
#wb=loadWorkbook("indicator gapminder gdp_per_capita_ppp.xlsx")
#names(getSheets(wb))

#Set up dataframes
gdp=read.xlsx("indicator gapminder gdp_per_capita_ppp.xlsx", sheetName = "Data")
mort=read.xlsx("indicator gapminder under5mortality.xlsx", sheetName = "Data")

#Tidy up the data a bit
library(reshape2)

gdpm=melt(gdp,id.vars = 'GDP.per.capita',variable.name='year')
gdpm$year = as.integer(gsub('X', '', gdpm$year))
gdpm=rename(gdpm, c("GDP.per.capita"="country", "value"="GDP.per.capita"))

mortm=melt(mort,id.vars = 'Under.five.mortality',variable.name='year')
mortm$year = as.integer(gsub('X', '', mortm$year))
mortm=rename(mortm, c("Under.five.mortality"="country", "value"="Under.five.mortality"))

#The following gives us a long dataset by country and year with cols for GDP and mortality
gdpmort=merge(gdpm,mortm,by=c('country','year'))

#Filter out some datasets by country
x.us=gdpmort[gdpmort['country']=='United States',]
x.bg=gdpmort[gdpmort['country']=='Bangladesh',]
x.cn=gdpmort[gdpmort['country']=='China',]

Now let’s have a go at some charts. First, let’s try a static 3d line plot using the scatterplot3d package:

library(scatterplot3d)

s3d = scatterplot3d(x.cn$year,x.cn$Under.five.mortality,x.cn$GDP.per.capita, 
                     color = "red", angle = -50, type='l', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year")
s3d$points3d(x.bg$year,x.bg$Under.five.mortality, x.bg$GDP.per.capita, 
             col = "purple", type = "l")
s3d$points3d(x.us$year,x.us$Under.five.mortality, x.us$GDP.per.capita, 
             col = "blue", type = "l")

Here’s what it looks like… (it’s worth fiddling with the angle setting to get different views):

3dline

A 3d bar chart provides a slightly different view:

s3d = scatterplot3d(x.cn$year,x.cn$Under.five.mortality,x.cn$GDP.per.capita, 
                     color = "red", angle = -50, type='h', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year",pch = " ")
s3d$points3d(x.bg$year,x.bg$Under.five.mortality, x.bg$GDP.per.capita, 
             col = "purple", type = "h",pch = " ")
s3d$points3d(x.us$year,x.us$Under.five.mortality, x.us$GDP.per.capita, 
             col = "blue", type = "h",pch = " ")

3dbar

As well as static 3d plots, we can generate interactive ones using the rgl library.

Here’s the code to generate an interactive 3d plot that you can twist and turn with a mouse:

#Get the data from required countries - data cols are GDP and child mortality
x.several = gdpmort[gdpmort$country %in% c('United States','China','Bangladesh'),]

library(rgl)
plot3d(x.several$year, x.several$Under.five.mortality,  log10(x.several$GDP.per.capita),
       col=as.integer(x.several$country), size=3)

We can also set the 3d chart spinning….

play3d(spin3d(axis = c(0, 0, 1)))

We can also grab frames from the spinning animation and save them as individual png files. If you have Imagemagick installed, there’s a function that will generate the image files and weave them into an animated gif automatically.

It’s easy enough to install on a Mac if you have the Homebrew package manager installed. On the command line:

brew install imagemagick

Then we can generate a movie:

movie3d(spin3d(axis = c(0, 0, 1)), duration = 10,
        dir = getwd())

Here’s what it looks like:

movie

Handy…:-)

A Quick Look at Planning Data on the Isle of Wight

One of the staples that I suspect many folk look to in our weekly local paper, the Isle of Wight local press, is the listing of recent planning notices.

The Isle of Wight Council website also provides a reasonably comprehensive online source about planning information. Notices are split across several listings:

It’s easy enough to knock up a scraper to grab the list of current applications, scrape each of the linked to application pages in turn, and then generate a map showing the locations of the current planning applications.

map_html

Indeed, working with my local hyperlocal onthewight.com, here’s a sketch of exactly such an approach, published for the first time yesterday: Isle of Wight planning applications : Mapped (announcement).

I’m hoping to do a lot more with OnTheWight – and perhaps others…? – over the coming weeks and months, so it’d great to hear any feedback you have either here, or on the OnTheWight site itself.

Where Next?

The sketch is a good start, but it’s exactly that. If we are going to extend the service, for example, by also providing a means of reviewing recently accepted (or rejected) applications, as well as applications currently under appeal, we perhaps need to think a little bit more clearly about how we store the data – and keep track of where it is in the planning process.

If we look at the page for a particular application, we see that there are essentially three tables:

Isle_of_Wight_Council_-_Planning_Application_Details

The listings pages also take slightly different forms. All of them have an address, and all of them have a planning application identification number (though in two forms, albeit intersecting); but they differ in terms of the semantics of the third and possible fourth columns, although each ultimately resolves to a date or null value.

– current (and archive) listings:

current_planning

– recent decisions:

planning_decisions

– appeals:

planning_appeals

At the moment, the OnTheWight sketchmap is generated from a scrape of the Isle of Wight Council current planning applications page (latitude and longitude are generated by geocoding the address). A more complete solution would be to start to build a database of all applications, though this requires a little bit of thought when it comes to setting up the database so it becomes possible to track the current state of a particular application.

It might also be useful to put together a simple flow chart that shows how the public information available around an application evolves as an application progresses and then build a data model that can readily reflect that. We could then start to annotate that chart with different output opportunities – for example, as the map goes, it’s easy enough to imagine several layers: a current applications layer, a (current) appeals layer, a recent decisions layer, an archived decision layer.

A process diagram would also allow us to start spotting event opportunities around which we might be able to generate alerts. For example, generating feeds that that allow you to identify changes in application activity within a particular unit postcode or postcode district (ONS: UK postcode structure) or ward could act as the basis of a simple alerting mechanism. It’s then easy enough to set up an IFTT feed to email pipe, though longer term an “onsite” feed to email subscription service would allow for a more local service. (Is there a WordPress plugin that lets logged in users generate multiple email subscriptions to different feeds?

In terms of other value-adds that arise from processing the data, I can think of a few… For example, keeping track of repeated applications to the same property, analysing which agents are popular in terms of applications (and perhaps generating a league table of success rates!), linkage to other location based services (for example, license applications or prices paid data) and so on.

Takes foot off spade and stops looking into the future, surveys weeds and half dug hole…;-)