Category: Tinkering

Charting Terrorism Related Arrest Flows Through The Criminal Justice System

One of my daily read feeds is a list of the day’s government statistical releases. Today, I spotted a release on the Operation of police powers under the Terrorism Act 2000, quarterly update to September 2015, which included an annes on Arrests and outcomes, year ending September 2015:

terror_cjs

I tweeted a link to doc, and Michael/@fantasticlife replied with a comment it might look interesting as a Sankey diagram…

Hmm…

So here’s a quick sketch generated using SankeyMATIC:

sankeymatic_1000x800

I took the liberty of adding an extra “InSystem’ step into the chart to account for the feedback look of the bailed arrests.

Here’s the data I used:

Arrested [192] InSystem
Arrested [115] Released without charge
Arrested [8] Alternative action
InSystem [124] Charged
InSystem [68] Released on bail
Charged [111] Terrorism Related
Charged [13] Non-terrorism related
Terrorism Related [36] Prosecuted.t
Terrorism Related [1] Not proceeded against
Terrorism Related [74] Awaiting prosecution
Non-terrorism related [6] Prosecuted.n
Non-terrorism related [2] Not proceeded against
Non-terrorism related [5] Awaiting prosecution
Prosecuted.t [33] Convicted (terrorism related)
Prosecuted.t [2] Convicted (non-terrorism related)
Prosecuted.t [1] Acquitted
Prosecuted.n [5]  Convicted (non-terrorism related)
Prosecuted.n [1] Acquitted

Looking at the diagram, I find the placement of the labels quite confusing and I’m not really sure what relate to what. (The numbers, for example…) It would also be neater if we could capture flows still “in-the system”, for example by stopping the Released on bail element at the same depth as the Charged elements, and also keeping the Awaiting prosecution element short of the right hand side. (Perhaps bail and awaiting elements could be added into a “limbo” field?)

So – nice idea, but as soon as you look at it you see that a quick look at trivial sketch immediately identifies all sorts of other issues that you need to take into account to make the diagram informatively glanceable…

Hmmm…

Thinks.. SankeyMATIC is a d3.js app… it would be nice if I could drag the elements in the generator to may the diagram a bit clearer… maybe I can?
sankeymatic_1000x800 (1)

Only that’s wrong too… because the InSystem label applies to the boundary to the left, and the Bail label to the right… So we need to tweak it a bit more…

sankeymatic_1200x800 (1)

In fact, you may notice that the labels seem to be applied left and right justified according to different rules? Hmmm… Not so simple again…

How about if I take out the insterstitial value I added?

sankeymatic_1200x800 (2)

That’s perhaps a bit clearer? And all goes some way to showing how constructing a graphic is generally an iterative process, scaffolding the evolution of the diagram as you go, as you learn to see it/read it from different perspectives and tweak it to try to clarify particular communicative messages? (Which in this case, for me, was to try to tease out how far through the process various flows had got, as well as clearly identify final outcomes…)

Other things we could do to try to improve the graphic are experiment a bit more with the colour schemes. But that’s left as an exercise for the reader…;-)

Some Jupyter Notebook / nbconvert Housekeeping Hints

A few snippets and bits of pieces regarding housekeeping around Jupyter notebooks.

Clearing Output Cells

Via Matthias Bussonnier, this handy command for rendering a version of a notebook with a the output cells cleared:

jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True NOTEBOOK.ipynb

Adding --inplace will rewrite the notebook with cleared output cells.

Custom Templates

If you have a custom template or a custom config file in the current directory, you can invoke them using:

jupyter nbconvert --config=my_config.py NOTEBOOK.ipynb
jupyter nbconvert --template=my_template.tpl NOTEBOOK.ipynb

I found running the --log-level=DEBUG flag was also handy…

Via MinRk, additional paths can be set using c.TemplateExporter.template_path.append('/path/to/templates') though I’m not really sure where that setting needs to be applied. (Whilst I love the Jupyter project, I really struggle to keep track of where things are supposed to be located and which bits are working/don’t work anymore:-(

He also notes that [a]bsolute template paths will also work if you specify: c.TemplateExporter.template_path.append('/'), adding the comment that [a]bsolute paths for templates should probably work without modifying template_path, but they don’t right now.

It would be really handy if the ability to specify an absolute path in the command line setting did work out of the can…

Split a Long Notebook into Multiple Sub-Notebooks

The Jupyter notebooks allow you to split long cells into two cells using the cursor as a split point, but how about splitting a long notebook into multiple notebooks?

The following test script will split a notebook into sub-notebooks at an explicit split point – a markdown cell containing just the string SPLIT NOTEBOOK.

import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

mynb=nb.read('TEST_LONG_NOTEBOOK.ipynb',nb.NO_CONVERT)
#Partition a long notebook into subnotebooks at specificed split points
#Enter SPLIT NOTEBOOK on its own in a markdown cell to specify a split point
c=1
test=nb4.new_notebook()
for i in mynb['cells']:
    if (i['cell_type']=='markdown'):
        if ('SPLIT NOTEBOOK' in i['source']):
            nb.write(test,'subNotebook{}.ipynb'.format(c))
            c=c+1
            test=nb4.new_notebook()
        else:
            test.cells.append(nb4.new_markdown_cell(i['source']))
    elif (i['cell_type']=='code'):
        cc=nb4.new_code_cell(i['source'])
        for o in i['outputs']:
            cc['outputs'].append(o)
        test.cells.append(cc)
nb.write(test,'subNotebook{}.ipynb'.format(c))

I should probably tidy this up so that it reuses the original notebook name rather than the subNotebook stub. It might also be handy to turn this into a notebook extension that lets splits the current notebook into two relative to the current cursor location (e.g. all cells above the selected cell go in one sub-notebook, everything from the selected cell to the end of the notebook going into a second sub-notebook.

Another refinement might allow for the declaration of a comment set of cells to prefix each sub-notebook. By default, this could be the set of cells prior to the first split point. (Which is to say for N split points, there would be N, rather than N+1, sub-notebooks, the cells above the first split point appearing in each sub-notebook. The first sub-notebook would thus contain cells starting with the first cell after the first split point, prefixed by the cells appearing before the first split point; and the last sub-notebook would contain the cells starting with the first cell after the last split point, prefixed once again by the cells appearing before the first split point.

A second utility to merge, or concatenate, two or more notebooks when provided with their filenames might also be handy…

Anything Else?

So what other handy Jupyter notebook / nbconvert housekeeping hints and tricks am I missing?

Some Random Upstart Debugging Notes…

…just so I don’t lose them…

dmesg spews out messages a Linux kernel has been issuing… which should also appear in /var/log/syslog (h/t Rod Norfor)

/var/log/upstart/SERVICE.log has log messages from trying to start a service SERVICE.

/etc/init.d should contain what looks like a generic sort of file, filename SERVICE, with the actual config file that contains the command you want to start the service in a file SERVICE.conf in /etc/init.

To generate the files that will have a go at auto-running the service, run the command update-rc.d SERVICE defaults.

Start a service with service SERVICE start, stop it with service SERVICE stop, and restart (stop if started, then start) with service SERVICE restart. Find out what’s going on with it using service SERVICE status.

Maybe…

Confusing Chart? Seaborn Jointplot

I’ve just been doodling with some data and the seaborn graphic library and managed to confuse myself a couple of times when quickly glancing at some “jointplots” that add marginal histograms to a scatter plot.

The code is easy enough:

sns.jointplot(x='Indoors Sub-domain Rank (where 1 is most deprived)',
           y='Outdoors Sub-domain Rank (where 1 is most deprived)', 
           data=lsoaiw)

which give charts of the form:

IW_Deprivation_jointplot

but how do you intuitively see the histograms to compare them?

IW_Deprivation_jointplotCompare

My at-a-glance (tired, past midnight…) reaction keeps “seeing” the top comparison, representing a 90 degree counterclockwise rotation about the top left corner of the y-axis chart. But if you think about for more than a glance, that obviously puts the large-y values at low-x, and low y-values at large-x.

The correct way is the bottom one; to make the comparison you need to flip the y-axis chart, folding its bottom right corner towards the top-left corner of the x-axis chart. This is a rotation and a reflection. Which is hard.

But if we do the flip to try to help (can we do this using seaborn???):

IW_Deprivation_jointplotCompare2

my eye now feels as if it wants to help out, keeping all the near corners of the various charts close together (top right of the y-axis chart, bottom left of the x-axis one, and top left of the main panel) all close together, and flipping the bottom left corner of the y-axis chart as if up to the top right corner of the x-axis chart. (i.e. in this configuration it wants to do the flip?)

The distribution of bars in  the marginal charts may also be complicating matters, encouraging the eye to match large with large (which in this case is wrong…).

Hmm….

Tinkering With Data Consuming WordPress Plugins

Now that I’ve got my own webspace on Reclaim Hosting and set up a handful of my own self-hosted WordPress blogs, I started having a tinker with some custom plugins. I’ve started off with something simple, a couple of shortcode plugins that I can use to embed a simple leaflet map into a post or a page.

datawire_–_test_blog___Just_seeing_what_the_bots_can_do…

So far, I’ve tried a couple of different approaches…

The first one has been pared down to just a default accepting shortcode:

[IWPlanningLeafletMap]

The shortcode calls a bit of code that makes a cached, (at most, three times a day) request to a morph.io scraper that returns a list of currently open for consultation planning applications to the Isle of Wight Council. (That said, I can pass in a few shortcode parameters relating to the size of the map (which is a fixed size, unfortunately – I haven’t worked out to make it flexible…) and the default location and zoom.) The scraper itself runs once daily.

scraperplugin

My feeling is that this sort of code would work best in the context of a WordPress page, acting as a destination that allows folk to just check on currently open applications.

The second plugin embeds a map that displays markers for recent house sales (as recorded by the Land Registry prices paid dataset). This dataset is published as a monthly set of data a month or two after the fact and is downloaded to my local desktop. A python script then reads in the data, creates a new WordPress post containing the shortcode with the data baked in, and uploads the post to WordPress (where it appears in the draft queue).

 alt=

In this shortcode, the marker data is currently encoded using the PHP serialise format (via the python phpserialize dumps method) and embedded in the post as the value of shortcode attribute.

[MultiMarkerLeafletMap zoom=11 lat=50.675 lon=-1.32 width=800 height=500 markers='a:112:{i:0;a:5:{s:3:"lat";d:50.699382843;s:4:"date";s:10:"2015-07-10";s:3:"lon";d:-1.29297620442;s:8:"location";s:22:"SAVOY COURT, TOWN...]

In this case, with the marker data baked into the shortcode, there’s a good argument for rendering the map within a timestamped post as a fixed map (at least, ‘fixed’ in the sense that the data is unchanging).

The PHP un/serialize route isn’t ideal because I think it raises security issues? I originally tried to pass the data as serialised JSON, but the data is in the form of a list and it seems the ] breaks things. I guess what I should really do is see if I can pass the data in as serialised JSON between the shortcode tags rather than pass it in as an attribute?

Another approach might be to define just a simple map embed shortcode, and then support additional shortcodes that add markers to the map?

My PHP coding is also a bit scrappy (and I keep forgetting the ;’s…:-( I think one thing I do need to do is pop the simple plugin code functions into a class to keep them safe, and also patch in a hack or two (as seems to be required?) so that the leaflet map libraries are only loaded into post and page headers for posts/pages that actually contain one of the map shortcodes.

I’m also thinking now I need to find a way to support boundary lines and shape colouring?

boundarymaplines

Hmm…

First Steps in a Conversational Slackbot interface to CQC Inspection Data

A few months ago, I started to have a play with ratings data from the CQC – the Care Quality Commission. I don’t seem to have posted the scraper tools I ended up with anywhere, but I have started playing with them again over the last couple of weeks in the context of my slackbot explorations.

In particular, I’ve started working about a conversational workflow for keeping track of inspections in a particular local area. At the current time, the CQC website only seems to support alerts relating to new reports at the level of a particular location, although it is possible to get faceted search results relating to reports published over the course of the last week or last month. For the local journalist wanting to keep tabs on reports associated with local providers, this means setting up a data beat that includes checking the CQC website on a regular basis, firstly to see whether there are any new reports, and secondly to see what sort of reports they are.

And as a report from OnTheWight today shows (Beacon Health Centre rated ‘Good’ by CQC), this can include good news stories as well as bad news one.

Beacon_Health_Centre_rated_‘Good’_by_CQC

So what’s the thing I’ve been exploring in the slack context? A time saver, hopefully. In the first case, it provides a quick way of checking up on reports from the local area released over the previous week or previous month:

To begin with, we can ask for a summary report of recent inspections:

dtest___OUseful_Slack

The report does a bit of counting – to provide some element of context – and provides a highlights statement regarding the overall result from each of the reports. (At the moment, I don’t sort the order in which reports are presented. There are opportunities here for prioritising which results to show. I also need to think about wether results should be provided as a single slackbot response, as is currently the case, or using a separate (and hence, “star-able”) response for each report.)

After briefly skimming over the recent results, we can tunnel down into a slightly more detailed summary of the report by passing in a location ID:

dtest___OUseful_Slack2

As part of the report, this returns a link to the CQC website so we can inspect the full result at source. I’ve also got a scraper that pulls the full reports from the CQC site, but at the moment it’s not returning the result to slack (I think there’s a message size limit which I’m overflowing, so I need to find what that limit it and split up the response to get round it.). That said, slack is perhaps not the best place to return long reports? Maybe a better way would be to submit quoted report components into a draft WordPress blog post?

We can also pull back administrative information regarding a location from its location ID.

dtest___OUseful_Slack3

This report also includes information about other locations operated by the same provider. (I think I do have a report somewhere that summarises the report ratings over all the locations associated with a given provider, so we can look to see how well other establishments operated by the same provider are rated, but I haven’t wired that into the Slack bot yet.)

There are several other ways we can develop this conversation…

Company number and charity number information is available for some providers, which means it should be to trivial to ask return company registration information and company directors information from Companies House or OpenCorporates, and perhaps even data from the Charities Commission.

Rather more scruffily, perhaps, we could use location name and postcode to try a search on the Food Standards Agency website to see if we can find food ratings data for establishments of interest.

There might also be opportunities for linking in items from local spending data to try to capture local authority spend with a particular location or provider. This would be simplified if council payments to CQC rated establishments or providers included the CQC location or provider ID, but that’s perhaps too much to ask for.

If nothing else, however, this demonstrates a casual conversational way in which a local journalist might be able to use slack as part of a local data beat to run regular, periodic checks over recent reports published by the CQC relating to local care establishments.

Slackbot Data Wire, Initial Sketch

Via a round-up post from Matt Jukes/@jukesie (Interesting elsewhere: Bring on the Bots), I was prompted to look again at Slack. OnTheWight’s Simon Perry originally tried to hook me in to Slack, but I didn’t need another place to go to check messages. Simon had also mentioned, in passing, how it would be nice to be able to get data alerts into Slack, but I’d not really followed it through, until the weekend, when I read again @jukesie’s comment that “what I love most about it [Slack] is the way it makes building simple, but useful (or at least funny), bots a breeze.”

After a couple of aborted attempts, I found a couple of python libraries to wrap the Slack API: pyslack and python-rtmbot (the latter also requires python-slackclient).

Using pyslack to send a message to Slack was pretty much a one-liner:

#Create API token at https://api.slack.com/web
token='xoxp-????????'

#!pip install pyslack
import slack
import slack.chat
slack.api_token = token
slack.chat.post_message('#general', 'Hello world', username='testbot')

general___OUseful_Slack

I was quite keen to see how easy it would be to reuse one of more of my data2text sketches as the basis for an autoresponder that could get accept a local data request from a Slack user and provide a localised data response using data from a national dataset.

I opted for a JSA (Jobseekers Allowance) textualiser (as used by OnTheWight and reported here: Isle of Wight innovates in a new area of Journalism and also in this journalism.co.uk piece: How On The Wight is experimenting with automation in news) that I seem to have bundled up into a small module, which would let me request JSA figures for a council based on a council identifier. My JSA textualiser module has a couple of demos hardwired into it (one for the Isle of Wight, one for the UK) so I could easily call on those.

To put together an autoresponder, I used the python-rtmbot, putting the botcode folder into a plugins folder in the python-rtmbot code directory.

The code for the bot is simple enough:

from nomis import *
import nomis_textualiser as nt
import pandas as pd

nomis=NOMIS_CONFIG()

import time
crontable = []
outputs = []

def process_message(data):

	text = data["text"]
	if text.startswith("JSA report"):
		if 'IW' in text: outputs.append([data['channel'], nt.otw_rep1(nt.iwCode)])
		elif 'UK' in text: outputs.append([data['channel'], nt.otw_rep1(nt.ukCode)])
	if text.startswith("JSA rate"):
		if 'IW' in text: outputs.append([data['channel'], nt.rateGetter(nt.iwCode)])
		elif 'UK' in text: outputs.append([data['channel'], nt.rateGetter(nt.ukCode)])

general___OUseful_Slack2

Rooting around, I also found a demo I’d put together for automatically looking up a council code from a Johnston Press newspaper title using a lookup table I’d put together at some point (I don’t remember how!).

Which meant that by using just a tiny dab of glue I could extend the bot further to include a lookup of JSA figures for a particular council based on the local rag JP covering that council. And the glue is this, added to the process_message() function definition:

	def getCodeForTitle(title):
		code=jj_titles[jj_titles['name']==title]['code_admin_district'].iloc[0]
		return code

	if text.startswith("JSA JP"):
		jj_titles=pd.read_csv("titles.csv")
		title=text.split('JSA JP')[1].strip()
		code=getCodeForTitle(title)

		outputs.append([data['channel'], nt.otw_rep1(code)])
		outputs.append([data['channel'], nt.rateGetter(code)])

general___OUseful_Slack3

This is quite an attractive route, I think, for national newsgroups: anyone in the group can create a bot to generate press release style copy at a local level from a national dataset, and then make it available to reporters from other titles in the group – who can simply key in by news title.

But it could work equally well for a community network of hyperlocals, or councils – organisations that are locally based and individually do the same work over and over again on national datasets.

The general flow is something a bit like this:

Tony_Hirst_-_Cardiff_-_community_journalism_-_data_wire_pptx

which has a couple of very obvious pain points:

Tony_Hirst_-2_Cardiff_-_community_journalism_-_data_wire_pptx

Firstly, finding the local data from the national data, cleaning the data, etc etc. Secondly, making some sort of sense of the data, and then doing some proper journalistic work writing a story on the interesting bits, putting them into context and explaining them, rather than just relaying the figures.

What the automation route does is to remove some of the pain, and allow the journalist to work up the story from the facts, presented informatively.

Tony_Hirst_-3_Cardiff_-_community_journalism_-_data_wire_pptx

This is a model I’m currently trying to work up with OnTheWight and one I’ll be talking about briefly at the What next for community journalism? event in Cardiff on Wednesday [slides].

PS Hmm.. this just in, The Future of the BBC 2015 [PDF] [announcement].

Local Accountability Reporting Service

Under this proposal, the BBC would allocate licence fee funding to invest in a service that reports on councils, courts and public services in towns and cities across the UK. The aim is to put in place a network of 100 public service reporters across the country.

Reporting would be available to the BBC but also, critically, to all reputable news organisations. In addition, while it would have to be impartial and would be run by the BBC, any news organisation — news agency, independent news provider, local paper as well as the BBC itself—could compete to win the contract to provide the reporting team for each area.

A shared data journalism centre Recent years have seen an explosion in data journalism. New stories are being found daily in government data, corporate data, data obtained under the Freedom of Information Act and increasing volumes of aggregated personalised data. This data offers new means of sourcing stories and of holding public services, politicians and powerful organisations to account.

We propose to create a new hub for data journalism, which serves both the BBC and makes available data analysis for news organisations across the UK. It will look to partner a university in the UK, as the BBC seeks to build a world-class data journalism facility that informs local, national and global news coverage.

A News Bank to syndicate content

The BBC will make available its regional video and local audio pieces for immediate use on the internet services of local and regional news organisations across the UK.

Video can be time-consuming and resource-intensive to produce. The News Bank would make available all pieces of BBC video content produced by the BBC’s regional and local news teams to other media providers. Subject to rights and further discussion with the industry we would also look to share longer versions of content not broadcast, such as sports interviews and press conferences.

Content would be easily searchable by other news organisations, making relevant material available to be downloaded or delivered by the outlets themselves, or for them to simply embed within their own websites. Sharing of content would ensure licence fee payers get maximum value from their investment in local journalism, but it would also provide additional content to allow news organisations to strengthen their offer to audiences without additional costs. We would also continue to enhance linking out from BBC Online, building on the work of Local Live.

Hmm… Share content – or share “pre-content”. Use BBC expertise to open up the data to more palatable forms, forms that the BBC’s own journalists can work with, but also share those intermediate forms with the regionals, locals and hyperlocals?