Category: Tinkering

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:


#render via:

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:


So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.


  X2[label="Jupyter notebook\n(slide mode)"]


  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:


For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.


The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.


Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.


This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

Automatically Generating Two Dimensional Reveal.js Documents Using Jupyter Notebooks

One of the things I finally got round to exploring whilst at the Reproducible Research Using Jupyter Notebooks curriculum development hackathon was the ability to generate slideshows from Jupyter notebooks.

The underlying slideshow presentation framework is reveal.js. This uses a 1.5(?) dimensional slide geometry, so slides can transition left to right, or you can transition down to subslides off a single slide.

This got me wondering… could I use a notebook/script to generate a reveal.js slideshow that could provide a convenient way of navigating automatically generated slideshows made up from automatically generated data2text slides?

The 1.5/two dimensional component would mean that slides could be structured by topic horizontally, with subtopic vertically downwards within a topic.

A quick test suggests that this is absolutely doable…

import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

test.cells.append(nb4.new_markdown_cell('# Test slide1',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('# Test slide2',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide2 extra content line 1\n\nSlide2 extra content line 2'))
test.cells.append(nb4.new_markdown_cell('# Test slide3',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide3 fragment 1',metadata={"slideshow": {"slide_type": "fragment"}}))
test.cells.append(nb4.new_markdown_cell('Slide3 fragment 2',metadata={"slideshow": {"slide_type": "fragment"}}))
test.cells.append(nb4.new_markdown_cell('# Slide4',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide4 extra content line 1\n\nSlide4 extra content line 2'))
test.cells.append(nb4.new_markdown_cell('# Slide4 subslide',metadata={"slideshow": {"slide_type": "subslide"}}))


#Generate and render slideshow
!jupyter nbconvert $nbf --to slides --post serve

Let the fun begin…:-)

PS here’s a second pass:

def addSlideComponent(notebook, content, styp=''):
    if styp in ['slide','fragment','subslide']: styp={"slideshow": {"slide_type":styp}}
    else: styp={}
    notebook.cells.append(nb4.new_markdown_cell(content, metadata=styp))

addSlideComponent(test,'# Test2 slide1','slide')
addSlideComponent(test,'# Test slide2','slide')
addSlideComponent(test,'Slide2 extra content line 1\n\nSlide2 extra content line 2')
addSlideComponent(test,'# Test slide3','slide')
addSlideComponent(test,'Slide3 fragment 1','fragment')
addSlideComponent(test,'Slide3 fragment 2','fragment')
addSlideComponent(test,'# Slide4','slide')
addSlideComponent(test,'Slide4 extra content line 1\n\nSlide2 extra content line 1')
addSlideComponent(test,'# Slide4 subslide','subslide')


Wrangling Time Periods (such as Financial Year Quarters) In Pandas

Looking at some NHS 111 and A&E data today, the reported data I was interested in was being reported for different sorts of period, specifically, months and quarters.

The pandas python library has quite a few tools for dealing with periods, so here are a couple of examples of tricks I put to use today.

Firstly, casting months to a month period. Monthly periods (in column df['Periodname']) were reported in the form “Dec-10”, “Jan-11”, etc, which is to say a three letter month followed by a two digit year representation. If we create a datetime on the first of the corresponding month, we can then cast that date to a month period for the corresponding month:

#Generate a pandas period for time series indexing purposes
#Create a date from each month (add the first date of the month) then set to period
#Note: if the datetime is an index, drop the .dt.
df['_period']=pd.to_datetime(('01-'+df['Periodname']), \

If the datetimes form the dataframe index, we could drop the .dt. element and cast the timeseries index directly to the monthly period. The %b term maps onto the locale specific month description.

The second example had financial years and dates in two columns – Year and Quarter. The months specified in the quarter description mix three letter representations and the fully written name of the month. (As with many NHS datasets, the text used to describe dates can be all over the place, even within a single spreadsheet or datafile!)

If we grab the first three letters of the month, we can identify the month as it’s short form. If we create a date stamp for the first day of the quarter (in the UK, corporate financial years run 1-April, to 31-March (it’s the personal tax year that reports from April 5th?) we can then cast to a quarterly reporting period whose year ends in a particular month, eg March (freq='Q-MAR').


Note that the YearQuarter uses the calendar year in which the financial year ends. We can check the start and end date of the quarters explicitly:


One issue when it comes to plotting the financial year quarterly periods is that the chart looks to me as if everything is a year out:


That is, I read the year as the calendar year, but it’s actually the calendar year in which a particular financial year ends.

So I’m wondering – should pandas really report and plot the financial year in which the fiscal year starts? Is there a switch for this? Or do I need to submit a bug report (and if so, what’s the bug?)

PS for a notebook worked example, see here: NHS 111 example.

Creating a Self-Updating WordPress Post Using WordPress Transients and Data from a Third Party API

In the post Pondering a Jupyter Notebooks to WordPress Publishing Pattern: MultiMarker Map Widget, I described a simple pattern I started exploring last year that used a custom WordPress shortcode plugin to render data added to one or more custom fields associated with a WordPress post; the post text (including shortcode) and custom fields data were themselves posted into WordPress using some Python code executed from a Jupyter notebook. The idea behind that pattern was to provide a way of automating the creation of custom posts largely from a supplied data set, rendered using a generic shortcode plugin.

Another pattern I explored last year used the WordPress Transients API to cache data pulled from a 3rd party API in the WordPress database, and allow that data to be used by a custom plugin to render the post.

Here’s some example code for a plugin that renders a map containing recent planning applications on the Isle of Wight: the data is grabbed via an API from a webscraper, which scrapes the data from the Isle of Wight council website.

The two key bits of the script are where I check to see if cached data exisits ( get_transient( 'iwcurrplanningitems' ); and if it doesn’t, grab a recent copy from the API and cache it for 8 hours (set_transient('iwcurrplanningitems', $markers, 60*60*8);).

Plugin Name: IWPlanningLeafletMap
Description: Shortcode to render an interactive map displaying clustered markers. Markers are pulled in via JSON from an external URL. Intended primarily to supported automated post creation. Inspired by folium python library and Google Maps v3 Shortcode multiple Markers WordPress plugin
Version: 1.0
Author: Tony Hirst

//Loaded in from multimarker shortcode
add_action( 'wp_enqueue_scripts', 'custom_scripts' );
add_action( 'wp_enqueue_scripts', 'custom_styles' );

// Add stuff to header
add_action('wp_head', 'IWPlanningLeafletMap_header');
add_action('wp_head', 'fix_css');

function fix_css() { 
	echo '<style type="text/css">#map {
      }</style>' . "\n";

function IWPlanningLeafletMap_header() {

function IWPlanningLeafletMap_call($attr) {
// Generate the map template

	// Default attributes - can be overwritten from shortcode
	$attr = shortcode_atts(array(	
									'lat'   => '50.675', 
									'lon'    => '-1.32',
									'id' => 'iwmap_1',
									'zoom' => '11',
									'width' => '800',
									'height' => '500',
									), $attr);

	$html = '<div class="folium-map" id="'.$attr['id'].'" style="width: '. $attr['width'] .'px; height: '. $attr['height'] .'px"></div>

   <script type="text/javascript">
      var base_tile = L.tileLayer("https://{s}{z}/{x}/{y}.png", {
          maxZoom: 18,
          minZoom: 1,
          attribution: "Map data (c) OpenStreetMap contributors -"

      var baseLayer = {
        "Base Layer": base_tile

      list of layers to be added
      var layer_list = {

      Bounding box.
      var southWest = L.latLng(-90, -180),
          northEast = L.latLng(90, 180),
          bounds = L.latLngBounds(southWest, northEast);

      Creates the map and adds the selected layers
      var map ="'.$attr['id'].'", {
                                       center:['.$attr['lat'].', '.$attr['lon'].'],
                                       zoom: '.$attr['zoom'].',
                                       maxBounds: bounds,
                                       layers: [base_tile]

      L.control.layers(baseLayer, layer_list).addTo(map);

      //cluster group
      var clusteredmarkers = L.markerClusterGroup();
      //section for adding clustered markers
	$markers =  get_transient( 'iwcurrplanningitems' );
	if ( false === $markers ) {
		$json = file_get_contents($url);
		$markers=json_decode($json, true);
		set_transient('iwcurrplanningitems', $markers, 60*60*8);
	for ($i = 0;$i < count($markers);$i ++){
		$arrkeys=['Agent or Applicant','Location','Proposal'];
		foreach($arrkeys as $arrkey){
			$markers[$i][$arrkey] = str_replace("\n", "<br/>", $markers[$i][$arrkey]);
			$markers[$i][$arrkey] = str_replace("\r", "<br/>", $markers[$i][$arrkey]);
		$html .='
			var marker_'.$i.'_icon = L.AwesomeMarkers.icon({ icon: "info-sign",markerColor: "blue",prefix: "glyphicon",extraClasses: "fa-rotate-0"});
      		var marker_'.$i.' = L.marker(['.$markers[$i]['lat'].','.$markers[$i]['lon'].'], {"icon":marker_'.$i.'_icon});
      marker_'.$i.'.bindPopup("<strong>Consultation start:</strong> '.$markers[$i]['Consultation Start Date'].'<br/><strong>Consultation end:</strong> '.$markers[$i]['Consultation End Date'].'<br/><strong>Location:</strong> '.$markers[$i]['Location'].'<br/><em> '.$markers[$i]['Parish'].' parish, '.$markers[$i]['Ward'].' ward.</em><br/><strong>Proposal:</strong> '.$markers[$i]['Proposal'].'<br/><strong>Agent or Applicant:</strong> '.$markers[$i]['Agent or Applicant'].'<br/><strong>Case Officer:</strong> '.$markers[$i]['Case Officer'].'<br/><em><a href=\''.$markers[$i]['stub'].'\'>View application</a></em>");
      marker_'.$i.'._popup.options.maxWidth = 300;
     		//add the clustered markers to the group anyway

	$html .= '</script>';
	return $html;

add_shortcode('IWPlanningLeafletMap', 'IWPlanningLeafletMap_call');

One thing I started to wonder over the Christmas break was whether this approach could provide a way of sharing “data2text” content. For example, having a plugin that creates a canned summary of jobseeker’s allowance figures from data cached from the ONS website? A downside of this is that I’d have to write the data2text script using PHP, which means I couldn’t directly build on related code I’ve written previously…

I also wonder if we could use custom fields to permanently store data for a particular post. For example, we might check whether or not a custom field exists for the post, and if it doesn’t we could create and populate it using data pulled from an API, (possibly keyed by plugin/shortcode parameters, or the post publication date), using a WordPress add_post_meta() function call?

Pondering a Jupyter Notebooks to WordPress Publishing Pattern: MultiMarker Map Widget

On my to do list for next year is to finally get round to doing something consistently with open data in an Isle of Wight context, probably on One of the things I particularly want to explore are customisable WordPress plugins that either source data from on online data source or that can be configured as part of an external publishing system.

For example, the following code, saved as MultiMarkerLeafletMap2.php and zipped up implements a WordPress plugin that can render an interactive leaflet map with clustered markers.


Plugin Name: MultiMarkerLeafletMap2
Description: Shortcode to render an interactive map displaying clustered markers. Markers to be added as JSON. Intended primarily to supported automated post creation. Inspired by folium python library and Google Maps v3 Shortcode multiple Markers WordPress plugin
Version: 1.0
Author: Tony Hirst

function MultiMarkerLeafletMap2_custom_styles() {

	wp_deregister_style( 'oi_css_map_leaflet' );
	wp_register_style( 'oi_css_map_leaflet', '//',false, '0.7.3' );
	wp_enqueue_style( 'oi_css_map_leaflet' );

	wp_deregister_style( 'oi_css_map_bootstrap' );
	wp_register_style( 'oi_css_map_bootstrap', '//', false, '3.2.0' );
	wp_enqueue_style( 'oi_css_map_bootstrap' );

	wp_enqueue_style( 'oi_css_map_bootstrap_theme');

	wp_enqueue_style( 'oi_css_map_fa');

	wp_enqueue_style( 'oi_css_map_lam');

	wp_enqueue_style( 'oi_css_map_lmcd');

	wp_enqueue_style( 'oi_css_map_lmc');

	wp_enqueue_style( 'oi_css_map_lar');


function MultiMarkerLeafletMap2_custom_scripts() {

	wp_deregister_script( 'oi_script_leaflet' );
	wp_register_script( 'oi_script_leaflet', '//',array('oi_script_jquery'),'0.7.3');
	wp_enqueue_script( 'oi_script_leaflet' );

	wp_deregister_script( 'oi_script_jquery' );
	wp_register_script( 'oi_script_jquery', '//', false, '1.11.3' );
	wp_enqueue_script( 'oi_script_jquery' );

	wp_deregister_script( 'oi_script_bootstrap' );
	wp_register_script( 'oi_script_bootstrap', '//',false,'3.2.0');
	wp_enqueue_script( 'oi_script_bootstrap' );

	wp_deregister_script( 'oi_script_lam' );
	wp_register_script( 'oi_script_lam', '//',array('oi_script_leaflet'),'2.0');
	wp_enqueue_script( 'oi_script_lam' );

	wp_deregister_script( 'oi_script_lmc' );
	wp_register_script( 'oi_script_lmc', '//',array('oi_script_leaflet'),'0.4.0');
	wp_enqueue_script( 'oi_script_lmc' );	

	wp_deregister_script( 'oi_script_lmcsrc' );
	wp_register_script( 'oi_script_lmcsrc', '//',array('oi_script_leaflet'),'0.4.0');
	wp_enqueue_script( 'oi_script_lmcsrc' );

add_action( 'wp_enqueue_scripts', 'MultiMarkerLeafletMap2_custom_scripts' );
add_action( 'wp_enqueue_scripts', 'MultiMarkerLeafletMap2_custom_styles' );

// Add items to header
add_action('wp_head', 'MultiMarkerLeafletMap2_header');
add_action('wp_head', 'MultiMarkerLeafletMap2_fix_css');

function MultiMarkerLeafletMap2_fix_css() {
	echo '<style type="text/css">#map {
      }</style>' . "\n";

function MultiMarkerLeafletMap2_header() {

function MultiMarkerLeafletMap2_call($attr) {
// Generate the map template

	// Default attributes - can be overwritten from shortcode
	$attr = shortcode_atts(array(
									'lat'   => '0',
									'lon'    => '0',
									'id' => 'oimap_1',
									'zoom' => '7',
									'width' => '600',
									'height' => '400',
									'type' => 'multimarker',
									), $attr);

	$html = '<div class="folium-map" id="'.$attr['id'].'" style="width: '. $attr['width'] .'px; height: '. $attr['height'] .'px"></div>
<script type="text/javascript">
      var base_tile = L.tileLayer("https://{s}{z}/{x}/{y}.png", {
          maxZoom: 18,
          minZoom: 1,
          attribution: "Map data (c) OpenStreetMap contributors -"

      var baseLayer = {
        "Base Layer": base_tile

      list of layers to be added
      var layer_list = {
      Bounding box.
      var southWest = L.latLng(-90, -180),
          northEast = L.latLng(90, 180),
          bounds = L.latLngBounds(southWest, northEast);

      Creates the map and adds the selected layers
      var map ="'.$attr['id'].'", {
                                       center:['.$attr['lat'].', '.$attr['lon'].'],
                                       zoom: '.$attr['zoom'].',
                                       maxBounds: bounds,
                                       layers: [base_tile]

      L.control.layers(baseLayer, layer_list).addTo(map);

      //cluster group
      var clusteredmarkers = L.markerClusterGroup();
      //section for adding clustered markers
	// Get our custom fields
	global $post;

	$premarkers=get_post_meta( $post->ID, 'markers', true );
	$markers = json_decode($premarkers,true);
	$legend = get_post_meta( $post->ID, 'maplegendtemplate', true );
	$legendkeys = get_post_meta( $post->ID, 'maplegendkeys', true );

	if (count($markers)>0){
		for ($i = 0;$i < count($markers);$i ++){
			foreach (explode(',', $legendkeys) as $k) {

			$html .='
			     var marker_'.$i.'_icon = L.AwesomeMarkers.icon({ icon: "info-sign",markerColor: "blue",prefix: "glyphicon",extraClasses: "fa-rotate-0"});
      var marker_'.$i.' = L.marker(['.$markers[$i]['lat'].','.$markers[$i]['lon'].'], {"icon":marker_'.$i.'_icon});
      marker_'.$i.'._popup.options.maxWidth = 300;


	$html .= '//add the clustered markers to the group anyway
	return $html;

<?php } add_shortcode('MultiMarkerLeafletMap2', 'MultiMarkerLeafletMap2_call'); ?>

Data is passed to the plugin embedded in a WordPress post via three custom fields associated with the post:

  • markers: a JSON list that contains information associated with each marker;
  • maplegendkeys: a comma separated list of key values that refers to keys in each marker object that are referenced when constructing the popup legend for each marker;
  • maplegendtemplate: a template that is used to construct each popup legend, of the form ‘Asset type: %typ% (%loc%)’, where the %VAR% elements identify key vales VAR associated with object attributes in the markers list.

In the set up I have, the post content – including the plugin code – is generated from a Python script running in a Jupyter notebook that can be posted using the following code fragment:

#!pip3 install python-wordpress-xmlrpc

from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
from wordpress_xmlrpc.methods.posts import NewPost

wpoi = Client(WORDPRESS_BLOG_URL+'/xmlrpc.php', 'robot1', WORDPRESS_API_KEY)

def wp_customPost(client,title='ping',content='pong, <em>pong<em>',custom={}):
    post = WordPressPost()
    post.title = title
    post.content = content
    post.custom_fields = []
    for c in custom:
            'key': c,
            'value': custom[c]
    response =
    return response

A list of objects is created from a pandas dataframe where each object contains the information associated with each marker – we limit the list to only include items for which we have latitude and longitude information:

def itemiser(row):
                     'lon': row['latlong'].split(',')[1],
                     'typ':row['Asset Type Description'],
                     'location':'{}, {}, {}'.format(row['Address 1'], row['Address 2'], row['Post Code']),
    return item

jd1=df[(df['latlong']!='') & (df['latlong'].notnull())].apply(itemiser,axis=1)

A post is then constructed that includes a reference to the plugin (as part of the text of the body of the post) and the data that is to be passed to the custom variables.

import json

#txt contains the content for the blog post
txt="[MultiMarkerLeafletMap2 zoom=11 lat=50.675 lon=-1.31 width=800 height=500]"
<div><em>Data produced by <a href="">Isle of WIght Council</a>.</div>

#jsondata contains the custom variable data that will be associated with the post
jsondata={'markers':json.dumps( jd1.tolist() ),
          'maplegendtemplate':'Asset type: %typ% (%loc%)<br/>Tenure: %tenure%<br/>%location%'}

wp_customPost(wpoi, "Properties on the Isle of Wight Council property register", txt, jsondata)

Student Workload Planning – Section Level Word Count Reports in MS Word Docs

One of the things the OU seems to have gone in for big time lately is “learning design”, with all sorts of planning tools and who knows what to try and help us estimate student workloads.

One piece of internal research I saw suggested that we “adopt a University-wide standard for study speed of 35 words per minute for difficult texts, 70 words per minute for normal texts and 120 words per minute for easy texts”. This is complemented by a recommended level 1 (first year equivalent) 60:40 split between module-directed (course text) work and student-directed (activities, exercises, self-assessment questions, forum activity etc) work. Another constraint is the available study time per week – for a 30 CAT point course (300 hours study), this is nominally set at 10 hours study per week. I seem to recall that retention charts show that retention rates go down as mean study time goes up anywhere close to this…

One of the things that seems to have been adopted is the assumption that the first year equivalent study material should all be rated at the 35 words per minute level. For 60% module led work, at 10 hours a week, this gives approximately 35 * 60 * 6 ~ 1200 words of reading per week. With novels coming in around 500 words a page, that’s 20 pages of reading or so.

This is okay for dense text but we tend to write quite around with strong narrative, using relatively straightforward prose, explaining things a step at a time, with plenty of examples. Dense sentences are rewritten and the word count goes up (but not the reading rate… Not sure I understand that?)

As part of the production process, materials go through multiple drafts and several stages of critical reading by third parties. Part of the critical reading process is to estimate (or check) workload. To assist this, materials are chunked and should be provided with word counts and estimated study times. The authoring process uses Microsoft Word.

As far as I can tell, there is an increasing drive to segment all the materials and chunk them all to be just so, one more step down the line rigidly templated materials. For a level 1 study week, the template seems to be five sections per week with four subsections each, each subsection about 500 words or so. (That is, 10 to 20 blog posts per study week…;-)

I’m not sure what, if any, productivity tools there are to automate the workload guesstimates, but over coffee this morning I though I’d have a go at writing a Visual Basic macro to do do some of the counting for me. I’m not really familiar with VB, in fact, I’m not sure I’ve ever written a macro before, but it seemed to fall together okay if the document was structured appropriately.

To whit, the structure I adopted was: a section to separate each section and subsection (which meant I could count words in each section); a heading as the first line after a section break (so the word count could be associated with the (sub)section heading). This evening, I also started doodling a convention for activities, where an activity would include a line on its own of the form – Estimated study time: NN minutes – which could then be used as a basis for an activity count and an activity study time count.

Running the macro generates a pop up report and also inserts the report at the cursor insertion point. The report for a section looks something like this:


A final summary report also gives the total number of words.

It should be easy enough to also insert wordcounts into the document at the start of each section, though I’m not sure (yet) how I could put a placeholder in at the start of each section that the macro could update with the current wordcount each time I run it? (Also how the full report could just be updated, rather than appended to the document, which could get really cluttered…) I guess I could also create a separate Word doc, or maybe populate an Excel spreadsheet, with the report data.

Another natural step would be to qualify each subsection with a conventional line declaring the estimated reading complexity level, detecting this, and using it with a WPM rate to estimate the study time of the reading material. Things are complicated somewhat by my version of Word (on a Mac) not supporting regular expressions, but then, in the spirit of trying to build tools at the same level of complexity as the level at which we’re teaching, regex are probably out of scope (too hard, I suspect…)

To my mind, exploring such productivity tools is the sort of thing we should naturally do; at least, it’s the sort of thing that felt natural in a technology department. Computing seems different; computing doesn’t seem to be about understanding the technical world around us and getting our hands dirty with it. It’s about… actually, I’m not sure what it’s about. The above sketch really was a displacement activity – I have no misconceptions at all that the above will generate any interest at all, not even as a simple daily learning exercise (I still try to learn, build or create something new every day to keep the boredom away…) In fact, the “musical differences” between my view of the world and pretty much everyone else’s is getting to the stage where I’m not sure it’s tenable any more. The holiday break can’t come quickly enough… Roll on HoG at the weekend…

Sub WordCount()

    Dim NumSec As Integer
    Dim S As Integer
    Dim Summary As String

    Dim SubsectionCnt As Integer
    Dim SubsectionWordCnt As Integer
    Dim SectionText As String

    Dim ActivityTime As Integer
    Dim OverallActivityTime As Integer
    Dim SectionActivities As Integer

    Dim ParaText As String

    Dim ActivityTimeStr As String

    ActivityTime = 0
    OverallActivityTime = 0
    SectionActivities = 0

    SubsectionCnt = 0
    SubsectionWordCnt = 0

    NumSec = ActiveDocument.Sections.Count
    Summary = "Word Count" & vbCrLf

    For S = 1 To NumSec
        SectionText = ActiveDocument.Sections(S).Range.Paragraphs(1).Range.Text

        For P = 1 To ActiveDocument.Sections(S).Range.Paragraphs.Count
            ParaText = ActiveDocument.Sections(S).Range.Paragraphs(P).Range.Text
            If InStr(ParaText, "Estimated study time:") Then
                ActivityTimeStr = ParaText
                ActivityTimeStr = Replace(ActivityTimeStr, "Estimated study time: ", "")
                ActivityTimeStr = Replace(ActivityTimeStr, " minutes", "")
                ActivityTime = ActivityTime + CInt(ActivityTimeStr)
                SectionActivities = SectionActivities + 1
            End If

        If InStr(SectionText, "Section") = 1 Then
            OverallActivityTime = OverallActivityTime + OverallActivityTime
            Summary = Summary & vbCrLf & "SECTION SUMMARY" & vbCrLf _
            & "Subsections: " & SubsectionCnt & vbCrLf _
            & "Section Wordcount: " & SubsectionWordCnt & vbCrLf _
            & "Section Activity Time: " & ActivityTime & vbCrLf _
            & "Section Activity Count: " & SectionActivities & vbCrLf & vbCrLf
            SubsectionCnt = 0
            SubsectionWordCnt = 0
            ActivityTime = 0
            SectionActivities = 0
        End If

        Summary = Summary & "[Document Section " & S & "] " _
        & SectionText _
        & "Word count: " _
        & ActiveDocument.Sections(S).Range.Words.Count _
        & vbCrLf

        SubsectionCnt = SubsectionCnt + 1
        SubsectionWordCnt = SubsectionWordCnt + ActiveDocument.Sections(S).Range.Words.Count

    Summary = Summary & vbCrLf & vbCrLf & "Overall document wordcount: " & _

    Summary = Summary & vbCrLf & "Activity Time: " & ActivityTime & " minutes"
    MsgBox Summary

    Selection.Paragraphs(1).Range.InsertAfter vbCr & Summary & vbCrLf
End Sub

PS I’ve no idea what idiomatic VB is supposed to look like; all the examples I saw seemed universally horrible… If you can give me any pointers to cleaning the above code up, feel free to add them in the comments…

PPS Thinks… I guess each section could also return a readability score? Does VB have a readability score function? VB code anywhere implementing readability scores?

Getting Started With Neo4j and Companies House OpenData

One of the things that’s been on my to do list for ages has been to start playing with the neo4j graph database. I finally got round to having a dabble last night, and made a start trying to figure out how to load some sample data in.

The data I looked at came in two flavours, both bulk data downloads from Companies House:, a JSON dataset containing beneficial ownership/significant control data, and a tabular, CSV dataset containing basic company information.

To simplify running neo4j, I created a simple docker-compose.yml file that would fire up a couple of linked containers – one running neo4j, the other running a Jupyter notebook that I could run queries from. (Actually, I think neo4j has its own web UI, but I’m more comfortable in writing Python scripts in the Jupyter environment.)

#visit 7474 and change the default password - eg to: neo4jch
  image: neo4j
    - "7474:7474"
    - "1337:1337"
    - /opt/data

  image: jupyter/scipy-notebook
    - "8890:8888"
    - neo4jch:neo4j
    - ./notebooks:/home/jovyan/work

To launch things, I tend to run Kitematic, launch a docker command line, cd to the directory containing the above YAML file, then run docker-compose up -d. Kitematic then provides links to the neo4j and Jupyter web page UIs. One thing to note is that neo4j seems to want it’s default password changing – go to the container’s page on port 7474 and reset the password – I changed mine to neo4jch. Once launched, the containers can be suspended with the command docker-compose stop and resumed with docker-compose start.

I’ve popped an example notebook up here, along with a couple of sample data files, that shows how to load both sorts of data (the hierarchical JSON data, and the flat CSV table, into neo4j, along with a couple of sample queries.

That said, I’m not sure how good the examples are – I still need to read the documentation! (For example, via @markhneedham, “MERGE is MATCH/CREATE so you can use the same query on new/existing companies” which should let me figure out how to properly create company information nodes and them link to them from beneficial owners.)

Here are some examples of my starting attempts at the data ingest. Firstly, for JSON data that looks like this:

  "company_number": "09145694",
  "data": {
    "address": {
      "address_line_1": "****",
      "locality": "****",
      "postal_code": "****",
      "premises": "****",
      "region": "****"
    "country_of_residence": "England",
    "date_of_birth": {
      "month": *,
      "year": *
    "etag": "****",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"
    "name": "***",
    "name_elements": {
      "forename": "***",
      "middle_name": "***",
      "surname": "***",
      "title": "***"
    "nationality": "***",
    "natures_of_control": [
    "notified_on": "2016-04-06"

The following bit of Cypher script seems to load the data in:

with'snapshot_beneficialsmall.txt', 'r', 'utf-8-sig') as f:
    for line in f:
        jdata = json.loads(line)
        query = """
WITH {jdata} AS jd
MERGE (beneficialowner:BeneficialOwner {name:}) ON CREATE
  SET beneficialowner.nationality =, beneficialowner.country_of_residence =
MERGE (company:Company {companynumber: jd.company_number})
MERGE (beneficialowner)-[:BENEFICIALOWNEROF]-&amp;gt;(company)
FOREACH (noc IN | MERGE (beneficialowner)-[:BENEFICIALOWNEROF {kind:noc}]-&amp;gt;(company))
""", jdata = jdata)

For the CSV data, I tried the following recipe:

import csv
#Ideally, we create a company:Company node with a company either here
#and then link to it from the beneficial ownership data?
with open('snapshotcompanydata.csv','r') as csvfile:
    #need to clean the column names by stripping whitespace
    reader = csv.DictReader(csvfile,skipinitialspace=True)
    for row in reader:
        WITH {row} AS row
        MERGE (company:Company {companynumber: row.CompanyNumber}) ON CREATE
  SET = row.CompanyName

        MERGE (address:Address {postcode : row["RegAddress.PostCode"]}) ON CREATE
        SET address.line1=row['RegAddress.AddressLine1'], address.line2=row['RegAddress.AddressLine2'],
        MERGE (company)-[:LOCATION]-&amp;gt;(address)

        MERGE (companyactivity:SICCode {siccode:row['SICCode.SicText_1']})
        MERGE (company)-[:ACTIVITY]-&amp;gt;(companyactivity)

Note the way that “dotted” column names are handled.

What these early experiments suggest is that I should probably spend a bit of time trying to model the data to work out what sort of graph structure makes sense. My gut reaction was to define node types identifying beneficial owners, companies and SIC codes. Differently attributed BENEFICIALOWNEROF edges identify what sort of control a beneficial owner has.


However, for generality, I think I should define a more general person node, who could also have DIRECTORROLE edges linking them to companies with attributes correpsponding to things like “director”, “company secretary”, “nominee direcotor” etc? (I don’t think director information is available as a download from Companies House, but it could be accreted/cached into my own database each time I look up director information via the Companies House API.)

A couple of other things that need addressing: constraints (so for example, we should only have one node per company number – the correlate of company numbers being a unique key in a relational datatable (via @markhneedham, s/thing like CREATE CONSTRAINT ON (c:Company) ASSERT c. companynumber is UNIQUE maybe…); and indexes – it would probably make sense to create an index on something company numbers, for example.

Next on the to do list, some example queries on the data as I currently have it modelled to see what sorts of question we can ask and what sorts of network we can extract (I may need to add in more than the sample of data – which means I may also need to look at optimising the way the data is imported?). This might also inform how I should be modelling the data!;-)

Related: Trawling the Companies House API to Generate Co-Director Networks.

See also: Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose and Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers.

PS also via @markhneedham, one to explore when eg annotating a pre-existing node with additional attributes from a new dataset, something along lines of MERGE (c:Company {…}) SET c.newProp1 = “boo”, c.newProp2 = “blah” etc…