Category: Tinkering

A Quick Look at Planning Data on the Isle of Wight

One of the staples that I suspect many folk look to in our weekly local paper, the Isle of Wight local press, is the listing of recent planning notices.

The Isle of Wight Council website also provides a reasonably comprehensive online source about planning information. Notices are split across several listings:

It’s easy enough to knock up a scraper to grab the list of current applications, scrape each of the linked to application pages in turn, and then generate a map showing the locations of the current planning applications.

map_html

Indeed, working with my local hyperlocal onthewight.com, here’s a sketch of exactly such an approach, published for the first time yesterday: Isle of Wight planning applications : Mapped (announcement).

I’m hoping to do a lot more with OnTheWight – and perhaps others…? – over the coming weeks and months, so it’d great to hear any feedback you have either here, or on the OnTheWight site itself.

Where Next?

The sketch is a good start, but it’s exactly that. If we are going to extend the service, for example, by also providing a means of reviewing recently accepted (or rejected) applications, as well as applications currently under appeal, we perhaps need to think a little bit more clearly about how we store the data – and keep track of where it is in the planning process.

If we look at the page for a particular application, we see that there are essentially three tables:

Isle_of_Wight_Council_-_Planning_Application_Details

The listings pages also take slightly different forms. All of them have an address, and all of them have a planning application identification number (though in two forms, albeit intersecting); but they differ in terms of the semantics of the third and possible fourth columns, although each ultimately resolves to a date or null value.

– current (and archive) listings:

current_planning

– recent decisions:

planning_decisions

– appeals:

planning_appeals

At the moment, the OnTheWight sketchmap is generated from a scrape of the Isle of Wight Council current planning applications page (latitude and longitude are generated by geocoding the address). A more complete solution would be to start to build a database of all applications, though this requires a little bit of thought when it comes to setting up the database so it becomes possible to track the current state of a particular application.

It might also be useful to put together a simple flow chart that shows how the public information available around an application evolves as an application progresses and then build a data model that can readily reflect that. We could then start to annotate that chart with different output opportunities – for example, as the map goes, it’s easy enough to imagine several layers: a current applications layer, a (current) appeals layer, a recent decisions layer, an archived decision layer.

A process diagram would also allow us to start spotting event opportunities around which we might be able to generate alerts. For example, generating feeds that that allow you to identify changes in application activity within a particular unit postcode or postcode district (ONS: UK postcode structure) or ward could act as the basis of a simple alerting mechanism. It’s then easy enough to set up an IFTT feed to email pipe, though longer term an “onsite” feed to email subscription service would allow for a more local service. (Is there a WordPress plugin that lets logged in users generate multiple email subscriptions to different feeds?

In terms of other value-adds that arise from processing the data, I can think of a few… For example, keeping track of repeated applications to the same property, analysing which agents are popular in terms of applications (and perhaps generating a league table of success rates!), linkage to other location based services (for example, license applications or prices paid data) and so on.

Takes foot off spade and stops looking into the future, surveys weeds and half dug hole…;-)

Things I Take for Granted #287 – Grabbing Stuff from Web Form Drop Down Lists

Over the years, I’ve collected lots of little hacks for tinkering with various data sets. Here’s an example…

A form on a web page with country names that map to code values:

Advanced_Search_-_ROARMAP

If we want to generate a two column look up table from the names on the list to the values that encode them, we can look to the HTML source, grab the list of elements, then use a regular expression to to extract the names and values and rewrite them in two column, tab separated text file, with one item per line:

regexp form exractor

NOTE: the last character in the replace is \n (newline character). I grabbed the screenshot when the cursor was blinking on:-(

A Google Spreadsheets View Over DWP Tabulation Tool 1-Click Time Series Data

Whilst preparing for an open data training session for New Economy in Manchester earlier this week, I was introduced to the DWP tabulation tool that provides a quick way of analysing various benefits and allowances related datasets, including bereavement benefits, incapacity benefit and employment and support allowance.

The tool supports the construction of various data views as well as providing 1-click link views over “canned” datasets for each category of data.

DWP_Tabulation_Tool

The data is made available in the form on an HTML data table via a static URL (example):

Bereavement_Benefits__Bereavement_Benefit_and_Widows_Benefit_combined__--_On_Flows__thousands____Time_Series_by_Gender_of_claimant

To simplify working with data, we can import the data table directly into Google spreadsheets using the importHTML() formula, which allows you to specify a URL, and then import a specified HTML data table from that page. In the following example, the first table from a results page – that contains the description of the table – is imported into cell A1, and the actual datatable (table 2) is imported via an importhtml() formula specified in cell A2.

test1_-_Google_Sheets

Note that the first data row does not appear to import cleanly – inspection of the original HTML table shows why – the presence of what is presumably a split cell that declares the name of the timeseries index column along with the first time index value.

To simplify the import of these data tables into a Google Spreadsheet, we can make use of a small script to add an additional custom menu into Google spreadsheets that will import a particular dataset.

test1_-_Google_Sheets_addscript

The following script shows one way of starting to construct such a set of menus:

function onOpen() {
  var ui = SpreadsheetApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('DWP Tabs')
      .addSubMenu(ui.createMenu('Bereavement Benefits')
          .addItem('1-click BW/BB timeseries', 'mi_bb_b')
          .addItem('1-click Region timeseries', 'mi_bb_r')
          .addItem('1-click Gender timeseries', 'mi_bb_g')
          .addItem('1-click Age timeseries', 'mi_bb_a')
      )
      .addSubMenu(ui.createMenu('Incapacity Benefit/Disablement')
          .addItem('1-click Region timeseries', 'mi_ic_r')
      )
      .addToUi();
}

function menuActionImportTable(url){
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];

  var cell = sheet.getRange("A1");
  cell.setFormula('=importhtml("'+url+'","table",1)');
  cell = sheet.getRange("A2");
  cell.setFormula('=importhtml("'+url+'","table",2)');
}

//--Incapacity Benefit/Disablement
function mi_ic_r() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/ibsda/cdquarter/ccgor/a_carate_r_cdquarter_c_ccgor.html';
  menuActionImportTable(url)
}


//-- Bereavement Benefits
function mi_bb_r() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccgor/a_carate_r_cdquarter_c_ccgor.html';
  menuActionImportTable(url)
}

function mi_bb_g() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccsex/a_carate_r_cdquarter_c_ccsex.html';
  menuActionImportTable(url)
}

function mi_bb_a() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/cnage/a_carate_r_cdquarter_c_cnage.html';
  menuActionImportTable(url)
}

function mi_bb_b() {
  var url='http://tabulation-tool.dwp.gov.uk/flows/flows_on/bb/cdquarter/ccbbtype/a_carate_r_cdquarter_c_ccbbtype.html';
  menuActionImportTable(url)
}

Copying the above script into the script editor associated with a spreadsheet, and then reloading the spreadsheet (permissions may need to be granted to the script the first time it is run), provides a custom menu that allows the direct import of a particular dataset:

test1_-_Google_Sheets_menu

Duplicating the spreadsheet carries the script along with it (I think) and can presumably also be shared… (It’s been some time since I played with apps script – I’m not sure how permissioning works or how easy it is to convert scripts to add-ons, though I note from the documentation that top-level app menus aren’t supported by add-ons.

Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:

ROGUE CODE

One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:

regex1

If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)

regex2

To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)

regex2a

However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex3

So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex4

Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')

regex5a

This generates rows of the form:

regex6

We can now split these cells into several columns:

regex7

We use the :: sequence as the separator:

regex8

Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.

regex9

We now need to do a little more cleaning. For example, tidying up column names:

regex10

And then cleaning down each new column to remove the column heading.

regex11

As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):

regex12

To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)

regex13

As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)

Printing Out Spreadsheet Cell Values by (Hierarchical) Column Using pandas

Building on from Wrangling Complex Spreadsheet Column Headers, I’ve been hacking the spreadsheet published here a bit more so that I can print out each column value from each sheet in a given spreadsheet for a particular local authority (that is, particular key value in a particular column), to get an output of the form:

Housing_Data_printrow

(I guess I could add a print suppressor to only print statements where the value is not 0?)

The original notebook can be found here.

The major novelty over the previous post is the colmapbuilder() function that generates a nested dict from a group of hierarchical column names that terminates with either the column code or the cell value for that column and a given row selector (I need to tidy up the function args…)

import pandas as pd
dfx=pd.ExcelFile('Local_Authority_Housing_Statistics_dataset_2013-14.xlsx')

#Menu sheet parse to identify sheets A-I
import re
def getSheetDetails(dfx):
    sd=re.compile(r'Section (\w) - (.*)$')
    sheetDetails={}
    for row in dfx.parse('Menu')[[1]].values:
        if str(row[0]).startswith('Section'):
            sheetDetails[sd.match(row[0]).group(1)]=sd.match(row[0]).group(2)
    return sheetDetails

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    df=dfx.parse(sheet,header=None)
    df=df.dropna(how='all')
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=dfx.parse(sheet,header=row).dropna(how='all').dropna(how='all',axis=1)
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    df.columns=[c.split(' ')[0] for c in df.columns]
    return df,row


import collections
def coldecoder(dfx,sheet,row):
    zz=dfx.parse(sheet,header=None)
    stitle=zz[0][[0]][0]
    
    xx=zz[1:row].dropna(how='all')
    #Fill down
    xx.fillna(method='ffill', axis=0,inplace=True)
    #Fill across
    xx=xx.fillna(method='ffill', axis=1)
    #How many rows in the header?
    keydepth=len(xx)
    header=[i for i in range(0,keydepth)]

    xx=xx.append(zz[row:row+1])
    xx.to_csv('multi_index.csv',header=False,index=False,encoding='utf-8')
    mxx=pd.read_csv('multi_index.csv',header=header,encoding='utf-8')
    for c in mxx.columns.get_level_values(0).tolist():
        if c.startswith('Unnamed'):
            mxx = mxx.drop(c, level=0, axis=1)
    #We need to preserve the order of the header columns
    dd=mxx.to_dict(orient='split')
    ddz=zip(dd['columns'],dd['data'][0])
    keyx=collections.OrderedDict() #{}
    for r in ddz:
        if not pd.isnull(r[1]):
            #print r[1].split(' ')[0]
            keyx[r[1].split(' ')[0]]=r[0]
    return stitle,keyx,keydepth

#Based on http://stackoverflow.com/a/10756547/454773
def myprint(d,l=None):
  if l is None: l=''
  for k, v in d.iteritems():
    if isinstance(v, dict):
      print("{}{}".format(l,k))
      myprint(v,l=l+'-')
    else:
      print "{0} {1} : {2}".format(l,k.encode('utf-8'), v)

def colmapbuilder(dfx,sheet,code=None,retval=True):
    df,row=dfgrabber(dfx,sheet)
    sname,skey,kd=coldecoder(dfx,sheet,row)
    kq=collections.OrderedDict() #{}
    for k in skey:
        kq[k]=[]
        for j in skey[k]:
            if j not in kq[k]: kq[k].append(j)
    colmapper=collections.OrderedDict() #{}
    for kkq in kq:
        curr_level = colmapper
        depth=0
        for path in kq[kkq]:
            depth=depth+1
            if path not in curr_level:
                if depth<len(kq[kkq]):
                    curr_level[path] = collections.OrderedDict() #{}
                    curr_level = curr_level[path]
                else:
                    if retval and code is not None:
                        curr_level[path] = df[df['Current\nONS']==code][kkq].iloc[0]
                    else:
                        curr_level[path] = kkq
            else:
                curr_level = curr_level[path]
    return sname, colmapper

ll=dfx.sheet_names
ll.remove('Menu')
for lll in ll:
    sname,cmb=colmapbuilder(dfx,lll,'E06000046')
    print(sname+'\n')
    myprint(cmb)
    print('\n\n')

I’m not sure how this helps, other than demonstrating how we might be able to quickly generate a crude textualisation of values in a single row in spreadsheet with a complex set of hierarchical column names?

The code is also likely to be brittle, so the main questions are:

– is the method reusable?
– can the code/approach be generalised or at least made a little bit more robust and capable of handling other spreadsheets with particular properties? (And then – what properties, and how might we be able to detect those properties?)

Wrangling Complex Spreadsheet Column Headers

[This isn’t an R post, per se, but I’m syndicating it via RBloggers because I’m interested – how do you work with hierarchical column indices in R? Do you try to reshape the data to something tidier on the way in? Can you autodetect elements to help with any reshaping?]

Not a little p****d off by the Conservative election pledge to extend the right-to-buy to housing association tenants (my response: so extend the right to private tenants too?) I thought I’d have a dig around to see what data might be available to see what I could learn about the housing situation on the Isle of Wight, using a method that could also be used in other constituencies. That is, what datasets are provided at a national level, broken down to local level. (To start with, I wanted to see what I could lean ex- of visiting the DCLG OpenDataCommuniteis site.

One source of data seems to be the Local authority housing statistics data returns for 2013 to 2014, a multi-sheet spreadsheet reporting at a local authority level on:

– Dwelling Stock
– Local Authority Housing Disposals
– Allocations
– Lettings, Nominations and Mobility Schemes
– Vacants
– Condition of Dwelling Stock
– Stock Management
– Local authority Rents and Rent Arrears
– Affordable Housing Supply

Local_Authority_Housing_Statistics_dataset_2013-14_xlsx

Something I’ve been exploring lately are “external spreadsheet data source” wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

For example, I got started on the LA housing stats sheet as follows – first a look at the sheets, then a routine to grab sheet names out of the Menu sheet:

import pandas as pd
dfx=pd.ExcelFile('Local_Authority_Housing_Statistics_dataset_2013-14.xlsx')
dfx.sheet_names
#...
#Menu sheet parse to identify sheets A-I
import re
sd=re.compile(r'Section (\w) - (.*)$')
sheetDetails={}
for row in dfx.parse('Menu')[[1]].values:
    if str(row[0]).startswith('Section'):
        sheetDetails[sd.match(row[0]).group(1)]=sd.match(row[0]).group(2)
sheetDetails
#{u'A': u'Dwelling Stock',
# u'B': u'Local Authority Housing Disposals',
# u'C': u'Allocations',
# u'D': u'Lettings, Nominations and Mobility Schemes',
# u'E': u'Vacants',
# u'F': u'Condition of Dwelling Stock',
# u'G': u'Stock Management',
# u'H': u'Local authority Rents and Rent Arrears',
# u'I': u'Affordable Housing Supply'}

All the data sheets have similar columns on the left-hand side, which we can use as a crib to identify the simple, single row, coded header column.

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    df=dfx.parse(sheet,header=None)
    df=df.dropna(how='all')
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=dfx.parse(sheet,header=row).dropna(how='all').dropna(how='all',axis=1)
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    return df

#usage:
dfgrabber(dfx,'H')[:5]

That gives something like the following:

Housing_Data

This is completely useable if we know what the column codes refer to. What is handy is that a single row is available for columns, although metadata that neatly describes the codes is not so tidily presented:

excel_headers_complex

Trying to generate pandas hierarchical index from this data is a bit messy…

One approach I’ve explored is trying to create a lookup table from the coded column names back into the hierarchical column names.

For example, if we can detect the column multi-index rows, we can fill down on the first row (for multicolumn labels, the label is in the leftmost cell), then fill down to fill the index grid spanned cells with the value that spans them.

#row is autodetected and contains the row for the simple header
row=7
#Get the header columns - and drop blank rows
xx=dfx.parse('A',header=None)[1:row].dropna(how='all')
xx

excel_headerparse1

#Fill down
xx.fillna(method='ffill', axis=0,inplace=True)
#Fill across
xx=xx.fillna(method='ffill', axis=1)
xx

excel_headerparse2

#append the coded header row
xx=xx.append(dfx.parse('A',header=None)[row:row+1])
xx

excel_headerparse3

#Now make use of pandas' ability to read in a multi-index CSV
xx.to_csv('multi_index.csv',header=False, index=False)
mxx=pd.read_csv('multi_index.csv',header=[0,1,2])
mxx

excel_headerparse4

Note that the pandas column multi-index can span several columns, but not “vertical” levels.

Get rid of the columns that don’t feature in the multi-index:

for c in mxx.columns.get_level_values(0).tolist():
    if c.startswith('Unnamed'):
        mxx = mxx.drop(c, level=0, axis=1)
mxx

Now start to work on the lookup…

#Get a dict from the multi-index
mxx.to_dict(orient='record')

excel_headerparse5

We can then use this as a basis for generating a lookup table for the column codes.

keyx={}
for r in dd:
    keyx[dd[r][0].split(' ')[0]]=r
keyx

excel_headerparse6

We could also generate more elaborate dicts to provide ways of identifying particular codes.

Note that the key building required a little bit of tidying required arising from footnote numbers that appear in some of the coded column headings:

excel header footnote

This tidying should be also be applied to the code column generation step above…

I’m thinking there really should be an easier way?

PS and then, of course, there are the additional gotchas… like UTF-8 pound signs that’s break ascii encodings…

non-ascii

PPS Handy… informal guidance on Releasing data or statistics in spreadsheets, acts as a counterpoint to GSS’ Releasing statistics in spreadsheets: Good practice guidance.

Quick Sketch – Election Maps

On my to do list over the next few weeks is to pull together a set of resources that could be useful in supporting data related activities around the UK General Election.

For starters, I’ve popped up an example of using the folium Python library for plotting simple choropleth maps using geojson based Westminster Parliamentary constituency boundaries: example election maps notebook.

What I haven’t yet figured out – and don’t know if it’s possible – is how to generate qualitative/categorical maps using predefined colour maps (so eg filling boundaries using colour to represent the party of the current MP etc). If you know how to do this, please let me know via the comments…;-)

Also in the notebook is a reference to an election odds scraper I’m running over each (or at least, many…let me know if you spot any missing ones…) Parliamentary constituencies. The names associated with the constituencies don’t correspond in an exact match sense to any standard vocabularies, so on the to do list is to work out a mapping from the election odds constituency names to standard constituency identifiers. I’m thinking this could be represent a handy way of demonstrating my Westminster Constituency reconciliation service docker container….:-)