Category: Tinkering

Printing Out Spreadsheet Cell Values by (Hierarchical) Column Using pandas

Building on from Wrangling Complex Spreadsheet Column Headers, I’ve been hacking the spreadsheet published here a bit more so that I can print out each column value from each sheet in a given spreadsheet for a particular local authority (that is, particular key value in a particular column), to get an output of the form:

Housing_Data_printrow

(I guess I could add a print suppressor to only print statements where the value is not 0?)

The original notebook can be found here.

The major novelty over the previous post is the colmapbuilder() function that generates a nested dict from a group of hierarchical column names that terminates with either the column code or the cell value for that column and a given row selector (I need to tidy up the function args…)

import pandas as pd
dfx=pd.ExcelFile('Local_Authority_Housing_Statistics_dataset_2013-14.xlsx')

#Menu sheet parse to identify sheets A-I
import re
def getSheetDetails(dfx):
    sd=re.compile(r'Section (\w) - (.*)$')
    sheetDetails={}
    for row in dfx.parse('Menu')[[1]].values:
        if str(row[0]).startswith('Section'):
            sheetDetails[sd.match(row[0]).group(1)]=sd.match(row[0]).group(2)
    return sheetDetails

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    df=dfx.parse(sheet,header=None)
    df=df.dropna(how='all')
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=dfx.parse(sheet,header=row).dropna(how='all').dropna(how='all',axis=1)
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    df.columns=[c.split(' ')[0] for c in df.columns]
    return df,row


import collections
def coldecoder(dfx,sheet,row):
    zz=dfx.parse(sheet,header=None)
    stitle=zz[0][[0]][0]
    
    xx=zz[1:row].dropna(how='all')
    #Fill down
    xx.fillna(method='ffill', axis=0,inplace=True)
    #Fill across
    xx=xx.fillna(method='ffill', axis=1)
    #How many rows in the header?
    keydepth=len(xx)
    header=[i for i in range(0,keydepth)]

    xx=xx.append(zz[row:row+1])
    xx.to_csv('multi_index.csv',header=False,index=False,encoding='utf-8')
    mxx=pd.read_csv('multi_index.csv',header=header,encoding='utf-8')
    for c in mxx.columns.get_level_values(0).tolist():
        if c.startswith('Unnamed'):
            mxx = mxx.drop(c, level=0, axis=1)
    #We need to preserve the order of the header columns
    dd=mxx.to_dict(orient='split')
    ddz=zip(dd['columns'],dd['data'][0])
    keyx=collections.OrderedDict() #{}
    for r in ddz:
        if not pd.isnull(r[1]):
            #print r[1].split(' ')[0]
            keyx[r[1].split(' ')[0]]=r[0]
    return stitle,keyx,keydepth

#Based on http://stackoverflow.com/a/10756547/454773
def myprint(d,l=None):
  if l is None: l=''
  for k, v in d.iteritems():
    if isinstance(v, dict):
      print("{}{}".format(l,k))
      myprint(v,l=l+'-')
    else:
      print "{0} {1} : {2}".format(l,k.encode('utf-8'), v)

def colmapbuilder(dfx,sheet,code=None,retval=True):
    df,row=dfgrabber(dfx,sheet)
    sname,skey,kd=coldecoder(dfx,sheet,row)
    kq=collections.OrderedDict() #{}
    for k in skey:
        kq[k]=[]
        for j in skey[k]:
            if j not in kq[k]: kq[k].append(j)
    colmapper=collections.OrderedDict() #{}
    for kkq in kq:
        curr_level = colmapper
        depth=0
        for path in kq[kkq]:
            depth=depth+1
            if path not in curr_level:
                if depth<len(kq[kkq]):
                    curr_level[path] = collections.OrderedDict() #{}
                    curr_level = curr_level[path]
                else:
                    if retval and code is not None:
                        curr_level[path] = df[df['Current\nONS']==code][kkq].iloc[0]
                    else:
                        curr_level[path] = kkq
            else:
                curr_level = curr_level[path]
    return sname, colmapper

ll=dfx.sheet_names
ll.remove('Menu')
for lll in ll:
    sname,cmb=colmapbuilder(dfx,lll,'E06000046')
    print(sname+'\n')
    myprint(cmb)
    print('\n\n')

I’m not sure how this helps, other than demonstrating how we might be able to quickly generate a crude textualisation of values in a single row in spreadsheet with a complex set of hierarchical column names?

The code is also likely to be brittle, so the main questions are:

– is the method reusable?
– can the code/approach be generalised or at least made a little bit more robust and capable of handling other spreadsheets with particular properties? (And then – what properties, and how might we be able to detect those properties?)

Wrangling Complex Spreadsheet Column Headers

[This isn’t an R post, per se, but I’m syndicating it via RBloggers because I’m interested – how do you work with hierarchical column indices in R? Do you try to reshape the data to something tidier on the way in? Can you autodetect elements to help with any reshaping?]

Not a little p****d off by the Conservative election pledge to extend the right-to-buy to housing association tenants (my response: so extend the right to private tenants too?) I thought I’d have a dig around to see what data might be available to see what I could learn about the housing situation on the Isle of Wight, using a method that could also be used in other constituencies. That is, what datasets are provided at a national level, broken down to local level. (To start with, I wanted to see what I could lean ex- of visiting the DCLG OpenDataCommuniteis site.

One source of data seems to be the Local authority housing statistics data returns for 2013 to 2014, a multi-sheet spreadsheet reporting at a local authority level on:

– Dwelling Stock
– Local Authority Housing Disposals
– Allocations
– Lettings, Nominations and Mobility Schemes
– Vacants
– Condition of Dwelling Stock
– Stock Management
– Local authority Rents and Rent Arrears
– Affordable Housing Supply

Local_Authority_Housing_Statistics_dataset_2013-14_xlsx

Something I’ve been exploring lately are “external spreadsheet data source” wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

For example, I got started on the LA housing stats sheet as follows – first a look at the sheets, then a routine to grab sheet names out of the Menu sheet:

import pandas as pd
dfx=pd.ExcelFile('Local_Authority_Housing_Statistics_dataset_2013-14.xlsx')
dfx.sheet_names
#...
#Menu sheet parse to identify sheets A-I
import re
sd=re.compile(r'Section (\w) - (.*)$')
sheetDetails={}
for row in dfx.parse('Menu')[[1]].values:
    if str(row[0]).startswith('Section'):
        sheetDetails[sd.match(row[0]).group(1)]=sd.match(row[0]).group(2)
sheetDetails
#{u'A': u'Dwelling Stock',
# u'B': u'Local Authority Housing Disposals',
# u'C': u'Allocations',
# u'D': u'Lettings, Nominations and Mobility Schemes',
# u'E': u'Vacants',
# u'F': u'Condition of Dwelling Stock',
# u'G': u'Stock Management',
# u'H': u'Local authority Rents and Rent Arrears',
# u'I': u'Affordable Housing Supply'}

All the data sheets have similar columns on the left-hand side, which we can use as a crib to identify the simple, single row, coded header column.

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    df=dfx.parse(sheet,header=None)
    df=df.dropna(how='all')
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=dfx.parse(sheet,header=row).dropna(how='all').dropna(how='all',axis=1)
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    return df

#usage:
dfgrabber(dfx,'H')[:5]

That gives something like the following:

Housing_Data

This is completely useable if we know what the column codes refer to. What is handy is that a single row is available for columns, although metadata that neatly describes the codes is not so tidily presented:

excel_headers_complex

Trying to generate pandas hierarchical index from this data is a bit messy…

One approach I’ve explored is trying to create a lookup table from the coded column names back into the hierarchical column names.

For example, if we can detect the column multi-index rows, we can fill down on the first row (for multicolumn labels, the label is in the leftmost cell), then fill down to fill the index grid spanned cells with the value that spans them.

#row is autodetected and contains the row for the simple header
row=7
#Get the header columns - and drop blank rows
xx=dfx.parse('A',header=None)[1:row].dropna(how='all')
xx

excel_headerparse1

#Fill down
xx.fillna(method='ffill', axis=0,inplace=True)
#Fill across
xx=xx.fillna(method='ffill', axis=1)
xx

excel_headerparse2

#append the coded header row
xx=xx.append(dfx.parse('A',header=None)[row:row+1])
xx

excel_headerparse3

#Now make use of pandas' ability to read in a multi-index CSV
xx.to_csv('multi_index.csv',header=False, index=False)
mxx=pd.read_csv('multi_index.csv',header=[0,1,2])
mxx

excel_headerparse4

Note that the pandas column multi-index can span several columns, but not “vertical” levels.

Get rid of the columns that don’t feature in the multi-index:

for c in mxx.columns.get_level_values(0).tolist():
    if c.startswith('Unnamed'):
        mxx = mxx.drop(c, level=0, axis=1)
mxx

Now start to work on the lookup…

#Get a dict from the multi-index
mxx.to_dict(orient='record')

excel_headerparse5

We can then use this as a basis for generating a lookup table for the column codes.

keyx={}
for r in dd:
    keyx[dd[r][0].split(' ')[0]]=r
keyx

excel_headerparse6

We could also generate more elaborate dicts to provide ways of identifying particular codes.

Note that the key building required a little bit of tidying required arising from footnote numbers that appear in some of the coded column headings:

excel header footnote

This tidying should be also be applied to the code column generation step above…

I’m thinking there really should be an easier way?

PS and then, of course, there are the additional gotchas… like UTF-8 pound signs that’s break ascii encodings…

non-ascii

PPS Handy… informal guidance on Releasing data or statistics in spreadsheets, acts as a counterpoint to GSS’ Releasing statistics in spreadsheets: Good practice guidance.

Quick Sketch – Election Maps

On my to do list over the next few weeks is to pull together a set of resources that could be useful in supporting data related activities around the UK General Election.

For starters, I’ve popped up an example of using the folium Python library for plotting simple choropleth maps using geojson based Westminster Parliamentary constituency boundaries: example election maps notebook.

What I haven’t yet figured out – and don’t know if it’s possible – is how to generate qualitative/categorical maps using predefined colour maps (so eg filling boundaries using colour to represent the party of the current MP etc). If you know how to do this, please let me know via the comments…;-)

Also in the notebook is a reference to an election odds scraper I’m running over each (or at least, many…let me know if you spot any missing ones…) Parliamentary constituencies. The names associated with the constituencies don’t correspond in an exact match sense to any standard vocabularies, so on the to do list is to work out a mapping from the election odds constituency names to standard constituency identifiers. I’m thinking this could be represent a handy way of demonstrating my Westminster Constituency reconciliation service docker container….:-)

What’s the Point of an API?

Trying to clear my head of code on a dog walk after a couple of days tinkering with the nomis API and I started to ponder what an API is good for.

Chris Gutteridge and Alex Dutton’s open data excuses bingo card and Owen Boswarva’s Open Data Publishing Decision Tree both suggest that not having an API can be used as an excuse for not publishing a dataset as open data.

So what is an API good for?

I think one naive view is that this is what an API gets you…

api1

It doesn’t of course, because folk actually want this…

api2

Which is not necessarily that easy even with an API:

api3

For a variety of reasons…

api4

Even when the discovery part is done and you think you have a reasonable idea of how to call the API to get the data you want out of it, you’re still faced with the very real practical problem of how to actually get the data in to the analysis environment in a form that is workable on in that environment. Just because you publish standards based SDMX flavoured XML doesn’t mean anything to anybody if they haven’t got an “import from SDMX flavoured XML directly into some format I know how to work with” option.

api5

And even then, once the data is in, the problems aren’t over…

api6

(I’m assuming the data is relatively clean and doesn’t need any significant amount of cleaning, normalising, standardising, type-casting, date par;-sing etc etc. Which is of course likely to be a nonsense assumption;-)

So what is an API good for, and where does it actually exist?

I’m starting to think that for many users, if there isn’t some equivalent of an “import from X” option in the tool they are using or environment they’re working in, then the API-for-X is not really doing much of a useful job for them.

Also, if there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context and essentially means that the API-for-X is not really doing much of a useful job for them.

What I tried to do in doodling the Python / pandas Remote Data Access Wrapper for the Nomis API for myself create some tools that would help me discover various datasets on the nomis platform from my working environment – an IPython notebook – and then fetch any data I wanted from the platform into that environment in a form in which I could immediately start working with it – which is to say, typically as a pandas dataframe.

I haven’t started trying to use it properly yet – and won’t get a chance to for a week or so at least now – but that was the idea. That is, the wrapper should support a conversation with the discovery and access parts of the conversation I want to have with the nomis data from within my chosen environment. That’s what I want from an API. Maybe?!;-)

And note further – this does not mean things like a pandas Remote Data Access plugin or a CRAN package for R (such as the World Bank Development Indicators package or any of the other data/API packages referenced from the rOpenSci packages list should be seen as extensions of the API. At worst, they should be seen as projections of the API into user environments. At best, it is those packages that should be seen as the actual API.

APIs for users – not programmers. That’s what I want from an API.

See also: Opening Up Access to Data: Why APIs May Not Be Enough….

PS See also this response from @apievangelist: The API Journey.

Sketching Out a Python / pandas Remote Data Access Wrapper for the Nomis API

One of the things I keep failing to spend time looking at and playing with is the generation of text based reports from tabular datasets (“data2txt”, “data to text”, “textualisation”, “natural language generation (NLG)” etc).

One of my earlier fumblings was to look at generating “press releases” around monthly Jobseeker’s Allowance figures. One of the reasons that stalled was the amount of time it took me to trying to find my way around the nomis site and API, trying to piece together the URLs that would let me pull down the data in a meaningful way and help me properly understand what the data actually referred to.

So over the weekend, I started to put together a wrapper for the nomis API that would let have a conversation with it so that I could start to find out what sorts of datasets it knows about and how I can run queries into those datasets (that is, what dimensions are available for each dataset that we can query on) as well as pulling back actual datasets from it.

To make the data easier to work with once I have pulled it down, I put it into a pandas dataframe so that I can work with it in that context.

(With Open Knowledge, I’m running a series of Code Clubs in “Wrangling data with Python” for the Cabinet Office at the moment, based around pandas and IPython notebooks. If I can get the wrapper working reliably enough, it could be interesting to see what they make of it…)

The code can be found here (nomis_api_wrapper.py) and a notebook demonstrating it’s used can be found here: Demo Python Wrapper for nomis API.

This is a flavour of the sorts of thing I’ve been reaching for with it:

nomis_demo

nomis_demo2

To make life easier, you can pass in dimension parameter values using either the dimension parameter codes or their actual values; because the nomis API requires the codes, legitimate values are automatically converted. (Note to self – add further checks to discard illegitimate values, where detected…)

nomis_demo4

Any comments, feedback, issues if you try it etc, please let me know via the comments to this post (for now…!).

PS next up – revisit the ONS API following this first, aborted attempt.

Recreating a Node.js Installation – Package Versions

Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.

vagrant_tm351lts_ttyjs

Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:

### Show nodejs version and packages
> node -v
v0.10.35

> npm list -g
/usr/local/lib
├─┬ npm@1.4.28
│ ├── abbrev@1.0.5
│ ├── ansi@0.3.0
│ ├── ansicolors@0.3.2
│ ├── ansistyles@0.1.3
│ ├── archy@0.0.2
│ ├── block-stream@0.0.7
...
│ └── which@1.0.5
└─┬ tty.js@0.2.13
  ├─┬ express@3.1.0
  │ ├── buffer-crc32@0.1.1
...

Using this information, I could then use nvm, a node.js version manager, installed via:

curl https://raw.githubusercontent.com/creationix/nvm/v0.23.3/install.sh | NVM_DIR=/usr/local/lib/ bash

to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js

(I should probably add the tty.js version in there too? npm install tty.js@0.2.13 perhaps? )

The terminal can then be run as a demon from:

/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize

What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?

It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:

### Show packages
dpkg -l

And we can get a list of Python packages – and their version numbers – using:

### Show Python packages
pip3 list

Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?

Getting Text Out Of Anything (docs, PDFs, Images) Using Apache Tika

So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from .doc to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example). And your task is to identify the names of the people identified in those documents and the companies they have been associated with.

Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. And you have a stack of them to search through to find a particular name. What do you do?

Apart from cry a little, inside?

If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages.

But in a stack of arbitrary documents, or scanned image files, there is no consistent template you can work with to help you write the one scraper that will handle all the documents.

So how about a weaker form of document parsing? Text extraction, for example. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).

Something like this Microsoft Office (word) doc for example:

bio word doc

Or this scanned PDF (the highlighted text shows the text is copyable as such – so it is actually in the document as text):

scan_ocr

Or this image I captured from a fragment of the scanned PDF – no text as such here…:

ED121193

What are we to do?

Here’s where Apache Tika can help…

Apache Tika is like magic; give a document and it’ll (try) to give you back the text it contains. Even if that document is an image. Tika is quite a hefty bit of code, but it’s something you can run quite easily yourself as a service, using the magic of dockers containers.

In this example, I’m running Apache Tika as a web service in the cloud for a few pennies an hour; and yes, you can do this yourself – instructions for how to run Apache Tika in the cloud or on your own computer are described at the end of the post…

In my case, I had Apache Tika running at the address http://quicktika-1.psychemedia.cont.tutum.io:8008/tika (that address is no longer working).

I was working in an IPython notebook running on a Linux machine (the recipe will also work on a Mac; on Windows, you may need to install curl).

There are two steps:

  1. PUT the file you want the text extracted from to the server; I use curl, with a command of the form curl -T path/to/myfile.png http://quicktika-1.psychemedia.cont.tutum.io:8008/rmeta > path/to/myfile_txt.json
  2. Look at the result in the returned JSON file (path/to/myfile_txt.json)

Simple as that…simple as this:

Parse the word doc shown above…

You can see the start of the extracted text in the x-Tika:content element at the bottom…

tika-extract1

Parse the PDF doc shown above…

tika-extract2

Parse the actual image of fragment of the PDF doc shown above…

tika-extract3

See how Tika has gone into image parsing and optical character recognition mode automatically, and done its best to extract the text from the image file? :-)

Running Apache Tika in the Cloud

As described in Getting Started With Personal App Containers in the Cloud, the first thing you need to do is set up an account with a cloud service provider – I’m using Digital Ocean at the moment: it has simple billing and lets you launch cloud hosted virtual machines of a variety of sizes in a variety of territories, including the UK. Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. You don’t need to do anything else – we’ll be using that service via another one… [Affiliate Link: sign up to Digital Ocean and get $10 credit]

Having got your cloud provider account set up, create an account with Tutum and then link your Digital Ocean account to it.

Launch a node cluster as described at the start of Getting Started With Personal App Containers in the Cloud. The 2GB/2 core server is plenty.

Now launch a container – the one you want is logicalspark/docker-tikaserver:

tutum_tika

To be able to access the service over the web, you need to make its ports public:

tutum_tika2

I’m going to give it a custom port number, but you don’t have to, in which case a random one will be assigned:

tika_tutum3

Having created and deployed the container, look up it’s address from the Endpoints tab. The address will be something like tcp://thing-1.yourid.cont.tutum.io:NNNN. You can check the service is there by going to thing-1.yourid.cont.tutum.io:NNNN/tika in your browser.

tika_titum4

When you’re don terminate the container and the node cluster so you donlt get billed any more than is necessary.

quicktika___Tutum5

tika_tutum6

Running Apache Tika on your own computer

  1. Install boot2docker
  2. Launch boot2docker
  3. In the boot2docker command line, enter: docker pull logicalspark/docker-tikaserver to grab the container image;
  4. To run the container: docker run -d -p 9998:9998 logicalspark/docker-tikaserver
  5. enter boot2docker ip to find the address bootdocker is publishing to (eg 192.168.59.103);
  6. Check the server is there – in your browser, go to eg: http://192.168.59.103:9998/tika

(Don’t be afraid of command lines; you probably already know how to download an app (step 1), definitely know how to launch an app (step 2), know how to type (steps 3 to 5), and how to go to a web location (step 6; note: you do have to enter this URL in the browser location bar at the top of the screen – entering it into Google won’t help..;-) All steps 3 to 5 do are get you to write the commands the computer is to follow, rather than automatically selecting them from a nicely named menu option. (What do you think a computer actually does when you select a menu option?!)

PS via @Pudo, see also: textract – python library for “extracting text out of any document”.