Splitting Strings in pandas Dataframe Columns

A quick note on splitting strings in columns of pandas dataframes.

If we have a column that contains strings that we want to split and from which we want to extract particuluar split elements, we can use the .str. accessor to call the split function on the string, and then the .str. accessor again to obtain a particular element in the split list.

df_str = pd.DataFrame( {'col':['http://example.com/path/filename.suffix']*3} )
df_str['path'] = df_str['col'].str.split('/').str[-1]
df_str['stub'] = df_str['path'].str.split('.').str[0]

Wrangling Time Periods (such as Financial Year Quarters) In Pandas

Looking at some NHS 111 and A&E data today, the reported data I was interested in was being reported for different sorts of period, specifically, months and quarters.

The pandas python library has quite a few tools for dealing with periods, so here are a couple of examples of tricks I put to use today.

Firstly, casting months to a month period. Monthly periods (in column df['Periodname']) were reported in the form “Dec-10”, “Jan-11”, etc, which is to say a three letter month followed by a two digit year representation. If we create a datetime on the first of the corresponding month, we can then cast that date to a month period for the corresponding month:

#Generate a pandas period for time series indexing purposes
#Create a date from each month (add the first date of the month) then set to period
#Note: if the datetime is an index, drop the .dt.
df['_period']=pd.to_datetime(('01-'+df['Periodname']), \

If the datetimes form the dataframe index, we could drop the .dt. element and cast the timeseries index directly to the monthly period. The %b term maps onto the locale specific month description.

The second example had financial years and dates in two columns – Year and Quarter. The months specified in the quarter description mix three letter representations and the fully written name of the month. (As with many NHS datasets, the text used to describe dates can be all over the place, even within a single spreadsheet or datafile!)

If we grab the first three letters of the month, we can identify the month as it’s short form. If we create a date stamp for the first day of the quarter (in the UK, corporate financial years run 1-April, to 31-March (it’s the personal tax year that reports from April 5th?) we can then cast to a quarterly reporting period whose year ends in a particular month, eg March (freq='Q-MAR').


Note that the YearQuarter uses the calendar year in which the financial year ends. We can check the start and end date of the quarters explicitly:


One issue when it comes to plotting the financial year quarterly periods is that the chart looks to me as if everything is a year out:


That is, I read the year as the calendar year, but it’s actually the calendar year in which a particular financial year ends.

So I’m wondering – should pandas really report and plot the financial year in which the fiscal year starts? Is there a switch for this? Or do I need to submit a bug report (and if so, what’s the bug?)

PS for a notebook worked example, see here: NHS 111 example.

Printing Out Spreadsheet Cell Values by (Hierarchical) Column Using pandas

Building on from Wrangling Complex Spreadsheet Column Headers, I’ve been hacking the spreadsheet published here a bit more so that I can print out each column value from each sheet in a given spreadsheet for a particular local authority (that is, particular key value in a particular column), to get an output of the form:


(I guess I could add a print suppressor to only print statements where the value is not 0?)

The original notebook can be found here.

The major novelty over the previous post is the colmapbuilder() function that generates a nested dict from a group of hierarchical column names that terminates with either the column code or the cell value for that column and a given row selector (I need to tidy up the function args…)

import pandas as pd

#Menu sheet parse to identify sheets A-I
import re
def getSheetDetails(dfx):
    sd=re.compile(r'Section (\w) - (.*)$')
    for row in dfx.parse('Menu')[[1]].values:
        if str(row[0]).startswith('Section'):
    return sheetDetails

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    df.columns=[c.split(' ')[0] for c in df.columns]
    return df,row

import collections
def coldecoder(dfx,sheet,row):
    #Fill down
    xx.fillna(method='ffill', axis=0,inplace=True)
    #Fill across
    xx=xx.fillna(method='ffill', axis=1)
    #How many rows in the header?
    header=[i for i in range(0,keydepth)]

    for c in mxx.columns.get_level_values(0).tolist():
        if c.startswith('Unnamed'):
            mxx = mxx.drop(c, level=0, axis=1)
    #We need to preserve the order of the header columns
    keyx=collections.OrderedDict() #{}
    for r in ddz:
        if not pd.isnull(r[1]):
            #print r[1].split(' ')[0]
            keyx[r[1].split(' ')[0]]=r[0]
    return stitle,keyx,keydepth

#Based on http://stackoverflow.com/a/10756547/454773
def myprint(d,l=None):
  if l is None: l=''
  for k, v in d.iteritems():
    if isinstance(v, dict):
      print "{0} {1} : {2}".format(l,k.encode('utf-8'), v)

def colmapbuilder(dfx,sheet,code=None,retval=True):
    kq=collections.OrderedDict() #{}
    for k in skey:
        for j in skey[k]:
            if j not in kq[k]: kq[k].append(j)
    colmapper=collections.OrderedDict() #{}
    for kkq in kq:
        curr_level = colmapper
        for path in kq[kkq]:
            if path not in curr_level:
                if depth<len(kq[kkq]):
                    curr_level[path] = collections.OrderedDict() #{}
                    curr_level = curr_level[path]
                    if retval and code is not None:
                        curr_level[path] = df[df['Current\nONS']==code][kkq].iloc[0]
                        curr_level[path] = kkq
                curr_level = curr_level[path]
    return sname, colmapper

for lll in ll:

I’m not sure how this helps, other than demonstrating how we might be able to quickly generate a crude textualisation of values in a single row in spreadsheet with a complex set of hierarchical column names?

The code is also likely to be brittle, so the main questions are:

– is the method reusable?
– can the code/approach be generalised or at least made a little bit more robust and capable of handling other spreadsheets with particular properties? (And then – what properties, and how might we be able to detect those properties?)

Wrangling Complex Spreadsheet Column Headers

[This isn’t an R post, per se, but I’m syndicating it via RBloggers because I’m interested – how do you work with hierarchical column indices in R? Do you try to reshape the data to something tidier on the way in? Can you autodetect elements to help with any reshaping?]

Not a little p****d off by the Conservative election pledge to extend the right-to-buy to housing association tenants (my response: so extend the right to private tenants too?) I thought I’d have a dig around to see what data might be available to see what I could learn about the housing situation on the Isle of Wight, using a method that could also be used in other constituencies. That is, what datasets are provided at a national level, broken down to local level. (To start with, I wanted to see what I could lean ex- of visiting the DCLG OpenDataCommuniteis site.

One source of data seems to be the Local authority housing statistics data returns for 2013 to 2014, a multi-sheet spreadsheet reporting at a local authority level on:

– Dwelling Stock
– Local Authority Housing Disposals
– Allocations
– Lettings, Nominations and Mobility Schemes
– Vacants
– Condition of Dwelling Stock
– Stock Management
– Local authority Rents and Rent Arrears
– Affordable Housing Supply


Something I’ve been exploring lately are “external spreadsheet data source” wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

For example, I got started on the LA housing stats sheet as follows – first a look at the sheets, then a routine to grab sheet names out of the Menu sheet:

import pandas as pd
#Menu sheet parse to identify sheets A-I
import re
sd=re.compile(r'Section (\w) - (.*)$')
for row in dfx.parse('Menu')[[1]].values:
    if str(row[0]).startswith('Section'):
#{u'A': u'Dwelling Stock',
# u'B': u'Local Authority Housing Disposals',
# u'C': u'Allocations',
# u'D': u'Lettings, Nominations and Mobility Schemes',
# u'E': u'Vacants',
# u'F': u'Condition of Dwelling Stock',
# u'G': u'Stock Management',
# u'H': u'Local authority Rents and Rent Arrears',
# u'I': u'Affordable Housing Supply'}

All the data sheets have similar columns on the left-hand side, which we can use as a crib to identify the simple, single row, coded header column.

def dfgrabber(dfx,sheet):
    #First pass - identify row for headers
    row = df[df.apply(lambda x: (x == "DCLG code").any(), axis=1)].index.tolist()[0]#.values[0] # will be an array
    #Second pass - generate dataframe
    df=df[df['DCLG code'].notnull()].reset_index(drop=True)
    return df


That gives something like the following:


This is completely useable if we know what the column codes refer to. What is handy is that a single row is available for columns, although metadata that neatly describes the codes is not so tidily presented:


Trying to generate pandas hierarchical index from this data is a bit messy…

One approach I’ve explored is trying to create a lookup table from the coded column names back into the hierarchical column names.

For example, if we can detect the column multi-index rows, we can fill down on the first row (for multicolumn labels, the label is in the leftmost cell), then fill down to fill the index grid spanned cells with the value that spans them.

#row is autodetected and contains the row for the simple header
#Get the header columns - and drop blank rows


#Fill down
xx.fillna(method='ffill', axis=0,inplace=True)
#Fill across
xx=xx.fillna(method='ffill', axis=1)


#append the coded header row


#Now make use of pandas' ability to read in a multi-index CSV
xx.to_csv('multi_index.csv',header=False, index=False)


Note that the pandas column multi-index can span several columns, but not “vertical” levels.

Get rid of the columns that don’t feature in the multi-index:

for c in mxx.columns.get_level_values(0).tolist():
    if c.startswith('Unnamed'):
        mxx = mxx.drop(c, level=0, axis=1)

Now start to work on the lookup…

#Get a dict from the multi-index


We can then use this as a basis for generating a lookup table for the column codes.

for r in dd:
    keyx[dd[r][0].split(' ')[0]]=r


We could also generate more elaborate dicts to provide ways of identifying particular codes.

Note that the key building required a little bit of tidying required arising from footnote numbers that appear in some of the coded column headings:

excel header footnote

This tidying should be also be applied to the code column generation step above…

I’m thinking there really should be an easier way?

PS and then, of course, there are the additional gotchas… like UTF-8 pound signs that’s break ascii encodings…


PPS Handy… informal guidance on Releasing data or statistics in spreadsheets, acts as a counterpoint to GSS’ Releasing statistics in spreadsheets: Good practice guidance.

Visualising Pandas DataFrames With IPythonBlocks – Proof of Concept

A few weeks ago I came across IPythonBlocks, a Python library developed to support the teaching of Python programming. The library provides an HTML grid that can be manipulated using simple programming constructs, presenting the outcome of the operations in a visually meaningful way.

As part of a new third level OU course we’re putting together on databases and data wrangling, I’ve been getting to grips with the python pandas library. This library provides a dataframe based framework for data analysis and data-styled programming that bears a significant resemblance to R’s notion of dataframes and vectorised computing. pandas also provides a range of dataframe based operations that resemble SQL style operations – joining tables, for example, and performing grouping style summary operations.

One of the things we’re quite keen to do as a course team is identify visually appealing ways of illustrating a variety of data manipulating operations; so I wondered whether we might be able to use ipythonblocks as a basis for visualising – and debugging – pandas dataframe operations.

I’ve posted a demo IPython notebook here: ipythonblocks/pandas proof of concept [nbviewer preview]. In it, I’ve started to sketch out some simple functions for visualising pandas dataframes using ipythonblocks blocks.

For example, the following minimal function finds the size and shape of a pandas dataframe and uses it to configure a simple block:

def pBlockGrid(df):
    return BlockGrid(x,y)

We can also colour individual blocks – the following example uses colour to reveal the different datatypes of columns within a dataframe:

ipythinblocks pandas type colour

A more elaborate function attempts to visualise the outcome of merging two data frames:

ipythonblocks pandas demo

The green colour identifies key columns, the red and blue cells data elements from the left and right joined dataframes respectively, and the black cells NA/NaN cells.

One thing I started wondering about that I have to admit quite excited me (?!;-) was whether it would be possible to extend the pandas dataframe itself with methods for producing ipythonblocks visual representations of the state of a dataframe, or the effect of dataframe based operations such as .concat() and .merge() on source dataframes.

If you have any comments on this approach, suggestions for additional or alternative ways of visualising dataframe transformations, or thoughts about how to extend pandas dataframes with ipythonblocks style visualisations of those datastructures and/or the operations that can be applied to them, please let me know via the comments:-)

PS some thoughts on a possible pandas interface:

  • DataFrame().blocks() to show the blocks
  • .cat(blocks=True) and .merge(blocks=True) to return (df, blocks)
  • DataFrame().blocks(blockProperties={}) and eg .merge(blocks=True, blockProperties={})
  • blockProperties: showNA=True|False, color_base=(), color_NA=(), color_left=(), color_right=(), color_gradient=[] (eg for a .cat() on many dataframes), colorView=structure|datatypes|missing (the colorView reveals the datatypes of the columns, the structure origins of cells returned from a .merge() or .cat(), or a view of missing data (reveal NA/NaN etc over a base color), colorTypes={} (to set the colors for different datatypes)