OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Pivot Tables, pandas and IPython Notebooks

leave a comment »

For the last few months, I’ve found a home in IPython Notebooks for dabbling with data. IPython notebooks provide a flexible authoring tool for combining text with executable code fragments, as well as the outputs from executing code, such as charts, data tables or automatically generated text reports. We can also embed additional HTML5 content into a notebook, either inline or as an iframe.

With a little judicious use of templates, we can easily take data from a source we are working with in the notebook, and then render a view of it using included components. This makes it easy to use hybrid approaches to working with data in the notebook context. (Note: the use of cell magics also let’s us operate on a data set using different languages in the same notebook – for example, Python and R.)

As an example of a hybrid approach to exploratory data analysis, how about the following?

The data manipulation library I’ve spent most of my time in to date in the notebooks is pandas. pandas is a really powerful tool for wrangling tabular data shapes, including reshaping them and running group reports on them. Among the operations pandas supports are pivot tables. But writing the code can be fiddly, and sometimes you just want an interactive hands on play with the data. IPython Notebooks do support widgets (though I haven’t played with them yet), so I guess I could try to write a simple UI for running different pivot table views over dataset in an interactive fashion.

But if I’m happy with reading the numbers the pivot table table reports as an end product, and don’t need access to the report as data, I can use a third party interactive pivot table widget such as Nicolas Kutchen’s Pivot Table component to work with the data in an interactive fashion.

Pivot_Table_demo

I’ve popped a quick demo of a clunky hacky way of feeding a pivot table widget from a pandas dataframe here: pivot table in IPython Notebook demo. A pandas dataframe is written as an HTML table and embedded in a templated page that generates the pivot table from the the HTML table. This page is saved as an HTML file and then loaded in as an IFrame. (We could add the HTML to an iframe using srcdoc, rather than saving it as a file and loading it back in, but I thought it might be handy to have access to a copy of the file. Also, I’m not sure if all browsers support srcdoc?)

(Note: we could also use the pivot table widget with a subset of a large dataset to generate dummy reports to find the ones we want, and test pandas code against the same subset of data against that output to check the code is generating the table we want, and then deploy the code against the full dataset.)

The pivot table has the data in memory as a hidden HTML table in the pivot table page, so performance may be limited for large datasets. On my machine, it was quite happy working with a month’s spending/transparency data from my local council, and provided a quick way of generating different summary views over the data. (The df dataframe is simply the result of loading in the spending data CSV file as a pandas dataframe – no other processing required. So the pivotTable() function could easily be modified to accept the location of a CSV file, such as a local address or a URL, load the file in automatically into a dataframe, and then render it as a pivot table.)

Pivot_Table_demo2

There’s also limited functionality for tunneling down into the data by filtering it within the chart (reather than having to generate a filtered view of the data that is then baked in as a table to the chart HTML page, for example):

Pivot_Table_demo3

I’ve been dabbling with various other embedded charts too which I’ll post when I get a chance.

Written by Tony Hirst

October 8, 2014 at 9:30 am

Posted in Tinkering

Tagged with

Data Reporting, not Data Journalism?

with 12 comments

As a technology optimist, I don’t tend to go in so much for writing up long critical pieces about tech (i.e. I don’t do the “proper” academic thing), instead preferring to spend time trying to work out how make use of it, either on its own or in tandem with other technologies. (My criticality is often limited to quickly ruling out things I don’t want to waste my time on because they lack interestingness, or in spending time how to appropriate one thing to do something it perhaps wasn’t intended to do originally.)

I also fall in to the camp of lacking in confidence about things other people think I know about, generally assuming there are probably “proper” ways of doing things for someone properly knowledgeable in a tradition, although I don’t know what they are. (Quite where this situates me on a scale of incompetence, I’m not sure… e.g. Why the Unskilled Are Unaware: Further Explorations of (Absent) Self-Insight Among the Incompetent h/t @Downes).

A couple of days ago, whilst doodling with a data-to-text notebook (the latest incarnation of which can be found here) I idly asked @fantasticlife whether it was an example of data journalism (we’ve been swapping links about bot-generated news reports for some time), placing it also in the context of The Changing Task Composition of the US Labor Market: An Update of Autor, Levy, and Murnane (2003), David H. Autor & Brendan Price,June 21, 2013, and in particular this chart about the decline in work related “routine cognitive tasks”:

economics_mit_edu_files_9758

His response? “No, that’s data reporting”.

So now I’m wondering: how do data reporting and data journalism differ? And to what extent is writing some code along the lines of:

def pc(amount,rounding=''):
    if rounding=='down': rounding=decimal.ROUND_DOWN
    elif rounding=='up': rounding=decimal.ROUND_UP
    else: rounding=decimal.ROUND_HALF_UP

    ramount=float(decimal.Decimal(100 * amount).quantize(decimal.Decimal('.1'), rounding=rounding))
    return '{0:.1f}%'.format(ramount)

def otwMoreLess(now,then):
    delta=now-then
    if delta>0:
        txt=random.choice(['more'])
    elif delta<0:
        txt=random.choice(['less'])
    return txt

def otwPCmoreLess(this,that):
    delta=this-that
    return '{delta} {diff}'.format(delta=pc(abs(delta)),diff=otwMoreLess(this,that))

otw3='''
That means {localrate} of the resident {localarea} population {poptype} are {claim} \
– {regiondiff} than the rest of the {region} ({regionrate}), \
and {ukdiff} than the whole of the UK ({ukrate}).
'''.format(localrate=pc(jsaLocal_rate),
           localarea=jsaLocal['GEOGRAPHY_NAME'].iloc[0],
           poptype=decase(get16_64Population(localcode)['CELL_NAME'].iloc[0].split('(')[1].split(' -')[0]),
           claim=decase(jsaLocal['MEASURES_NAME'].iloc[0]),
           regiondiff=otwPCmoreLess(jsaLocal_rate,jsaRegion_rate),
           region=jsaRegion['GEOGRAPHY_NAME'].iloc[0],
           regionrate=pc(JSA_rate(regionCode)),
           ukdiff=otwPCmoreLess(jsaLocal_rate,jsaUK_rate),
           ukrate=pc(jsaUK_rate))

print(otw3)

to produce text from nomis data of the form:

That means 1.9% of the resident Isle of Wight population aged 16-64 are persons claiming JSA – 0.5% more than the rest of the South East (1.4%), and 0.5% less than the whole of the UK (2.4%).

an output that is data reporting, an example of journalistic practice? For example, is locating the data source a journalistic act? Is writing a script to parse the data into words through the medium of code a journalistic act?

Does it become “more journalistic” if the text generating procedure comments on the magnitude of the changes? For example, using something like:

def _txt6(filler=','):
    
    def _magnitude(term):

        #A heuristic here
        if propDelta<0.05:
            mod=random.choice(['slight'])
        elif propDelta<0.10:
             mod=random.choice(['significant'])
        else:
             mod=random.choice(['considerable','large' ])
        term=' '.join([mod,term])
        return p.a(term)
            
    txt6=txt2[:-1]
    
    propDelta= abs(yeardelta)/mostRecent['Total']
    
    if yeardelta==0:
        txt6+=', exactly the same amount.'
    else:
        if yeardelta <0: direction=_magnitude(random.choice([ 'decrease','fall']))
        else: direction=_magnitude(random.choice(['increase','rise']))
        
        txt6+='{_filler} {0} of {1} since then.'.format(p.a(direction),
                                                         abs(yeardelta),
                                                         _filler= filler )
    return txt6

print(txt1)
print(_txt6())

to produce further qualified text (in terms of commenting on the magnitude of the amount) that can take a variety of slightly different forms?

- The most recent figure (August 2014) for persons claiming JSA for the Isle of Wight area is 1502.
– This compares with a figure of 2720 from a year ago (August 2013), a large fall of 1218 since then.
– This compares with a figure of 2720 from a year ago (August 2013), a considerable fall of 1218 since then

At what point does the interpretation we can bake into the text generator become “journalism”, if at all? Can the algorithm “do journalism”? Is the crafting of the algorithm “journalism”? Or it is just computer assisted reporting?

In the context of routine cognitive tasks – such as the task of reporting the latest JSA figures in this town or that city – is this sort of automation likely? Is it replacing one routine cognitive task with another, or is it replacing it with a “non-routine analytical task”? Might it a be a Good Thing, allowing the ‘reporting’ to be done automatically, at least in draft form, and freeing up journalists to do other tasks? Or a Bad Thing, replace a routine, semi-skilled task by automation? To what extent might the algorithms so embed more intelligence and criticism, for example, automatically flagging up figures that are “interesting” in some way? (Would that be algorithmic journalism? Or would it be more akin to an algorithmic stringer?)

PS Twitter discussion thread around this post from 29/9/14

Written by Tony Hirst

September 29, 2014 at 12:39 pm

Assessing Data Wrangling Skills – Conversations With Data

with one comment

By chance, a couple of days ago I stumbled across a spreadsheet summarising awarded PFI contracts as of 2013 (private finance initiative projects, 2013 summary data).

The spreadsheet has 800 or so rows, and a multitude of columns, some of which are essentially grouped (although multi-level, hierarchical headings are not explicitly used) – columns relating to (estimated) spend in different tax years, for example, or the equity partners involved in a particular project.

As part of the OU course we’re currently developing on “data”, we’re exploring an assessment framework based on students running small data investigations, documented using IPython notebooks, in which we will expect students to demonstrate a range of technical and analytical skills, as well as some understanding of data structures and shapes, data management technology and policy choices, and so on. In support of this, I’ve been trying to put together one or two notebooks a week over the course of a few hours to see what sorts of thing might be possible, tractable and appropriate for inclusion in such a data investigation report within the time constraints allowed.

To this end, the Quick Look at UK PFI Contracts Data notebook demonstrates some of the ways in which we might learn something about the data contained within the PFI summary data spreadsheet. The first thing to note is that it’s not really a data investigation: there is no specific question I set out to ask, and no specific theme I set out to explore. It’s more exploratory (that is, rambling!) than that. It has more of the form what I’ve started referring to as a conversation with data.

In a conversation with data, we develop an understanding of what the data may have to say, along with a set of tools (that is, recipes, or even functions) that allow us to talk to it more easily. These tools and recipes are reusable within the context of the data set or datasets selected for the conversation, and may be transferrable to other data conversations. For example, one recipe might be how to filter and summarise a dataset in a particular way, or generate a particular view or reshaping of it that makes it easy to ask a particular sort of question, or generate a particular sort of a chart (creating a chart is like asking a question where the surface answer is returned in a graphical form).

If we are going to use IPython notebook documented conversations with data as part of an assessment process, we need to tighten up a little more how we might expect to see them structured and what we might expect to see contained within them.

Quickly skimming through the PFI conversation, elements of the following skills are demonstrated:

- downloading a dataset from a remote location;
– loading it into an appropriate data representation;
– cleaning (and type casting) the data;
– spotting possibly erroneous data;
– subsetting/filtering the data by row and column, including the use of partial string matching;
– generating multiple views over the data;
– reshaping the data;
– using grouping as the basis for the generation of summary data;
– sorting the data;
– joining different data views;
– generating graphical charts from derived data using “native” matplotlib charts and a separate charting library (ggplot);
– generating text based interpretations of the data (“data2text” or “data textualisation”).

The notebook also demonstrates some elements of reflection about how the data might be used. For example, it might be used in association with other datasets to help people keep track of the performance of facilities funded under PFI contracts.

Note: the notebook I link to above does not include any database management system elements. As such, it represents elements that might be appropriate for inclusion in a report from the first third of our data course – which covers basic principles of data wrangling using pandas as the dominant toolkit. Conversations held later in the course should also demonstrate how to get data into an out of an appropriately chosen database management system, for example.

Written by Tony Hirst

September 24, 2014 at 11:00 am

Posted in Anything you want, Infoskills

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 812 other followers