For the last few months, I’ve found a home in IPython Notebooks for dabbling with data. IPython notebooks provide a flexible authoring tool for combining text with executable code fragments, as well as the outputs from executing code, such as charts, data tables or automatically generated text reports. We can also embed additional HTML5 content into a notebook, either inline or as an iframe.
With a little judicious use of templates, we can easily take data from a source we are working with in the notebook, and then render a view of it using included components. This makes it easy to use hybrid approaches to working with data in the notebook context. (Note: the use of cell magics also let’s us operate on a data set using different languages in the same notebook – for example, Python and R.)
As an example of a hybrid approach to exploratory data analysis, how about the following?
The data manipulation library I’ve spent most of my time in to date in the notebooks is pandas. pandas is a really powerful tool for wrangling tabular data shapes, including reshaping them and running group reports on them. Among the operations pandas supports are pivot tables. But writing the code can be fiddly, and sometimes you just want an interactive hands on play with the data. IPython Notebooks do support widgets (though I haven’t played with them yet), so I guess I could try to write a simple UI for running different pivot table views over dataset in an interactive fashion.
But if I’m happy with reading the numbers the pivot table table reports as an end product, and don’t need access to the report as data, I can use a third party interactive pivot table widget such as Nicolas Kutchen’s Pivot Table component to work with the data in an interactive fashion.
I’ve popped a quick demo of a clunky hacky way of feeding a pivot table widget from a pandas dataframe here: pivot table in IPython Notebook demo. A pandas dataframe is written as an HTML table and embedded in a templated page that generates the pivot table from the the HTML table. This page is saved as an HTML file and then loaded in as an IFrame. (We could add the HTML to an iframe using srcdoc, rather than saving it as a file and loading it back in, but I thought it might be handy to have access to a copy of the file. Also, I’m not sure if all browsers support srcdoc?)
(Note: we could also use the pivot table widget with a subset of a large dataset to generate dummy reports to find the ones we want, and test pandas code against the same subset of data against that output to check the code is generating the table we want, and then deploy the code against the full dataset.)
The pivot table has the data in memory as a hidden HTML table in the pivot table page, so performance may be limited for large datasets. On my machine, it was quite happy working with a month’s spending/transparency data from my local council, and provided a quick way of generating different summary views over the data. (The df dataframe is simply the result of loading in the spending data CSV file as a pandas dataframe – no other processing required. So the pivotTable() function could easily be modified to accept the location of a CSV file, such as a local address or a URL, load the file in automatically into a dataframe, and then render it as a pivot table.)
There’s also limited functionality for tunneling down into the data by filtering it within the chart (reather than having to generate a filtered view of the data that is then baked in as a table to the chart HTML page, for example):
I’ve been dabbling with various other embedded charts too which I’ll post when I get a chance.