Exploring the Hierarchical Structure of DataFrames and CSV Data

Whilst tinkering with some possible ideas for updates to materials in the Data Management and Analysis course, I thought it might be useful as an aside at least to show how simple data tables might represent hierarchically structured data.

Take the following table for example:

A snapshor of UK administrative geography.

The column order could make things jump out at you a bit more, and we can also use the power of pandas multi-indexes to structure the data a little bit more:

pandas dataframe with hierarchical data unsorted multi-index

See it yet? Let’s sort the index terms:

pandas dataframe with hierarchical data sorted multi-index

To anyone who works with data regularly, a quick scan of the original data and you immediately know that the data can be arranged as a hierarchy or tree object. The table has countries, which have regions, which contain local authorities, which have wards: simples.

In your mind’s-eye, you sense that the data can be structured in just such a way and you make a mental note to that effect, as a structural property of the data that might be handy to raw on if you need to sort, group or filter the data, or partition it in other ways. Furthermore, by observing the coding schemes and levels of granularity associated with particular columns you immediately get a sense what other sorts of data you might be able to merge or link in to the data set, and how easy that would be to achieve (I’ll try to post more about linking data in another post).

But for a novice, the “consequences” of the way the table is organised at a structural level (read in a particular way, the data columns do define a hierarchy) is perhaps not obvious from the apparently unsorted and arbitrary order of the rows and the columns that give it its current visual appearance. I can look at the first table and from the columns, and pretty much immediately grok that I can probably treat the data as a tree, and all that follows from that. Pretty much just from a glance.

So how can we help students get a feel for the structures that may be evident in a raw data table that they meet, perhaps in an unsorted fashion. One way might be to visualise the tables and re-present them in ways where the structure becomes more visible. Something like this perhaps:

That’s one of the things I see, just top right behind my eyes (in NLP terms, “visually created”), if I try to “see” something like the original table in visual structural terms:

A snapshor of UK administrative geography.

But I donlt see it, it feels almost more like I breathe it. It’s just there. It jumps right out at you into the back of your head as a thing you know to be true of the data. And like a perceptually multistable figure-ground image, when you start to know the structures are there, you can bring them in and out of attentive focus:

One way we can try to make the structure evident in the table is through sorted index terms, as demonstrated above. Another way is to use a macroscope to view the whole dataset in a particular way.

Many folk will be familiar with the idea of a a micropsope, a tool that lets you look in very close-up detail at a tiny piece of a picture. A macroscope goes the other way: it lets you look at everything at a single glance.

So what sort of macroscope might you use? One particular macroscope I’m particulalry fond of is a treemap. This is one of the few flavours of circumscribed pie chart I like (the two dimensional mosaic plot being another), although it may take a little bit of getting your head round, like you can’t quite breathe it all to understand the full cosequences of what it’s trying to show you.

Tree map of England regions and local authorities

Here’s a snapshot of the code used to create the original interactive view of the treemap (created using a plotly treemap):

Code to create plotly interactive treemap from pandas dataframe

One of the things that surprised me whilst I was looking for a way of grabbing the tree structure out of the table as a Python dictionary and then exporting it as a JSON file was that there wasn’t an obvious way (or so it seemed to me) of exporting it from a suitably indexed pandas dataframe. There are several ways of orient the dictionary export from a dataframe, but I couldn’t see how to export a mutli-index with a tree structure as a tree based data structure. (If you have a code fragment that does that, please share it in the comments.)

One way of creating such an object would be to represent the table as a graph using the netwrokx Python package, with different column combinations used to define the edge-list (I’ll post more on how to spot graph/network structures in a dataset in a later post), but I also noticed the presence of a more specialised tree handling package, treelib, so I thought I’d give that a go…

To access the code, see this gist: https://gist.github.com/psychemedia/7cf7cf56f3178126df4e7a29d8621623

Being a bear of little brain, and docs and examples being in short supply, I started out with a very literal way of constructing the tree, taking one level of the tree at a time:

There’s obviously repetition in there, but not the sort of one-step-after-another repetition that lets you iterate over a particular operation in a “linear” way. Instead, the repetition is nested, looping ever deeper within itself (spiralling, you might say). Which is suggestive of an algorithmic approach that is hugely powerful and can end up being fiendishly complicated to get right if you’re angrily trying to get ti to work at 3am in the morning, even if the (correct) solution is beautifully simple and elegant: recursion.

Anyway, here’s my (probably less than elegant!) attempt at a recursive function, one capable of calling itself, to build the tree from the table:

Recursive function to create a tree from a hierachically structure table

It works by specifying combinations of columns that are used to define a unique identifier for each node in the tree as well as its label and then defining parent-child relationships between them.

Specifying the columns at each level of the tree

(The tree package seemed to choke with using labels as identifiers, and in a general case, the labels could not guaranteed to be unique in any case, so I actually make use of the various code columns for the identifiers and the name columns for the labels.)

Conveniently, the tree package can also export the data as a JSON data structure that we can then convert to a Pyhton dictionary. However, convenient as the data structure might be for certain forms of processing, it’s not a very clean data structure:

tree exported to JSON and converted to a Python dictionary

I know the child nodes are children becuase of their relationship to the parent node — I don’t need to be explicitly told they’re children. So how might we prune this datastructure? Again, coding the steps to process the data in “longhand” as a linear (if nested) sequence of operations gives us a clue:

Literally (linearly) pruning the children…

As before, we see repeating nested structures that we can crib from to help us create a recursive function that will call itself to descend the tree and prune out the explicit children nodes no matter how levels deep the tree goes:

Recursive function to prune explict “children” nodes out of the dictionary data structure

Here’s an example of the output for a small subset of the table:

Example of pruned JSON/Python dictionary

Once we have the data as a Python dictionary, we can export it as JSON, or view it in a navigable way using a tree widget. Again, I’m quite surprised that there aren’t a couple of well proven off-the-shelf tools for doing this. The example below is taken pretty much from a StackOverflow answer:

Example of pruned JSON, visualised via an interactive tree widget.

So, a quick of hierarchical data in a “flat” data table. The structure is there if you know how to see it. And there are various ways of rendering the data so you can start to “see” the structure in a very explicit way. That might not be the same as seeing the consequences, or “affordances” of having the data structured that way — you may still not grok what it means or makes possible for the data to be structured in that way, but it’s a step on the journey to becoming one with that beautiful medium that is “data”.

PS one of the unprojects I’ve wanted to do for a long time is a proper look at the aesthetics of data. I think this is an example of what I mean by that: the ability to see the structure of a dataset, or imagine (I’m not sure that visualise is right because its not necessarily visual: it’s more than that) the sorts of structural relations that hold between data elements and the consequences of those relations not just at the local level and but also across the dataset as a whole. Which is where macroscopes come in again.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...