Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

Or something that looks a bit more like a finite state machine diagram:

(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

For example, the following could describe total flow from one module to another over a given period of time:

If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

The R sankey package generates a simple Sankey diagram from a simple data table:

Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

Via the comments…

H/T @Richard: process maps using the R bupar package:

Process_Maps

PS for creating Sankey diagrams in a browser, see SankeyMATIC.

Charting Terrorism Related Arrest Flows Through The Criminal Justice System

One of my daily read feeds is a list of the day’s government statistical releases. Today, I spotted a release on the Operation of police powers under the Terrorism Act 2000, quarterly update to September 2015, which included an annes on Arrests and outcomes, year ending September 2015:

terror_cjs

I tweeted a link to doc, and Michael/@fantasticlife replied with a comment it might look interesting as a Sankey diagram…

Hmm…

So here’s a quick sketch generated using SankeyMATIC:

sankeymatic_1000x800

I took the liberty of adding an extra “InSystem’ step into the chart to account for the feedback look of the bailed arrests.

Here’s the data I used:

Arrested [192] InSystem
Arrested [115] Released without charge
Arrested [8] Alternative action
InSystem [124] Charged
InSystem [68] Released on bail
Charged [111] Terrorism Related
Charged [13] Non-terrorism related
Terrorism Related [36] Prosecuted.t
Terrorism Related [1] Not proceeded against
Terrorism Related [74] Awaiting prosecution
Non-terrorism related [6] Prosecuted.n
Non-terrorism related [2] Not proceeded against
Non-terrorism related [5] Awaiting prosecution
Prosecuted.t [33] Convicted (terrorism related)
Prosecuted.t [2] Convicted (non-terrorism related)
Prosecuted.t [1] Acquitted
Prosecuted.n [5]  Convicted (non-terrorism related)
Prosecuted.n [1] Acquitted

Looking at the diagram, I find the placement of the labels quite confusing and I’m not really sure what relate to what. (The numbers, for example…) It would also be neater if we could capture flows still “in-the system”, for example by stopping the Released on bail element at the same depth as the Charged elements, and also keeping the Awaiting prosecution element short of the right hand side. (Perhaps bail and awaiting elements could be added into a “limbo” field?)

So – nice idea, but as soon as you look at it you see that a quick look at trivial sketch immediately identifies all sorts of other issues that you need to take into account to make the diagram informatively glanceable…

Hmmm…

Thinks.. SankeyMATIC is a d3.js app… it would be nice if I could drag the elements in the generator to may the diagram a bit clearer… maybe I can?
sankeymatic_1000x800 (1)

Only that’s wrong too… because the InSystem label applies to the boundary to the left, and the Bail label to the right… So we need to tweak it a bit more…

sankeymatic_1200x800 (1)

In fact, you may notice that the labels seem to be applied left and right justified according to different rules? Hmmm… Not so simple again…

How about if I take out the insterstitial value I added?

sankeymatic_1200x800 (2)

That’s perhaps a bit clearer? And all goes some way to showing how constructing a graphic is generally an iterative process, scaffolding the evolution of the diagram as you go, as you learn to see it/read it from different perspectives and tweak it to try to clarify particular communicative messages? (Which in this case, for me, was to try to tease out how far through the process various flows had got, as well as clearly identify final outcomes…)

Other things we could do to try to improve the graphic are experiment a bit more with the colour schemes. But that’s left as an exercise for the reader…;-)

Generating Sankey Diagrams from rCharts

A couple of weeks or so ago, I picked up an inlink from an OCLC blog post about Visualizing Network Flows: Library Inter-lending. The post made use of Sankey diagrams to represent borrowing flows, and by implication suggested that the creation of such diagrams is not as easy as it could be…

Around the same time, @tiemlyportfolio posted a recipe for showing how to wrap custom charts so that they could be called from the amazing Javascript graphics library wrapping rCharts (more about this in a forthcoming post somewhere…). rCharts Extra – d3 Horizon Conversion provides a walkthrough demonstrating how to wrap a d3.js implemented horizon chart so that it can be generated from R with what amounts to little more than a single line of code. So I idly tweeted a thought wondering how easy it would be to run through the walkthrough and try wrapping a Sankey diagram in the same way (I didn’t have time to try it myself at that moment in time.)

Within a few days, @timelyportfolio had come up with the goods – Exploring Networks with Sankey and then a further follow on post: All My Roads Lead Back to Finance–PIMCO Sankey. The code itself can be found at https://github.com/timelyportfolio/rCharts_d3_sankey

Somehow, playtime has escaped me for the last couple of weeks, but I finally got round to trying the recipe out. The data I opted for is energy consumption data for the UK, published by DECC, detailing energy use in 2010.

As ever, we can’t just dive straight into the visualiastion – we need to do some work first to get int into shape… The data came as a spreadsheet with the following table layout:

Excel - copy data

The Sankey diagram generator requires data in three columns – source, target and value – describing what to connect to what and with what thickness line. Looking at the data, I thought it might be interesting to try to represent as flows the amount of each type of energy used by each sector relative to end use, or something along those lines (I just need something authentic to see if I can get @timelyportfolio’s recipe to work;-) So it looks as if some shaping is in order…

To tidy and reshape the data, I opted to use OpenRefine, copying and pasting the data into a new OpenRefine project:

Refine - paste DECC energy data

The data is tab separated and we can ignore empty lines:

Refine - paste import settings (DECC)

Here’s the data as loaded. You can see several problems with it: numbers that have commas in them; empty cells marked as blank or with a -; empty/unlabelled cells.

DECC data as imported

Let’s make a start by filling in the blank cells in the Sector column – Fill down:

DECC data fill down

We don’t need the overall totals because we want to look at piecewise relations (and if we do need the totals, we can recalculate them anyway):

DECC filter out overall total

To tidy up the numbers so they actually are numbers, we’re going to need to do some transformations:

DECC need to clean numeric cols

There are several things to do: remove commas, remove – signs, and cast things as numbers:

DECC clean numeric column

value.replace(',','') says replace commas with an empty string (ie nothing – delete the comma).

We can then pass the result of this transformation into a following step – replace the – signs: value.replace(',','').replace('-','')

Then turn the result into a number: value.replace(',','').replace('-','').toNumber()

If there’s an error, not that we select to set the cell to a blank.

having run this transformation on one column, we can select Transform on another column and just reuse the transformation (remembering to set the cell to blank if there is an error):

DECC number cleaner reuse

To simplify the dataset further, let’s get rid of he other totals data:

DECC remove data column

Now we need to reshape the data – ideally, rather than having columns for each energy type, we want to relate the energy type to each sector/end use pair. We’re going to have to transpose the data…

DECC start to reshape

So let’s do just that – wrap columns down into new rows:

DECC data transposition

We’re going to need to fill down again…

DECC need to fill down again

So now we have our dataset, which can be trivially exported as a CSV file:

DECC export as CSV

Data cleaning and shaping phase over, we’re now ready to generate the Sankey diagram…

As ever, I’m using RStudio as my R environment. Load in the data:

R import DECC data

To start, let’s do a little housekeeping:

#Here’s the baseline column naming for the dataset
colnames(DECC.overall.energy)=c(‘Sector’,’Enduse’,’EnergyType’,’value’)

#Inspired by @timelyportfolio - All My Roads Lead Back to Finance–PIMCO Sankey
#http://timelyportfolio.blogspot.co.uk/2013/07/all-my-roads-lead-back-to-financepimco.html

#Now let's create a Sankey diagram - we need to install RCharts
##http://ramnathv.github.io/rCharts/
require(rCharts)

#Download and unzip @timelyportfolio's Sankey/rCharts package
#Take note of where you put it!
#https://github.com/timelyportfolio/rCharts_d3_sankey

sankeyPlot <- rCharts$new()

#We need to tell R where the Sankey library is.
#I put it as a subdirectory to my current working directory (.)
sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')

#We also need to point to an HTML template page
sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")

having got everything set up, we can cast the data into the form the Sankey template expects – with source, target and value columns identified:

#The plotting routines require column names to be specified as:
##source, target, value
#to show what connects to what and by what thickness line

#If we want to plot from enduse to energytype we need this relabelling
workingdata=DECC.overall.energy
colnames(workingdata)=c('Sector','source','target','value')

Following @timelyportfolio, we configure the chart and then open it to a browser window:

sankeyPlot$set(
  data = workingdata,
  nodeWidth = 15,
  nodePadding = 10,
  layout = 32,
  width = 750,
  height = 500,
  labelFormat = ".1%"
)

sankeyPlot

Here’s the result:

Basic sankey DECC

Let’s make plotting a little easier by wrapping that routine into a function:

#To make things easier, let's abstract a little more...
sankeyPlot=function(df){
  sankeyPlot <- rCharts$new()
  
  #--------
  #See note in PPS to this post about a simplification of this part....
  #We need to tell R where the Sankey library is.
  #I put it as a subdirectory to my current working directory (.)
  sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')
  
  #We also need to point to an HTML template page
  sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
  #---------
 
  sankeyPlot$set(
    data = df,
    nodeWidth = 15,
    nodePadding = 10,
    layout = 32,
    width = 750,
    height = 500,
    labelFormat = ".1%"
  )
  
  sankeyPlot
}

Now let’s try plotting something a little more adventurous:

#If we want to add in a further layer, showing how each Sector contributes
#to the End-use energy usage, we need to additionally treat the Sector as
#a source and the sum of that sector's energy use by End Use
#Recover the colnames so we can see what's going on
sectorEnergy=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)
colnames(sectorEnergy)=c('source','target','value')

#We can now generate a single data file combing all source and target data
energyfull=subset(workingdata,select=c('source','target','value'))
energyfull=rbind(energyfull,sectorEnergy)

sankeyPlot(energyfull)

And the result?

Full Sankey DECC

Notice that the bindings are a little bit fractured – for example, the Heat block has several contributions from the Gas block. This also suggests that a Sankey diagram, at least as configured above, may not be the most appropriate way of representing the data in this case. Sankey diagrams are intended to represent flows, which means that there is a notion of some quantity flowing between elements, and further that that quantity is conserved as it passes through each element (sum of inputs equals sum of outputs).

A more natural story might be to show Energy type flowing to end use and then out to Sector, at least if we want to see how energy is tending to be used for what purpose, and then how end use is split by Sector. However, such a diagram would not tell us, for example, that Sector X was dominated in its use of energy source A for end use P, compared to Sector Y mainly using energy source B for the same end use P.

One approach we might take to tidying up the chart to make it more readable (for some definition of readable!), though at the risk of making it even more misleading, is to do a little bit more aggregation of the data, and then bind appropriate blocks together. Here are a few more examples of simple aggregations:

We can also explore other relationships and trivially generate corresponding Sankey diagram views over them:

#How much of each energy type does each sector use
enduseBySector=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)
colnames(enduseBySector)=c('source','target','value')
sankeyPlot(enduseBySector)

colnames(enduseBySector)=c('target','source','value')
sankeyPlot(enduseBySector)

#How much of each energy type is associated with each enduse
energyByEnduse=aggregate(value ~ EnergyType + Enduse, DECC.overall.energy, sum)
colnames(energyByEnduse)=c('source','target','value')

sankeyPlot(energyByEnduse)

So there we have it – quick and easy Sankey diagrams from R using rCharts and magic recipe from @timelyportfolio:-)

PS the following routine makes it easier to grab data into the appropriately named format

#This routine makes it easier to get the data for plotting as a Sankey diagram
#Select the source, target and value column names explicitly to generate a dataframe containing
#just those columns, appropriately named.
sankeyData=function(df,colsource='source',coltarget='target',colvalue='value'){
  sankey.df=subset(df,select=c(colsource,coltarget,colvalue))
  colnames(sankey.df)=c('source','target','value')
  sankey.df
}

#For example:
data.sdf=sankeyData(DECC.overall.energy,'Sector','EnergyType','value')
data.sdf

The code automatically selects the appropriate columns and renames them as required.

PPS it seems that a recent update(?) to the rCharts library by @ramnath_vaidya now makes things even easier and removes the need to download and locally host @timelyportfolio’s code:

#We can remove the local dependency and replace the following...
#sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')
#sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
##with this simplification
sankeyPlot$setLib('http://timelyportfolio.github.io/rCharts_d3_sankey')