Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

Or something that looks a bit more like a finite state machine diagram:

(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

For example, the following could describe total flow from one module to another over a given period of time:

If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

The R sankey package generates a simple Sankey diagram from a simple data table:

Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

Via the comments…

H/T @Richard: process maps using the R bupar package:

Process_Maps

PS for creating Sankey diagrams in a browser, see SankeyMATIC.

5 comments

  1. Pingback: Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R – Mubashir Qasim
  2. Pingback: Distilled News | Data Analytics & R
  3. Pingback: Network And Sankey Diagrams In Python And R – Curated SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s