Python Package Use Across Course Notebooks

Some time ago, I started exploring various ways of analysing the structure of Jupyter notebooks as part of an informal “notebook quality” unproject (innovationOUtside/nb_quality_profile).

Over the last week or two, for want of anything else to do, I’ve been looking at that old repo again and made a start tinkering with some of the old issues, as well as creating some new ones.

One of the things I messed around with today was a simple plot showing how different packages are used across a set of course notebooks. (One of the quality reports lists the packages imported by each notebook, and can flag if any packages are missing from the Python environment in which the tool runs.)

The course lasts ~30 weeks, with a set of notebooks most weeks and the plot shows the notebooks, in order of study, along the x-axis, with the packages listed as they are first enountered on the y-axis.

This chart is essentially a macroscopic view of package usage throughout the course module (and as long term readers will know, I do like a good mascroscope:-).

In passing, I note that I could also add colour and/or shapes or size to identify whether a package is in the Python standard library or whether it is imported from a project installed from PyPi, or highlight whether the package is not available in the Pyhton environment the tool that generates the chart is run in.

A quick glance at it reveals several things:

  • pandas is used heavily throughout (you might guess this is a data related course!), as we can see from the long horizontal run throughout the course;
  • several other packages are used over consecutive notebooks (short, contiguous horizontal runs of dots), suggesting a package that has a particular relevance for the subject matter studied that week;
  • vertical runs show that several new packages are used for the first time in the same notebook, perhaps to acheive a particular task. If the same vertical run appears in other notebooks, perhaps a similar task is being performed in each of those notebooks;
  • there is a steady increase in the number of packages used over the course. If there is a large number of packages introduced for the first time in a single notebook (i.e. a vertical run of dots), this might suggest a difficult notebook for students to work through in terms of new packages and new functions to get their head round;
  • if a package is used only one notebook (which is a little hard to see — I need to explore gridlines in a way that doesn’t overly clutter the graphic), it might be worth visiting that notebook to see if we can simplify it and remove the singleton use of that package, or check the relevance of the topic it relates to to the course overall;
  • if a notebook imports no modules (or has no code cells), it might be worth checking to see whether it really should be a notebook;
  • probably more things…

I’m now wondering what sort of tabular or list style report listing might be useful to identify the notebooks each module appears in, at least for packages that only appear once or twice, or are widely separated in terms of when they are studied.

I also wonder if there are any tools out that I can use to identify package functions used in each notebook to see how they are distributed over the module. (This all rather makes me think of topic analysis!)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: