Fragment – Metrics for Jupyter Notebook Based Educational Materials

Complementing the Jupyter notebook visualisations described in the previous post, I’ve also started dabbling with notebook metrics. These appear to be going down spectacularly badly with colleagues, but I’m going to carry on poking a stick at them nevertheless. (When I get a chance, I’ll also start applying them across various courses to content in OU-XML documents that drives our online and print course materials… I’d be interested to know if folk in IET already do this sort of thing, since they do love talking about things like reading rates and learning designs, and automation provides such an easy way of generating huge amounts of stats and data entrails to pore over…)

The original motivation was to try to come up with some simple metrics that could be applied over a set of course notebooks. This might include things like readability metrics (are introductory notebooks easier to read in terms of common readability scores than teaching notebooks, for example, under such measures?) and code complexity measures (do these give any insight into how hard a code cell might be to read and make sense of, for example). The measures might also help us get a feel for which notebooks might be overloaded in terms of estimated reading time, and potentially in need of some attention on that front in our next round of continual updates.

I also wanted to started building some tools to help me explore how the course notebooks we have developed to date are structured, and whether we might be able to see any particular patterns or practice developing in our use of them that a simple static analysis might reveal.

I might also have been influenced in starting down this route by a couple of papers I mentioned in a recent edition of the Tracking Jupyter newsletter (#TJ24 — Notebook Practice) that had reviewed the “quality” (and code quality) of notebooks linked from publications and on Github.

Estimating workload as a time measure is notoriously tricky and contentious, for all manner of reasons:

  • what do we mean by workload? “Reading time” is easy enough to measure but how does this differ from “engagement time” if we want students to “engage” or interat with our materials and not just skim over them?
  • different learners study at different rates; learners may also be pragmatic and efificient, using the the demands of continuous assssement material to focus their attention to certain areas of the course material;
  • reading time estimates, based on assumed word-per-minute (wpm) rates (in the OU, our rules of thumb are 35 wpm (~2000 words per hour) for challenging texts, 70 wpm for medium texts (4k wph), 120 wpm for easy texts (7k wph), assume that students read every word, and don’t skim; it’s likely that many students do skim read, though, flipping through pages of a print material to spot headings, images (photos, diagrams, etc) that grab attention, and exercises or self-assessment questions, so an estimate of “skim time” might also be useful. This is harder to do in online environments, particularly where the user interface requires a button click at the bottom of the page to move to the next page (if the button is not in the same place on the screen for each consecutive page, and there is no keyboard shortcut, you have to focus on moving the mouse to generate the next-page button click..), so for online, rather than print material, users, should we give them a single page view thay can skim over (OU VLE materials do have this feature at a unit (week of study) level, via the “print as single page” option);
  • activities and exercises often don’t have a simple mapping from word count to exercise completion time; a briefly stated activity may require 15 mins of student activity, or even hour. Activity text may state “you should expect to spend about X mins on this activity”, and structured activity texts may present expected activity time in a conventional way (identifiable metadata, essentially); when estimating the time of such activities, if we can identify the expected time, we might use this as a guide, possibly on top of the time estimated to actually read the activity text…
  • some text may be harder to read than other text, which we can model with reading time; but how do we know how hard to read a text is? Or do we just go with the most conservative read rate estimate? Several readability metrics exist, so these could be used to analyse different blocks of text and estimate reading rates relative to the calculated readability of each block in turn;
  • for technical materials, how do we caclculate reading rates asscociated with reading computer code, or mathemeatical or chemical equations? In the arts, how long does it take to look at a particular? In languages, how long to read a foreign text?
  • when working with code or equations, do we want the student to read the equation or code as text or engage with it more deeply, for example by executing the code, looking at the output, perhaps making a modification to the code and then executing it again to see how the output differs? For a mathematical equation, do we want students to run some numbers through the equation, or manipulate the equation?
  • code and equations are line based, so should we use line based, rather than word based, calculations to estimate reading — or engagement — time? For example, Xs per line?, with additional Xs per cell chunk / block for environments like Jupyter notebooks where a code chunk in a single cell often produces an single output per cell that we might expect the student to inspect?
  • as with using readability measures to tune reading rate parameters, we might be able to use code complexity measures to generate different code appreciation rates based on code complexity;
  • again, in Jupyter notebooks, we might distinguish between code in a markdown cell, that intended to be read but not executed, compared to code in a code cell which we do expect to be executed. The code itself may also have an execution time associated with it: for example, a single line of code to train a neural network model or run a complex statistical analysis, or even a simple analysis or query over a large dataset, may take several seconds, if not minutes, to run.

And yes, I know, there is probably a wealth of literature out there about this, and some of it has probably even been produced by the OU. If you can point me to things you think I should read, and/or that put me right about things that are obviously nonsense that I’ve claimed above, please post some links or references in the comments…:-)

At this point, we might say it’s pointless trying to capture any sort of metric based on a static analysis of course materials, compared to actually monitoring student study times. Instead, we might rely on our own rules of thumb as educators: if it takes me, as an “expert learner”, X minutes to work through the material, then it will take students 3X minutes (or perhaps, 4X minutes if I work through my own material, which I am familiar with, or 3X when I work through yours, which I am less familiar with); alternatively, based on experience, I may know that it typically takes me three weeks of production time to generate one week of study material, and use that as a basis for estimating the likely study time of a text based on how long I have spent trying to produce it. Different rules of thumb for estimating different things: how long does it take me to produce X hours of study material, how long does it take students to study Y amount of material.

Capturing actual study time is possible; for our Jupyter notebooks, we could instrument them with web analytics to capture webstats about how students engage with notebooks as if they were web pages, and we could also capture Jupyter telemetry for analysis. For online materials, we can capture web stats detailing how long students appeaer to spend on each page of study material before clicking through the the next, and so on.

So what have I been looking at? As well as the crude notebook visualisations, my reports are in the early stages, taking the following form at the current time:

In directory `Part 02 Notebooks` there were 6 notebooks.

– total markdown wordcount 5573.0 words across 160
– total code line count of 390 lines of code across 119 code cells
– 228 code lines, 137 comment lines and 25 blank lines

Estimated total reading time of 288 minutes.

The estimate is open to debate and not really I’ve spent much time thinking about yet (I was more interested in getting the notebook parsing and report generating machinery working): it’s currently a function of wpm reading rate applied to text and a “lines of code per minute” rate for code. But it’s not intended to be accurate, per se, and it’s definitely not intended to be precise; it’s just intended to provide an relative estimate of how long one notebook full of text may take to study compared to one that contains text and code; the idea is to calculate the numbers for all the notebooks across all the weeks of a course, then if we do manage to get a good idea of how long it takes a student to study one particular notebook, or one particular week, we can try to use structural similarities across other notebooks to get hopefully more accurate estimates out.

The estimate is also derived in code, and it’s easy enough to change the parameters (such as reading rates; lines of code engagement rates, etc) in the current algorithm, or the algorithm itself, to generate alternative estimates. (In fact, it might be interesting to generative several alternative forms and then compare them to see how threy feel, and if the ranked estimates and normalised estimates across different notebooks stay roughly the same, or whether they give different relative estimates.)

The report itself is generated from a template fed values from a pandas dataframe cast to a dict (that is, a Python dictionary). The templates take the form:

The bracketed items refer to columns in a feedstock dataframe and templated text blocks are generated a block at a time from individual rows of the dataframe passed to the template as a feedstock dict using a construction of the form:


Robot journalism ftw… meh…

The actual metrics collected are more comprehensive, including:

  • readability measures for markdown text (flesch_kincaid_grade_level, flesch_reading_ease, smog_index, gunning_fog_index, coleman_liau_index, automated_readability_index, lix, gulpease_index, wiener_sachtextformel), as well as simply structural measures (word count, sentence count, average words per sentence (mean, median) and sd, number of paragraphs, etc;
  • simpe code analysis (lines of code, comment lines, blank lines) and some experimental code complexity measure.

I’ve also started experimenting with tagging markdown with automatically extracted acronyms and “subject terms”, and exploring things like identifying the Python pakages imported into each notebook. Previous experiments include grabbing text headings out of notebooks, which may be useful when generating summary reports over sets of notebooks for review purposes.

Something I haven’t yet done is explore ways in which metrics evolve over time, for example as materials are polished and revised during a production or editorial process.

Reaction internally to my shared early doodlings so far have been pretty much universally negative, although varied: folk may happy with their own simple metrics (reading rates applied to word counts), or totally in denial about the utility of any form of static analysis depending on the intended study / material use model. As with many analytics, there are concerns that measures are okay if authors can use them as a tool to support their own work, but may not be appropriate for other people to make judgements from or about them. (This is worth bearing in mind when we talk about using metrics to monitor students, or computational tools to automatically grade them, but we then shy against applying similar techniques to our own outputs…)

You can find the code as it currently exists, created as a stream of consciousness notebook, in this gist. Comments and heckles welcome. As with any dataset, the data I’m producing is generated: a) because I can generate it; b) as just another opinion…

PS once I’ve gone through the notebook a few more times, building up different reports, generating different reading-time and engagement measures, coming up with a commandline interface to make it easier for folk to run against their own notebooks, etc, I think I’ll try to do the same for OU-XML materials… I already have OU-XML to markdown converters, so a running the notebook profiler over that material is easy enough, particularly if I use Jupytext to transform the md to notebooks. See also the PS to notebook visualisation post for related thoughts on this.

PPS The demo notebooks in this repository look like they could be interesting for eg code analysis. And this interactive DAG visualisation tool might also be interesting when it comes to viewing generated graphs.

PPPS This could be an interesting approach for building up a set of tools for checking student code: writing your own static code analysis code checks.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: