Testing OUr Educational Jupyter Notebooks

For years and years I’ve been meaning to put automation in place for testing the Jupyter notebooks we release to students in the Data Management and Analysis module. I’ve had several false starts at addressing this over those years, but thought I should try to have another iteration towards something that might be useful at the end of last week.

The maintenance and release cycle we have at the moment currently goes like this:

  • Docker container updated; notebook content errata fixed; notebooks are saved with code cells run in a private Github repo; for each new presentation of the module, we create a branch from the current main branch notebooks; errata are tracked via issues and are typically applied to main; they typically do not result in updated notebooks being released to students; instead, an erratum announcement is posted to the VLE, with the correction each student needs to make manually to their own copy of the notebook;
  • manual running and inspection of notebooks (typically not by me; if it was me, I’d have properly automated this much sooner!;-) in updated container. The checks identify whether cells run, whether there are warnings, etc; some cells are intended to fail execution, which can complicate a quick “Run all, then visually inspect” test. If the cell output changes, it may be hard to identify exactly what change has occurred / whether the change is one that is likely to be identified in use by a student as “an error”. Sometimes, outputs differ in detail, but not kind. For example, a fetch-one Mongo query brings back an item, but which item may not be guaranteed; a %memit or %timeit test is unlikely to return exactly the same resource usage, although we might want to compare the magnitudes of the time or memory consumed.

The testing tool I have primarily looked at using is the nbval extension to the py.test framework. This extension takes a notebook with pre-run “gold standard” cell outputs available, re-runs the notebook has limited support for tagging cells to allow outputs to be ignored, erroring cells to be identified and appropriately handle, or cell execution to be skipped altogether.

My own forked nbval package adds several “vague tests” to the test suite (some of my early tag extensions are described in Structural Testing of Jupyter Notebook Cell Outputs With nbval). For example, we can check that a cell output is a folium map, or an image with particular dimensions, or castable to a list of a certain length, or a dict with particular keys.

Other things that are useful to flag are warnings that are being raised as a consequence of the computational environment being updated.

To make testing easier, I’ve started working on a couple of sketch Github actions in a private cloned repo of our official private module team repo.

In the repo, the notebooks are arranged in weekly directories with a conventional directory name (Part XX Notebooks). The following manually triggered action provides a way of testing just the notebooks in a single week;’s directory:

When the action is run, the notebooks are run against the loaded environment pulled in as a Docker container (the container we want to test the materials against). Cell outputs are compared and an HTML report is generated using pytest-html ; this report is uploaded as an action artefact and attached to the action run report.

name: nbval-test-week
        type: choice
        description: Week to test
        - "01"
        - "02"
        - "03"
        - "04"
        - "05"
        - "07"
        - "08"
        - "09"
        - "10"
        - "11"
        - "12"
        - "14"
        - "15"
        - "16"
        - "20"
        - "21"
        - "22"
        - "23"
        - "25"
        - "26"
        description: 'Skip timeit'
        type: boolean
        description: 'Skip memit'
        type: boolean    
    runs-on: ubuntu-latest
      image: ouvocl/vce-tm351-monolith
    - uses: actions/checkout@master
    - name: Install nbval (TH edition)
      run: |
        python3 -m pip install --upgrade https://github.com//ouseful-PR/nbval/archive/table-test.zip
        python3 -m pip install pytest-html
    - name: Restart postgres
      run: |
        sudo service postgresql restart
    - name: Start mongo
      run: |
        sudo mongod --fork --logpath /dev/stdout --dbpath ${MONGO_DB_PATH}
    - name: Test TM351 notebooks
      run: |
        if [ "$memit" = "true" ]; then
        if [ "$timeit" = "true" ]; then
          nbval_flags="$nbval_flags --nbval-skip-timeit"
        py.test --nbval $nbval_flags --html=report-week-${{ github.event.inputs.week }}.html --self-contained-html ./Part\ ${{ github.event.inputs.week }}*
        INPUT_MEMIT: ${{ github.event.inputs.memit }}
        INPUT_TIMEIT: ${{ github.event.inputs.timeit }}
    - name: Archive test results
      if: always()
      uses: actions/upload-artifact@v3
        name: nbval-test-report
        path: ./report-week-${{ github.event.inputs.week }}.html

We can then download and review the HTML report to identify which cells failed in which notebook. (The Action log also displays any errors.)

Another action can be used to test the notebooks used across all the whole course.

On the to do list is: declaring a set of possible Docker images that the user can choose from; an action to run all cells against a particular image to generate a set of automatically produced gold standard outputs; an action to compare outputs from running the notebooks against one specified environment compared to the outputs generated by running them against a different specified environment. If we trust one particular environment for producing “correct” gold standard outputs, we can use that to the notebook outputs against which a second, development environment is being tested.

NOTE: updates to notebooks may not be backwards compatible with previous environments; the aim is to drive the content of the notebooks forward so they run against the evolving “current best practice” environment, not so that they are necessarily backwards compatible with earlier environments. Ideally, a set of “correct” run notebooks from one presentation form the basis of the test for the next presentation; but even so, differences may arise that represent a “correct” output in the new environment. Hmmm, so maybe I need an nbval-passes tag that can be used to identify cells whose output can be ignored because the cell is known to produce a correct output in the new environment that doesn’t match the output from the previous environment and that can’t be handled by an outstanding vague test. Then when we create a new branch of the notebooks for a new presentation, those nbval-passes are stripped from the notebooks under the assumption they should, “going forward”, now pass correctly.

As I retroactively start tagging notebooks with a view to getting improving the meaningful test pass rate, several things come to mind:

  • the primary aim is to check that the notebooks provide appropriate results when run in a particular environment; a cell output does not necessarily need to exactly match a gold master output for it to be appropriate;
  • the pass rate of some cells could be improved by modifying the code; for example, displaying SQL queries or dataframes that have been sorted on a particular column or columns. In some cases this will not detract from the learning point being made in the cell, but in other cases it might;
  • adding new cell tags / tests can weaken or strengthen tests that are already available, although at the cost of introducing more tag types to manage; for example, the dataframe output test currently checks the dataframe size and column names match, BUT the columns do not necessarily need to be in the same order; this test could be strengthened by also checking column name order, or weakened by dropping the column name check altogether. We could also improve the strength by checking column types, for example;
  • some cells it’s perhaps just better to skip or ignore altogether; but in such cases, we should be able to report on which cells have been skipped or had their cell output ignored (so we can check whether a ‘failure’ could arise that might need to be addressed rather than ignored), or disable the “ignore” or “skip” behaviour to run a comprehensive test.

For the best test coverage, we would have 0 ignored output cells, 0 skipped cells, tests that are as strong as possible, no errors, no warnings, and no failures (where a failure is a failure of the matching test, either exact matching or one of my vague tests).

PS as well as tests, I am also looking at actions to support the distribution of notebooks; this includes things like checking for warnings, clearing output cells, making sure that cell toolbars are collapsed, making sure that activity answers are collapsed, etc etc. Checking toolbars and activity outputs are collapsed could be tests, or could be automatically run actions. Ideally, we should be able to automate the publication of a set of notebooks by:

  • running tests over the notebooks;
  • if all the tests pass, run the distribution actions;
  • create a distributable zip of ready-to-use notebook files etc.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: