Fragment: Structural Testing of Jupyter Notebook Cell Outputs With nbval

In an attempt to try to automate a bit more of our educational notebook testing and release process, I’ve started looking again at nbval [repo]. This package allows you to take a set of run notebooks and then re-run them, comparing the new cell outputs with the original cell outputs.

This allows for the automated testing of notebooks when our distributed code execution environment is updated. This allows us to check for code that has stopped working for whatever reason, as well as picking up new warning notices, such as deprecation notices.

It strikes me that it would also be useful to generate a report for each notebook that captures the notebook execution time. Which makes me think, is there also a package that profiles notebook execution time on a per cell basis?

The basis of comparison I’ve been looking at is the string match on each code cell output area and on each code cell stdout (print) area. In several of the notebooks I’m interested in checking in the first instance, we are raising what are essentially false positive errors in certain cases:

  • printed outputs that have a particular form (for example, a printed output at each iteration of a loop) but where the printed content may differ within a line;
  • database queries that return pandas dataframes with a fixed shape but variable content, or Python dictionaries will a particular key structure but variable values;
  • %%timeit queries that return different times each time the cell is run.

For the timing errors, nbval does support the use of regular expressions to rewrite cell ouptut before comparing it. For example:

[regex1]
regex: CPU times: .*
replace: CPU times: CPUTIME

[regex2]regex: Wall time: .*
replace: Wall time: WALLTIME

[regex3]
regex: .* per loop \(mean ± std. dev. of .* runs, .* loops each\)
replace: TIMEIT_REPORT

In a fork of the nbval repo, I’ve added these as a default sanitisation option, although it strikes me it might also be useful to capture timing reports and then raise an error if the times are significantly different (for example, and order of magnitude difference either way). This would then also start to give us some sort of quality of service test as well.

For the dataframes, we can grab the dataframe table output from the text/html cell output data element and parse it back into a dataframe using the pandas pd.read_html() function. We can then compare structual elements of the dataframe, such as its size (number of rows and columns) and the column headings. In my hacky code, this behaviour is triggered using an nbval-test-df cell tag:

def compare_dataframes(self, item, key="data", data_key="text/html"):
        """Test outputs for dataframe comparison. """
        df_test = False
        test_out = ()
        if "nbval-test-df" in self.tags and key in item and data_key in item[key]:
            df = pd.read_html(item[key][data_key])[0]
            df_test = True
            test_out = (df.shape, df.columns.tolist())
        return df_test, data_key, test_out

The error report separately reports on shape and column name mismatches:

    def format_output_compare_df(self, key, left, right):
        """Format a dataframe output comparison for printing"""
        cc = self.colors

        self.comparison_traceback.append(
            cc.OKBLUE
            + "dataframe mismatch from parsed '%s'" % key
            + cc.FAIL)

        size_match = left[0]==right[0]
        cols_match = left[1]==right[1]
        
        if size_match:
            self.comparison_traceback.append(cc.OKGREEN 
                + f"df size match: {size_match} [{left[0]}]" + cc.FAIL)
        else:
            self.comparison_traceback.append("df size mismatch")
            self.fallback_error_report(left[0], right[0])
        
        if cols_match:
            self.comparison_traceback.append(cc.OKGREEN
                + f"df cols match: {cols_match} [{left[1]}]"+ cc.FAIL)
        else:
            self.comparison_traceback.append("df cols mismatch")
            self.fallback_error_report(left[1], right[1])
        self.comparison_traceback.append(cc.ENDC)

In passing, I also extended the reporting for mismatched output fields to highlight what output was either missing or added:

        missing_output_fields = ref_keys - test_keys
        unexpected_output_fields = test_keys - ref_keys

        if missing_output_fields:
            self.comparison_traceback.append(
                cc.FAIL
                + "Missing output fields from running code: %s"
                % (missing_output_fields)
                + '\n'+'\n'.join([f"\t{k}: {reference_outs[k]}" for k in missing_output_fields])
                + cc.ENDC
            )
            return False
        elif unexpected_output_fields:
            self.comparison_traceback.append(
                cc.FAIL
                + "Unexpected output fields from running code: %s"
                % (unexpected_output_fields)
                + '\n'+'\n'.join([f"\t{k}: {testing_outs[k]}" for k in unexpected_output_fields])
                + cc.ENDC

For printed output, we can grab the stdout cell output element, and run a simple linecount test to check the broad shape of the output is similar, at least in terms of linecount.

    def compare_print_lines(self, item, key="stdout"):
        """Test line count similarity in print output."""
        linecount_test = False
        test_out = None
        if "nbval-test-linecount" in self.tags and key in item:
            test_out = (len(item[key].split("\n")))
            linecount_test = True
        return linecount_test, test_out

The report is currently just a simple “mismatch” error message:

            for ref_out, test_out in zip(ref_values, test_values):
                # Compare the individual values
                if ref_out != test_out:
                    if df_test:
                        self.format_output_compare_df(key, ref_out, test_out)
                    if linecount_test:
                        self.comparison_traceback.append(
                            cc.OKBLUE
                            + "linecount mismatch '%s'" % key
                            + cc.FAIL)
                    if not df_test and not linecount_test:
                        self.format_output_compare(key, ref_out, test_out)
                    return False

I also added support fork some convenience tags: nb-variable-output and folium-map both suppress the comparison of outputs of cells in a behaviour that currntly models the NBVAL_IGNORE_OUTPUT case, but with added semantics. (My thinking is this should make it easy to improve the test coverage of notebooks as I figure out how to sensibly test different things, rather than just “escaping” problematic false positive cells with the nbval-ignore-output tag.

PS I just added a couple more tags: nbval-test-listlen allows you to test a list code cell output to check that is it the same length in test and reference notebooks, even as the list content differs; nbval-test-dictkeys allows you to test the (top level) sorted dictionary keys of dictionary output in test and reference notebooks, even as the actual dictionary values differ.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: