In an attempt to try to automate a bit more of our educational notebook testing and release process, I’ve started looking again at nbval
[repo]. This package allows you to take a set of run notebooks and then re-run them, comparing the new cell outputs with the original cell outputs.
This allows for the automated testing of notebooks when our distributed code execution environment is updated. This allows us to check for code that has stopped working for whatever reason, as well as picking up new warning notices, such as deprecation notices.
It strikes me that it would also be useful to generate a report for each notebook that captures the notebook execution time. Which makes me think, is there also a package that profiles notebook execution time on a per cell basis?
The basis of comparison I’ve been looking at is the string match on each code cell output area and on each code cell stdout (print) area. In several of the notebooks I’m interested in checking in the first instance, we are raising what are essentially false positive errors in certain cases:
- printed outputs that have a particular form (for example, a printed output at each iteration of a loop) but where the printed content may differ within a line;
- database queries that return pandas dataframes with a fixed shape but variable content, or Python dictionaries will a particular key structure but variable values;
%%timeit
queries that return different times each time the cell is run.
For the timing errors, nbval
does support the use of regular expressions to rewrite cell ouptut before comparing it. For example:
[regex1]
regex: CPU times: .*
replace: CPU times: CPUTIME
[regex2]regex: Wall time: .*
replace: Wall time: WALLTIME
[regex3]
regex: .* per loop \(mean ± std. dev. of .* runs, .* loops each\)
replace: TIMEIT_REPORT
In a fork of the nbval
repo, I’ve added these as a default sanitisation option, although it strikes me it might also be useful to capture timing reports and then raise an error if the times are significantly different (for example, and order of magnitude difference either way). This would then also start to give us some sort of quality of service test as well.
For the dataframes, we can grab the dataframe table output from the text/html
cell output data
element and parse it back into a dataframe using the pandas pd.read_html()
function. We can then compare structual elements of the dataframe, such as its size (number of rows and columns) and the column headings. In my hacky code, this behaviour is triggered using an nbval-test-df
cell tag:
def compare_dataframes(self, item, key="data", data_key="text/html"):
"""Test outputs for dataframe comparison. """
df_test = False
test_out = ()
if "nbval-test-df" in self.tags and key in item and data_key in item[key]:
df = pd.read_html(item[key][data_key])[0]
df_test = True
test_out = (df.shape, df.columns.tolist())
return df_test, data_key, test_out
The error report separately reports on shape and column name mismatches:
def format_output_compare_df(self, key, left, right):
"""Format a dataframe output comparison for printing"""
cc = self.colors
self.comparison_traceback.append(
cc.OKBLUE
+ "dataframe mismatch from parsed '%s'" % key
+ cc.FAIL)
size_match = left[0]==right[0]
cols_match = left[1]==right[1]
if size_match:
self.comparison_traceback.append(cc.OKGREEN
+ f"df size match: {size_match} [{left[0]}]" + cc.FAIL)
else:
self.comparison_traceback.append("df size mismatch")
self.fallback_error_report(left[0], right[0])
if cols_match:
self.comparison_traceback.append(cc.OKGREEN
+ f"df cols match: {cols_match} [{left[1]}]"+ cc.FAIL)
else:
self.comparison_traceback.append("df cols mismatch")
self.fallback_error_report(left[1], right[1])
self.comparison_traceback.append(cc.ENDC)
In passing, I also extended the reporting for mismatched output fields to highlight what output was either missing or added:
missing_output_fields = ref_keys - test_keys
unexpected_output_fields = test_keys - ref_keys
if missing_output_fields:
self.comparison_traceback.append(
cc.FAIL
+ "Missing output fields from running code: %s"
% (missing_output_fields)
+ '\n'+'\n'.join([f"\t{k}: {reference_outs[k]}" for k in missing_output_fields])
+ cc.ENDC
)
return False
elif unexpected_output_fields:
self.comparison_traceback.append(
cc.FAIL
+ "Unexpected output fields from running code: %s"
% (unexpected_output_fields)
+ '\n'+'\n'.join([f"\t{k}: {testing_outs[k]}" for k in unexpected_output_fields])
+ cc.ENDC
For printed output, we can grab the stdout
cell output element, and run a simple linecount test to check the broad shape of the output is similar, at least in terms of linecount.
def compare_print_lines(self, item, key="stdout"):
"""Test line count similarity in print output."""
linecount_test = False
test_out = None
if "nbval-test-linecount" in self.tags and key in item:
test_out = (len(item[key].split("\n")))
linecount_test = True
return linecount_test, test_out
The report is currently just a simple “mismatch” error message:
for ref_out, test_out in zip(ref_values, test_values):
# Compare the individual values
if ref_out != test_out:
if df_test:
self.format_output_compare_df(key, ref_out, test_out)
if linecount_test:
self.comparison_traceback.append(
cc.OKBLUE
+ "linecount mismatch '%s'" % key
+ cc.FAIL)
if not df_test and not linecount_test:
self.format_output_compare(key, ref_out, test_out)
return False
I also added support fork some convenience tags: nb-variable-output
and folium-map
both suppress the comparison of outputs of cells in a behaviour that currntly models the NBVAL_IGNORE_OUTPUT
case, but with added semantics. (My thinking is this should make it easy to improve the test coverage of notebooks as I figure out how to sensibly test different things, rather than just “escaping” problematic false positive cells with the nbval-ignore-output
tag.
PS I just added a couple more tags: nbval-test-listlen
allows you to test a list code cell output to check that is it the same length in test and reference notebooks, even as the list content differs; nbval-test-dictkeys
allows you to test the (top level) sorted dictionary keys of dictionary output in test and reference notebooks, even as the actual dictionary values differ.