Testing OUr Educational Jupyter Notebooks

For years and years I’ve been meaning to put automation in place for testing the Jupyter notebooks we release to students in the Data Management and Analysis module. I’ve had several false starts at addressing this over those years, but thought I should try to have another iteration towards something that might be useful at the end of last week.

The maintenance and release cycle we have at the moment currently goes like this:

  • Docker container updated; notebook content errata fixed; notebooks are saved with code cells run in a private Github repo; for each new presentation of the module, we create a branch from the current main branch notebooks; errata are tracked via issues and are typically applied to main; they typically do not result in updated notebooks being released to students; instead, an erratum announcement is posted to the VLE, with the correction each student needs to make manually to their own copy of the notebook;
  • manual running and inspection of notebooks (typically not by me; if it was me, I’d have properly automated this much sooner!;-) in updated container. The checks identify whether cells run, whether there are warnings, etc; some cells are intended to fail execution, which can complicate a quick “Run all, then visually inspect” test. If the cell output changes, it may be hard to identify exactly what change has occurred / whether the change is one that is likely to be identified in use by a student as “an error”. Sometimes, outputs differ in detail, but not kind. For example, a fetch-one Mongo query brings back an item, but which item may not be guaranteed; a %memit or %timeit test is unlikely to return exactly the same resource usage, although we might want to compare the magnitudes of the time or memory consumed.

The testing tool I have primarily looked at using is the nbval extension to the py.test framework. This extension takes a notebook with pre-run “gold standard” cell outputs available, re-runs the notebook has limited support for tagging cells to allow outputs to be ignored, erroring cells to be identified and appropriately handle, or cell execution to be skipped altogether.

My own forked nbval package adds several “vague tests” to the test suite (some of my early tag extensions are described in Structural Testing of Jupyter Notebook Cell Outputs With nbval). For example, we can check that a cell output is a folium map, or an image with particular dimensions, or castable to a list of a certain length, or a dict with particular keys.

Other things that are useful to flag are warnings that are being raised as a consequence of the computational environment being updated.

To make testing easier, I’ve started working on a couple of sketch Github actions in a private cloned repo of our official private module team repo.

In the repo, the notebooks are arranged in weekly directories with a conventional directory name (Part XX Notebooks). The following manually triggered action provides a way of testing just the notebooks in a single week;’s directory:

When the action is run, the notebooks are run against the loaded environment pulled in as a Docker container (the container we want to test the materials against). Cell outputs are compared and an HTML report is generated using pytest-html ; this report is uploaded as an action artefact and attached to the action run report.

name: nbval-test-week
        type: choice
        description: Week to test
        - "01"
        - "02"
        - "03"
        - "04"
        - "05"
        - "07"
        - "08"
        - "09"
        - "10"
        - "11"
        - "12"
        - "14"
        - "15"
        - "16"
        - "20"
        - "21"
        - "22"
        - "23"
        - "25"
        - "26"
        description: 'Skip timeit'
        type: boolean
        description: 'Skip memit'
        type: boolean    
    runs-on: ubuntu-latest
      image: ouvocl/vce-tm351-monolith
    - uses: actions/checkout@master
    - name: Install nbval (TH edition)
      run: |
        python3 -m pip install --upgrade https://github.com//ouseful-PR/nbval/archive/table-test.zip
        python3 -m pip install pytest-html
    - name: Restart postgres
      run: |
        sudo service postgresql restart
    - name: Start mongo
      run: |
        sudo mongod --fork --logpath /dev/stdout --dbpath ${MONGO_DB_PATH}
    - name: Test TM351 notebooks
      run: |
        if [ "$memit" = "true" ]; then
        if [ "$timeit" = "true" ]; then
          nbval_flags="$nbval_flags --nbval-skip-timeit"
        py.test --nbval $nbval_flags --html=report-week-${{ github.event.inputs.week }}.html --self-contained-html ./Part\ ${{ github.event.inputs.week }}*
        INPUT_MEMIT: ${{ github.event.inputs.memit }}
        INPUT_TIMEIT: ${{ github.event.inputs.timeit }}
    - name: Archive test results
      if: always()
      uses: actions/upload-artifact@v3
        name: nbval-test-report
        path: ./report-week-${{ github.event.inputs.week }}.html

We can then download and review the HTML report to identify which cells failed in which notebook. (The Action log also displays any errors.)

Another action can be used to test the notebooks used across all the whole course.

On the to do list is: declaring a set of possible Docker images that the user can choose from; an action to run all cells against a particular image to generate a set of automatically produced gold standard outputs; an action to compare outputs from running the notebooks against one specified environment compared to the outputs generated by running them against a different specified environment. If we trust one particular environment for producing “correct” gold standard outputs, we can use that to the notebook outputs against which a second, development environment is being tested.

NOTE: updates to notebooks may not be backwards compatible with previous environments; the aim is to drive the content of the notebooks forward so they run against the evolving “current best practice” environment, not so that they are necessarily backwards compatible with earlier environments. Ideally, a set of “correct” run notebooks from one presentation form the basis of the test for the next presentation; but even so, differences may arise that represent a “correct” output in the new environment. Hmmm, so maybe I need an nbval-passes tag that can be used to identify cells whose output can be ignored because the cell is known to produce a correct output in the new environment that doesn’t match the output from the previous environment and that can’t be handled by an outstanding vague test. Then when we create a new branch of the notebooks for a new presentation, those nbval-passes are stripped from the notebooks under the assumption they should, “going forward”, now pass correctly.

As I retroactively start tagging notebooks with a view to getting improving the meaningful test pass rate, several things come to mind:

  • the primary aim is to check that the notebooks provide appropriate results when run in a particular environment; a cell output does not necessarily need to exactly match a gold master output for it to be appropriate;
  • the pass rate of some cells could be improved by modifying the code; for example, displaying SQL queries or dataframes that have been sorted on a particular column or columns. In some cases this will not detract from the learning point being made in the cell, but in other cases it might;
  • adding new cell tags / tests can weaken or strengthen tests that are already available, although at the cost of introducing more tag types to manage; for example, the dataframe output test currently checks the dataframe size and column names match, BUT the columns do not necessarily need to be in the same order; this test could be strengthened by also checking column name order, or weakened by dropping the column name check altogether. We could also improve the strength by checking column types, for example;
  • some cells it’s perhaps just better to skip or ignore altogether; but in such cases, we should be able to report on which cells have been skipped or had their cell output ignored (so we can check whether a ‘failure’ could arise that might need to be addressed rather than ignored), or disable the “ignore” or “skip” behaviour to run a comprehensive test.

For the best test coverage, we would have 0 ignored output cells, 0 skipped cells, tests that are as strong as possible, no errors, no warnings, and no failures (where a failure is a failure of the matching test, either exact matching or one of my vague tests).

PS as well as tests, I am also looking at actions to support the distribution of notebooks; this includes things like checking for warnings, clearing output cells, making sure that cell toolbars are collapsed, making sure that activity answers are collapsed, etc etc. Checking toolbars and activity outputs are collapsed could be tests, or could be automatically run actions. Ideally, we should be able to automate the publication of a set of notebooks by:

  • running tests over the notebooks;
  • if all the tests pass, run the distribution actions;
  • create a distributable zip of ready-to-use notebook files etc.

In-Browser WASM Powered Postgres and DuckDB Fragments On the To Do List…

A quick note that I need to demo some simple educational material that shows how we can use postgres-wasm to drop a postgres-wasm PostgreSQL terminal into an IFrame and run some activities:

We can also access a wasm powered db with proxied sockets, which means:

  • we can connect to the DB from something like pandas if we are running in an environment that supports socket connections (which pyodide/JupyterLite doesn’t);
  • we only need to run a simple proxy webservice alongside the http server that delivers the WASM bundle, rather than a full PostgreSQL server. Persistence is handled via browser storage, which means if the database is large, that may be the main hurdle…

If we were just doing SQL data wrangling, it would possibly make more sense to use something like DuckDB. In passing, I note an experimental package that supports DuckDB inside JupyterLite — iqmo-org/jupylite_duckdb — to complement the “full fat” duckdb Python package:

However, for playing with things like roles and permissions, or more of the basic DB management functions, having a serverless PostgreSQL database is really handy. One thing it can’t (currently?) do, though, is support multiple concurrent connections, which means no playing with transactions? Although – maybe the proxied version can?! One to try…

Using langchain To Run Queries Against GPT4All in the Context of a Single Documentary Knowledge Source

In the previous post, Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook, I posted a simple walkthough of getting GPT4All running locally on a mid-2015 16GB Macbook Pro using langchain. In this post, I’ll provide a simple recipe showing how we can run a query that is augmented with context retrieved from single document based knowledg source.

I’ve updated the previously shared notebook here to include the following…

Example Query Supported by a Document Based Knowledge Source

Example document query using the example from the langchain docs.

The idea is to run the query against a document source to retrieve some relevant context, and use that as part of the prompt context.


template = """

Question: {question}


prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

A naive prompt gives an irrelevant answer:

query = "What did the president say about Ketanji Brown Jackson"
CPU times: user 58.3 s, sys: 3.59 s, total: 1min 1s
Wall time: 9.75 s
'\nAnswer: The Pittsburgh Steelers'

Now let’s try with a source document.

#!wget https://raw.githubusercontent.com/hwchase17/langchainjs/main/examples/state_of_the_union.txt
from langchain.document_loaders import TextLoader

# Ideally....
loader = TextLoader('./state_of_the_union.txt')

However, creating the embeddings is qute slow so I’m going to use a fragment of the text:

#ish via chatgpt...
def search_context(src, phrase, buffer=100):
  with open(src, 'r') as f:
    words = txt.split()
    index = words.index(phrase)
    start_index = max(0, index - buffer)
    end_index = min(len(words), index + buffer+1)
    return ' '.join(words[start_index:end_index])

fragment = './fragment.txt'
with open(fragment, 'w') as fo:
    _txt = search_context('./state_of_the_union.txt', "Ketanji")
!cat $fragment

Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. We can do both. At our border, we’ve installed new technology like cutting-edge
loader = TextLoader('./fragment.txt')

Generate an index from the knowledge source text:

#%pip install chromadb
from langchain.indexes import VectorstoreIndexCreator
# Time: ~0.5s per token
# NOTE: "You must specify a persist_directory oncreation to persist the collection."
# TO DO: How do we load in an already generated and persisted index?
index = VectorstoreIndexCreator(embedding=llama_embeddings,
                                vectorstore_kwargs={"persist_directory": "db"}
Using embedded DuckDB with persistence: data will be stored in: db

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 7.87 µs​

# The following errors...
#index.query(query, llm=llm)
# With the full SOTU text, I got:
# Error: llama_tokenize: too many tokens;
# Also occasionally getting:
# ValueError: Requested tokens exceed context window of 512

# If we do get passed that,
# NotEnoughElementsException

# For the latter, somehow need to set something like search_kwargs={"k": 1}

It seems the retriever is expecting, by default, 4 results documents. I can’t see how to pass in a lower limit (a single response document is acceptable in this case), so we nd to roll our own chain…​


# Roll our own....

from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Again, we should persist the db and figure out how to reuse it
docsearch = Chroma.from_documents(texts, llama_embeddings)
Using embedded DuckDB without persistence: data will be transient

CPU times: user 5min 59s, sys: 1.62 s, total: 6min 1s
Wall time: 49.2 s

# Just getting a single result document from the knowledge lookup is fine...

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))
CPU times: user 861 µs, sys: 2.97 ms, total: 3.83 ms
Wall time: 7.09 ms

How about running our query now in the context of the knowledge source?



What did the president say about Ketanji Brown Jackson
CPU times: user 7min 39s, sys: 2.59 s, total: 7min 42s
Wall time: 1min 6s

' The president honored Justice Stephen Breyer and acknowledged his service to this country before introducing Justice Ketanji Brown Jackson, who will be serving as the newest judge on the United States Court of Appeals for the District of Columbia Circuit.'

How about a more precise query?

query = "Identify three things the president said about Ketanji Brown Jackson"

CPU times: user 10min 20s, sys: 4.2 s, total: 10min 24s
Wall time: 1min 35s

' The president said that she was nominated by Barack Obama to become the first African American woman to sit on the United States Court of Appeals for the District of Columbia Circuit. He also mentioned that she was an Army veteran, a Constitutional scholar, and is retiring Justice of the United States Supreme Court.'

Hmm… are we in a conversation and picking up on previous outputs? In previous attempts I did appear to be getting quite relevant answers… Are we perhaps getting more than a couple of results docs and picking the less good one? Or is the model hit and miss on what it retrieves? Can we view the sample results docs from the knoweldge lookup to help get a feel for what’s going on?

Let’s see if we can format the response…


query = """
Identify three things the president said about Ketanji Brown Jackson. Provide the answer in the form: 

- ITEM 1
- ITEM 2
- ITEM 3

CPU times: user 12min 31s, sys: 4.24 s, total: 12min 35s
Wall time: 1min 45s

"\n\nITEM 1: President Trump honored Justice Breyer for his service to this country, but did not specifically mention Ketanji Brown Jackson.\n\nITEM 2: The president did not identify any specific characteristics about Justice Breyer that would be useful in identifying her.\n\nITEM 3: The president did not make any reference to Justice Breyer's current or past judicial rulings or cases during his speech."

Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook

Over the last three weeks or so I’ve been following the crazy rate of development around locally run large language models (LLMs), starting with llama.cpp, then alpaca and most recently (?!) gpt4all.

My laptop (a mid-2015 Macbook Pro, 16GB) was in the repair shop for over a week of that period, and it’s only really now that I’ve had a even a quick chance to play, although I knew 10 days ago what sort of thing I wanted to try, and that has only really become off-the-shelf possible in the last couple of days.

The following script can be downloaded as a Jupyter notebook from this gist.

GPT4All Langchain Demo

Example of locally running GPT4All, a 4GB, llama.cpp based large langage model (LLM) under langchachain](https://github.com/hwchase17/langchain), in a Jupyter notebook running a Python 3.10 kernel.

Tested on a mid-2015 16GB Macbook Pro, concurrently running Docker (a single container running a sepearate Jupyter server) and Chrome with approx. 40 open tabs).

Model preparation

  • download gpt4all model:
  • download llama.cpp 7B model
#%pip install pyllama
#!python3.10 -m llama.download --model_size 7B --folder llama/
  • transform gpt4all model:
#%pip install pyllamacpp
#!pyllamacpp-convert-gpt4all ./gpt4all-main/chat/gpt4all-lora-quantized.bin 

llama/tokenizer.model ./gpt4all-main/chat/gpt4all-lora-q-converted.bin
GPT4ALL_MODEL_PATH = "./gpt4all-main/chat/gpt4all-lora-q-converted.bin"

langchain Demo

Example of running a prompt using langchain.

#%pip uninstall -y langchain
#%pip install --upgrade git+https://github.com/hwchase17/langchain.git

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
  • set up prompt template:
template = """

Question: {question}
Answer: Let's think step by step.

prompt = PromptTemplate(template=template, input_variables=["question"])
  • load model:
llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH)

llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
CPU times: user 572 ms, sys: 711 ms, total: 1.28 s
Wall time: 1.42 s
  • create language chain using prompt template and loaded model:
llm_chain = LLMChain(prompt=prompt, llm=llm)
  • run prompt:
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
CPU times: user 5min 2s, sys: 4.17 s, total: 5min 6s
Wall time: 43.7 s
'1) The year Justin Bieber was born (2005):\n2) Justin Bieber was born on March 1, 1994:\n3) The Buffalo Bills won Super Bowl XXVIII over the Dallas Cowboys in 1994:\nTherefore, the NFL team that won the Super Bowl in the year Justin Bieber was born is the Buffalo Bills.'

Another example…

template2 = """

Question: {question}


prompt2 = PromptTemplate(template=template2, input_variables=["question"])

llm_chain2 = LLMChain(prompt=prompt, llm=llm)
question2 = "What is a relational database and what is ACID in that context?"
CPU times: user 14min 37s, sys: 5.56 s, total: 14min 42s
Wall time: 2min 4s
"A relational database is a type of database management system (DBMS) that stores data in tables where each row represents one entity or object (e.g., customer, order, or product), and each column represents a property or attribute of the entity (e.g., first name, last name, email address, or shipping address).\n\nACID stands for Atomicity, Consistency, Isolation, Durability:\n\nAtomicity: The transaction's effects are either all applied or none at all; it cannot be partially applied. For example, if a customer payment is made but not authorized by the bank, then the entire transaction should fail and no changes should be committed to the database.\nConsistency: Once a transaction has been committed, its effects should be durable (i.e., not lost), and no two transactions can access data in an inconsistent state. For example, if one transaction is in progress while another transaction attempts to update the same data, both transactions should fail.\nIsolation: Each transaction should execute without interference from other concurrently executing transactions, thereby ensuring its properties are applied atomically and consistently. For example, two transactions cannot affect each other's data"

Generating Embeddings

We can use the llama.cpp model to generate embddings.

#%pip uninstall -y llama-cpp-python
#%pip install --upgrade llama-cpp-python

from langchain.embeddings import LlamaCppEmbeddings
llama = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)
llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
text = "This is a test document."​
query_result = llama.embed_query(text)
CPU times: user 12.9 s, sys: 1.57 s, total: 14.5 s
Wall time: 2.13 s
doc_result = llama.embed_documents([text])
CPU times: user 10.4 s, sys: 59.7 ms, total: 10.4 s
Wall time: 1.47 s

Next up, I’ll try to create a simple db using the llama embeddings and then try to run a QandA prompt against a source document…

PS See also this example of running a query against GPT4All in langchain in the context of a single, small, document knowledge source.

Whither In-Browser Jupyter WASM? R is Here, Could Postgres Be Too?

The original MyBinder service for launching Jupyter notebooks from GitHub originally included an option to attach a PostgreSQL database that you could access from Jupyter notebooks:

With JupyterLite taking over many of the demo requests for the Try Jupyter site from MyBinder, reducing the need for anything other than a simple webserver on the Try Jupyter site, and Jupyterlab running purely in the browser under WASM, I wonder whether it would be possible to integrate an in-browser PostgreSQL server into the distribution using postgres-wasm (earlier review)?

I also note that Jupyter (originally coined from Julia, Python and R, with the suggestion, if not implication, that the environment would support those separate languages equally) is now a step closer to being legitimised again with a blog post from the Posit (RStudio, as was) camp, who very sensibly got
George Stagg on board and supported his development of WebR, that announced the official release last week of WebR, and with it an experimental JupyerLite R kernel. There’s a list of WebR/wasm compiled supported R packages here.

So now you can run R in a JupyterLite environment, or via the WebR console. See also Bob Rudis’ / @hrbrmstr’s getting started with WebR post.

Presumably, that means you’ll also be able to use the JupyterLite R kernel to provide in-browser code execution in a Thebe (ThebeLite?) backed Jupyter Book when that package gets a release (it keeps seeming to be so close…! Maybe for JupyterCon?)

Hmm… now I wonder… There was a xeus-sqlite Jupyterlite kernel, presumably derived from the xeus-sql kernel? So I wonder – could you get a xeus-sql kernel running in JupyterLite and calling postgres-wasm running in the same tab?

I also wonder: what if Jupyter Book could transclude content from a wasm flavoured SQLite or PostgreSQL database? Or ship a full-text, fuzzy, or even semantic search using a wasm powered database?

PS in passing, I also note various WASM backed dashboarding solutions:

Again, it’d be interesting to see one of those shipping with database hooks in place? Or DuckDB integration so you could easily make SQL requests over various a whole host of sources. Or datasesette-lite? Etc etc. (I’m not sure how the plumbing would work though???)

Fragment — Did You Really X That?

There is a website — https://thisxdoesnotexist.com/ — that collects links for various sites along the lines of This Person Does Not Exist, for example, or This Automobile Does Not Exist.

So I started wondering about a complement, “This is Not Me”, that could link to customisable AI tools that could take over you, for example, by generating things in the style of you.

  • VALL-E (research paper on a text-to-speech generator using your voice, from a short clip of you speaking);
  • FRAN (“Production-Ready Face Re-Aging for Visual Effects” from Disney);
  • Calligrapher AI (text-to-handwriting: parameterised, at the moment, but how long before you can train it with your handwriting?)

Things to look out for:

  • did you really go there? (add your photo to an obscure location; photoshop does this already, but 1-click for the rest of us…);
  • did you really do that? (photo of you apparently doing something — trampolining whilst wearing wellies and a Rocky Horror costime, for example; okay… maybe you did do that… Again, Photoshop, but for the rest of us…)
  • did you really write that? (something written in your style and with your flavour of typos.. I suspect there are lots of these already but I haven’t looked…)

Getting darker, then there’s the deepfake pr0n stuff, of course…

Various other bits related:

Working with Broken

OpenAI announce the release of an AI generated text identification tool that they admit is broken (“not fully reliable”, as they euphemistically describe it) and that you have figure out how to use as best you can, given its unreliability.

Even though we spend 2-3 years producing new courses, the first presentation always has broken bits, broken bits that sometimes take years to be discovered, others that are discovered quickly but still take years before any one gets round to fixing them. Sometimes, courses need a lot of work after the first presentation reveals major issues with them. Updates to materials are discouraged, in case they are themselves broken or break things in turn, which means materials start to rot in place (modules can remain in place for 5 years, even 10, with few changes).

My attitude has been that we can ship things that are a bit broken, if we fix them quickly, and/or actively engage with students to mitigate or even explore the issue. We just need to be open. Quality improved through iteration. Quality assured through rapid response (not least becuase the higher the quality at the start, the less work we make for ourselves by having to fix things).

Tinkering with ChatGPT, I started wondering about how we can co-opt ChatGPT as a teaching and learning tool, given its output may be broken. ChatGPT as an unreliable but well-intentioned tutor. Over the last week or two, I’ve been trying to produce a new “sample report” that models the sort of thing that we expect students to produce for their data analysis and management course end-of-course assessment (a process that made me realise how much technical brokenness we are probably letting slip through if the presented reports look plausible enough — the lesson of ChatGPT again comes to mind here, with the student submitting the report as being akin to an unreliable author who can slip all sorts of data management abuses and analysis mistakes through in a report, in a largely non-technical that only displays the results and doesn’t show the working). In so doing, I wondered whether it might be more useful to create an “unreliable” sample report, but annotate it with comments, as if from a tutor, that acknowledged good points and picked up on bad ones.

My original thinking seven or eight years ago now was that the final assessment report for the data management and analysis course would be presented as a reproducible document. That never happened — I had little role in designing the assessment, and things were not so mature getting on for a decade ago now when the cours was first mooted — but as tools like Jupyter Book and Quarto show, there are now tools in place that can produce good quality interactive HTML reports with hideable/revealable code, or easily produce two parallel MS Word or PDF document outputs – a “finished document” output (with code hidden), and a “fully worked” document with all the code showing. This would add a lot of work for the student though. Currently, the model we use is for students to work in a single project diary notebook (though some students occasionally use multiple notebooks) that contains all manner of horrors, and then paste things like charts and tables into the final report. The final report typically contains a quick discursive commentary where the students explain what they think they did. We reserve the right to review the code (the report is the thing that is assessed), but I suspect the notebook contents rarely get a detailed look from markers, even if they are looked at at all. For students to tease only the relevant code out of their notebook into a reproducible report would be a lot of extra work…

For the sample report, my gut feeling is that the originating notebook for the data handling and analysis should not be shared. We need to leave the students with some work to do, and for a technical course, that is the method. In this model, we give a sample unreproducible report, unreliable but commented upon, that hints at the sort of thing we expect to get back, but that hides the working. Currently, we assess the student’s report, which operates at the same level. But ideally, they’d give us a reproducible report back that gives us the chance to look at their working inline.

Anyway, that’s all an aside. The point of this post was an announcement I saw from OpenAI — New AI classifier for indicating AI-written text — where they claim to have  “trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers“. Or not:

Our classifier is not fully reliable. [OpenAI’s emphasis] In our evaluations on a “challenge set” of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as “likely AI-written,” while incorrectly labeling human-written text as AI-written 9% of the time (false positives).

OpenAI then make the following offer: [w]e’re making this classifier publicly available to get feedback on whether imperfect tools like this one are useful.

So I’m wondering: is this the new way of doing things? Giving up on the myth that things work properly, and instead accept that we have to work with tools that are known to be a bit broken? That we have to find ways of working with them that accommodate that? Accepting that everything we use is broken-when-shipped, that everything is unreliable, and that it is up to us to use our own craft, and come up with our own processes, in order to produce things that are up to the standard we expect, even given the unreliability of everything we have to work with? Quality assurance as an end user problem?

Chat Je Pétais

Over on the elearnspace blog, George — I’m assuming it’s George — makes the following observation in This Time is Different. Part 1.: “I’m writing a series on the threat higher education faces and why I think it’s future is one of substantive, dramatic, and systemic changes. However, it’s not AI. There are a range of factors related to the core of what educators do: discover, create, and share knowledge. AI is part of it – and in the future may be the entirety of it. Short term, however, there are other urgent factors.

I’ve increasingly been thinking about the whole “discover, create, share” thing in a different context over the last year or so, in the context of traditional stroytelling.

I now spend much of my free-play screen time trawling the archive.org looking for 19th century folk tale and fairy tale collections, or searching dismally OCR’d 19th newspapers in the British newspapers archive (I would contribute my text corrections back, but in the British newspaper archive at least, the editor sucks and it would take forever; it’s not hard to imagine a karaoke display with that tries to highlight each word in turn to prompt you to read the text aloud, then use whisper.ai to tunr that into text, but as it is, you get arbitrarily chunked smally collections of words with their own edit box that take forever to update separately. The intention is obviously to improve the OCR training, not allow readers who who have transcribed the whole to paste some properly searchable text in and then let the machine have a go at text alignment.)

So for me, text related online discovery now largely relates to discovery within 19th century texts, creating is largely around trying to pull stories together, or sequences of stories that work together, and sharing is telling stories in folk clubs and storytelling gigs. As to sharing knowledge, the stories are, of course, all true…

I’ve also played with ChatGPT a little bit, and it’s a time waster for me. It’s a game as you try to refine the prompt to generate answers of substance, every claim of fact requires fact checking, and whilst the argumentation appears reasonable at a glance, it doesn’t always follow. The output is, on the surface, compelling and plausible, and is generated for you to read without you having to thing too much about it. I realise now whey Socratic dialogue as a mode of learning gets a hard press: the learner doesn’t really have to do much og the hard learning work, where you have to force your own brain circuits to generate sentences, and rewire those bits of your head that make you say things that don’t make sense, or spout ungrounded opinions, à la Chat je pétais.

In passing, via the tweets, All my classes suddenly became AI classes and the following policy for using chatGPT in an educational setting:

Elsewhere, I note that I should probably be following Jackie Gerstein via my feeds…

Dave Cormier also shares a beuaifully rich observation to ponder upon — ChatGPT as “autotune for knowledge”. And Simon Willison shares a handy guide to improving fart prompts from the Open.ai Cookbook — Techniques to improve reliability — because any prompts you do use naively are just that.

I hate and resent digital technology more and more every day.

And though trying to sell tickets to oral culture events, I am starting to realise how many people are digitally excluded, don’t have ready internet access, can’t buy things online, and don’t discover things online. And I would rather spend my time in their world than this digital one. Giving up archive.org would be a shame, but I have no trouble finding books in second hand bookshops, even if they do cost a bit more.

From Packages to Transformers and Pipelines

When I write code, I typically co-opt functions and algorithms I’ve pinched from elsewhere.

There are Python packages out there that are likely to do pretty much whatever you want, at least as a first draft, so to my mind, it makes sense to use them, and file issues (and occasionally PRs) if they don’t quite do what I want or expect, or if I find they aren’t working as advertised or as intended. (I’m also one of those people who emails webmasters to tell them their website is broken…)

But I’m starting to realise that there’s now a whole other class of off-the-shelf computational building blocks available in the form of AI pipelines. (For example, I’ve been using the whisper speech2text model for some time via a Python package.)

For example, the Hugging Face huggingface/transformers package contains a whole raft of pre-trained AI models wrapped by simple, handy Python function calls.

For example, consider a “table question answering” task: how would I go about creating a simple application to help me ask natural language queries over a tabular data set? A peek at the Hugging Face table question answering task suggests using the table-question-answering transformer and the google/tapas-base-finetuned-wtq model:

from transformers import pipeline
import pandas as pd

# prepare table + question
data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
table = pd.DataFrame.from_dict(data)
question = "how many movies does Leonardo Di Caprio have?"

# pipeline model
# Note: you must to install torch-scatter first.
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")

# result

print(tqa(table=table, query=query)['cells'][0])

The task description also points to at least one application demonstrating the approach — Table Question Answering (TAPAS) — although when I tried I could only get it to give one column in reply on queries I posed to it…

…which is to say the the models / applications may not do quite what you want them to do. But as with a lot of found code, it can often help get you started, either with some code you can revise, or as an example of an approach that does not do what you want and that you should avoid.

Now I’m wondering: are there other transformers like packages out there? And should I be looking at getting myself a m/c with a reasonable GPU so I can run this stuff locally… Or bite the bullet and start paying for AI APIs and on-demand GPU servers…

Search Assist With ChatGPT

Via my feeds, a tweet from @john_lam:

The tools for prototyping ideas are SO GOOD right now. This afternoon, I made a “citations needed” bot for automatically adding citations to the stuff that ChatGPT makes up


A corresponding gist is here.

Having spent a few minutes prior to that doing a “traditional” search using good old fashioned search terms and the Google scholar search engine to try to find out how defendants in English trials of the early 19th century could challenge jurors (Brown, R. Blake. “Challenges for Cause, Stand-Asides, and Peremptory Challenges in the Nineteenth Century.” Osgoode Hall Law Journal 38.3 (2000) : 453-494, http://digitalcommons.osgoode.yorku.ca/ohlj/vol38/iss3/3 looks relevant), I wondered whether ChatGPT, and a John Lam’s search assist, might have been able to support the process:

Firstly, can ChatGPT help answer the question directly?

Secondly, can ChatGPT provide some search queries to help track down references?

The original rationale for the JSON based response was so that this could be used as part of an automated citation generator.

So this gives us a pattern of: write a prompt, get a response, request search queries relating to key points in response.

Suppose, however, that you have a set of documents on a topic and that you would like to be able to ask questions around them using something like ChatGPT. I note that Simon Willison has just posted a recipe on this topic — How to implement Q&A against your documentation with GPT3, embeddings and Datasette — that independently takes a similar approach to a recipe described in OpenAI’s cookbook: Question Answering using Embeddings.

The recipe begins with a semantic search of a set of papers. This is done by generating an embdding for the documents you want to search over using the OpenAI embeddings API, though we could roll our own that runs locally, albeit with a smaller model. (For example, here’s a recipe for a simple doc2vec powered semantic search.) To perform a semantic search, you find the embedding of the search query and then find near embeddings generated from your source documents to provide the results. To speed up this part of the process in datasette, Simon created the datasette-faiss plugin to use FAISS .

The content of the discovered documents are then used to seed a ChatGPT prompt with some “context”, and the question is applied to that context. So the recipe is something like: use a query to find some relavant documents, grab the content of those documents as context, then create a ChatGPT prompt of the form “given {context}, and this question: {question}”.

It shouldn’t be too difficult to hack together a think that runs this pattern against OU-XML materials. In other words:

  • generate simple text docs from OU-XML (I have scrappy recipes for this already);
  • build a semantic search engine around those docs (useful anyway, and I can reuse my doc2vec thing);
  • build a chatgpt query around a contextualised query, where the context is pulled from the semantic search results. (I wonder, has anyone built a chatgpt like thing around an opensource gpt2 model?)

PS another source of data / facts are data tables. There are various packages out there that claim to provide natural language query support for interrogating tabular data eg abhijithneilabraham/tableQA, and this review article, or the Higging Face table-question-answering transformer, but I forget which I’ve played with. Maybe I should write a new RallyDataJunkie unbook that demonstrates those sort of tool around tabulated rally results data?