Whither In-Browser Jupyter WASM? R is Here, Could Postgres Be Too?

The original MyBinder service for launching Jupyter notebooks from GitHub originally included an option to attach a PostgreSQL database that you could access from Jupyter notebooks:

With JupyterLite taking over many of the demo requests for the Try Jupyter site from MyBinder, reducing the need for anything other than a simple webserver on the Try Jupyter site, and Jupyterlab running purely in the browser under WASM, I wonder whether it would be possible to integrate an in-browser PostgreSQL server into the distribution using postgres-wasm (earlier review)?

I also note that Jupyter (originally coined from Julia, Python and R, with the suggestion, if not implication, that the environment would support those separate languages equally) is now a step closer to being legitimised again with a blog post from the Posit (RStudio, as was) camp, who very sensibly got
George Stagg on board and supported his development of WebR, that announced the official release last week of WebR, and with it an experimental JupyerLite R kernel. There’s a list of WebR/wasm compiled supported R packages here.

So now you can run R in a JupyterLite environment, or via the WebR console. See also Bob Rudis’ / @hrbrmstr’s getting started with WebR post.

Presumably, that means you’ll also be able to use the JupyterLite R kernel to provide in-browser code execution in a Thebe (ThebeLite?) backed Jupyter Book when that package gets a release (it keeps seeming to be so close…! Maybe for JupyterCon?)

Hmm… now I wonder… There was a xeus-sqlite Jupyterlite kernel, presumably derived from the xeus-sql kernel? So I wonder – could you get a xeus-sql kernel running in JupyterLite and calling postgres-wasm running in the same tab?

I also wonder: what if Jupyter Book could transclude content from a wasm flavoured SQLite or PostgreSQL database? Or ship a full-text, fuzzy, or even semantic search using a wasm powered database?

PS in passing, I also note various WASM backed dashboarding solutions:

Again, it’d be interesting to see one of those shipping with database hooks in place? Or DuckDB integration so you could easily make SQL requests over various a whole host of sources. Or datasesette-lite? Etc etc. (I’m not sure how the plumbing would work though???)

Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook

Over the last three weeks or so I’ve been following the crazy rate of development around locally run large language models (LLMs), starting with llama.cpp, then alpaca and most recently (?!) gpt4all.

My laptop (a mid-2015 Macbook Pro, 16GB) was in the repair shop for over a week of that period, and it’s only really now that I’ve had a even a quick chance to play, although I knew 10 days ago what sort of thing I wanted to try, and that has only really become off-the-shelf possible in the last couple of days.

The following script can be downloaded as a Jupyter notebook from this gist.

GPT4All Langchain Demo

Example of locally running GPT4All, a 4GB, llama.cpp based large langage model (LLM) under langchachain](https://github.com/hwchase17/langchain), in a Jupyter notebook running a Python 3.10 kernel.

Tested on a mid-2015 16GB Macbook Pro, concurrently running Docker (a single container running a sepearate Jupyter server) and Chrome with approx. 40 open tabs).

Model preparation

  • download gpt4all model:
#https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin
  • download llama.cpp 7B model
#%pip install pyllama
#!python3.10 -m llama.download --model_size 7B --folder llama/
  • transform gpt4all model:
#%pip install pyllamacpp
#!pyllamacpp-convert-gpt4all ./gpt4all-main/chat/gpt4all-lora-quantized.bin 

llama/tokenizer.model ./gpt4all-main/chat/gpt4all-lora-q-converted.bin
GPT4ALL_MODEL_PATH = "./gpt4all-main/chat/gpt4all-lora-q-converted.bin"

langchain Demo

Example of running a prompt using langchain.

#https://python.langchain.com/en/latest/ecosystem/llamacpp.html
#%pip uninstall -y langchain
#%pip install --upgrade git+https://github.com/hwchase17/langchain.git

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
  • set up prompt template:
template = """

Question: {question}
Answer: Let's think step by step.

"""​
prompt = PromptTemplate(template=template, input_variables=["question"])
  • load model:
%%time
llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH)

llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
CPU times: user 572 ms, sys: 711 ms, total: 1.28 s
Wall time: 1.42 s
  • create language chain using prompt template and loaded model:
llm_chain = LLMChain(prompt=prompt, llm=llm)
  • run prompt:
%%time
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
CPU times: user 5min 2s, sys: 4.17 s, total: 5min 6s
Wall time: 43.7 s
'1) The year Justin Bieber was born (2005):\n2) Justin Bieber was born on March 1, 1994:\n3) The Buffalo Bills won Super Bowl XXVIII over the Dallas Cowboys in 1994:\nTherefore, the NFL team that won the Super Bowl in the year Justin Bieber was born is the Buffalo Bills.'

Another example…

template2 = """

Question: {question}
Answer:  

"""​

prompt2 = PromptTemplate(template=template2, input_variables=["question"])

llm_chain2 = LLMChain(prompt=prompt, llm=llm)
%%time
question2 = "What is a relational database and what is ACID in that context?"
llm_chain2.run(question2)
CPU times: user 14min 37s, sys: 5.56 s, total: 14min 42s
Wall time: 2min 4s
"A relational database is a type of database management system (DBMS) that stores data in tables where each row represents one entity or object (e.g., customer, order, or product), and each column represents a property or attribute of the entity (e.g., first name, last name, email address, or shipping address).\n\nACID stands for Atomicity, Consistency, Isolation, Durability:\n\nAtomicity: The transaction's effects are either all applied or none at all; it cannot be partially applied. For example, if a customer payment is made but not authorized by the bank, then the entire transaction should fail and no changes should be committed to the database.\nConsistency: Once a transaction has been committed, its effects should be durable (i.e., not lost), and no two transactions can access data in an inconsistent state. For example, if one transaction is in progress while another transaction attempts to update the same data, both transactions should fail.\nIsolation: Each transaction should execute without interference from other concurrently executing transactions, thereby ensuring its properties are applied atomically and consistently. For example, two transactions cannot affect each other's data"

Generating Embeddings

We can use the llama.cpp model to generate embddings.

#https://abetlen.github.io/llama-cpp-python/
#%pip uninstall -y llama-cpp-python
#%pip install --upgrade llama-cpp-python

from langchain.embeddings import LlamaCppEmbeddings
llama = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)
llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
%%time
text = "This is a test document."​
query_result = llama.embed_query(text)
CPU times: user 12.9 s, sys: 1.57 s, total: 14.5 s
Wall time: 2.13 s
%%time
doc_result = llama.embed_documents([text])
CPU times: user 10.4 s, sys: 59.7 ms, total: 10.4 s
Wall time: 1.47 s

Next up, I’ll try to create a simple db using the llama embeddings and then try to run a QandA prompt against a source document…

PS See also this example of running a query against GPT4All in langchain in the context of a single, small, document knowledge source.

Using langchain To Run Queries Against GPT4All in the Context of a Single Documentary Knowledge Source

In the previous post, Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook, I posted a simple walkthough of getting GPT4All running locally on a mid-2015 16GB Macbook Pro using langchain. In this post, I’ll provide a simple recipe showing how we can run a query that is augmented with context retrieved from single document based knowledg source.

I’ve updated the previously shared notebook here to include the following…

Example Query Supported by a Document Based Knowledge Source

Example document query using the example from the langchain docs.

The idea is to run the query against a document source to retrieve some relevant context, and use that as part of the prompt context.

#https://python.langchain.com/en/latest/use_cases/question_answering.html

template = """

Question: {question}

Answer: 

"""​
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

A naive prompt gives an irrelevant answer:

%%time
query = "What did the president say about Ketanji Brown Jackson"
llm_chain.run(question)
CPU times: user 58.3 s, sys: 3.59 s, total: 1min 1s
Wall time: 9.75 s
'\nAnswer: The Pittsburgh Steelers'

Now let’s try with a source document.

#!wget https://raw.githubusercontent.com/hwchase17/langchainjs/main/examples/state_of_the_union.txt
from langchain.document_loaders import TextLoader

# Ideally....
loader = TextLoader('./state_of_the_union.txt')

However, creating the embeddings is qute slow so I’m going to use a fragment of the text:

#ish via chatgpt...
def search_context(src, phrase, buffer=100):
  with open(src, 'r') as f:
    txt=f.read()
    words = txt.split()
    index = words.index(phrase)
    start_index = max(0, index - buffer)
    end_index = min(len(words), index + buffer+1)
    return ' '.join(words[start_index:end_index])

fragment = './fragment.txt'
with open(fragment, 'w') as fo:
    _txt = search_context('./state_of_the_union.txt', "Ketanji")
    fo.write(_txt)
!cat $fragment

Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. We can do both. At our border, we’ve installed new technology like cutting-edge
loader = TextLoader('./fragment.txt')

Generate an index from the knowledge source text:

#%pip install chromadb
from langchain.indexes import VectorstoreIndexCreator
%time
# Time: ~0.5s per token
# NOTE: "You must specify a persist_directory oncreation to persist the collection."
# TO DO: How do we load in an already generated and persisted index?
index = VectorstoreIndexCreator(embedding=llama_embeddings,
                                vectorstore_kwargs={"persist_directory": "db"}
                               ).from_loaders([loader])
Using embedded DuckDB with persistence: data will be stored in: db

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 7.87 µs​
%time
pass

# The following errors...
#index.query(query, llm=llm)
# With the full SOTU text, I got:
# Error: llama_tokenize: too many tokens;
# Also occasionally getting:
# ValueError: Requested tokens exceed context window of 512

# If we do get passed that,
# NotEnoughElementsException

# For the latter, somehow need to set something like search_kwargs={"k": 1}

It seems the retriever is expecting, by default, 4 results documents. I can’t see how to pass in a lower limit (a single response document is acceptable in this case), so we nd to roll our own chain…​

%%time

# Roll our own....

#https://github.com/hwchase17/langchain/issues/2255
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Again, we should persist the db and figure out how to reuse it
docsearch = Chroma.from_documents(texts, llama_embeddings)
Using embedded DuckDB without persistence: data will be transient

CPU times: user 5min 59s, sys: 1.62 s, total: 6min 1s
Wall time: 49.2 s
%%time

# Just getting a single result document from the knowledge lookup is fine...
MIN_DOCS = 1

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",
                                 retriever=docsearch.as_retriever(search_kwargs={"k": MIN_DOCS}))
CPU times: user 861 µs, sys: 2.97 ms, total: 3.83 ms
Wall time: 7.09 ms

How about running our query now in the context of the knowledge source?

%%time

print(query)

qa.run(query)
What did the president say about Ketanji Brown Jackson
CPU times: user 7min 39s, sys: 2.59 s, total: 7min 42s
Wall time: 1min 6s

' The president honored Justice Stephen Breyer and acknowledged his service to this country before introducing Justice Ketanji Brown Jackson, who will be serving as the newest judge on the United States Court of Appeals for the District of Columbia Circuit.'

How about a more precise query?

%%time
query = "Identify three things the president said about Ketanji Brown Jackson"

qa.run(query)
CPU times: user 10min 20s, sys: 4.2 s, total: 10min 24s
Wall time: 1min 35s

' The president said that she was nominated by Barack Obama to become the first African American woman to sit on the United States Court of Appeals for the District of Columbia Circuit. He also mentioned that she was an Army veteran, a Constitutional scholar, and is retiring Justice of the United States Supreme Court.'

Hmm… are we in a conversation and picking up on previous outputs? In previous attempts I did appear to be getting quite relevant answers… Are we perhaps getting more than a couple of results docs and picking the less good one? Or is the model hit and miss on what it retrieves? Can we view the sample results docs from the knoweldge lookup to help get a feel for what’s going on?

Let’s see if we can format the response…

%%time

query = """
Identify three things the president said about Ketanji Brown Jackson. Provide the answer in the form: 

- ITEM 1
- ITEM 2
- ITEM 3
"""

qa.run(query)
CPU times: user 12min 31s, sys: 4.24 s, total: 12min 35s
Wall time: 1min 45s

"\n\nITEM 1: President Trump honored Justice Breyer for his service to this country, but did not specifically mention Ketanji Brown Jackson.\n\nITEM 2: The president did not identify any specific characteristics about Justice Breyer that would be useful in identifying her.\n\nITEM 3: The president did not make any reference to Justice Breyer's current or past judicial rulings or cases during his speech."

Testing OUr Educational Jupyter Notebooks

For years and years I’ve been meaning to put automation in place for testing the Jupyter notebooks we release to students in the Data Management and Analysis module. I’ve had several false starts at addressing this over those years, but thought I should try to have another iteration towards something that might be useful at the end of last week.

The maintenance and release cycle we have at the moment currently goes like this:

  • Docker container updated; notebook content errata fixed; notebooks are saved with code cells run in a private Github repo; for each new presentation of the module, we create a branch from the current main branch notebooks; errata are tracked via issues and are typically applied to main; they typically do not result in updated notebooks being released to students; instead, an erratum announcement is posted to the VLE, with the correction each student needs to make manually to their own copy of the notebook;
  • manual running and inspection of notebooks (typically not by me; if it was me, I’d have properly automated this much sooner!;-) in updated container. The checks identify whether cells run, whether there are warnings, etc; some cells are intended to fail execution, which can complicate a quick “Run all, then visually inspect” test. If the cell output changes, it may be hard to identify exactly what change has occurred / whether the change is one that is likely to be identified in use by a student as “an error”. Sometimes, outputs differ in detail, but not kind. For example, a fetch-one Mongo query brings back an item, but which item may not be guaranteed; a %memit or %timeit test is unlikely to return exactly the same resource usage, although we might want to compare the magnitudes of the time or memory consumed.

The testing tool I have primarily looked at using is the nbval extension to the py.test framework. This extension takes a notebook with pre-run “gold standard” cell outputs available, re-runs the notebook has limited support for tagging cells to allow outputs to be ignored, erroring cells to be identified and appropriately handle, or cell execution to be skipped altogether.

My own forked nbval package adds several “vague tests” to the test suite (some of my early tag extensions are described in Structural Testing of Jupyter Notebook Cell Outputs With nbval). For example, we can check that a cell output is a folium map, or an image with particular dimensions, or castable to a list of a certain length, or a dict with particular keys.

Other things that are useful to flag are warnings that are being raised as a consequence of the computational environment being updated.

To make testing easier, I’ve started working on a couple of sketch Github actions in a private cloned repo of our official private module team repo.

In the repo, the notebooks are arranged in weekly directories with a conventional directory name (Part XX Notebooks). The following manually triggered action provides a way of testing just the notebooks in a single week;’s directory:

When the action is run, the notebooks are run against the loaded environment pulled in as a Docker container (the container we want to test the materials against). Cell outputs are compared and an HTML report is generated using pytest-html ; this report is uploaded as an action artefact and attached to the action run report.

name: nbval-test-week
#.github/workflows/nbval.yml
on:
  workflow_dispatch:
    inputs:
      week:
        type: choice
        description: Week to test
        options: 
        - "01"
        - "02"
        - "03"
        - "04"
        - "05"
        - "07"
        - "08"
        - "09"
        - "10"
        - "11"
        - "12"
        - "14"
        - "15"
        - "16"
        - "20"
        - "21"
        - "22"
        - "23"
        - "25"
        - "26"
      timeit:
        description: 'Skip timeit'
        type: boolean
      memit:
        description: 'Skip memit'
        type: boolean    
jobs:
  nbval-demo:
    runs-on: ubuntu-latest
    container:
      image: ouvocl/vce-tm351-monolith
    steps:
    - uses: actions/checkout@master
    - name: Install nbval (TH edition)
      run: |
        python3 -m pip install --upgrade https://github.com//ouseful-PR/nbval/archive/table-test.zip
        python3 -m pip install pytest-html
    - name: Restart postgres
      run: |
        sudo service postgresql restart
    - name: Start mongo
      run: |
        sudo mongod --fork --logpath /dev/stdout --dbpath ${MONGO_DB_PATH}
    - name: Test TM351 notebooks
      run: |
        memit=$INPUT_MEMIT
        timeit=$INPUT_TIMEIT
        nbval_flags=""
        if [ "$memit" = "true" ]; then
          nbval_flags="--nbval-skip-memit"
        fi
        if [ "$timeit" = "true" ]; then
          nbval_flags="$nbval_flags --nbval-skip-timeit"
        fi
        py.test --nbval $nbval_flags --html=report-week-${{ github.event.inputs.week }}.html --self-contained-html ./Part\ ${{ github.event.inputs.week }}*
      env:
        INPUT_MEMIT: ${{ github.event.inputs.memit }}
        INPUT_TIMEIT: ${{ github.event.inputs.timeit }}
    - name: Archive test results
      if: always()
      uses: actions/upload-artifact@v3
      with:
        name: nbval-test-report
        path: ./report-week-${{ github.event.inputs.week }}.html

We can then download and review the HTML report to identify which cells failed in which notebook. (The Action log also displays any errors.)

Another action can be used to test the notebooks used across all the whole course.

On the to do list is: declaring a set of possible Docker images that the user can choose from; an action to run all cells against a particular image to generate a set of automatically produced gold standard outputs; an action to compare outputs from running the notebooks against one specified environment compared to the outputs generated by running them against a different specified environment. If we trust one particular environment for producing “correct” gold standard outputs, we can use that to the notebook outputs against which a second, development environment is being tested.

NOTE: updates to notebooks may not be backwards compatible with previous environments; the aim is to drive the content of the notebooks forward so they run against the evolving “current best practice” environment, not so that they are necessarily backwards compatible with earlier environments. Ideally, a set of “correct” run notebooks from one presentation form the basis of the test for the next presentation; but even so, differences may arise that represent a “correct” output in the new environment. Hmmm, so maybe I need an nbval-passes tag that can be used to identify cells whose output can be ignored because the cell is known to produce a correct output in the new environment that doesn’t match the output from the previous environment and that can’t be handled by an outstanding vague test. Then when we create a new branch of the notebooks for a new presentation, those nbval-passes are stripped from the notebooks under the assumption they should, “going forward”, now pass correctly.

As I retroactively start tagging notebooks with a view to getting improving the meaningful test pass rate, several things come to mind:

  • the primary aim is to check that the notebooks provide appropriate results when run in a particular environment; a cell output does not necessarily need to exactly match a gold master output for it to be appropriate;
  • the pass rate of some cells could be improved by modifying the code; for example, displaying SQL queries or dataframes that have been sorted on a particular column or columns. In some cases this will not detract from the learning point being made in the cell, but in other cases it might;
  • adding new cell tags / tests can weaken or strengthen tests that are already available, although at the cost of introducing more tag types to manage; for example, the dataframe output test currently checks the dataframe size and column names match, BUT the columns do not necessarily need to be in the same order; this test could be strengthened by also checking column name order, or weakened by dropping the column name check altogether. We could also improve the strength by checking column types, for example;
  • some cells it’s perhaps just better to skip or ignore altogether; but in such cases, we should be able to report on which cells have been skipped or had their cell output ignored (so we can check whether a ‘failure’ could arise that might need to be addressed rather than ignored), or disable the “ignore” or “skip” behaviour to run a comprehensive test.

For the best test coverage, we would have 0 ignored output cells, 0 skipped cells, tests that are as strong as possible, no errors, no warnings, and no failures (where a failure is a failure of the matching test, either exact matching or one of my vague tests).

PS as well as tests, I am also looking at actions to support the distribution of notebooks; this includes things like checking for warnings, clearing output cells, making sure that cell toolbars are collapsed, making sure that activity answers are collapsed, etc etc. Checking toolbars and activity outputs are collapsed could be tests, or could be automatically run actions. Ideally, we should be able to automate the publication of a set of notebooks by:

  • running tests over the notebooks;
  • if all the tests pass, run the distribution actions;
  • create a distributable zip of ready-to-use notebook files etc.

In-Browser WASM Powered Postgres and DuckDB Fragments On the To Do List…

A quick note that I need to demo some simple educational material that shows how we can use postgres-wasm to drop a postgres-wasm PostgreSQL terminal into an IFrame and run some activities:

We can also access a wasm powered db with proxied sockets, which means:

  • we can connect to the DB from something like pandas if we are running in an environment that supports socket connections (which pyodide/JupyterLite doesn’t);
  • we only need to run a simple proxy webservice alongside the http server that delivers the WASM bundle, rather than a full PostgreSQL server. Persistence is handled via browser storage, which means if the database is large, that may be the main hurdle…

If we were just doing SQL data wrangling, it would possibly make more sense to use something like DuckDB. In passing, I note an experimental package that supports DuckDB inside JupyterLite — iqmo-org/jupylite_duckdb — to complement the “full fat” duckdb Python package:

However, for playing with things like roles and permissions, or more of the basic DB management functions, having a serverless PostgreSQL database is really handy. One thing it can’t (currently?) do, though, is support multiple concurrent connections, which means no playing with transactions? Although – maybe the proxied version can?! One to try…