Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook

Over the last three weeks or so I’ve been following the crazy rate of development around locally run large language models (LLMs), starting with llama.cpp, then alpaca and most recently (?!) gpt4all.

My laptop (a mid-2015 Macbook Pro, 16GB) was in the repair shop for over a week of that period, and it’s only really now that I’ve had a even a quick chance to play, although I knew 10 days ago what sort of thing I wanted to try, and that has only really become off-the-shelf possible in the last couple of days.

The following script can be downloaded as a Jupyter notebook from this gist.

GPT4All Langchain Demo

Example of locally running GPT4All, a 4GB, llama.cpp based large langage model (LLM) under langchachain](https://github.com/hwchase17/langchain), in a Jupyter notebook running a Python 3.10 kernel.

Tested on a mid-2015 16GB Macbook Pro, concurrently running Docker (a single container running a sepearate Jupyter server) and Chrome with approx. 40 open tabs).

Model preparation

  • download gpt4all model:
#https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin
  • download llama.cpp 7B model
#%pip install pyllama
#!python3.10 -m llama.download --model_size 7B --folder llama/
  • transform gpt4all model:
#%pip install pyllamacpp
#!pyllamacpp-convert-gpt4all ./gpt4all-main/chat/gpt4all-lora-quantized.bin 

llama/tokenizer.model ./gpt4all-main/chat/gpt4all-lora-q-converted.bin
GPT4ALL_MODEL_PATH = "./gpt4all-main/chat/gpt4all-lora-q-converted.bin"

langchain Demo

Example of running a prompt using langchain.

#https://python.langchain.com/en/latest/ecosystem/llamacpp.html
#%pip uninstall -y langchain
#%pip install --upgrade git+https://github.com/hwchase17/langchain.git

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
  • set up prompt template:
template = """

Question: {question}
Answer: Let's think step by step.

"""​
prompt = PromptTemplate(template=template, input_variables=["question"])
  • load model:
%%time
llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH)

llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
CPU times: user 572 ms, sys: 711 ms, total: 1.28 s
Wall time: 1.42 s
  • create language chain using prompt template and loaded model:
llm_chain = LLMChain(prompt=prompt, llm=llm)
  • run prompt:
%%time
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
CPU times: user 5min 2s, sys: 4.17 s, total: 5min 6s
Wall time: 43.7 s
'1) The year Justin Bieber was born (2005):\n2) Justin Bieber was born on March 1, 1994:\n3) The Buffalo Bills won Super Bowl XXVIII over the Dallas Cowboys in 1994:\nTherefore, the NFL team that won the Super Bowl in the year Justin Bieber was born is the Buffalo Bills.'

Another example…

template2 = """

Question: {question}
Answer:  

"""​

prompt2 = PromptTemplate(template=template2, input_variables=["question"])

llm_chain2 = LLMChain(prompt=prompt, llm=llm)
%%time
question2 = "What is a relational database and what is ACID in that context?"
llm_chain2.run(question2)
CPU times: user 14min 37s, sys: 5.56 s, total: 14min 42s
Wall time: 2min 4s
"A relational database is a type of database management system (DBMS) that stores data in tables where each row represents one entity or object (e.g., customer, order, or product), and each column represents a property or attribute of the entity (e.g., first name, last name, email address, or shipping address).\n\nACID stands for Atomicity, Consistency, Isolation, Durability:\n\nAtomicity: The transaction's effects are either all applied or none at all; it cannot be partially applied. For example, if a customer payment is made but not authorized by the bank, then the entire transaction should fail and no changes should be committed to the database.\nConsistency: Once a transaction has been committed, its effects should be durable (i.e., not lost), and no two transactions can access data in an inconsistent state. For example, if one transaction is in progress while another transaction attempts to update the same data, both transactions should fail.\nIsolation: Each transaction should execute without interference from other concurrently executing transactions, thereby ensuring its properties are applied atomically and consistently. For example, two transactions cannot affect each other's data"

Generating Embeddings

We can use the llama.cpp model to generate embddings.

#https://abetlen.github.io/llama-cpp-python/
#%pip uninstall -y llama-cpp-python
#%pip install --upgrade llama-cpp-python

from langchain.embeddings import LlamaCppEmbeddings
llama = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)
llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
%%time
text = "This is a test document."​
query_result = llama.embed_query(text)
CPU times: user 12.9 s, sys: 1.57 s, total: 14.5 s
Wall time: 2.13 s
%%time
doc_result = llama.embed_documents([text])
CPU times: user 10.4 s, sys: 59.7 ms, total: 10.4 s
Wall time: 1.47 s

Next up, I’ll try to create a simple db using the llama embeddings and then try to run a QandA prompt against a source document…

PS See also this example of running a query against GPT4All in langchain in the context of a single, small, document knowledge source.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

3 thoughts on “Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook”

  1. Thanks for the tutorial, but I couldn’t obtain tokenizer via command
    python3 -m llama.download –model_size 7B –folder llama/

    got ModuleNotFoundError. My version of python is 3.8.10 could this be a problem?

    1. I solved this problem, ran
      pip install pyllama -U

      instead of just

      pip install pyllama

      and then
      python -m llama.download –model_size 7B

  2. #%pip install pyllamacpp
    #!pyllamacpp-convert-gpt4all ./gpt4all-main/chat/gpt4all-lora-quantized.bin

    llama/tokenizer.model ./gpt4all-main/chat/gpt4all-lora-q-converted.bin

    this does not compile in jupyter

Comments are closed.