Over the last three weeks or so I’ve been following the crazy rate of development around locally run large language models (LLMs), starting with llama.cpp
, then alpaca
and most recently (?!) gpt4all
.
My laptop (a mid-2015 Macbook Pro, 16GB) was in the repair shop for over a week of that period, and it’s only really now that I’ve had a even a quick chance to play, although I knew 10 days ago what sort of thing I wanted to try, and that has only really become off-the-shelf possible in the last couple of days.
The following script can be downloaded as a Jupyter notebook from this gist.
GPT4All Langchain Demo
Example of locally running GPT4All
, a 4GB, llama.cpp based large langage model (LLM) under langchachain
](https://github.com/hwchase17/langchain), in a Jupyter notebook running a Python 3.10 kernel.
Tested on a mid-2015 16GB Macbook Pro, concurrently running Docker (a single container running a sepearate Jupyter server) and Chrome with approx. 40 open tabs).
Model preparation
- download
gpt4all
model:
#https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin
- download
llama.cpp
7B model
#%pip install pyllama #!python3.10 -m llama.download --model_size 7B --folder llama/
- transform
gpt4all
model:
#%pip install pyllamacpp #!pyllamacpp-convert-gpt4all ./gpt4all-main/chat/gpt4all-lora-quantized.bin llama/tokenizer.model ./gpt4all-main/chat/gpt4all-lora-q-converted.bin
GPT4ALL_MODEL_PATH = "./gpt4all-main/chat/gpt4all-lora-q-converted.bin"
langchain
Demo
Example of running a prompt using langchain
.
#https://python.langchain.com/en/latest/ecosystem/llamacpp.html #%pip uninstall -y langchain #%pip install --upgrade git+https://github.com/hwchase17/langchain.git from langchain.llms import LlamaCpp from langchain import PromptTemplate, LLMChain
- set up prompt template:
template = """ Question: {question} Answer: Let's think step by step. """
prompt = PromptTemplate(template=template, input_variables=["question"])
- load model:
%%time llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH) llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ... llama_model_load: n_vocab = 32001 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 2 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: type = 1 llama_model_load: ggml map size = 4017.70 MB llama_model_load: ggml ctx size = 81.25 KB llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state) llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' llama_model_load: model size = 4017.27 MB / num tensors = 291 llama_init_from_file: kv self size = 512.00 MB
CPU times: user 572 ms, sys: 711 ms, total: 1.28 s Wall time: 1.42 s
- create language chain using prompt template and loaded model:
llm_chain = LLMChain(prompt=prompt, llm=llm)
- run prompt:
%%time question = "What NFL team won the Super Bowl in the year Justin Bieber was born?" llm_chain.run(question)
CPU times: user 5min 2s, sys: 4.17 s, total: 5min 6s Wall time: 43.7 s
'1) The year Justin Bieber was born (2005):\n2) Justin Bieber was born on March 1, 1994:\n3) The Buffalo Bills won Super Bowl XXVIII over the Dallas Cowboys in 1994:\nTherefore, the NFL team that won the Super Bowl in the year Justin Bieber was born is the Buffalo Bills.'
Another example…
template2 = """ Question: {question} Answer: """ prompt2 = PromptTemplate(template=template2, input_variables=["question"]) llm_chain2 = LLMChain(prompt=prompt, llm=llm)
%%time question2 = "What is a relational database and what is ACID in that context?" llm_chain2.run(question2)
CPU times: user 14min 37s, sys: 5.56 s, total: 14min 42s Wall time: 2min 4s
"A relational database is a type of database management system (DBMS) that stores data in tables where each row represents one entity or object (e.g., customer, order, or product), and each column represents a property or attribute of the entity (e.g., first name, last name, email address, or shipping address).\n\nACID stands for Atomicity, Consistency, Isolation, Durability:\n\nAtomicity: The transaction's effects are either all applied or none at all; it cannot be partially applied. For example, if a customer payment is made but not authorized by the bank, then the entire transaction should fail and no changes should be committed to the database.\nConsistency: Once a transaction has been committed, its effects should be durable (i.e., not lost), and no two transactions can access data in an inconsistent state. For example, if one transaction is in progress while another transaction attempts to update the same data, both transactions should fail.\nIsolation: Each transaction should execute without interference from other concurrently executing transactions, thereby ensuring its properties are applied atomically and consistently. For example, two transactions cannot affect each other's data"
Generating Embeddings
We can use the llama.cpp
model to generate embddings.
#https://abetlen.github.io/llama-cpp-python/ #%pip uninstall -y llama-cpp-python #%pip install --upgrade llama-cpp-python from langchain.embeddings import LlamaCppEmbeddings
llama = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)
llama_model_load: loading model from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' - please wait ... llama_model_load: n_vocab = 32001 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 2 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: type = 1 llama_model_load: ggml map size = 4017.70 MB llama_model_load: ggml ctx size = 81.25 KB llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state) llama_model_load: loading tensors from './gpt4all-main/chat/gpt4all-lora-q-converted.bin' llama_model_load: model size = 4017.27 MB / num tensors = 291 llama_init_from_file: kv self size = 512.00 MB
%%time text = "This is a test document." query_result = llama.embed_query(text)
CPU times: user 12.9 s, sys: 1.57 s, total: 14.5 s Wall time: 2.13 s
%%time doc_result = llama.embed_documents([text])
CPU times: user 10.4 s, sys: 59.7 ms, total: 10.4 s Wall time: 1.47 s
Next up, I’ll try to create a simple db using the llama embeddings and then try to run a QandA prompt against a source document…
PS See also this example of running a query against GPT4All in langchain in the context of a single, small, document knowledge source.
Thanks for the tutorial, but I couldn’t obtain tokenizer via command
python3 -m llama.download –model_size 7B –folder llama/
got ModuleNotFoundError. My version of python is 3.8.10 could this be a problem?
I solved this problem, ran
pip install pyllama -U
instead of just
pip install pyllama
and then
python -m llama.download –model_size 7B