Search Assist With ChatGPT

Via my feeds, a tweet from @john_lam:

The tools for prototyping ideas are SO GOOD right now. This afternoon, I made a “citations needed” bot for automatically adding citations to the stuff that ChatGPT makes up

A corresponding gist is here.

Having spent a few minutes prior to that doing a “traditional” search using good old fashioned search terms and the Google scholar search engine to try to find out how defendants in English trials of the early 19th century could challenge jurors (Brown, R. Blake. “Challenges for Cause, Stand-Asides, and Peremptory Challenges in the Nineteenth Century.” Osgoode Hall Law Journal 38.3 (2000) : 453-494, looks relevant), I wondered whether ChatGPT, and a John Lam’s search assist, might have been able to support the process:

Firstly, can ChatGPT help answer the question directly?

Secondly, can ChatGPT provide some search queries to help track down references?

The original rationale for the JSON based response was so that this could be used as part of an automated citation generator.

So this gives us a pattern of: write a prompt, get a response, request search queries relating to key points in response.

Suppose, however, that you have a set of documents on a topic and that you would like to be able to ask questions around them using something like ChatGPT. I note that Simon Willison has just posted a recipe on this topic — How to implement Q&A against your documentation with GPT3, embeddings and Datasette — that independently takes a similar approach to a recipe described in OpenAI’s cookbook: Question Answering using Embeddings.

The recipe begins with a semantic search of a set of papers. This is done by generating an embdding for the documents you want to search over using the OpenAI embeddings API, though we could roll our own that runs locally, albeit with a smaller model. (For example, here’s a recipe for a simple doc2vec powered semantic search.) To perform a semantic search, you find the embedding of the search query and then find near embeddings generated from your source documents to provide the results. To speed up this part of the process in datasette, Simon created the datasette-faiss plugin to use FAISS .

The content of the discovered documents are then used to seed a ChatGPT prompt with some “context”, and the question is applied to that context. So the recipe is something like: use a query to find some relavant documents, grab the content of those documents as context, then create a ChatGPT prompt of the form “given {context}, and this question: {question}”.

It shouldn’t be too difficult to hack together a think that runs this pattern against OU-XML materials. In other words:

  • generate simple text docs from OU-XML (I have scrappy recipes for this already);
  • build a semantic search engine around those docs (useful anyway, and I can reuse my doc2vec thing);
  • build a chatgpt query around a contextualised query, where the context is pulled from the semantic search results. (I wonder, has anyone built a chatgpt like thing around an opensource gpt2 model?)

PS another source of data / facts are data tables. There are various packages out there that claim to provide natural language query support for interrogating tabular data eg abhijithneilabraham/tableQA, and this review article, or the Higging Face table-question-answering transformer, but I forget which I’ve played with. Maybe I should write a new RallyDataJunkie unbook that demonstrates those sort of tool around tabulated rally results data?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: