I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)
See the technical recipes: World of Tales scraper; Tiny Tales scraper.
The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s dogsheep-beta
demo, which creates a free text meta-search database over several distinct databases using SQLite’s ATTACH
method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.
And it turns out, I can. Really easily…
The dogsheep-beta
demo essentially bundles two things:
- a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
- a datasette extension for rendering a faceted full text search page over the combined database.
The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are type
(in the dogsheep world, this is the original data source, eg a tweet, or a github record); key
(a unique key for the record); title
; timestamp
; category
; is_public
; search_1
, search_2
, search_3
.
I forked the repo to tweak these columns slightly, changing timestamp
to pubdate
and adding a new book
column to replace type
with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.
Items in config file have the form:
DATABASE_FILE.db:
DB_LABEL:
sql: |-
select
MY_ID as key,
title,
MY_TEXT as search_1
from MY_TABLE
The select
MUST be in lower case (a hack in the code searches for the first select
in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a ;
, as the aforementioned query rewrite appends LIMIT 0
to the original query en route to identifying the column headings.
Here’s the config file I used for creating my metasearch database:
lang_fairy_tale.db:
lang:
sql: |-
select
"lang::" || replace(title, " ", "") as key,
title,
book,
text as search_1
from books
jacobs_fairy_tale.db:
jacobs:
sql: |-
select
"jacobs::" || replace(story_title, " ", "") as key,
story_title as title,
book_title as book,
story_text as search_1
from stories
word_of_tales.db:
world_of_tales:
sql: |-
select
"wot::" || replace(title, " ", "") as key,
title,
book,
text as search_1
from tales
ashliman_demo.db:
ashliman:
sql: |-
select
"dash::" || replace(title, " ", "") as key,
title as title,
metadata as book,
text as search_1
from ashliman_stories
mtdf_demo.db:
mtdf:
sql: |-
select
"mtdf::" || replace(title, " ", "") as key,
title as title,
text as search_1
from english_stories
Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:
One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the vtfunc
trick that relies on seems to have rotted in Py10 [update: ensuring Cython
is installed fixes the vtfunc
install; for another Python wrapper for sqlite, see also apsw
]; there may be an alternative way to access a TableFunction
via the peewee
package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with sqlite_utils
or pandas.read_sql()
.
What I’m thinking is, it would be really handy to have a template repo associated with sqlite_utils
/ datasette
that provides tools to:
- create a metasearch database (
dogsheep-beta
pretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.); - provide an easily customisable
index.html
template fordatasette
that gives you a simple free text search UI (related issue); - provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the
index.html
template should then give you the option of running a free text search or a semantic search; - (it might also be handy to support other custom fuzzy search functions (example);)
- a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the
index.html
template serves queries appropriately.
PS I posted this related item to sqlite_utils
discord server:
SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.
The coleifer/sqlite-vtfunc package provided a fairly straightforward way of defining custom table functions (eg I used it to create a custom fuzzy search function here:
The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.
However, that package is no longer maintained and seems to be breaking when installing in at least Py10?
An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.
Is there any support for registering such table returning functions in sqlite_utils?