I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)
The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s
dogsheep-beta demo, which creates a free text meta-search database over several distinct databases using SQLite’s
ATTACH method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.
And it turns out, I can. Really easily…
dogsheep-beta demo essentially bundles two things:
- a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
- a datasette extension for rendering a faceted full text search page over the combined database.
The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are
type (in the dogsheep world, this is the original data source, eg a tweet, or a github record);
key (a unique key for the record);
I forked the repo to tweak these columns slightly, changing
pubdate and adding a new
book column to replace
type with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.
Items in config file have the form:
DATABASE_FILE.db: DB_LABEL: sql: |- select MY_ID as key, title, MY_TEXT as search_1 from MY_TABLE
select MUST be in lower case (a hack in the code searches for the first
select in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a
;, as the aforementioned query rewrite appends
LIMIT 0 to the original query en route to identifying the column headings.
Here’s the config file I used for creating my metasearch database:
lang_fairy_tale.db: lang: sql: |- select "lang::" || replace(title, " ", "") as key, title, book, text as search_1 from books jacobs_fairy_tale.db: jacobs: sql: |- select "jacobs::" || replace(story_title, " ", "") as key, story_title as title, book_title as book, story_text as search_1 from stories word_of_tales.db: world_of_tales: sql: |- select "wot::" || replace(title, " ", "") as key, title, book, text as search_1 from tales ashliman_demo.db: ashliman: sql: |- select "dash::" || replace(title, " ", "") as key, title as title, metadata as book, text as search_1 from ashliman_stories mtdf_demo.db: mtdf: sql: |- select "mtdf::" || replace(title, " ", "") as key, title as title, text as search_1 from english_stories
Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:
One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the
vtfunc trick that relies on seems to have rotted in Py10 [update: ensuring
Cython is installed fixes the
vtfunc install; for another Python wrapper for sqlite, see also
apsw]; there may be an alternative way to access a
TableFunction via the
peewee package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with
What I’m thinking is, it would be really handy to have a template repo associated with
datasette that provides tools to:
- create a metasearch database (
dogsheep-betapretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.);
- provide an easily customisable
datasettethat gives you a simple free text search UI (related issue);
- provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the
index.htmltemplate should then give you the option of running a free text search or a semantic search;
- (it might also be handy to support other custom fuzzy search functions (example);)
- a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the
index.htmltemplate serves queries appropriately.
PS I posted this related item to
sqlite_utils discord server:
SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.
The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.
However, that package is no longer maintained and seems to be breaking when installing in at least Py10?
An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.
Is there any support for registering such table returning functions in sqlite_utils?