Using dogsheep-beta to Create a Combined SQLite Free Text Metasearch Engine Over Several SQLite Databases

I had a quick play yesterday tinkering with my storynotes side project, creating a couple more scrapers over traditional tale collections, specifically World of Tales and Laura Gibb’s Tiny Tales books. The scrapers pull the stories into simple, free text searchable SQLite databases to support discovery of particular stories based on simple, free text search terms. (I’ve also been exploring simple doc2vec based semantic search strategies over the data.)

See the technical recipes: World of Tales scraper; Tiny Tales scraper.

The databases I’ve been using aren’t very consistent in the table structure I’m using to store the scraped data, and thus far I tended to search the databases separately. But when I came across Simon Willison’s dogsheep-beta demo, which creates a free text meta-search database over several distinct databases using SQLite’s ATTACH method (example) for connecting to multiple databases, I thought I’d have a quick play to see if I could re-use that method to create my own meta-search database.

And it turns out, I can. Really easily…

The dogsheep-beta demo essentially bundles two things:

  • a tool for constructing a single, full text searchable database from the content of one or more database tables in one or more databases;
  • a datasette extension for rendering a faceted full text search page over the combined database.

The index builder is based around a YAML config file that contains a series of entries, each describing a query onto a database that returns the searchable data in a standard form. The standardised columns are type (in the dogsheep world, this is the original data source, eg a tweet, or a github record); key (a unique key for the record); title; timestamp; category; is_public; search_1, search_2, search_3.

I forked the repo to tweak these columns slightly, changing timestamp to pubdate and adding a new book column to replace type with the original book from which a story came; the database records I’m interested in search over are individual stories or tales, but it can be useful to know the original source text. I probably also need to be able to support something like a link to the original text, but for now I’m interested in a minimum viable search tool.

Items in config file have the form:

        sql: |-
              MY_ID as key,
              MY_TEXT as search_1
            from MY_TABLE

The select MUST be in lower case (a hack in the code searches for the first select in the provided query, as part of a query rewrite. Also, the query MUST NOT end with a ;, as the aforementioned query rewrite appends LIMIT 0 to the original query en route to identifying the column headings.

Here’s the config file I used for creating my metasearch database:

        sql: |-
              "lang::" || replace(title, " ", "") as key,
              text as search_1
            from books

        sql: |-
              "jacobs::" || replace(story_title, " ", "") as key,
              story_title as title,
              book_title as book,
              story_text as search_1
            from stories

        sql: |-
              "wot::" || replace(title, " ", "") as key,
              text as search_1
            from tales

        sql: |-
              "dash::" || replace(title, " ", "") as key,
              title as title,
              metadata as book,
              text as search_1
            from ashliman_stories

        sql: |-
              "mtdf::" || replace(title, " ", "") as key,
              title as title,
              text as search_1
            from english_stories

Using the datasette UI to query the FTS table, with a slightly tweaked SQL query, we can get something like the following:

One of the issues with even a free text search strategy is that the search terms must appear in the searched text if they are to return a result (we can get round this slightly by using things like stemming to reduce a word to its stem). However, it’s not too hard to generate a simple semantic search over a corpus using doc2vec, as this old demo shows. However, the vtfunc trick that relies on seems to have rotted in Py10 [update: ensuring Cython is installed fixes the vtfunc install; for another Python wrapper for sqlite, see also apsw]; there may be an alternative way to access a TableFunction via the peewee package, but that seems tightly bound to a particular database object, and on the quickest of plays, I couldn’t get it to play nice with sqlite_utils or pandas.read_sql().

What I’m thinking is, it would be really handy to have a template repo associated with sqlite_utils / datasette that provides tools to:

  • create a metasearch database (dogsheep-beta pretty much does this, but could be generalised to be more flexible in defining/naming required columns, etc.);
  • provide an easily customisable index.html template for datasette that gives you a simple free text search UI (related issue);
  • provide a simple tool that will build a doc2vec table, register vector handlers (related issue) and a custom semantic search function; the index.html template should then give you the option of running a free text search or a semantic search;
  • (it might also be handy to support other custom fuzzy search functions (example);)
  • a simpler config where you select free text and or semantic search, and the index/doc2vec builders are applied appropriately, and the index.html template serves queries appropriately.

PS I posted this related item to sqlite_utils discord server:

SQLite supports the creation of virtual table functions that allow you to define custom SQL functions that can return a table.

The coleifer/sqlite-vtfunc package provided a fairly straightforward way of defining custom table functions (eg I used it to create a custom fuzzy search function here:

The recipe is to define a class class My_Custom_Function(TableFunction) and then register it on a connection (My_Custom_Function.register(db.conn)). This worked fine with sqlite_utils database connections and meant you could easily add custom functions to a running server.

However, that package is no longer maintained and seems to be breaking when installing in at least Py10?

An equivalent (ish) function is provided by the peewee package (functions imported from playhouse.sqlite_ext), but it seems to be rather more tightly bound to a SqliteDatabase object, rather than simply being registered to a database connection.

Is there any support for registering such table returning functions in sqlite_utils?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: