Over the last few weeks, I’ve been tinkering with various recipes for pulling searchable text content out of the Internet Archive and popping it into a full text searchable database.
One of my first sketches has used 19th century editions of Notes & Queries. As well as the weekly “content” issues, N & Q also published two index volumes a year detailing the entries of the preceding volume.
Through starting trying to compile sensible index entries for my sin-eater unbook (still a work in progrgress, particularly the index) using the sphinx/Jupyter Book indexing features, I have a new found respect for the compilers of indexes: there’s a real craft to it.
At first glance, you might think there is limitied utility in having an index as well as full text search support, but there are at least two reasons at least why that’s not correct.
The first is navigational: the index provides both a way of identifying search terms as well as helping under the pattern of occurrences of a particular term.
The second is because full-text search using text extracted from large number of scans using OCR really sucks. Even with good stemming etc on full text search terms, even with fuzzy search tools, getting a match on a search term can, at times, be tricky.
So to supplement my full text search over N&Q, I am topping it up with a search into the index that also tries to identify pages directly from related index entries. (The use of the index is also a handy cross-check that the free text search has turned up at least the results included in the originally compiled index.
In passing, I also note the power of the internal cross-referencing scheme used across items appearing in N&Q…