Pencil Sketch – Building a Simple Search Engine Around a Particular Internet Archive Document Collection

Over the last few weeks, I’ve been making a fair bit of use of the Internet Archive, not least to look up items in copies of Notes & Queries from the 19th century. In part, this has rekindled my interest in things like indexes (for human readers) and custom search engines. So here are a few notes before I try and hack anything together about various bits and bobs that come to mind with how I could start to play a bit with the Internet Archive content.

Inspection of various sample records shows a few things we can work with:

Particularly:

  • the pub_notes-and-queries tag appears to be used to assign related documents to an associated document collection;
  • the OCR search text is available as a text file;
  • an OCR page index document contains a list of four tuples, one per scanned paged; the first two elements in the tuple appear to give the start and end character index count of characters in the OCR search text file for each page. (I have no idea what the thrid and fourth elements in the tuple relate to.)
  • a page numbers text file that gives a list of records, one per page, with a page number estimate and confidence score:

To retrieve the data, we can use the Python internetarchive package. This provides various command line tools, including bulk downloaers, as well as a Python API.

For example, we can use the API to search for items within a particular collection:

#%pip install internetarchive

from internetarchive import search_items

item_ids = []
# Limit search to collection items
#collection:"pub_notes-and-queries"
# Limit search to a particular publication and  volume
#sim_pubid:1250 AND volume:6
# Limit search to a particular publication and year
#sim_pubid:1250 AND year:1867

for i in search_items('collection:"pub_notes-and-queries"').iter_as_items():
    #iterate through retrieved item records
   pass

The item records include information that is likely to be useful for helping us retrieve items and construct a search engine over the documents (though what form that search engine might take, I am still not sure).

{'identifier': 'sim_notes-and-queries_1867_12_index',
 'adaptive_ocr': 'true',
 'auditor': 'associate-jerald-capanay@archive.org',
 'betterpdf': 'true',
 'boxid': 'IA1641612',
 'canister': 'IA1641612-06',
 'collection': ['pub_notes-and-queries', 'sim_microfilm', 'periodicals'],
 'contrast_max': '250',
 'contrast_min': '147',
 'contributor': 'Internet Archive',
 'copies': '4',
 'date': '1867',
 'derive_version': '0.0.19',
 'description': 'Notes and Queries 1867: <a href="https://archive.org/search.php?query=sim_pubid%3A1250%20AND%20volume%3A12" rel="nofollow">Volume 12</a>, Issue Index.<br />Digitized from <a href="https://archive.org/details/sim_raw_scan_IA1641612-06/page/n1668" rel="nofollow">IA1641612-06</a>.<br />Previous issue: <a href="https://archive.org/details/sim_notes-and-queries_1867-06-29_11_287" rel="nofollow">sim_notes-and-queries_1867-06-29_11_287</a>.<br />Next issue: <a href="https://archive.org/details/sim_notes-and-queries_1867-07-06_12_288" rel="nofollow">sim_notes-and-queries_1867-07-06_12_288</a>.',
 'issn': '0029-3970',
 'issue': 'Index',
 'language': 'English',
 'mediatype': 'texts',
 'metadata_operator': 'associate-berolyn-gilbero@archive.org',
 'next_item': 'sim_notes-and-queries_1867-07-06_12_288',
 'ppi': '400',
 'previous_item': 'sim_notes-and-queries_1867-06-29_11_287',
 'pub_type': 'Scholarly Journals',
 'publisher': 'Oxford Publishing Limited(England)',
 'scanner': 'microfilm03.cebu.archive.org',
 'scanningcenter': 'cebu',
 'sim_pubid': '1250',
 'software_version': 'nextStar 4.5.0.20626',
 'source': ['IA1641612-06', 'microfilm'],
 'sponsor': 'Kahle/Austin Foundation',
 'subject': ['Classical Studies',
  'Library And Information Sciences',
  'Literary And Political Reviews',
  'Literature',
  'Publishing And Book Trade',
  'Scholarly Journals',
  'microfilm'],
 'title': 'Notes and Queries  1867: Vol 12 Index',
 'volume': '12',
 'uploader': 'arthur+microfilm02@archive.org',
 'publicdate': '2021-10-19 11:27:07',
 'addeddate': '2021-10-19 11:27:07',
 'identifier-access': 'http://archive.org/details/sim_notes-and-queries_1867_12_index',
 'identifier-ark': 'ark:/13960/t3gz6zr86',
 'imagecount': '29',
 'ocr': 'tesseract 5.0.0-beta-20210815',
 'ocr_parameters': '-l eng',
 'ocr_module_version': '0.0.13',
 'ocr_detected_script': 'Latin',
 'ocr_detected_script_conf': '0.9685',
 'ocr_detected_lang': 'en',
 'ocr_detected_lang_conf': '1.0000',
 'page_number_confidence': '100.00',
 'pdf_module_version': '0.0.15',
 'foldoutcount': '0'}

So for example, we might pull out the volume, issue, date, and title; we can check whether page numbers were identified; we have reference to the next and previous items.

From inspection of other records, we also note that if an item is restricted (for example, in its preview, then a 'access-restricted-item': 'true' attribute is also set.

We can retrieve documents from the API by id and document type:

from internetarchive import download

# downloads to a dir with name same as id
# _page_numbers.json
# _hocr_searchtext.txt.gz

doc_id = 'sim_notes-and-queries_1849-11-03_1_1'

download(doc_id, destdir='ia-downloads',
         formats=[ "OCR Page Index", "OCR Search Text", "Page Numbers JSON"])

To create a search index from the downloads, what are we to do?

By inspection of several OCR Search Text files, it appears as if the content is arranged as one paragraph per line of file, where paragraphs are perhaps more correctly blocks of text that appear visually separated from other blocks of text.

For a full text search engine, for example, over a SQLite FTS4 extension virtual table, we could just add each document as a separate record.

However, it might also be interesting to have full text search over page level records. We could do this by splitting content in the full text document according to the OCR page index file, and also numbering pages with the “scan page” index and also any extracted “actual” page number from the page numbers text file. (The length of the lists in the OCR page index file and the page numbers text file should be the same.)

A search over the pages tables would then be able to return page numbers.

In some cases it may be that the searcher will want to view the actual original document scan, for example, in the case of dodgy OCR, or to check the emphasis or layout of the original text. So it probably makes sense to also grab the scanned documents, either as a PDF, or as a collection of JPG images, loading the file as binary data and saving into the database as a SQLite BLOB.

We can preview the files downloadable for a particular item by getting a list of the associated files and then reporting their file format:

from internetarchive import get_item

#Retrieve an item record by id
jx = get_item('sim_notes-and-queries_1867_12_index')

# The item .get_files() method returns an iterator of
# available file types
for jj in jx.get_files():
    #Report the file format
    print(jj.format)

"""
Item Tile
JPEG 2000
JPEG 2000
Text PDF
Archive BitTorrent
chOCR
DjVuTXT
Djvu XML
Metadata
JSON
hOCR
OCR Page Index
OCR Search Text
Item Image
Single Page Processed JP2 ZIP
Metadata
Metadata
Page Numbers JSON
JSON
Scandata
"""

The Single Page Processed JP2 ZIP contains separate JPEG images, one per scan page, so that is a good candidate. We could also grab the Text PDF, which is a searchable PDF document. If page level access were required were could then grab the whole PDF and just extract and display the required page.

However, downloading all the JPG / PDF files feels a bit excessive. So it might be more interesting to try to build something that will only download and store PDFs / images on demand, as and when a users wants to preview an actual scanned page.

PS in passing, I note this recent package – https://github.com/cbrennig/pdf_sqlite_fts – for OCRing, indexing and full-text searching PDF docs.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: