Sketches Around Transkribus – Handwritten Text Transcriptions in Jupyter Notebooks

Another strike day not reading as much as I’d hoped, I auto-distracted by having a play with the Transkribus Python API. (Transkribus, if you recall, is an app that supports transcription of hand-written texts.)

The API lets you pull (and push, but I haven’t got that far yet) documents from and to the Transkribus webservice. One of the docs you can export from it (which is also available from the GUI client) is an XML doc that includes co-ordinates for segmented line regions within each page:

You can export the document from the GUI…

…but there’s a Python API way of doing it too…

So, I made a few sketches (in a notebook in this gist) that started to explore the API, including pulling the XML down, along with page images, parsing it, and using OpenCV to crop individual text lines out of the page image scan.

I then popped a function together to create a simple markdown file containing each cropped line and any trasncript already added to it:

My thinking here is that I could use Jupytext to open the markdown document in a notebook interface and add further transcription text to a markdown doc / notebook containing separate text lines. There’s a Python API call for pushing stuff back to the server, so I hoping I should be able to come up with a simple script to transform the markdown, or perhaps even notebook ipynb/JSON derived using Jupytext from it, to the required XML format and push it back to the Transkribus server.

(You can see where I’m going here, perhaps? A simple notebook UI as an alternative to the more complex Transkribus UI.)

The next step, though, is to see if I can get the Transkribus service to find the text lines on a new page in a document already uploaded to the service and then pull the corresponding XML down; then see if I can upload a document to the service. (I also need to have a go at creating a document collection.) Then I’ll be able to thing a bit more about generating the XML I need to push a new, or updated, transcript back to the Transkribus service.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

I should probably also try getting a config to run this in MyBinder, and working on a reproducible demo (the sketch uses a document I’ve uploaded and partially trasncribed, and I’m not sure how to go about sharing it, if indeed I can?)

PS for an alternative take on using the Transkribus API from a Jupyter notebook, via @wragge.see: