I had intended to spend strike week giving my hands a rest, reading rather than keyboarding, but as it was I spent today code-sketching around the National Archives, as well as other things.
In trying to track down original Home Office papers relating to the Yorkshire Luddites, I’ve been poking around the National Archives (as described in passing here). Over the last couple of years, I’ve grown weary of search interfaces, even Advanced Search ones, preferring to try to grab the data into my own database(s) where I can more easily query and enrich it, as well as join it with other data sources.
I had assumed the National Archives search index was a bit richer than it is (I put down my lack of success in many searches I tried to unfamiliarlity with it) but it seems pretty thin – an index catalogue that indexes the existence of document collections but not what’s in them to any great level of detail.
But assuming there was rather more detail than I seem to have found, I did a few code sketches around it that demonstrate:
mechanicalsoupto load a search page, set form selections, “click” a download burron and capture the result;
StringIOto load CSV data into a
spacyto annotate a data frame with named entities;
- exploding lists in a data-frame column to make a long dataframe therefrom;
- expanding a column of tuples in a dataframe across several columns;
Wand(an Python API for imagemagick) to render pages from a PDF as images in a Jupyter notebook (Chrome is borked again, not rendering PDFs via a notebook IFrame).
Check the gist to see the code… (Bits of it should run in MyBinder too – just remember to select “Gist”!
spacy isn’t installed at the moment — Gists seem to be a bit broken at the moment, the
requirements.txt file is being mistreated, and I donlt want to risk breaking other bits as a side effect of trying to fix it. If Gists are other than temporarily borked, I will try to remember to add the code within this post explicilty.)