I had intended to spend strike week giving my hands a rest, reading rather than keyboarding, but as it was I spent today code-sketching around the National Archives, as well as other things.
In trying to track down original Home Office papers relating to the Yorkshire Luddites, I’ve been poking around the National Archives (as described in passing here). Over the last couple of years, I’ve grown weary of search interfaces, even Advanced Search ones, preferring to try to grab the data into my own database(s) where I can more easily query and enrich it, as well as join it with other data sources.
I had assumed the National Archives search index was a bit richer than it is (I put down my lack of success in many searches I tried to unfamiliarlity with it) but it seems pretty thin – an index catalogue that indexes the existence of document collections but not what’s in them to any great level of detail.
But assuming there was rather more detail than I seem to have found, I did a few code sketches around it that demonstrate:
- using
mechanicalsoup
to load a search page, set form selections, “click” a download burron and capture the result; - using
StringIO
to load CSV data into apandas
dataframe; - using
spacy
to annotate a data frame with named entities; - exploding lists in a data-frame column to make a long dataframe therefrom;
- expanding a column of tuples in a dataframe across several columns;
- using
Wand
(an Python API for imagemagick) to render pages from a PDF as images in a Jupyter notebook (Chrome is borked again, not rendering PDFs via a notebook IFrame).
Check the gist to see the code… (Bits of it should run in MyBinder too – just remember to select “Gist”! spacy
isn’t installed at the moment — Gists seem to be a bit broken at the moment, the requirements.txt
file is being mistreated, and I donlt want to risk breaking other bits as a side effect of trying to fix it. If Gists are other than temporarily borked, I will try to remember to add the code within this post explicilty.)