Scraping Texts

One of the things I learned early on about scraping web pages (often referred to as “screen scraping”) is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:

  • display a database table as an HTML table in a web page;
  • display each row of a database as a templated HTML page.

The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.

In the latter case, the scrape may proceed in a couple of ways. For example:

  • by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;
  • parsing the recognisable literal text displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the r1chardj0n3s/parse Python package as applied to text pulled from a page using something like the kennethreitz/requests-html package.

When scraping from PDFs, it is often necessary to make use of positional information (the co-ordinates that identify where on the page a particular piece of text can be found) as well as literal text / pattern matching to try to identify different structured items on the page.

In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.

At the basic level, we may be able to do this by recognising structural patterns with the text. For example:

Name: Joe Blogs
Address: 17, Withnail Street, Lesser Codswallop

We can then do some simple pattern matching to extract the identified elements.

Within the text, there may also be things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe “things”, that is, entities, as well as identifying what sorts of “thing”, or entity, they are.

One powerful Python natural language processing package, spacy, has an entity recognition capability that lets us identify entities within a text in couple of ways. The spacy package includes models for a variety of languages that can identify thinks like people’s names (PEOPLE), company names (ORG), MONEY and DATE strings.

However, we can also extend spacy by developing our own models, or building on top of spacy‘s pre-existing models.

In the first case, we can build an enumerated model that explicitly identifies terms we want to match against a particular entity type. For example, we might have a list of MP names that we want to use to tag a text to identify wherever an MP is mentioned.

In the second case, we may want to build a more general sort of model. Again, spacy can help us here. One way of matching text items is to look at the “shape” of tokens (words) in a text. For example, we might extract the shape of the word “Hello” as “Xxxxx” to identify upper and lower case alphabetic characters. We might use the “d” symbol to denote a numerical character. A common UK postcode form may then be identified from its shape, such as XXd dXX or Xd dXX.

Another way of matching elements is to look at “parts of speech” (POS) tags and the patterns they make. If you remember your grammar, things like nouns, proper nouns or adjectives, or conjunctions and prepositions.

Looking at a sentence in terms of its POS tags provides another level of structure across which we might look for patterns.
The following shows how even a crude model can start to identify useful features in a text, albeit with some false matches:

For examples of scraping texts, see this notebook: psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes

PS related, in policy  / ethical best practice terms: ONS web-scraping policy

Do Special Interest Groups Reveal Their Hand in Parliamentary Debate?

Mulling over committee treemaps – code which I really need to revisit and package up somewhere – I started wondering…

…can we use the idea of treemap displays as a way in to help think about how interest groups may – or may not – reveal themselves in Parliament?

For example suppose we had a view over Parliamentary committees, or AAPGs. Add another level of structure to the treemap display showing members of each each committee with cells of equal area for each member. Now, in a debate, if any of the members of the committee speak, highlight the cell for that member.

With a view over all committees, if the members of a particular committee, or particular APPG, lit up, or didn’t light up, we might be able to to start asking whether the representation from those members was statistically unlikely.

(We could do the same for divisions, displaying how each member voted and then seeing whether that followed party lines?)

From mulling over this visually inspired insight, we wouldn’t actually need to use treemaps, of course. We could just run a query over the data and do some counting, creating “lenses” that show how particular interest groups or affiliations (committees, APPGs, etc) are represented in a debate?

Contextualised Search Result Displays

One of the prevalent topics covered in the early days of this blog concerned how to appropriate search tools and technologies and explore how they could be used as more general purpose technologies. Related to this were posts on making the most of document collections, or exploiting technologies complementary to search that returned results or content based on context.

For example:

There’s probably more…

(Like looking for shared text across documents to try to work out the provenance of a particular section of text as a document goes through multiple versions…)


So where are we at now…?

Mulling over recent updates to Parliamentary search, I started wondered about ranking and the the linear display of results. I’ve always quite liked facet limits that will filter out a subset of results returned by the search term based on a particular attribute. For example, in an Amazon search, we’re probably all familiar with entering a general search term then using category filters / facets in the left hand sidebar to narrow results down to books, or subject categories within books.

Indeed, the faceted search basis of  “search + filter” is one that inspired many of my own hacks.

As well as linear displays of ranked results (ranked how is always an issue), every so often multi-dimensional result displays appear. For example, things like Bento box displays (examples) were all the rage in university library catalogues several years ago, where multiple topical results panels display results from different facets or collections in different boxes distributed on a 2D grid. I’m not sure if they’re still “a thing” or whether preferences have gone back to a linear stream of results, perhaps with faceting to limit results within a topic? I guess one of the issues now is limited real estate on portrait orientation mobile phone displays compared to more expansive real estate you get in a landscape oriented large screen desktop display? (Hmmm, thinks… are Netvibes and PageFlakes still a thing?)

Anyway, I’ve not been thinking about search much for years, so in a spirit of playfulness, here’s a line of thinking I think could be fun to explore: contextualised search results, or more specifically, contextualised search result displays.

This phrase unpacks in several ways depending on where you think the emphasis on “contextualised” lies. (“contextualised lies”… “contextualies”…. hmmm…)

For example, if we interpret contextualised in the sense of context sensitive relative to the “natural arranging” of the results returned, we might trivially think of things like map based displays for displaying the results of a search where search items are geotagged. Complementary to this are displays where the results have some sort of time dependency. This is often displayed in the form of a date based ranking, but why not display results in a calendar interface or timeline (e..g. tracking Parliamentary bill process via a timeline)? Or where dates and locations are relevant to each resource, return the results via a calendar map display such as TimeMapper (more generally, see these different takes on storymaps). (I’ve always thought such displays should have to modes: a “show all” mode, and then a filtered mode, e.g. that shows just results for a particular time/geographical area limit.)

(One of the advantages of making search results available via a feed is that tinkerers can then easily wire the results into other sorts of display, particularly when feed items are also tagged, eg with facet information, dates, times, names of entities identified in the text, etc.)

A second sense in which we might think of contextualised search result displays is to identify the context of the user based on their interests. Given a huge set of linear search results, how might they then group, arrange or organise the results so that they can work with them more effectively?

Bento box displays offer a trivial start here for the visual display, for example by grouping differently faceted results in their own results panel. Looking at something like Parliamentary search, this might mean the user entering a search term and the results coming back in panels relating to different content types: results from research briefings in one panel, for example, from Hansard in other, written questions / answers in a third, and so on.

(Again, if results from the search engine are available as a tagged feed, it should be easy enough to roll your own display? Hmm… thinks.. what Javascript libraries would you use to display such a thing nowadays?)

It might also be possible to derive additional information from the results. For example, if results are tagged with members associated with a result (on a committee, asked that question, was the person speaking whose result was returned in the Hansard result), then a simple ranked facet of who the members interested in the topic across all the resource types might identify that person as someone interested in the topic (expert search / discovery also used to be a big thing, I seem to remember?).

In terms of trying to imagine differently contextualised displays, what sorts of user / user interest might there be? Off the top of my head, I can imagine:

  • someone searching for a topic “in general”: so just give them a list of stuff ranked however the search algo ranks it;
  • someone searching for a topic in general, organised by format or type (e.g. research briefing, written question/answer, parliamentary debate, committee report, etc), in which case a faceted display or bento box display might work;
  • someone searching for something in response to a news item, in which case they might want something ordered by time and maybe boosted by news mentions as a ranking factor (reminds me of trying to track media mentions of press releases and my press release / poll report CSE);
  • someone searching around the activities of an MP, in which case, you might want something like TheyWorkForYou member pages or perhaps a calendar or timeline view of their activity, or a narrative chart (e.g. with one line for a member, then other lines for the sorts of interaction they have with a topic – committee, question, debate – with each node linking to the associated document);
  • someone trying to track something in the context of the progress of a piece of legislation (or committee inquiry), in which case you may want a timeline, narrative chart or storyline style view; and maybe a custom search hub that searches over all documents relating to that piece of evolving legislation;
  • someone interested in people interested in a topic – expert search, in other words;
  • someone interested in the engagement of a person or organisation with Parliamentary processes, such as witness appearances at committee, submissions to written evidence, etc; it would also be handy if this turned up government relations, such as an association with a government group (it would be nice of that was a register, with each group having a formal register entry that included things like members…). Showing the different sorts of process, and the stage of the process at which the interaction or mention occurred could also be useful….

There are probably more…

Anyway, perhaps thinking about search could be fun again… So: does the new Parliamentary search make feeds available? And when are the Release TBC items listed on going to be available?!:-)

Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes

Via a GSS (Government Statistical Service) blog post yesterday – Why Do We Need Another Register? – announcing the Statistical Geography Register, which contains “the codes for each type of statistical geography within the UK”, I came across mention of the ONS Register of Geographic Codes.

This register, maintained by the Office of National Statistics, is released as a zip file containing an Excel spreadsheet. Separate worksheets in the spreadsheet list codes for various geographies in England and Wales (but not Scotland; that data is available elsewhere).

To make a rather more reproducible component for accessing the data, I hacked together a simple notebook to pull the data out of the spreadsheet and pop it into a simple SQLite3 database as a set of separate tables, one per sheet.

One thing we need to do to reconcile items in the metadata sheet and the sheetnames by joining a couple of the columns together with a subscript:

xl['RGC']["codeAbbrv"] = xl['RGC']["Entity code"].map(str) + '_' + xl['RGC']["Entity abbreviation"].map(str)

The bulk of the script is quite simple (see the full notebook here):

import sqlite3
con = sqlite3.connect("onsgeocodes.sqlite")


bigcodes.to_sql(con=con, name='codelist', index=False, if_exists='replace')

sheets= list(xl.keys())
for sheet in sheets[2:]:
xl[sheet].to_sql(con=con, name=sheet, index=False, if_exists='replace')
#Reorder the columns
xl[sheet][['sheet']+cols].to_sql(con=con, name='codelist', index=False, if_exists='append')

You may also notice that it creates a “big” table (codelist) that contains all the codes – which means we can look up the provenance of a particular code:

SELECT sheet, GEOGCD, GEOGNM, GEOGNMW, codelist.STATUS, "Entity name"
FROM codelist JOIN metadata WHERE "GEOGCD"="{code}" AND codeAbbrv=sheet
pd.read_sql_query(q, con)
0 W40_CMLAD W40000004 Denbighshire Sir Ddinbych live Census Merged Local Authority Districts

We can also look to see what (current) geographies might be associated with a particular name:

SELECT DISTINCT "Entity name", sheet FROM codelist JOIN metadata
WHERE "GEOGNM" LIKE "%{name}%" AND codeAbbrv=sheet AND codelist.STATUS="live"
'''.format(name='Isle of Wight')
pd.read_sql_query(q, con)
Entity name sheet
0 Super Output Areas, Lower Layer E01_LSOA
1 Super Output Areas, Middle Layer E02_MSOA
2 Unitary Authorities E06_UA
3 Westminster Parliamentary Constituencies E14_WPC
4 Community Safety Partnerships E22_CSP
5 Registration Districts E28_REGD
6 Registration Sub-district E29_REGSD
7 Travel to Work Areas E30_TTWA
8 Fire and Rescue Authorities E31_FRA
9 Built-up Areas E34_BUA
10 Clinical Commissioning Groups E38_CCG
11 Census Merged Local Authority Districts E41_CMLAD
12 Local Resilience Forums E48_LRF
13 Sustainability and Transformation Partnerships E54_STP

What I’m wondering now is – can I crib from the way the ergast API is put together to create a simple API that takes a code, or a name, and returns geography register information related to it?

The same approach could also be applied to the registers I pull down from NHS Digital (here) – which makes me think I should generate a big codelist table for those codes too…

PS this in part reminds me of a conversation years ago with Richard et al from @cottagelabs who were mooting, at the time, a service that would take an arbitrary code and try to pattern match the coding scheme it was part of and then look it up in that coding scheme.

PPS hmm, also thinks: maybe names associated with coding schemes could be added to a simple document tagger.

HexJSON HTMLWidget for R, Part 2

In my previous post – HexJSON HTMLWidget for R, Part 1 – I described a first attempt at an HTMLwidget for displaying hexJSON maps using d3-hexJSON.

I had another play today and added a few extra features, including the ability to:

  • add a grid (as demonstrated in the original d3-hexJSON examples),
  • modify the default colour of data and grid hexes,
  • set the data hex colour via a col attribute defined on a hexJSON hex, and
  • set the data hex label via a label attribute defined on a hexJSON hex.

We can now also pass in the path to a hexJSON file, rather than just the hexJSON object:

Here’s the example hexJSON file:

And here’s an example of the default grid colour and a custom text colour :

I’ve also tried to separate out the code changes as separate commits for each feature update: code checkins. For example, here’s where I added the original colour handling.

I’ve also had a go at putting some docs in place, generated using roxygen2 called from inside the widget code folder with devtools::document(). (The widget itself gets rebuilt by running the command devtools::install().)

Next up – some routines to annotate a base hexJSON file with data to colour and label the hexes. (I’m also wondering if I should support the ability to specify arbitrary hexJSON hex attribute names for label text (label) and hex colour (col), or whether to keep those names as a fixed requirement?) See what I came up with here: HexJSON HTMLWidget for R, Part 3.

HexJSON HTMLWidget for R, Part 1

In advance of the recent UK general election, ODI Leeds published an interactive hexmap of constituencies to provide a navigation surface over various datasets relating to Westminster constituencies:

As well as the interactive front end, ODI Leeds published a simple JSON format for sharing the hex data – hexjson that allows you to specify an identifier for each, some data values associated with it, and relative row (r) and column (q) co-ordinates:

It’s not hard to imagine the publication of default, or base, hexJSON documents that include standard identifier codes and appropriate co-ordinates, e.g. for Westminster constituencies, wards, local authorities, and so on being developed around such a standard.

So that’s one thing the standard affords – a common way of representing lots of different datasets.

Tooling can then be developed to inject particular data-values into an appropriate hexJSON file. For example, a hexJSON representation of UK HEIs could add a data attribute identifying whether an HEI received a Gold, Silver or Bronze TEF rating. That’s a second thing the availability of a standard supports.

By building a UI that reads data in from a hexJSON file, ODI Leeds have developed an application that can presumably render other people’s hexJSON files, again, another benefit of a standard representation.

But the availability of the standard also means other people can build other visualisation tools around the standard. Which is exactly what Oli Hawkins did with his d3-hexjson Javascript library, “a D3 module for generating [SVG] hexmaps from data in HexJSON format” as announced here. So that’s another thing the standard allows.

You can see an example here, created by Henry Lau:

You maybe start to get a feel for how this works… Data in a standard form, standard library that renders the data. For example, Giuseppe Sollazzo (aka @puntofisso), had a play looking at voter swing:

So… one of the things I was wondering was how easy it would be for folk in the House of Commons Library, for example, to make use of the d3-hexjson maps without having to do the Javascript or HTML thing.

Step in HTMLwidgets, a (standardised) format for publishing interactive HTML widgets from Rmarkdown (Rmd). The idea is that you should be able to say something like:

hexjsonwidget( hexjson )

and embed a rendering of a d3-hexjson map in HTML output from a knitred Rmd document.

(Creating the hexjson as a JSON file from a base (hexJSON) file with custom data values added to it is the next step, and the next thing on my to do list.)

So following the HTMLwidgets tutorial, and copying Henry Lau’s example (which maybe drew on Oli’s README?) I came up with a minimal take on a hexJSON HTMLwidget.


It’s little more than a wrapping of the demo template, and I’ve only tested it with a single example hexJSON file, but it does generate d3.js hexmaps:




It also needs documenting. And support for creating data-populated base hexJSON files. But it’s a start. And another thing the hexJSON has unintentionally provided supported for.

But it does let you create HTML documents with embedded hexmaps if you have the hexJSON file handy:

By the by, it’s also worth noting that we can also publish an image snapshot of the SVG hexjson map in a knitr rendering of the document to a PDF or Microsoft Word output document format:

At first I thought this wasn’t possible, and via @timelyportfolio found a workaround to generate an image from the SVG:


, export_widget( )
), viewer=NULL) %>%
webshot( delay = 3 )

But then noticed that the PDF rendering was suddenly working – it seems that if you have the webshot and htmltools packages installed, then the PDF and Word rendering of the HTMLwidget SVG as an image works automagically. (I’m not sure I’ve seen that documented – the related HTMLwidget Github issue still looks to be open?)

See also: HexJSON HTMLWidget for R, Part 2, in which support for custom hex colour and labeling is added, and HexJSON HTMLWidget for R, Part 3, where I add in the ability to merge an R dataframe into a hexjson object, and create a hexjsonwidget directly from a dataframe.

Tinkering With Apache Drill – JOINed Queries Across JSON and CSV files

Coming down from a festival high, I spent the day yesterday jamming with code and trying to get a feel for Apache Drill. As I’ve posted before, Apache Drill is really handy for querying large CSV files.

The test data I’ve been using is Evan Odell’s 3GB Hansard dataset, downloaded as a CSV file but recast in the parquet format to speed up queries (see the previous post for details). I had another look at the dataset yesterday, and popped together some simple SQL warm up exercises in notebook (here).

Something I found myself doing was flitting between running SQL queries over the data using Apache Drill to return a pandas dataframe, and wrangling the pandas dataframes directly, following the path of least resistance to doing the thing I wanted to do at each step. (Which is to say, if I couldn’t figure out the SQL, I’d try moving into pandas; and if the pandas route was too fiddly, rethinking the SQL query! That said, I also noticed I had got a bit rusty with SQL…) Another pattern of behaviour I found myself falling into was using Apache Drill to run summarising queries over the large original dataset, and then working with these smaller, summary datasets as in-memory pandas dataframes. This could be a handy strategy, I think.

As well as being able to query large flat CSV files, Apache Drill also allows you to run queries over JSON files, as well directories full of similarly structured JSON or CSV files. Lots of APIs export data as JSON, so being able to save the response of multiple calls on a similar topic as uniquely named (and the name doesn’t matter..) flat JSON files in the same folder, and then run a SQL query over all of them simply by pointing to the host directory, is really appealing.

I explored these features in a notebook exploring UK Parliament Written Answers data. In particular, the notebook shows:

  • requesting multiple pages on the same from the UK Parliament data API (not an Apache Drill feature!)
  • querying deep into a single large JSON file;
  • running JOIN queries over data contained in a JSON file and a CSV file;
  • running JOIN queries over data contained in a JSON file and data contained in multiple, similarly structured, data files in the same directory.

All I need to do now is come up with a simple “up and running” recipe for working Apache Drill. I’m thinking: Docker compose and some linked containers: Jupyter notebook, RStudio, Apache Drill, and maybe a shared data volume between them?

Parliamentary Data Sources & API Wrappers

A stub post for collecting together:

  • code libraries that wrap APIs and datasets related to UK Parliament data;
  • downloadable datasets (raw and annotated);
  • API endpoints;
  • etc.

If you know of any others, please let me know via the comments…


Official Parliament APIs

Secondary Source APIs


  • mnis: a small Python library that makes it easy to download data on UK Members of Parliament from (about)
  • They Work For You API: no longer maintained?


  • hansard: R package to automatically fetch data from the UK Parliament API; but not Hansard!
  • twfy: R wrapper for TheyWorkForYou’s API (about)
  • twfyR: another R wrapper for TheyWorkForYou’s API

Data Sourcing Applications



  • Hansard Speeches and Sentiment: a public dataset of speeches in Hansard, with information on the speaking MP, their party, gender and age at the time of the speech. (Large files (~3GB) – example of how to query datafile using Apache Drill.)

Data Handling Utilities


Tracking down Data Files Associated With Parliamentary Business

One of the ways of finding data related files scattered around an organisations website is to run a web search using a search limit that specifies a data-y filetype, such as xlsx  for an Excel spreadsheet (csv and xls are also good candidates). For example, on the Parliament website, we could run a query along the lines of filetype:xlsx and then opt to display the omitted results:

Taken together, these files form an ad hoc datastore (e.g. as per this demo on using FOI response on WhatDoTheyKnow as an “as if” open datastore).

Looking at the URLs, we see that data containing files are strewn about the online Parliamentary estate (that is, the website;-)…

Freedom of Information Related Datasets

Parliament seems to be quite open in the way is handles its FOI responses, publishing disclosure logs and releasing datafile attachments rooted on

Written Questions

Responses to Written Questions often come with datafile attachments.

These are files are posted to the subdomain

Given the numeric key for a particular question, we can run a query on the Written Answers API to find details about the attachment:

Looking at the actual URL , something like, it looks as if some guesswork is required generating the URL from the data contained in the API response? (For example, how might original attachments might distinguish from other attachments (such as “revised” ones, maybe?).)

Written Statements

Written statements often come with one of more data file attachments.

The data files also appear on the subdomain although it looks like they’re on a different path to the answered question attachments ( compared to This subdomain doesn’t appear to have the data files indexed and searchable on Google? I don’t see a Written Statements API on either?

Deposited Papers

Deposited papers often include supporting documents, including spreadsheets.

Files are located under

At the current time there is no API search over deposited papers.

Committee Papers

A range of documents may be associated with Committees, including reports, responses to reports, and correspondence, as well as evidence submissions. These appear to mainly be PDF documents. Written evidence documents are rooted on and can be found from committee written evidence web (HTML) pages rooted on the same path (example).

A web search for inurl:committee (filetype:xls OR filetype:csv OR filetype:xlsx) doesn’t turn up any results.

Parliamentary Research Briefings

Research briefings are published by Commons and Lords Libraries, and may include additional documents.

Briefings may be published along with supporting documents, including spreadsheets:

The files are published under the following subdomain and path:

The file attachments URLs can be found via the Research Briefings API.

This response is a cut down result – the full resource description, including links to supplementary items, can be found by keying on the numeric identifier from the URI _about which the “naturally” identified resource (e.g. SN06643) is described.


Data files can be found variously around the Parliamentary website, including down the following paths:

  • (appear in Written Answers API results);
  • (appear in Research Briefings API results)

(I don’t think the API supports querying resources that specifically include attachments in general, or attachments of a particular filetype?)

What would be nice would be support for discovering some of these resources. A quick way in to this would be the ability to limit search query responses to webpages that link to a data file, on the grounds that the linking web page probably contains some of the keywords that you’re likely to be searching for data around?

Tinkering With Parliament Data APIs: Commons Written Questions And Parliamentary Written Answers

So…. inspired by @philbgorman, I had a quick play last night with Parliament Written Questions data, putting together a recipe (output) for plotting a Sankey diagram showing the flow of questions from Members of the House of Commons by Party to various Answering Bodies for a particular parliamentary session.

The response that comes back from the Written Questions API includes a question uin (unique identification number?). If you faff around with date settings on the Parliamentary Questions web page you can search for a question by this ID:

Here’s an example of the response from a build download of questions (by 2015/16 session) from the Commons Written Questions API, deep filtered by the uin:

If you tweak the _about URI, which I think refers to details about the question, you get the following sort of response, built around a numeric identifier (447753 in this case):

There’s no statement of the actual answer text in that response, although there is a reference to an answer resource, again keyed by the same numeric key:

The numeric key from the _about identifier is also used with both the Commons Written Questions API and the Parliamentary Questions Answered API.

For example, questions:

And answers:

The uin values can’t be used with either of these APIs, though?

PS I know, I know, the idea is that we just follow resource links (but they’re broken, right? the leading lda. is missing from the http identifiers), but sometimes it’s just as easy to take a unique fragment of the URI (like the numeric key) and then just drop it into the appropriate context when you want it. In this case, contexts are


IMHO, any way… ;-)

PPS for a full list of APIs, see