One of the things I learned early on about scraping web pages (often referred to as “screen scraping”) is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:
- display a database table as an HTML table in a web page;
- display each row of a database as a templated HTML page.
The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.
In the latter case, the scrape may proceed in a couple of ways. For example:
- by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;
- parsing the recognisable literal text displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the r1chardj0n3s/parse Python package as applied to text pulled from a page using something like the kennethreitz/requests-html package.
When scraping from PDFs, it is often necessary to make use of positional information (the co-ordinates that identify where on the page a particular piece of text can be found) as well as literal text / pattern matching to try to identify different structured items on the page.
In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.
At the basic level, we may be able to do this by recognising structural patterns with the text. For example:
Name: Joe Blogs Address: 17, Withnail Street, Lesser Codswallop
We can then do some simple pattern matching to extract the identified elements.
Within the text, there may also be things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe “things”, that is, entities, as well as identifying what sorts of “thing”, or entity, they are.
One powerful Python natural language processing package,
spacy, has an entity recognition capability that lets us identify entities within a text in couple of ways. The
spacy package includes models for a variety of languages that can identify thinks like people’s names (
PEOPLE), company names (
However, we can also extend
spacy by developing our own models, or building on top of
spacy‘s pre-existing models.
In the first case, we can build an enumerated model that explicitly identifies terms we want to match against a particular entity type. For example, we might have a list of MP names that we want to use to tag a text to identify wherever an MP is mentioned.
In the second case, we may want to build a more general sort of model. Again,
spacy can help us here. One way of matching text items is to look at the “shape” of tokens (words) in a text. For example, we might extract the shape of the word “Hello” as “Xxxxx” to identify upper and lower case alphabetic characters. We might use the “d” symbol to denote a numerical character. A common UK postcode form may then be identified from its shape, such as XXd dXX or Xd dXX.
Another way of matching elements is to look at “parts of speech” (POS) tags and the patterns they make. If you remember your grammar, things like nouns, proper nouns or adjectives, or conjunctions and prepositions.
Looking at a sentence in terms of its POS tags provides another level of structure across which we might look for patterns.
The following shows how even a crude model can start to identify useful features in a text, albeit with some false matches:
For examples of scraping texts, see this notebook: psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes
PS related, in policy / ethical best practice terms: ONS web-scraping policy