Scraping Texts

One of the things I learned early on about scraping web pages (often referred to as “screen scraping”) is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:

  • display a database table as an HTML table in a web page;
  • display each row of a database as a templated HTML page.

The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.

In the latter case, the scrape may proceed in a couple of ways. For example:

  • by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;
  • parsing the recognisable literal text displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the r1chardj0n3s/parse Python package as applied to text pulled from a page using something like the kennethreitz/requests-html package.

When scraping from PDFs, it is often necessary to make use of positional information (the co-ordinates that identify where on the page a particular piece of text can be found) as well as literal text / pattern matching to try to identify different structured items on the page.

In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.

At the basic level, we may be able to do this by recognising structural patterns with the text. For example:

Name: Joe Blogs
Address: 17, Withnail Street, Lesser Codswallop

We can then do some simple pattern matching to extract the identified elements.

Within the text, there may also be things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe “things”, that is, entities, as well as identifying what sorts of “thing”, or entity, they are.

One powerful Python natural language processing package, spacy, has an entity recognition capability that lets us identify entities within a text in couple of ways. The spacy package includes models for a variety of languages that can identify thinks like people’s names (PEOPLE), company names (ORG), MONEY and DATE strings.

However, we can also extend spacy by developing our own models, or building on top of spacy‘s pre-existing models.

In the first case, we can build an enumerated model that explicitly identifies terms we want to match against a particular entity type. For example, we might have a list of MP names that we want to use to tag a text to identify wherever an MP is mentioned.

In the second case, we may want to build a more general sort of model. Again, spacy can help us here. One way of matching text items is to look at the “shape” of tokens (words) in a text. For example, we might extract the shape of the word “Hello” as “Xxxxx” to identify upper and lower case alphabetic characters. We might use the “d” symbol to denote a numerical character. A common UK postcode form may then be identified from its shape, such as XXd dXX or Xd dXX.

Another way of matching elements is to look at “parts of speech” (POS) tags and the patterns they make. If you remember your grammar, things like nouns, proper nouns or adjectives, or conjunctions and prepositions.

Looking at a sentence in terms of its POS tags provides another level of structure across which we might look for patterns.
The following shows how even a crude model can start to identify useful features in a text, albeit with some false matches:

For examples of scraping texts, see this notebook: psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes

PS related, in policy  / ethical best practice terms: ONS web-scraping policy

HexJSON HTMLWidget for R, Part 3

In HexJSON HTMLWidget for R, Part 1 I described a basic HTMLwidget for rendering hexJSON maps using d3-hexJSON, and HexJSON HTMLWidget for R, Part 2 described updates for supporting colour.

Having booked off today for emergency family cover that turned out not to be required, I had another stab at the package, so it now supports the following additional features…

Firstly, I had a go at popping some “base” hexjson files into a location within the package from which I could load them (checkin). Based on a crib from here, which suggests putting datafiles into an extdata folder in the package inst/ folder, from where devtools::build() makes them available in the built package root directory.

hexjsonbasefiles <- function(){
  list.files(system.file("extdata", package = "hexjsonwidget"))
}

With the files in place, we can use any base hexjson files included included in the package as the basis for hexmaps.

I also added the ability to switch off labels although later in the day I simplified this process…

One thing that was close to the top of my list was the ability to merge the contents of a dataframe into a hexJSON object. In particular, for a row identified by a particular key value associated with a hex key value, I wanted to map columns onto hex attributes. The hexjson object is represented as a list, so this required a couple of things: firstly, getting the dataframe data into an appropriate list form, secondly merging this into the hexjson list using the rlist::merge() function from the rlist package. Here’s the gist of the trick I ended up with, which was to construct a list split() from each row in a dataframe, with the rowname as the list name, using lapply(.., as.list):

ll=lapply(split(customdata, rownames(customdata)), as.list)
jsondata$hexes = list.merge(jsondata$hexes, ll)

A hexjsondatamerge(hexjson,df) function takes a hexjson file and merges the contents of the dataframe into the hexes:

The contents of a dataframe can also be merged in directly when creating a hexjsonwidget:

Having started to work with dataframes, it also seemed like it might be sensible to support the creation of a hexjson object directly from a dataframe. This uses a similar trick to the one used in creating the nested list for the merge function:

hexjsonfromdataframe <- function(df,layout="odd-r", keyid='id',
                                 q='q', r='r'){

  rownames(df) = df[[keyid]]
  df[[keyid]] = NULL
  colnames(df)[colnames(df) == q] = 'q'
  colnames(df)[colnames(df) == r] = 'r'

  list(layout=layout,
       hexes=lapply(split(df, rownames(df)), as.list))
}

hexjsonpart3_6

As you might expect, we can then use the hexjson object to create a hexjsonwidget:

A hexjsonwidget can also be created directly from a dataframe:

If we wanted to save the hexjson to a file, we could do something like: write( toJSON( jjx ), "test_out.hexjson" ).

In creating the hexjson-from-dataframe, I also refactored some of the other bits of code to simplify the number of parameters I’d started putting into the hexjsonwidget() function, in effect overloading them so the same named parameter could be used in different supporting functions.

I think that’s pretty much it from the developments I had in mind for the package. Now all I need to do is put it into practice… testing for which will, no doubt, throw up issues!)

PS Not quite it… I just added some simple file handlers too: to save the hexjson to a file, use hexjsonwrite(df, filename) and to read a json/hexjson file into a hexjson object use hexjsonread(filename).

HexJSON HTMLWidget for R, Part 2

In my previous post – HexJSON HTMLWidget for R, Part 1 – I described a first attempt at an HTMLwidget for displaying hexJSON maps using d3-hexJSON.

I had another play today and added a few extra features, including the ability to:

  • add a grid (as demonstrated in the original d3-hexJSON examples),
  • modify the default colour of data and grid hexes,
  • set the data hex colour via a col attribute defined on a hexJSON hex, and
  • set the data hex label via a label attribute defined on a hexJSON hex.

We can now also pass in the path to a hexJSON file, rather than just the hexJSON object:

Here’s the example hexJSON file:

And here’s an example of the default grid colour and a custom text colour :

I’ve also tried to separate out the code changes as separate commits for each feature update: code checkins. For example, here’s where I added the original colour handling.

I’ve also had a go at putting some docs in place, generated using roxygen2 called from inside the widget code folder with devtools::document(). (The widget itself gets rebuilt by running the command devtools::install().)

Next up – some routines to annotate a base hexJSON file with data to colour and label the hexes. (I’m also wondering if I should support the ability to specify arbitrary hexJSON hex attribute names for label text (label) and hex colour (col), or whether to keep those names as a fixed requirement?) See what I came up with here: HexJSON HTMLWidget for R, Part 3.

Future Incoming… Reproducible Parliamentary Research Briefings?

One of the things I’ve noticed coming out of GDS, the Government Digital Service, over the last few weeks is that reproducible research and report generation seems to be an active area of interest. In conversations with folk around the House of Commons Library, it seems as if there may an appetite for starting to explore a similar approach there. So what might be involved?

As the GDS post on Reproducible Analytical Pipelines suggests, a typical workflow for producing a report containing statistical results, charts and tables often looks something like the following:

(Actually, I’m not sure what different roles the statistical software and spreadsheet are supposed to achieve in that diagram?)

In this case, data is obtained from a data store, wrangled analysed, tabulated and charted, annotated and commented upon and then published. The workflow is serialised across several applications (which in itself is not necessarily a bad thing) but may be laborious to reproduce, for example if the data is updated and a new version of the briefing is required.

A few weeks ago, representatives from the DfE and DCMS, along with GDS, look to have held a seminar on how this sort of workflow has been implemented for statistics production in their departments. (I’m guessing some of the contributors were participants in the Data Science Accelerator programme…?)

A similar sort of workflow exists for producing library research briefings, along with a cut down variant of the workflow for answering Members’ questions (the output format may be an email rather than a PDF or Word document.) In the latter case, reproducing the workflow may be required if a member needs an updated response to a previously asked question, or another member asks a similar question to a previously asked question but about a different constituency.

In this case, automation of analyses or the production of document assets (tables, charts, etc) may support reuse of methods used to answer a question for one particular constituency across all constituencies, or methods used to answer a question at one particular time to re-answer the same question at a later time with updated information.

A functionally equivalent process to he GDS workflow can be implemented using a reproducible research toolchain such as the one described using the R statistical programming language and knitr publishing package by Michael Sachs in his Tools for Reproducible Research: Introduction to knitr presentation:

In this case, the data ingest, analysis, tabulation and charting is done in the context of the annotation and commentary as part of a single source document – the Rmd report. Various automated publication routes then handle the rendering and publication of the final document.

In the following example, a prepackaged dataset is used as the basis for a simple scatterplot, created using one line of code.

One of the possible arguments that can be made against the automated production of graphics containing reports is that the graphics won’t conform to the house style or convention. However, graphics packages such as matplotlib in Python, or ggplot in R allow you to both “write charts” and create chart objects to which style can then be applied.

In the above example, the styling applied to the chart object can be updated by adding a simple predefined clause to the definition of the chart object. (Themes can be updated in Python using Seaborn styles or R using ggplot ggthemes. GDS have already produced a Government “govstyle” theme for R, so it should equally be possible to produce green-and-red themes for the House of Commons and House of Lords libraries respectively.)

(Related to this, I’ve also previously dabbled with quick experiments to automatically generate accessible text descriptions from scripted chart objects.)

An output document, in this case HTML (but it could be PDF, or a Microsoft Word document), can then be generated from the source document. If desired, the display of the code used to generate the chart objects can be suppressed in the output document. Generating an updated version of the chart just requires an update to the dataset and regenerating the output document, in the desired format, from the source document.

 

Reproducible code scripts can also be used to produce a particular chart type in a self-documenting way that may be used as a training example or as the basis for another diagram. For example, this Migrant Flow notebook documents the creation of a Sankey diagram as well as providing information about how to export it in different file formats.

As well as scripting statistical analyses and chart generation so that they can be reproduced, code can often be reused as part of an interactive application. In the R ecosystem, the shiny package supports the creation of customised interactive applications around a particular dataset.

For example, a dataset reporting different broadband statistics at LSOA level for a particular local authority can be used as the basis of a graphical reporting tool that displays a selected statistic using a choropleth map.

The creation of such an application can be used to demonstrate how reusable components can be developed one useful step at a time to provide a range of tools around a particular dataset. For example:

  • for a particular LA,  generate a map for a particular statistic; this requires loading in a shapefile and a datafile that use the same identifier scheme to identify regions. Data associated with a particular region can then be used to colour the corresponding area of the map. This might be developed in response a query from a particular person for a particular area, and used to generate a map or tabular data returned in an email, for example.
  • generalise the mapping function so that it can use data associated with a selected statistic within the datafile to produce maps/tables for other members based on their constituency code.
  • create an interactive application that uses column headings in the datafile corresponding to different statistical measures as the options in a drop down list selection UI component; the selected item from the drop down list can then be used to trigger the generation of the map for that statistic.

The above example relates to a tool that plots a variety of statistics contained within a single data file for a particular LA. (You can find the code here.) This code can itself act as a building block for further work. For example:

  • extend the code to generate maps for other LAs when specifying an appropriate LA code.
  • write a script to iterate through all LA codes and produce a report for each. (For a [related example](https://blog.ouseful.info/2017/02/23/reporting-in-a-repeatable-parameterised-transparent-way/), [these documents](https://psychemedia.github.io/parlihacks/iwgplsoadocs/) were created from a script that mapped the patient catchment area within a particular LA for a particular GP practice code, that was itself called multiple times from a second script that looked up the GP practice codes within a particular LA.)
  • use a different datafile (using similar region codes) to display different sorts of data.
  • add a button to the interactive application to generate and download a PDF or png version of the map ;
  • add a selection list to the interactive application to allow the user to select a particular LA as well as a particular statistic.

Each additional step results in something more or differently useful, and provides another building block or code fragment that could be reused as a building block or “tweakable example” elsewhere.

In the Jupyter notebook ecosystem, ipywidgets provide a complement to the R/Shiny approach by allowing the use of interactive widgets inline in a notebook, or as part of an interactive dashboard.

By scripting – that is, automating – different parts of the enquiry answering process, we can start to develop a range of components that can be used to scale the answering of queries of the same form to other areas or periods of time without having to redo the same work each time.

Making code available also supports checking and transparency of method. For example, Chris Hanretty’s blog post asking Is the left over-represented within academia? is backed up by code that allows the reader to check his working although the rerunnability of the script falls short by not explicitly specifying how to obtain and load in the source data. The media are also starting to make use reproducible scripts to support some of their stories. For example, Buzzfeed News regularly publish scripts, such as this one on how Nobel Prizewinners Show Why Immigration Is So Important For American Science, as background support for their stories. (See also: Data Journalism Units on Github.) By publishing reproducible research scripts, third parties can not only check working and assumptions, but may also extend or otherwise build on the same research. They can also generate chart assets, for example, according to the first party analysis and then theme exactly that chart in their own house style.

Reusability and “scripting support” can also be promoted though the use and development of software packages developed to make accessing and analysing particular datasets easier. For example, Oli Hawkin’s Python MNIS API wrapper, or Evan Odell’s Hansard Speeches and Sentiment dataset or R hansard Parliamentary data API wrapper provide tools that make it easier to access or ingest data into Python or R environments. The Python pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/) package offers support for accessing data from a growing number of sources including the World Bank, OECD and Eurostat, and exposing it as tabular pandas dataframes.

Identifying other often used data sources as candidates for “wrapping” in a similar way, such that data access can be automated in a repeatable way, is one way of improving local workflow but also contributing back to the wider data using community.

Accessing such datasources using scripts enhances an analysis by including the provenance of the data in such a way that a third party can also access it. (If the datasource does not support versioning but may include updates, keeping an archival copy of the data used in a particular analysis is also recommended…)

Reuse and Build On – IW Broadband Reports

A couple of weeks ago I posted a demo of how to automate the production of a templated report (catchment for GP practices by LSOA on the Isle of Wight) using Rmd and knitr (Reporting in a Repeatable, Parameterised, Transparent Way).

Today, I noticed another report, with data, from the House of Commons Library on Superfast Broadband Coverage in the UK. This reports at the ward level rather than the LSOA level the GP report was based on, so I wondered how easy it would be to reuse the GP/LSOA code for a broadband/ward map…

After fighting with the Excel data file (metadata rows before the header and at the end of the table, cruft rows between the header and data table proper) and the R library I was using to read the file (it turned the data into a tibble, with spacey column names I couldn’t get to work with ggplot, rather than a dataframe – I ended saving to CSV then loading back in again…), not many changes were required to the code at all… What I really should have done was abstracted the code in to an R file (and maybe some importable Rmd chunks) and tried to get the script down to as few lines of bespoke code to handle the new dataset as possible – maybe next time…

The code is here and example PDF here.

I also had a quick play at generating a shiny app from the code (again, cut and pasting rather than abstracting into a separate file and importing… I guess at least now I have three files to look at when trying to abstract the code and to test against…!)

Shiny code here.

So what?

So this has got me thinking – what are the commonly produced “types” of report or report section, and what bits of common/reusuble code would make it easy to generate new automation scripts, at least at a first pass, for a new dataset?