Category: Anything you want

Scraping Texts

One of the things I learned early on about scraping web pages (often referred to as “screen scraping”) is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:

  • display a database table as an HTML table in a web page;
  • display each row of a database as a templated HTML page.

The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.

In the latter case, the scrape may proceed in a couple of ways. For example:

  • by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;
  • parsing the recognisable literal text displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the r1chardj0n3s/parse Python package as applied to text pulled from a page using something like the kennethreitz/requests-html package.

When scraping from PDFs, it is often necessary to make use of positional information (the co-ordinates that identify where on the page a particular piece of text can be found) as well as literal text / pattern matching to try to identify different structured items on the page.

In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.

At the basic level, we may be able to do this by recognising structural patterns with the text. For example:

Name: Joe Blogs
Address: 17, Withnail Street, Lesser Codswallop

We can then do some simple pattern matching to extract the identified elements.

Within the text, there may also be things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe “things”, that is, entities, as well as identifying what sorts of “thing”, or entity, they are.

One powerful Python natural language processing package, spacy, has an entity recognition capability that lets us identify entities within a text in couple of ways. The spacy package includes models for a variety of languages that can identify thinks like people’s names (PEOPLE), company names (ORG), MONEY and DATE strings.

However, we can also extend spacy by developing our own models, or building on top of spacy‘s pre-existing models.

In the first case, we can build an enumerated model that explicitly identifies terms we want to match against a particular entity type. For example, we might have a list of MP names that we want to use to tag a text to identify wherever an MP is mentioned.

In the second case, we may want to build a more general sort of model. Again, spacy can help us here. One way of matching text items is to look at the “shape” of tokens (words) in a text. For example, we might extract the shape of the word “Hello” as “Xxxxx” to identify upper and lower case alphabetic characters. We might use the “d” symbol to denote a numerical character. A common UK postcode form may then be identified from its shape, such as XXd dXX or Xd dXX.

Another way of matching elements is to look at “parts of speech” (POS) tags and the patterns they make. If you remember your grammar, things like nouns, proper nouns or adjectives, or conjunctions and prepositions.

Looking at a sentence in terms of its POS tags provides another level of structure across which we might look for patterns.
The following shows how even a crude model can start to identify useful features in a text, albeit with some false matches:

For examples of scraping texts, see this notebook: psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes

PS related, in policy  / ethical best practice terms: ONS web-scraping policy

Some More Rally Result Chart Sketches

Some more sketches, developing / updating one of the charts I first played with last year (the stage chart and tinkering with something new.

First the stage chart – I’ve started pondering a couple of things with this chart to try to get the information density up a bit.

At a first attempt at updating the chart, I’ve started to look at adding additional marginal layers. In the example above:

  • vertical dashed lines separate out the different legs. As soon as I get the data to hand, I think it could make sense to use something like a solid line to show service, maybe a double solid line to show *parc fermé*; I’m not sure about additionally separating the days? (They’re perhaps implied by *parc fermé*? I need to check that…)
  • I added stage names *above* the chart  – this has the benefit of identifying stages that are repeated;
  • stage distances are added *below* the chart. I’ve also been wondering about adding the transit distances in *between* the stages;
  • driver labels – and positions – are shown to the left and the right.

As a second attempt, I started zooming in to just the stages associated with a particular leg. This encouraged me to start adding more detailed layers. These can be applied to the whole chart, but it may start to get a bit cluttered.

Here’s an example of a chart that shows three stages that make up a notional leg:

You’ll notice several additions to the chart:

  • the labels to the left identify the driver associated with each line. The number is currently the overall position of the driver at the end of the first stage in the leg, but I’m not sure if it should be the position at the end of the previous stage so it carries more information. The time is the gap to the overall leading driver at the end of the first stage;
  • the labels to the right show the overall positions and gap to overall leader at the end of the leg. The position label is in bold font if the driver position has improved over the leg (a switch lets you select whether this is a class rank improvement or an overall position improvement). Thinking about it, I could use italics for class improvement and bold for overall improvement to carry both pieces of information in the same label. The position is actually redundant (you can count…) so maybe it’d make more sense to give a position delta from the start of the leg (that is, the position at the end of the stage prior to the first stage shown in the current leg). The time delta is given in bold if it is better than at the start of the leg.
  • the red dots depict that the gap to the overall leader had *increased* for a driver by the end of the stage compared to the end of the previous stage. So a red dot means the car is further behind the leader at the end of the stage than they were at the end of the previous stage; this indicator could be rebased to show deltas between a target (“hero”) car and the other cars on the stage. The green dot shows that the time to the leader did not increase;
  • the grey labels at the top are a running count of the number of “wins in a row” a driver has had. There are options to choose other running counts (eg stage wins so far), and flags available for colouring things like “took lead”, “retained lead”, “lost lead”.

As well as the stage chart, I started wondering about an “ultimate stage report” for each stage, showing the delta between each driver and the best time achieved in a sector (that is, the time spent between two splits).

Here’s what I came up with at a first attempt. Time delta is on the bottom. The lower level grey bar indicates the time a driver lost relative to the “ultimate” stage. (The bar maxes out at the upper limit of the chart to indicate “more than” – I maybe need to indicate this visually eg with a dashed / broken line at the end of a maxed out bar.)

Within each driver area is a series of lollipop style charts. These indicate the gap between a driver and the best time achieved on the sector (first sector at the top of the group, last at the bottom). The driver label indicates the driver who achieved the best sector time. This chart could be rebased to show other gaps, but I need to think about that… The labels are coloured to indicate sector, and transparent to cope with some of the overlapping issues.

It’s also possible to plot this chart using a log scale:

This makes it easier to see the small gaps, as well as giving a far range on the delta. However, the log scale is harder to read for folk not familiar with them. It might be handy to put in a vertical dashed line for each power of 10 time (so a dashed line at 1s and 10s; the limit is 100s). It might also make sense to add a label to the right of the total delta bar to show what the actual delta time is.

So… tinkering… I was hoping to start to pull all the chart types I’ve been playing with together in a Leanpub book, but Leanpub is not free to play anymore unless you have generated over $10k of royalties (which I haven’t…). I’ve started looking at gitbook, but that’s new to me so I need to spend some time getting a feel for how to use it and to come up with a workflow /toolchain around it.

More Tinkering With PyTracery

Having started to look at using pytracery for data2text transforms with pandas dataframes, I’ve been pondering how to do computational things with it, such as counts, conditional tests, and loops.

Here’s where I’m at – defining counters, a branch / conditional statement, and a conditional loop:

def inc(text, *params):
    return str(int(text)+1)

def dec(text, *params):
    return str(int(text)-1)

def branch(text, *params):
    if text==params[0].strip(): 
        ans = params[1].strip()
        ans = params[2].strip()
    return '[answer:#{}#]'.format(ans)

def whileNot0(text, *params):
    if int(text) > 0:
        return '[i:{i}]'.format(i=str(int(text)-1))
    return '[do:#{}#]'.format(params[0].strip())

pytracery_logic = {
    'inc': inc,
    'dec': dec,
    'whileNot0': whileNot0

Count example:

rules = {'origin': 'Count demo - #countdemo#',
         'countdemo':'Count at [cnt:#initval#]#cnt#.\n#addone#, #addanother#, #takeone#',
         'addone': 'Add one gives []#cnt#',
         'addanother': 'add another one gives []#cnt#',
         'takeone':'take one gives [cnt:#cnt.dec#]#cnt.int_to_words#',

grammar = tracery.Grammar(rules)

> Count demo - Count at 1.
> Add one gives 2, add another one gives 3, take one gives two

Logical test:

rules={'origin':'Logical test:[branch:#condition#]#branch# Testing #value# gives: #answer#',
       'condition': '#value.branch(VeryTrue, answer1, answer2)#',
    'answer1':'This is true...',
    'answer2':'This is false...',

grammar = tracery.Grammar(rules)

> Logical test: Testing VeryTrue gives: This is true...

> Logical test: Testing notTrue gives: This is false...

Loop test:

rules = {'origin': 'Loop test: \n#loop#',
         'action':'- count is at #i#\n',
         'end':'All done'

grammar = tracery.Grammar(rules)
> Loop test: 
> - count is at 5
> - count is at 4
> - count is at 3
> - count is at 2
> - count is at 1
> - count is at 0
> All done

I’m a bit out of the way of declarative thinking (which pytracery relies on) atm, so there may be more native /idiomatic ways of doing this sort of stuff in pytracery.

I’m also a bit rusty on assembler / machine code programming, so there may be better ways of phrasing basic computational operations using pytracery using those sorts of operators.

Next step is to see whether I can start to use these reasoning steps in association with data pulled from a pandas dataframe.

Rally Stage Sector Charts, Overlaid On Stage Progress Charts

One of the last rally charts I sketched last year was a dodged bar chart showing “sector” times within a stage for each driver, either as absolute times or rebased relative to a particular target driver. This represented an alternative to the “driver subseries” line charts e.g. as shown here.

Re: the naming of driver subseries charts – this is intended to be reminiscent of seasonal subseries charts.

The original slit time data on the WRC site looks like this:

Taking the raw sector (split) times, we can rebase the times relative to a particular driver. In this case, I have rebased relative to OGI so the sector times are as shown in the table above. The colour basis is the opposite to the basis used in the chart because I am trying to highlight to the target driver where they lost time, rather than where the others gained time. It may be that the chart makes more sense to professionals if I change the colour basis in the chart below, to use green to show that the driver made up that amount of time on the target driver).

The dodged bar charts are ordered with the first split time at the top of the set for each driver. The overall stage time is the lower bar in each driver group.

Here’s how it looks using the other colour basis:

Hmm… maybe that is better…

Note that the above charts also differs from the WRC split times results table in the ordering. The results table orders results in order of start, whereas the above charts rank relative to stage position.

To generate the “sector” times, we can find the difference between each split for a particular driver, and between the final (overall) stage time and the final split time. As before, we can then rebase these times relative to a specific target driver.

This chart shows how the target driver compared to each of the other drivers on each of the “sectors” between each split. So we see OGI dropped quite a lot of time relative to other drivers on the fourth “sector” of the stage, between splits 3 and 4. He made up time against MEE on the first and second parts of the stage.

As well as just showing the split times, we can find the total time delta between the target driver and each of the other drivers on the stage as the sum of the sector times. We can use a lower graphic layer to underplot this total beneath the dodged bars for each driver.

The grey bars show the total time gained / lost by the target driver on the stage relative to the other drivers.

In a simular way, we can also overplot the dodged bars on top of a stage progress chart, recolouring slightly.

This increases the information density of the stage progress chart even further, and provides additional “delta” signals in keeping with the deltascope inspiration / basis for that chart type.

Again, this sort of fits with the warped hydraulic model: the dodged bars can be stacked to give the length of the lighter coloured bar underneath them

(I’m not sure I’ve ordered the drivers correctly in the above chart – it was generated from two discordantly arranged datasets. The final chart will be generated from a single, consistent dataframe.)

PS it strikes me that the dodged bars need to be transparent to show a solid bar underneath that doesn’t extend as far as the dodged bars. This might need some careful colour / transparency balancing.

Sketching – WRC Stage Progress Chart

Picking up on some sketches I started doing around WRC (World Rally Championship) results data last year, and after rewriting pretty much everything following an update to WRC’s website that means results data can now be accessed via JSON data feeds, here are some notes on a new chart type, which I’m calling a Stage Progress Chart for now.

In contrast to one of my favourite chart forms, the macroscope, which is intended to visualise the totality of a dataset in a single graphic, this one is developed as a deltascope, where the intention is to visualise a large number of differences in same chart.

In particular, this chart is intended to show:

  • the *time difference* between a given car and a set of other cars at the *start* of a stage;
  • the *time difference* between that same given car and the same set of other cars at the *end* of a stage;
  • any *change in overall ranking* between the start and end of the stage;
  • the *time gained* or the *time lost* by the given car relative to each of the other cars over the stage.

Take 1

The data is presented on the WRC website as a series of data tables. For example, for SS4 of WRC Rally Sweden 2018 we get the following table:

The table on the left shows the stage result, the table on the right the overall position at the end of the stage. DIFF PREV is the time difference to the car ahead, and DIFF 1st the difference to the first ranked driver in the table, howsoever ranked: the rank for the left hand table is the stage position, the rank on the right hand table is overall rally position at the end of the stage.

One thing the table does not show is the start order.

The results table for SS4 is shown below. SS4 is the stage that forms the focus for this walkthrough.

Data is obtained by scraping the WRC results JSON data feeds and popping it into a SQLite3 database. A query returns the data for a particular stage:

We can pivot the data to get the total accumulated times (in ms) at the end of each stage for the current (SS4) and previous (SS3) stage for each driver onto the same row:

Obtaining the times for a target driver allows us to rebase the times relative to that driver as a Python dict:

Rebasing is simply a matter of relativising times with respect to the target driver times:

The rebased accumulated rally times for each stage are relative to the accumulated rally time for the target driver at the end of the corresponding stage. The rebased delta gives the time that the target driver either made up, or lost against, each other car on the stage. The deltaxx value is “a thing” used to support the chart plotting. It’s part of the art… ;-)

UPDATE: it strikes me that thing() is actually returning the value closest to zero if the signs are the same:

if samesign(overall,delta): return min([overall,delta], key=abs)
return delta

So, that’s data wrangly bits… here’s what we can produce from it:

The chart is rebased according to a particular driver, in this case Ott Tänak. It’s overloaded and you need to learn to read it. It’ll be interesting to see how natural it comes to read with some practise – if it doesn’t, then it’s not that useful; if it does, it may be handy for power users.

First up, what do you see?

The bars are ordered and have positive (left) and negative (right) values, with names down the side. The names are driver labels, in this case for WRC RC1 competitors. The ordering is the in overall rally class rank order at the end of the stage identified in the title (SS4). So at the end of stage SS4, NEU was in the lead of the rally class, with MIK in second.

The chart times are rebased relative to TÄN – at the moment this is not explicitly identified, though it can be read from the chart – the bars for TÄN are all set to 0.

Rebasing means that the times are normalised (rebased) relative to a particular driver (the “target driver”).

The times that are used for the rebasing are:

  • the difference between the accumulated rally stage time of the target driver and each other driver at the end of the *previous* stage;
  • the difference between the accumulated rally stage time of the target driver and each other driver at the end of the specified stage (in this case, SS4).

From these differences, calculated relative to the target driver, we can calculate the time gained (or lost) on the stage by the target driver relative to each of the other drivers.

The total length of each continued coloured bar indicates the time delta (that is, the time gained/lost by the target driver relative to each other car) on the stage between the target driver and each other car. Red means time was lost, green means time was gained.

So TÄN lost about 15s (red) to NEU on the stage, and gained about 50s (green) on AL. He also lost about 3s (pink) to MEE and about 15s to BRE (pink plus red).

The furthest extent of the light grey bar, or the solid (red / green) bar shows the overall difference to that driver from the target at the end of the stage. So NEU is 17s or so ahead and EVA is about 80s behind. Of the 80s or so EVA is behind, TÄN made about 50s of those (green) on the current stage. Of the 17s or so TÄN is behind NEU, NEU made up about 15s on the current stage, and went into the stage ahead (grey to the right) of TÄN by about 2s.

If a pastel (pink or light green) colour totally fills a bar to left or right, that is a “false time”. It doesn’t indicate the overall time at the end of the stage (the grey bar on the other side does that), bit it does allow you to read off how much time was gained / lost.

The dashed bar indicates situations where the driver started the stage relative to the target driver. So SUN started the stage about 15s behind and made up about 14s (the pink bar to the right). The grey bar to the left shows SUN finished the stage about 1s behind TÄN. PAD also started about 15s behind and made up about 15s, leaving just a fraction of a second behind (below) TÄN at the end of the stage. If you look very closely, you might just see a tiny sliver of grey to the left for PAD.

Where coloured bars straddle the zero line, that shows the the target driver has gone from being ahead (or behind) a driver at the start of the stage to being behind (ahead) of them at the end of the stage. So MIK, LAP, OST, BRE and LAT all started the stage behind TÄN but finished overall ahead at the end. (If you look at the WRC overall result table on the right for SS3 (the previous stage) you’ll see TÄN was second overall. At the end of SS4, he was in seventh, as the stage progress chart shows.)

Here’s the chart for the same stage rebased relative to PAD.

Here we see that PAD make up time on TÄN and LAT, remaining a couple of seconds behind LAT (the light green that fills the bar to the left shows this is a “false” time) and not quite getting ahead of TÄN overall (TÄN is above PAD). SUN held stay relative to PAD, but PAD took good chunks of time away from MEE all the way down (green). PAD also lost (red) a second or two to each of NEU, LAP and OST, and lost most to MIK.

Take 2

The interpretation of the pink /light green bars is a bit confusing, so let’s simplify a little to show times that should be added using the solid red/green colour, and the pastel colours purely as false times.

We can also simplify the chart symbols further by removing the dashed line and just using a solid line for bars.

Now the solid colour clearly shows the times should be added together to give the overall delta on the stage between the rebase target car and the other cars.

One thing missing from the chart is the start order, which could be added as part of the y-axis label. The driver the chart is rebased relative to should also be identified clearly, even if just in the chart title.

As to how to understanding how the chart works, one way is to think of it as using a warped hydraulic model.

For example:

  • imagine that a grey bar to the right shows how far behind a particular car the target driver is at the start of the stage; being ahead, the bar is above that of the target car:
    • the car extends its lead over the target driver during the stage, so the bar plunger is pulled to the right and it fills with red (time lost) fluid at the base. How much red fluid is the amount of additional time the target driver lost on that stage to the target bar.
    • the car reduces its overall lead over the target driver during the stage, which is to say the target driver gains time. The plunger is pushed to the left and fills with notional light green time, to the left, representing the time gained by the target driver; the actual time the target driver is behind the car is still shown by the solid grey bar to the right; the white gap at the end of the bar (similar in size to the notional light green extruded to the left) shows how much of the lead the car had at the start of the stage is has been lost during the stage. Reading off the width of the white area is har to do, which is why we map it to the notional light green time to the left;
    • the car loses all its lead and falls behind the target car, overall, by the end of the stage. In this case, the grey plunger to the right is pushed all the way down, past zero, which fills the whole bar with solid green; the bar is further pushed to the left by the amount of time the car is now behind the target car, overall, at the end of the stage. The total width of the solid green bar is the total amount of time the target car gained on the stage. The vertical positioning of the bar also falls below that of the target car to show it is now behind. That the bar is filled green to the right also indicates that the car was ahead of the target car at the end of the previous stage;
  • imagine that a grey bar to the left shows how far ahead of a particular car the target driver is at the start of the stage; behind behind, the bar is below that of the target car:
    • the car loses further ground on the stage, so the bar is extended to the left and fills with green “time gained by target car” fluid at the base;
    • the car makes some time back on the target car on the stage, but not enough to get ahead overall; the bar is pushed to the right, leaving a white space to the left indicative of the time the target car lost. The amount by which the grey still extends to the left is the time the target driver is still ahead of the car at the end of the stage. A notional pink time to the right (the same width as the white space created on the left) is shown to the right;
    • the car makes all the time it was behind at the start of the stage back, and then some; the bar moves above the target car on the vertical dimension. The plunger is pushed so far from he left it passes zero and fills the bar with solid red (time lost by target car). The amount the bar extends to the right is the time the car is ahead of the target car, overall, at the end of the stage. The total width of the completely solid red bar is the total time the target driver lost relative to that car on the stage. The fact that the bar is above the target driver and has a solid red bar that extends on the left as well as the right further shows that the car was behind the target at the start of the stage.

I’m still feeling my way with this one… as mentioned at the start, it’ll be interesting to see how natural it feels – or not – after some time spent trying to use it.

Creepy or Creeping? Or Both?

A few years ago, Ray Corrigan suggested to me that the “surveillance state” would most likely be manifest through the everyday mundane actions of petty bureaucrats and administrators, as well as the occasional bitter petty-Hitler having a bad day.

Ray has a systems background, and systems theories were one of my early passions (I started reading General Systems Theory books whilst still at school), so I suspect we both tend to see things from a variety of systems perspectives.

This means that we see the everyday drip, drip, drip of new technologies, whether electronic, digital or legal (because I take legal and contractual systems to be technologies too) not as independent initiatives but as potential components of a wider system that allows these apparently innocuous separate components to be combined or composed together to create something new.

(Inventing a new thing in and of itself is not that interesting. There are many more opportunities for combining things that already exist in new ways than there are for inventing new things”out of nowhere”.)

For example, today a news story has been doing the rounds repeating claims that the DWP are requesting video surveillance materials from leisure centres as part of fraud investigations [DWP: No denial of  blanket harvesting of CCTV or Fake-Friending of Disabled People]. It is clear that the DWP do make use of video surveillance footage (just do a news search on disability DWP video) but not the extent to which they do or are legally allowed to request footage from third parties.

The story reminded me about some digging I did a few years ago around DWP using the opportunity of police roadside traffic checks for fraud trawling [All I Did Was Go to a Carol Service…]. It’s maybe worth revisiting that, and updating it with examples relating to requests for video surveillance footage.

What’s notable, perhaps, is the way in which laws are written that afford enforcers a legal basis for some activity that, when the law is originally written, is apparently innocuous, and only offensive to creeped out paranoid science fiction activists imagining what the application of the law might mean, whilst at the same time lighting up the eyes of techies who see the law as an enabler of a particular, technologically mediated action that could be “really cool”.

“Cool, we could use smartphone fingerprint readers for handheld fingerprint checks anywhere and it’d be really cool…” Etc.

Which is not totally science fiction. Because if you recall, UK police started using handheld fingerprint readers for identification purposes as part of “everyday” stop and search activities  earlier this year [Police in West Yorkshire can now use mobile phones to check your fingerprints on the spot] . You can read a Home Office press release about the trial here: Police trial new Home Office mobile fingerprint technology.

Amidst the five minute flurry of reactionary “this is not acceptable behaviour” tweets, Andrew Keogh pointed out that [t]his isn’t new law, it has been on the Statute Book for some time, arising from an update to the Police and Criminal Evidence Act, 1984 (PACE) by the Serious Organised Crime and Police Act 2005. My naive reading of that amendment is that the phrases “A constable may take a person’s fingerprints without the appropriate consent” and “Fingerprints taken … may be checked against other fingerprints to which the person seeking to check has access” put in place the legal basis for procedures that are perhaps only now becoming possible through the widespread availability of handheld, networked devices.

One wonders how much other law is already in-place that a creative administrator or bureaucrat might pick up on and “reimagine” using some newly available technology.

At the same time that legislation might incorporate “sleeper” clauses that can be powerfully invoked or enabled through the availability of new technologies, the law also fails to keep up with the behaviour of the large web companies (that were once small web companies).

I was never convinced by Google’s creative interpretations of copyright law on the one hand, or its “we’re a platform not a media company” mantra (often combined with US  constitutional “free speech” get out clauses). It will be interesting to see which one they deploy against the AgeID requirements for viewing online pornography in the UK which is to be regulated by the BBFC  (background: DCMS Age Verification for pornographic material online impact assessment, 2016 as well as the current UK gov Internet Safety Strategy activity). Because Google and Bing are, of course, pr0n search engines with content preview availability.

On the topic of identity creep, another government initiative announced some time ago (Voter ID pilot to launch in local elections [Sept 2017] had a more formal announcement earlier this week: Voter ID pilot schemes – May, 2018. For a reaction that summarises my own feelings, see eg James Ball, Electoral fraud is incredibly rare. The Tories’ ID trial is an unsavoury attempt at voter suppression. When you take the time out to go and vote, just remember to take your ID with you. And if you need to get some ID, make sure you do it well in advance of the election day on which you’ll need it.

As with many technologically led, “digital first” initiatives, an anti-pattern approach loved by certain public bodies as a way of making services inaccessible to those most likely to either require them or benefit from them, I can’t help get the feeling that technology is increasingly seen by those who know how to use it as an instrument for abuse, rather than as an enabler. Which really sucks.