Live By Machine – CircleCI and Docker Hub AutoBuilds

Some time ago I put together a recipe for creating a simple data analysis workbench around the ergast F1 data using Chris Newell’s ergast API: Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API.

All the ingredients are in this repo.

The ergast Docker container is built using an automated Docker build whenever the Github repo is updated.

Following on from Simon Willison’s recipe for generating a commit log for San Francisco’s official list of trees, a scraper hosted on Github that does a daily scrape using CircleCI and then commits updates back to the repo, I’ve also set CircleCI to run against my repo using a daily cron job that copies the latest version of the ergast MySQL db file from the ergast website and then commits it to the repo.

To provide CircleCI with access to your Github account / organisations, go to your Github personal settings / Applications / OAuth applications.

The .cricleci/config.yml file is pretty much a straight rip-off of Simon’s:

version: 2
jobs:
  fetch_and_commit:
    docker:
      - image: circleci/python:3.6.4
    steps:
      - checkout
      - run:
          command: |
            cp ergastdb/data/f1db.sql.gz ergastdb/data/f1db-old.sql.gz
            curl -o ergastdb/data/f1db.sql.gz "http://ergast.com/downloads/f1db.sql.gz"
            git add ergastdb/data/f1db.sql.gz
            git config --global user.email "ergastbot@example.com"
            git config --global user.name "ergastbot"
            git commit -m "Daily update..." && \
              git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/psychemedia/ergast-f1-api.git master \
              || true
workflows:
  version: 2
  build:
    jobs:
      - fetch_and_commit
  nightly:
    triggers:
      - schedule:
          cron: "0 0 * * *"
          filters:
            branches:
              only:
                - master
    jobs:
      - fetch_and_commit

The Github Personal Access Token was set up just with permissions to access my public repos:

I then used the value of the token for the GITHUB_PERSONAL_TOKEN environmental variable in the appropriate CircleCI project:

What this means is that the Ergast API container should be regularly rebuilt, automatically, using a regularly updated copy of the ergast database.

Public CircleCI build logs can be found here:

https://circleci.com/api/v1.1/project/github/psychemedia/ergast-f1-api/

Add an optional build count (an integer) at the end of the URL for specific build details.

Neo4J Graph Database Running in MyBinder

Earlier today, I spotted this rather handy Global Witness repo that includes data ingest and analysis around the UK Persons of Significant Control register using Neo4J.

In part it reminded me of my own early explorations around the PSC register, as well as previous attempts at setting up linked Jupyter notebook and RStudio environments linked to neo4J, such as  Getting Started With the Neo4j Graph Database – Linking Neo4j and Jupyter SciPy Docker Containers Using Docker Compose and this one on Accessing a Neo4j Graph Database Server from RStudio and Jupyter R Notebooks Using Docker Containers.

I’ve also previously explored how to run Postgres server in a Binderised / MyBinder environment — Running a PostgreSQL Server in a MyBinder Container — so it seemed only natural to see if I could launch neo4J in a MyBinder environment too.

Setting things up to work with a Python client is easy enough — see this templated, Binderised repo — binder-neo4j — although at the moment I can’t seem to get the neo4j web UI to work with jupyter-server-proxy.

Now I’m wondering whether I should try to put together a Binderised repo that uses Sam Leon’s Global Witness repo scripts to ingest the PSC data into a Binderised repo to make running graph queries over the PSC data possible in a Mybinder environment…

View demo in MyBinder: https://mybinder.org/v2/gh/psychemedia/binder-neo4j/master?filepath=py%2Fneo4j-demo.py

Intercepting JSON HTTP Responses to Web Browser Page Requests Using MITMProxy

Coming back from a week or so away, the car let us down with a ruptured water hose which sent my confidence / mental state tanking, albeit with the AA managing a  quick fix with some new-to-me water activated tape along the lines of this . (It’s bad enough requiring a call out, but the stress is multiplied when you live on an island!)

My reboot strategy was to have a quick play with data from the weekend’s WRC rally, but when my datagrabber failed, it tanked my mood further and resulted in an 18 hour not-moving / not eating manic coding stretch that I’m still bleary eyed from.

The problem stemmed from a couple of things that interacted enough to confuse me. One was a pandas update that changed the behaviour of the json_normalize function I was using to unpack JSON values, and the other was the behaviour of the WRC server I pull data from (probably in breach of terms and conditions) which erratically kept giving a NULL/404 response to valid requests.

I’m not sure if the server behaviour was a defensive measure against scraping on the part of the publisher or if it’s an issue with the cache service I’m pulling from (certainly, hitting the same URL could give a valid data response, then nothing for the next few hits, then a response again). I tweaked my Python requests scraper code by adding some header info to spoof a browser user-agent, as well as tweaking the request period to make it play a bit nicer, but still the erratic 404s appeared at any ever greater rate.

Loading pages via a browser works okay, with the JSON requests being handled correctly, so I could just scrape the HTML tables that I think the JSON is used to populate (else: why load it?), although I have noted in the past that the JSON data structures have more data fields than are displayed in the WRC live timing HTML tables.

Then I started wonder about how to automate the collection of those requests using a browser automation route, with Selenium handling page selections and something grabbing the data perhaps from the browser devtools har archive (right click on a recorded entry in the devtools network listing to save all of them to a har archive).

The har archive itself is a JSON file, so that’s quite easy to work with, but the Chrome export (I think) is everything, not just filtered requests as in the screenshot above. Firefox seems to let you filter network items and just export filtered ones to a har file, which you can then open as a json file, filtering on the url to identify the request(s) of interest.

In passing, I also note that the har-extractor tool makes previewing life easier in the way it unpacks requests from a har archive into discrete files in a (nested) directory structure.

I also notice just now that the Firefox developer tools also seem to have a websocket sniffer, which could be handy… [UPDATE: I think that may be an extension I installed…]  FWIW, I also started trying to get my head around web sockets in a generic Python context when I was trying to come up with a simple MyBinder client (see here): A Minimal Python Client for MyBinder.

Whilst the Firefox route looked promising, I wasn’t sure how automatable it would be: whilst selenium-py would let me script lots of link clicking in the WRC site, I’m not sure it provides an API to browser dev tools?

I did find one tool I thought looked interesting, selenium-wire, but it seems to only capture headers, not payloads, of requests made from the selenium scripted browser:

#!pip3 install selenium-wire
from seleniumwire import webdriver 

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# Go to a WRC live timing page
driver.get('https://www.wrc.com/en/wrc/livetiming/page/4175----.html')

# Access requests via the `requests` attribute
for request in driver.requests:
    if request.response:
        if 'sasCache' in request.path:
            print(
                request.path,
                request.response.status_code,
                request.response.headers,
                request.body,
                '\n'
            )

I can see how that might be handy for capturing the addresses of resources loaded by a page, but I want the actual gzipped JSON data that forms the request content for the resources I’m interested in…

UPDATE: seems like selenium-wire is totally up to the job:

Update: this also looks interesting, proxy.py.

Poking around further, it seems the best approach is to use a proxy that can grab traffic as required. There are lots of partial clues as to what to use out there, many of them referring to browsermob and the BrowserMob Proxy Python client, but no full recipes.

Another proxy that looked a bit easier to use, with a Python base and more powerful in the way you can script it, is mitmproxy (“man-in-the-middle proxy”).

Again, the docs and recipes seem to be a bit scattered, so here’s a complete recipe that worked for me…

Start off by installing the proxy and getting it running. You can do this on the command line with:

pip3 install mitmproxy

mitmdump -w test1
#This will dump intercepted requests into
#    the file: test1

#close with: ctrl-c

On my local install, but not MyBinder?, I can actually run this as a background job from a notebook code cell using cell block magic:

%%script bash --bg
mitmdump -w test1

To stop the background process, we could look up the process number from a code cell and then kill that process by process ID (kill PROCESSID):

#Process numbers 
! ps -e | grep 'mitmdump' | awk '{print $1 " " $4}']

or let the (Linux) machine do it…

#Or to kill eg
!kill $(ps -e | grep 'mitmdump' | awk '{print $1}' )

In a notebook, we can then launch a (optionally, headless) selenium controlled browser:

from selenium import webdriver

PROXY = "localhost:8080" # IP:PORT or HOST:PORT

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
chrome_options.add_argument("--headless") 

chrome = webdriver.Chrome(chrome_options=chrome_options)
chrome.get("https://www.wrc.com/en/wrc/livetiming/page/4175----.html")

chrome.close()

We can view the result using the mitmweb browser app (from the command line: mitmweb). (Note that I think this also runs the proxy…? Again, ctrl-c to kill it.)

We can load the file we collected within the web app and then filter the requests to ones of interest:

Selecting a request shows us the contents of the request response, which is to say: the JSON data I’m after…

So this is starting to look promising…

…even more so when we realise we can filter a collected set of resources using the mitmdump command, for example with a construction of the form:

mitmdump -nr test1 -w test4 "~u .*sasCacheApi.*"

which will filter the archived collection in test1 using any desired filters (eg. "~u .*sasCacheApi.*") to create the filtered set in test4.

We can also add filters when running mitmdump to collect requests, For example:

mitmdump -w test5 "~u .*sasCacheApi.*"

will only capture and dump intercepted requests from locations with the desired address pattern into the file test5.

I’ve now got a pattern that could be used to scrape lots of JSON files:

  • set up the mitmproxy with appropriate filters to collect just the files I want,
  • script selenium to load the desired web page and click through various bits of it to make sure all the resources I want are loaded *(not addressed here; I need to do a post on scraping with Selenium-py; for now, here’s an example of using it to [do some repetitive work](https://blog.ouseful.info/2019/01/21/bulk-notebook-uploads-to-nbgallery-using-selenium/)…)*,
  • and then… then what? Parse the resource collection, that’s what…

Here’s an initial fragment for how to do that.

First, we can preview the headers for intercepted resources:

from mitmproxy import io
from mitmproxy.net.http.http1.assemble import assemble_request

def response(flow):
    print(assemble_request(flow.request).decode('utf-8'))

with open('test4', "rb") as logfile:
    freader = io.FlowReader(logfile)
    for f in freader.stream():
        response(f)

We can inspect what has been captured by getting the state of a flow object:

f.get_state()

This actually returns a python dict, the keys for which we can easily preview: f.get_state().keys()

Of particular interest is the response:

f.get_state()['response']

We note that the content is compressed / gzipped, so we can uncompress that…

import gzip
text = gzip.decompress(f.get_state()['response']['content'])
text

All that remains now is a tweak to the iteration through the response previewer (the response(flow) function defined above) to unzip and save each of them to a file. For example, something like:

import gzip
def response2(flow):
    fn = flow.get_state()['request']['path'].decode()
    fn = fn.split('=')[1].replace('%2F','_').replace('%3F','_').replace('%3D','_')
    print('Saving file: {}'.format(fn))
    with open('{}.json'.format(fn),'wb') as outfile:
        outfile.write( gzip.decompress(flow.get_state()['response']['content']) )

with open('test4', "rb") as logfile:
    freader = io.FlowReader(logfile)
    for f in freader.stream():
        response2(f)

I’m still not feeling right happy / in control, though… F****g Boris…

PS as to why scrape the data? For generating things like these Stage Charts for WRC Rally Sweden.

Fragment: Keeping an Eye on What’s Trackable, Where, and When — Tools for Data Protection Officers as well as the Rest of Us?

Way back when, in the early days of FOI and then “open data”, I naively believed that open data and FOI contact points in organisations would act on as advocates for us outside the organisation getting access to information from inside the organisation. The reality seems to be that as appointees and employees of the organisation, those individuals instead become gatekeepers and often seem to act to find ways of defending the organisation against such requests rather than trying to open the organisation up to them.

When it comes to those appointed to oversee data protection and data privacy issues, I would like to think that whoever is appointed such a role sees it as the role of an advocate for those who work for or come into contact with the organisation, as well as providing an opportunity to aggressively defend the rights of those outside the organisation against the unnecessary and disproportionate collection, processing and sharing of data about them by the organisation. That said, I suspect in many cases the role is more about trying to make sure the company doesn’t get sued under GDPR.

Whilst it would also be nice to think that the data protection person is a geek w/ skillz who can hack their way around an organisation’s systems and websites, poking around to find things that shouldn’t be there and demonstrating how other things can be potentially misused, I suspect they aren’t.

So do we need tools for such officers to keep tabs on their organisation, or perhaps tools to help privacy advocates provide oversight of them?

Poking around traffic generated as I visited the OU VLE a week or two ago, I saw a couple of requests I thought were unnecessary and raised an internal query about them. But it also got me thinking…

The requests appear to be made from tags loaded into the web page using the Google Tag Manager. The Google Tag Manager code appears to be delivered via a gtm.js script with the structure:

{
  "resource": {
    "version": "XXX",
    "macros": [ {} ],
    "tags": [ {} ],
    "predicates": [{}],
    "rules": [ {} ]
  },
  "runtime": [ [], [] ]
}

followed byb a chunk of Javascript code.

The gtm.js file includes rules of the form [["if",1,31],["unless",34,35],["add",51]] that appear to index into the predicates list in the conditional part (logically or’d tests?) and then add a particular tag, which may reference a macro, when the condition is met.

Predicates take the form:

      "function":"_re",
      "arg0":["macro",0],
      "arg1":"^http(s)?:\\\/\\\/(www\\.)?open.ac.uk\\\/?(index.html)?($|\\?)",
      "ignore_case":true
    }

Tags can take a variety of forms, including:

      {
      "function":"__html",
      "once_per_event":true,
      "vtp_html":"\n\u003Cscript type=\"text\/gtmscript\"\u003E!function(b,e,f,g,a,c,d){b.fbq||(a=b.fbq=function(){a.callMethod?a.callMethod.apply(a,arguments):a.queue.push(arguments)},b._fbq||(b._fbq=a),a.push=a,a.loaded=!0,a.version=\"2.0\",a.queue=[],c=e.createElement(f),c.async=!0,c.src=g,d=e.getElementsByTagName(f)[0],d.parentNode.insertBefore(c,d))}(window,document,\"script\",\"https:\/\/connect.facebook.net\/en_US\/fbevents.js\");fbq(\"init\",\"870490019710405\");fbq(\"track\",\"PageView\");\u003C\/script\u003E\n\u003Cnoscript\u003E\n\u003Cimg height=\"1\" width=\"1\" src=\"https:\/\/www.facebook.com\/tr?id=870490019710405\u0026amp;ev=PageView\n\u0026amp;noscript=1\"\u003E\n\u003C\/noscript\u003E\n\n\n",
      "vtp_supportDocumentWrite":false,
      "vtp_enableIframeMode":false,
      "vtp_enableEditJsMacroBehavior":false,
      "tag_id":51
    }

And macros take the form:

{
      "function":"__gas",
      "vtp_cookieDomain":"auto",
      "vtp_doubleClick":false,
      "vtp_setTrackerName":false,
      "vtp_useDebugVersion":false,
      "vtp_useHashAutoLink":false,
      "vtp_decorateFormsAutoLink":false,
      "vtp_enableLinkId":false,
      "vtp_enableEcommerce":false,
      "vtp_trackingId":"UA-4391747-17",
      "vtp_enableRecaptchaOption":false,
      "vtp_enableUaRlsa":false,
      "vtp_enableUseInternalVersion":false
    }

So what I’m wondering is: is there an offline, static analyser for gtm.js scripts that would allow someone to point to a website form a which a gtm.js script be downloaded and then lets them generate human readable reports that:

  • identify in general which trackers are loaded by which rules on which events with what arguments; and
  • identify which trackers are loaded by which rules on which events with what arguments for a specific URL.

This would then allow a university data protection officer, for example, or a student, to provide a URL, such as a URL into the VLE, and get a simple, statically generated report back that shows what trackers are loaded when visiting that environment.

Which is simpler than running Ghostery or opening developer tools in a wide open by default browser like Chrome, rather than the rather more privacy defending Firefox, for example, and searching the network logs for incriminating evidence.

Google Tag Manager has been around for some time, and I’m assuming that organisational web folk have read each line of code in the gtm.js they load into user’s browsers to make sure that it’s not doing anything untoward. (That everyone else uses it is no excuse, unless perhaps it meets some sort of international software quality standard that folk can just embed it without looking at it.)

So I’m wondering:

  • is there a line by line annotated version of the code at the bottom of the gtm.js script anywhere?
  • are there line by line examples out there of a simple gtm.js script and how to read it / analyse it (so eg walking through: this rule says this, which adds that tag, which is then parsed this way?)
  • are there static gtm.js analysers out there that generate the static reports suggested above and that allow folk to analyse arbitrary gtm.js scripts that are loaded into their browser in many of the sites they visit?

So for example, here’s a blog post that describes, line by line, how the Google tag manager container snippet that webmasters embed in their webpages runs so as to load in the gtm.js script: The Container Snippet: GTM Secrets Revealed. What I want is something similar for the gtm.js script…

PS it seems that GTM Spy [h/t/ Simo Ahava] helps browse the tags loaded in from a Google Tag Manager container. For example, here’s a look at code associated with a tag loaded via GTM by a particular org:

and here’s the trigger condition associated with it (I notice that a single naive, pass of my image blurrer, I can still read the contains text…):

This is a good start, but the navigation falls short of being usable by a ‘never-reads-the-manual’ such as myself. For example, looking at one of the triggers on another tab in the GTM Spy view:

I see the tag ID identified but no obvious way of finding out what tag that relates to, or what parameters / data might be returned via that tag?

Fragment – Quantum Coding in Python

Noting that we now may be in an age of quantum supremacy (original docs possibly available via here, here’s yet more stuff for my “to learn about” list, quantum programming simulators in Python from the big guns:

There’s also:

  • QuTiP — Quantum Toolkit in Python.

Splitting Strings in pandas Dataframe Columns

A quick note on splitting strings in columns of pandas dataframes.

If we have a column that contains strings that we want to split and from which we want to extract particuluar split elements, we can use the .str. accessor to call the split function on the string, and then the .str. accessor again to obtain a particular element in the split list.

df_str = pd.DataFrame( {'col':['http://example.com/path/filename.suffix']*3} )
df_str['path'] = df_str['col'].str.split('/').str[-1]
df_str['stub'] = df_str['path'].str.split('.').str[0]
df_str

SQL Murder Mystery, Notebook Style

In passing, I noticed that Simon Willison had posted a datasette mediated version of the Knight Lab SQL Murder Mystery.

The mystery ships as a SQLite database and a clue…

To my mind, a Jupyter notebook provides an ideal medium for playing this sort of game. In between writing queries onto the database, and displaying the responses inline within the notebook, detectives can also use markdown cells to write notes, pull out salient points, formulate hypotheses and write natural language questions that can then be cast into SQLese.

So as an aside for TM351 students that they really don’t need, I put together a notebook that sets the scene for the murder mystery, along the way showing how to create PostgreSQL databases and users, set database permissions on a per user basis, and import the original SQLite database into Postgres.

You can find the notebook here along with a link for how to download and install the TM351 VM yourself if you fancy giving it a spin…