Intercepting JSON HTTP Responses to Web Browser Page Requests Using MITMProxy

Coming back from a week or so away, the car let us down with a ruptured water hose which sent my confidence / mental state tanking, albeit with the AA managing a  quick fix with some new-to-me water activated tape along the lines of this . (It’s bad enough requiring a call out, but the stress is multiplied when you live on an island!)

My reboot strategy was to have a quick play with data from the weekend’s WRC rally, but when my datagrabber failed, it tanked my mood further and resulted in an 18 hour not-moving / not eating manic coding stretch that I’m still bleary eyed from.

The problem stemmed from a couple of things that interacted enough to confuse me. One was a pandas update that changed the behaviour of the json_normalize function I was using to unpack JSON values, and the other was the behaviour of the WRC server I pull data from (probably in breach of terms and conditions) which erratically kept giving a NULL/404 response to valid requests.

I’m not sure if the server behaviour was a defensive measure against scraping on the part of the publisher or if it’s an issue with the cache service I’m pulling from (certainly, hitting the same URL could give a valid data response, then nothing for the next few hits, then a response again). I tweaked my Python requests scraper code by adding some header info to spoof a browser user-agent, as well as tweaking the request period to make it play a bit nicer, but still the erratic 404s appeared at any ever greater rate.

Loading pages via a browser works okay, with the JSON requests being handled correctly, so I could just scrape the HTML tables that I think the JSON is used to populate (else: why load it?), although I have noted in the past that the JSON data structures have more data fields than are displayed in the WRC live timing HTML tables.

Then I started wonder about how to automate the collection of those requests using a browser automation route, with Selenium handling page selections and something grabbing the data perhaps from the browser devtools har archive (right click on a recorded entry in the devtools network listing to save all of them to a har archive).

The har archive itself is a JSON file, so that’s quite easy to work with, but the Chrome export (I think) is everything, not just filtered requests as in the screenshot above. Firefox seems to let you filter network items and just export filtered ones to a har file, which you can then open as a json file, filtering on the url to identify the request(s) of interest.

In passing, I also note that the har-extractor tool makes previewing life easier in the way it unpacks requests from a har archive into discrete files in a (nested) directory structure.

I also notice just now that the Firefox developer tools also seem to have a websocket sniffer, which could be handy… [UPDATE: I think that may be an extension I installed…]  FWIW, I also started trying to get my head around web sockets in a generic Python context when I was trying to come up with a simple MyBinder client (see here): A Minimal Python Client for MyBinder.

Whilst the Firefox route looked promising, I wasn’t sure how automatable it would be: whilst selenium-py would let me script lots of link clicking in the WRC site, I’m not sure it provides an API to browser dev tools?

I did find one tool I thought looked interesting, selenium-wire, but it seems to only capture headers, not payloads, of requests made from the selenium scripted browser:

#!pip3 install selenium-wire
from seleniumwire import webdriver 

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# Go to a WRC live timing page
driver.get('https://www.wrc.com/en/wrc/livetiming/page/4175----.html')

# Access requests via the `requests` attribute
for request in driver.requests:
    if request.response:
        if 'sasCache' in request.path:
            print(
                request.path,
                request.response.status_code,
                request.response.headers,
                request.body,
                '\n'
            )

I can see how that might be handy for capturing the addresses of resources loaded by a page, but I want the actual gzipped JSON data that forms the request content for the resources I’m interested in…

UPDATE: seems like selenium-wire is totally up to the job:

Update: this also looks interesting, proxy.py.

Poking around further, it seems the best approach is to use a proxy that can grab traffic as required. There are lots of partial clues as to what to use out there, many of them referring to browsermob and the BrowserMob Proxy Python client, but no full recipes.

Another proxy that looked a bit easier to use, with a Python base and more powerful in the way you can script it, is mitmproxy (“man-in-the-middle proxy”).

Again, the docs and recipes seem to be a bit scattered, so here’s a complete recipe that worked for me…

Start off by installing the proxy and getting it running. You can do this on the command line with:

pip3 install mitmproxy

mitmdump -w test1
#This will dump intercepted requests into
#    the file: test1

#close with: ctrl-c

On my local install, but not MyBinder?, I can actually run this as a background job from a notebook code cell using cell block magic:

%%script bash --bg
mitmdump -w test1

To stop the background process, we could look up the process number from a code cell and then kill that process by process ID (kill PROCESSID):

#Process numbers 
! ps -e | grep 'mitmdump' | awk '{print $1 " " $4}']

or let the (Linux) machine do it…

#Or to kill eg
!kill $(ps -e | grep 'mitmdump' | awk '{print $1}' )

In a notebook, we can then launch a (optionally, headless) selenium controlled browser:

from selenium import webdriver

PROXY = "localhost:8080" # IP:PORT or HOST:PORT

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
chrome_options.add_argument("--headless") 

chrome = webdriver.Chrome(chrome_options=chrome_options)
chrome.get("https://www.wrc.com/en/wrc/livetiming/page/4175----.html")

chrome.close()

We can view the result using the mitmweb browser app (from the command line: mitmweb). (Note that I think this also runs the proxy…? Again, ctrl-c to kill it.)

We can load the file we collected within the web app and then filter the requests to ones of interest:

Selecting a request shows us the contents of the request response, which is to say: the JSON data I’m after…

So this is starting to look promising…

…even more so when we realise we can filter a collected set of resources using the mitmdump command, for example with a construction of the form:

mitmdump -nr test1 -w test4 "~u .*sasCacheApi.*"

which will filter the archived collection in test1 using any desired filters (eg. "~u .*sasCacheApi.*") to create the filtered set in test4.

We can also add filters when running mitmdump to collect requests, For example:

mitmdump -w test5 "~u .*sasCacheApi.*"

will only capture and dump intercepted requests from locations with the desired address pattern into the file test5.

I’ve now got a pattern that could be used to scrape lots of JSON files:

  • set up the mitmproxy with appropriate filters to collect just the files I want,
  • script selenium to load the desired web page and click through various bits of it to make sure all the resources I want are loaded *(not addressed here; I need to do a post on scraping with Selenium-py; for now, here’s an example of using it to [do some repetitive work](https://blog.ouseful.info/2019/01/21/bulk-notebook-uploads-to-nbgallery-using-selenium/)…)*,
  • and then… then what? Parse the resource collection, that’s what…

Here’s an initial fragment for how to do that.

First, we can preview the headers for intercepted resources:

from mitmproxy import io
from mitmproxy.net.http.http1.assemble import assemble_request

def response(flow):
    print(assemble_request(flow.request).decode('utf-8'))

with open('test4', "rb") as logfile:
    freader = io.FlowReader(logfile)
    for f in freader.stream():
        response(f)

We can inspect what has been captured by getting the state of a flow object:

f.get_state()

This actually returns a python dict, the keys for which we can easily preview: f.get_state().keys()

Of particular interest is the response:

f.get_state()['response']

We note that the content is compressed / gzipped, so we can uncompress that…

import gzip
text = gzip.decompress(f.get_state()['response']['content'])
text

All that remains now is a tweak to the iteration through the response previewer (the response(flow) function defined above) to unzip and save each of them to a file. For example, something like:

import gzip
def response2(flow):
    fn = flow.get_state()['request']['path'].decode()
    fn = fn.split('=')[1].replace('%2F','_').replace('%3F','_').replace('%3D','_')
    print('Saving file: {}'.format(fn))
    with open('{}.json'.format(fn),'wb') as outfile:
        outfile.write( gzip.decompress(flow.get_state()['response']['content']) )

with open('test4', "rb") as logfile:
    freader = io.FlowReader(logfile)
    for f in freader.stream():
        response2(f)

I’m still not feeling right happy / in control, though… F****g Boris…

PS as to why scrape the data? For generating things like these Stage Charts for WRC Rally Sweden.

Grabbing Javascript Objects Out of Web Pages And Into Python

Engaging in some rally data junkie play yesterday, I started wondering about whether I could grab route data out of the rather wonderful rally-maps.com website, a brilliant resource for accessing rally stage maps for a wide range of events.

The site display maps using leaflet maps, so the data must be in there somewhere as a geojson object, right?! ;-)

My first thought was to check the browser developer tools network tab to see if I could spot any geojson data being loaded into the page so that I could just access it directly… But no joy…

Hmmm… a quick View Source, and it seems the geojson data is baked explicitly into the HTML page as a data object referenced by the leaflet map.

So how to get it out again?

My first thought was to just scrape the HTML and then try to find a way to scrape the Javascript defining the object out of page. But that’s a real faff. My second thought was to wonder whether I could somehow parse the Javascript, in Python, and then reference the data directly as a Javascript object. But I don’t think you can.

At this point I started wondering about accessing the data as a native JSON object somehow.

One thing I’ve kept not figuring out how to do is find an easy way of inspecting Javascript objects inside a web page. It turns out that it’s really easy: in the Firefox developer tools console, if you enter window to display the window object, you can then browse through all the objects loaded into the page…

Poking around, and by cross referencing the HTML source, I located the Javascript object I wanted that contains the geojson data. For sake of example, let’s say it was in the Javascript object map.data. Knowing the path to the data, can I then grab it into a Python script?

One of the tricks I’ve started using increasingly for scraping data is to use browser automation via Selenium and the Python selenium package. Trivially, this allows me to open a page in a web browser, optionally click on things, fill in forms, and so on, and then either grab HTML elements from the browser, or use selenium-wire to capture all the traffic loaded into the page, (this traffic might incude a whole set of JSON files, for example, that I can then reference at my leisure).

So can I use this route to access the Javascript data object?

It seems so: simply call the selenium webdriver object with .execute_script('return map.data') and the Javascript object should be returned as text.

Only it wasn’t… A circular reference in the object definition meant the call failed. A bit more web searching, and I found a bit of javascript for parsing cyclic objects without getting into an infinite recursion. Loading this code into the browser, via selenium, I was then able to access the Javascript/JSON data object.

The recipe is essentially as follows: load in a web page from a Python script into a headless web-browser using selenium; find an off-the-shelf Javascript script to handle circular references in a Javascript object; shove the Javascript script into a Python text string, along with a return call that uses the script to JSON-stringify the desired object; return the JSON string representing the object to Python; parse the JSON string into a Python dict object. Simples:-)

Here are the code essentials…

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import json

options = Options()
options.headless = True

browser = webdriver.Firefox(options = options)

browser.get(url)

#https://apimirror.com/javascript/errors/cyclic_object_value
jss = '''const getCircularReplacer = () => {
  const seen = new WeakSet();
  return (key, value) => {
    if (typeof value === "object" && value !== null) {
      if (seen.has(value)) {
        return;
      }
      seen.add(value);
    }
    return value;
  };
};

//https://stackoverflow.com/a/10455320/454773
return JSON.stringify(map.data, getCircularReplacer());
'''

js_data = json.loads(browser.execute_script(jss))
browser.close()

#JSON data object is now available as a dict:
js_data

Another one for the toolbox:-)

Plus I can now access lots of rally stage maps data for more rally data junkie fun :-)

PS I also realised that this recipe provides a way of running any old Javascript from Python and getting the result of any computation stored in a js object back into the Python environment.

PPS it also strikes me that ipywidgets potentially offer a route to arbitrary JS execution from a Python environment, as well as real-time state synching back to that Python environment? In this case, the browser executing the Javascript code will be the one used to actually run the Jupyter notebook calling the ipywidgets. (Hmm… I think there’s a push on to to support ipywidgets in VSCode? What do they use for the Javascript runtime?)