Grabbing Javascript Objects Out of Web Pages And Into Python

Engaging in some rally data junkie play yesterday, I started wondering about whether I could grab route data out of the rather wonderful rally-maps.com website, a brilliant resource for accessing rally stage maps for a wide range of events.

The site display maps using leaflet maps, so the data must be in there somewhere as a geojson object, right?! ;-)

My first thought was to check the browser developer tools network tab to see if I could spot any geojson data being loaded into the page so that I could just access it directly… But no joy…

Hmmm… a quick View Source, and it seems the geojson data is baked explicitly into the HTML page as a data object referenced by the leaflet map.

So how to get it out again?

My first thought was to just scrape the HTML and then try to find a way to scrape the Javascript defining the object out of page. But that’s a real faff. My second thought was to wonder whether I could somehow parse the Javascript, in Python, and then reference the data directly as a Javascript object. But I don’t think you can.

At this point I started wondering about accessing the data as a native JSON object somehow.

One thing I’ve kept not figuring out how to do is find an easy way of inspecting Javascript objects inside a web page. It turns out that it’s really easy: in the Firefox developer tools console, if you enter window to display the window object, you can then browse through all the objects loaded into the page…

Poking around, and by cross referencing the HTML source, I located the Javascript object I wanted that contains the geojson data. For sake of example, let’s say it was in the Javascript object map.data. Knowing the path to the data, can I then grab it into a Python script?

One of the tricks I’ve started using increasingly for scraping data is to use browser automation via Selenium and the Python selenium package. Trivially, this allows me to open a page in a web browser, optionally click on things, fill in forms, and so on, and then either grab HTML elements from the browser, or use selenium-wire to capture all the traffic loaded into the page, (this traffic might incude a whole set of JSON files, for example, that I can then reference at my leisure).

So can I use this route to access the Javascript data object?

It seems so: simply call the selenium webdriver object with .execute_script('return map.data') and the Javascript object should be returned as text.

Only it wasn’t… A circular reference in the object definition meant the call failed. A bit more web searching, and I found a bit of javascript for parsing cyclic objects without getting into an infinite recursion. Loading this code into the browser, via selenium, I was then able to access the Javascript/JSON data object.

The recipe is essentially as follows: load in a web page from a Python script into a headless web-browser using selenium; find an off-the-shelf Javascript script to handle circular references in a Javascript object; shove the Javascript script into a Python text string, along with a return call that uses the script to JSON-stringify the desired object; return the JSON string representing the object to Python; parse the JSON string into a Python dict object. Simples:-)

Here are the code essentials…

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import json

options = Options()
options.headless = True

browser = webdriver.Firefox(options = options)

browser.get(url)

#https://apimirror.com/javascript/errors/cyclic_object_value
jss = '''const getCircularReplacer = () => {
  const seen = new WeakSet();
  return (key, value) => {
    if (typeof value === "object" && value !== null) {
      if (seen.has(value)) {
        return;
      }
      seen.add(value);
    }
    return value;
  };
};

//https://stackoverflow.com/a/10455320/454773
return JSON.stringify(map.data, getCircularReplacer());
'''

js_data = json.loads(browser.execute_script(jss))
browser.close()

#JSON data object is now available as a dict:
js_data

Another one for the toolbox:-)

Plus I can now access lots of rally stage maps data for more rally data junkie fun :-)

PS I also realised that this recipe provides a way of running any old Javascript from Python and getting the result of any computation stored in a js object back into the Python environment.

PPS it also strikes me that ipywidgets potentially offer a route to arbitrary JS execution from a Python environment, as well as real-time state synching back to that Python environment? In this case, the browser executing the Javascript code will be the one used to actually run the Jupyter notebook calling the ipywidgets. (Hmm… I think there’s a push on to to support ipywidgets in VSCode? What do they use for the Javascript runtime?)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: