Grabbing Javascript Objects Out of Web Pages And Into Python

Engaging in some rally data junkie play yesterday, I started wondering about whether I could grab route data out of the rather wonderful rally-maps.com website, a brilliant resource for accessing rally stage maps for a wide range of events.

The site display maps using leaflet maps, so the data must be in there somewhere as a geojson object, right?! ;-)

My first thought was to check the browser developer tools network tab to see if I could spot any geojson data being loaded into the page so that I could just access it directly… But no joy…

Hmmm… a quick View Source, and it seems the geojson data is baked explicitly into the HTML page as a data object referenced by the leaflet map.

So how to get it out again?

My first thought was to just scrape the HTML and then try to find a way to scrape the Javascript defining the object out of page. But that’s a real faff. My second thought was to wonder whether I could somehow parse the Javascript, in Python, and then reference the data directly as a Javascript object. But I don’t think you can.

At this point I started wondering about accessing the data as a native JSON object somehow.

One thing I’ve kept not figuring out how to do is find an easy way of inspecting Javascript objects inside a web page. It turns out that it’s really easy: in the Firefox developer tools console, if you enter window to display the window object, you can then browse through all the objects loaded into the page…

Poking around, and by cross referencing the HTML source, I located the Javascript object I wanted that contains the geojson data. For sake of example, let’s say it was in the Javascript object map.data. Knowing the path to the data, can I then grab it into a Python script?

One of the tricks I’ve started using increasingly for scraping data is to use browser automation via Selenium and the Python selenium package. Trivially, this allows me to open a page in a web browser, optionally click on things, fill in forms, and so on, and then either grab HTML elements from the browser, or use selenium-wire to capture all the traffic loaded into the page, (this traffic might incude a whole set of JSON files, for example, that I can then reference at my leisure).

So can I use this route to access the Javascript data object?

It seems so: simply call the selenium webdriver object with .execute_script('return map.data') and the Javascript object should be returned as text.

Only it wasn’t… A circular reference in the object definition meant the call failed. A bit more web searching, and I found a bit of javascript for parsing cyclic objects without getting into an infinite recursion. Loading this code into the browser, via selenium, I was then able to access the Javascript/JSON data object.

The recipe is essentially as follows: load in a web page from a Python script into a headless web-browser using selenium; find an off-the-shelf Javascript script to handle circular references in a Javascript object; shove the Javascript script into a Python text string, along with a return call that uses the script to JSON-stringify the desired object; return the JSON string representing the object to Python; parse the JSON string into a Python dict object. Simples:-)

Here are the code essentials…

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import json

options = Options()
options.headless = True

browser = webdriver.Firefox(options = options)

browser.get(url)

#https://apimirror.com/javascript/errors/cyclic_object_value
jss = '''const getCircularReplacer = () => {
  const seen = new WeakSet();
  return (key, value) => {
    if (typeof value === "object" && value !== null) {
      if (seen.has(value)) {
        return;
      }
      seen.add(value);
    }
    return value;
  };
};

//https://stackoverflow.com/a/10455320/454773
return JSON.stringify(map.data, getCircularReplacer());
'''

js_data = json.loads(browser.execute_script(jss))
browser.close()

#JSON data object is now available as a dict:
js_data

Another one for the toolbox:-)

Plus I can now access lots of rally stage maps data for more rally data junkie fun :-)

PS I also realised that this recipe provides a way of running any old Javascript from Python and getting the result of any computation stored in a js object back into the Python environment.

PPS it also strikes me that ipywidgets potentially offer a route to arbitrary JS execution from a Python environment, as well as real-time state synching back to that Python environment? In this case, the browser executing the Javascript code will be the one used to actually run the Jupyter notebook calling the ipywidgets. (Hmm… I think there’s a push on to to support ipywidgets in VSCode? What do they use for the Javascript runtime?)

Fragment – Accessibility Side Effects? Free Training Data for Automated Markers

Another of those “woke up and suddenly started thinking about this” sort of things…

Yesterday, I was in on a call yesterday discussing potential projects around an “intelligent” automated short answer question marking system that could be plugged in to a Jupyter notebook environment (related approach here).

Somewhen towards the end of last year, I did a quick sketch of a simple marker support tool that does quick pairwise similarity comparisons between sentences in a submitted answer and a specimen answer. (The report can report on all pairwise comparisons or just the betst matching sentence in the specimen compared to to the submitted text.)

One issue with this is the need to generate the specimen text. This is also true for developing or training intelligent automated markers that try to match submitted texts against specimen texts.

As part of the developmnet, or training, process, automated marking tools may also require a large number of sample texts to train the system on. (My simple marker support simularity tool doesn’t: it’s there purely to help a marker cross-reference sentences in a text with sentences in the sample text.)

So… this is what I woke up wondering: if we set a question in a data analysis context, asking students to interpret a chart or comment on a set of reported model parameters, can we automatically generate the text from the data, which is to say, from the underyling chart object or model parameters.

I’ve touched on this before, in a slightly different context, specifically creating text desciptions of charts as an accessibility support measure for visually impaired readers (First Thoughts on Automatically Generating Accessible Text Descriptions of ggplot Charts in R), as well as more generally (for example, Data Textualisation – Making Human Readable Sense of Data.

So I wonder, if we have an automated system to mark free text short answers that ask students to comment on some sort of data analysis, could our automated marking system:

  • take the chart or model parameters the student generated and from that generate a simple “insightful” text report that could be used as the specimen answer to mark the student’s own free text answer (i.e. does the student report back similar words to words out insights generator “sees” and reports back in text form);
  • compare the chart of model parameters generated by the student with own own “correct” analysis / charts / model parameters.

In the first case, we are checking the extent to which the student has correctly interpreted their own chart / model (or one we have provided them with or generated for them) as a free text comparison. In the second case, we are checking if the student’s model/chart is correct comparred to our specimen model / chart etc. based on a comparison of model / chart parameters.

Something else. If we have a system for generating text from data (which could be datatables, could be chart or model parameters etc), we might also be able to generate lots of different texts on the theme, based on the same data (I recently started exploring data2text again using Simple Rule Based Approach in Python for Generating Explanatory Texts from pandas Dataframes via the durable_rules package. One of the approaches I’m looking at is to include randomising effects to generate multiple different possible text fragments from the same rule; still early days on that.) If our automated marked then needed to be trained on 500 sample submitted texts, we could then automatically generate those (perhaps omitting some bits, perhaps adding correct interpretations but of misread parameters (so right-but-wrong or wrong-but-right answers), perhaps adding irrelevant sentences etc., perhaps adding typos etc.).

In passing, I was convinced I had posted previously on the topic of “robot writers” generating texts from data not for human consumption but instead for search engines, the idea being that a search engine could index the text and use that to support discovery of a dataset. It seems I had, but hadn’t. In my draft queue (from summer 2015), I note the presence of two still-in-unfinished-draft posts from the Notes on Robot Churnalism series, left languishing because I got zero response back from the first two posts in the series (even though I thought they were pretty good…;-)

Here’s the relevant quote:

One answer I like to this sort of question is that if the search engine’s are reading the words, then the machine generation of textual statements, interpretation and analyses may well be helping to make certain data points discoverable by turning the data into sentences that then become web searchable? (I think I was first introduced to that idea by this video of a talk from 2012 by Larry Adams of Narrative Science: Using Open Data to Generate Personalized Stories.) If you don’t know how to write a query over a dataset to find a particular fact, if someone has generated a list of facts or insights from the the dataset as textual sentences, then you may be able to discover that fact from a straightforward keyword-based query. Just generating hundreds of millions sentences from data so that they can be indexed just in case someone asks some sort of question about that fact might appear wasteful, at least until someone comes up with a good way of indexing spreadsheets or tabular datasets so that you can make search-engine query like requests of them; which I guess is what things like Wolfram Alpha are trying to do? For example, what is the third largest city in the UK?)

On the other hand, we might perhaps need to be sensitive to the idea that that generated content might place a burden on effective discovery. For example, in Sims, Lee, and Roberta E. Munoz. “The Long Tail of Legal Information-Legal Reference Service in the Age of the’Content Farm’.” Law Library Journal Vol. 104:3 p411-425 (2012-29) [PDF], …???

I wish I’d finished those posts now (Notes on Robot Churnalism, Part III – Robot Gatekeepers and Notes on Robot Churnalism, Part IV – Who Cares?), not least to remind myself of what related thoughts I was having at the time… There’s hundreds of words drafted in each case, but a lot of the notes are still of the “no room in the margin” or “no time to write this idea out fully” kind..

Sustainability of the ipynb/nbformat Document Format

[Reposted from Jupyter discourse site just so I have my own copy…]

By chance, I just came across the Library of Congress Sustainability of Digital Formats that has a schema for cataloguing digital document formats as well as a set of criteria against which the sustainability of digital documents formats can be tracked.

Sustainability factors include:

There are also fields associated with Quality and functionality factors which for text documents include: normal rendering, integrity of document structure, integrity of layout and display, support for mathematics/formulae etc., functionality beyond normal rendering.

I note that .ipynb is not currently on the list of mentioned formats. Records for geojson and Rdata provide a steer for the sorts of thing that an ipynb record might initially contain. (I also note that Python / Jupyter kernels don’t have a standardised serialisation format akin to R’s .rdata workspace serialisation (dill goes some way to towards this, maybe also [](, maybe also data-vault). I also appreciate this is complicated by the wide variety of custom objects created by Python packages, but just as IPython supports rich display integration through __repr__ methods (see also the notes at the end of the IPython.display.display docs for a description of what methods are supported), it might also be timely to start thinking about __serialise__ methods (they may already exist; there is so much I don’t know about Python! I do know that things don’t always work though; eg Python’s json package in my py envt breaks when trying to serialise numpy.int64 objects…).)

There is now a significant number of notebooks on eg Github, as well as signs that notebooks are starting to be used as a publishing format (or at least, as a feedstock for publication, whether rendered using nbconvert or more elaborate tools such as Jupyter-book, nbsphinx, ipypublish, or howsoever).

I wonder if it would be timely to review the ipynb document format in terms of its sustainability and whether getting it included on the LoC list (or other appropriate forum) would be an appropriate thing to do for several reasons, including:

  • signals the existence of the document format to the Library / sustainability community in terms the are familiar with and may be able to help with;
  • help identify how nbformat should not develop in future in ways that might affect its sustainability as a format;
  • help identify things that might help improve its sustainability;
  • help inform workflows and behaviours regarding how eg cell metadata / tags feed into sustainability.

If .ipynb is to remain the core data-structure for representing Jupyter executable documents and their outputs, and as other third party applications (such as VSCode, or Google Colab) start to support the format, and if it doesn’t already exist, I also wonder whether a simple RFC style document (cf. the GeoJSON RFC) would be appropriate alongside the slightly less formal nbformat documentation as a formal statement of the document standard?

Interoperability is driven by convention as well as standard, and if we are going to see external services developing around Jupyter from individuals or organisations not previously associated with the Jupyter community, but offering interoperability with it, there needs to be a clear basis for what the standards are. This includes not just the base ipynb format, but also messaging and state protocols.

The nbformat format description docs pages seem to act as the normative reference work for the .ipynb standard, and I assume the Jupyter client – messaging docs are the normative reference for the client-server messaging? For ipywidgets, the widget messaging protocol and widget model state docs in the ipywidgets repo appear to provide the normative reference.

See also: this thing on Managing computational notebooks from a preservation perspective and the FAIR Principles (Findable – Accessible – Interoperable – Re-usable).

And on the question of being able to rebuild eg Dockerised environments, here and here.

Rapid ipywidgets Prototyping Using Third Party Javascript Packages in Jupyter Notebooks With jp_proxy_widget

Just before the break, I came across a rather entrancing visualisation of Jean Michel Jarre’s Oxygene album in the form of an animated spectrogram.

Time is along the horizontal x-axis, and frequency along the vertical y-axis. The bright colours show the presence, and volume, of each frequency as the track plays out.

Such visualisations can help you hear-by-seeing the structure of the sound as the music plays. So I wondered… could I get something like that working in a Jupyter notebook….?

And it seems I can, using the rather handy jp_proxy_widget that provides a way of easily loading jQueryUI components as well as the requests.js module to load and run Javascript widgets.

Via this StackOverflow answer, which shows how to embed a simple audio visualisation into a Jupyter notebook using the Wavesurfer.js package, I note that Wavesurfer.js also supports spectrograms. The example page docs are a bit ropey, but a look at the source code and the plugin docs revealed what I needed to know…

#%pip install --upgrade ipywidgets
#!jupyter nbextension enable --py widgetsnbextension

#%pip install jp_proxy_widget

import jp_proxy_widget

widget = jp_proxy_widget.JSProxyWidget()

js = "https://unpkg.com/wavesurfer.js"
js2="https://unpkg.com/wavesurfer.js/dist/plugin/wavesurfer.spectrogram.min.js"
url = "https://ia902606.us.archive.org/35/items/shortpoetry_047_librivox/song_cjrg_teasdale_64kb.mp3"

widget.load_js_files([js, js2])

widget.js_init("""
element.empty();

element.wavesurfer = WaveSurfer.create({
    container: element[0],
    waveColor: 'violet',
        progressColor: 'purple',
        loaderColor: 'purple',
        cursorColor: 'navy',
        minPxPerSec: 100,
        scrollParent: true,
        plugins: [
        WaveSurfer.spectrogram.create({
            wavesurfer: element.wavesurfer,
            container: element[0],
            fftSamples:512,
            labels: true
        })
    ]
});

element.wavesurfer.load(url);

element.wavesurfer.on('ready', function () {
    element.wavesurfer.play();
});
""", url=url)

widget

#It would probably make sense to wire up these commands to upywidgets buttons...
#widget.element.wavesurfer.pause()
#widget.element.wavesurfer.play(0)

The code is also saved as a gist here and can be run on MyBinder (the dependencies should be automatically installed):

Here’s what it looks like (It may take a moment or two to load when you run the code cell…)

It doesn’t seem to work in JupyterLab though…

It looks like the full ipywidgets machinery is supported, so we can issue start and stop commands from the Python notebook envioronment that control the widget Javascript.

So now I’m wondering what other Javascript apps are out there that might be interesting in a Jupyter notebook context, and how easy it’d be to get them running…?

It might also be interesting to try to construct an audio file within the notebook and then visualise it using the widget.

Dockerising / Binderising the TM351 Virtual Machine

Just before the Chirstmas break, I had a go recasting the TM351 VM as a Docker container built from a Github repository using MyBinder (which is to say: I had a go at binderising the VM…). Long time readers will know that this virtual machine has been used to deliver a computing environment to students on the OU TM351 Data managament and Analysis course since 2016. The VM itself is built using Virtualbox provisioned using vagrant and then distributed originally via a mailed out USB stick or alternatively (which is to say, unofficially; though my preferrred route) as a download from VagrantCloud.

The original motivation for using Vagrant was a hope that we’d be able to use a range of provisioners to construct VM images for a range of virtualisation platforms, but that’s never happened. We still ship a Virtualbox image that causes problems to a small number of Windows users each year, rather than a native HyperV image, because: a) I refuse to buy a Windows machine so I can build the HyperV image myself; b) no-one else sees benefit from offering multiple images (perhaps because they don’t provide the tech support…).

For all our supposed industrial scale at delivering technology backed “solutions”, the VM is built, maintained and supported on a cottage industry basis from within the course team.

For a scaleable solution that would work:

a) within a module presentation;
b) across module presentations;
c) across modules

I think we should be looking at some sort of multi-user hosted service, with personal accounts and persistent user directories. There are various ways I can imagine delivering this, each that creates its own issues as well solving particular problems.

As a quick example, here are two possible extremes:

1) one JupyterHub server to rule them all: every presentation, every module, one server. JupyterHub can be configured to use the DockerSpawner to present different kernel container options to the user, (although I’m not sure if this can be personalised on a per user basis? If not, that feature would make for a useful contribution back…), so a student could be presented with a list of containers for each of their modules.

2) one JupyterHub server per module per presentation: this requires more admin and means servers everywhere, but it separates concerns…

The experimental work on a “persistent Binderhub deployment” also looks interesting, offering the possibility of launching arbitrary environments (as per Binderhub) against personally mounted file area (as per JupyterHub).

Providing a “takeaway” service is also one of my red lines: a student should be free to take away any computing environment we provide them with. One in-testing hosted version of the TM351 VM comes, I believe, with centralised Postgres and MongoDB servers that students have accounts on and must log in to. Providing a mutli-user service, rather than a self-contained personal server, raises certain issues regarding support, but also denies the student the ability to take away the database service and use it for their own academic, personal or even work purposes. A fundamentally wrong approach, in my opinion. It’s just not open.

So… binderising the VM…

When Docker containers first made their appearance, best practice seemed to be to have one service per container, and then wire containers together using docker-compose to provide a more elaborate environment. I have experimented in the past with decoupling the TM351 services into separate containers and then launching them using docker-compose, but it;s never really gone anywhere…

In the communities of practice that I come across, more emphasis now seems to be on putting everything into a single container. Binderhub is also limited to launching a single container (I don’t think there is a Jupyter docker-compose provisioner yet?) so that pretty much seals it… All in one…

A proof-of-concept Binderised version of the TM351 VM can be found here: innovationOUtside/tm351vm-binder.

It currently includes:

  • an OU branded Jupyter notebook server running jupyter-server-proxy;
  • the TM351 Python environment;
  • an OpenRefine server proxied using jupyter-server-proxy;
  • a Postgres server seeded (I think? Did I do that yet?!) with the TM351 test db (if I haven’t set it up as per the VM, the code is there that shows how to do it…);
  • a MongoDB server serving the small accidents dataset that appears in the TM351 VM.

What is not included:

  • the sharded Mongo DB activity; (the activity it relates to as presented at the moment is largely pointless, IMHO; we could deminstrate the sharding behaviour with small datasets, and if we did want to provided queries over the large dataset, that might make sense as something we host centrally and et students log in to query. Which would also give us another teachng point.)

The Binder configuration is provided in the binder/ directory. An Anaconda binder/environment.yml file is used to install packages that are complicated to build or install otherwise, such as Postgres.

The binder/postBuild file is run as a shell script responsible for:

  • configuring the Postgres server and seeding its test database;
  • installing and seeding the MongoDB database;
  • installing OpenRefine;
  • installing Python packages from binder/requirements.txt (the requirements.txt is not otherwise automatically handled by Binderhub — it is trumped by the environment.yml file);
  • enabling required Jupyter extensions.

If any files handled via postBuild need to be persisted, they can be written into $CONDA_DIR.

(As a reference, I have also created some simple standalone template repos showing how to configure Postgres and MongoDB in Binderhub/repo2docker environments. There’s also a neo4j demo template too.)

The binder/start file is responsible for:

  • defining environment variables and paths required at runtime;
  • starting the PostgreSQL and MongoDB database services.

(OpenRefine is started by the user from the notebook server homepage or JupyterLab. There’s a standalone OpenRefine demo repo too…)

Launching the repo using MyBinder will build the TM351 environment (if a Binder image does not already exist) and start the required services. The repo can also be used to build an environment locally using repo2docker.

As well as building a docker image within the Binderhub context, the repo is also automated with a Github Action that is used to build release commits using repo2docker and then push the resulting container to Docker Hub. The action can be found in the .github/workflows directory. The container can be found as ousefuldemos/tm351-binderised:latest. When running a container derived from this image, the Jupyter notebook server runs on the default port 8888 inside the container, and the OpenRefine application proxied through it; the database services should autostart. The notebook server is started with a token required, so you need to spot the token from the start up logs of the container – which means you shouldn’t run it with the -d flag. A variant of the following command should work (I’m not sure how you reliably specify the correct $PWD (present working directory) mount directory from a Windows command prompt):

docker run --name tm351test --rm -p 8895:8888 -v $PWD/notebooks:/notebooks -v $PWD/openrefine_projects:/openrefine_projects ousefuldemos/tm351-binderised:latest

Probably easier is to use the Kitematic inspired containds “personal Binderhub” app which can capture and handle the token auomatically and let you click straight through into the running notebook server. Either use containds to build the image locally by providing the repo URL, or select a new image and search for tm351: the ousefuldemos/tm351-binderised image is the one you want. When prompted, select the “standard” launch route, NOT the ‘Try to start Jupyter notebook’ route.

Although I’ve yet to try it (I ran out of time before the break), I’m hopeful that the prebuilt container should work okay with JupyterHub. If it does, this means the innovationOUtside/tm351vm-binder repo can serve as a template for building images that can be used to deploy authenticated OU computing environments via an OU authenticated and OU hosted JupyterHub server (one can but remain hopeful!).

If you try out the environment, either using MyBinder, via repo2docker, or from the pre-built Docker image, please let me know either here, via the repo issues, or howsoever: a) whether it worked; b) whether it didn’t; c) whether there were any (other) issues. Any and all feedback would be much appreciated…

Simple Rule Based Approach in Python for Generating Explanatory Texts from pandas Dataframes

Many years ago, I used to use rule based systems all the time, first as a postdoc, working with the Soar rule based system to generate “cognitively plausible agents”, then in support of the OU course T396 Artificial Intelligence for Technology.

Over the last couple of years, I’ve kept thinking that a rule based approach might make sense for generating simple textual commentaries from datasets. I had a couple of aborted attempts around this last year using pytracery (eg here and here) but the pytracery approach was a bit too clunky.

One of the tricks I did learn at the time was that things could be simplified by generating data truth tables that encode the presence of particular features in “enrichment” tables that could be used to trigger particular rules.

These tables would essentially encode features that could be usefully processed in simple commentary rules. For example, in rally reporting, something like “X took stage Y, his third stage win in a row, increasing his overall lead by P seconds to QmRs” could be constructed from an appropriately defined feature table row.

I’m also reminded that I started to explore using symbolic encodings to try to encode simple feature runs as strings and then use regular expressions to identify richer features within them (for example, Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching).

Anyway, a brief exchange today about a possible PhD projects for faculty-funded PhD studentships, starting in October (the project will be added here at some point…) got me thinking again about this… So as the Dakar rally is currently running, and as I’ve been scraping the results, I wondered how easy it would be to pull an off-the-shelf python rules engine, erm, off the shelf, and create a few quick rally commentary rules…

And the answer is, surprisingly easy…

Here’s a five minute example of some sentences generated from a couple of simple rules using the durable_rules rules engine.

The original data looks like this:

and the generated sentences look like this: JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position, 11 minutes and 19 seconds behind the first placed HONDA.

Let’s see how the various pieces fit together…

For a start, here’s what the rules look like:

from durable.lang import *

txts = []

with ruleset('test1'):
    
    #Display something about the crew in first place
    @when_all(m.Pos == 1)
    def whos_in_first(c):
        """Generate a sentence to report on the first placed vehicle."""
        #We can add additional state, accessiblr from other rules
        #In this case, record the Crew and Brand for the first placed crew
        c.s.first_crew = c.m.Crew
        c.s.first_brand = c.m.Brand
        
        #Python f-strings make it easy to generate text sentences that include data elements
        txts.append(f'{c.m.Crew} were in first in their {c.m.Brand} with a time of {c.m.Time_raw}.')
    
    #This just checks whether we get multiple rule fires...
    @when_all(m.Pos == 1)
    def whos_in_first2(c):
        txts.append('we got another first...')
        
    #We can be a bit more creative in the other results
    @when_all(m.Pos>1)
    def whos_where(c):
        """Generate a sentence to describe the position of each other placed vehicle."""
        
        #Use the inflect package to natural language textify position numbers...
        nth = p.number_to_words(p.ordinal(c.m.Pos))
        
        #Use various probabalistic text generators to make a comment for each other result
        first_opts = [c.s.first_crew, 'the stage winner']
        if c.m.Brand==c.s.first_brand:
            first_opts.append(f'the first placed {c.m.Brand}')
        t = pickone_equally([f'with a time of {c.m.Time_raw}',
                             f'{sometimes(f"{display_time(c.m.GapInS)} behind {pickone_equally(first_opts)}")}'],
                           prefix=', ')
        
        #And add even more variation possibilities into the returned generated sentence
        txts.append(f'{c.m.Crew} were in {nth}{sometimes(" position")}{sometimes(f" representing {c.m.Brand}")}{t}.')
    

Each rule in the ruleset is decorated with a conditional test applied to the elements of a dict passed in to the ruleset. Rules can also set additional state which can be accessed tested by, and accessed from within, other rules.

Rather than printing out statements in each rule, which was the approach taken in the original durable_rules demos, I instead opted to append generated text elements to an ordered list (txts), that I could then join and render as a single text string at the end.

(We could also return a tuple from a rule, eg (POS, TXT) that would allow us to re-order statements when generating the final text rendering.)

The data itself was grabbed from my Dakar scrape database into a pandas dataframe using a simple SQL query:

q=f"SELECT * FROM ranking WHERE VehicleType='{VTYPE}' AND Type='general' AND Stage={STAGE} AND Pos2:
            return ', '.join(f'{l[:-1]} {andword} {str(l[-1])}')
        elif len(l)==2:
            return f' {andword} '.join(l)
        return l[0]
    
    result = []

    if intify:
        t=int(t)

    #Need better handle for arbitrary time strings
    #Perhaps parse into a timedelta object
    # and then generate NL string from that?
    if units=='seconds':
        for name, count in intervals:
            value = t // count
            if value:
                t -= value * count
                if value == 1:
                    name = name.rstrip('s')
                result.append("{} {}".format(value, name))

        return nl_join(result[:granularity])

To add variety to the rule generated text, I played around with some simple randomisation features when generating commentary sentences. I suspect there’s a way of doing things properly “occasionally” via the rules engine, but that could require some clearer thinking (and reading the docs…) so it was easier to create some simple randomising functions that I could call on with in a rule to create statements “occasionally” as part of the rule code.

So for example, the following functions help with that, returning strings probabilistically.

import random

def sometimes(t, p=0.5):
    """Sometimes return a string passed to the function."""
    if random.random()>=p:
        return t
    return ''

def occasionally(t):
    """Sometimes return a string passed to the function."""
    return sometimes(t, p=0.2)

def rarely(t):
    """Rarely return a string passed to the function."""
    return sometimes(t, p=0.05)

def pickone_equally(l, prefix='', suffix=''):
    """Return an item from a list,
       selected at random with equal probability."""
    t = random.choice(l)
    if t:
        return f'{prefix}{t}{suffix}'
    return suffix

def pickfirst_prob(l, p=0.5):
    """Select the first item in a list with the specified probability,
       else select an item, with equal probability, from the rest of the list."""
    if len(l)>1 and random.random() >= p:
        return random.choice(l[1:])
    return l[0]

The rules handler doesn’t seem to like the numpy typed numerical objects that the pandas dataframe provides [UPDATE: it turns out this is a python json library issue: it does like np.int64s…], but if we cast the dataframe values to JSON and then back to a Python dict, everything seems to work fine.

import json
#This handles numpy types that ruleset json serialiser doesn't like
tmp = json.loads(tmpq.iloc[0].to_json())

One nice thing about the rules engine is that you can apply statements that are processed by the rules in a couple of ways: as events and as facts.

If we post a statement as an event, then only a single rule can be fired from it. For example:

post('test1',tmp)
print(''.join(txts))

generates a sentence along the lines of R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

We can create a function that can be applied to each row of a pandas dataframe that will run the contents of the row, expressed as a dict, through the ruleset:

def rulesbyrow(row, ruleset):
    row = json.loads(json.dumps(row.to_dict()))
    post(ruleset,row)

Capture the text results generated from the ruleset into a list, and then display the results.

txts=[]
tmpq.apply(rulesbyrow, ruleset='test1', axis=1)

print('\n\n'.join(txts))

The sentences generated each time (apart from the sentence generated for the first position crew) contain randomly introduced elements even though the rules are applied deterministically.

R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second representing HONDA.

M. WALKNER RED BULL KTM FACTORY TEAM were in third.

J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth, with a time of 10:50:06.

JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth, 11 minutes and 19 seconds behind the stage winner.

We can evaluate a whole set of events passed as list of events using the post_batch(RULESET,EVENTS) function. It’s easy enough to convert a pandas dataframe into a list of palatable dicts…

def df_json(df):
    """Convert rows in a pandas dataframe to a JSON string.
       Cast the JSON string back to a list of dicts 
       that are palatable to the rules engine. 
    """
    return json.loads(df.to_json(orient='records'))

Unfortunately, the post_batch() route doesn’t look like it necessarily commits the rows to the ruleset in the provided row order? (Has the dict lost its ordering?)

txts=[]

post_batch('test1', df_json(tmpq))
print('\n\n'.join(txts))

R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

X. DE SOULTRAIT MONSTER ENERGY YAMAHA RALLY TEAM were in tenth position, with a time of 10:58:59.

S. SUNDERLAND RED BULL KTM FACTORY TEAM were in ninth, with a time of 10:56:14.

P. QUINTANILLA ROCKSTAR ENERGY HUSQVARNA FACTORY RACING were in eighth position representing HUSQVARNA, 15 minutes and 40 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.

We can also assert the rows as facts rather than running them through the ruleset as events. Asserting a fact adds it as a persistent fact to the rule engine, which means that it can be used to trigger multiple rules, as the following example demonstrates (check the ruleset definition to see the two rules that match on the first position condition).

Once again, we can create a simple function that can be applied to each row in the pandas dataframe / table:

def factsbyrow(row, ruleset):
    row = json.loads(json.dumps(row.to_dict()))
    assert_fact(ruleset,row)

In this case, when we assert the fact, rather than post a once-and-once-only resolved event, the fact is retained even it it matches a rule, so it gets a chance to match other rules too…

txts=[]
tmpq.apply(factsbyrow, ruleset='test1', axis=1);
print('\n\n'.join(txts))

R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

we got another first…

K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second, with a time of 10:43:47.

M. WALKNER RED BULL KTM FACTORY TEAM were in third representing KTM.

J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth representing HONDA, with a time of 10:50:06.

JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position, 11 minutes and 19 seconds behind the first placed H

The rules engine is much richer in what it can handle than I’ve shown above (the reference docs provide more examples, including how you can invoke state machine and flowchart behaviours, for example in a business rules / business logic application) but even used in my simplistic way, it still offers quite a lot of promise for generating simple commentaries, particulalry if I also make use enrichment tables and symbolic strings (the rules engine supports pattern matching operations in the conditions).

In passing, I also note a couple of minor niggles. Firstly, you can’t seem to clear the ruleset, which means in a Jupyter notebook environment you get an error if you try to update a ruleset and run that code cell again. Secondly, if you reassert the same facts into a ruleset context, an error is raised that also borks running the ruleset again. (That latter one might make sense depending on the implementation, although the error is handled badly? I can’t think through the consequences… The behaviour I think I’d expect reasserting a fact is for that fact to be removed and then reapplied… UPDATE: retract_fact() lets you retract a fact.)

FWIW, the code is saved as a gist here, although with the db it’s not much use directly…

Installing Applications via postBuild in MyBinder and repo2docker

A note on downloading and installing things into a Binderised repo, or a container built using repo2docker.

If you save the files into $HOME as part of the container build process, if you try to use the image outside of MyBinder you will find that if storage volumes or local directories are mounted onto $HOME, your saved files are clobbered.

The MyBinder / repo2docker build is pretty limiting in terms of permissions the default jovyan user has over the file system. $HOME is one place you can write to, but if you need somewhere outside the path, then $CONDA_DIR (which defaults to /srv/conda) is handy…

For example, I just tweaked my neo4j binder repo to install a downloaded neo4j server into that path.