Simple Rule Based Approach in Python for Generating Explanatory Texts from pandas Dataframes

Many years ago, I used to use rule based systems all the time, first as a postdoc, working with the Soar rule based system to generate “cognitively plausible agents”, then in support of the OU course T396 Artificial Intelligence for Technology.

Over the last couple of years, I’ve kept thinking that a rule based approach might make sense for generating simple textual commentaries from datasets. I had a couple of aborted attempts around this last year using pytracery (eg here and here) but the pytracery approach was a bit too clunky.

One of the tricks I did learn at the time was that things could be simplified by generating data truth tables that encode the presence of particular features in “enrichment” tables that could be used to trigger particular rules.

These tables would essentially encode features that could be usefully processed in simple commentary rules. For example, in rally reporting, something like “X took stage Y, his third stage win in a row, increasing his overall lead by P seconds to QmRs” could be constructed from an appropriately defined feature table row.

I’m also reminded that I started to explore using symbolic encodings to try to encode simple feature runs as strings and then use regular expressions to identify richer features within them (for example, Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching).

Anyway, a brief exchange today about a possible PhD projects for faculty-funded PhD studentships, starting in October (the project will be added here at some point…) got me thinking again about this… So as the Dakar rally is currently running, and as I’ve been scraping the results, I wondered how easy it would be to pull an off-the-shelf python rules engine, erm, off the shelf, and create a few quick rally commentary rules…

And the answer is, surprisingly easy…

Here’s a five minute example of some sentences generated from a couple of simple rules using the durable_rules rules engine.

The original data looks like this:

and the generated sentences look like this: JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position, 11 minutes and 19 seconds behind the first placed HONDA.

Let’s see how the various pieces fit together…

For a start, here’s what the rules look like:

from durable.lang import *

txts = []

with ruleset('test1'):
    
    #Display something about the crew in first place
    @when_all(m.Pos == 1)
    def whos_in_first(c):
        """Generate a sentence to report on the first placed vehicle."""
        #We can add additional state, accessiblr from other rules
        #In this case, record the Crew and Brand for the first placed crew
        c.s.first_crew = c.m.Crew
        c.s.first_brand = c.m.Brand
        
        #Python f-strings make it easy to generate text sentences that include data elements
        txts.append(f'{c.m.Crew} were in first in their {c.m.Brand} with a time of {c.m.Time_raw}.')
    
    #This just checks whether we get multiple rule fires...
    @when_all(m.Pos == 1)
    def whos_in_first2(c):
        txts.append('we got another first...')
        
    #We can be a bit more creative in the other results
    @when_all(m.Pos>1)
    def whos_where(c):
        """Generate a sentence to describe the position of each other placed vehicle."""
        
        #Use the inflect package to natural language textify position numbers...
        nth = p.number_to_words(p.ordinal(c.m.Pos))
        
        #Use various probabalistic text generators to make a comment for each other result
        first_opts = [c.s.first_crew, 'the stage winner']
        if c.m.Brand==c.s.first_brand:
            first_opts.append(f'the first placed {c.m.Brand}')
        t = pickone_equally([f'with a time of {c.m.Time_raw}',
                             f'{sometimes(f"{display_time(c.m.GapInS)} behind {pickone_equally(first_opts)}")}'],
                           prefix=', ')
        
        #And add even more variation possibilities into the returned generated sentence
        txts.append(f'{c.m.Crew} were in {nth}{sometimes(" position")}{sometimes(f" representing {c.m.Brand}")}{t}.')
    

Each rule in the ruleset is decorated with a conditional test applied to the elements of a dict passed in to the ruleset. Rules can also set additional state which can be accessed tested by, and accessed from within, other rules.

Rather than printing out statements in each rule, which was the approach taken in the original durable_rules demos, I instead opted to append generated text elements to an ordered list (txts), that I could then join and render as a single text string at the end.

(We could also return a tuple from a rule, eg (POS, TXT) that would allow us to re-order statements when generating the final text rendering.)

The data itself was grabbed from my Dakar scrape database into a pandas dataframe using a simple SQL query:

q=f"SELECT * FROM ranking WHERE VehicleType='{VTYPE}' AND Type='general' AND Stage={STAGE} AND Pos2:
            return ', '.join(f'{l[:-1]} {andword} {str(l[-1])}')
        elif len(l)==2:
            return f' {andword} '.join(l)
        return l[0]
    
    result = []

    if intify:
        t=int(t)

    #Need better handle for arbitrary time strings
    #Perhaps parse into a timedelta object
    # and then generate NL string from that?
    if units=='seconds':
        for name, count in intervals:
            value = t // count
            if value:
                t -= value * count
                if value == 1:
                    name = name.rstrip('s')
                result.append("{} {}".format(value, name))

        return nl_join(result[:granularity])

To add variety to the rule generated text, I played around with some simple randomisation features when generating commentary sentences. I suspect there’s a way of doing things properly “occasionally” via the rules engine, but that could require some clearer thinking (and reading the docs…) so it was easier to create some simple randomising functions that I could call on with in a rule to create statements “occasionally” as part of the rule code.

So for example, the following functions help with that, returning strings probabilistically.

import random

def sometimes(t, p=0.5):
    """Sometimes return a string passed to the function."""
    if random.random()>=p:
        return t
    return ''

def occasionally(t):
    """Sometimes return a string passed to the function."""
    return sometimes(t, p=0.2)

def rarely(t):
    """Rarely return a string passed to the function."""
    return sometimes(t, p=0.05)

def pickone_equally(l, prefix='', suffix=''):
    """Return an item from a list,
       selected at random with equal probability."""
    t = random.choice(l)
    if t:
        return f'{prefix}{t}{suffix}'
    return suffix

def pickfirst_prob(l, p=0.5):
    """Select the first item in a list with the specified probability,
       else select an item, with equal probability, from the rest of the list."""
    if len(l)>1 and random.random() >= p:
        return random.choice(l[1:])
    return l[0]

The rules handler doesn’t seem to like the numpy typed numerical objects that the pandas dataframe provides [UPDATE: it turns out this is a python json library issue: it does like np.int64s…], but if we cast the dataframe values to JSON and then back to a Python dict, everything seems to work fine.

import json
#This handles numpy types that ruleset json serialiser doesn't like
tmp = json.loads(tmpq.iloc[0].to_json())

One nice thing about the rules engine is that you can apply statements that are processed by the rules in a couple of ways: as events and as facts.

If we post a statement as an event, then only a single rule can be fired from it. For example:

post('test1',tmp)
print(''.join(txts))

generates a sentence along the lines of R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

We can create a function that can be applied to each row of a pandas dataframe that will run the contents of the row, expressed as a dict, through the ruleset:

def rulesbyrow(row, ruleset):
    row = json.loads(json.dumps(row.to_dict()))
    post(ruleset,row)

Capture the text results generated from the ruleset into a list, and then display the results.

txts=[]
tmpq.apply(rulesbyrow, ruleset='test1', axis=1)

print('\n\n'.join(txts))

The sentences generated each time (apart from the sentence generated for the first position crew) contain randomly introduced elements even though the rules are applied deterministically.

R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second representing HONDA.

M. WALKNER RED BULL KTM FACTORY TEAM were in third.

J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth, with a time of 10:50:06.

JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth, 11 minutes and 19 seconds behind the stage winner.

We can evaluate a whole set of events passed as list of events using the post_batch(RULESET,EVENTS) function. It’s easy enough to convert a pandas dataframe into a list of palatable dicts…

def df_json(df):
    """Convert rows in a pandas dataframe to a JSON string.
       Cast the JSON string back to a list of dicts 
       that are palatable to the rules engine. 
    """
    return json.loads(df.to_json(orient='records'))

Unfortunately, the post_batch() route doesn’t look like it necessarily commits the rows to the ruleset in the provided row order? (Has the dict lost its ordering?)

txts=[]

post_batch('test1', df_json(tmpq))
print('\n\n'.join(txts))

R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

X. DE SOULTRAIT MONSTER ENERGY YAMAHA RALLY TEAM were in tenth position, with a time of 10:58:59.

S. SUNDERLAND RED BULL KTM FACTORY TEAM were in ninth, with a time of 10:56:14.

P. QUINTANILLA ROCKSTAR ENERGY HUSQVARNA FACTORY RACING were in eighth position representing HUSQVARNA, 15 minutes and 40 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.

We can also assert the rows as facts rather than running them through the ruleset as events. Asserting a fact adds it as a persistent fact to the rule engine, which means that it can be used to trigger multiple rules, as the following example demonstrates (check the ruleset definition to see the two rules that match on the first position condition).

Once again, we can create a simple function that can be applied to each row in the pandas dataframe / table:

def factsbyrow(row, ruleset):
    row = json.loads(json.dumps(row.to_dict()))
    assert_fact(ruleset,row)

In this case, when we assert the fact, rather than post a once-and-once-only resolved event, the fact is retained even it it matches a rule, so it gets a chance to match other rules too…

txts=[]
tmpq.apply(factsbyrow, ruleset='test1', axis=1);
print('\n\n'.join(txts))

R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.

we got another first…

K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second, with a time of 10:43:47.

M. WALKNER RED BULL KTM FACTORY TEAM were in third representing KTM.

J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth representing HONDA, with a time of 10:50:06.

JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position, 11 minutes and 19 seconds behind the first placed H

The rules engine is much richer in what it can handle than I’ve shown above (the reference docs provide more examples, including how you can invoke state machine and flowchart behaviours, for example in a business rules / business logic application) but even used in my simplistic way, it still offers quite a lot of promise for generating simple commentaries, particulalry if I also make use enrichment tables and symbolic strings (the rules engine supports pattern matching operations in the conditions).

In passing, I also note a couple of minor niggles. Firstly, you can’t seem to clear the ruleset, which means in a Jupyter notebook environment you get an error if you try to update a ruleset and run that code cell again. Secondly, if you reassert the same facts into a ruleset context, an error is raised that also borks running the ruleset again. (That latter one might make sense depending on the implementation, although the error is handled badly? I can’t think through the consequences… The behaviour I think I’d expect reasserting a fact is for that fact to be removed and then reapplied… UPDATE: retract_fact() lets you retract a fact.)

FWIW, the code is saved as a gist here, although with the db it’s not much use directly…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...