Fragment – Accessibility Side Effects? Free Training Data for Automated Markers

Another of those “woke up and suddenly started thinking about this” sort of things…

Yesterday, I was in on a call yesterday discussing potential projects around an “intelligent” automated short answer question marking system that could be plugged in to a Jupyter notebook environment (related approach here).

Somewhen towards the end of last year, I did a quick sketch of a simple marker support tool that does quick pairwise similarity comparisons between sentences in a submitted answer and a specimen answer. (The report can report on all pairwise comparisons or just the betst matching sentence in the specimen compared to to the submitted text.)

One issue with this is the need to generate the specimen text. This is also true for developing or training intelligent automated markers that try to match submitted texts against specimen texts.

As part of the developmnet, or training, process, automated marking tools may also require a large number of sample texts to train the system on. (My simple marker support simularity tool doesn’t: it’s there purely to help a marker cross-reference sentences in a text with sentences in the sample text.)

So… this is what I woke up wondering: if we set a question in a data analysis context, asking students to interpret a chart or comment on a set of reported model parameters, can we automatically generate the text from the data, which is to say, from the underyling chart object or model parameters.

I’ve touched on this before, in a slightly different context, specifically creating text desciptions of charts as an accessibility support measure for visually impaired readers (First Thoughts on Automatically Generating Accessible Text Descriptions of ggplot Charts in R), as well as more generally (for example, Data Textualisation – Making Human Readable Sense of Data.

So I wonder, if we have an automated system to mark free text short answers that ask students to comment on some sort of data analysis, could our automated marking system:

  • take the chart or model parameters the student generated and from that generate a simple “insightful” text report that could be used as the specimen answer to mark the student’s own free text answer (i.e. does the student report back similar words to words out insights generator “sees” and reports back in text form);
  • compare the chart of model parameters generated by the student with own own “correct” analysis / charts / model parameters.

In the first case, we are checking the extent to which the student has correctly interpreted their own chart / model (or one we have provided them with or generated for them) as a free text comparison. In the second case, we are checking if the student’s model/chart is correct comparred to our specimen model / chart etc. based on a comparison of model / chart parameters.

Something else. If we have a system for generating text from data (which could be datatables, could be chart or model parameters etc), we might also be able to generate lots of different texts on the theme, based on the same data (I recently started exploring data2text again using Simple Rule Based Approach in Python for Generating Explanatory Texts from pandas Dataframes via the durable_rules package. One of the approaches I’m looking at is to include randomising effects to generate multiple different possible text fragments from the same rule; still early days on that.) If our automated marked then needed to be trained on 500 sample submitted texts, we could then automatically generate those (perhaps omitting some bits, perhaps adding correct interpretations but of misread parameters (so right-but-wrong or wrong-but-right answers), perhaps adding irrelevant sentences etc., perhaps adding typos etc.).

In passing, I was convinced I had posted previously on the topic of “robot writers” generating texts from data not for human consumption but instead for search engines, the idea being that a search engine could index the text and use that to support discovery of a dataset. It seems I had, but hadn’t. In my draft queue (from summer 2015), I note the presence of two still-in-unfinished-draft posts from the Notes on Robot Churnalism series, left languishing because I got zero response back from the first two posts in the series (even though I thought they were pretty good…;-)

Here’s the relevant quote:

One answer I like to this sort of question is that if the search engine’s are reading the words, then the machine generation of textual statements, interpretation and analyses may well be helping to make certain data points discoverable by turning the data into sentences that then become web searchable? (I think I was first introduced to that idea by this video of a talk from 2012 by Larry Adams of Narrative Science: Using Open Data to Generate Personalized Stories.) If you don’t know how to write a query over a dataset to find a particular fact, if someone has generated a list of facts or insights from the the dataset as textual sentences, then you may be able to discover that fact from a straightforward keyword-based query. Just generating hundreds of millions sentences from data so that they can be indexed just in case someone asks some sort of question about that fact might appear wasteful, at least until someone comes up with a good way of indexing spreadsheets or tabular datasets so that you can make search-engine query like requests of them; which I guess is what things like Wolfram Alpha are trying to do? For example, what is the third largest city in the UK?)

On the other hand, we might perhaps need to be sensitive to the idea that that generated content might place a burden on effective discovery. For example, in Sims, Lee, and Roberta E. Munoz. “The Long Tail of Legal Information-Legal Reference Service in the Age of the’Content Farm’.” Law Library Journal Vol. 104:3 p411-425 (2012-29) [PDF], …???

I wish I’d finished those posts now (Notes on Robot Churnalism, Part III – Robot Gatekeepers and Notes on Robot Churnalism, Part IV – Who Cares?), not least to remind myself of what related thoughts I was having at the time… There’s hundreds of words drafted in each case, but a lot of the notes are still of the “no room in the margin” or “no time to write this idea out fully” kind..

Grabbing Javascript Objects Out of Web Pages And Into Python

Engaging in some rally data junkie play yesterday, I started wondering about whether I could grab route data out of the rather wonderful rally-maps.com website, a brilliant resource for accessing rally stage maps for a wide range of events.

The site display maps using leaflet maps, so the data must be in there somewhere as a geojson object, right?! ;-)

My first thought was to check the browser developer tools network tab to see if I could spot any geojson data being loaded into the page so that I could just access it directly… But no joy…

Hmmm… a quick View Source, and it seems the geojson data is baked explicitly into the HTML page as a data object referenced by the leaflet map.

So how to get it out again?

My first thought was to just scrape the HTML and then try to find a way to scrape the Javascript defining the object out of page. But that’s a real faff. My second thought was to wonder whether I could somehow parse the Javascript, in Python, and then reference the data directly as a Javascript object. But I don’t think you can.

At this point I started wondering about accessing the data as a native JSON object somehow.

One thing I’ve kept not figuring out how to do is find an easy way of inspecting Javascript objects inside a web page. It turns out that it’s really easy: in the Firefox developer tools console, if you enter window to display the window object, you can then browse through all the objects loaded into the page…

Poking around, and by cross referencing the HTML source, I located the Javascript object I wanted that contains the geojson data. For sake of example, let’s say it was in the Javascript object map.data. Knowing the path to the data, can I then grab it into a Python script?

One of the tricks I’ve started using increasingly for scraping data is to use browser automation via Selenium and the Python selenium package. Trivially, this allows me to open a page in a web browser, optionally click on things, fill in forms, and so on, and then either grab HTML elements from the browser, or use selenium-wire to capture all the traffic loaded into the page, (this traffic might incude a whole set of JSON files, for example, that I can then reference at my leisure).

So can I use this route to access the Javascript data object?

It seems so: simply call the selenium webdriver object with .execute_script('return map.data') and the Javascript object should be returned as text.

Only it wasn’t… A circular reference in the object definition meant the call failed. A bit more web searching, and I found a bit of javascript for parsing cyclic objects without getting into an infinite recursion. Loading this code into the browser, via selenium, I was then able to access the Javascript/JSON data object.

The recipe is essentially as follows: load in a web page from a Python script into a headless web-browser using selenium; find an off-the-shelf Javascript script to handle circular references in a Javascript object; shove the Javascript script into a Python text string, along with a return call that uses the script to JSON-stringify the desired object; return the JSON string representing the object to Python; parse the JSON string into a Python dict object. Simples:-)

Here are the code essentials…

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import json

options = Options()
options.headless = True

browser = webdriver.Firefox(options = options)

browser.get(url)

#https://apimirror.com/javascript/errors/cyclic_object_value
jss = '''const getCircularReplacer = () => {
  const seen = new WeakSet();
  return (key, value) => {
    if (typeof value === "object" && value !== null) {
      if (seen.has(value)) {
        return;
      }
      seen.add(value);
    }
    return value;
  };
};

//https://stackoverflow.com/a/10455320/454773
return JSON.stringify(map.data, getCircularReplacer());
'''

js_data = json.loads(browser.execute_script(jss))
browser.close()

#JSON data object is now available as a dict:
js_data

Another one for the toolbox:-)

Plus I can now access lots of rally stage maps data for more rally data junkie fun :-)

PS I also realised that this recipe provides a way of running any old Javascript from Python and getting the result of any computation stored in a js object back into the Python environment.

PPS it also strikes me that ipywidgets potentially offer a route to arbitrary JS execution from a Python environment, as well as real-time state synching back to that Python environment? In this case, the browser executing the Javascript code will be the one used to actually run the Jupyter notebook calling the ipywidgets. (Hmm… I think there’s a push on to to support ipywidgets in VSCode? What do they use for the Javascript runtime?)

On the Economics of Live Music

Just back from a very enjoyable evening — again — at our local independent music venue, Strings Bar and Venue (who don’t do themselves any favours by having a Facebook page that often ranks over, and performs better, than their own webpage…[just sayin’…]…)

…and pondering the economics…

(I used to to promote bands as an indie promoter, way back when.. Buy me a beer and I’ll maybe tell you some stories…)

Thinks: suppose you have a 200 capacity venue (a nice size for a indie venue…).

Suppose also, as a band, you have two million (2M) streamed plays? Good, right? Meh… Say there are 100 towns (0.1k) in the UK with populations over 50k, so that’s 5k per decade per town giving 20k people in ages 15-55 in each town on a uniform distribution (the distribution won’t be, but this is back of the envelope maths and I’ve had a pint or two…). For 100 towns, that’s 20k*100 = 2M. One play per person.

Go back to that 200 capacity venue. For all the streams, how many people in the town are going to turn out to see the band play? 200 people on 20k = 20000, means 1 in 100. 1%. As if…

Say you have a turn out of 100 at a tenner per ticket (£1k on the door). The venue has two bar staff for the 3 hour event, with a cleaner after, (say £100) plus someone on the door and a license stipulated security presence (say another £100). Then there’s the sound engineer, who maybe does 2 or 3 gigs a week if they’re lucky, at say £50 a time (so they need another job, right?). The venue also takes a tenner per person over the bar (optimistically…), but a 50% markup means their actual take is less than £5 per person (so <£500).

The venue is open maybe 4 days a week, which caps the take.

The band have travelled (£100-200 travelling expenses) and there's 1-5 of them. Maybe a driver too. They may have to take one or two days off work, depending on how far they've travelled, to be able to do the gig.

On the occasions I do private contracts, I can charge maybe £500 per day. You think each band member is on that rate?

For a band, a 1 hr gig is the day's work. 3-5 of them in the band. Plus a driver / merch stand person. (You'd like to think they have a roadie, but for a lot of small bands I see, there is no a roadie, and there may not even be anyone driving / on the merch who isn't a parent of one of the band members.) A lot of touring bands I see, they're way more professional than I am.

Do the math.

If you have a local venue: support it; buy the tickets; shove money over the bar; buy the band merch.

FFS.

Onboarding Into A New Github Repo – Initial Commit Actioned PRs

One of the blockers to folk using Github, I think, is the perception that it’s hard and requires lots of command line foo. That is, and isn’t true: you can get started very easily just by addign and editing files via the web (eg Easy Web Publishing With Github).

A new template repo, fastai/fastpages, from the fastai folk uses Github Actions to manage an automated publishing workflow that generates a blogsite, using Github pages, from notebooks and markdown files.

Actually, this is a better repo to use maybe? fastai/fast_template Could still do with some more babysteps explanation about what the _config.yml is and does, for example?

The automation requires a secret to be set as part of the repo settings to allow Github Actions to push generated files the branch that’s actually used to publish the resulting website.

Trying to write instructions that a novice can follow to set up the repo can be a bit fiddly, but the fastai have an interesting take on that: the template also includes a “first run” / “inital commit” action that makes a PR into your cloned repository telling you how to proceed, and providing direct links to the pages on which you need to edit your own settings.

A few moments after your cloned repo is loaded, refresh it, and you should see at least one pull request (PR).

My screenshot actually shows two automated PRs have been made: as well as the fastai on-boarding PR, a Github bot has spotted a vulnerability in a dependency and made its own PR to fix it.

The fastai PR provides instructions, and direct links to pages you need to access, to set up the repo:

There’s still the issue of directing the novice user to the PR page (the repo home page, as created, will show 0 PRs initially: it doesn’t refresh itself to show the updated PR count automatically, I don’t think?) and then how to merge the PR. [Downes also commented on the instructions and tried to make them even more baby steps followable here.)]

But the use of the initial commit triggered PR to stage the presentation of instructions is an interesting onboarding mechanic, I think, and one I might add to some of my own template repos.

Simple Text to Speech With Skulpt

On my to do list for many years has been getting my head round how Skulpt works. In case you haven’t come across it before, Skulpt is a small, client-side Javascript package that implements elements of Python in the browser. (Originally it only supported Python 2.7 syntax, but the master branch has now moved to supporting Python 3.7 syntax.)

My proximal reason for playing with it is that it is used in a Ev3devSim, a simple browser based robot simulator that I’m pretty sure we’re going to use in a course update. I’m itching to get back to tinkering with it, but as we’re on strike, I’m not going to until the strike action is over. This will make it hard to get it into the state I want it, and to develop the activities I’d like to create around it, by the time that they’re required to meet handover deadlines, but the strike action seems designed to cause stress and disruption to the strikers rather than the organisation we work for. Such is life. The collective decided and we’re out.

Although this is related to that, this blog has been languishing somewhat, lately, so before I forget for myself the small progress I made with tinkering with Skulpt, here’s a quick write-up.

The challenge I set myself was to create a simple text-to-speech function callable from an in-browser Skulpt Python environment that would convert a provided text string to speech using the SpeechSynthesis part of modern web browsers’ Web Speech API (this also defines a speech recognition API, which could offer some interesting accessibility directed possibilities…).

Calling the API from Javascript is straightforward enough: the speechSynthesis.speak(new SpeechSynthesisUtterance(txt)) javascript call run in a browser will speak the provided text string aloud. To ensure that the text object is a string and not some other object type, we force it to a string type in a Skulpt context by calling it as txt.$jsstr().

If we wrap this API call in a Javascript function, we can make it available by saving into a module file (such as src/lib/playsound.js) with some boilerplate that identifies the file as a loadable module and defines the function(s) available within it.

// src/lib/playsound.js
//  
// Define a Javascript function to do 
// whatever we want the Skulpt Python function to do...
function say (obj) {
    speechSynthesis.speak(new SpeechSynthesisUtterance(obj.$jsstr()))
  }

// Define the file as a Skulpt Python module
var $builtinmodule = function(name)
{
    var mod = {};

    // Add the say function to the module
    mod.say = new Sk.builtin.func(function(obj){
        say(obj);
        return new Sk.builtin.none;
    })

    return mod;
}

With node.js installed, run npm run dist in the top level repo directory. A series of tests will be executed and copies of skulpt.min.js and skulpt-stdlib.js built into the Skulpt dist folder. These files can then be loaded into a web page and used to power in-browser Python code execution.

The module can them be loaded into a Skulpt Python context and the text to speech function called as follows:

import playsound
playsound.say('hello world')

In passing, I note that the forked version of Skulpt used in BlockPy supports other handy packages not in the base Skulpt distribution, including matplotlib (src/lib/matplotlib). I’m not sure how tightly this is bound into the BlockPy UI, or how straightforward a task it would be to be able to make use of it in an Ev3Dev context?

I also note that the sqlite3 module seems to be unimplemented. Would it make sense to try to wrap kripken/sql.js in a Skulpt src/lib module, I wonder?

CountryWatch – Rural Surveillance

Eighteen months or so ago, looking for a bite to eat in advance of going to catch a flight, we stumbled across a village somewhere, with a village green… and a surveillance camera.

Over the weekend, in rural Devon, down a steep, single track road, leading to an out of the way beach, and a coastal walk: a car park.

A pay on exit carpark.

A pay on exit car park with ANPR.

A pay on exit car park with ANPR and car registration autocomplete as part of the payment machine UI.

I hate modern technology.

PS Meanwhile, in Downderry, Devon, some villagers at least seem to be of the mind that “the arrival of the [ANPR] cameras [in the local pub car park] was ‘an offence in a village setting'”.

49587277613_5fd30811fe_o

Clock Watching

Last week, as something of rather an impulse purchase, I bought a 19th century skeleton clock from a clock shop wandered past, by chance, in Teignmouth (“Tinmuth”, I think?) — Time Flies:

The clock’s back home now, and I’m slowly starting to learn about it (so if I talk nonsense in this post, please feel free to pick me up on it via the comments!):

As a first time clock owner, it’s fascinating trying to set it up. The period is tweaked via the pendulum — lengthen the pendulum and you slow down time (i.e. fix a fast running clock). It seems to be running a bit slow at the moment, so I need to raise the pendulum slightly, but I figure waiting another 18 hours or so to give it another full day’s run to see what the daily error is. (I suspect it’s still getting used to ambient temperatures etc., and settling in after it’s trip home.) There is some (deliberate? consequence of age?) freedom in how the wheels align, and one of those definitely seemed out, so I pushed it back, only to have the clock stop after 20 mins or so as various bits of my tinkering seem to have compounded the wrong way: the energy supply must be sensitively tuned relative to the amount of friction that can be introduced into the system.

Slightly more off-putting was a clunk on the rise to the hour, increasing in frequency, and then a slowing after the hour. There’s a single strike (I guess that’s an example of a complication, unless complications only refer to watches???), so I wondered if it could be something to do with the eccentricity of that; but it had more of a sound of something slipping or giving way, which I fancied might have something to do with the fusee powertrain:

Having emailed a quick audio grab to the clockshop:

a response quickly came back  that, firstly, it was very off-beat, (which I’d been introduced to in the shop as one of the things that could go “wrong” with it), and of less concern that the clunk was likely a thing, perhaps with the fusee mechanism, that would probably start to settle down as the clock found its way and tempered in:

Taking a look at the audio clip in Audacity, it’s easy to see that the tick and the tock were not evenly spaced:

The fix, as I’d been shown in the shop, and clarified via the “Andrew Clayton, Clock Repairs” website, from which the below image was taken, was to “bend the crutch”:

My warped logic for which way to bend the crutch (the bit at the back) was towards from the tock side, figuring that the clock need to spend less time getting back to the centrepoint from that side. So right hand high and push low with the left, counter to the above example.

Things are a bit better now (though a little bit more adjustment is still required), and the clunking seemed to have settled a bit too although it seems to have just come back now the temperature in the house is changing as night falls and the heating does whatever the heating does:

One thing I did notice having got the beat (I thought) sorted was that it really needs setting in situ. I’d got a pretty good beat going with the clock sat on a rug, but when I moved it back it went off again: presumably the level was slightly off one location relative to the other. A small two-asix spirit level is now on my “must-get-one-of-those” list.

Quite a fascinating machine, really, and something to learn the ways of over the days, weeks, months and years. It’s an eight-day wind and needs a service at least every 20 years, apparently…

In passing, and in trying to start looking for services of vocabulary (pretty much all learning is based, in part at least, at getting the vocabulary down and relating that to what you can see, and hear…), I came across various menions of Parliament clocks, named after the short lived Duties on Clocks and Watches Act, 1797, and the idea of a marriage, a clock in a non-original case,  (from a “marriage of unrelated parts”).

There’s a lot there that might be interesting to explore for a story or two, methinks…

And it’s far more interesting than digital tech…