Fragment: A Personal Take on Anaconda

Anaconda is a cross-platform (Windows / Mac OS X / Linux) Python distribution developed commercially by Continuum Analytics, Inc. and built around the conda package management system. It is available as a free open source individual edition (docs), as well as various commercial editions.

The distribution includes:

  • a comprehensive scientific computing stack (including things like Pandas, IPython, Jupyter tools, NumPy, SciPy, Matplotlib etc by default.
  • a desktop development application (the Spyder editor);
  • a cross-platform package manager that is capable of bundling operating system packages as well as language specific packages (Python, R etc);
  • support for separate conda environments/user namespaces.

The Anaconda distribution has several negative side-effects:

  • the download size and footprint on disk is large;
  • there are side-effects:
    • Anaconda may rewrite paths that clobber already installed software;
    • there may be unwanted elements in the distribution (eg the Spyder application, unrequired packages etc);
  • creating and testing new Anaconda packages, often customised for each operating system, is often complex;
  • packages releases may lag official releases and not all release versions may be available;
  • dependency reconciliation / conflicts can often causes issues when installing packages;
  • finally, the installation process is not always smooth…

Using custom distributions via Miniconda can reduce the overhead of unused / unwanted packages and components but this adds to maintenance and testing overheads.

In terms of flexibility, conda environments can be configured for various languages such as Python and R. This means that different environments could be configured for different modules. However, the overhead on any given module teaching students how to use environments often means that students work in a single default environment, which can cause issues if students are required to install Anaconda for different modules. The solution is to teach students how to use conda environments, but this adds weight to an individual module, does not necessarily appear to add value to a specific module, and may cause issues of its own (such as students being confused about what environment they are working in, how to move between environments and so on).

There are good arguments to teach students in general about namespaces and environments in computational environments at level 1 (Python environments, Jupyter notebook kernels, conda environments), and develop skills in using them, but this is more of a qualification / cross-module / generalist skill / benefit, so no module is willing to use its teaching time budget to develop this not immediately or apparently directly relevant skill.

As a scientific computing stack, Anaconda is arguably limited in terms of the sorts of environment or application it can distribute. Anaconda can be used to deploy environments for developing software or running software. To deploy and run an application may require installing particular packages (perhaps in a particular conda environment to keep them isolated from other applications) and then run an application server.

When it comes to providing consistency or environment between hosted solutions provided by the institution, or by third parties, the need to run the Anaconda on some operating system means that we cannot guarantee that students are working in exactly the same environment with the same user interface or performance across platforms or across local and hosted deployments.

For deploying simple applications, some languages may have support for creating self-contained executable packages capable of running on various platforms (for example, in the case of Python, ) but as these are not generalisable, they are of limited interest.

A more general approach is to deploy software environments and applications using a virtual machine of some sort (either a “full” virtual machine such as a provided by VirtualBox or VMWare) or using containerised software such as Docker containers. The use of virtualised software is more general — in producing the deployment we have full control of over the contents of the environment, such as the operating system and a conda environment installed into it — and the student will run exactly the same software wherever it is deployed (locally, institutionally hosted, third party hosted).

One disadvantage of the virtualised approach is that it makes it harder to deploy desktop applications: the virtualised environment does not have access to the host desktop. Running desktop applications inside a vritual machine and rendering them on the host desktop is possible but may require complex configuration and the installation of additional window management components or some sort of bridge. That said, virtual machine applications such as VirtualBox do provide a visual desktop environment if required, and VMs and containers alike can also publish desktops to remote desktop applications such as the cross-platform Microsoft Remote Desktop (RDP) client or via a browser using tools such as noVNC or XPRA. However, handling video or audio may be problematic with such clients.

Help – I deleted my notebook/notebook cells

A trick that I need to try but remark on here, because it might help some students who accidentally delete things and only realise too late…

Via a tweet, the IPython %history magic, which will display the IPython history (essentially, your run code cell history) from the current session by default can also search back into previous sessions.

From the docs, you can alose go back in time to previous sessions:

  • line 4, current session
  • lines 4-6, current session: 4-6
  • lines 1-5, session 243: 243/1-5
  • Line 7, session 2 before current: ~2/7
  • From the first line of 8 sessions ago, to the fifth line of 6 sessions ago: ~8/1-~6/5

The -g flag will search through all session histories. This is perhaps best used in association with the -f flag which save to a file. The -u flag will only save unique entries.

So to capture something you lost from the mists of time, it will quite likely be found in %history -g -f phew.txt.

PS Thoughts:

  • how big is a my history cache? And how is it stored? In a sqlite db would be handy…
  • where is the history cache? If you’re working in a dockerised environment and using a fresh container each work session , then to go deep and search from previous work sessions would mean a mount from persistent storage needs to cover wherever the IPython history is kept.
  • could we fashion a simple emergency toolbely tool to recreate, in part at least, a notebook from an IPython history for a given session?! Maybe just opening a saved session history file via jupytext would be a start? The history includes comments (it captures the complete contents of each run cell), which could then be extracted as md cells, for example. But then, markdown cells presumably aren’t captured in history so would be lost to future history grabs. Which perhaps puts cell comments (in a notebook context) into a different light: code cell comments are recoverable from IPython history but md comments aren’t…
  • Security implications…. like: when you explicitly assign your password to a variable in a code cell, and then immediately delete the cell once it’s run “to be safe”… Oops…

Authoring Reusable Code in Python Files from Jupyter Notebooks Using %%writefile and %run magics or autoreloaded file modules

One of the charges often made against Jupyter notebooks is that you can often end up locking code up in a notebook in a less than portable way: the code runs fine in the notebook, but how do you use it elsewhere?

If you install the Python Jupytext package, you can actually author python files (files with a .py suffix) in a notebook environment, with code cells saved into the Python file as code and markdown saved into the file as comments.

But I realised there’s another IPython supported trick that preserves the notebook authoring and code use practice whilst at the same time making code available in a file that you can load into other notebooks. The trick relies on the pattern:

  • save code in in a code cell to a file using the %%writefile block cell magic; note that this saves but does not run the code in the code cell; by default, %%writefile will overwrite a pre-existing file of the same name;
  • in the next cell, use the %run to run the contents of the file; this has the same effect as running the unmagicked code cell.

The file is executed in an empty namespace containing shared global objects (such as previously imported modules) but not local, interactively set values. After execution, the IPython interactive namespace is updated with all variables defined by the program. Using the -i flag will load and run the script in the same interactive/IPython namespace as the unmagicked code cells.

This pattern allows you to save the code in one cell and then run it in the next. The contents of the saved file can also be run in another notebook using the %run / %run -i line magic.

If you want to add the content of additional code cells to a pre-existing file, you can use the -a (append) character with the block magic (for example, %%writefile -a Note that if you repeatedly run such a cell, with slightly updated code each time, for example, the new code will get appended to the file each time, so you may have multiple versions of the code saved to the same file.

Another way of accessing the contents of the saved file is just to treat it as a module. For example, from FILENAME import * or from FILENAME import MYFUNCTION. Note that trying to reimport a package of a function from a package will not generally work if that item has already been imported into a Python session, even if the underlying code has changed.

However, another magic can be used to force the reload of imported items whenever a code cell is run: autoreload. Load in the magic with %load_ext autoreload and then enable automatic reloads with %autoreload 2. If you now use the pattern of writing the contents of a code cell to a file, and then you run a cell to import the required function, each time you call the function, the latest version of the module, and any functions defined within it, will be read in from the file via the autoreload.

Coding Expectations, No. 23

Another tiny example of the sort of thing I think of in terms of way I think that “everyone should learn to code”.

Even though it’s only a couple of lines, it unpacks a long way. And there are some big ideas in there. But writing that sort of code, which is complete and (I would argue) useful (I created it to scratch my own itch) is the sort of thing I’d be enjoyed to hear about if it was something one of our students had just naturally thought to do (via).


See also: when did you last write any code? (My answer: a couple of hours ago…)

Exploring the Hierarchical Structure of DataFrames and CSV Data

Whilst tinkering with some possible ideas for updates to materials in the Data Management and Analysis course, I thought it might be useful as an aside at least to show how simple data tables might represent hierarchically structured data.

Take the following table for example:

A snapshor of UK administrative geography.

The column order could make things jump out at you a bit more, and we can also use the power of pandas multi-indexes to structure the data a little bit more:

pandas dataframe with hierarchical data unsorted multi-index

See it yet? Let’s sort the index terms:

pandas dataframe with hierarchical data sorted multi-index

To anyone who works with data regularly, a quick scan of the original data and you immediately know that the data can be arranged as a hierarchy or tree object. The table has countries, which have regions, which contain local authorities, which have wards: simples.

In your mind’s-eye, you sense that the data can be structured in just such a way and you make a mental note to that effect, as a structural property of the data that might be handy to raw on if you need to sort, group or filter the data, or partition it in other ways. Furthermore, by observing the coding schemes and levels of granularity associated with particular columns you immediately get a sense what other sorts of data you might be able to merge or link in to the data set, and how easy that would be to achieve (I’ll try to post more about linking data in another post).

But for a novice, the “consequences” of the way the table is organised at a structural level (read in a particular way, the data columns do define a hierarchy) is perhaps not obvious from the apparently unsorted and arbitrary order of the rows and the columns that give it its current visual appearance. I can look at the first table and from the columns, and pretty much immediately grok that I can probably treat the data as a tree, and all that follows from that. Pretty much just from a glance.

So how can we help students get a feel for the structures that may be evident in a raw data table that they meet, perhaps in an unsorted fashion. One way might be to visualise the tables and re-present them in ways where the structure becomes more visible. Something like this perhaps:

That’s one of the things I see, just top right behind my eyes (in NLP terms, “visually created”), if I try to “see” something like the original table in visual structural terms:

A snapshor of UK administrative geography.

But I donlt see it, it feels almost more like I breathe it. It’s just there. It jumps right out at you into the back of your head as a thing you know to be true of the data. And like a perceptually multistable figure-ground image, when you start to know the structures are there, you can bring them in and out of attentive focus:

One way we can try to make the structure evident in the table is through sorted index terms, as demonstrated above. Another way is to use a macroscope to view the whole dataset in a particular way.

Many folk will be familiar with the idea of a a micropsope, a tool that lets you look in very close-up detail at a tiny piece of a picture. A macroscope goes the other way: it lets you look at everything at a single glance.

So what sort of macroscope might you use? One particular macroscope I’m particulalry fond of is a treemap. This is one of the few flavours of circumscribed pie chart I like (the two dimensional mosaic plot being another), although it may take a little bit of getting your head round, like you can’t quite breathe it all to understand the full cosequences of what it’s trying to show you.

Tree map of England regions and local authorities

Here’s a snapshot of the code used to create the original interactive view of the treemap (created using a plotly treemap):

Code to create plotly interactive treemap from pandas dataframe

One of the things that surprised me whilst I was looking for a way of grabbing the tree structure out of the table as a Python dictionary and then exporting it as a JSON file was that there wasn’t an obvious way (or so it seemed to me) of exporting it from a suitably indexed pandas dataframe. There are several ways of orient the dictionary export from a dataframe, but I couldn’t see how to export a mutli-index with a tree structure as a tree based data structure. (If you have a code fragment that does that, please share it in the comments.)

One way of creating such an object would be to represent the table as a graph using the netwrokx Python package, with different column combinations used to define the edge-list (I’ll post more on how to spot graph/network structures in a dataset in a later post), but I also noticed the presence of a more specialised tree handling package, treelib, so I thought I’d give that a go…

To access the code, see this gist:

Being a bear of little brain, and docs and examples being in short supply, I started out with a very literal way of constructing the tree, taking one level of the tree at a time:

There’s obviously repetition in there, but not the sort of one-step-after-another repetition that lets you iterate over a particular operation in a “linear” way. Instead, the repetition is nested, looping ever deeper within itself (spiralling, you might say). Which is suggestive of an algorithmic approach that is hugely powerful and can end up being fiendishly complicated to get right if you’re angrily trying to get ti to work at 3am in the morning, even if the (correct) solution is beautifully simple and elegant: recursion.

Anyway, here’s my (probably less than elegant!) attempt at a recursive function, one capable of calling itself, to build the tree from the table:

Recursive function to create a tree from a hierachically structure table

It works by specifying combinations of columns that are used to define a unique identifier for each node in the tree as well as its label and then defining parent-child relationships between them.

Specifying the columns at each level of the tree

(The tree package seemed to choke with using labels as identifiers, and in a general case, the labels could not guaranteed to be unique in any case, so I actually make use of the various code columns for the identifiers and the name columns for the labels.)

Conveniently, the tree package can also export the data as a JSON data structure that we can then convert to a Pyhton dictionary. However, convenient as the data structure might be for certain forms of processing, it’s not a very clean data structure:

tree exported to JSON and converted to a Python dictionary

I know the child nodes are children becuase of their relationship to the parent node — I don’t need to be explicitly told they’re children. So how might we prune this datastructure? Again, coding the steps to process the data in “longhand” as a linear (if nested) sequence of operations gives us a clue:

Literally (linearly) pruning the children…

As before, we see repeating nested structures that we can crib from to help us create a recursive function that will call itself to descend the tree and prune out the explicit children nodes no matter how levels deep the tree goes:

Recursive function to prune explict “children” nodes out of the dictionary data structure

Here’s an example of the output for a small subset of the table:

Example of pruned JSON/Python dictionary

Once we have the data as a Python dictionary, we can export it as JSON, or view it in a navigable way using a tree widget. Again, I’m quite surprised that there aren’t a couple of well proven off-the-shelf tools for doing this. The example below is taken pretty much from a StackOverflow answer:

Example of pruned JSON, visualised via an interactive tree widget.

So, a quick of hierarchical data in a “flat” data table. The structure is there if you know how to see it. And there are various ways of rendering the data so you can start to “see” the structure in a very explicit way. That might not be the same as seeing the consequences, or “affordances” of having the data structured that way — you may still not grok what it means or makes possible for the data to be structured in that way, but it’s a step on the journey to becoming one with that beautiful medium that is “data”.

PS one of the unprojects I’ve wanted to do for a long time is a proper look at the aesthetics of data. I think this is an example of what I mean by that: the ability to see the structure of a dataset, or imagine (I’m not sure that visualise is right because its not necessarily visual: it’s more than that) the sorts of structural relations that hold between data elements and the consequences of those relations not just at the local level and but also across the dataset as a whole. Which is where macroscopes come in again.

VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation

One of the things I’m a bit conflicted about in the context of asset creation tools is the extent to which they should be scripted. The examples I started pulling together in my Subject Matter Notebooks demos (an effort which has stalled again…:-( are essentially scripted: the asset generating examples are generated from script. But there are other ways of creating assets within a Jupyter environment, using interactive editors such as the editor:

One of the concerns I have with tools like this in the JupyterLab context is that you can start to step away from the narrated description of how an asset was created. You can see this even more clearly in things like the plotly chart editor:


Yes, you can use this tool to create a diagram or source file from which you can generate a png or interactive chart asset, for example. And yes, you can edit that source file via the UI. But the linear, literate, narrated construction of the asset is lost.

But that’s a debate for another post… This post is about VS Code.

Just like a Jupyter user interface environment can be used as a rich (generative) interactive asset authoring and display environment (for example, OpenJALE, my Open Jupyter Authoring and Learning Environment demo), so too can VS Code. And with a much larger developer community, VS Code currently provides support for a wider range of tools than the Jupyter ecosystem, and competition between them that drives further improvement.

So here’s a quick review of just some of the tools that VS Code already integrates via extensions that support the authoring of rich media (I’ll review various extensions that support teaching computing related subjects in another post).

As we’ve mentioned already, in the context of its availability as JupyterLab extension, let’s start there: the hediet.vscode-drawio extension provides .drawio extension senstivity in the form of an embedded drawio provided in much the same way as the JupyterLab extension. I donlt know if the JupyterLab extension works in a collaborative mode, but the VS Code extension claims to, via VS Code Liveshare. Another handy feature is the ability to edit the diagram and its XML source code side by side: edit the diagram and the XML updates, edit the XML and the diagram updates (does this approach also work in JupyterLab? I guess it should if the diagram is reactive to changes in the source file?)

When it comes to editing markdown files, a markdown side-by-side previewer is offered natively, with support for things like embedded LaTeX maths expressions provided by simple extensions such as goessner.mdmath, or with SMILES chemical molecule rendering support too, mathpix.vscode-mathpix-markdown, or as part of more comprehensive extensions such as shd101wyy.markdown-preview-enhanced. The latter (docs) adds support for maths equations, PDF generation, scripted diagram generation (including flowchart.js, Mermaid, PlantUML, Graphviz, Vega/Vega-Lite, Ditaa (ASCII art)). The extension can also execute code and render code outputs into the document (cf. Rmd in RStudio).

Other extensions that provide additional support to the markdown editing include spellchecking (for example, swyphcosmo.spellchecker), linting (for example, DavidAnson.vscode-markdownlint), pasting images into a markdown document from the clipboard (for example, telesoho.vscode-markdown-paste-image) and standalone diagram generating extensions (for example, gebv.pikchr (pikchr syntax editing and previews; render server required) or qiqiworld.vscode-markdown-tinymind (mindmaps); SimonSiefke.svg-preview for SVG previews; vstirbu.vscode-mermaid-preview for mermaid.js diagrams; several previewers for Graphviz/dot language, including tintinweb.graphviz-interactive-preview; the markdown-preview-enhanced extensions seems to cover several of these). I haven’t found a tikz renderer/previewer extension yet although there are lots of LaTeX related extensions, including the James-Yu.latex-workshop extension, which I notice supports some tikz snippets so maybe it does do tikz diagram rendering more genrally too.

Support is also available (currently in the VS Code insiders preview edition) for Jupytext (donjayamanne.vscode-jupytext), which will provide VS Code notebook style editing of markdown documents.

Musical scores / music typesetting in VS Code is also available using extensions such as lhl2617.VSLilyPond (a LilyPond wrapper) and ABC notation (


I haven’t spotted any extensions yet to support the creation and editing of electronic circuit diagram schematics.

3D model previews are available using extensions such as slevesque.vscode-3dviewer or michead.vscode-mesh-viewer. (In terms of image viewers, I haven’t found a FITS image viewer extension yet, or extensions that provide lookup and previewing lidar or satellite imagery? Maybe the astronomers et al. don’t do VS Code?)

Developer RandomFractalsInc publishes an interesting set of extensions that support interactive exploration of data and maps: the RandomFractalsInc.vscode-data-preview extension provides exploratory charting, the RandomFractalsInc.vscode-vega-viewer and RandomFractalsInc.vscode-chartjs extensions provide previews of Vega and Chart.js charts respectively, and the RandomFractalsInc.geo-data-viewer provides support for rendering a range of geo data formats including GeoJSON, TopoJSON, KML, GPX, and shapefiles. Other map preview extensions include jumpinjackie.vscode-map-preview. (I haven’t spotted a 3D map/elevation raster previewer yet.)

Scoping Out JupyterLab Extensions

Over the years, we’ve added quite a few classic notebook extensions to the mix in the environment we distribute to students in our data management and analysis course.

One of the things that has put me off moving to JupyterLab has been the increased complexity in developing extensions, both in terms of the typescript source and the need to familiarise myself with the complex and bewildering JupyterLab framework API.

But if we are to move to JupyterLab components, or if I want to try to make more use of JupyterLite, we are going to need to find some JupyterLab extension equivalents to the extensions we currently use in classic notebooks; or find someone to create them. And on current form, given the chances of the latter are near zero, that means I need to update the extensions myself. So this post is a quick review of already available JupyterLab extensions that might serve as equivalents of the extensions we currently use, or that I could use as cribs for my own extensions.

“Official” cookie cutter and example extension repos are available: jupyterlab/extension-cookiecutter-ts, jupyterlab/extension-examples. The examples repos are filled with dozens and dozens of developer cruft files and it’s not clear which are actually required, of those that are required, whether they are/can be/should be automatically generated (and how), or whether not-essential files can break things if they are laying around and not correct etc etc. This is a confusing and a massive blocker to me trying to get started from a position of wanting to have as little to do as possible with formal developer IDEs, build crap, test crap, etc etc. Productivity tool support for “proper devs” in “proper IDEs” is all very well for “proper devs”; but for the rest of us, it’s a hostile “you’re not welcome” signal (think: hostile architecture) and goes very much against a minimum viable example principle where only the bare essentials are included in the repo…

I’m not sure what extension, if anything, is powering the cell tags, which appear to only be accessed via the settings gear in the right hand margin tab list:

If you click on a tag, it seems to disappear, which is a usability nightmare, in my opinion…

Collapsible_Headings: JupyterLab equivalent of classic notebook nbextensions/collapsible_headings extension; note that the value assigned to heading_collapsed metadata is *not* the same in the JupyterLab and classic extensions. In the classic extension, the assignment is to the boolean true; in the JupyterLab extension, the assignment is to the string "true". See this related issue requesting metadata parity across extensions.

jupyterlab-skip-traceback: JupyterLab version of classic notebook nbextensions/skip-traceback extension for collapising error messages beneath error message name header; dozens of files in the repo, no idea which are necessary and which are just stuff;

jupyterlab-system-monitor: “display system information (memory and cpu usage)”, a bit like the classic notebook jupyter-resource-usage (nbresuse as was?) extension;

jupyterlab-execute-time: display cell execution times, similar to classic notebook nbextensions/execute_time extension;

spellchecker: “highlight misspelled words in markdown cells within notebooks and in the text files”, cf. classic notebook nbextensions/spellchecker extension;

jupyterlab_code_formatter (docs): format one or more code cells; cf. classic notebook nbextensions/code_prettify extension;

jupyterlab-cell-flash: “show a flash effect when a cell is executed”; this could be a useful crib for replicating some of the cell run status indicators that are provided for the class notebook UI by nb_cell_execution_status (pending, running, completed cell activity).

clear-cell-outputs : what looks like a simple extension to clear all cell outputs from a toolbar; of the dozens of files in the repo, I’m not sure what the smallest subset you actually need to get this to build/install actually is. The demo shows outputs in an empty notebook being cleared, so I have no idea if it actually does anything. This might be useful as a crib for: adding a toolbar button; iterating through code cells; clearing cell output;

jupyter-scribe: “transforms Markdown cells into rich-text-editing cells, powered by ProseMirror”; cf. classic notebook jupyter-wysiwyg or livemdpreview extensions.

jlab-hide-code: very old (does it still work, even?) extension to provide two toolbar buttons to hide/unhide all code cells; cf. classic notebook nbextensions/hide_input_all extension;

jupyterlab-hide-code: provides a JupyterLab toolbar button “to run the code cells and then to hide the code cells”;

jupyterlab-show-cell-tags: show cell tags within notebook UI;

jupyterlab-codecellbtn: add a run button to the footer of each code cell;

jlab-enhanced-cell-toolbar: enhance cell toolbar with a toolbar for selected cells allowing cell type selection tool, code cell run button, tag display and tag editor tools;

jupyterlab-custom-css (about; not updated to JupyterLab 3.0?): add custom css rules in the settings in the Advanced Settings Editor;

visual-tags: “more easily choose which cells get executed”, apparently, but the README is just boilerplate cruft and I can’t see what this does (if anything), how to do it, etc etc.

jupyterlab_templates: create a new notebook from a notebook template.

There’s nothing I’ve found yet that provides anything like the nb_extension_empinken or nb_extension_tagstyler extensions for styling cells based on tags (if you know of an example, please suggest it via the comments).

There have been a couple of aborted(?)/stale-d attempts to add DOM attributes based on tags (eg here and as originally referenced here) but this enabling feature never seems to get enough traction to make it as a merged PR. There is also a community contributed celltag2dom, but it also looks like it may be unmaintained/stale (IIRC, I couldn’t make sense of the code, and as with many extensions built presimably from examples, there’s a stack of files and I have no idea which are necessary, what any of them do, if any of the essentially superfluous ones can break things if they are laying around but not right, etc etc.).

Appropriate/ing Knowledge and Belief Tools?

A few weeks ago, I got my invite to the OpenAI GPT-3 beta (Terms of Engagement With the OpenAI API). You know the one: give it a start phrase and it will then start to generate follow on text for you. The phrase can be anything:

It was a cold, grey, morning. I picked up the chain saw and headed off into the town.

Wasn’t long before I was joined by a couple of men I knew from the pub. We’d worked together for a few years before my accident, and they were just pleased to see me back at work.

It was enjoyable, once I’d got used to the hard physical work. I knew I was in good shape, and the cold air made me feel even better.

My arms and legs were a little stiff, but nothing that wouldn’t soon disappear. I’d stopped taking painkillers a couple of months after getting the cast off, and hadn’t had any problems since.

It had been almost three months since I’d been at work. It was good to be back.

We were working on the new school, which had been burnt down a few months before. It was almost complete, and all that was left was to do the roofing and masonry work.

The owner of the building company was a nice man, with a sense of humour. He placed a ladder against the outside wall for us to stand on while we worked on the roof. I knew it was safe as long as he was holding it.

We had almost finished, when we heard shouting from the ground. We turned around to see the boss waving his arms, and we realised that the ladder had slipped from under one of the men. He was dangling from the top rung

OpenAI machine generated text

Or from a slightly different start:

It was a cold, grey, morning. I mopped up the blood, picked up the chain saw, and headed off into the town.

My face and head were covered in blood, so I had to drive with the window down. The blood was beginning to dry and flake. I didn’t want anyone to see me, so I didn’t stop.

I pulled into the parking lot of a 7-Eleven. I went inside, pulled out a newspaper, and sat down at a table by the window.

“Joe,” the guy behind the counter called out.

“Damn…that’s me,” I thought. I looked up, and was relieved it was just him. I gave him a nod.

“Hey, Joe, you okay?”

There was no point in lying to him.

“No. Not really,” I said.

“Uh, what happened?”

I’d been here before. If I had a different face, that would have been the end of it. But I had blood on my face.

“You need to go to the hospital?”

He was getting nervous.

I shook my head and said, “No. It’s nothing. Just a fight with my wife.”

He smiled, relieved. He went back behind the counter. I continued reading the paper. A few minutes later, he came back over and stood by my table

OpenAI machine generated text

(Hmmm… product placement opportunities to help pay for these services and keep then “free”?!)

Now, the above may or may not be “plagiarised” text, lifted directly from a single source or recombined from multiple trash fiction sources. The second piece is perhaps interesing because it makes use of dialogue. I can’t help but wonder what the surrealists would have made of this, how they would have seen this as a mechanical riff on both the exquisite corpse and automatic writing, an automated “fetch the next word” algorithm built on top of machine based hallucination engine wrapping the GPT-3 model constructed from an exquisite and almost unimagible, exquisite even, text corpus. (Yes, it was me that wrote that; not the machine.)

So the attribution of the previous text may or may not be suspect. The text may or may not be original. We have no way of knowing.

But consider the following example, a declarative rephrasing of a simple recall question — what are Asimov’s Three laws of Robotics?— that we have asked year on year in an introductory module:

Asimov’s Three Laws of Robotics are:

The first law: A robot may not injure a human being or, through inaction, allow a human being to come to harm. The second law: A robot must obey the orders given it by human beings except where such orders would conflict with the First Law. The third law: A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

The laws are implied in many of Asimov’s Robot books, but they are first stated explicitly in The Robots of Dawn. Other issues are also raised in the series. In later books, Asimov added a “zeroth law” of robotics: “A robot may not harm humanity, or, by inaction, allow humanity to come to harm”; this was introduced in Robots and Empire and again in The Caves of Steel.

The first three laws are similar to “The Three Laws of Eddorian Robotics” appearing in Eddore from the Lensman series by E. E. Smith. However, Smith’s laws are not a fundamental part of his future history, but are instead a throwaway item. Smith stated that his laws were merely a literary device to be used in his stories, and that they do not reflect any deep philosophical idea.

(More product placment?!)

In this case, it seems as if we get a factful respone. Compare, for example, the result of typing the interrogative form of the statement into the Google web search engine:

In this case, we get a similiar response, but with some form of attribution.

So, OpenAI can perhaps be used as a knowledge engine. But how trustworthy is it?

In formal logics of knowledge and belief, a distinction is often made between a belief statement that the system holds and uses to reason about the world and a knoweldge statement, a statement that the system holds and uses to reason about the world that is also true in and of the world.

In some respects, we might see Google as attempting to perform as a knowledge engine, returning facts that are true, and OpenAI as a belief engine, freewheeling a response based on what it’s just heard and what it’s heard others say before.

Let’s consider another example, the start of a declarative statement that we might be offered in response to an assessment questions aked in a simple databases module, for example:

The main differences between relational databases and document databases are

that document databases are:

Not relational. This means that relationships between data entities are not enforced in a strict sense. Document databases are designed to store and retrieve individual documents, not rows in a table.

Non-normalized. Document databases do not store data in multiple tables that are related by key fields. Instead, all the data for a given entity is stored in a single document.

In addition, there is no enforced integrity or consistency between entities in a document database.

Not ACID-compliant. ACID stands for Atomicity, Consistency, Isolation, Durability. These are the four basic properties of a transaction and are used to guarantee data integrity in a database.

Because document databases do not enforce relationships between entities, they do not meet the ACID test and thus are not transactional.

However, document databases address the same three basic CRUD (Create, Read, Update, Delete) scenarios as other non-relational databases.

They can also be queried and filtered using SQL queries and can scale to hundreds of nodes with a single database cluster.

Also, they can be hosted in a variety of cloud environments, including Amazon Web Services (AWS).

For more on why I refer to document databases as NoSQL databases, see this page on Wikipedia.

OpenAI machine generated text from initial prompt

(And even more product placement?! Is this gonna be a whole new SEO style battleground?!)

So… we get an answer, of a sort. (Computing academics are welcome to score the above answer out of 5 and submit marks as comments to the is post…). One think I note in particular is the reference (in text) to Wikipedia, but no link or page reference. At that point in the response, I had limited to the engine to produce a certain number of words, so who knows whether a link or page reference would have been generated as the next item. (I should have let it play on, I guess…)

One might also wonder what other Coleridgian interruptions our automatic writing engine might experience…?

It’s not just text that the models will generate. A recent announcement from Github (owned by Microsoft) and OpenAI introduces Github Copilot, “a new AI pair programmer that helps you write better code” which claims to be able to let you “write a comment describing the logic you want, and let GitHub Copilot assemble the code for you”, “let GitHub Copilot suggest tests that match your implementation code”, and let “GitHub Copilot show you a list of solutions [so you can] evaluate a few different approaches”.

In passing, I note an interesting UI feature in highlighting the the latter example, anudge, literally: a not-button is nudged, enticing you to click it, and if you do, you’re presented with another example:

The code as it currently stands is based on a model trained from submissions to Github. My immediate thought was: is it possible to licence code in a way that forbids its inclusion in machine learning/AI training sets (or will it be a condition of use of Github that public code repos at least must hand over the right for the code, and diffs, and commit comments to be used for machine training?). Another observation I saw several folk make on the Twitterz was whether we’ll start t see folk deliberately putting bad code or exploit code into Github in an attempt to try to pollute the model. As a quality check, I wondered what would happen if every Stack Overflow were provided with a machine generated answer based on OpenAI generated text and Copilot generated code and then used upvotes and downvotes as a error/training signal. Then @ultrazool/Jopointed out that a training signal can already be generated from suggested code that later appears in a git commit, presumably as a vote of confidence. We are so f****d.

It’s also interesting to ponder how this fits into higher education. In the maths and sciences, there are a wide range of tools that support productivity and correctness. If you want a solution to, or the steps in a proof of, a mathematical or engineering equation, Wolfram Alpha will do it for you. Now, it seems, if you want an answer to a simple code question, Copilot will offer a range of solution for your delectation and delight.

At this point, it’s maybe worth noting that code reuse is an essential part of coding practice, reusing code fragments you have found useful (and perhaps then adding them to the language in the form of code packages on PyPi), as for example described in this 2020 arXiv preprint on Code Duplication and Reuse in Jupyter Notebooks.

So when it comes to assessment, what are we to do: should we create assessments that allow learners to use knowledge and productivity tools, or should we be constraining them to do their own work, ex- of using mechanincal (Worlfram Alpha?) or statistical-mechanical (OpenAI) support tools for the contemporary knowledge worker?

PLAGIARISM WARNING – the use of assessment help services and websites

The work that you submit for any assessment/exam on any module should be your own. Submitting work produced by or with another person, or a web service or an automated system, as if it is your own is cheating. It is strictly forbidden by the University.

You should not:

– provide any assessment question to a website, online service, social media platform or any individual or organisation, as this is an infringement of copyright.

– request answers or solutions to an assessment question on any website, via an online service or social media platform, or from any individual or organisation. use an automated system (other than one prescribed by the module) to obtain answers or solutions to an assessment question and submit the output as your own work.

– discuss exam questions with any other person, including your tutor. The University actively monitors websites, online services and social media platforms for answers and solutions to assessment questions, and for assessment questions posted by students.

A student who is found to have posted a question or answer to a website, online service or social media platform and/or to have used any resulting, or otherwise obtained, output as if it is their own work has committed a disciplinary offence under Section SD 1.2 of our Code of Practice for Student Discipline. This means the academic reputation and integrity of the University has been undermined.

And when it comes to the tools, how should we view things like OpenAI and and Copilot? Should we regard them belief engines, rather than knowledge engines, and if so how should we then interact with them? Should we be starting to familiarise ourselves with the techniques descriebed in Automatic Detection of Machine Generated Text: A Critical Survey, or is that being unnecessarily prejudiced against the machine?

In skimming the OpenAI docs [Answer questions guide], one of the ways of using OpenAI is as “a dedicated question-answering endpoint useful for applications that require high accuracy text generations based on sources of truth like company documentation and knowledge bases”. The “knowledge” is provided as “additional context” uploaded via additional documents that can be used to top up the model. The following code fragment jumped out at me though:

{"text": "puppy A is happy", "metadata": "emotional state of puppy A"}
{"text": "puppy B is sad", "metadata": "emotional state of puppy B"}

The data is not added as a structured data object, such as `{subject: A, type: puppy, state:happy}`, it is added as a text sentence.

Anyone who has looked at problem solving strategies as a general approach in any domain is probably with familiar with the idea that the way you represent a problem can make it easier (or harder) to solve. In many computing (and data) tasks, solutions are often easier if you represent them in a very particular, structured way. The semantics are essentially mapped to syntax, so if you get the syntax right, the semantics follow. But here we have an example of taking structured data and mapping it into natural language, where it is presumably added to the model but with added weight to be applied in recall?

This puts me in mind of a couple of other things:

  • a presentation by a somone from Narrative Science or Automated Insights (I forget which) many years ago, commenting on how one use for data-to-text engines was to generate text sentences from every row of data in a database so that it could then be searched for using a normal text search engine, rather than having to write a database query;
  • the use of image based representations in a lot of a machine learning applications. For example, if you want to analyse an audio waveform, whose raw natural representation is a set of time ordered amplitude values, one way of presenting it to a machine learning system is to re-present it as a spectrogram, a two dimensional image with time along the x-axis and a depiction of the power of each frequency component along the y-axis.

It seems as if everyday we are moving away from mechanical algorithms to statistical-mechanical algorithms — AI systems are stereotype engines, with prejudiced beliefs based on the biased data they are trained on — embedded in rigid mechanical processes (the computer says: “no”). So. Completely. F****d.

Tinkering With Selenium IDE: Downloading Multiple Files from One Page, Keyed by Another

It being the third marking time of year again, I get to enjoy the delights of having to use various institutional systems to access marks and student scripts. One system stores the marks, allocates me my third marking tasks (a table of student IDs and links to their marks, their marks, and a from to submit my marks). Another looks after the scripts.

To access the scripts, I need to go to another system, enter the course code and student ID, one at a time, (hmm, what happens if I try a list of IDs with various separators; could that give me multiple files?) to open a pop-up window from which I can click to collect a zipped file containing the student’s submitted work. The downloaded file is downloaded as a zip file with a filename of the form ; which is to say, a filename based on datetime.

To update a set of marks, I need to get a verification code from the pop up raised after entering the student ID on the second system into a form on the page associated with a particular student’s marks on the first system. Presumably, the thinking about workflow went something like: third marker looks at marks on first system, copies ID, gets script and code from second system, marks script, enters code from second system in first system, updates mark. For however many scripts you need to mark. One at a time. Rather than: download every script one at a time, do marking howsoever, then have to juggle both systems trying to figure out the confirmation code for a particular student to update the marks from a list you’ve scribbled onto a piece of paper against their ID (is that a 2 or a 7?). Or whatever.

Needless to say, several years ago I hacked a mechanicalsoup Python script to look up my assigned marking on the first system, along with the first and second marks, download all the scripts and confirmation codes from the second system, unzip the student script downloads and bundle everything into a directory tree. I also hacked some marking support tools that would display how the markers compared on each of the five marking criteria they scored scripts against and allow me to record my marks. I held off from automating the upload of marks back to the system and kept that as a manual step becacause I don’t want to get into the habit of hacking code to write to university systems just in case I mess something up… I did try to present my workflow and tools to exams and various others by sharing a Powerpoint review of it, but as I recall never got any reply.

Anyway… mechanicalsoup. A handy package combing mechanize and beautifulsoup, the first part mocked a browser and allowed you to automate it, and the second part provided the scraping utilities. But mechanize doesn’t do Javascript. Which was fine because the marks and scripts systems are old old HTML and easily scraped, pretty vanilla tables and web forms. And the old OU auth was pretty simple to automate your way through too.

But the new OU authenticator uses Javascript goodness(?) as part of its handshake so my timesaver third marking scraping tools are borked because I can’t get through the auth.

So: time to play with Selenium, which is a complete browser automation tool that automates an off-the-shelf browser (Chrome, or Firefox, or Safari etc) rather than mocking one up (as per mechanicalsoup). Intended as a tool for automated testing of websites, you can also use it as a general purpose automation tool, or to provide browser automation for screenscraping. I’ve tinkered with Selenium before, scripting it from Python to automate repetitive tasks (eg Bulk Jupyter Notebook Uploads to nbgallery Using Selenium) but there’s also a browser extension / Selenium IDE that lets you record steps as you work through a series of actions in a live website, as well as scripting in your own additional steps.

So: how hard can it be, I thought, to record a quick script to automate the lookup of student IDs and then step through each one? Surprisingly faffy, as it turns out. The first issue was simply how to iterate through the rows of the table containing each individual student reference to pick up the student ID.

The method I ended up with was to get a cound of rows in the table, then iterate through each row, picking up the student ID as link text (of the form STUDENT_ID STUDENT NAME), duly cleaned by splitting on the first space and grabbing the first element, and then manually creating a string of delimited IDs STUDENT_ID1::STUDENT_ID2::... . (I couldn’t seem to add IDs to an ID array but I was maybe doing something wrong… And trying to find any sensible docs on getting stuff done using the current IDE seems to be a largely pointless task.)

So, I now have a list of IDs, which means I can (automatically) click through the script download system and grab the scripts one at a time. Remember, this involves adding a course code and a student identifer, clicking a button to get a pop up, clicking a button to zip and download the student files, then closing the pop up.

Here’s the first part – entering the course code and student ID:

In the step that opens the new window, we need to flag that a new window has been opened and and generate a reference to it:

In the pop-up, we can then click the collect button, wait a moment for the download to start, then close the pop-up and return to the window where we enter the course code and student ID:

If I now run the script on a browser where I’m already logged in (so the browser already has auth cookies set), I can just sit back and watch it grab the student IDs from my work allocation table on the first system to generate a list of IDs I need scripts for, and then download each one from the second system.

So I have the scripts, but as a set of uselessly named zip files (some of them duplicates); and I don’t have the first and second marks scrpaed from the first system. Or the confirmation codes from the second system. To perform those steps, I probably do need a Python script automating the Selenium actions. the Selenium IDE is fine (ish) for filling in forms with simple scraped state and then clicking buttons that act on those values, but for scraping it’s not really appropriate.

Whilst the Selenium IDE doesnlt export Python code, it does produce an export JSON file that itemises the steps in scripts created in the IDE. This could be used to help boostrap the production of Python code. The Selenium IDE recorder provides a way of recording simple pointy-clicky sequences of action which could be really useful to help get those scripts going. But ideally, I need a thing that can replay the JSON exported scripts from Python then I could have the best of both worlds.

Finally, in terms of design pattern, this recipe doesn’t include any steps that interact directly with logging in: the automated browser uses cookies that have been set previously elsewhere. Rather, the script automates actions over a previously logged in browser. Which means this sort of script is something that could be easily shared within an organisation? So I wonder, are there orgs in which the core systems don’t play well but skunkworks and informal channels share automation scripts that do integrate them, ish?!

(Hmm… would the Python scripted version load a browser with auth cookies set, or does it load into a private browser in which authentication would be required?)

Bah… I really should be marking, not tinkering…

PS It looks like you can export to a particular language script:

…but when I try it I get an error message regarding an Unknown locator:

Fragment: Remarkable Error Generation

It’s that time of year again when the vagaries of marking assert themselves and we get to third mark end of course assessment scripts where the first two markers award significantly different marks or their marks straddle the pass/fail boundary.

There is a fiddle function available to us where we can “standardise” marks, shifting the the mean of particular markers (and maybe the sd?) whose stats suggest they may be particularly harsh or lenient. Ostensibly, standardisation just fixes the distribution; ideally, this is probably something a trained statistiscian should do; pragmatically, it’s often an attempt to reduce the amount of third marking; intriguingly, we could probably hack some code to “optimise” stanadardisation to bring everyone in to line and reduce third marking to zero; but we tend to avoid that process, leaving the raw marks in all their glory.

I’ve never really understood why we don’t do a post mortem after the final award board and compare the marks awarded by markers in their MK1 (first marker) and MK2 (second marker) roles against the mark finally awarded to a script. This would generate some sort of error signal that modules teams, staff tutors and markers could use to see how effective any given marker is at “predicting” the final grade awarded to a script. But we don’t do that. Analytics are purely for applying to learners because it’s their fault. (I often wonder if the learning analytics folk look at marker identity as one of the predictors for a student’s retentin and grade; and if there is an effect; or maybe some things are best left under the stone…)

Anyway… third marking time.

In a sense, it’s a really useful activity because we get to see a full range of student scripts and get a feel for what they’ve got out of the course.

But the fact that we often get a large disparity between marks does, as ever, raise questions about the reliablity of the marks awarded to scripts we don’t third mark (for example, if two harsh or lenient markers mark the same scripts). I’m sure there are ways the numbers could be churned to give some useful and simple insights into individual marker behaviour, rather than the not overly helpful views we’re given over marker distributions. And I wonder if we just train a simple text classifier on raw scripts against the final awarded mark on a script how much it would vary compared to human markers. And maybe one that classifies based on screenshots of the report (a 20 second skim of how a report looks often gives me sense of which grade boundary it is likely as not to fall in…)

But ours not to reason why…