Fragment – Carillion-ish

A quick sketch of some companies that are linked by common directors based on a list of directors seeded from Carillion PLC.

The data was obtained from the Companies House API using the Python chwrapper package and some old code of my own that’ll I’ll share once I get a chance to strip the extraneous cruft out of the notebook it’s in.

The essence of the approach / recipe is an old one that I used to use with OpenCorporates data, as described here: Mapping corporate networks-with Opencorporates.

Note the sketch doesn’t make claims about anything much. The edges just show that companies are linked by the same directors.

A better approach may be to generate a network based on control / ownership registration data but I didn’t have any code to hand to do that (it’s on my to do list for my next company data playtime!).

One way of starting to use this sort of structure is to match companies that appear in the network with payments data to see the actual extent of public body contracting with Carillion group companies. For other articles on Carillion contracts, see eg Here’s the data that reveals the scale of Carillion’s big-money government deals.

Potential Issues With Institutionally Mediated Reproducible Research Environments

One of the advantages, for me, of the Jupyter Binderhub enviornment is that it provides with a large amount of freedom to create my own computational environment in the context of a potentially managed institutional service.

At the moment, I’m lobbying for an OU hosted version of Binderhub, probably hosted via Azure Kubernetes, for internal use in the first instance. (It would be nice if we could also be part of an open and federated MyBinder provisioning service, but I’m not in control of any budgets.) But in the meantime, I’m using the open MyBinder service (and very appreciative of it, too).

To test the binder builds locally, I use repo2docker, which is also used as part of the Binderhub build process.

What this all means is that I should be able to write – and test – notebooks locally, and know that I’ll be able to run them “institutionally” (eg on Binderhub).

However, one thing I noticed today was that notebooks in a binder container that was running okay, and that still builds and runs okay locally, have broken when run through Binderhub.

I think the error is a permissions error in creating temporary directories or writing temporary image files in either the xelatex commandline command used to generate a PDF from the LaTeX script, or the ImageMagick convert command used produce an image from the PDF which are both used as part of some IPython magic that renders LaTeX tikz diagram generating scripts. It certainly affects a couple of my magics. (It might be an issue with the way the magics are defined too. But whatever the case, it works for me locally but not “institutionally”.)

Broken notebook: https://mybinder.org/v2/gh/psychemedia/showntell/maths?filepath=Mechanics.ipynb
magic code: https://github.com/psychemedia/showntell/tree/maths/magics/tikz_magic
Error is something to do with the ImageMagick convert command not converting the .pdf to an image. At least one of issues seems to be that ghostscript is lost somewhere?

So here’s the issue. Whilst the notebooks were running fine in a container generated from an image that was itself created presumably before a Binderhub update, rebuilding the image (potentially without making any changes to the source Github repository) can lead to notebooks that were running fine to break.

Which is to say, there may be a dependency in the way a repository defines an environment on some of the packages installed by the repo2docker build process. (I don’t know if we can fully isolate out these dependencies by using a Dockerfile to define the environment rather than apt.txt and requirements.txt?)

This raises a couple of questions for me about dependencies:

  • what sort of dependency issues might there be in components or settings introduced by the jupyter2repo process, and how might we mitigate against these?
  • are there other aspects of the Binderhub process that can produce breaking changes that impact on notebooks running in a repository that specifies a computational environment run via Binderhub?

Institutionally, it also means that environments run via an institutionally supported Binderhub environment could break downstream environments (that is, ones run via Binderhub) through updates to the Binderhub environment.

This is a really good time for this to happen to me, I think, because it gives me more things to think about when considering the case for providing a Binderhub service institutionally.

On the other hand, it means I can’t update any of the other repos that use the tikz or asymptote magic until I find the fix because otherwise they will break too…

Should users of the institutional service, for example, be invited to define test areas in their Binder repositories (for example, using nbval) that the institution can use as test cases when making updates to the institutional service? If errors are detected through the running of these tests by the institutional service provider against their users’ tests, then the institutional service provider could explore whether the issue can be addressed by their update strategy, or alert the Binderhub user there may be breaking changes and how to explore what they are or mitigate against them. (That is, perhaps it falls to the institutional provider to centrally explore the likely common repercussions of a particular update and identify fixes to address them?)

For example, there might be dependencies on particular package version numbers. In this case, the user might then either want to update their own code, or add in a build requirement that regresses the package to the desired version. (Institutional providers might have something to say about that if the upgrade was for valid security reasons, though running things in isolation in containers should reduce that risk?) Lists of affected packages could also be circulated to other users using the same packages, along with mitigation strategies for coping with updates to the institutionally provided service.

There are also updating issues associated with a workflow strategy I am exploring around Binderhub which relates to using “base containers” to seed Binderhub builds (Note On My Emerging Workflow for Working With Binderhub). For example, if a build uses a “latest” tagged base image, any updates to that base image may break things built on top of it. In this case, mitigating against update risk to the base container is achieved by building from a specifically tagged version of the container. However, if an update to the Binderhub environment can break notebooks running on top of a particularly labelled base container, the fix for the notebooks may reside in making a fix to the environment in the base container (for example, which specifically acts to enforce a package version). This suggests that the base container might need doubly tagging – one tag paying heed to the downstream end users (“buildForExptXYZ”) – and the other that captures the upstream Binderhub environment (“BinderhubBuildABC”).

I’m also wondering know about where responsibility arises for maintaining the integrity of the user computing environment (that is, the local computational environment within which code in notebooks should continue to operate once the user has defined their environment). Which is to say, if there are changes to the wider environment that somehow break that local user environment, who should help fix it? If the changes are likely to impact widely, it makes sense to try to fix it once and then share the change, rather than expecting every user suffering from the break to have to find the fix independently?

Also, I’m wondering about classes of error that might arise. For example, ones that can be fixed purely by changing the environmental definition (baking package versions into config files, for example, which is probably best practice anyway) and ones that require changes to code in notebooks?

PS Hmm.. noting… are whitelists and blacklists also specifiable in Binderhub config? eg https://github.com/jupyterhub/mybinder.org-deploy/pull/239/files

binderhub:
  extraConfig:
    bans:
        c.GitHubRepoProvider.banned_specs = [
          '^GITHUBUSER/REPO.*'
        ]

Fragments – Multi-Tasking In Jupyter Notebooks and Postgres in Binderhub

A couple of things that I thought might be handy:

  • multi-tasking in Jupyter notebook cells: nbmultitask; running actions in notebook cells is usually blocking on that cell; this extension – and the associated examples.ipynb notebook – show various ways of running non-blocking threads in notebook cells.
  • a Binderhub compliant Dockerfile for running Postgres alongside a Jupyter notebook; I posted a fragment here that demos a connection if you run the repo via Binderhub. It requires the user starting and stopping the postgres server, and it also escalates the jovyan user privileges via sudo, which could be handy for TM351 purposes.  Now pretty much all I need to see a demo of is running OpenRefine too. I think the RStudio Jupyter proxy route is being generalised to make that possible… and, via @choldgraf, it seems as if there is an issue relating to this: https://github.com/jupyterhub/binder/issues/40 .

TO DO – Automated Production of a Screencast Showing Evolution of a Web Page?

Noting @edsu’s Web histories approach describing Richard Rogers’  Doing Web history with the Internet Archive: screencast documentaries, I wonder how hard it would be to write a simple script that automates the collection of screenshots from a web page timeline in the Wayback Machine and stitches them together in a movie?

It’s not hard to find code fragments describing how to turn a series of image files into a video (for example, Create a video in Python from images or Combine images into a video with Python 3 and OpenCv 3) and you can grab a screenshot from a web page using various web testing frameworks (eg Capturing screenshots of website with Python or Grabbing Screenshots of folium Produced Choropleth Leaflet Maps from Python Code Using Selenium).

So, a half hour hack? But I don’t have half an hour right now:-(

PS Possibly handy related tool for downloading stuff in bulk from Wayback Machine, this Wayback Machine Downloader script.

PPS Hmm… should be easy enough to take a similar approach to creating “Wikipedia Journeys” – grab a random Wikipedia page, snapshot it, follow a random link from within the page subject matter, screenshot that, etc. To simplify matters there, there’s the Python wikipedia package.

Using ThebeLab to Run Python Code Embedded in HTML Via A Jupyter Kernel

If you want to run Python code embedded in a HTML file in your browser, one way of doing it is to use something like Brython, Skulpt or Pypy.js, which convert your python code into Javascript and then run that Javascript in the browser.

One of the problems with this approach is the limited availability of Python modules ported into Javascript for use by those packages.

Another route to using Python in the browser is to connect to a remote Python environment. This is the approach taken by online code editors such as PythonAnywhere or repl.it.

A third way to access a Python environment via a web browser is using Jupyter notebooks, but this limits you to using the Jupyter notebook environment, or a display rendered using a Jupyter extension, such as RISE slideshows or appmode.

Several years ago, O’Reilly published a demonstration Jupyter plugin called Thebe that allowed you to write code in an HTML page and then run it against a Jupyter kernel.

I think that example rotted some time ago, but there’s a new candidate in the field in the form of @minrk’s ThebeLab [repo] .

Here’s an example of a live (demo) web page, embedding Python code in the HTML that can be executed against a Jupyter kernel:

The way the code is included in the page is similar to the way it was embedded in the original Thebe demo, via a suitably annotated <pre> tag:

One other thing that’s particularly neat is the way the page invokes the required Jupyter kernel – via a Binderhub container:

(You may note you also need to pull in a thebelab Javascript package to make the whole thing work…)

What this means is that you can embed arbitrary code, for an arbitrary language (or at least, as arbitrary as the language kernels supported by Jupyter), running against an arbitrary environment (as specified by the Binder image definition).

The code cells are also editable, which means you can edit the code and run your own code in them. Obviously:

  1. this is great for educators and learners alike because it means you can write – and run – interactive code exercises inline in your online course materials;
  2. rubbish for IT because they’ll be scared about the security implications. (The fact that stuff runs in containers should mitigate some of the “can our network get hacked as a result” concerns, but leave open the opportunity that someone could used the kernel as place from which to mount an attack somewhere else. One way many notebook get round this is block or whitelist the external sites from which requests can be made from inside the kernel. Which can be a pain if you need to access third party sites, eg to download data. But is maybe less of an issue when running a more constrained activity inline within course materials against a custom kernel environment?)

It’d be great to be able to run something like this as a demonstrator activity in TM112… I just need to put a demo together now… (Which shouldn’t be too hard: the current plan is to use notebooks for the demos, running them from a Binderhub launched environment…)

PS I just did the quickest of quick proofs of concept for myself, taking the demo thebelab html file and adding my own bits of ipython-folium-magic demo code, and hey presto…https://psychemedia.github.io/ipython_magic_folium/ In the demo, try editing the code cell to geolocate your own address, rather the address of the OU, and re-run that code cell. Or look for other things to try out with the magic as described here.

PPS so now I’m wondering about a Thebelab HTML output formatter for nbconvert that runs the code in code input cells that are hidden using the Hide Cell notebook extension, to render the output from those cells, and writes the code from the unhidden code input cells into <pre> tags for use with Thebelab?

Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs

Ish-via @opencorporates, I came across the “Virtues of a Programmer”, referenced from a Wikipedia page, in a Nieman Lab post by Brian Boyer on Hacker Journalism 101,, and stated as follows:

  • Laziness: I will do anything to work less.
  • Impatience: The waiting, it makes me crazy.
  • Hubris: I can make this computer do anything.

I can buy into those… Whilst also knowing (from experience) that any of the above can lead to a lot of, erm, learning.

For example, whilst you might think that something is definitely worth automating:

the practical reality may turn out rather differently:

The reference has (currently) disappeared from the Wikipedia page, but we can find it in the Wikipedia page history:

Larry_Wall_-_Wikipedia_old

The date of the NiemanLab article was 

Larry_Wall__Revision_history_-_Wikipedia

So here’s one example of a linked reference to a web resource that we know is subject to change and that has a mechanism for linking to a particular instance of the page.

Academic citation guides tend to suggest that URLs are referenced along with the date that the reference was (last?) accessed by the person citing the reference, but I’m not sure that guidance is given that relates to securing the retrievability of that resource, as it was accessed, at a later date. (I used to bait librarians a lot for not getting digital in general and the web in particular. I think they still don’t…;-)

This is an issue that also hits us with course materials, when links are made to third party references by URI, rather than more indirectly via a DOI.

I’m not sure to what extent the VLE has tools for detecting link rot (certainly, they used to; now it’s more likely that we get broken link reports from students failing to access a particular resource…) or mitigating against broken links.

One of the things I’ve noticed from Wikipedia is that it has a couple of bots for helping maintain link integrity: InternetArchiveBot and Wayback Medic.

Bots help preserve link availability in several ways:

  • if a link is part of a page, that link can be submitted to an archiving site such as the Wayback machine (or if it’s a UK resource, the UK National Web Archive);
  • if a link is spotted to be broken (header / error code 404), it can be redirected to the archived link.

One of the things I think we could do in the OU is add an attribute to the OU-XML template that points to an “archive-URL”, and tie this in with service that automatically makes sure that linked pages are archived somewhere.

If a course link rots in presentation, students could be redirected to the archived link, perhaps via a splash screen (“The original resource appears to have disappeared – using the archived link”) as well as informing the course team that the original link is down.

Having access to the original copy can be really helpful when it comes to trying to find out:

  • whether a simple update to the original URL is required (for example, the page still exists in its original form, just at a new location, perhaps because of a site redesign); or,
  • whether a replacement resource needs to be found, in which case, being able to see the content of the original resource can help identify what sort of replacement resource is required.

Does that count as “digital first”, I wonder???

Programming in Jupyter Notebooks, via the Heavy Metal Umlaut

Way back when, I used to take delight in following the creative tech output of Jon Udell, then at InfoWorld. One of the things I fondly remember is his Heavy Metal Umlaut screencast:

You can read about how he put it together via the archived link available from Heavy metal umlaut: the making of the movie.

At some point, I seem to remember a tool became available for replaying the history of a Wikipedia page that let you replay its edits in time (perhaps a Jon Udell production?) Or maybe that’s a false memory?

A bit later, the Memento project started providing tools to allow you to revisit the history of the web using archived pages from the Wayback Machine.Memento. You can find the latest incarnation here: Memento Demos.

(Around the time it first appeared, I think Chris Gutteridge built something related? As Time Goes By, It Makes a World of Diff?)

Anyway – the Heavy Metal Umlaut video came to mind this morning as I was pondering different ways of using Jupyter notebooks to write programmes.

Some of my notebooks have things of this form in them, with “finished” functions appearing in the code cells:

Other notebooks trace the history of the development of a function, from base elements, taking an extreme REPL approach to test each line of code, a line at a time, as I try to work out how to do something. Something a bit more like this:

This is a “very learning diary” approach, and one  way of using a notebook that keeps the history of all the steps – and possibly trial and error within a single line of code or across several lines of code  – as you work out what you want to do. The output of each state change is checked to make sure that the state is evolving as you expect it to.

I think this approach can be very powerful when you’re learning because you can check back on previous steps.

Another approach to using the notebooks is to work within a cell and build up a function there. Here’s an animated view of that approach:

This approach loses the history – loses your working – but gets to the same place, largely through the same process.

That said, in the notebook environment used in CoCalc, there is an option to relay a notebook’s history in much the same was as Memento lets you replay the history of a web page.

In practice, I tend to use both approaches: keeping a history of some of the working, whilst RPLing in particular cells to get things working.

I also dip out into other cells to try things out / check things, to incorporate in a cell, and then delete the scratchpad / working out cell.

I keep coming back to the idea that Jupyter notebooks are a really powerful environment for learning in, and think  there’s still a lot we can do to explore the different ways we might be able to use them to support teaching as well as learning…:-)

PS via Simon Willison, who also recalled a way of replaying Wikipedia pages, this old Greasemonkey script.

PPS Sort of related, and also linking this post with [A] Note On Web References and Broken URLs, an inkdroid post by Ed Summers on Web Histories that reviews a method by Prof Richard Rogers for Doing Web history with the Internet Archive: screencast documentaries.