Category: Anything you want

Fragments – Multi-Tasking In Jupyter Notebooks and Postgres in Binderhub

A couple of things that I thought might be handy:

  • multi-tasking in Jupyter notebook cells: nbmultitask; running actions in notebook cells is usually blocking on that cell; this extension – and the associated examples.ipynb notebook – show various ways of running non-blocking threads in notebook cells.
  • a Binderhub compliant Dockerfile for running Postgres alongside a Jupyter notebook; I posted a fragment here that demos a connection if you run the repo via Binderhub. It requires the user starting and stopping the postgres server, and it also escalates the jovyan user privileges via sudo, which could be handy for TM351 purposes.  Now pretty much all I need to see a demo of is running OpenRefine too. I think the RStudio Jupyter proxy route is being generalised to make that possible… and, via @choldgraf, it seems as if there is an issue relating to this: https://github.com/jupyterhub/binder/issues/40 .

TO DO – Automated Production of a Screencast Showing Evolution of a Web Page?

Noting @edsu’s Web histories approach describing Richard Rogers’  Doing Web history with the Internet Archive: screencast documentaries, I wonder how hard it would be to write a simple script that automates the collection of screenshots from a web page timeline in the Wayback Machine and stitches them together in a movie?

It’s not hard to find code fragments describing how to turn a series of image files into a video (for example, Create a video in Python from images or Combine images into a video with Python 3 and OpenCv 3) and you can grab a screenshot from a web page using various web testing frameworks (eg Capturing screenshots of website with Python or Grabbing Screenshots of folium Produced Choropleth Leaflet Maps from Python Code Using Selenium).

So, a half hour hack? But I don’t have half an hour right now:-(

PS Possibly handy related tool for downloading stuff in bulk from Wayback Machine, this Wayback Machine Downloader script.

PPS Hmm… should be easy enough to take a similar approach to creating “Wikipedia Journeys” – grab a random Wikipedia page, snapshot it, follow a random link from within the page subject matter, screenshot that, etc. To simplify matters there, there’s the Python wikipedia package.

Using ThebeLab to Run Python Code Embedded in HTML Via A Jupyter Kernel

If you want to run Python code embedded in a HTML file in your browser, one way of doing it is to use something like Brython, Skulpt or Pypy.js, which convert your python code into Javascript and then run that Javascript in the browser.

One of the problems with this approach is the limited availability of Python modules ported into Javascript for use by those packages.

Another route to using Python in the browser is to connect to a remote Python environment. This is the approach taken by online code editors such as PythonAnywhere or repl.it.

A third way to access a Python environment via a web browser is using Jupyter notebooks, but this limits you to using the Jupyter notebook environment, or a display rendered using a Jupyter extension, such as RISE slideshows or appmode.

Several years ago, O’Reilly published a demonstration Jupyter plugin called Thebe that allowed you to write code in an HTML page and then run it against a Jupyter kernel.

I think that example rotted some time ago, but there’s a new candidate in the field in the form of @minrk’s ThebeLab [repo] .

Here’s an example of a live (demo) web page, embedding Python code in the HTML that can be executed against a Jupyter kernel:

The way the code is included in the page is similar to the way it was embedded in the original Thebe demo, via a suitably annotated <pre> tag:

One other thing that’s particularly neat is the way the page invokes the required Jupyter kernel – via a Binderhub container:

(You may note you also need to pull in a thebelab Javascript package to make the whole thing work…)

What this means is that you can embed arbitrary code, for an arbitrary language (or at least, as arbitrary as the language kernels supported by Jupyter), running against an arbitrary environment (as specified by the Binder image definition).

The code cells are also editable, which means you can edit the code and run your own code in them. Obviously:

  1. this is great for educators and learners alike because it means you can write – and run – interactive code exercises inline in your online course materials;
  2. rubbish for IT because they’ll be scared about the security implications. (The fact that stuff runs in containers should mitigate some of the “can our network get hacked as a result” concerns, but leave open the opportunity that someone could used the kernel as place from which to mount an attack somewhere else. One way many notebook get round this is block or whitelist the external sites from which requests can be made from inside the kernel. Which can be a pain if you need to access third party sites, eg to download data. But is maybe less of an issue when running a more constrained activity inline within course materials against a custom kernel environment?)

It’d be great to be able to run something like this as a demonstrator activity in TM112… I just need to put a demo together now… (Which shouldn’t be too hard: the current plan is to use notebooks for the demos, running them from a Binderhub launched environment…)

PS I just did the quickest of quick proofs of concept for myself, taking the demo thebelab html file and adding my own bits of ipython-folium-magic demo code, and hey presto…https://psychemedia.github.io/ipython_magic_folium/ In the demo, try editing the code cell to geolocate your own address, rather the address of the OU, and re-run that code cell. Or look for other things to try out with the magic as described here.

PPS so now I’m wondering about a Thebelab HTML output formatter for nbconvert that runs the code in code input cells that are hidden using the Hide Cell notebook extension, to render the output from those cells, and writes the code from the unhidden code input cells into <pre> tags for use with Thebelab?

Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs

Ish-via @opencorporates, I came across the “Virtues of a Programmer”, referenced from a Wikipedia page, in a Nieman Lab post by Brian Boyer on Hacker Journalism 101,, and stated as follows:

  • Laziness: I will do anything to work less.
  • Impatience: The waiting, it makes me crazy.
  • Hubris: I can make this computer do anything.

I can buy into those… Whilst also knowing (from experience) that any of the above can lead to a lot of, erm, learning.

For example, whilst you might think that something is definitely worth automating:

the practical reality may turn out rather differently:

The reference has (currently) disappeared from the Wikipedia page, but we can find it in the Wikipedia page history:

Larry_Wall_-_Wikipedia_old

The date of the NiemanLab article was 

Larry_Wall__Revision_history_-_Wikipedia

So here’s one example of a linked reference to a web resource that we know is subject to change and that has a mechanism for linking to a particular instance of the page.

Academic citation guides tend to suggest that URLs are referenced along with the date that the reference was (last?) accessed by the person citing the reference, but I’m not sure that guidance is given that relates to securing the retrievability of that resource, as it was accessed, at a later date. (I used to bait librarians a lot for not getting digital in general and the web in particular. I think they still don’t…;-)

This is an issue that also hits us with course materials, when links are made to third party references by URI, rather than more indirectly via a DOI.

I’m not sure to what extent the VLE has tools for detecting link rot (certainly, they used to; now it’s more likely that we get broken link reports from students failing to access a particular resource…) or mitigating against broken links.

One of the things I’ve noticed from Wikipedia is that it has a couple of bots for helping maintain link integrity: InternetArchiveBot and Wayback Medic.

Bots help preserve link availability in several ways:

  • if a link is part of a page, that link can be submitted to an archiving site such as the Wayback machine (or if it’s a UK resource, the UK National Web Archive);
  • if a link is spotted to be broken (header / error code 404), it can be redirected to the archived link.

One of the things I think we could do in the OU is add an attribute to the OU-XML template that points to an “archive-URL”, and tie this in with service that automatically makes sure that linked pages are archived somewhere.

If a course link rots in presentation, students could be redirected to the archived link, perhaps via a splash screen (“The original resource appears to have disappeared – using the archived link”) as well as informing the course team that the original link is down.

Having access to the original copy can be really helpful when it comes to trying to find out:

  • whether a simple update to the original URL is required (for example, the page still exists in its original form, just at a new location, perhaps because of a site redesign); or,
  • whether a replacement resource needs to be found, in which case, being able to see the content of the original resource can help identify what sort of replacement resource is required.

Does that count as “digital first”, I wonder???

Programming in Jupyter Notebooks, via the Heavy Metal Umlaut

Way back when, I used to take delight in following the creative tech output of Jon Udell, then at InfoWorld. One of the things I fondly remember is his Heavy Metal Umlaut screencast:

You can read about how he put it together via the archived link available from Heavy metal umlaut: the making of the movie.

At some point, I seem to remember a tool became available for replaying the history of a Wikipedia page that let you replay its edits in time (perhaps a Jon Udell production?) Or maybe that’s a false memory?

A bit later, the Memento project started providing tools to allow you to revisit the history of the web using archived pages from the Wayback Machine.Memento. You can find the latest incarnation here: Memento Demos.

(Around the time it first appeared, I think Chris Gutteridge built something related? As Time Goes By, It Makes a World of Diff?)

Anyway – the Heavy Metal Umlaut video came to mind this morning as I was pondering different ways of using Jupyter notebooks to write programmes.

Some of my notebooks have things of this form in them, with “finished” functions appearing in the code cells:

Other notebooks trace the history of the development of a function, from base elements, taking an extreme REPL approach to test each line of code, a line at a time, as I try to work out how to do something. Something a bit more like this:

This is a “very learning diary” approach, and one  way of using a notebook that keeps the history of all the steps – and possibly trial and error within a single line of code or across several lines of code  – as you work out what you want to do. The output of each state change is checked to make sure that the state is evolving as you expect it to.

I think this approach can be very powerful when you’re learning because you can check back on previous steps.

Another approach to using the notebooks is to work within a cell and build up a function there. Here’s an animated view of that approach:

This approach loses the history – loses your working – but gets to the same place, largely through the same process.

That said, in the notebook environment used in CoCalc, there is an option to relay a notebook’s history in much the same was as Memento lets you replay the history of a web page.

In practice, I tend to use both approaches: keeping a history of some of the working, whilst RPLing in particular cells to get things working.

I also dip out into other cells to try things out / check things, to incorporate in a cell, and then delete the scratchpad / working out cell.

I keep coming back to the idea that Jupyter notebooks are a really powerful environment for learning in, and think  there’s still a lot we can do to explore the different ways we might be able to use them to support teaching as well as learning…:-)

PS via Simon Willison, who also recalled a way of replaying Wikipedia pages, this old Greasemonkey script.

PPS Sort of related, and also linking this post with [A] Note On Web References and Broken URLs, an inkdroid post by Ed Summers on Web Histories that reviews a method by Prof Richard Rogers for Doing Web history with the Internet Archive: screencast documentaries.

Fragment – Wizards From Jupyter RISE Slideshows

Note to self, as much as anything…

At the moment I’m tinkering with a couple of OU hacks that require:

  • a login prompt to log in to OU auth
  • a prompt that requests what service you require
  • a screen that shows a dialogue relating to the desired service, as well as a response from that service.

I’m building these up in Jupyter notebooks, and it struck me that I could create a simple, multi-step wizard to mediate the interaction using the Jupyter RISE slideshow extension.

For example, the first slide is the login, the second slide the service prompt, the third screen the service dialogue, maybe with child slides relating to that?

(Hmm, thinks – would be interesting if RISE supported even more non-linear actions over and above it’s 1.5D nature? For example, branched logic, choosing which of N slides to go to next?)

Anyway, just noting that as an idea: RISE live notebook slideshows as multi-step wizards.

If Only I’d Been More Focussed… National-Local Data Robot Media Wire

And so it came to pass that Urbs Media started putting out their Arria NLG generated local data stories, customised from national data sets, on the PA news wire, as reported by the Press GazetteFirst robot-written stories from Press Association make it into print in ‘world-first’ for journalism industry – and Hold the Front Page: Regional publishers trial new PA robot reporting project.

Ever keen to try new approaches out, my local hyperlocal, OnTheWight, have already run a couple of the stories. Here’s an example: Few disadvantaged Isle of Wight children go to university, figures show.

Long term readers might remember that this approach is one that OnTheWight have explored before, of course, as described in OnTheWight: Back at the forefront of next wave of automated article creation.

Back in 2015, I teamed up with them explore some ideas around “robot journalism”, reusing some of my tinkerings to automate the production of a monthly data story OnTheWight run around local jobless statistics. You can see a brief review from the time here and an example story from June 2015 here. The code was actually developed a bit further to include some automatically generated maps (example) but the experiment had petered out by then (“musical differences”, as I recall it!;-) (I think we’re talking again now.. ;-) I’d half imagined actually making a go of some sort of offering around this, but hey ho… I still have some related domains I bought on spec at the time…

At the time, we’d been discussing ways for what to do next. The “Big Idea” as I saw it was that doing the work of churning through a national dataset, with data at the local level, once, (for OntheWight), meant that the work was already done for everywhere.

robot_intermediatePR

To this end, I imagined a “datawire” – you can track the evolution of that phrase through OUseful.info posts here – that could be used to distribute localised press releases automatically generated from national datasets. One of the important things for OnTheWIght was the importance of getting data reports out quickly once a data set had been released. (I seem to remember we raced each other – the manual route versus the robot one.) My tools weren’t fully automated – I had to keep hitting reload to fetch the data rather than having a cron job start pinging the Nomis website around the time of the official release, but that was as much because I didn’t run any servers as anything. One thing we did do was automatically push the robot generated story into the OnTheWight WordPress blog draft queue, from where it could be checked and published by a human editor. The images were handled circuitously (I don’t think I had a key to push image assets to the OnTheWight image server?)

The data wire idea was actually sketched out a couple of years ago at a community journalism conference (Time for a Local Data Wire?), and that was perhaps where our musical differences about the way forward started to surface? :-(

One thing you may note is the focus on producing press releases, with the intention that a journalist could build a story around the data product, rather than the data product standing in wholesale for a story.

I’m not sure this differs much from the model being pursued by Urbs Media, the organisation that’s creating the PA data stories, and that is funded in part at least by a Google Digital News Initiative (DNI) grant: PA awarded €706,000 grant from Google to fund a local news automation service in collaboration with Urbs Media.

FWIW, give me three quarters of a million squids, or Euros, and that’d do me as a private income for the rest of working my life; which means I’d be guilt free enough to play all time…!

One of the things that I think the Urb stories are doing is including quotes on the national statistical context taken from the original data release. For example:

Which reminds me – I started to look at the ONS JSON API when it appeared (example links), but don’t think I got much further than an initial play... One to revisit, to see if it can be used as a source from which automated quote extraction is possible…

Something our original job stats stories didn’t really get to evolve as far  as being the inspiration for contextualising reporting – they were more or less a literal restatement of the “data generated press release”. I seem to recall that this notion of data-to-text-to-published-copy started to concern me, and I began to explore it in a series of posts on “robot churnalism” (for example, Notes on Robot Churnalism, Part I – Robot Writers and Notes on Robot Churnalism, Part II – Robots in the Journalism Workplace).

(I don’t know how many of the stories returned in that search were from PA stories. I think that regional news group operators such as Johnston Press and Archant also run national units producing story templates that can be syndicated, so some templated stories may come from there.)

I think there are a couple more posts in that series still in my draft queue somewhere which I may need to finish off… Perhaps we’ll see how the new stories start to play out to see whether we start to see the copy being reprinted as is or being used to inspire more contextualised local reporting around the data.

I also recall presenting on the topic of “Robot Writers” at ILI in 2016 (I wasn’t invited back this year:-(

So what sort of tech is involved in producing the PA data wire stories? From the preview video on the Urbs Media website, the technology behind the Radar project –  Reporters and Data and Robots  – looks to be the Articulator Lite application developed by Arria NLG. If you haven’t been keeping up, Arria NLG is the UK equivalent of companies like Narrative Science and Automated Insights in the US which I’ve posted about on and off for the last few years (for example, Notes on Narrative Science and Automated Insights).

Anyway, it’ll be interesting to see how the PA / Urbs Media thing plays out. I don’t know if they’re automating the charts’n’maps production thing yet, but if they do then I hope they generate easily skinnable graphic objects that can be themed using things like ggthemes or matplotlib styles.

There’s a stack of practical issues and ethical issues associated with this sort of thing, and it’ll be interesting to see if any concerns start to be aired, or oopses appear. The reporting around the Births by parents’ characteristics in England and Wales: 2016 could easily be seen as judgemental, for example.

PS I wonder if they run a Slack channel data wire? Slackbot Data Wire, Initial Sketch Maybe there’s still a gap in the market for one of my ideas?! ;-)