Programming in Jupyter Notebooks, via the Heavy Metal Umlaut

Way back when, I used to take delight in following the creative tech output of Jon Udell, then at InfoWorld. One of the things I fondly remember is his Heavy Metal Umlaut screencast:

You can read about how he put it together via the archived link available from Heavy metal umlaut: the making of the movie.

At some point, I seem to remember a tool became available for replaying the history of a Wikipedia page that let you replay its edits in time (perhaps a Jon Udell production?) Or maybe that’s a false memory?

A bit later, the Memento project started providing tools to allow you to revisit the history of the web using archived pages from the Wayback Machine.Memento. You can find the latest incarnation here: Memento Demos.

(Around the time it first appeared, I think Chris Gutteridge built something related? As Time Goes By, It Makes a World of Diff?)

Anyway – the Heavy Metal Umlaut video came to mind this morning as I was pondering different ways of using Jupyter notebooks to write programmes.

Some of my notebooks have things of this form in them, with “finished” functions appearing in the code cells:

Other notebooks trace the history of the development of a function, from base elements, taking an extreme REPL approach to test each line of code, a line at a time, as I try to work out how to do something. Something a bit more like this:

This is a “very learning diary” approach, and one  way of using a notebook that keeps the history of all the steps – and possibly trial and error within a single line of code or across several lines of code  – as you work out what you want to do. The output of each state change is checked to make sure that the state is evolving as you expect it to.

I think this approach can be very powerful when you’re learning because you can check back on previous steps.

Another approach to using the notebooks is to work within a cell and build up a function there. Here’s an animated view of that approach:

This approach loses the history – loses your working – but gets to the same place, largely through the same process.

That said, in the notebook environment used in CoCalc, there is an option to relay a notebook’s history in much the same was as Memento lets you replay the history of a web page.

In practice, I tend to use both approaches: keeping a history of some of the working, whilst RPLing in particular cells to get things working.

I also dip out into other cells to try things out / check things, to incorporate in a cell, and then delete the scratchpad / working out cell.

I keep coming back to the idea that Jupyter notebooks are a really powerful environment for learning in, and think  there’s still a lot we can do to explore the different ways we might be able to use them to support teaching as well as learning…:-)

PS via Simon Willison, who also recalled a way of replaying Wikipedia pages, this old Greasemonkey script.

PPS Sort of related, and also linking this post with [A] Note On Web References and Broken URLs, an inkdroid post by Ed Summers on Web Histories that reviews a method by Prof Richard Rogers for Doing Web history with the Internet Archive: screencast documentaries.

Fragment – Wizards From Jupyter RISE Slideshows

Note to self, as much as anything…

At the moment I’m tinkering with a couple of OU hacks that require:

  • a login prompt to log in to OU auth
  • a prompt that requests what service you require
  • a screen that shows a dialogue relating to the desired service, as well as a response from that service.

I’m building these up in Jupyter notebooks, and it struck me that I could create a simple, multi-step wizard to mediate the interaction using the Jupyter RISE slideshow extension.

For example, the first slide is the login, the second slide the service prompt, the third screen the service dialogue, maybe with child slides relating to that?

(Hmm, thinks – would be interesting if RISE supported even more non-linear actions over and above it’s 1.5D nature? For example, branched logic, choosing which of N slides to go to next?)

Anyway, just noting that as an idea: RISE live notebook slideshows as multi-step wizards.

If Only I’d Been More Focussed… National-Local Data Robot Media Wire

And so it came to pass that Urbs Media started putting out their Arria NLG generated local data stories, customised from national data sets, on the PA news wire, as reported by the Press GazetteFirst robot-written stories from Press Association make it into print in ‘world-first’ for journalism industry – and Hold the Front Page: Regional publishers trial new PA robot reporting project.

Ever keen to try new approaches out, my local hyperlocal, OnTheWight, have already run a couple of the stories. Here’s an example: Few disadvantaged Isle of Wight children go to university, figures show.

Long term readers might remember that this approach is one that OnTheWight have explored before, of course, as described in OnTheWight: Back at the forefront of next wave of automated article creation.

Back in 2015, I teamed up with them explore some ideas around “robot journalism”, reusing some of my tinkerings to automate the production of a monthly data story OnTheWight run around local jobless statistics. You can see a brief review from the time here and an example story from June 2015 here. The code was actually developed a bit further to include some automatically generated maps (example) but the experiment had petered out by then (“musical differences”, as I recall it!;-) (I think we’re talking again now.. ;-) I’d half imagined actually making a go of some sort of offering around this, but hey ho… I still have some related domains I bought on spec at the time…

At the time, we’d been discussing ways for what to do next. The “Big Idea” as I saw it was that doing the work of churning through a national dataset, with data at the local level, once, (for OntheWight), meant that the work was already done for everywhere.

robot_intermediatePR

To this end, I imagined a “datawire” – you can track the evolution of that phrase through OUseful.info posts here – that could be used to distribute localised press releases automatically generated from national datasets. One of the important things for OnTheWIght was the importance of getting data reports out quickly once a data set had been released. (I seem to remember we raced each other – the manual route versus the robot one.) My tools weren’t fully automated – I had to keep hitting reload to fetch the data rather than having a cron job start pinging the Nomis website around the time of the official release, but that was as much because I didn’t run any servers as anything. One thing we did do was automatically push the robot generated story into the OnTheWight WordPress blog draft queue, from where it could be checked and published by a human editor. The images were handled circuitously (I don’t think I had a key to push image assets to the OnTheWight image server?)

The data wire idea was actually sketched out a couple of years ago at a community journalism conference (Time for a Local Data Wire?), and that was perhaps where our musical differences about the way forward started to surface? :-(

One thing you may note is the focus on producing press releases, with the intention that a journalist could build a story around the data product, rather than the data product standing in wholesale for a story.

I’m not sure this differs much from the model being pursued by Urbs Media, the organisation that’s creating the PA data stories, and that is funded in part at least by a Google Digital News Initiative (DNI) grant: PA awarded €706,000 grant from Google to fund a local news automation service in collaboration with Urbs Media.

FWIW, give me three quarters of a million squids, or Euros, and that’d do me as a private income for the rest of working my life; which means I’d be guilt free enough to play all time…!

One of the things that I think the Urb stories are doing is including quotes on the national statistical context taken from the original data release. For example:

Which reminds me – I started to look at the ONS JSON API when it appeared (example links), but don’t think I got much further than an initial play... One to revisit, to see if it can be used as a source from which automated quote extraction is possible…

Something our original job stats stories didn’t really get to evolve as far  as being the inspiration for contextualising reporting – they were more or less a literal restatement of the “data generated press release”. I seem to recall that this notion of data-to-text-to-published-copy started to concern me, and I began to explore it in a series of posts on “robot churnalism” (for example, Notes on Robot Churnalism, Part I – Robot Writers and Notes on Robot Churnalism, Part II – Robots in the Journalism Workplace).

(I don’t know how many of the stories returned in that search were from PA stories. I think that regional news group operators such as Johnston Press and Archant also run national units producing story templates that can be syndicated, so some templated stories may come from there.)

I think there are a couple more posts in that series still in my draft queue somewhere which I may need to finish off… Perhaps we’ll see how the new stories start to play out to see whether we start to see the copy being reprinted as is or being used to inspire more contextualised local reporting around the data.

I also recall presenting on the topic of “Robot Writers” at ILI in 2016 (I wasn’t invited back this year:-(

So what sort of tech is involved in producing the PA data wire stories? From the preview video on the Urbs Media website, the technology behind the Radar project –  Reporters and Data and Robots  – looks to be the Articulator Lite application developed by Arria NLG. If you haven’t been keeping up, Arria NLG is the UK equivalent of companies like Narrative Science and Automated Insights in the US which I’ve posted about on and off for the last few years (for example, Notes on Narrative Science and Automated Insights).

Anyway, it’ll be interesting to see how the PA / Urbs Media thing plays out. I don’t know if they’re automating the charts’n’maps production thing yet, but if they do then I hope they generate easily skinnable graphic objects that can be themed using things like ggthemes or matplotlib styles.

There’s a stack of practical issues and ethical issues associated with this sort of thing, and it’ll be interesting to see if any concerns start to be aired, or oopses appear. The reporting around the Births by parents’ characteristics in England and Wales: 2016 could easily be seen as judgemental, for example.

PS I wonder if they run a Slack channel data wire? Slackbot Data Wire, Initial Sketch Maybe there’s still a gap in the market for one of my ideas?! ;-)

Open ALMS….

A year or two after graduating, and having failed in bid to become an independent music promoter with a couple of others (we booked Hawkwind rather than Nirvana; oops…) I was playing chess with a stoner of Buddha-like nature who showed me how the pieces moved. (That is, I knew how the pieces moved square-wise; but this was more in the sense of “doh… so if you do that……., then that, and that, and then this, then that and, …okay. Another game?”)

Sometimes, I get that feeling from OLDaily (the being taught how the pieces move thing; I can’t really comment on the adjectival bits). As a case in point, Stephen recently linked to an old piece of work from David Wiley, amongst others: The Four R’s of Openness and ALMS Analysis: Frameworks for Open Educational Resources.

The paper opens with a review of “The Four R’s of Openness”, the rights to reuse, redistribute, revise and remix and then goes on to consider Wiley’s ALMS framework, which represent’s Stephen’s move:

One of the primary benefits of an OER is that it can be adapted, or localized, to the needs of specific situations.

Even if a work has been licensed so that users are free to reuse, redistribute, revise and remix it, several technical factors affect openness, particularly in terms of revising and remixing. If producers of OER give people permission to use their resources, they should also consider giving them the technical keys to unlock the OERs so that they can adapt the OER to their own needs. … ALMS is an acronym that stands for: Access to editing tools. Level of expertise required to revise or remix. Meaningfully editable. Source-file access.

Access to editing tools. When people try to revise OER, one of the first questions they will need to ask is ―What software do I use to edit this resource? …

Level of expertise required to revise or remix. … Even if end users have access to editing tools, if they need 100 hours of training to use the tool effectively, revising OERs that rely on those tools will likely be beyond their reach. …

Meaningfully editable. Perhaps the classic example of OER that are not meaningfully editable are scanned PDF documents. If a person takes handwritten notes, scans images of those notes and puts them into PDF format, the contents of the resulting file cannot be meaningfully edited. The only way to revise or remix this work is to type out the words in the PDF into a word processor document and make revisions there. [TH: so you might argue that diagrams are typically not meaningfully editable.]

Source-file access. … A source file is the file that a programmer or developer edits and works with in order to produce a final product. …

Open educational resources will be easy to revise or remix technically if they are meaningfully editable (like a web page), access to the source file is provided (like an HTML file), can be edited by a wide range of free or affordable software programs (like an RTF file), and can be edited with software that is easy to use and is used by many people.

Open educational resources will be difficult to revise or remix technically if they are not meaningfully editable (like scanned handwriting), are not self-sourced (like a Flash file), can only be edited by one, single platform, expensive software program (like a Microsoft OneNote file), and can only be edited with software that requires extensive training and is used by relatively few people.

[T]echnical aspects of OER will affect how “open” they really are. Creators of OER who wish to promote revising and remixing should ensure that OER are designed in such a way that users will have access to editing tools, that the tools needed to will not require a prohibitive level of expertise, and that the OER are meaningfully editable and self-sourced.

So… I’ve tried explain some of my recent thinking around this topic in Maybe Programming Isn’t What You Think It Is? Creating Repurposable & Modifiable OERs and  OERs in Practice: Re-use With Modification, but that didn’t get anywhere. Inspired by the ALMS piece, I’ll try again…

…by talking about Binder and Binderhub, again… Binderhub is a technology for launching on demand a Jupyter notebook server that serves notebooks against a particular programming environment. The notebook server essentially provides an interactive user interface via a browser in the form of Jupyter notebooks. The specification for the computing environment accessed from the notebooks, as well as the notebooks themselves, can be published in a public Github repository. Binderhub copies the files in the repository, and then uses them to build the computing environment.

Think of the computing environment like your desktop computer. To open particular documents, you need particular applications installed. The installed applications are part of your environment. If I give you a weird document type, you might have to install an application in order to open that document. The document might open easily for me because I have customised my computer with a particular environment (which includes all the applications I have installed on it to open and edit weird document types). But it might be difficult for you to install whatever it is you need to install to open the document and work with it. How much easier if you could just use my computer to do it. Or my environment.

Some of the files in a public github repository, then, can be used to define an environment (they are the “source code” for it. Binder can use them to build for you a copy of the environment that I work in, particularly if the environment I work in is one built from me by Binder from the same repository. If the files work for me, they should work for you.

As well as being a technology, Binderhub is, currently, a free, open service. It can be used to launch interactive notebooks running in an environment defined via files in a public github repository. Try one: Binder From the Cell menu, select Run All.

The demos in that example relate to maths. If you prefer to see music related demos, use this Binder instead: Binder

The definition for the environment, and the notebooks (that is, the “content”) are here: psychemedia/showntell/maths and here:
psychemedia/showntell/music.

All the rich assets in a notebook – images, sounds, interactives – can be generated from “source code” contained in the notebook (from the toolbar, select the Toggle selected cell input display toolbar button to reveal / hide the code). In many cases the source code can be quite concise because it calls on other, powerful applications preinstalled in the environment. The environment we defined in the repository.

Actually, that’s not strictly true. We may also pull on third party resources, such as embedded Google or Bing maps. But we could, if we wanted to, add a service to deliver the map tiles into the local environment, by defining the appropriate services in the repository and building them into the environment “from source”. We could, if we wanted to, build a self-contained environment that contains everything it needs in order to render everything that appears in the notebook.

What Binder and Binderhub do, then, is remove the technical barrier to entry when it comes to setting up a computing environment that is pre-configured to run a particular set of notebooks. Binder will build the environment for you, from source code. If the author of the notebooks creates and tests the notebooks to their satisfaction in the same environment, they should work…

So now we get to the notebooks. What the notebooks do is provide an environment that can be used to author, which is to say edit, which is to say revise and remix, rich content (where “rich” might be taken to mean “including media resources other than static text”). They also provide an environment for viewing and interacting with those materials. Again, for the viewer / user, there is no need to install anything. Binder will provide, via your browser.

At the moment, most of the focus on the development, and use, of Jupyter notebooks is for scientific research, but there is growing use in education and (data) journalism too.

I think there is a huge opportunity for using notebooks as a general purpose environment for creating materials that can make use of compute stuff, and from there start to work as a gateway drug for learning how to make more effective use of computers by writing easily learned magic incantations, a line at a time.

As an example, see if you can figure out to embed a map located on an address familiar to you using the following Binder: Binder

To read more about that notebook, see Embedding folium Maps In Jupyter Notebooks Using IPython Magic and Extending the folium Magic…. For other examples, such as how to embed images showing Scratch like programs, see Scratch Materials – Using Blockly Style Resources in Jupyter Notebooks. For an example of how to use a Jupyter notebook to create, and display, an interactive slideshow, see OERs in Practice: Repurposing Jupyter Notebooks as Presentations.

For more references to the ALMS work, see: Measuring Technical Difficulty in Reusing Open Educational Resources with the ALMS Analysis Framework, Seth M. Gurell, PhD Thesis.

Practical DigiSchol – Refinding Lost Recent Web History

I don’t know to what extent our current courses equip folk for the future (or even the present), but here’s a typical example of something I needed to figure out a solution to a problem (actually, more of a puzzle) today using the resources I had to hand…

Here’s the puzzle:

I couldn’t remember offhand what this referred to other than “body shape” and “amazon” (I’ve had a cold recently and it seems to have affected my mental indexing a bit…!), and a quick skim of my likely pinboard tags didn’t turn anything up…

Hmmm… I recall looking at a related page, so I should be able to search my browser history. An easy way to do this is using Simon Willison’s datasette app to search the SQLite database that contains my Chrome browser history:

pip3 install datasette
datasette ~/Library/Application\ Support/Google/Chrome/Default/History

This fires up a  simple server with a browser based UI:

Body Labs – that was it…

Sigh… So what was there?

You can browse the archived site on the Wayback Machine.

On reflection (as I said: head cold, not focussing/concentrating properly), I could have web searched:

No closer to the video but it sets the context…

That said, searching for bodylabs video brings up some likely candidates for alternatives, such as this one (pixelfondue: Body + Measure + Scan + AI = Bodylabs delicacy):

Anyway, back to the question: what did I need to know in order to be able to do that? And where do folk learn that sort of thing, whether or not they are formally taught it. Indeed, if there is a related thing to be formally taught, what it is, at what level, and in what context?

PS I had actually also kept a note of the company and the video in a scribble pad draft post on the Digital Worlds blog where I collect irreality related stiff:

Edit_Post_‹_Digital_Worlds_

Extending the folium Magic…

A couple of days ago I posted about some IPython magic for embedding interactive maps in Jupyter notebooks.

I had a bit more of a play yesterday, and then into the night, and started to add support in for other things. For example, you can now build up a map using several pieces of magic, for example by assigning a magic generated map to a variable and then passing that map in as the basemap to another piece of magic.

If the output of the previously run code cell is a folium map, the magic will use that map as the starting map by default – there is a switch (-b None), or some alternative magic (%folium_new_map) that forces a fresh map if you want to start from scratch.

The magic now also has some support for trying to guess where to centre a map, as well as logic that tries to guess at what columns you might want to use when generated a choropleth map from some data and a geojson file. This is best exemplified by some helper magic:

The same routines are used to try to guess column names if you omit them when trying to plot a choropleth. For example, here we only pass in a reference to the data file, the geojson file, and the numeric data column to colour the map. The columns used to plot the boundaries are guessed at.

This is likely to be very ropey – I’ve only tested it with a single pair of data/geosjon files, but it might work for limited general cases…

You can play with it here: Binder

Code is…

…something that everyone is supposed to learn, apparently, but I’m not sure why; and you don’t have to if you’re an adult learner, compared to someone in school?

So here are a couple of reasons why I think knowledge of how to use code is useful.

Code is a means for telling computers, rather than computer users, what to do

Writing instructional material for students that require them to use computer application is a real pain. You have to write laborious From the file menu, select…, in the popup wizard, check the… instructions, or use screenshots (which have to look exactly the same as what the learner is likely to see or they can get worried) or screencasts (which have to look exactly the same as what the learner is likely to see or they can get worried).

The problem is, of course, you don’t want the learner to really do any of that. You want them to be apply to apply a series of transformations to some sort of resource, such as a file.

On the other hand, you can write a set of instructions for what you want the computer to do to the resource, and give that to the learner. The learner can use that medium for getting the computer to transform the resource rather more literally…

Code is a tool for building tools

Code lets you build your own tools, or build your own tools on top of your own tools.

One huge class of tools you can build are tools that automate something else, either by doing a repetitive task multiple times over on your behalf, or by abstracting a lengthy set of instructions into a single instruction with some “settings” passed as parameters. And yes, you do know what parameters are: in the oven for thirty minutes at gas mark 5, which is to say, in “psuedo-code”, oven(gas_mark=5, time=30).

Here’s one tool I built recently – some magic that makes it easy to embed maps in Jupyter notebooks.

Code is a tool for extending other tools

If some you have access to the code that describes a tool, you can extend it, tune it, or customise it with your own code.

For example, here’s that  previous magic extended.

Code is something that can be reused in, or by, other bits of code

If you’ve written some code to perform a particular task in order to achieve something in one tool, you don’t need to write that code again if you want to perform the same task in another tool. You can just reuse the code.

Lots of code is “algorithmic boilerplate”, in the sense that it implements a recipe for doing something. Oftentimes, you may want to reuse the “algorithmic boilerplate” that someone else has got working in their application / tool, and reuse it, perhaps with slight remodification (swapping dataframes for datafiles, for example).

Disclaimer

In my student days, many nights were spent between the computer room, and the Student Union playing bridge. In the bridge games, the bidding was ad hoc because no-one could ever remember any of  the bidding systems you’re “supposed to use” properly… Obviously, it wasn’t proper bridge, but it passed the time amicably…

So unlike folk who (think code should be taught properly and don’t think to use code themselves for daily tasks), I have more pragmatic approach and ((think my code only has to work enough to make it worthwhile using it) or (because it’s an applied recreational activity like doing a crossword)).