June 2020 – OUseful.Info, the blog…

Computers

If you look back not that far in history, the word “computer” was a term applied a person working in a particular role. According to Webster’s 1828 American Dictionary of the English Language, a computer was defined as “[o]ne who computes or reckons; one who estimates or considers the force and effect of causes, with a view to form a correct estimate of the effects”.

Going back a bit further, to Samuel Johnson’s magnum opus, we see a “computer” is defined more concisely as a “reckoner” or “accountant”.

In a disambiguation page, Wikipedia identifies Computer_(job_description), quoting Turing’s Computing Machinery and Intelligence paper in Mind (Volume LIX, Issue 236, October 1950, Pages 433–460):

The human computer is supposed to be following fixed rules; he has no authority to deviate from them in any detail. We may suppose that these rules are supplied in a book, which is altered whenever he is put on to a new job.

Skimming through a paper that appeared in my feeds today — CHARTDIALOGS: Plotting from Natural Language Instructions [ACL 2020; code repo] — the following jumped out at me:

Humans. As computers. Again.

Originally, the computer was a person doing a mechanical task.

Now, a computer is a digital device.

Now a computer aspires to be AI, artificial (human) intelligence.

Now AI is, in many cases, behind the Wizard of Oz curtain, inside von Kempelen’s “The Turk” automaton (not…), a human.

Human Inside.

A couple of of other things that jumped out at me, relating to instrumentation and comparison between machines:

$The cases in which the majority of the workers (3/3 or 2/3) exactly match the original Operator, corresponding to the first two rows, happen 72.6% of the time. The cases when at least 3 out of all 4 humans (including the original Operator) agree, corresponding to row 1, 2 and 5, happen 80.6% of the time. This setting is also worth considering because the original Operator is another MTurk worker, who can also make mistakes. Both of these numbers show that a large fraction of the utterances in our dataset are intelligible implying an overall good quality dataset. Fleiss’ Kappa among all 4 humans is 0.849; Cohen’s Kappa between the original Operator and the majority among 3 new workers is 0.889. These numbers indicate a strong agreement as well.$

Just like you might compare the performance of different implementations of an algorithm in code, we also compare the performance of their instationation in digitial or human computers.

At the moment, for “intelligence” tasks (and it’s maybe worth noting that Mechanical Turk has work packages defined as HITs, “Human Intelligence Tasks”) humans are regarded as providing the benchmark god standard, imperfect as it is.

7.5 Models vs. Gold Human Performance (P3) The gold human performance was obtained by having one of the authors perform the same task as described in the previous subsection, on a subset

Dehumanising?

Fragment — Figure-Ground: Opposites Jupyter and Excel

Via a Twitter trawl sourcing potential items for Tracking Jupyter, I came across several folk picking up on a recent Growing the Internet Economy podcast interview — Invest Like the Best, EP.178 — with John Collison, co-founder of digital payments company, Stripe, picking up on a couple of comments in particular.

Firstly, on Excel in the context of “no code” environments:

[I]f you look at Excel, no one calls as a no-code tool, but Excel, I think is one of the most underappreciated programming environments in the world. And the number of Excel programmers versus people using how we think of as more traditional languages is really something to behold.”

One of the issues I have with using things like Scratch to teach adults to code is that it does not provide an environment that resonates with the idea of using code to do useful work. To the extent that programming is taught in computing departments as an academic discipline on the one hand, and a softare engineering, large project codebase activity on the other has zero relevance to the way I use code every day, as a tool for building tools, exploring stateful things in a state transformative way, and getting things done through automation.

I would far rather we taught folk a line of code at a time principles using something like Excel. (There’s an added advantage to this that you also teach students in a natural way about concepts relating to vector based / columnar computation, as well as reactivity, neither of which are typically taught in introductory, maybe even advanced, academic programming classes. Certainly, after several years of teaching the pandas domain specific language in a data management and analysis course, we have only recently really articulated to ourselves how we really do need to develop the idea of vectorised computation more explictly.)

Secondly, on Excel as an environment:

“[L]ots of features of Excel … make it a really nice programming environment and really nice to learn in, where the fact that it’s continuously executed means that unlike you running your code and it doesn’t work, and you’ve got some error, that’s hard to comprehend. Instead, you have a code that just continuously executed in the form of the sheets you see in front of you. And similarly the fact that its individual cells, and you kind of lay out the data spatial… Or the program spatially, where the code and the data is interspersed together, and no one part of it can get too big and diffuse.”

The continuous execution is typically a responsive evaluation of all cells based on an input change to one of them. In this sense, Excel has many similarites with the piecewise “REPL” (read-evaluate-print-loop) execution model used by Jupyter (notebook) kernels, where a change to code in an input cell is evaluated when the cell is run and often an output data state is rendered, such as a chart or a table.

One the replies to one of the shares, from @andrewparker — makes this explicit: “[w]hen writing code, the functions are always visible and the variables’ contents are hidden. Excel is programming where the opposite is true.”

In the spreadsheet, explicit input data is presented to hidden code (that is, formulas) and the result of code execution is then rendered in the form of transformed data. In many “working” spreadsheets, partial steps (“a line of code at a time”) are calculated across parallel columns, with the spreadsheet giving a macroscopic view over partial transformations of the data a step at a time, befire returning the final calculations in the final column, or in the form of an interpreted display such as a graphical chart.

One of the oft-quoted criticisms against Jupyter notebooks is that “the state is hidden” (although if you treat Jupyter notebooks as a linear narrative and read and execute them as such, this claim is just so-much nonsense…) but suggests viewing notebooks in a complementary way: rather than having the parallel columnar cells of the Excel case, where a function at the top of the column may be applied to data values from previous columns, you have top-down linear exposition of the calculation where code cell at a time is used to transform the state generated by the previous cell. (One of the ways I construct notebooks is to take an input dataset and display it at the top of the notebook, apply a line of code to trasnform it and display the result of that transformation, apply another line of code in another cell and view the result of that, and so on.) You can now see not only the state of the data after each transformative step, but also the formula (the line of code) that generated it from data rendered from an earlier step.

Again picking up on the criticism of notebooks that at any given time you may read notebook as a cacophony of incoherent partially executed, a situation that may occur if you run a notebook to completion, then maybe change the input data at the top and run it half way, and then change the input data at the top and run just the first few cells, leaving a notebook with rendered data everywhere execution from different stes of data. This approach corresponds to the model of a spreadsheet worksheet where perhaps you have to click on each column in turn and hit return before the cells are updated, and that cell updates are only responsive to your action that triggers an update on selected cells. But if you get into the habit of only executing notebook cells using a restart-kernel-then-run-all execution model (which an extension could enforce) then this nonsense does not occur, and all the linear cells would be updated, in linear order.

And again, here there is a point of contrast: in the spreadsheet setting, any column or selection of cells may be created by the applciation of a formula to any other collection of cells in the workbook. In a jupyter notebook, if you use the restart-kernel-then-run-all execution model, then the rendering of data as the output to each code cell is a linear sequence. (There are other notebook extensions that let you define dependent cells which could transform the execution order to a non-linear one, but why would you do that..?)

Things can still get messy, though. For example, from another, less recent (March, 2020) tweet I found in the wild: I just realized that making plots in Excel has the same “bad code” smell for me that doing research in a Jupyter notebook does: you’re mixing analysis, data, and results in a way that doesn’t let you easily reuse bits for another later analysis (@rharang) Biut then, reuse is another issue altogether.

Thinks: one of the things I think I need to think about is how the spatial layout of a spreadsheet could may onto the spatial layout of a notebook. It might be interesting to find some “realistic” spreadsheets containing plausible business related calculations and give them a notebook treatment…

Anyway, here’s a fuller exceprted transcript from the podcast:

Patrick: I’m curious how you think about the transition to what’s now being called the no-code movement. The first part of the question is, how under supplied is the world in terms of just talented software developers? But may that potentially not be as big a problem if we do get no-code tools that would allow someone like me that has dabbled in but is certainly not terribly technical on software, more so in data science to build things for myself and not need engineers. What do you think that glide path looks like over the next say 10 years?

John Collison: The answer to how short staffed we are on engineers is still clearly loads, …, [W]e’re still really short of software engineers.

And no-code, I don’t think no-code is fully a panacea, because I think the set of at even when you’re doing no-code, you’re still reasoning about the relations between different objects and data flows and things like that. And so I think when you’re doing, when you’re building an app with Zapier or something like that, you’re still doing a form of engineering, you’re just not necessarily writing codes. And so hopefully that’s something that can give leverage to people without necessarily needing to have to spend quite as much time in it. And this is not new by the way, if you look at Excel, no one calls as a no-code tool, but Excel, I think is one of the most underappreciated programming environments in the world. And the number of Excel programmers versus people using how we think of as more traditional languages is really something to behold.

And I actually think lots of features of Excel that make it a really nice programming environment and really nice to learn in, where the fact that it’s continuously executed means that unlike you running your code and it doesn’t work, and you’ve got some error, that’s hard to comprehend. Instead, you have a code that just continuously executed in the form of the sheets you see in front of you. And similarly the fact that its individual cells, and you kind of lay out the data spatial… Or the program spatially, where the code and the data is interspersed together, and no one part of it can get too big and diffuse. Anyway, I think there are all of these ways in which, anyone who’s developing a no-code or new software paradigm should look at Excel because so many people have managed to essentially learn how to do some light programming from looking at other people’s models and other people’s workbooks and kind of emulating what they see.

… I don’t think no-code will obviate the need for software programmers, I would hope that it can make many more people able to participate in software creation and kind of smooth the on ramp, which is right now, there’s like a really sharp, vertical part of that one.

Some of this sentiment resonates with one of my motivations for “why code?”: it gives people a way of looking at problems that helps them understand the extent to which they may be computable, or may be decomposed, as well as given them a tool that allows them to automate particular tasks, or build other tools that help them get stuff done.

Family Faux Festivals ish-via Clashfinder

However many weeks we are into lockdown by now, we’ve been dabbling in various distributed family entertainments, from quizzes to online escape rooms. We’ve also already missed two festivals — Bearded Theory and the Isle of Wight Festival — with more to not come: Rhythm Tree, Festival at the Edge and Beautiful Days.

When we do go to festivals, I tend to prep by checking out the relevant Clashfinder site, listening to a couple of tracks from every band listed, figuring out which bands I intend to see and printing off enough copies of the Clashfinder listing to have spares..

With no festivals upcoming, I floated the idea we programme our own faux festival on the Clashfinder site, with each person getting two stages to programme as desired: a mid-size one and a smaller one.

Programming on the Clashfinder site means adding an act to a stage at a particular time and for a particular duration; you can optionally add various bits of metadata, such as the band’s name, homepage, or a Youtube video:

In the setup page for the particular Clashfinder site, you can enable automatic tagging: the system will try to identify the act and automatically add MusicBrainz metadata and generate relative links from it. Alternatively, you can disable this feature and the links you provide will be used as the link destinations:

On the public page for the festival, hovering over an act pops up dialogue that lets you click through on any added links, such as any Youtube link you may have added:

As well as the graphical editor there is also a text editing option, which gives you more of a data centric view:

You can also export the data as CSV, Excel, JSON, XML etc. There’s also an Excel import facility.

So…

Data…

One of the things I pondered was whether I could knock up a thing that would play out the festival in real time, or “as if” realtime, where you pretend it’s a particular day of festival and play out the videos in real time as if it were that day.

Here’s my first attempt:

It’s a single web page app that uses the data (manually copied over at the moment) from the Clashfinder site and lets you view the festival in real time or as-if real time.

The broadcast model is false. The client web page checks the time and if an act is on at that time the video will play. If there’s no act scheduled at any particular time, you get a listing for that stage for that day with a line through the acts you’ve missed.

Ideally, you want to schedule videos that are not part of a playlist. If a video is in a playlist, then when a video finishes, the next video seems to autoplay, which is a real pain, if your scheduled slot extends more that a few seconds past the end time of the video…

(Hmm… I wonder, could you set an end time past the end of the video to see if that pauses autoplay of the next item in the playlist? Or maybe pass in a playlist with a dummy video, perphaps relating to your faux festival to play in the immediate aftermath og an act video whilst still in their scheduled slot time?)

On the to do list is a simple templated github repo that lets you submit a Clashfinder URL as an issue and it will then build and publish your site for you (eg using using something akin to this proof-of-concept approach) using Github Pages.

This approach would work equally for scheduling faux conferences, schools programming, etc. The content play out is synchronised and locally pulled, rather than broadcast. If you want to get social, use whatever social networking channel you prefer.

Essentiall, it’s using Clashfinder to schedule the play out of stage based Youtube playlists.

Note that if there’s a bunch of you scheduling things on the same Clashfinder event, there are no locks, so you need to refresh and update regularly or you could find that stale page you’ve had open in edit mode for the last three days and then make a single typo change to has wiped out the hundreds of commits the rest of your gang has made over the previous three days.

There’s lots of fettling I still want to do to the template page, but even in its cirrent bare bones state, it sort of works…

First Foray into the Reclaim Cloud (Beta) – Running a Personal Jupyter Notebook Server

For years and years I;ve been hassling my evil twin brother (it’s a long story) Jim Groom about getting Docker hosting up and running as part of Reclaim, so when an invite to the Reclaim Cloud beta arrived today (thanks, Jim:-), I had a quick play (with more to come in following days and weeks, hopefully… or at least until he switches my credit off;-)

For an early example of how to get JupyterHub up and running on Reclaim Cloud, see https://github.com/ousefulReclaimed/jupyterhub-docker/ . Best practice around this currently (July ’21) seems to be Tim Sherratt’s (@wragge) GLAM Workbench on Reclaim Cloud recipes.

The environment is provided by Jelastic, (I’m not sure how the business model will work, eg in terms of what’s being licensed and what’s being resold…?).

Whilst there are probably docs, the test of a good environment is how far you can get by just clicking buttons, so here’s a quick recap of my first foray…

Let’s be having a new environment then..

Docker looks like a good choice:

Seems like we can search for public DockerHub containers (and maybe also provate ones if we provide credentials?).

I’ll use one of my own containers, that is built on top of an official Jupyter stack container:

Select one and next, and a block is highlighted to show we’ve configured it…

When you click apply, you see loads of stuff available…

I’m going to cheat now… the first time round I forgot a step, and that step was setting a token to get into the Jupyter notebook.

If you look at my repo docs for the container I selected, you see that I recommend setting the Jupyter login token via an environment variable…

In the confusing screen, there’s a {...} Variables option that I guessed might help with that:

Just in passing, if your network connection breaks in a session, we get a warning and it tries to reconnect after a short period:

Apply the env var and hit the create button on the bewildering page:

And after a couple of minutes, it looks like we have a container running on a public IP address:

Which doesn’t work:

And it doesn’t work becuase the notebook isnlt listening on port 80, it autostarts on port 8888. So we need to look for a port map:

A bit of guessing now – we porbbaly want an http port, which nominally maps, or at least default, to port 80? And then map that to the port the notebook server is listening on?

Add that and things now look like this as far as the endpoints go:

Try the public URL again, on the insecure http address:

Does Jim Rock?

Yes he does, and we’re in…

So what else is there? Does it work over https?

Hmmm… Let’s go poking around again and see if we can change the setup:

So, in the architecture diagram on the left, if we click the top Balancing block, we can get a load balancer and reverse proxy, which are the sorts of thing that can often handle certificates for us:

I’ll go for Nginx, cos I’ve heard of that…

It’s like a board game, isn’t it, where you get to put tokens on your personal board as you build your engine?! :-)

It takes a couple of mins to fire up the load balancer container (which is surely what it is?):

If we now have a look in the marketplace (I have to admit, I’d had skimmed through this at the start, and noticed there was something handy there…) we can see a Let’s Encrypt free SSL certificate:

Let’s have one of those then…

I’ll let you into another revisionist secret… I’d tried to install the SSL cert without the load balancer, but it refused to apply it to my container… and it really looked like it wanted to apply to something else. Which is what made me thing of the nginx server…

Again we need to wait for it to be applied:

When it is, I donlt spot anyhting obvious to show the Let’s Encrypt cert is there, but I did get a confirmation (not shown in screenshots).

So can we log in via https?

Bah.. that’s a sort of yes, isn’t it? The cert’s there:

but there’s http traffic passing through, presumably?

I guess I maybe need another endpoint? https onto port 8888?

I didn’t try at the time — that’s for next time — becuase what I actually did was to save Jim’s pennies…

And confirm…

So… no more than half an hour from a zero start (I was actually tinkering whilst on a call, so only half paying attention too…).

As for the container I used, that was built and pushed to DockerHub by other tools.

The container was originally defined in a Github repo to run on MyBinder using not a Dockerfile, but requirements.txt and apt.txt text files in a binder/ directory.

The Dockerhub image was built using a Github Action:

And for that to be able to push from Github to DockerHub, I had to share my DockerHub username and password as a secret with the Github repo:

But with that done, when I make a release of the repo, having tested it on MyBinder, an image is automatically built and pushed to Dockerhub. And when it’s there, I can pull it into Reclaim Cloud and run it as my own personal service.

Thanks, Jim..

PS It’s too late to play more today now, and this blog post has taken twice as long to write as it took me to get a Jupyter notebook sever up an running from scratch, but things on my to do list next are:

1) see if I can get the https access working;

2) crib from this recipe and this repo to see if I can get a multi-user JupyterHub with a Dockerspawner up and running from a simple Docker Compose script. (I can probably drop the Traefik proxy and Let’s Encrypt steps and just focus on the JupyerHub config; the Nginx reverse proxy can then fill the gap, presumably…)