Take, Take, Take…

Alan recently posted (Alan recently posts a lot…:-) about a rabbit hole he fell down when casually eyeing something in his web stats (Search, Serendipity, Semantically Silent).

Here’s what I see from my hosted WordPress blog stats:

  • traffic from Google but none of the value about what folk were searching for, although Google knows full well. WordPress also have access to that data from the Google Analytics tracking code they force into my hosted WordPress blog, but I don’t get to see it unless I pay for an upgrade…
  • traffic from pages on educational sites that I can’t see because they require authentication; I don’t even know what the course was on… So how can I add further value back to support that traffic?
  • occasional links from third party sites back in the day when people blogged and included links…

See also: Digital Dementia – Are Google Search and the Web Getting Alzheimer’s? etc…

Fragment – Jupyter For Edu

With more and more core components, as well as user contibutions, being added to the Jupyter framework, I’m starting to lose track of what’s possible. One of the things I might be useful for the OU, and Institute of Coding, context is to explore various architectural patterns that can be constructed in a Jupyter mediated environment that are particular useful for education.

In advance of getting a Github repo / wiki together to start that, here are a few fragments my my feeds, several of which have appeared in just the last couple of days:

Jupyter Enterprise Gateway Now a Top Level Jupyter Project

Via the Jupyter blog, I see the Jupyter Enterprise Gateway is now a top-level Jupyter project.

The Jupyter Enterprise Gateway “enables Jupyter Notebook to launch remote kernels in a distributed cluster“, which provides a handy separation between a notebook server (or Jupyterhub multi-user notebook server) and the kernel that a notebook runs against. For example, Jupyter Enterprise Gateway can be used to create kernels in a scaleable way using Kubernetes, or (I’m guessing…?) to do things like launch remote kernels running on a GPU cluster. From the docs it looks like Jupyter Enterprise Gateway  should work in a Jupyterhub context, although I can’t offhand find a simple howto / recipe for how to do that. (Presumably, Jupyterhub creates and launches user specific notebook server containers, and these then create and connect to arbitrary kernel running back-ends via the Jupyter Enterprise Gateway? Here’s a related issue I found.)

Running Notebook Cells One at a Time in a Terminal

The ever productive Doug Blank has a recipe for stepping through notebook cells in a terminal [code: nbplayer]. The player launches an IPython terminal that displays the first cell in the notebook and lets you step through them (executing or skipping the cell) one at a time. You can also run your own commands in between stepping through the notebook cells.

I can imagine using this to create a fixed set of steps for an activity that I want a student to work through, whilst giving them “free time” to explore the state of current execution environment, for example, or try out particular “given” functions with different parameters. This approach also provides a workaround for using notebook authored exercises in the terminal environment, which I know some colleagues favour over the notebook environment.

On my to do list is recast some of the activities from the new TM112 course to see how they feel using this execution model, and then compare that to the original activity and the activity run using the same notebook in a notebook environment.

Adding Multiple Student Users to a Jupyterhub Environment

Also via Doug Blank, a recipe for adding multiple users to a Jupyterhub environment using a form that allows you to simply add a list of user names: a more flexible way of adding accounts to Jupyterhub. User account details and random passwords are created automatically and then emailed to students.

To allow users to change passwords, e.g. on first run, I think the NotebookApp.allow_password_change=True notebook server parameter (Jupyter notebook – Config file and command line options) allows that?

The repo also shows a way of bundling nbviewer to allow users to “publish” HTML versions of their notebooks.

Doug also points to yuvipanda/jupyterhub-firstuseauthenticator, a first use authenticator for Jupyterhub that allows new users to create an account and then set a password on it. This could be really handy for workshops, where you want to allow uses to self-serve an environment that persists over a couple of workshop sessions, for example. (One thing we still need to do in the OU is get a Jupyterhub server up and running with persistent user storage; for TM112, we ran a temporary notebook server, which meant students couldn’t save and return to notebooks on the server – they’d have to download notebooks and then re-upload them into a new session if they wanted to return to working on a notebook they had modified. That said, the activity was designed as a “displosable” activity…)

Zip All Notebooks

This handy extension — nbzipprovides a button to zip and download a Jupyter notebook server folder.  If you’re working on a temporary notebook server, this provides and easy way of grabbing all the notebooks in one go. What might be even nicer would be to select a sub-folder, or selected set of files, using checkbox selectors? I’m not sure if there’s a complementary tool that will let you upload a zipped archive and unpack it in one go?

Fragment – Running Multiple Services, such as Jupyter Notebooks and a Postgres Database, in a Single Docker Container

Over the last couple of days, I’ve been fettling the build scripts for the TM351 VM, which typically uses vagrant to build a VirtualBox VM from a set of shell scripts, so they can be used to build a single Docker container that runs all the TM351 services, specifically Jupyter notebooks, OpenRefine, PostgreSQL and MongoDB.

Docker containers are typically constructed to a run a single service, with compositions of containers wired together using Docker Compose to create applications that deliver, or rely on, more than one running service. For example, in a previous post (Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API) I showed how to set up a couple of containers to work together, one running a MySQL database server, the other an http service that provided an API to the database.

So how to run multiple services in the same container? Docs on the Docker website suggest using supervisord to run multiple services in a single container, so here’s a fragment on how I’ve done that from my TM351 build.

To begin with, I’ve built the container up as a tiered set of containers, in a similar way to the way the stack of opinionated Jupyter notebook Docker containers are constructed:

#Define a stub to identify the images in this image stack

# minimal
## Define a minimal container, eg a basic Linux container
## using whatever flavour of Linux we prefer
docker build --rm -t ${IMAGESTUB}-minimal-test ./minimal

# base
## The base container installs core packages
## The intention is to define a common build environment
## populated with packages likely to be common to many courses
docker build --rm --build-arg BASE=${IMAGESTUB}-minimal-test -t ${IMAGESTUB}-base-test ./base


One of the things I’ve done to try to generalise the build steps is allow the name a base container to be used to bootstrap a new one by passing the name of the base image in via an optional variable (in the above case, --build-arg BASE=${IMAGESTUB}-minimal-test). Each Dockerfile in a build step directory uses the following construction to work out which image to use as the FROM basis:

#Set ARG values using --build-arg =
#Each ARG value can also have a default value
ARG BASE=psychemedia/ou-tm351-base-test

Using the same approach, I have used separate build tiers for the following components:

  • jupyter base: minimal Jupyter notebook install;
  • jupyter custom: add some customisation onto a pre-existing Jupyter notebook install;
  • openrefine: add the OpenRefine application; (note, we could just use BASE=ubuntu to create this a simple, standalone OpenRefine container);
  • postgres: create a seeded PostgreSQL database; note, this could be split into two: a base postgres tier and then a customisation that adds users, creates and seed databases etc;
  • mongodb: add in a seeded mongo database; again, the seeding could be added as an extra tier on a minimal database tier;
  • topup: a tier to add in anything I’ve missed without having to go back to rebuild from an earlier step…

The intention behind splitting out these tiers is that we might want to have a battle hardened OU postgres tier, for example, that could be shared between different courses. Alternatively, we might want to have tiers offering customisations for specific presentations of a course, whilst reusing several other fixed tiers intended to last out the life of the course.

By the by, it can be quite handy to poke inside an image once you’ve created it to check that everything is in the right place:

#Explore inside animage by entering it with a shell command
docker run -it --entrypoint=/bin/bash psychemedia/ou-tm351-jupyter-base-test -i

Once the services are in place, I add a final layer to the container that ensures supervisord is available and set up with an appropriate supervisord.conf configuration file:

#Final tier Dockerfile
ARG BASE=psychemedia/testpieces

USER root
RUN apt-get update && apt-get install -y supervisor

RUN mkdir -p /openrefine_projects  && chown oustudent:100 /openrefine_projects
VOLUME /openrefine_projects

RUN mkdir -p /notebooks  && chown oustudent:100 /notebooks
VOLUME /notebooks

RUN mkdir -p /var/log/supervisor
COPY monolithic_container_supervisord.conf /etc/supervisor/conf.d/supervisord.conf


CMD ["/usr/bin/supervisord"]

The supervisord.conf file is defined as follows:

##We can check running processes under supervisord with: supervisorctl

#The HOME envt needs setting to the correct USER
#otherwise jupyter throws: [Errno 13] Permission denied: '/root/.local'

#Note the auth is a bit ropey on this atm!
command=/usr/local/bin/jupyter notebook --port=8888 --ip= --y --log-level=WARN --no-browser --allow-root --NotebookApp.password= --NotebookApp.token=
#The directory we want to start in
#(replaces jupyter notebook parameter: --notebook-dir=/notebooks)

command=/usr/lib/postgresql/9.5/bin/postgres -D /var/lib/postgresql/9.5/main -c config_file=/etc/postgresql/9.5/main/postgresql.conf

command=/usr/bin/mongod --dbpath=/var/lib/mongodb --port=27351

command=/opt/openrefine-3.0-beta/refine -p 3334 -i -d /vagrant/openrefine_projects

One thing I need to do better is to find a way to stage the construction of the supervisord.conf file, bearing in mind that multiple tiers may relate to the same servicel for example, I have a jupyter-base tier to create a minimal Jupyter notebook server and then a jupyter-base-custom tier that adds in specific customisations, such as branding and course related notebook extensions.

When the final container is built, the supervisord command is run and the multiple services started.

One other thing to note: we’re hoping to run TM351 environments on an internal OpenStack cluster. The current cluster only allows students to expose a single port, and port 80 at that, from the VM (IP addresses are in scant supply, and network security lockdowns are in place all over the place). The current VM exposes at least two http services: Jupyter notebooks and OpenRefine, so we need a proxy in place if we are to expose them both via a single port. Helpfully, the nbserverproxy Jupyter extension (as described in Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy), allows us to do just that. One thing to note, though – I had to enable it via the same user that launches the notebook server in the suoervisord.conf settings:

##Dockerfile fragment

RUN $PIP install nbserverproxy

USER oustudent
RUN jupyter serverextension enable --py nbserverproxy
USER root

To run the VM, I can call something like:

docker run -p 8899:8888 -d psychemedia/tm351dockermonotest

and then to access the additional services, I can browse to e.g. localhost:8899/proxy/3334/ to see the OpenRefine application.

PS in case you’re wondering why I syndicated this through RBloggers too, the same recipe will work if you’re using Jupyter notebooks with an R kernel, rather than the default IPython one.

More Cookie Notices…

One of the cookie acceptance notices I’ve started noticing on first visits to several sites lately comes from TrustArc:


This categorises cookies into three classes — required, functional and advertising — and lets you make decisions on whether to accept those cookies at the category level or the individual provider level if look at the Detailed Settings:


However, opting out at the category level doesn’t necessarily mean you have opted out of all the cookies provided in that category:


So whilst I like things like the TrustArc display in principle, it would be nicer if it also had a percentage bar display, for example, showing the percentage of cookie providers that were successfully opted out from in that category?

I’m not quite sure how cookie opt-outs work. I can see how it’d work on the user side – e.g. follow a link provided by someone like IgnitionOne – to a page that sets opt-out cookies in your browser, but how about a publisher setting the opt-out from a third party on your behalf? This description — OpenX – how the cookie-based opt-out mechanism works — suggests that the publisher calls a link and then “OpenX sets the user opt-out cookie and performs a test to see that the user accepted the cookie“, but how do the bits of data flow and what state or flags are set where? Another one to add to the “I don’t understand how this actually works; try to find out at some point” list…

PS From the Forbes website, a slider to control cookie levels:


Data Ethics – Guidance and Code – And a Fragment on Machine Learning Models

By chance, I notice that the Department for Digital, Culture, Media & Sport (DDCMS) have published a guidance on a Data Ethics Framework, with an associated workbook, that is intended to guide the design of appropriate data use in government and the wider public sector.

The workbook is based around providing answers to questions associated with seven principles:

  1. A clear user need and public benefit
  2. Awareness of relevant legislation and codes of practice
  3. Use of data that is proportionate to the user need
  4. Understanding of the limitations of the data
  5. Using robust practices and working within your skillset
  6. Making work transparent and being accountable
  7. Embedding data use responsibly

[See also this Ethics Framework from the Digital Catapult’s Machine Intelligence Garage Ethics Committee which identifies: 1. Be clear about the benefits of your product or service; 2. Know and manage your risks; 3. Use data responsibly; 4. Be worthy of trust; 5. Promote diversity, equality and inclusion; 6. Be open and understandable in communications; 7. Consider your business model. The framework expands on each point in a similar manner to the approach shown below.]

It’s likely that different sector codes will start to appear, such as this one from the Department of Health & Social Care (DHSC): Initial code of conduct for data-driven health and care technology. (See also the related Data security and protection for health and care organisations guidance document.) In this case, the code incorporates ten principles:

1 Define the user
Understand and show who specifically your product is for, what problem you are trying to solve for them, what benefits they can expect and, if relevant, how AI can offer a more efficient or quality-based solution. Ensure you have done adequate background research into the nature of their needs, including any co-morbidities and socio-cultural influences that might affect uptake, adoption and ongoing use.

2 Define the value proposition
Show why the innovation or technology has been developed or is in development, with a clear business case highlighting outputs, outcomes, benefits and performance indicators, and how exactly the product will result in better provision and/or outcomes for people and the health and care system.

3 Be fair, transparent and accountable about what data you are using
Show you have utilised privacy-by-design principles with data-sharing agreements, data flow maps and data protection impact assessments. Ensure all aspects of GDPR have been considered (legal basis for processing, information asset ownership, system level security policy, duty of transparency notice, unified register of assets completion and data privacy impact assessments).

4 Use data that is proportionate to the identified user need (data minimisation principle of GDPR)
Show that you have used the minimum personal data necessary to achieve the desired outcomes of the user need identified in 1.

5 Make use of open standards
Utilise and build into your product or innovation, current data and interoperability standards to ensure you can communicate easily with existing national systems. Programmatically build data quality evaluation into AI development so that harm does not occur if poor data quality creeps in.

6 Be transparent to the limitations of the data used and algorithms deployed
Show you understand the quality of the data and have considered its limitations when assessing if it is appropriate to use for the defined user need. When building an algorithm be clear on its strengths and limitations, and show in a transparent manner if it is your training or deployment algorithms that you have published.

7 Make security integral to the design
Keep systems safe by integrating appropriate levels of security and safeguarding data.

8 Define the commercial strategy
Purchasing strategies should show consideration of commercial and technology aspects and contractual limitations. You should only enter into commercial terms in which the benefits of the partnerships between technology companies and health and care providers are shared fairly.

9 Show evidence of effectiveness for the intended use
You should provide evidence of how effective your product or innovation is for its intended use. If you are unable to show evidence, you should draw a plan that addresses the minimum required level of evidence given the functions performed by your technology.

10 Show what type of algorithm you are building, the evidence base for choosing that algorithm, how you plan to monitor its performance on an ongoing basis and how you are validating performance of the algorithm.

One of the discussions we often have when putting new courses together is how to incorporate ethics related  issues in a way that makes sense (which is to say, in a way that can be assessed…) One way might to apply things like the workbook or the code of conduct to a simple case study. Creating appropriate case studies can be a challenge, but via an O’Reilly post, I note that in a joint project between the Center for Information Technology Policy and the Center for Human Values, both at Princeton, has recently produced a set of fictional case studies that are designed to elucidate and prompt discussion about issues in the intersection of AI and Ethics that cover a range of issues: an automated healthcare app (foundations of legitimacy, paternalism, transparency, censorship, inequality);  dynamic sound identification (rights, representational harms, neutrality, downstream responsibility); optimizing schools issues: (privacy, autonomy, consequentialism, rhetoric); law enforcement chatbots (automation, research ethics, sovereignty).

I also note that a recent DDCMS Consultation on the Centre for Data Ethics and Innovation has just closed… One of the topics of concern that jumped out at me related to IPR:

Intellectual Property and ownership Intellectual property rights protect – and therefore reward – innovation and creativity. It is important that our intellectual property regime keeps up with the evolving ways in which data use generates new innovations. This means assigning ownership along the value chain, from datasets, training data, source code, or other aspects of the data use processes. It also includes clarity around ownership, where AI generates innovations without human input. Finally, there are potentially fundamental questions around the ownership or control of personal data, that could heavily shape the way data-driven markets operate.

One of the things I think we are likely to see more of is a marketplace in machine learning models, either sold (or rented out?) as ‘fixed’ or ‘further trainable’, building on the the shared model platforms that are starting to appear; (a major risk here, of course, is that models with built in biases – or vulnerabilities – might be exploited if bad actors know what models you’re running…). For example:

  • Seedbank [announcement], “a [Google operated] place to discover interactive machine learning examples which you can run from your browser, no set-up required“;
  • TensorFlow Hub [announcement], “a [Google operated] platform to publish, discover, and reuse parts of machine learning modules in TensorFlow“;
  • see also this guidance on sharing ML models on Google Cloud
  • Comet.ml [announcement], “the first infrastructure- and workflow-agnostic machine learning platform.“. No, me neither…

I couldn’t offhand spot a marketplace for Amazon Sagemaker models, but I did notice some instructions for how to import Your Amazon SageMaker trained model into Amazon DeepLens, so if model portability is a thing, the next thing Amazon will likely to is to find a way to take a cut from people selling that thing.

I wonder, too, if the export format has anything to do with ONNX, “an open format to represent deep learning models?


(No sign of Google there?)

How the IPR around these models will be regulated can also get a bit meta. If data is licensed to one party so they can train a model, should the license also cover who might make use of any models trained on that data, or how any derived models might be used?

And what counts as “fair use” data when training models anyway? For example, Google recently announced Conceptual Captions. “A New Dataset and Challenge for Image Captioning“. The dataset:

consist[s] of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages.

So how were those images and text caption training data gathered / selected? And what license conditions were associated with those images? Or when compiling the data set, did Google do what Google always does, which is conveniently ignore copyright because it’s only indexing and publish search results, not actually re-publishing material (allegedly…).

Does that sort of consideration fall into the remit of the current Lords Communications Committee inquiry into The Internet: to regulate or not to regulate?, I wonder?

A recent Information Law and Policy Centre post on Fixing Copyright Reform: How to Address Online Infringement and Bridge the Value Gap, starts as follows:

In September 2016, the European Commission published its proposal for a new Directive on Copyright in the Digital Single Market, including its controversial draft Article 13. The main driver behind this provision is what has become known as the ‘value gap’, i.e. the alleged mismatch between the value that online sharing platforms extract from creative content and the revenue returned to the copyright-holders.

This made me wonder, is there a “mismatch” somewhere between:

a) the data that people share about themselves, or that is collected about them, and the value extracted from it;
b) the content qua data that web search engine operators hoover up with their search engine crawlers and then use as a corpus for training models that are used to provide commercial services, or sold / shared on?

There is also a debate to be had about other ways in which the large web cos seem to feel they can just grab whatever data they want, as hinted at in this report on Google data collection research.

The Growing Popularity of Jupyter Notebooks…

It’s now five years since we first started exploring the use of Jupyter notebooks — then known as IPython notebooks — in the OU for the course that became (and still is) TM351 Data management and analysis.

At the time, the notebooks were evolving fast (they still are…) but knowing the length of time it takes to produce a course, and the compelling nature of the notebooks, and the traction they already seemed to have, it felt like a reasonably safe technology to go with.

Since then, the Jupyter project has started to rack up some impressive statistics. Using Google BigQuery on public datasets, we can find how many times the Jupyter Python package is installed from PyPi, the centralised repository for Python package distribution, each month by querying the the-psf:pypi.downloads dataset (if you don’t like the code in this post, how else do you expect me to demonstrably source the supporting evidence…?!):

  STRFTIME_UTC_USEC(timestamp, "%Y-%m") AS yyyymm,
  COUNT(*) as download_count
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "year"),

WHERE file.project="jupyter"
GROUP BY yyyymm

(Our other long term bet for TM351 was on the pandas Python package, and that has also gone from strength to strength…)

Google BigQuery datasets also include a Stack Overflow dataset (Stack Overflow is a go to technical question-and-answer site for developers), so something like the following crappy query will count the jupyter-notebook tagged questions appearing each year:

SELECT tags, COUNT(*) c,  year
 SELECT SPLIT(tags, '|') tags, YEAR(creation_date) year
  FROM [bigquery-public-data:stackoverflow.posts_questions] a
  WHERE YEAR(creation_date) >= 2014 AND tags LIKE '%jupyter-notebook%'
WHERE tags='jupyter-notebook'
GROUP BY year, tags

I had thought we might be able to use BigQuery to query the number of notebooks on Github (a site widely used by developers for managing code repositories and, increasingly, and educators for managing course notes/resources), but it seems that the Github public data tables [(bigquery-public-data:github_repos)] only represent a sample of 10% or so of public projects?

FWIW, here are a couple of sample queries on that dataset anyway. First, a count of projects identified as Jupyter Notebook projects:

FROM [bigquery-public-data:github_repos.languages]
WHERE language.name = "Jupyter Notebook"

And secondly, a count of .ipynb notebooks:

FROM [bigquery-public-data:github_repos.files]
#filter on files with a .ipynb suffix
WHERE RIGHT(path,6) = ".ipynb"

We can also look to see what the most popular Python packages imported into the notebooks are using a recipe I found here

numpy, matplotlib and pandas, then, perhaps not surprisingly… Here’s the (cribbed) code for that query:


#The query finds line breaks and then tries to parse import statements
  CASE WHEN package LIKE '%\\n",' THEN LEFT(package, LENGTH(package) - 4)
    ELSE package END as package, n
  SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
  FROM (
    SELECT SPLIT(content, '\n \"') as line
    FROM (SELECT * FROM [bigquery-public-data:github_repos.contents]
      WHERE id IN (
        SELECT id FROM [bigquery-public-data:github_repos.files]
         WHERE RIGHT(path, 6) = '.ipynb'))
       HAVING LEFT(line, 7) = 'import ' OR LEFT(line, 5) = 'from ')
  GROUP BY package

(I guess a variant of the above could be used to find out what magics are most commonly loaded into notebooks?)

That said, Github also seem to expose a count of files directly, as described in parente/nbestimate, from where the following estimates of growth in the the numbers of Jupyter notebook / .ipynb files on Github are taken:

The number returned by making the following request is approximate – maybe we get the exact count if we make an authenticated request?


On Docker hub, where various flavours of official notebook container are hosted, there’ve been over a million pulls of the datascience notebook:


Searching Docker hub for jupyter notebook also identifies over 5000 notebook related Docker container images at the current count, of which over 1500 are automatically built from Github repos.

Mentions of Jupyter notebooks are also starting to appear on ITJobswatch, a site that tracks IT and programming job ads in the UK:

By the by, if  you fancy exploring other sorts of queries, this simple construction will let you run a Google web search to look for .ipynb files that mention the word course or module on ac.uk websites…

filetype:ipynb site:ac.uk (course OR module)

Anyway, in the same way that it can take many years to get anything flying regularly on NASA space flights, OU courses take a couple of years to put together, and innovations often have to be proved to work for a couple of course presentations in another course before a second course will take them on; so it’s only now that we’re starting to see an uptick of interest in using Jupyter notebooks in other courses… but that interest is growing, and it’s looking like there’s a flurry of courses about to start production that are likely to be using notebooks.

We may have fallen a little bit behind places like UC Berkeley, where 50% of undergrads now take a core, Jupyter notebook powered, datascience course, with subject specifc bolt-on mini-modules, and notebooks now provide part of the university core infrastructure:

…but even so, there could still be exciting times ahead…:-)

Even more exciting if we could get the library to start championing it as a general resource…


Robot Workers?

A lazy post that does nothing much more than rehash and link bullet points from the buried lede that is someone else’s post…

It seems like folk over at the Bank of England have been in the news again about robots taking over human jobs (Bank of England chief economist [Andy Haldane] warns on AI jobs threat); this follows on from a talk earlier this year by Mark Carney at the Public Policy Forum in Toronto [slides] and is similar in kind to other speeches coming out of the Bank of England over the last few years (Are Robots Threatening Jobs or Are We Taking Them Ourselves Through Self-Service Automation?).

The interview(?) was presumably in response to a YouGov survey on Workers and Technology: Our Hopes and Fears associated with the launch of a Fabian Society and Community Commission on Workers and Technology.

(See also a more recent YouGov survey on “friends with robots” which asked “In general, how comfortable or uncomfortable do you think you would be working with a colleague or manager that was a robot?” and “Please imagine you had received poor service in a restaurant or shop from a robot waiter/ shop assistant that is able to detect tone and emotion in a human’s voice… Do you think you would be more or less likely to be rude to the robot, than you would to a human waiter/ shop assistant, or would there be no difference? (By ‘rude’, we mean raising your voice, being unsympathetic and being generally impolite… )“.)

One of the job categories that is being enabled by automation is human trainers that help generate the marked up data that feeds the machines. A recent post on The Lever, “Google Developers Launchpad’s new resource for sharing applied-Machine Learning (ML) content to help startups innovate and thrive” [announcement] asks Where Does Data Come From?. The TLDR answer?

  • Public data
  • Data from an existing product
  • Human-in-the-loop (e.g. a human driver inside an “autonomous” vehicle)
  • Brute force (e.g. slurping all the data you can find; hello Google/Facebook etc etc)
  • Buying the data (which means someone is also selling the data, right?)

A key part of many machine learning approaches is to use labelled datasets that the machine learns from. This means taking a picture of a face, for example, that a human has annotated with areas labelled “eyes”, “nose”, “mouth”, and then training the ‘pootah to try to identify particular features in the photographs that allow the machine to identify those labels with those features, and hopefully the corresponding elements in a previously unseen photo.

Here’s a TLDR summary of part of the Lever post, concerning where these annotations come from:

  • External annotation service providers
  • Internal annotation team
  • Getting users to generate the labels (so the users do folk in external annotation service providers out of a job…)

The post also identifies several companies that provide external annotation services… Check them out if you want to get a glimpse of a future of work that involves feeding a machine…

  • Mechanical Turk: Amazon’s marketplace for getting people to do piecemeal bits of work for pennies that other people often sell as “automated” services, which you might have though meant “computerised”. Which it is, in that a computer assigns the work to essential anonymous, zero hours contract workers. Where it gets really amusing is when folk create bots to do the “human work” that other folk are paying Mechanical Turk for
  • Figure Eight: a “Human-in-the-Loop Machine Learning platform transforms unstructured text, image, audio, and video data into customized high quality training data”… Sounds fun, doesn’t it? (The correct answer is probably “no”);
  • Mighty AI: “a secure cloud-based annotation software suite that empowers you to create the annotations you need for developing and validating autonomous perception systems”, apparently.. You get a sense of how it’s supposed to work from the blurb:
    • “Mighty Community”, a worldwide community that provides our customers with timely, high-quality annotations, offloading the burden to find, train, and manage a pool of annotators to generate ground truth data.
    • Expert global community allows for annotations 24 hours/day
    • Training on Mighty Tools eliminates annotator on-boarding time
    • Available at a moment’s notice to instantly scale customer annotation programs
    • Community members covered by confidentiality agreement
    • Automated annotation management process with Mighty Studio
    • Close integration with Mighty Quality eliminates the need to find and correct annotation errors
  • Playment: “With 300,000+ skilled workers ready to work round-the-clock on Playment’s mobile and web app, we can generate millions of labels in a matter of hours. As the workers do more work, they get better and Playment is able to accomplish much more work in lesser time.” (And then when they’ve done the work, the machine does the “same” work with reduced marginal cost… Hmm, thinks, how do the human worker costs (pennies per task) compare with the server costs for large ML services?)

Happy days to come, eh…?

Legislating For Mandatory Software Updates

From the Automated and Electric Vehicles Act 2018, I notice the following:

4 Accident resulting from unauthorised software alterations or failure to update software

(1) An insurance policy in respect of an automated vehicle may exclude or limit the insurer’s liability under section 2(1) for damage suffered by an insured person arising from an accident occurring as a direct result of—
(a) software alterations made by the insured person, or with the insured person’s knowledge, that are prohibited under the policy, or
(b) a failure to install safety-critical software updates that the insured person knows, or ought reasonably to know, are safety-critical.

(2) But as regards liability for damage suffered by an insured person who is not the holder of the policy, subsection (1)(a) applies only in relation to software alterations which, at the time of the accident, the person knows are prohibited under the policy.

(3) Subsection (4) applies where an amount is paid by an insurer under section 2(1) in respect of damage suffered, as a result of an accident, by someone who is not insured under the policy in question.

(4) If the accident occurred as a direct result of—
(a) software alterations made by an insured person, or with an insured person’s knowledge, that were prohibited under the policy, or
(b) a failure to install safety-critical software updates that an insured person knew, or ought reasonably to have known, were safety-critical,the amount paid by the insurer is recoverable from that person to the extent provided for by the policy.

(5) But as regards recovery from an insured person who is not the holder of the policy, subsection (4)(a) applies only in relation to software alterations which, at the time of the accident, the person knew were prohibited under the policy.

(6) For the purposes of this section—
(a) “software alterations” and “software updates”, in relation to an automated vehicle, mean (respectively) alterations and updates to the vehicle’s software;
(b) software updates are “safety-critical” if it would be unsafe to use the vehicle in question without the updates being installed.

It looks like this is the first time that “software update” appears in legislation..

Is this the start of things to come?

And what tools exist on the legislation.gov.uk and parliament.uk websites to make it easier to keep track of such phrases in passed, enacted and draft legislation?

PS Michael posted some useful thoughts lately on why the Parliament website isn’t for everyone. It also reminded of an important rule when writing bids for outreach activities: “if you claim the project event is for everybody, that means it’s targeted at nobody and won’t be funded”. After all, who should go, and why?

Algorithms That Work, A Bit… On Teaching With Things That Don’t Quite Work

Grant Potter picked up on a line in yesterday’s post on Building Your Own Learning Tools:

including such tools as helpers in an arts history course would help introduce students to the wider question of how well such algorithms work in general, and the extent to which tools or applications that use them can be trusted

The example in point related to the use of k-means algorithms to detect common colour values in images of paintings in order to produce a “dominant colour” palette for the image. The random seed nature of the algorithm means that if you run the same algorithm on the same image multiple times, you may get a different palette each time. Which makes a point about algorithms, if nothing else, and encourages you to treat them just like any other (un)reliable witness. In short, the tool demoed was a bit flakey, but no less useful (or instructive) for that…

Grant’s pick-up got me thinking that about another advantage of bringing interactive digital — algorithmic — tools into the wider curriculum using things like responsive Jupyter notebooks: by trying to apply algorithms to the analysis of things people care about, like images of paintings in art history, like texts in literature, the classics, or history, like maps in geography and politics, and so on, and showing how they can be a bit rubbish, we help students see the limits of the power of similar algorithms that are applied elsewhere, and help them develop healthy — and a slightly more informed — scepticism about algorithms through employing them in a context they (academically) care about.

Using machine “intelligence” for analysis or critique in arts or humanities also shows how many perspectived those subject matters can be, and how biases can creep in. (We can also use algorithmic exploration to dig one level deeper in the sciences. In this respect, when anyone quotes an average at me I think of Anscombe’s Quartet, or for a population trend, Simspon’s Paradox.)

Introducing computational tools as part of the wider curriculum, even in passing, also helps reveal what’s possible with a simple (perhaps naive) application of technology, and the possible benefits, limits and risks associated with such application.

(Academic algorithm researchers often tout things like a huge increase in performance on some test problem from 71.2% success to 71.3% for example, albeit at the expense of using a sizeable percentage of the world’s computers to run the computation. Which means 3 times in 10 it’s still wrong and the answer still needs checking all the time to catch those errors. If I can do the same calculation to 60% success on my Casio calculator watch, I still get a benefit 60% of the time, and the risk of failure isn’t that much different. You pays your money and you takes your choice.)

This also relates to “everyone should learn to code” dogma too. Putting barebones computational tools inline into course materials show’s how you can often achieve some sort of result with just a line or two of code, but more importantly, it shows you what sorts of thing can be done with a line or two of code, and what sorts of thing can’t be done that easily…

In putting together educational materials, we often focus on teaching people the right way to do something. But perhaps when it comes to code, rather than “everyone should learn to code”, maybe we need “everyone should experience the limitation of simple algorithms”? The solutionist good reason for using computational in methods the curriculum would be to show how useful computers can be (for some things) for automating, measuring, sorting and filtering. The resistance reason is that we can start to show people what the limits of the algorithms are, and how the computer answer isn’t always (isn’t often?) the right one.

My Festival Round-Up – Beautiful Days, 2018

Belatedly, a round-up of my top bands from Beautiful Days, our best festival (once again…) of the year…

Thursday is get in day, and no formal music events, with the stages kicking off on Friday. I left the obligatory Levellers acoustic session to catch the full Rews set opening up the main stage, and a cracking start to the festival proper:

They’ve got some great riffs and infectious lyrics, but I think the festival set list order could do with a tweak… A lot of the songs have an early hook, presumably to catch the interest of the Spotify / Youtube generation before they click away, but in a festival set when you’ve caught the interest of the audience, there’s an opportunity to give a tune bit more space and play with the instrumental breaks a bit more. The audience was asked for an opinion on a new song and I gave it a thumbs up, in general, but in need of a bit of reworking: let it build a bit more… Dropping one of the songs from the set list to give more space to songs that were rocking would have made this an even better set for me, but it was still a blast… They’re touring later in the year, so if you get a chance to see ’em, take it whilst the tickets are still affordable…

A quick saunter over to the Big Top to see the Army’s Justin Sullivan on a solo set, then back to the main stage for My Baby, the stand out act of the weekend for me… I’m not generally in to the dance thing, but they were just incredible, building songs beautifully and just FRICKIN’ AWESOME…

If you ever get a chance to see them in a festival setting, don’t miss it… Rews could learn a lot from them…

I’d also happily spend another lazy afternoon sipping Long Island Iced Tea and foot-tapping along with Kitty, Daisy and Lewis

For myself and many of the old gang I met up with at the festival, Feeder are the band whose songs you may remember well – and they have a lot of them – but for which you could never remember who played them. Solid, slick, but forgettably memorable…

Then it was Hives… meh.. irritating, irritating, irritating; but so irritating that a lot sat through it to see just how much more irritating they could be… I wandered off to see Suzanne Vega, who had a couple of session musicians alongside her stalwart guitarist, but the set was the same as the two piece offering and perhaps more suited to a theatrical setting than a festival alt-stage. TBH, for that hour or two, a little bit of me would rather have been back on the island watching 77/78

A slow start to Saturday – too much iced tea the day before – but whilst Kitty McFarlane’s story backed folk songs were quite beautiful (I’d love to see her at the Quay Arts) it wasn’t quite rockin’ enough to kick start my day, so I regret not catching more of Emily Capell who was boogie woogie -licious -lightful:

I’m not sure whether she has a thing going with any of the Spitfires, who were on after, but it’s the first time I’ve seen an early after support act: a) brag about how they’re the tightest band at the festival, and b) suggest everyone go to the clean toilets a walk away by the acoustic tent for a poo when then next main stage band are on…

Undeterred, I still gave the Spitfires a chance – should’ve brought me docks… It’d be good to see them at Strings to give the Orders a bit of a lesson in how to be’ave…

I’m still not sure how I get on with 3 Daft Monkeys… perhaps depends on the mood I’m in when I catch ’em… would be nice to see them again at Rhythm Tree.

Elephant Sessions, on the other hand, I have no doubts about… The video below doesn’t do the power of them justice… played loud in a big tent: brull-yunt. It’d be a bit of a trek to get them down to the Island but I’d give them a floor to sleep on any day…

(I also thought this was interesting – their tech spec / stage plan.)

Saturday headliners were the Manics… whilst I admire their performance, persistence and continued inventiveness, I’ve never really got on with them. I should perhaps have made it down to the front (Beautiful Days is a beautifully sized festival – you can get as close to the stage as you fancy and have a bit of a jig, though it can get a little boisterous from time to time…).

Sunday opened with homage to The Bar Steward Sons of Val Doonican, the second band I now rate from Barnsley (Hands Off Gretel being the other…). Lyrical reworkings of well known tunes, Paint ‘Em Back had me in stitches, and the audience management was incredible – an inflatable boat trip surf completely round the Big Top, and the whole tent, and a packed tent at that – down low for a jump; a perfick Sunday awakening…

The Bar Stewards took inspiration from another “comedy” act who were playing in the pub, Hobo Jones and the Junkyard Dogs.

We gave them a bit of a listen over a(nother) cocktail whilst sat on the hill, before heading back for Dub Pistols:

I skipped out with the last 2 or 3 three songs to go (not least because I needed a rest!) to check out Skinny Lister, but that was a mistake… should’a stayed at the Pistols for the full set… Oysterband a bit later didn’t really work for me either…

So that was pretty much it, then, (I missed Gogol Bordello in favour of hitting the tent to replenish the cocktail readymix bottles) until days end with the Levellers full set. I’d watched them spotting the lights the night before – although the lighting guy hadn’t been in an overly chatty mood (“concentrating….”)

Post festival shopping list:

  • Rews, My Baby, Elephant Sessions, Emily Capell, maybe the Spitfires.

I also saw a lot of Ferocious Dog t-shirts, so maybe something from them too…