Helluva Job – Still Not Running Libvirt vagrant plugin on a Mac…

Note to self after struggling for ages trying to install <span class="s1">vagrant-libvirt plugin on a Mac…

Revert to vagrant 2.0.0 then follow recipe given by @ccosby

 

Then more ratholes…

Try to start a libvirtd daemon: /usr/local/sbin/libvirtd  which I had hoped would write a socket connection file to somewhere that I could use as the basis of a connection (? /var/run/libvirt/libvirt-sock ) but that didn’t seem to work?:-(

Help (/usr/local/sbin/libvirtd --help) suggests:

Configuration file (unless overridden by -f):

$XDG_CONFIG_HOME/libvirt/libvirtd.conf

Sockets:

$XDG_RUNTIME_DIR/libvirt/libvirt-sock

TLS:

CA certificate:     $HOME/.pki/libvirt/cacert.pem

Server certificate: $HOME/.pki/libvirt/servercert.pem

Server private key: $HOME/.pki/libvirt/serverkey.pem

PID file:

$XDG_RUNTIME_DIR/libvirt/libvirtd.pid

but $XDG_RUNTIME_DIR/ doesn’t appear to be set and I can’t see anything in local dir… Setting it to /var/run/ doesn’t seem to help? So I’m guessing I need a more complete way of starting libvirtd such as passing a process definition/config file?

Take, Take, Take…

Alan recently posted (Alan recently posts a lot…:-) about a rabbit hole he fell down when casually eyeing something in his web stats (Search, Serendipity, Semantically Silent).

Here’s what I see from my hosted WordPress blog stats:

  • traffic from Google but none of the value about what folk were searching for, although Google knows full well. WordPress also have access to that data from the Google Analytics tracking code they force into my hosted WordPress blog, but I don’t get to see it unless I pay for an upgrade…
  • traffic from pages on educational sites that I can’t see because they require authentication; I don’t even know what the course was on… So how can I add further value back to support that traffic?
  • occasional links from third party sites back in the day when people blogged and included links…

See also: Digital Dementia – Are Google Search and the Web Getting Alzheimer’s? etc…

Fragment – Jupyter For Edu

With more and more core components, as well as user contibutions, being added to the Jupyter framework, I’m starting to lose track of what’s possible. One of the things I might be useful for the OU, and Institute of Coding, context is to explore various architectural patterns that can be constructed in a Jupyter mediated environment that are particular useful for education.

In advance of getting a Github repo / wiki together to start that, here are a few fragments my my feeds, several of which have appeared in just the last couple of days:

Jupyter Enterprise Gateway Now a Top Level Jupyter Project

Via the Jupyter blog, I see the Jupyter Enterprise Gateway is now a top-level Jupyter project.

The Jupyter Enterprise Gateway “enables Jupyter Notebook to launch remote kernels in a distributed cluster“, which provides a handy separation between a notebook server (or Jupyterhub multi-user notebook server) and the kernel that a notebook runs against. For example, Jupyter Enterprise Gateway can be used to create kernels in a scaleable way using Kubernetes, or (I’m guessing…?) to do things like launch remote kernels running on a GPU cluster. From the docs it looks like Jupyter Enterprise Gateway  should work in a Jupyterhub context, although I can’t offhand find a simple howto / recipe for how to do that. (Presumably, Jupyterhub creates and launches user specific notebook server containers, and these then create and connect to arbitrary kernel running back-ends via the Jupyter Enterprise Gateway? Here’s a related issue I found.)

Running Notebook Cells One at a Time in a Terminal

The ever productive Doug Blank has a recipe for stepping through notebook cells in a terminal [code: nbplayer]. The player launches an IPython terminal that displays the first cell in the notebook and lets you step through them (executing or skipping the cell) one at a time. You can also run your own commands in between stepping through the notebook cells.

I can imagine using this to create a fixed set of steps for an activity that I want a student to work through, whilst giving them “free time” to explore the state of current execution environment, for example, or try out particular “given” functions with different parameters. This approach also provides a workaround for using notebook authored exercises in the terminal environment, which I know some colleagues favour over the notebook environment.

On my to do list is recast some of the activities from the new TM112 course to see how they feel using this execution model, and then compare that to the original activity and the activity run using the same notebook in a notebook environment.

Adding Multiple Student Users to a Jupyterhub Environment

Also via Doug Blank, a recipe for adding multiple users to a Jupyterhub environment using a form that allows you to simply add a list of user names: a more flexible way of adding accounts to Jupyterhub. User account details and random passwords are created automatically and then emailed to students.

To allow users to change passwords, e.g. on first run, I think the NotebookApp.allow_password_change=True notebook server parameter (Jupyter notebook – Config file and command line options) allows that?

The repo also shows a way of bundling nbviewer to allow users to “publish” HTML versions of their notebooks.

Doug also points to yuvipanda/jupyterhub-firstuseauthenticator, a first use authenticator for Jupyterhub that allows new users to create an account and then set a password on it. This could be really handy for workshops, where you want to allow uses to self-serve an environment that persists over a couple of workshop sessions, for example. (One thing we still need to do in the OU is get a Jupyterhub server up and running with persistent user storage; for TM112, we ran a temporary notebook server, which meant students couldn’t save and return to notebooks on the server – they’d have to download notebooks and then re-upload them into a new session if they wanted to return to working on a notebook they had modified. That said, the activity was designed as a “displosable” activity…)

Zip All Notebooks

This handy extension — nbzipprovides a button to zip and download a Jupyter notebook server folder.  If you’re working on a temporary notebook server, this provides and easy way of grabbing all the notebooks in one go. What might be even nicer would be to select a sub-folder, or selected set of files, using checkbox selectors? I’m not sure if there’s a complementary tool that will let you upload a zipped archive and unpack it in one go?

Fragment – Running Multiple Services, such as Jupyter Notebooks and a Postgres Database, in a Single Docker Container

Over the last couple of days, I’ve been fettling the build scripts for the TM351 VM, which typically uses vagrant to build a VirtualBox VM from a set of shell scripts, so they can be used to build a single Docker container that runs all the TM351 services, specifically Jupyter notebooks, OpenRefine, PostgreSQL and MongoDB.

Docker containers are typically constructed to a run a single service, with compositions of containers wired together using Docker Compose to create applications that deliver, or rely on, more than one running service. For example, in a previous post (Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API) I showed how to set up a couple of containers to work together, one running a MySQL database server, the other an http service that provided an API to the database.

So how to run multiple services in the same container? Docs on the Docker website suggest using supervisord to run multiple services in a single container, so here’s a fragment on how I’ve done that from my TM351 build.

To begin with, I’ve built the container up as a tiered set of containers, in a similar way to the way the stack of opinionated Jupyter notebook Docker containers are constructed:

#Define a stub to identify the images in this image stack
IMAGESTUB=psychemedia/tm361testm

# minimal
## Define a minimal container, eg a basic Linux container
## using whatever flavour of Linux we prefer
docker build --rm -t ${IMAGESTUB}-minimal-test ./minimal

# base
## The base container installs core packages
## The intention is to define a common build environment
## populated with packages likely to be common to many courses
docker build --rm --build-arg BASE=${IMAGESTUB}-minimal-test -t ${IMAGESTUB}-base-test ./base

#...

One of the things I’ve done to try to generalise the build steps is allow the name a base container to be used to bootstrap a new one by passing the name of the base image in via an optional variable (in the above case, --build-arg BASE=${IMAGESTUB}-minimal-test). Each Dockerfile in a build step directory uses the following construction to work out which image to use as the FROM basis:

#Set ARG values using --build-arg =
#Each ARG value can also have a default value
ARG BASE=psychemedia/ou-tm351-base-test
FROM ${BASE}

Using the same approach, I have used separate build tiers for the following components:

  • jupyter base: minimal Jupyter notebook install;
  • jupyter custom: add some customisation onto a pre-existing Jupyter notebook install;
  • openrefine: add the OpenRefine application; (note, we could just use BASE=ubuntu to create this a simple, standalone OpenRefine container);
  • postgres: create a seeded PostgreSQL database; note, this could be split into two: a base postgres tier and then a customisation that adds users, creates and seed databases etc;
  • mongodb: add in a seeded mongo database; again, the seeding could be added as an extra tier on a minimal database tier;
  • topup: a tier to add in anything I’ve missed without having to go back to rebuild from an earlier step…

The intention behind splitting out these tiers is that we might want to have a battle hardened OU postgres tier, for example, that could be shared between different courses. Alternatively, we might want to have tiers offering customisations for specific presentations of a course, whilst reusing several other fixed tiers intended to last out the life of the course.

By the by, it can be quite handy to poke inside an image once you’ve created it to check that everything is in the right place:

#Explore inside animage by entering it with a shell command
docker run -it --entrypoint=/bin/bash psychemedia/ou-tm351-jupyter-base-test -i

Once the services are in place, I add a final layer to the container that ensures supervisord is available and set up with an appropriate supervisord.conf configuration file:

##Dockerfile
#Final tier Dockerfile
ARG BASE=psychemedia/testpieces
FROM ${BASE}

USER root
RUN apt-get update && apt-get install -y supervisor

RUN mkdir -p /openrefine_projects  && chown oustudent:100 /openrefine_projects
VOLUME /openrefine_projects

RUN mkdir -p /notebooks  && chown oustudent:100 /notebooks
VOLUME /notebooks

RUN mkdir -p /var/log/supervisor
COPY monolithic_container_supervisord.conf /etc/supervisor/conf.d/supervisord.conf

EXPOSE 3334
EXPOSE 8888

CMD ["/usr/bin/supervisord"]

The supervisord.conf file is defined as follows:

##supervisord.conf
##We can check running processes under supervisord with: supervisorctl

[supervisord]
nodaemon=true
logfile=/dev/stdout
loglevel=trace
logfile_maxbytes=0
#The HOME envt needs setting to the correct USER
#otherwise jupyter throws: [Errno 13] Permission denied: '/root/.local'
#https://github.com/jupyter/notebook/issues/1719
environment=HOME=/home/oustudent

[program:jupyternotebook]
#Note the auth is a bit ropey on this atm!
command=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --allow-root --NotebookApp.password= --NotebookApp.token=
#The directory we want to start in
#(replaces jupyter notebook parameter: --notebook-dir=/notebooks)
directory=/notebooks
autostart=true
autorestart=true
startsecs=5
user=oustudent
stdout_logfile=NONE
stderr_logfile=NONE

[program:postgresql]
command=/usr/lib/postgresql/9.5/bin/postgres -D /var/lib/postgresql/9.5/main -c config_file=/etc/postgresql/9.5/main/postgresql.conf
user=postgres
autostart=true
autorestart=true
startsecs=5

[program:mongodb]
command=/usr/bin/mongod --dbpath=/var/lib/mongodb --port=27351
user=mongodb
autostart=true
autorestart=true
startsecs=5

[program:openrefine]
command=/opt/openrefine-3.0-beta/refine -p 3334 -i 0.0.0.0 -d /vagrant/openrefine_projects
user=oustudent
autostart=true
autorestart=true
startsecs=5
stdout_logfile=NONE
stderr_logfile=NONE

One thing I need to do better is to find a way to stage the construction of the supervisord.conf file, bearing in mind that multiple tiers may relate to the same servicel for example, I have a jupyter-base tier to create a minimal Jupyter notebook server and then a jupyter-base-custom tier that adds in specific customisations, such as branding and course related notebook extensions.

When the final container is built, the supervisord command is run and the multiple services started.

One other thing to note: we’re hoping to run TM351 environments on an internal OpenStack cluster. The current cluster only allows students to expose a single port, and port 80 at that, from the VM (IP addresses are in scant supply, and network security lockdowns are in place all over the place). The current VM exposes at least two http services: Jupyter notebooks and OpenRefine, so we need a proxy in place if we are to expose them both via a single port. Helpfully, the nbserverproxy Jupyter extension (as described in Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy), allows us to do just that. One thing to note, though – I had to enable it via the same user that launches the notebook server in the suoervisord.conf settings:

##Dockerfile fragment

RUN $PIP install nbserverproxy

USER oustudent
RUN jupyter serverextension enable --py nbserverproxy
USER root

To run the VM, I can call something like:

docker run -p 8899:8888 -d psychemedia/tm351dockermonotest

and then to access the additional services, I can browse to e.g. localhost:8899/proxy/3334/ to see the OpenRefine application.

PS in case you’re wondering why I syndicated this through RBloggers too, the same recipe will work if you’re using Jupyter notebooks with an R kernel, rather than the default IPython one.

More Cookie Notices…

One of the cookie acceptance notices I’ve started noticing on first visits to several sites lately comes from TrustArc:

TrustArc_-_Technology_Powered_Privacy_Compliance_Solutions

This categorises cookies into three classes — required, functional and advertising — and lets you make decisions on whether to accept those cookies at the category level or the individual provider level if look at the Detailed Settings:

TrustArc_-_2

However, opting out at the category level doesn’t necessarily mean you have opted out of all the cookies provided in that category:

cookieoptout

So whilst I like things like the TrustArc display in principle, it would be nicer if it also had a percentage bar display, for example, showing the percentage of cookie providers that were successfully opted out from in that category?

I’m not quite sure how cookie opt-outs work. I can see how it’d work on the user side – e.g. follow a link provided by someone like IgnitionOne – to a page that sets opt-out cookies in your browser, but how about a publisher setting the opt-out from a third party on your behalf? This description — OpenX – how the cookie-based opt-out mechanism works — suggests that the publisher calls a link and then “OpenX sets the user opt-out cookie and performs a test to see that the user accepted the cookie“, but how do the bits of data flow and what state or flags are set where? Another one to add to the “I don’t understand how this actually works; try to find out at some point” list…

Data Ethics – Guidance and Code – And a Fragment on Machine Learning Models

By chance, I notice that the Department for Digital, Culture, Media & Sport (DDCMS) have published a guidance on a Data Ethics Framework, with an associated workbook, that is intended to guide the design of appropriate data use in government and the wider public sector.

The workbook is based around providing answers to questions associated with seven principles:

  1. A clear user need and public benefit
  2. Awareness of relevant legislation and codes of practice
  3. Use of data that is proportionate to the user need
  4. Understanding of the limitations of the data
  5. Using robust practices and working within your skillset
  6. Making work transparent and being accountable
  7. Embedding data use responsibly

It’s likely that different sector codes will start to appear, such as this one from the Department of Health & Social Care (DHSC): Initial code of conduct for data-driven health and care technology. In this case, the code incorporates ten principles:

1 Define the user
Understand and show who specifically your product is for, what problem you are trying to solve for them, what benefits they can expect and, if relevant, how AI can offer a more efficient or quality-based solution. Ensure you have done adequate background research into the nature of their needs, including any co-morbidities and socio-cultural influences that might affect uptake, adoption and ongoing use.

2 Define the value proposition
Show why the innovation or technology has been developed or is in development, with a clear business case highlighting outputs, outcomes, benefits and performance indicators, and how exactly the product will result in better provision and/or outcomes for people and the health and care system.

3 Be fair, transparent and accountable about what data you are using
Show you have utilised privacy-by-design principles with data-sharing agreements, data flow maps and data protection impact assessments. Ensure all aspects of GDPR have been considered (legal basis for processing, information asset ownership, system level security policy, duty of transparency notice, unified register of assets completion and data privacy impact assessments).

4 Use data that is proportionate to the identified user need (data minimisation principle of GDPR)
Show that you have used the minimum personal data necessary to achieve the desired outcomes of the user need identified in 1.

5 Make use of open standards
Utilise and build into your product or innovation, current data and interoperability standards to ensure you can communicate easily with existing national systems. Programmatically build data quality evaluation into AI development so that harm does not occur if poor data quality creeps in.

6 Be transparent to the limitations of the data used and algorithms deployed
Show you understand the quality of the data and have considered its limitations when assessing if it is appropriate to use for the defined user need. When building an algorithm be clear on its strengths and limitations, and show in a transparent manner if it is your training or deployment algorithms that you have published.

7 Make security integral to the design
Keep systems safe by integrating appropriate levels of security and safeguarding data.

8 Define the commercial strategy
Purchasing strategies should show consideration of commercial and technology aspects and contractual limitations. You should only enter into commercial terms in which the benefits of the partnerships between technology companies and health and care providers are shared fairly.

9 Show evidence of effectiveness for the intended use
You should provide evidence of how effective your product or innovation is for its intended use. If you are unable to show evidence, you should draw a plan that addresses the minimum required level of evidence given the functions performed by your technology.

10 Show what type of algorithm you are building, the evidence base for choosing that algorithm, how you plan to monitor its performance on an ongoing basis and how you are validating performance of the algorithm.

One of the discussions we often have when putting new courses together is how to incorporate ethics related  issues in a way that makes sense (which is to say, in a way that can be assessed…) One way might to apply things like the workbook or the code of conduct to a simple case study. Creating appropriate case studies can be a challenge, but via an O’Reilly post, I note that in a joint project between the Center for Information Technology Policy and the Center for Human Values, both at Princeton, has recently produced a set of fictional case studies that are designed to elucidate and prompt discussion about issues in the intersection of AI and Ethics that cover a range of issues: an automated healthcare app (foundations of legitimacy, paternalism, transparency, censorship, inequality);  dynamic sound identification (rights, representational harms, neutrality, downstream responsibility); optimizing schools issues: (privacy, autonomy, consequentialism, rhetoric); law enforcement chatbots (automation, research ethics, sovereignty).

I also note that a recent DDCMS Consultation on the Centre for Data Ethics and Innovation has just closed… One of the topics of concern that jumped out at me related to IPR:

Intellectual Property and ownership Intellectual property rights protect – and therefore reward – innovation and creativity. It is important that our intellectual property regime keeps up with the evolving ways in which data use generates new innovations. This means assigning ownership along the value chain, from datasets, training data, source code, or other aspects of the data use processes. It also includes clarity around ownership, where AI generates innovations without human input. Finally, there are potentially fundamental questions around the ownership or control of personal data, that could heavily shape the way data-driven markets operate.

One of the things I think we are likely to see more of is a marketplace in machine learning models, either sold (or rented out?) as ‘fixed’ or ‘further trainable’, building on the the shared model platforms that are starting to appear; (a major risk here, of course, is that models with built in biases – or vulnerabilities – might be exploited if bad actors know what models you’re running…). For example:

  • Seedbank [announcement], “a [Google operated] place to discover interactive machine learning examples which you can run from your browser, no set-up required“;
  • TensorFlow Hub [announcement], “a [Google operated] platform to publish, discover, and reuse parts of machine learning modules in TensorFlow“;
  • see also this guidance on sharing ML models on Google Cloud
  • Comet.ml [announcement], “the first infrastructure- and workflow-agnostic machine learning platform.“. No, me neither…

I couldn’t offhand spot a marketplace for Amazon Sagemaker models, but I did notice some instructions for how to import Your Amazon SageMaker trained model into Amazon DeepLens, so if model portability is a thing, the next thing Amazon will likely to is to find a way to take a cut from people selling that thing.

I wonder, too, if the export format has anything to do with ONNX, “an open format to represent deep learning models?

ONNX

(No sign of Google there?)

How the IPR around these models will be regulated can also get a bit meta. If data is licensed to one party so they can train a model, should the license also cover who might make use of any models trained on that data, or how any derived models might be used?

And what counts as “fair use” data when training models anyway? For example, Google recently announced Conceptual Captions. “A New Dataset and Challenge for Image Captioning“. The dataset:

consist[s] of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages.

So how were those images and text caption training data gathered / selected? And what license conditions were associated with those images? Or when compiling the data set, did Google do what Google always does, which is conveniently ignore copyright because it’s only indexing and publish search results, not actually re-publishing material (allegedly…).

Does that sort of consideration fall into the remit of the current Lords Communications Committee inquiry into The Internet: to regulate or not to regulate?, I wonder?

A recent Information Law and Policy Centre post on Fixing Copyright Reform: How to Address Online Infringement and Bridge the Value Gap, starts as follows:

In September 2016, the European Commission published its proposal for a new Directive on Copyright in the Digital Single Market, including its controversial draft Article 13. The main driver behind this provision is what has become known as the ‘value gap’, i.e. the alleged mismatch between the value that online sharing platforms extract from creative content and the revenue returned to the copyright-holders.

This made me wonder, is there a “mismatch” somewhere between:

a) the data that people share about themselves, or that is collected about them, and the value extracted from it;
b) the content qua data that web search engine operators hoover up with their search engine crawlers and then use as a corpus for training models that are used to provide commercial services, or sold / shared on?

There is also a debate to be had about other ways in which the large web cos seem to feel they can just grab whatever data they want, as hinted at in this report on Google data collection research.

The Growing Popularity of Jupyter Notebooks…

It’s now five years since we first started exploring the use of Jupyter notebooks — then known as IPython notebooks — in the OU for the course that became (and still is) TM351 Data management and analysis.

At the time, the notebooks were evolving fast (they still are…) but knowing the length of time it takes to produce a course, and the compelling nature of the notebooks, and the traction they already seemed to have, it felt like a reasonably safe technology to go with.

Since then, the Jupyter project has started to rack up some impressive statistics. Using Google BigQuery on public datasets, we can find how many times the Jupyter Python package is installed from PyPi, the centralised repository for Python package distribution, each month by querying the the-psf:pypi.downloads dataset (if you don’t like the code in this post, how else do you expect me to demonstrably source the supporting evidence…?!):

#https://langui.sh/2016/12/09/data-driven-decisions/
SELECT
  STRFTIME_UTC_USEC(timestamp, "%Y-%m") AS yyyymm,
  COUNT(*) as download_count
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "year"),
    CURRENT_TIMESTAMP()
  )

WHERE file.project="jupyter"
GROUP BY yyyymm
ORDER BY yyyymm DESC
LIMIT 12

(Our other long term bet for TM351 was on the pandas Python package, and that has also gone from strength to strength…)

Google BigQuery datasets also include a Stack Overflow dataset (Stack Overflow is a go to technical question-and-answer site for developers), so something like the following crappy query will count the jupyter-notebook tagged questions appearing each year:

SELECT tags, COUNT(*) c,  year
FROM (
 SELECT SPLIT(tags, '|') tags, YEAR(creation_date) year
  FROM [bigquery-public-data:stackoverflow.posts_questions] a
  WHERE YEAR(creation_date) >= 2014 AND tags LIKE '%jupyter-notebook%'
)
WHERE tags='jupyter-notebook'
GROUP BY year, tags
ORDER BY year DESC

I had thought we might be able to use BigQuery to query the number of notebooks on Github (a site widely used by developers for managing code repositories and, increasingly, and educators for managing course notes/resources), but it seems that the Github public data tables [(bigquery-public-data:github_repos)] only represent a sample of 10% or so of public projects?

FWIW, here are a couple of sample queries on that dataset anyway. First, a count of projects identified as Jupyter Notebook projects:

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.languages]
WHERE language.name = "Jupyter Notebook"

And secondly, a count of .ipynb notebooks:

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.files]
#filter on files with a .ipynb suffix
WHERE RIGHT(path,6) = ".ipynb"

We can also look to see what the most popular Python packages imported into the notebooks are using a recipe I found here

numpy, matplotlib and pandas, then, perhaps not surprisingly… Here’s the (cribbed) code for that query:

#https://dev.to/walker/using-googles-bigquery-to-better-understand-the-python-ecosystem

#The query finds line breaks and then tries to parse import statements
SELECT
  CASE WHEN package LIKE '%\\n",' THEN LEFT(package, LENGTH(package) - 4)
    ELSE package END as package, n
FROM (
  SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
  FROM (
    SELECT SPLIT(content, '\n \"') as line
    FROM (SELECT * FROM [bigquery-public-data:github_repos.contents]
      WHERE id IN (
        SELECT id FROM [bigquery-public-data:github_repos.files]
         WHERE RIGHT(path, 6) = '.ipynb'))
       HAVING LEFT(line, 7) = 'import ' OR LEFT(line, 5) = 'from ')
  GROUP BY package
  ORDER BY n DESC)
LIMIT 10;

(I guess a variant of the above could be used to find out what magics are most commonly loaded into notebooks?)

That said, Github also seem to expose a count of files directly, as described in parente/nbestimate, from where the following estimates of growth in the the numbers of Jupyter notebook / .ipynb files on Github are taken:

The number returned by making the following request is approximate – maybe we get the exact count if we make an authenticated request?

https://github.com/search/count?q=extension%3Aipynb+nbformat_minor&amp;ref=searchresults&amp;type=Code

On Docker hub, where various flavours of official notebook container are hosted, there’ve been over a million pulls of the datascience notebook:

jupyter_-_Docker_Hub.png

Searching Docker hub for jupyter notebook also identifies over 5000 notebook related Docker container images at the current count, of which over 1500 are automatically built from Github repos.

Mentions of Jupyter notebooks are also starting to appear on ITJobswatch, a site that tracks IT and programming job ads in the UK:

By the by, if  you fancy exploring other sorts of queries, this simple construction will let you run a Google web search to look for .ipynb files that mention the word course or module on ac.uk websites…

filetype:ipynb site:ac.uk (course OR module)

Anyway, in the same way that it can take many years to get anything flying regularly on NASA space flights, OU courses take a couple of years to put together, and innovations often have to be proved to work for a couple of course presentations in another course before a second course will take them on; so it’s only now that we’re starting to see an uptick of interest in using Jupyter notebooks in other courses… but that interest is growing, and it’s looking like there’s a flurry of courses about to start production that are likely to be using notebooks.

We may have fallen a little bit behind places like UC Berkeley, where 50% of undergrads now take a core, Jupyter notebook powered, datascience course, with subject specifc bolt-on mini-modules, and notebooks now provide part of the university core infrastructure:

…but even so, there could still be exciting times ahead…:-)

Even more exciting if we could get the library to start championing it as a general resource…

n_campuJupyterhub

Robot Workers?

A lazy post that does nothing much more than rehash and link bullet points from the buried lede that is someone else’s post…

It seems like folk over at the Bank of England have been in the news again about robots taking over human jobs (Bank of England chief economist [Andy Haldane] warns on AI jobs threat); this follows on from a talk earlier this year by Mark Carney at the Public Policy Forum in Toronto [slides] and is similar in kind to other speeches coming out of the Bank of England over the last few years (Are Robots Threatening Jobs or Are We Taking Them Ourselves Through Self-Service Automation?).

The interview(?) was presumably in response to a YouGov survey on Workers and Technology: Our Hopes and Fears associated with the launch of a Fabian Society and Community Commission on Workers and Technology.

(See also a more recent YouGov survey on “friends with robots” which asked “In general, how comfortable or uncomfortable do you think you would be working with a colleague or manager that was a robot?” and “Please imagine you had received poor service in a restaurant or shop from a robot waiter/ shop assistant that is able to detect tone and emotion in a human’s voice… Do you think you would be more or less likely to be rude to the robot, than you would to a human waiter/ shop assistant, or would there be no difference? (By ‘rude’, we mean raising your voice, being unsympathetic and being generally impolite… )“.)

One of the job categories that is being enabled by automation is human trainers that help generate the marked up data that feeds the machines. A recent post on The Lever, “Google Developers Launchpad’s new resource for sharing applied-Machine Learning (ML) content to help startups innovate and thrive” [announcement] asks Where Does Data Come From?. The TLDR answer?

  • Public data
  • Data from an existing product
  • Human-in-the-loop (e.g. a human driver inside an “autonomous” vehicle)
  • Brute force (e.g. slurping all the data you can find; hello Google/Facebook etc etc)
  • Buying the data (which means someone is also selling the data, right?)

A key part of many machine learning approaches is to use labelled datasets that the machine learns from. This means taking a picture of a face, for example, that a human has annotated with areas labelled “eyes”, “nose”, “mouth”, and then training the ‘pootah to try to identify particular features in the photographs that allow the machine to identify those labels with those features, and hopefully the corresponding elements in a previously unseen photo.

Here’s a TLDR summary of part of the Lever post, concerning where these annotations come from:

  • External annotation service providers
  • Internal annotation team
  • Getting users to generate the labels (so the users do folk in external annotation service providers out of a job…)

The post also identifies several companies that provide external annotation services… Check them out if you want to get a glimpse of a future of work that involves feeding a machine…

  • Mechanical Turk: Amazon’s marketplace for getting people to do piecemeal bits of work for pennies that other people often sell as “automated” services, which you might have though meant “computerised”. Which it is, in that a computer assigns the work to essential anonymous, zero hours contract workers. Where it gets really amusing is when folk create bots to do the “human work” that other folk are paying Mechanical Turk for
  • Figure Eight: a “Human-in-the-Loop Machine Learning platform transforms unstructured text, image, audio, and video data into customized high quality training data”… Sounds fun, doesn’t it? (The correct answer is probably “no”);
  • Mighty AI: “a secure cloud-based annotation software suite that empowers you to create the annotations you need for developing and validating autonomous perception systems”, apparently.. You get a sense of how it’s supposed to work from the blurb:
    • “Mighty Community”, a worldwide community that provides our customers with timely, high-quality annotations, offloading the burden to find, train, and manage a pool of annotators to generate ground truth data.
    • Expert global community allows for annotations 24 hours/day
    • Training on Mighty Tools eliminates annotator on-boarding time
    • Available at a moment’s notice to instantly scale customer annotation programs
    • Community members covered by confidentiality agreement
    • Automated annotation management process with Mighty Studio
    • Close integration with Mighty Quality eliminates the need to find and correct annotation errors
  • Playment: “With 300,000+ skilled workers ready to work round-the-clock on Playment’s mobile and web app, we can generate millions of labels in a matter of hours. As the workers do more work, they get better and Playment is able to accomplish much more work in lesser time.” (And then when they’ve done the work, the machine does the “same” work with reduced marginal cost… Hmm, thinks, how do the human worker costs (pennies per task) compare with the server costs for large ML services?)

Happy days to come, eh…?

Legislating For Mandatory Software Updates

From the Automated and Electric Vehicles Act 2018, I notice the following:

4 Accident resulting from unauthorised software alterations or failure to update software

(1) An insurance policy in respect of an automated vehicle may exclude or limit the insurer’s liability under section 2(1) for damage suffered by an insured person arising from an accident occurring as a direct result of—
(a) software alterations made by the insured person, or with the insured person’s knowledge, that are prohibited under the policy, or
(b) a failure to install safety-critical software updates that the insured person knows, or ought reasonably to know, are safety-critical.

(2) But as regards liability for damage suffered by an insured person who is not the holder of the policy, subsection (1)(a) applies only in relation to software alterations which, at the time of the accident, the person knows are prohibited under the policy.

(3) Subsection (4) applies where an amount is paid by an insurer under section 2(1) in respect of damage suffered, as a result of an accident, by someone who is not insured under the policy in question.

(4) If the accident occurred as a direct result of—
(a) software alterations made by an insured person, or with an insured person’s knowledge, that were prohibited under the policy, or
(b) a failure to install safety-critical software updates that an insured person knew, or ought reasonably to have known, were safety-critical,the amount paid by the insurer is recoverable from that person to the extent provided for by the policy.

(5) But as regards recovery from an insured person who is not the holder of the policy, subsection (4)(a) applies only in relation to software alterations which, at the time of the accident, the person knew were prohibited under the policy.

(6) For the purposes of this section—
(a) “software alterations” and “software updates”, in relation to an automated vehicle, mean (respectively) alterations and updates to the vehicle’s software;
(b) software updates are “safety-critical” if it would be unsafe to use the vehicle in question without the updates being installed.

It looks like this is the first time that “software update” appears in legislation..

Is this the start of things to come?

And what tools exist on the legislation.gov.uk and parliament.uk websites to make it easier to keep track of such phrases in passed, enacted and draft legislation?

PS Michael posted some useful thoughts lately on why the Parliament website isn’t for everyone. It also reminded of an important rule when writing bids for outreach activities: “if you claim the project event is for everybody, that means it’s targeted at nobody and won’t be funded”. After all, who should go, and why?

Algorithms That Work, A Bit… On Teaching With Things That Don’t Quite Work

Grant Potter picked up on a line in yesterday’s post on Building Your Own Learning Tools:

including such tools as helpers in an arts history course would help introduce students to the wider question of how well such algorithms work in general, and the extent to which tools or applications that use them can be trusted

The example in point related to the use of k-means algorithms to detect common colour values in images of paintings in order to produce a “dominant colour” palette for the image. The random seed nature of the algorithm means that if you run the same algorithm on the same image multiple times, you may get a different palette each time. Which makes a point about algorithms, if nothing else, and encourages you to treat them just like any other (un)reliable witness. In short, the tool demoed was a bit flakey, but no less useful (or instructive) for that…

Grant’s pick-up got me thinking that about another advantage of bringing interactive digital — algorithmic — tools into the wider curriculum using things like responsive Jupyter notebooks: by trying to apply algorithms to the analysis of things people care about, like images of paintings in art history, like texts in literature, the classics, or history, like maps in geography and politics, and so on, and showing how they can be a bit rubbish, we help students see the limits of the power of similar algorithms that are applied elsewhere, and help them develop healthy — and a slightly more informed — scepticism about algorithms through employing them in a context they (academically) care about.

Using machine “intelligence” for analysis or critique in arts or humanities also shows how many perspectived those subject matters can be, and how biases can creep in. (We can also use algorithmic exploration to dig one level deeper in the sciences. In this respect, when anyone quotes an average at me I think of Anscombe’s Quartet, or for a population trend, Simspon’s Paradox.)

Introducing computational tools as part of the wider curriculum, even in passing, also helps reveal what’s possible with a simple (perhaps naive) application of technology, and the possible benefits, limits and risks associated with such application.

(Academic algorithm researchers often tout things like a huge increase in performance on some test problem from 71.2% success to 71.3% for example, albeit at the expense of using a sizeable percentage of the world’s computers to run the computation. Which means 3 times in 10 it’s still wrong and the answer still needs checking all the time to catch those errors. If I can do the same calculation to 60% success on my Casio calculator watch, I still get a benefit 60% of the time, and the risk of failure isn’t that much different. You pays your money and you takes your choice.)

This also relates to “everyone should learn to code” dogma too. Putting barebones computational tools inline into course materials show’s how you can often achieve some sort of result with just a line or two of code, but more importantly, it shows you what sorts of thing can be done with a line or two of code, and what sorts of thing can’t be done that easily…

In putting together educational materials, we often focus on teaching people the right way to do something. But perhaps when it comes to code, rather than “everyone should learn to code”, maybe we need “everyone should experience the limitation of simple algorithms”? The solutionist good reason for using computational in methods the curriculum would be to show how useful computers can be (for some things) for automating, measuring, sorting and filtering. The resistance reason is that we can start to show people what the limits of the algorithms are, and how the computer answer isn’t always (isn’t often?) the right one.