Tagged: jupyter

OERs in Practice: Repurposing Jupyter Notebooks as Presentations

Over coffee following a maps SIG meeting last week, fellow Jupyter notebooks enthusiast Phil Wheeler wondered about the extent to which tutors / Associate Lecturers might be able to take course materials delivered as notebooks and deliver them in tutorials as slideshows.

The RISE notebook extension allows notebooks to be presented using the reveal.js HTML presentation framework. The slides essentially provide an alternative client to the Jupyter notebook and can be autolaunched using Binder. (I’m not sure if the autostart slideshow can also be configured to Run All cells before it starts?)

To see how this might work, I marked up one of the notebooks in my showntell/maths demo setting some of the cells to appear as slides in a presentation based on the notebook.

I also used the Hide Input Jupyter extension to hide code cell inputs so that the code used to generate an output image or interactive could be hidden from the actual presentation.

Get into the slideshow editor mode from the notebook View menu, select Cell Toolbar and then Slideshow. Reset the notebook disable using View > Cell Toolbar > None.

To run the presentation with code cell outputs pre-rendered, you first need to run all the cells. From the notebook Cell menu select Run All to execute all the cells.  You can now enter the slideshow using the Enter/Exit RISE Slideshow toolbar buttom (it looks like a bar chart). Exit the presentation using the cross in the topleft of the slideshow display.

[Live demo on Binderhub]

PS building on the idea of using mapping notebook cells into a reveal.js tagged display, I wonder if we could do something similar using a scrollytelling framework such as scrollama, scrollstory or idyll?

Note On My Emerging Workflow for Working With Binderhub

Yesterday saw the public reboot of Binder / MyBinder (which I first wrote about a couple of years ago here), as reported in The Jupyter project blog post Binder 2.0, a Tech Guide and this practical guide: Introducing Binder 2.0 — share your interactive research environment.

For anyone not familiar with Binder / MyBinder, it’s a service that will launch a fully running Jupyter notebook server and computing environment based the contents of a Github repository (config files as well as notebooks).  What this means is that if you put your Jupyter notebooks into a Github repository, along with one or two simple files that least any Linux or Python packages you need to install in order to run the code in the notebooks (or R packages and perhaps Rmd files if you also install an R kernel/RStudio), you can get a browser access to that running environment at just the click of a link. And the generosity of whoever is paying for the servers the notebook server runs on.

The system has been rebuilt to use Jupyterhub, with a renaming as far as the codebase goes to Binderhub. There are also several utility tools associated with the project, including the really handy repo2docker that builds a Docker image from the contents of a local folder or Github repository.

One of the things that particularly interested me in the announcement blog posts was the following aspirational remark:

We would love to see others deploy their own BinderHub servers, either for their own communities, or as part of a federated public service of BinderHubs.

I’d love to see the OU get behind this, either directly or under the banner of OpenLearn, as part of an effort to help make Jupyter powered interactive open educational materials available without the need to install any software.

(I tried to pitch it to FutureLearn to help support the OU/FutureLearn Learn to Code for Data Analysis MOOC when we were writing that course, but they weren’t interested…)

One disadvantage is Binderhub is a stateless service, which means you need to download any notebooks you’re working on and them upload them again yourself if you stop an interactive session: the environment you were working in is personal to you, but it’s also destroyed whenever you close the session (or after a particular amount of time? So other solutions are required for persisting state (i.e. having a personal file storage area). Jupyterhub is one way to do that (and one of the things we’re starting to explore in the OU at the moment).

Through playing with Binderhub over the last couple of weeks as part of an attempt to put together some demos for how to use Jupyter notebooks to support the creation of educational content that contains rich content (images, interactives) from specifications contained within the notebook document (think: writing diagrams) I’ve come to the following workflow:

  • create a Github repository to host different builds (example). In my case, these are for different topic areas; but they could be different research projects, courses, data journalism investigations, etc.
  • put each build in a branch (example);
  • work up the build instructions for the environment either using Github/Binder or locally; I was having to use Github/Binder because I was working on a slow network connection that made building my evolving image difficult. But it meant that every time I made a change to the build, it used up Binder resources to do so.
  • if the build is a big one, it can take time to complete. I think that Binder will rebuild the Docker image each time you update the repo, so even if you only update notebook files, then *I think* that that package installation steps are also run even if those files *haven’t* changed? To simplify this process, we can instead create a Docker image from out build files and push that to Dockerhub (example).
  • We can then then create a new build process for our working repository that pulls the pre-built image (containing all the required packages) and adds in the working notebooks (example).
  • We can also share a minimum viable repository that can be forked to allow other people to use the same environment (example).

One advantage of this route is that it separates “sys admin” concerns – building and installing the required packages – from “working” concerns relating to developing the contents of the notebooks. (I think the working repository that uses the Dockerfile build can also draw on the postbuild file to add in any additional or missing packages, which can then be added to the container build as part of a maintenance step.)

PS picking up on a recent related Downes presentation – Applications, Algorithms and Data: Open Educational Resources and the Next Generation of Virtual Learning – and a response from @jimgroom that I really need to comment back on – Containing the Future of OER – this phrase comes to mind: “syndicated runtime” eg if you syndicate the HTML version of a notebook via an RSS feed with a link back to the Binder runnable version of it…

Some Recent Noticings From the Jupyter Ecosystem

Over the last couple of weeks, I’ve got back into the speaking thing, firstly at an OU TEL show’n’tell event, then at a Parliamentary Digital Service show’n’tell.

In each case, the presentation was based around some of the things you can do with notebooks, one of which was using the RISE extension to run a notebook as an interactive slideshow: cells map on to slides or slide elements, and code cells can be executed live within the presentation, with any generated cell outputs being displayed in the slide.

RISE has just been updated to include an autostart mode that can be demo’ed if you run the RISE example on Binderhub.

Which brings me to Binderhub. Originally know as MyBinder, Binderhub takes the MyBinder idea of building a Docker image based on the build specification and content files contained in a public Github repository, and launching a Docker container from that image. Binderhub has recently moved into the Jupyter ecosystem, with the result that there are several handy spin-off command line components; for example, jupyter-repo2docker lets you build, and optionally push and/or launch, a local image from a Github repository or a local repository.

To follow on from my OU show’n’tell, I started putting together a set of branches on a single repository (psychemedia/showntell) that will eventually(?!) contain working demos of how to use Jupyter notebooks as part of “generative document” workflow in particular topic areas. For example, for authoring texts containing rich media assets in a maths subject area, or music. (The environment I used for the shown’n’tell was my own build (checks to make sure I turned that cloud machine off so I’m not still paying for it!), and I haven’t got working Binderhub environments for all the subject demos yet. If anyone would like to contribute to setting up the builds, or adding to subject specific demos, please get in touch…)

I also prepped for the PDS event by putting together a Binderhub build file in my psychemedia/parlihacks repo so (most of) the demo code would work on Binderhub. I think the only think that doesn’t work at the moment is the Shiny app demo? This includes an RStudio environment, launched from the Jupter notebooks New menu. (For an example, see the binder-examples/dockerfile-rstudio demo.)

So – long and short of that – you can create multiple demo environments in a single Github repo using a different branch for each demo, and then launch them separately using Binderhub.

What else…?

Oh yes, a new extension gives you a Shiny like workflow for creating simple apps from a Jupyter notebook: appmode. This seems to complement the Jupyter dashboards approoach, by providing an “app view” of a notebook that displays the content of markdown cells and code cell outputs, but hides the code cell contents. So if you’e been looking for a Jupyter notebook equivalent to R/shiny app development, this may get you some of the way there… (One of the nice things about the app view is that you can easily “View Source” – and modify that source…)

Possibly related to the appmode way of doing things, one thing I showed in the PDS show’n’tell was how notebooks can be used to define simple API services using the jupyter/kernel_gateway (example). These seem to run okay – locally at least – inside Binderhub, although I didn’t try calling a Jupyter API service from outside the container. (Maybe they can be made publicly available via the jupyterhub/nbserverproxy? Why’s this relevant to appmode? My thinking is architecturally you could separate out concerns, having one or more notebooks running an API that is consumed from the appmode notebook?

Another recent announcement came from Google in the form of Colaboratory, a “research project created to help disseminate machine learning education and research”. The environment is “a Jupyter notebook environment that requires no setup to use”, although it does require registration to run notebook cells, and there appears to be a waiting list. The most interesting thing, perhaps, is the ability to collaboratively work on notebooks shared with other people across Google Drive. I think this is separate from the jupyterlab-google-drive initiative, which is looking to offer a similar sort of shared working, again through Google Drive?

By the by, it’s probably also worth noting that other big providers make notebooks available, such as Microsoft (notebooks.azure.com) and IBM (eg datascientistworkbench.com, cognitiveclass.ai; digging around, cognitiveclass.ai seems to be a rebranding of bigdatauniversity.com).

There are other hosted notebook servers relevant to education too: CoCalc (previously SageMathCloud) offers a free way in, as does gryd.us if you have a .edu email address. pythonanywhere.com/ offers notebooks to anyone on a paid plan.

It also seems like there are services starting to appear that offer free notebooks as well as compute power for research/scientific computing on a model similar to CoCalc (free tier in, then buy credits for additional services). For example, Kogence.

For sharing notebooks, I also just spotted Anaconda Cloud, which looks like it could be an interesting place to browse every so often…

Interesting times…

First Attempt At Using IPywidgets in Jupyter Notebooks to Display V-REP Robot Simulator Telemetry

Having got a thing together that lets me use some magic to load a V-REP robot simulator scene, connect to it and control a robot contained inside it, I also started to wonder about we could build instrumentation on the Jupyter notebook client side.

The V-REP simulator itself has graph objects that can record and display logged data within the simulator:

But we can also capture data from the simulator as part of the Python control loop running via a notebook.

(I’m not sure if streaming data from the simulator is possible, or how to go about either setting that up in the simulator connection or rendering it in the notebook?)

So here’s my quick starter for 10 getting a simple data display running in a notebook using IPython widgets.

So here’s a simple text display to give a real time (ish) view of a couple of sensor values:

As the robot runs, the widget values update in real-time-ish .

I couldn’t figure out offhand how to generate a live-updating chart, and couldn’t quickly see how to return data from inside the magic cell as part of the magic function. (In fact, I’m not convinced I understand at all the scoping in there!)

But it seems as if we set a global variable inside the magic cell, we can get data out and plot it when the simulation is stopped:

Example notebook here.

If anyone can show me how to create and update a live chart, that would be fantastic:-)

Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose

I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using docker-compose. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the docker-compose version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.

So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.

The result is something that looks like this:

The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.

The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.

This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate ./data directory and provide some means for populating it with data files. Alternatively, if the files already exist on host,  mounting the host data directory onto a /data volume in the Apache Drill container would work too.

Here’s the docker-compose.yaml file I’ve ended up with:

  image: dialonce/drill
    - 8047:8047
    - zookeeper
    -  ./notebooks/data:/nbdata
    -  ./R/data:/rdata

  image: jplock/zookeeper

  container_name: notebook-apache-drill-test
  image: psychemedia/ou-tm351-jupyter-custom-pystack-test
    - 35200:8888
    - ./notebooks:/notebooks/
    - drill:drill

  container_name: rstudio-apache-drill-test
  image: rocker/tidyverse
    - PASSWORD=letmein
  #default user is: rstudio
    - ./R:/home/rstudio
    - 8787:8787
    - drill:drill

If you have docker installed and running, running docker-compose up -d in the folder containing the docker-compose.yaml file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the ./notebooks, ./notebooks/data, ./R and ./R/data subfolders don’t exist they will be created.

We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the pydrill package to connect. Note the hostname used is the linked container name (in this case, drill).

If we download data to the ./notebooks/data folder which is mounted inside the Apache Drill container as /nbdata, we can query against it.

(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)

We can also query against that same data file from the RStudio container. In this case I used the DrillR package (I had hoped to use the sergeant package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find java installed, rather than DBI, and java isn’t installed in the rocker/tidyverse container I used.) UPDATE: sergeant now works without Java dependency... Thanks, Bob:-)

I’m not sure if DrillR is being actively developed, but it would be handy if it could return the data from the query as a dataframe.

So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have docker installed:-)

Tinkering With Apache Drill – JOINed Queries Across JSON and CSV files

Coming down from a festival high, I spent the day yesterday jamming with code and trying to get a feel for Apache Drill. As I’ve posted before, Apache Drill is really handy for querying large CSV files.

The test data I’ve been using is Evan Odell’s 3GB Hansard dataset, downloaded as a CSV file but recast in the parquet format to speed up queries (see the previous post for details). I had another look at the dataset yesterday, and popped together some simple SQL warm up exercises in notebook (here).

Something I found myself doing was flitting between running SQL queries over the data using Apache Drill to return a pandas dataframe, and wrangling the pandas dataframes directly, following the path of least resistance to doing the thing I wanted to do at each step. (Which is to say, if I couldn’t figure out the SQL, I’d try moving into pandas; and if the pandas route was too fiddly, rethinking the SQL query! That said, I also noticed I had got a bit rusty with SQL…) Another pattern of behaviour I found myself falling into was using Apache Drill to run summarising queries over the large original dataset, and then working with these smaller, summary datasets as in-memory pandas dataframes. This could be a handy strategy, I think.

As well as being able to query large flat CSV files, Apache Drill also allows you to run queries over JSON files, as well directories full of similarly structured JSON or CSV files. Lots of APIs export data as JSON, so being able to save the response of multiple calls on a similar topic as uniquely named (and the name doesn’t matter..) flat JSON files in the same folder, and then run a SQL query over all of them simply by pointing to the host directory, is really appealing.

I explored these features in a notebook exploring UK Parliament Written Answers data. In particular, the notebook shows:

  • requesting multiple pages on the same from the UK Parliament data API (not an Apache Drill feature!)
  • querying deep into a single large JSON file;
  • running JOIN queries over data contained in a JSON file and a CSV file;
  • running JOIN queries over data contained in a JSON file and data contained in multiple, similarly structured, data files in the same directory.

All I need to do now is come up with a simple “up and running” recipe for working Apache Drill. I’m thinking: Docker compose and some linked containers: Jupyter notebook, RStudio, Apache Drill, and maybe a shared data volume between them?