Category: Rstats

Note On My Emerging Workflow for Working With Binderhub

Yesterday saw the public reboot of Binder / MyBinder (which I first wrote about a couple of years ago here), as reported in The Jupyter project blog post Binder 2.0, a Tech Guide and this practical guide: Introducing Binder 2.0 — share your interactive research environment.

For anyone not familiar with Binder / MyBinder, it’s a service that will launch a fully running Jupyter notebook server and computing environment based the contents of a Github repository (config files as well as notebooks).  What this means is that if you put your Jupyter notebooks into a Github repository, along with one or two simple files that least any Linux or Python packages you need to install in order to run the code in the notebooks (or R packages and perhaps Rmd files if you also install an R kernel/RStudio), you can get a browser access to that running environment at just the click of a link. And the generosity of whoever is paying for the servers the notebook server runs on.

The system has been rebuilt to use Jupyterhub, with a renaming as far as the codebase goes to Binderhub. There are also several utility tools associated with the project, including the really handy repo2docker that builds a Docker image from the contents of a local folder or Github repository.

One of the things that particularly interested me in the announcement blog posts was the following aspirational remark:

We would love to see others deploy their own BinderHub servers, either for their own communities, or as part of a federated public service of BinderHubs.

I’d love to see the OU get behind this, either directly or under the banner of OpenLearn, as part of an effort to help make Jupyter powered interactive open educational materials available without the need to install any software.

(I tried to pitch it to FutureLearn to help support the OU/FutureLearn Learn to Code for Data Analysis MOOC when we were writing that course, but they weren’t interested…)

One disadvantage is Binderhub is a stateless service, which means you need to download any notebooks you’re working on and them upload them again yourself if you stop an interactive session: the environment you were working in is personal to you, but it’s also destroyed whenever you close the session (or after a particular amount of time? So other solutions are required for persisting state (i.e. having a personal file storage area). Jupyterhub is one way to do that (and one of the things we’re starting to explore in the OU at the moment).

Through playing with Binderhub over the last couple of weeks as part of an attempt to put together some demos for how to use Jupyter notebooks to support the creation of educational content that contains rich content (images, interactives) from specifications contained within the notebook document (think: writing diagrams) I’ve come to the following workflow:

  • create a Github repository to host different builds (example). In my case, these are for different topic areas; but they could be different research projects, courses, data journalism investigations, etc.
  • put each build in a branch (example);
  • work up the build instructions for the environment either using Github/Binder or locally; I was having to use Github/Binder because I was working on a slow network connection that made building my evolving image difficult. But it meant that every time I made a change to the build, it used up Binder resources to do so.
  • if the build is a big one, it can take time to complete. I think that Binder will rebuild the Docker image each time you update the repo, so even if you only update notebook files, then *I think* that that package installation steps are also run even if those files *haven’t* changed? To simplify this process, we can instead create a Docker image from out build files and push that to Dockerhub (example).
  • We can then then create a new build process for our working repository that pulls the pre-built image (containing all the required packages) and adds in the working notebooks (example).
  • We can also share a minimum viable repository that can be forked to allow other people to use the same environment (example).

One advantage of this route is that it separates “sys admin” concerns – building and installing the required packages – from “working” concerns relating to developing the contents of the notebooks. (I think the working repository that uses the Dockerfile build can also draw on the postbuild file to add in any additional or missing packages, which can then be added to the container build as part of a maintenance step.)

PS picking up on a recent related Downes presentation – Applications, Algorithms and Data: Open Educational Resources and the Next Generation of Virtual Learning – and a response from @jimgroom that I really need to comment back on – Containing the Future of OER – this phrase comes to mind: “syndicated runtime” eg if you syndicate the HTML version of a notebook via an RSS feed with a link back to the Binder runnable version of it…

Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

Or something that looks a bit more like a finite state machine diagram:

(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

For example, the following could describe total flow from one module to another over a given period of time:

If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

The R sankey package generates a simple Sankey diagram from a simple data table:

Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

Via the comments…

H/T @Richard: process maps using the R bupar package:

Process_Maps

PS for creating Sankey diagrams in a browser, see SankeyMATIC.

Programming, meh… Let’s Teach How to Write Computational Essays Instead

From Stephen Wolfram, a nice phrase to describe the sorts of thing you can create using tools like Jupyter notebooks, Rmd and Mathematica notebooks: computational essays that complements the “computational narrative” phrase that is also used to describe such documents.

Wolfram’s recent blog post What Is a Computational Essay?, part essay, part computational essay,  is primarily a pitch for using Mathematica notebooks and the Wolfram Language. (The Wolfram Language provides computational support plus access to a “fact engine” database that ca be used to pull factual information into the coding environment.)

But it also describes nicely some of the generic features of other “generative document” media (Jupyter notebooks, Rmd/knitr) and how to start using them.

There are basically three kinds of things [in a computational essay]. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these three kinds of these all work together to express what’s being communicated.

In Mathematica, the view is something like this:


In Jupyter notebooks:

In its raw form, an RStudio Rmd document source looks something like this:

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

When we originally drafted the OU/FutureLearn course Learn to Code for Data Analysis (also available on OpenLearn), we wrote the explanatory text – delivered as HTML but including static code fragments and code outputs – as a notebook, and then ‘ran” the notebook to generate static HTML (or markdown) that provided the static course content. These notebooks were complemented by actual notebooks that students could work with interactively themselves.

(Actually, we prototyped authoring both the static text, and the elements to be used in the student notebooks, in a single document, from which the static HTML and “live” notebook documents could be generated: Authoring Multiple Docs from a Single IPython Notebook. )

Whilst the notion of the computational essay as a form is really powerful, I think the added distinction between between generative and generated documents is also useful. For example, a raw Rmd document of Jupyter notebook is a generative document that can be used to create a document containing text, code, and the output generated from executing the code. A generated document is an HTML, Word, or PDF export from an executed generative document.

Note that the generating code can be omitted from the generated output document, leaving just the text and code generated outputs. Code cells can also be collapsed so the code itself is hidden from view but still available for inspection at any time:

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

Even if notebooks are not used interactively, they can be used to create correct static texts where outputs that are supposed to relate to some fragment of code in the main text actually do so because they are created by the code, rather than being cut and pasted from some other environment.

However, making the generative – as well as generated – documents available means readers can learn by doing, as well as reading:

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

In terms of our own learnings to date about how to use notebooks most effectively as part of a teaching communication (i.e. as learning materials), Wolfram seems to have come to many similar conclusions. For example, try to limit the amount of code in any particular code cell:

In a typical computational essay, each piece of input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

...

So what can go wrong? Well, like English prose, can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

It can also be useful to "preview" the output of a particular operation that populates a variable for use in the following expression to help the reader understand what sort of thing that expression is evaluating:

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like.

In many respects, the computational narrative format forces you to construct an argument in a particular way: if a piece of code operates on a particular thing, you need to access, or create, the thing before you can operate on it.

[A]nother thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

One of the arguments I've been trying to develop in an attempt to persuade some of my colleagues to consider the use of notebooks to support teaching is the notebook nature of them. Several years ago, one of the en vogue ideas being pushed in our learning design discussions was to try to find ways of supporting and encouraging the use of "learning diaries", where students could reflect on their learning, recording not only things they'd learned but also ways they'd come to learn them. Slightly later, portfolio style assessment became "a thing" to consider.

Wolfram notes something similar from way back when...

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

One of the advantages that notebooks have over some other environments in which students learn to code is that structure of the notebook can encourage you to develop a solution to a problem whilst retaining your earlier working.

The earlier working is where you can engage in the minutiae of trying to figure out how to apply particular programming concepts, creating small, playful, test examples of the sort of the thing you need to use in the task you have actually been set. (I think of this as a "trial driven" software approach rather than a "test driven* one; in a trial,  you play with a bit of code in the margins to check that it does the sort of thing you want, or expect, it to do before using it in the main flow of a coding task.)

One of the advantages for students using notebooks is that they can doodle with code fragments to try things out, and keep a record of the history of their own learning, as well as producing working bits of code that might be used for formative or summative assessment, for example.

Another advantage is that by creating notebooks, which may include recorded fragments of dead ends when trying to solve a particular problem, is that you can refer back to them. And reuse what you learned, or discovered how to do, in them.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

Looking at many of the notebooks students have created from scratch to support assessment activities in TM351, it's evident that many of them are not using them other than as an interactive code editor with history. The documents contain code cells and outputs, with little if any commentary (what comments there are are often just simple inline code comments in a code cell). They are barely computational narratives, let alone computational essays; they're more of a computational scratchpad containing small code fragments, without context.

This possibly reflects the prior history in terms of code education that students have received, working "out of context" in an interactive Python command line editor, or a traditional IDE, where the idea is to produce standalone files containing complete programmes or applications. Not pieces of code, written a line at a time, in a narrative form, with example output to show the development of a computational argument.

(One argument I've heard made against notebooks is that they aren't appropriate as an environment for writing "real programmes" or "applications". But that's not strictly true: Jupyter notebooks can be used to define and run microservices/APIs as well as GUI driven applications.)

However, if you start to see computational narratives as a form of narrative documentation that can be used to support a form of literate programming, then once again the notebook format can come in to its own, and draw on styling more common in a text document editor than a programming environment.

(By default, Jupyter notebooks expect you to write text content in markdown or markdown+HTML, but WYSIWYG editors can be added as an extension.)

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

As well as allowing you to create documents in which the content is generated interactively - code cells can be changed and re-run, for example - it is also possible to embed interactive components in both generative and generated documents.

On the one hand, it's quite possible to generate and embed an interactive map or interactive chart that supports popups or zooming in a generated HTML output document.

On the other, Mathematica and Jupyter both support the dynamic creation of interactive widget controls in generative documents that give you control over code elements in the document, such as sliders to change numerical parameters or list boxes to select categorical text items. (In the R world, there is support for embedded shiny apps in Rmd documents.)

These can be useful when creating narratives that encourage exploration (for example, in the sense of  explorable explantations, though I seem to recall Michael Blastland expressing concern several years ago about how ineffective interactives could be in data journalism stories.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like Manipulate, [interact/interactive in Jupyter notebooks] into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay."

I've also thought previously that interactive functions are a useful way of motivating the use of functions in general when teaching introductory programming. For example, An Alternative Way of Motivating the Use of Functions?.

One of the issues in trying to set up student notebooks is how to handle boilerplate code that is required before the student can create, or run, the code you actually want them to explore. In TM351, we preload notebooks with various packages and bits of magic; in my own tinkerings, I'm starting to try to package stuff up so that it can be imported into a notebook in a single line.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

As far as opportunities for making increasing use of notebooks as a kind of technology goes, I came to a similar conclusion some time ago to Stephen Wolfram when he writes:

[I]t’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Regarding his final conclusion, I'm a little bit more circumspect:

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new.

In many respects, HTML+Javascript pages have been capable of delivering, and actually delivering, computationally generated documents for some time. Whether computational notebooks offer some sort of step-change away from that, or actually represent a return to the original read/write imaginings of the web with portable and computed facts accessed using Linked Data?

Fragment – DIT4C – Docker Base Containers for Edu Remote Computing Labs

What’s an effective way of helping a student run a desktop application when their own computer won’t run the application, for whatever reason, locally? Virtualised software, running remotely, provides one solution. So here’s an example of a project that looks at doing just that:  DIT4C (“Data Intensive Tools for the Cloud”)a platform for hosting data analysis tools “in the cloud” using containers [repo].

Prepackaged, standalone containers are defined for a range of applications, including RStudio, Jupyter notebooks, Jupyter+R and OpenRefine

Standalone Containers With Branded Landing Page

The application containers are built on top of a base container that includes an nginx webser/proxy, a GoTTY shell and a file uploader. The individual containers then have a “homepage” that links to the particular application:

So what do we have at this point?

  • a branded landing page;
  • browser accessed shell:
  • a browser accessed file uploader:

These services are all running within a single container. I don’t know if there’s a way of linking multiple containers using docker-compose? This would require finding some way of announcing the services provided by each container, to a central nginx server which could then link to each from a single homepage. But this would mean separate terminals and file loaders into each one (though maybe the shared files could be handled as a single mounted volume shared across all the linked containers?

Once again, I’m coming round to the idea that using a single container to run multiple services, rather than several linked containers each running a single service, is simpler, even if it does go against the (ideal?) model of using containers as part of a small pieces, loosely joined architecture? I think I need to post a simple recipe (or recipes) somewhere that show different ways of running multiple services within a single container. The docker docs  – Run multiple services in a container – provide a crib in to this at the moment.

X11 Applications

Skimming the docs, I notice reference to a base X11 desktop container. Interesting… I have a PhD student looking for an easy way to host a Qt widget running application in the cloud for evaluation purposes. To this end I’ve just started looking around for X11/noVNC web client containers that would allow us to package the app in a simple container then access it from something like Digital Ocean (given there’s no internal OU docker container hosting service that I’m allowed to access (or am aware of… Maybe on the Faculty cluster?)).

So things like this show the way – a container that offers a link to a containerised “desktop” application, in this case QGIS (dit4c/dockerfile-dit4c-container-qgis); (does the background colour mean anything, I wonder? How could we make use of background colour in OU containers?):


Following the X11 Session link, we get to a desktop:

There’s an icon in the toolbar to the application we want – QGIS:

What I’m thinking now is this could be handy for running the V-REP robot simulator, and maybe Gephi…

It also makes me think that things could be simplified a little further by offering a link to QGIS, rather than X11 Application, and opening the application in full screen mode (on the virtualised desktop) on start-up. (See Distributing Virtual Machines That Include a Virtual Desktop To Students – V-REP + Jupyter Notebooks for some thoughts on how to use VMs to distribute a single pre-launched on startup desktop application to try to simplify the student experience.)

It also makes me even more concerned about the apparent lack of interest in the OU, and even awareness of, the possibilities of virtualised software offerings. For example, at a recent SIG group on (interactive) maps/mapping, brief mention was made of using QGIS, and problems arising therefrom (though I forget the context of the problems). Here we have a solution – out there for all to see and anyone to find – that demonstrates the use of QGIS in a prebuilt container. But who, internally, would think to mention that? I don’t think any of the Tech Enhanced Learning folk I’ve spoken to would even consider it, if they are even aware of it as an option?

(Of course, in testing, it might be rubbish… how much bandwidth is required for a responsive experience when creating detailed maps? See also one of my earlier related experiments: Accessing GUI Apps Via a Browser from a Container Using Guacamole, which remotely accessing the Audacity audio editor using a cloud hosted container.)

The Platform Offering

Skimming through the repos, I (mistakenly, as it happens) thought I saw a reference to resbaz (ResBaz Cloud – Containerised Research Apps as a Service). I was mistaken in thinking I had seen a reference in the code I skimmed though, but not, it seems, in the fact that there is a relationship:

And so it seems that perhaps more interestingly than the standalone containers is that DIT4C is a platform offering (architecture docs), providing authenticated access to users, file persistence (presumably?) and the ability to launch prebuilt docker images as required.

That said, looking at the Github repository commits for the project, there appears to have been little activity since March 2017 and the gitter channel appears to have gone silent at the end of 2016. In addition, the docs for getting an instance of the platform up and running are a little bit too sparse for me to follow easily… [UPDATE: it seems as if the funding did run out/get pulled:-(]

So maybe as a project, DIT4C is perhaps now “of historical interest” only, rather than being a live project we might have been able to jump on the back of to get an OU hosted remote computing lab up and running? :-( That said, the ResBaz (Research Bazaar) initiative, “worldwide festival promoting the digital literacy emerging at the center of modern research”, still seems to be around…

Oh How I Have Failed Thee, Jupyter Notebooks…

Although I first came across Jupyter – then IPython – notebooks in October 2012 (I think…), it took me another six months or so before I started playing regularly with them and pitched them for the then nascent TM351 course (geeknotes/history). We decided to explore the notebooks when the course/module team first met around about October 2013. Four years ago. Notebooks were also adopted for the Learn to Code for Data Analysis FutureLearn course (H/T to Michel Wermelinger for driving that) and only get the briefest of look-ins in the new level 1 course TM112 (even after I showed we could probably get turtle running in them…).

But to my shame I haven’t lobbied more on campus, and haven’t done the rounds giving talks and workshops and putting together meaningful demos.

Which is possibly the sort of activity that this newly advertised, and hugely attractive, role at the University of Edinburgh (h/t @PhilBarker) is designed to support – eLearning Officer Computational Notebooks.

Do you have a sound knowledge of technology and an enthusiasm for evaluating new approaches in education? We are looking for a learning technologist with a passion for communication and relationship management to lead a pilot of Jupyter notebooks for learning and teaching at the University of Edinburgh.

Jupyter notebooks are open-source web applications that enable learners to create, share and reuse computational narratives. Based within the central Information Services you will work closely with academic colleagues in Science and Engineering. You will analyse user requirements, advise on and support the use of Jupyter and evaluate the success of the pilot.

After clicking on the Apply button, we get to some more detail. Part of the purpose of the job is to “scope, assess demand and support requirements for a computational notebook (Jupyter Notebook) service”, something we’re trying to push through in a very limited form in the OU in order to support the TM112 notebook activity.

Here’s how the responsibilities unpick:

  1. To help academic and support staff make best use of learning technology services (in this case Jupyter Computational Notebook Service) and where required supporting and managing service change. Documenting use cases and sharing good practice and innovative solutions to improve the user experience. (Approx % of time 40%)
  2. To work with the user community and project partners in academic departments in order to continually improve the services and range of tools on offer. To maintain an up-to-date knowledge of the broader e-learning landscape in order to influence strategic direction and to develop innovative and appropriate use of learning technologies. (Approx % of time 30%)
  3. To participate and lead user and partner engagement events, in order to promote collaboration, knowledge sharing and greater awareness of services. To organise testing, training and workshops to support users. To represent the University and its interests both internally and externally. (Approx % of time 20%)
  4. Contribute to process improvement within both ISG and the wider University. Liaise and negotiate within members of University committees, user forums and working groups to formulate policy in accordance with the university strategic aims for learning and teaching. (Approx % of time 10%)

(On process improvement, I think Jupyter notebooks can provide a useful authoring environment (along with things like “written diagrams“) for “reproducible”  (which is to say, maintainable) course materials in the OU context. An approach I have had a total lack of success in promoting.)

I couldn’t help but try out a quick search for other notebooks related job ads, and turned up a handful of research posts, including one for a Bioinformatics Training Developer at the University of Cambridge – Cancer Research UK Cambridge Institute. The job duties and requirements provide an interesting complement to the skills required of a data journalist:

The training courses and summer schools already established are very popular and have gained a strong reputation. In this role, you will further develop the existing courses to reflect new advances. You will also create and deliver new training courses and materials in scientific data analysis and visualization, … . You will be responsible for assessing the training needs of research scientists and shaping a programme to meet those needs. This is an excellent opportunity to develop and apply new training approaches, making use of technologies such as R/Python Notebooks, Shiny web applications and Docker.

The successful candidate will have a degree in a scientific or computational discipline and preferably a postgraduate degree (MSc or PhD) and/or significant experience in Bioinformatics or Computational Biology, including the analysis of omics datasets using R. The role requires a high level of interpersonal and organizational skills and previous experience in preparation and delivery of training courses is essential. Strong practical skills in R and/or Python are highly desirable, including the use of version control systems, e.g. GitHub. …

[My emphasis.]

It’s maybe also worth mentioning here the current consultation around the draft Data Scientist Integrated Degree Apprenticeship (level 6) standard. Please comment if you can…

PS I popped together a feed for a search for “notebooks” on jobs.ac.uk using fetchrss.com to try to keep track of future upcoming academic job ads making mention of notebooks.

ergastR – R Wrapper for ergast F1 Results Data API

By the by, I’ve posted a first attempt at an R package – ergastR to wrap the ergast developer API, which is where I get chunks of data from for my f1datajunkie tinkerings.

You can find it on Github: psychemedia/ergastR.

The function names are the ones used in the Wrangling F1 Data With R book.

The R package needs a bit of tidying up and also needs work on the following: cacheing, so that we don’t keep hitting the ergast API unnecessarily; paged results handling (I fudge this a bit at the moment by explicitly setting a large results limit); and dual handling of ergast API versus downloaed ergast database requests (if a database connection string is passed, use that rather than make a call to the ergast API). But it’s a start… Feel free to raise issues via the repo:-)

In related news, Will Vaughan tipped me off to a Python package he’s started putting together to wrap the ergast API: ergast-python. He’s also making a start on some Wrangling F1 Data Jupyter notebooks that make use of the Python wrapper: wranglingf1data.

HexJSON HTMLWidget for R, Part 3

In HexJSON HTMLWidget for R, Part 1 I described a basic HTMLwidget for rendering hexJSON maps using d3-hexJSON, and HexJSON HTMLWidget for R, Part 2 described updates for supporting colour.

Having booked off today for emergency family cover that turned out not to be required, I had another stab at the package, so it now supports the following additional features…

Firstly, I had a go at popping some “base” hexjson files into a location within the package from which I could load them (checkin). Based on a crib from here, which suggests putting datafiles into an extdata folder in the package inst/ folder, from where devtools::build() makes them available in the built package root directory.

hexjsonbasefiles <- function(){
  list.files(system.file("extdata", package = "hexjsonwidget"))
}

With the files in place, we can use any base hexjson files included included in the package as the basis for hexmaps.

I also added the ability to switch off labels although later in the day I simplified this process…

One thing that was close to the top of my list was the ability to merge the contents of a dataframe into a hexJSON object. In particular, for a row identified by a particular key value associated with a hex key value, I wanted to map columns onto hex attributes. The hexjson object is represented as a list, so this required a couple of things: firstly, getting the dataframe data into an appropriate list form, secondly merging this into the hexjson list using the rlist::merge() function from the rlist package. Here’s the gist of the trick I ended up with, which was to construct a list split() from each row in a dataframe, with the rowname as the list name, using lapply(.., as.list):

ll=lapply(split(customdata, rownames(customdata)), as.list)
jsondata$hexes = list.merge(jsondata$hexes, ll)

A hexjsondatamerge(hexjson,df) function takes a hexjson file and merges the contents of the dataframe into the hexes:

The contents of a dataframe can also be merged in directly when creating a hexjsonwidget:

Having started to work with dataframes, it also seemed like it might be sensible to support the creation of a hexjson object directly from a dataframe. This uses a similar trick to the one used in creating the nested list for the merge function:

hexjsonfromdataframe <- function(df,layout="odd-r", keyid='id',
                                 q='q', r='r'){

  rownames(df) = df[[keyid]]
  df[[keyid]] = NULL
  colnames(df)[colnames(df) == q] = 'q'
  colnames(df)[colnames(df) == r] = 'r'

  list(layout=layout,
       hexes=lapply(split(df, rownames(df)), as.list))
}

hexjsonpart3_6

As you might expect, we can then use the hexjson object to create a hexjsonwidget:

A hexjsonwidget can also be created directly from a dataframe:

If we wanted to save the hexjson to a file, we could do something like: write( toJSON( jjx ), "test_out.hexjson" ).

In creating the hexjson-from-dataframe, I also refactored some of the other bits of code to simplify the number of parameters I’d started putting into the hexjsonwidget() function, in effect overloading them so the same named parameter could be used in different supporting functions.

I think that’s pretty much it from the developments I had in mind for the package. Now all I need to do is put it into practice… testing for which will, no doubt, throw up issues!)

PS Not quite it… I just added some simple file handlers too: to save the hexjson to a file, use hexjsonwrite(df, filename) and to read a json/hexjson file into a hexjson object use hexjsonread(filename).