Rolling Your Own IT – Automating Multiple File Downloads

Yesterday, I caught up with a video briefing on Transforming IT from the OU’s Director of IT, recorded earlier thus year (OU internal link, which, being on Sharepoint, needs Microsoft authentication, rather than OU single sign on?).

The video, in part, describe the 20 year history of some of the OU’s teaching related software services, which tended to be introduced piecemeal and which are necessarily as integrated as they could be…

In the post Decision Support for Third Marking Significant Difference Double Marked Assessments, I mentioned part of the OU process for managing third marking.

Guidance provided for collecting scripts for third marking is something like this:

The original markers’ scores and feedback will be visible in OSCAR.

Electronically submitted scripts can be accessed in the eTMA system via this link: …

Please note the scripts can only be accessed via the EAB/admin tab in the eTMA system ensuring you add the relevant module code and student PI.

[My emphasis.]

Hmmm… OSCAR is accessed via a browser, and supports “app internal” links that display the overall work allocation, a table listing the students, along with their PIs, and links to various data views including the first and second marks table referred to in the post mentioned above.

The front end to the eTMA system is a web form that requests a course code and student PI, which then launches another web page listing the student’s submitted files, a confirmation code that needs to be entered in OSCAR to let you add third marks, and a web form that requires you to select a file download type from a drop down list with a single option and a button to download the zipped student files.

So that’s two web things…

To download multiple student files requires a process something like this:

So why not just have a something on the OSCAR work allocation page that that lets you select – or select all – the students and download all the files, or get all all the confirmation codes?

Thinks… I could do that, sort of, over  coffee…. (I’ve tried to obfuscate details while leaving the general bits of code that could be reused elsewhere in place…)

First up, we need to login and get authenticated:

!pip3 install MechanicalSoup

import mechanicalsoup
import pandas as pd


def getSession():
 browser = mechanicalsoup.StatefulBrowser()
 browser.select_form(FORM_ID) #in form: #loginForm
 resp = browser.submit_selected()
 return browser


Now we need a list of PIs. We could scrape these from OSCAR, but that’s a couple of steps and easier just to copy and paste the table from the web page for now:

#Get student PIs - copy and paste table from OSCAR for now


#Put that data into a pandas dataframe then pull out the PIs
from io import StringIO

pids=[i[0] for i in df[1].str.split()]

We now have a list of student PIs, which we can iterate through to download the relevant files:

#Download the zip file for each student
import zipfile, io, random

def downloader(pid, outdir='etmafiles'):
  print('Downloading assessment for {}'.format(pid))
  !mkdir -p {outdir}
  #Download the file...,data=payload)

  #...and treat it as a zipfile
  z = zipfile.ZipFile(io.BytesIO(r.content))
  #Save a bit more time for the user by unzipping it too...

#Here's the iterator...
for pid in pids:
    print('Failed for {}'.format(pid))

We can also grab the “student page” from the eTMA system and scrape it for the confirmation code. (On to do list, try to post the confirmation code back to OSCAR to authorise the upload of third marks, as well as auto-posting a list of marks and comments back.)

#Scraper for confirmation codes
def getConfirmationCode(pid):
  print('Getting confirmation code for {}'.format(pid))

  #scrapy bit
  confirmation_code, pid=SCRAPE(elements)
  return [confirmation_code, pid]


for pid in pids:
    # Add data to dataframe...
    codes = pd.concat([codes, pd.DataFrame([tmp], columns=['PI','Code'])])
    print('Failed for {}'.format(pid))


So… yes, the systems don’t join up in the usual workflow, but it’s easy enough to hack together some glue as an end-user developed application: given that the systems are based on quite old-style HTML thinking, they are simple enough to scrape and treat as a de facto roll-your-own API.

Checking the time, it has taken me pretty as much as long as it took to put the above code together as it has taken to write this post and generate the block diagram shown above.

With another hour, I could probably learn enough about the new plotly Dash package (like R/shiny for python?) to create a simple browser-based app UI for it.

Of course, this isn’t enterprise grade for a digital organisation, where everything is form/button/link/click easy, but it’s fine for a scruffy digital org where you appropriate what you need and string’n’glue’n’gaffer tape let you get stuff done (and also prototype, quickly and cheaply, things that may be useful, without spending weeks and months arguing over specs and font styles).

Indeed, it’s the digital equivalent of the workarounds all organisations have, where you know someone or something who can hack a process, or a form, or get you that piece of information you need, using some quirky bit of arcane knowledge, or hidden backchannel, that comes from familiarity with how the system actually works, rather than how people are told it is supposed to work. (I suspect this is not what folk mean when they are talking about a digital first organisation, though?!;-)

And if it’s not useful? Well it didn’t take that much time to try it to see if it would be…

Keep on Tuttling…;-)

PS the blockdiagram above was generated using an online service, blockdiag. Here’s the code (need to check: could I assign labels to a variable and use those to cut down repetition?):

  A [label="Work Allocation"];
  B [label="eTMA System"];
  C [label="Student Record"];
  D [label="Download"];
  DD [label="Confirmation Code"]
  E [label="Student Record"];
  F [label="Download"];
  FF [label="Confirmation Code"]
  G [shape="dots"];
  H [label="Student Record"];
  I [label="Download"];
  II [label="Confirmation Code"];

  OSCAR -> A -> B;

  B -> C -> D;
  C -> DD;

  B -> E -> F;
  E -> FF;
  B -> G;

  B -> H -> I;
  H -> II;

Is that being digital? Is that being cloud? Is that being agile (e.g. in terms of supporting maintenance of the figure?)?

(Superfluous?) Jupyter / pandas Notebook for Wrangling TEF Excel Spreadsheets

A couple of days ago, new Wonkhe employee Dave Kernohan (congrats!:-) got in touch asking if I’d be interested in helping wrangle the as then yet to be released TEF (Teaching Excellence Framework) data into a form they could work with. The suspicion was that if the data was be released in a piecemeal fashion – one Excel spreadsheet per institution – rather than as a single data set, it would be a bit of a faff to analyse…

I found an example of the sort of spreadsheet it looked might be used to publish the data in, and started hacking a notebook to parse the sheets and pull the data into a set of monolithic files.

As it was, the data was published in a variety of forms, rendering the notebook superfluous. But I’ll post it anyway as a curiosity here.

FWIW, the final, published spreadsheets differed slightly in layout and in the naming of the internal sheets, both of which broke the original notebook. The revised notebook is no less brittle – cell ranges are typically hard coded, rather than being auto detected.

The spreadsheets contain many compound tables – for example, in the following case we have full-time and part time student data in the same table. (I parse these out by hard coded cell ranges – I really should autodetect the row number of the row the headcount headers appear on and use those values, noting the number of rows in each subtable is the same…)

Also note the the use of single and multi-level columns headings in adjacent columns.

A single sheet may also contain multiple tables. For example, like this:

or like this:

As mentioned, the sheets also contained adminstrative data cruft, as revealed when opening the full sheet into a pandas dataframe:


Anyway – superfluous, as the data exists in a single long form CSV file anyway on the TEF data download site. But maybe useful as a crib for the future. Here’s a reminder of the link to the notebook.

PS another tool for wrangling spreadsheets that I really need to get my head round is databaker, which was originally developed for working with the spreadsheet monstrosities that ONS produce…

PPS the OU actively opted out of TEF, citing “justifiable reasons”… It will be interesting to see the NSS results this year… (and also see how NSS and TEF scores correlate).

Decision Support for Third Marking Significant Difference Double Marked Assessments

In the OU, project report assessed courses tend to see project reports  double marked – once by the student’s own tutor, and once by another marker from the same marking pool. Despite our best efforts at producing marking guides and running a marker co-ordination event with a few sample scripts, marks often differ. University regulations suggest that marks that differ by more than 15% or that straddle grade boundaries should be third marked, although these rules can be tweaked a bit.

With a shed load of third marking about to turn up tomorrow, and that needs to be turned round for next Tuesday, I thought I’d have a quick look at what information provided by the first two markers is available to support the third marking effort.

For the course I have to third mark, the mark recording tool we use – OSCAR (Online Score Capture for Assessment Records) – makes available the marks for each marker in a table that also identifies the marking categories. The mark scheme we have in this particular case has five unequally weighted categories, with 8 marks available in each. (Note that this means there are 64 marks in all, and a delta of 10 marks roughly equates to a 15% difference and an automatic sigdiff/third marking flag. If I am doing my sums right and have grade boundaries about right, it also means two markers make give the same grade/classification (pass1, pass2, etc) but still raise a sigdiff.)

To try to make it easier to see where significant differences were arising between two markers, I prototyped a simple spreadsheet based display that calculated the weighted marks and charted them in a couple of ways:

  • a dodged bar chart, by category, so that we could see which categories the markers differed in;
  • a stacked bar chart that shows the total score awarded by each marker.

The stacked bar chart is also overloaded with another bar that loosely identifies the grade boundaries. (The colours in this bar do not relate to the legend.) Ideally, I’d have used grade boundaries as vertical gridlines in the chart to make it clear which grade band a final mark fell into, but I’m not familiar with Excel charting and couldn’t see how to do that offhand. (Also, I guessed at where the grade boundaries are, so don’t read anything too much into the ones presented.)

(I also came across a gotcha in my version of Excel on a Mac… the charts don’t update when I paste the new data in. Instead I have to cut them and paste them back into the sheet, at which point they do update. WTF?!)

A couple of other things that should be quick to add to the prototype:

  • a statement of the grade awarded by each marker (pass 1, pass 2, fail), perhaps also qualified (strong pass 2 (at the top of the band) , bare pass 3 (at the bottom of the band), solid pass 4 (in the middle of the band)) for example;
  • a statement of the average mark and the grade that would result. (One of the heuristics for awarding marks from markers that differ by a small amount is to use the average.)

I should probably also add a slot for third marker marks to be displayed…

More elaborate would be some rules to generate a brief text report that identify which topics the markers differ significantly on, for example, by how many awarded marks, and what this translates to in terms of weighted marks (or even percentage marks).

One reason for doing this is to try to make life easier – a report may not need completely remarking if the markers just differ in one particular respect, for example (which may even be the result of an error when entering the original marks). Such a tool may also be useful at an award board for getting a quick, visually informative view of how markers awarded markers to a particular script.

But this sort of tool may also help us start to understand better why and how markers are marking differently, and what sorts of change we might need to make to the marking scheme or marking guidance. (See the differences in a particular category in a visual way often leaves you with a different feeling to seeing a table of the numerical marks).

It also provides an environment for tinkering with some automated feedback generators, powered by the marks.

Of course, I’d rather be developing these views in Jupyter notebooks/dashboards, or R, and if we had easy access to the data it wouldn’t be hard to roll a simple app together. But as a digital first, cloud organisation, we get to view each set of double marks, one HTML page at a time.

PS I don’t think a scraper would be too hard to write to pull down the marker returns for each student on a course, which are handily all linked to from a single page, and pop them into a single dataframe…. Hypothetically, here’s how we might be able to get in, for example, using the python MechanicalSoup package, which works with python3 (mechanize requires python2)…

!pip3 install MechanicalSoup
import mechanicalsoup

def getSession():
    browser = mechanicalsoup.StatefulBrowser()
    browser.select_form(FORM_ID) #in form: #loginForm
    browser["username"] = USERNAME
    browser["password"] = PASSWORD
    resp = browser.submit_selected()
    return browser


Of course, this sort of thing probably goes against the Computing Code of Conduct… I’m also not sure if IT folk are paranoid enough to look for rapid bursts of machine generated requests and lock an account out if they spot it? But that’s not too hard to workaround – just put a random delay in between page requests when running the scrape (which is a nice behaviour anyway).

So What Do the OU’s Lifelong Student Demographics Look Like?

In last week’s “state of the OU” presentation, mention was made of the OU’s rich history in innovation as something we could look back on, plus supporting “lifelong learning” as a (renewed?) commitment for the future.

Which got me wondering about what stories some of the OU’s historical data might have to tell us?

For example, for each of the last fifty years, I’d like to see some charts (for example, ggplot charts faceted by year) showing the distribution of:

  • students who joined the OU that year, by age;
  • the age distribution of *all* OU students that year;
  • the age distribution of OU students graduating that year;
  • the age distribution of the sum total of OU graduates over the life of the university, up to a particular point in time.

I’m probably completely wrong, but I’d guess the  shapes may have changed over the years. Depending how they’ve changed might help us tell ourselves some stories about what’s been going on over the last 50 years.

For example, imagine if the populations looked like this originally:

and looked more like this now?

Plotting all 50 years, in a 5 x 10 facet using the same x and y axes would tell at a glance how the university student body has evolved.

Animating the charts would tell another story.

Using a population pyramid (which typically contrasts gender by age range), we could look at the relative make up of the OU compared to other UK HEIs perhaps:

Again, doing this by year and animating may tell a useful story. Using absolute numbers rather than percentages would tell another. And again, these charts could be generated for each year’s intake, graduates, overall population, and accumulated graduates.

The charts might also help show us what lifelong learning has meant for OU populations over the last 50 years. I suspect that we’ve had two sorts of lifelong learning students over that period: students who mid-life signed up for their first degree (sometimes characterised as “second chancers who missed out the first time around”) and students who took one or two very specifically chosen modules to learn about a particular topic that might help with promotion or met a particular skills gap,

I’ve never really understood why the OU regime of the last 10 years have been hell bent on competing with other institutions for signing 18 year olds up to full (named) degrees. To support lifelong learning, don’t we need to provide upskilling in particular areas (one off modules, no entry requirements), lifelong support with access to all OU content for self study over the course of a career (or maybe a “lifelong” degree where you take a module every couple of years to fit in with career or “professional amateur” interests”,  or intense conversion courses to help with mid-career transitions?

Whatever – I’m guessing looking at some pictures and telling some stories off the back off them could provoke other ideas too… Not sure if the data is available in a handy form though?

Innovation Starts At Home…?

Mention was made a couple of times last week in the VC’s presentation to the OU about the need to be more responsive in our curriculum design and course production. At the moment it can take a team of up to a dozen academics over two years to put an introductory course together, that is then intended to last, without significant change, other than in the preparation of assessment material, for five years or more.

The new “agile” production process is currently being trialled by a new authoring tool, OpenCreate, that is currently available to a few select course teams as a partially complete “beta”. I think it is “cloud” based. And maybe also promoting the new “digital” first strategy. (I wonder how many letters in the KPMG ABC bingo card consulting product the OU paid for, and how much per letter? Note: A may also stand for “analytics”.)

I asked I could have a play with the OpenCreate tool, such as it, last week, but told it was still in early testing (so a good time to be able to comment, then?) and so, “no”. (So instead,  I went back to one of the issues I’d raised a few days ago on somebody else’s project on Github to continue helping with the testing of a feature suggestion. (A few days ago; the suggestion has already been implemented and the issue is now closed as completed. making my life easier and hopefully improving the package too.) Individuals know how to do agile. Organisations don’t. ;-))

So why would I wan’t to play with OpenCreate now, while it’s still flaky? Partly because I suspect the team are working on a UI and have settled elements of the backend. For all the f**kwitted nonsense the consultants may have been spouting about  agile, beta, cloud, digital solutions, any improvements are going to come form the way the users use the tools. And maybe workarounds they find. And by looking at how the thing works, I may be able to explore other bits of the UI design space, and maybe even bits of the output space…

Years ago, the OU moved to an XML authoring route, defining and XML schema (OU-XML) that could be used to repurpose content for multiple output formats (HTML, epub, docx, Word). By the by, these are all standardised document formats, which means other people also build tooling around them. The OU-XML document was an internal standard. Which meant only the OU developed tools for it. Or people we paid. I’m not sure if, or how much Microsoft, were paid to produce the OU’s custom authoring extensions for Word that would output OU-XML, for example… Another authoring route was an XML editor (currently, oXygen, I believe). OU-XML also underpinned OpenLearn content.

That said, OU-XML was a standard, so it was in principle possible for people who had knowledge of it to author tools around it. I played with a few myself, though they never generated much interest internally.

  • generating mind maps from OU/OpenLearn structured authoring XML documents: these provided the overview of a whole course and could also be used as a navigation surface (revisited here and here); I made these sort of mindmaps available as an additional asset in the T151 short course, but they were never officially recognised);
  • I then started treating a whole set of OU-XML documents *as a database* which meant we could generate *ad hoc* courses on a particular topic by searching for keywords across OpenLearn courses and then returning a mindmap constructed around components in different courses, again displaying the result as a mindmap (Generating OpenLearn Navigation Mindmaps Automagically). Note this was all very crude and represented playtime. I’d have pushed it further if anyone internally had shown any interest in exploring this more widely.
  • I also started looking at ways of liberating assets and content, which meant we could perform OpenLearn Searches over Learning Outcomes and Glossary Items. That is, take all the learning outcomes from OpenLearn docs and search into that to find units with learning outcomes on that topic. Or provide a “metaglossary” generated (for free) from glossary terms introduced in all OpenLearn materials. Note that I *really* wanted to do this as a cross-OU course content demo, but as the OU has become more digital, access to content has become less open. (You used to be able to look at complete course, OU print materials in academic libraries. No you need a password to access the locked down digital content; I suspect access expires to students after a period of time too; and it also means students can’t sell on their old course materials;
  • viewing OU-XML documents as structured database meant we could also asset strip OpenLearn for  images, providing a search tool to lookup images related to a particular topic. (Internally, we are encouraged to reuse previously created assets, but the discovery problem about helping authors discover what previously created assets are available has never really been addressed; I’m not sure the OU Digital Archive is really geared up for this, either?)
  • we could also extract links from courses and use them as a course powered custom search engine. This wasn’t very successful at the course level, (not enough links) but might have been interesting at across multiple courses;
  • a first proof of concept pass at a tool to export OU-XML documents from Google docs, so you could author documents using Google docs and then upload the result into the OU publishing system.

Something that has also been on my to do list for a long time are templates to convert Rmd (Rmarkdown) and Jupyter notebook ipynb documents to OU-XML.

So… if I could get to see the current beta OpenCreate tool, I might me able to see what document format authors were being encouraged to author into. I know folk often get the “woahh,, too complicated… feeling when reading blog posts*, but at the end of the day whatever magic dreams folk have for using tech, it boils down to a few poor sods having to figure out how to do that using three things: code, document formats (which we might also view as data representations more generally) and transport mechanisms (things like http; and maybe we could also class things like database connections here). Transport moves stuff between stuff. Representations represent the stuff you want to move. Code lets you do stuff with the represented stuff, and also move it between other things that do black box transformations to it (for example, transforming it from one representation to another).

That’s it. (My computing colleagues might disagree. But they don’t know how to think about systems properly ;-)

If OpenCreate is a browser based authoring tool, the content stuff created by authors will be structured somehow, and possibly previewed somehow. There’ll also be a mechanism for posting the authored stuff into the OU backend.

If I know what (document) format the content is authored in, I can use that as a standard and develop my own demonstration authoring tools and routes around that on the input side. For example, a converted that converts Jupyter notebook, or Rmd, or Google docs authored content into that format.

If there is structure in the format (as there was in OU-XML), I can use that as a basis for exploring what might be done if we can treat the whole collection of OU authored course materials as a database and exploring what sorts of secondary products, or alternative ways of using that content, might be possible.

If the formats aren’t sorted yet, maybe my play would help identify minor tweaks that could make content more, or less, useful. (Of course, this might be a distraction.)

I might also be able to comment on the UI…

But is this likely to happen? Is it f**k, because the OU is an enterprise that’s sold corporate, enterprise IT thinking from muppets who only know “agile” (or is that “analytics”?), “beta”, “cloud” and “digital” as bingo terms that people pay handsomely for. And we don’t do any of them because nobody knows what they mean…

* So for example, in Pondering What “Digital First” and “University of the Cloud” Mean…Pondering What “Digital First” and “University of the Cloud” Mean…, I mention things like “virtual machines” and “Docker” and servers and services. If you think that’s too technical, you know what you can do with your cloud briefings…

The OU was innovative because folk understood technologies of all sorts and made creative use of them. Many of our courses included emerging technologies that were examples of the technologies being taught in the courses. We ate the dogfood we were telling students about. Now we’ve put the dog down and just show students cat pictures given to us by consultants.

Pondering What “Digital First” and “University of the Cloud” Mean…

In a briefing to OU staff from senior management earlier this week, VC Peter Horrocks channelled KPMG consultants with talk of the OU becoming “digital first”, and reimagining itself as a “University of the Cloud”, updating the original idea of it being a “University of the Air” [Open University jobs at risk in £100m ‘root and branch’ overhaul].

I have no clear idea what “digital” means, nor what it would mean to be a “university of the cloud” (if things are cloudy, does that mean we can’t do blue skies thinking; or that there is a silver lining somewhere?!;-). But here are a few things I’d like to explore that are based on trends that I think have been emerging for the last few years (and which I can date from historical blog posts, both here and in the original ouseful archive, which dates back to 2005..)

From Applications to Apps

In recent years, we’ve seen a move away from installed software applications that are self contained and run, offline, on a host computer, and towards installed apps that often have tight integration to online services. Apps may run, in part, offline, but they prefer it when there’s a network connection. Online apps exist solely elsewhere and are accessed via a browser.

Where the code lives, and where data files are stored, has implications for the user. If you’re using an online app, you need a reliable network connection. If all you use are online apps, a tablet or a Chromebook are fine. This in turn impacts on providers of services that make make use of software (such as the OU, for example). I’ve been wittering on for years that if all students have is a Chromebook, then we’re excluding them if our courses require them to have a computer onto which you can install a “traditional” software application. This tends to fall on deaf ears – two new level 1 courses, both currently in production and that don’t launch until later this year and next year, and that would typically be expected to have a life of several years, both make use of desktop software installs. I suspect this is not university of the cloud style thinking.

So Browser First…

The view I have had for several years is that all software services we expect students to be able to access should be accessed via a browser. This frees us up to deliver applications onto the desktop that expose themselves to students via the browser, or deliver the services from a remote online host. This could be an OU delivered service (for example, via the OpenSTEM Lab), a third party delivered service (such as Azure Notebooks), or a service managed on a remote host by the student themselves (for example, the TM351 Amazon/AWS AMI we are testing at the moment).

For TM351, we took an early decision to use just such browser accessed tools for the course, in particular Jupyter notebooks and Open Refine (along with some other “headless” database services). For convenience, these were packaged inside a single virtual machine that could be installed on a “traditional” computer (Windows or Mac). Running the virtual machine exposed the services via the browser. A shared directory meant student files were kept on the host computer but could be accessed by the services running inside the VM. Although we did not explicitly provide support for students who only had access to a tablet or Chromebook that could not run the VM, a proof of concept solution using linked Docker containers that could be run on a cloud host was available was an emergency fall back.

In updating the TM351 VM for the next presentation of the course, we are also exploring making the VM available at least as an AWS (Amazon Web Services) machine instance (AMI), which would allow a student to run the VM, at their own cost, on a remote Amazon server and access the course software via their browser.

The applications that live inside the TM351 virtual machine have also been broken out into separate Docker containers. These can be combined and launched in a multitude of ways. For example, OpenRefine running on its own, Jupyter notebooks running on their own, Jupyter notebooks + PostgreSQL running in a linked fashion, Jupyter notebooks + MongoDB running in a linked fashion. The use of Docker means that the services can also be run locally on an offline student computer running Docker, or they can be run on a remote server (that is, in the cloud) and accessed via a browser. This approach means we can continue to provide software to students that runs on their own computer, but we can also provide exactly that same software from a remote host that lives “in the cloud”.

A couple of other examples of using VMs do exist in a couple of computing courses, but from what I can tell there is little interest in trying to push our thinking about how virtualised computing can be used to support either computing courses or other courses with computing needs. The part of the OU that provides “digital” support to the OU has little, if any, experience in providing cloud services, and to date there seems to have been little, if any, capacity in trying to explore this area. Digital. Cloud. Hmmm…

The use of containerised services can also extend outside the computing curriculum to other subject areas. I’ve tried to float the idea of a “digital applications shelf” in the Library that publishes standalone, virtualised (containerised) services, along with scripts for combining them (example, or as in the case of linked TM351 applications) but never really got very far with it. It only takes a little bit of imagination to see how this might work (a Dockerhub image shelf, a repository of Docker compose scripts that can wire containers created from images pulled off the shelf together), but that imagination, again, seems to be lacking. (I could be spectacularly and completely wrong, of course!;-)

What I don’t think we should be doing is making remote desktops available to students that run installed software applications (I don’t think we should give them access to a remote Windows desktop, for example, running the Windows installed software we might traditionally have developed). We should be using software that runs as a service and is delivered directly through the browser. Service based, personal computing in the cloud.

As well as providing software that students can run themselves, there’s also the question of students being provided directly with OU hosted (or at least, OU badged) online applications. We started to have some early discussions internally about a “computational wing” for the OpenSTEM Lab, but that appears to have stalled. My personal roadmap for that would be to start off by making use of a couple of open source programming environments that can be accessed via a browser and that can be run at scale (Jupyter notebooks, RStudio Server, Shiny Server). This would give us operational experience in installing and managing this sort of service – or we could pay someone else to do it… Between them, these three environments support a wide variety of computational activities. Shiny Server, and Jupyter Dashboards, also provide a means for rapidly developing and publishing small interactive applications. Shiny server has already been used in at least one course to deliver some simple interactive applications created by a self-admitted not-very-technical academic.

Exploring this technologies can also support self-service research computing, although I got the feeling the non-teaching related research may be losing support… (That said, I don’t see many/any folk researching “cloud/digital infrastructure and workflows”, or emerging trends in personal computing tech and end user application wrangling, for teaching purposes or otherwise…

Digital Production Methods

I know I’m biased, but I think the OU is way behind the curve in document creation and production methods. Over the last two or three years, reproducible research methods have spawned a range of innovations in supporting the creation and publication of interactive and “generated” content (examples). This speeds up production and cuts down own maintenance. Documents carry the “source code” for the creation of media assets contained within them, and produce assets that reflect the state of other parts of the document around them. This avoids problems of drift between things like code fragments and the outputs the output they produce, as well as syntax errors introduced as part of the editing process.

The depiction of media assets as computational objects also means they can be restyled by applying a different stylistic theme to the asset without changing the actual content (example1, example2).

The ability to both static and interactive outputs from the same media asset object is also very powerful. For example, the mpld3 python package can take a matplotlib chart object that would naturally be rendered using an image format and generate an interactive HTML chart from it – no extra work required.

Helper libraries (with customisable templates) also mean that it can be quite simple to generate complex, templated interactive code quite straightforwardly. I may not know how to write the code to publish an interactive Google map, but I don’t have to when there’s a myriad number of packages out there that will create the code for me, and put a marker on the map in the appropriate place if I give it a location.

Publishing workflows, such as the ones based around the R knitr package or Jupyter notebook nbconvert tool also mean that source documents represented using a simple text format (markdown) can be rendered in a variety of styles and document formats (HTML pages, PDF, .docx, HTML slideshows).

At one point I started to explore Jupyter based workflows, but FutureLearn head in the sand + OU production process fascism put a rapid halt to that. Can it really also be nearly 3 years ago since I used knitr to first publish my Wrangling F1 Data With R book to Leanpub?! (That was motivated originally purely as a way of exploring how to go from RMarkdown to a print publication in an automated way.)

I’m not sure at all what the new OU OpenCreate tool looks like, or supports, or how the workflow flows (I asked for a beta invite… still no reply…) but I wonder if any of the team even looked at things like the knitr workflow and if so, what they thought of it and how the OpenCreate workflow compares? And to what extent asset creation and whole-content maintenance plays an integrated part in it? I also wonder how “digital” it is…

Related, in sense of “digital first” – via Cameron Neylon: As a researcher…I’m a bit bloody fed up with Data Management.

PS The cuts that aren’t cuts because some of the money will be spent elsewhere may also mean I might need a new job soon. FWIW, I tend to work from home, swear a lot on Twitter about whatever I happen to thinking about at the time, and am not a team player. I am convinced my imposter syndrome is actually the Dunning Krueger effect in play. Skillset: quirky.

Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose

I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using docker-compose. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the docker-compose version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.

So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.

The result is something that looks like this:

The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.

The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.

This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate ./data directory and provide some means for populating it with data files. Alternatively, if the files already exist on host,  mounting the host data directory onto a /data volume in the Apache Drill container would work too.

Here’s the docker-compose.yaml file I’ve ended up with:

  image: dialonce/drill
    - 8047:8047
    - zookeeper
    -  ./notebooks/data:/nbdata
    -  ./R/data:/rdata

  image: jplock/zookeeper

  container_name: notebook-apache-drill-test
  image: psychemedia/ou-tm351-jupyter-custom-pystack-test
    - 35200:8888
    - ./notebooks:/notebooks/
    - drill:drill

  container_name: rstudio-apache-drill-test
  image: rocker/tidyverse
    - PASSWORD=letmein
  #default user is: rstudio
    - ./R:/home/rstudio
    - 8787:8787
    - drill:drill

If you have docker installed and running, running docker-compose up -d in the folder containing the docker-compose.yaml file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the ./notebooks, ./notebooks/data, ./R and ./R/data subfolders don’t exist they will be created.

We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the pydrill package to connect. Note the hostname used is the linked container name (in this case, drill).

If we download data to the ./notebooks/data folder which is mounted inside the Apache Drill container as /nbdata, we can query against it.

(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)

We can also query against that same data file from the RStudio container. In this case I used the DrillR package (I had hoped to use the sergeant package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find java installed, rather than DBI, and java isn’t installed in the rocker/tidyverse container I used.) UPDATE: sergeant now works without Java dependency... Thanks, Bob:-)

I’m not sure if DrillR is being actively developed, but it would be handy if it could return the data from the query as a dataframe.

So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have docker installed:-)