Jupyter Notebooks for Scheduling HESA Data Returns? If It’s Good Enough for Netflix…

I’ve had so much of the Jupyter Kool Aid I now wonder of most computer related activities “could notebooks help with that?“.

Adopting a new technology like notebooks is not necessarily an easy thing to do, because it can transform workflows and allows, perhaps even encourages, people to do things differently. Adopting a technology that, to get the most from it, requires change means that right there you have a blocker to adoption: change has a high activation energy.

If I was going to draw a diagram (in a reproducible way), I dig start by poking around the LaTEX endiagram package to see if I could tweak the labels to something I wanted (or disable labels and add my own as an overplot).

As an end user development environment, notebooks give you web browser access to fully blown programming languages, an environment to run then in all manner of interactive widgets, and through other Jupyter warez, various means of deploying them as interactive web apps or web APIs.

Enter stage left this OU job ad for a Developer (HESA Data Futures):

HESA Data Futures is a new project being set up to deliver systems to support the move from annual retrospective returns of student data to the Higher Education Statistics Agency (HESA) to in-year continuous data submission from 2019/20. We are looking for an enthusiastic Developer to work as part of a small team to redevelop our current systems.

You will be responsible for designing, building, testing and implementing software components for the University’s HESA Data Futures system.  The University is participating in the HESA Data Futures pilots and you will be actively engaging with HESA and other institutions participating in the pilot.

You will have experience of developing and maintaining systems to a high standard using languages such as SAS, Python, SQL, XML.  The project is at an early stage and you will be encouraged to use your experience to influence the choice of software platform, programming language, database and tools.

Hmmm… The project is at an early stage and you will be encouraged to use your experience to influence the choice of software platform, programming language, database and tools.

Thinks… Netflix use notebooks across the organisation for working with data:

When thinking about the future of analytics tooling, we initially asked ourselves a few basic questions:

  • What interface will a data scientist use to communicate the results of a statistical analysis to the business?
  • How will a data engineer write code that a reliability engineer can help ensure runs every hour?
  • How will a machine learning engineer encapsulate a model iteration their colleagues can reuse?

We also wondered: is there a single tool that can support all of these scenarios?

Netflix Tech Blog: Scheduling Notebooks at Netflix.

The answer…? Yep, you guessed it….

In particular, to support reporting tasks, Netflix contributed to the papermill project. As the docs tell it, Papermill lets you:

  • parameterize notebooks
  • execute and collect metrics across the notebooks
  • summarize collections of notebooks

This opens up new opportunities for how notebooks can be used. For example:

  • Perhaps you have a financial report that you wish to run with different values on the first or last day of a month or at the beginning or end of the year, using parameters makes this task easier.
  • Do you want to run a notebook and depending on its results, choose a particular notebook to run next? You can now programmatically execute a workflow without having to copy and paste from notebook to notebook manually.
  • Do you have plots and visualizations spread across 10 or more notebooks? Now you can choose which plots to programmatically display a summary collection in a notebook to share with others.

The list of papermill committers includes at least one person with a Netflix affiliation, and I only checked two…

See also the Scheduling Notebooks post for more detail of how papermill is used at Netflix.

Perhaps not surprisingly, the way in to the Jupyter ecosystem for Netflix was data analysts using them, but now they provide a unifying interface across all areas of the organisation (Beyond Interactive: Notebook Innovation at Netflix):

Data Scientist: run an experiment with different coefficients and summarize the results

Data Engineer: execute a collection of data quality audits as part of the deployment process

Data Analyst: share prepared queries and visualizations to enable a stakeholder to explore more deeply than Tableau allows

Software Engineer: email the results of a troubleshooting script each time there’s a failure

For a more detailed dive, listen to episode 54 of the Data Engineering podcast: Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal.

As for the coding requirements? 1) “everyone should learn to code” ;-); 2) a line at a time is often all you need (this was the basis of our free online Learn to Code for Data Analysis shortcourse). Try it and see…

Here’s a 5 minute keynote from this year’s Jupytercon which also describes how notebooks are currently used at Netflix, although you may want to skip the first two minutes, which is largely a Netflix ad (Beyond Interactive: Scaling Impact with Notebooks at Netflix – Michelle Ufford (Netflix)):

There’s another, more techie talk from a year ago here: Jupyter at Netflix – Kyle Kelley (Netflix).

So this makes me wonder: would Jupyter notebooks provide a useful candidate for processing and submitting HESA returns on the one hand, but also for analysing, interpreting and storytelling around the data internally?

In passing, I note that I attended part of a briefing from IET(?) on same course data where Excel, Tableau, whatever charts were the basis of the presentation and handouts. Within the course, we use notebooks to analyse our course survey (SEaM analyser) and I’ve used notebooks to prototype automating various bits of grabbing assessment script downloading (Rolling Your Own IT – Automating Multiple File Downloads) and supporting third marking (originally in Excel — I wanted a tool that could be shared with and used by Excel users for an advocacy attempt that went nowhere — but now in Jupyter notebooks (unblogged, as yet… oops…)). Another tool downloads all the marks/tutor comments data for courses I have marking access to into a SQLite database, but that probably breaks all manner of IT policies (I do hash all the personal IDs with a random salt, and I do delete the data as soon as we’re done with it!). I’ve also dabbled in the past with a notebook based reporting script for FutureLearn logs (FutureLearn Data Doodles Notebook and a Reflection on unLearning Analytics and explored a way we could author a FutureLearn course from Jupyter notebooks by splitting monolithic notebooks into page length fragments, grabbing the markdown and POSTing it to FutureLearn, which as I understand it accepts markdown (Authoring Multiple Docs from a Single IPython Notebook)? Only, in the OU, authors don’t have permission to use the FutureLearn authoring tool, so I couldn’t proof of concept that final submission bit…

I am so fed up…

PS by the by, I note I first posted an idle wondering about using notebooks to support devops back in 2015, back before they were known as Jupyter notebooks…: Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?.

PPS an example of notebooks being used in Gitlab devops:

Author: Tony Hirst

I'm a lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.