Draft: Glossary of Jupyter Production Workflow Terms

Jupyter: an open source community project focused on the development of the Jupyter ecosystem (tools and architectures for the deployment of arbitrary executable code environment and reproducible "computational essay" documents). Coined from the original three programming languages supported by the IPython notebook architecture which was subsumed into the Jupyter project as Jupyter Notebooks: Julia, Python and R.

Jupyter Notebooks: variously: a browser based interactive Jupyter notebook; a textual document format (.ipynb); and (less frequently) the single user Jupyter notebook server. In the first sense, most commonly used sense, the Jupyter notebook is a browser based application within which users can edit, render and save markdown (text rendered as HTML), edit code in a wide variety of languages (including but not limited to Python, Javascript, R, Java, C++, SQL), execute the code on a code server and then return and display the response/code outputs in the interactive notebook. The Jupyter notebook document format is a text (JSON) document format that can embed the markdown text, code and code outputs. The cell based structure of the notebook format supports the use of metadata "tags" to annotate cells which can then be used to provide extension supported styling of individual cells (for example, colouring "activity" tagged cells with a blue background to distinguish them from the rest of the content) or modify cell behaviour in other ways.

JupyterHub: JupyterHub is a multi-user server providing authentication, access to persistent user storage, and a multi-user experience. Logged in users can be presented with a range of available environments associated with their user account. The JupyterHub server is responsible for launching individual notebook servers on demand and providing tools for users to manage their environment as well as tools for administrators to manage all users registered on the hub. JupyterHub can launch environments using remote cloud-hosted servers in an elastic (on-demand and responsive) way.

Jupyter server: a Jupyter server or Jupyter notebook server is a server that that connects a Jupyter served computational environment to a Jupyter client (for example, the Jupyter notebook or JupyterLab user interface or the VS Code IDE).

Jupyter kernel: a Jupyter kernel is a code execution environment managed by Jupyter protocols that can execute code requests from a Jupyter notebook environment or IDE and return a code output to the notebook. Jupyter kernels are available for a wide variety of programming languages.

Integrated Development Environment / IDE: a software application providing code editing and debugging tools. IDEs such as Microsoft’s VS Code also provide support for editing and previewing markdown content (as well as generated content, such as VS Code as an Integrated, Extensible Authoring Environment for Rich Media Asset Creation) and showing differences between file versions (see for example Sensible Diff-ing of Jupyter Notebook ipynb Documents Using VS Code).

BinderHub: BinderHub is a on-demand server capable of building and launching temporary / ephemeral environments constructed from configuration files and content contained in an online repository (eg Github or a DOI accessed repository). By default, BinderHub will build a Jupyter notebook environment with preinstalled packaged defined as requirements in a specified Github repository and populated with notebooks contained in the repository.

MyBinder: MyBinder is a freely available community service that launches temporary/ephemeral interactive environments from public repositories using donated cloud server resources.

ipywidgets: the ipywidgets Python package provide a set of interactive HTML widgets that can synchronise settings across interactive Javascript applications that are rendered in a web browser with the state of Python programmes running inside a Jupyter computational environment. ipywidgets also provide a toolkit for easily generating end user application interfaces / widgets inside a Jupyter notebook that can interact with Python programme code also defined in the same notebook.

Core package: for the purposes of this document, a core package is one that is managed under the official jupyter namespace under the Jupyter project governance process.

Contributed package: for the purposes of this document, a contributed package is one that is maintained outside of the official Jupyter project namespace and governance process by independent contributors but complements or extends the core Jupyter packages. Many "official" (which is to say core) packages started life as contributed packages.

Jupytext: Jupytext is a contributed package that supports the conversion of Jupyter notebook .ipynb files to/from other text representations (structured markdown files, Python or Rmd (R markdown) code files). A server extension allows markdown and code documents opened from within a Jupyter environment to be edited within the Jupyter environment. Jupytext also synchronises multiple formats of the same notebooks, such as an .ipynb notebook document with populated code output cells and simple markdown document that represented just markdown and code input cells.

JupyterLite: JupyterLite is a contributed package that removes the need for a separately hosted Jupyter server. Instead, a simple web server can deploy a JupyterLite distribution which provides a JupyterLab or RetroLab user environment that can execute code against a computational environment that runs purely in the web page/web browser using a WASM compiled Jupyter kernel. With JupyterLite, the user can run a Jupyter environment without the need to install any software other than a web browser and without the need to have a web connection once the environment is loaded in the browser.

Github: Github is an online collaborative development environment owned and operated by Microsoft. Online code repositories provide version controlled file archives that can be access individually or by multiple team members. As well as providing a git managed repository with all that involves (the ability to inspect different versions of checked in files, the ability to manage various code branches, management tools for accepting pull requests), Github also provides a wide range of project management and coordination tools: project boards, issue management, discussion forums, code commit comments, wikis, automation.

git: git is a version control system for tracking changes over separate file "commits" (i.e. saved versions of a file). Originally designed as a command line tool, several graphical UI applications (for example, Github Desktop and Sourcetree) or IDEs (for example, VS Code with the extensions make it easier to manage git environments locally as well as synchronising local code repositories with online code repositories. Many IDEs also integrate git support natively (VS Code, RStudio) as well as providing extended support through additional extensions (for example, VS Code GitLens extension). Notably, the VS Code environment provides a rich differencing display for Jupyter notebooks.

ThebeLab: Thebelab is a contributed package that provides a set of Javascript functions that support remote code execution from an HTML web page. Using ThebeLab, code contained in HTML code cells can be edited and executed against a remote Jupyter kernel that is either hosted by a Jupyter notebook server or launched responsively via MyBinder or another BinderHub server.

Jupyter Book: Jupyter Book is a contributed technique for generating an interactive HTML style textbook from a set of markdown documents or Jupyter notebooks using the Sphinx document processing toolchain. Documents can also be rendered into other formats such as e-book formats or PDF. Notebooks can be executed to include code outputs or rendered without code execution. Notebook cell tags can be used to hide (or remove) unwanted code cell inputs or outputs as well as styling particular cells. Inline interactive code execution is also possible using ThebeLab, although in-browser code execution using JupyterLite is not supported. Interactive notebooks can also be launched from Jupyter Books using MyBinder or opened directly in a linked Jupyter notebook server environment. Jupyter Book builds on several community contributed tools managed as part of the Executable Books project for rendering rich and comprehensively styled content from source markdown and notebook documents. Jupyter Book represents the closest thing to an official a rich publication route from notebook content.

Sphinx: Sphinx is a publishing toolchain originally created to support the generation of Python code documentation. Spinx can render a documents in a wide variety of formats including HTML, ebooks, LaTeX and PDF. A wide range of plugins and extensions exist to support formatting and structuring of documentation, including the generation of tables of contents, managing references, handling code syntax highlighting and providing code copying tools.

nbsphinx: nbsphinx is a contributed Sphinx extension that for parsing and executing Jupyter notebook .ipynb files. nbsphinx thus represents a simple publishing extension to Sphinx for rendering Jupyter notebooks, compared to Jupyter Book which provides a complete framework for publishing rich interactive content as part of a Jupyter workflow.

Docker: Docker is a virtual machine technology used to deploy virtualised environments on a user’s own computer or via a remote server. A JupyterHub server can be used to manage the deployment of Docker environments running individual Jupyter user environments on remote, scaleable servers.

Docker image / Docker container image: a Docker virtual machine environment is downloaded as an image file. An actual instance of a Docker virtual machine environment is generated from a Docker image. Public Docker images are hosted in a Docker registry such as DockerHub from where they can be downloaded by a Docker client.

Docker container: a Docker container is an instantiated version of a Docker image. A Docker container can be used to deploy a Jupyter notebook server and the Jupyter environments exposed by the server. Just like a "real" computer, Docker containers can also be hibernated / resumed or restarted. A pristine version of the environment can be created by destroying a container and then creating a brand new one from the original Docker container image.

Dockerhub: DockerHub is a hosted Docker image registry that hosts public Docker images that can be downloaded and used by Docker applications running locally or on a cloud server. Github also publish a Docker container registry. In addition, organisations and individuals can self-host a registry. Private image registries are also possible that only allow authenticated users or clients to search for and download particular images.

Python: Python is a general purpose programming language that is widely used in OU modules. A Python environment can be distributed via the Anaconda scientific Python distribution or inside a Docker container.

Anaconda: Anaconda is a scientific Python distribute that bundles the basic Python environment with a wide range of preinstalled scientific Python packages. In many instances, the Anaconda distribution will include all the packages required in order to perform a set of required scientific computing tasks. Anaconda can be installed directly onto the user’s desktop or used inside a Docker container to provide a Python environment inside such a virtualised environment. The appropriateness of using Anaconda as a distribution environment in a distance education context is contested.

IPython: IPython (interactive Python) provides an interactive "REPL" (read, evaluate, print, loop) environment for supporting interactive execution and code output display. In a Python based Jupyter environment, it is actual IPython that supports the interactive code execution.

R: R is a programming language designed to support statistical analysis and the creation of hight quality, data driven scientific charts and graphs. R is used in several OU modules.

Javascript: Javascript is a widely used general purpose programming language. Javascript is also available inside a web browser. Standalone interactive web pages or web applications are typically built from Javascript code that runs inside the web page/web browser. Such applications can often continue to work even in the absence of a network connection.

WASM: WASM (or WebAssembly) is a virtualised programming environment that can run inside a web browser. The JupyterLite package uses WASM to provide an in-browser computational environment for Jupyter environments that allows notebooks to execute Python code cells purely within the browser.

Markdown: Markdown is a simple text markup language that allows you to use simple conventions to indicate style (for example, wrapping a word in asterisks to indicate emphasis, or using a dash at the start of a line to indicate a list item or bullet point). Markdown is typically converted to HTML and then rendered in a browser as a styled document. Many Markdown editors, including Jupyter notebooks and IDEs such as VS Code, provide live, styled previews of raw markdown content within the application.

HTML: HTML (hyptertext markup language) is an XML based language used to mark-up text documents with simple structure and style. Web browsers typically render HTML documents as styled web pages. The actual styling (cplour selection, font selection) is typically managed using a CSS (cascading style sheets) which can change the look and feel of the page without having to change the underlying HTML. (When a theme is changed on a web page, for example, dark mode, a different set of CSS settings are used to render the page whilst the HTML remains unchanged).

CSS: CSS (cascading style sheets) control the particular visual styles used to render HTML content. Changing the CSS changes the visual rendering of a particular HTML webpage without having to change the underlying structural HTML.

nbgrader: nbgrader is a core Jupyter package providing a range of tools for manage the creation, release, collection and automated and manual marking of Jupyter notebooks.

Version Control: version control is a technique for tracking changes in one or more documents over time. Changes to individual documents may be uniquely tracked with different document versions (for example, imagine looking at "tracked changes" between two versions of the same document), and collections of versioned documents can themselves be versioned and tracked (for example, a set of documents that make up the documents released to students in a particular presentation of a particular module). In a distributed version control system such as git, mechanisms exist that allow multiple authors or editors to work on their own own copies of the same documents at the same time, and then alert each other to the changes they have made to the documents and allow them to merge changes in made by other authors/editors. If two people have changed the same piece of content in different ways at the same time, a so-called merge conflict will be generated that identifies the clash and allows a decision to be made as to which change is accepted.

Merge conflict: a merge conflict arises in a collaborative, distributed version control system when conflicting changes are made the same part of a particular file by different people, or when one person works on or makes changes to a file that another has independently deleted. Resolving the merge conflict means deciding which set of updates you actually want to to accept into the modified document.

Github Issue: a Github Issue is a single issue comment thread used to discuss a particular issue such as a specific bug, error or feature request. Issues can be tagged with particular users and/or topics. Github Issues are associated with a particular code repository. "Open issues" are ones that are still to be addressed; once resolved, they are then "closed" providing an archived history of matters arising and how they were addressed. When files are committed to the repository, the commit message may be used to associate the commit (i.e. the changes made to particular files) with a particular issue, and even automatically close the issue if the commit resolves that issue.

Github Discussion: a Github Discussion is a threaded forum associated with a particular repository that allows for more open ended discussions than might be appropriate in an issue.

Github/git commit: a git or Github commit represents a check-in of a particular set of changes to one or more documents. Each commit has a unique reference value allowing you to review just the changes made as part of that commit compared to either the previous version of those documents, or another version of those documents. Making commits at a low level of granularity means that very particular changes can be tracked and if necessary rolled back. A commit message allows a brief summary of the changes made in the commit to be associated with it; this is useful for review purposes and in distributed multi-user settings to communicate what changes have been made (a longer description message may also be attached to each commit). Identifying an appropriate level of granularity for commits is one of the challenges in establishing a good workflow, not least because of the overhead associated with adding a commit message to each commit.

Github/git pull request (PR): a git or Github Pull request (PR) represents a request that a set of committed changes are accepted from one branch into into another branch of a git repository. Automated checks and tests can be run whenever a PR is made; if they do not pass, the person making the PR is alerted to the fact and invited to address the issue. Merging commits from a PR may be blocked until all tests pass. PRs may also be blocked until the PR has received a review by one or more named individuals.

Automation: automation is the use of automatically or manually triggered events or manually issued commands for running scripted tasks. Automation can be used to run a spell-checker over a set of files whenever they are updated, automatically check and style code syntax, or automatically execute and text code execution. Automation can also be used to automatically update the building of Docker images or render and publish interactive textbooks. Automation could be used to automate the production of material distributions and releases and then publish them to a desired location (such as the location pointed to by a VLE download link).

Autonomation: autonomation (not commonly used in computing context) is a term taken from lean manufacturing that refers to "automation with a human touch". In the case of a Jupyter production system, this might include the running of automated tests (such as spell checkers) that prevent documents being committed to a repository if they contain a spelling mistake. The main idea is that errors should not propagate but be fixed immediately at source. The automation identifies the issue and prevents it being propagated forward, a human fixes the issue then reruns the automated tests. If they pass, the work is then automatically passed forwards.

Github Action: a Github Action forms part of an automation framework for Github. Github Actions can be triggered to run checks and tests in response to particular events such as code commits, PRs or releases, as well as to manual triggers. Github Actions can also be used to render source documents to create distributions as well as publishing distributions to particular locations (for example, creating a Docker image and pushing it to DockerHub, generating a Jupyter Book interactive textbook and publishing it via Github Pages, etc.). A wide range of off-the-shelf Github Actions are available.

git commit hook: a git commit hook is a trigger for automation scripts that are run whenever a git commit is issued. The script runs against the committed files and may augment the commit (for example, automatically checking and correcting code style / layout and automatically adding style corrections as part of the commit process, or using Jupytext to automatically create a paired markdown document for a committed .ipynb notebook document, or vice versa.

pre-commit: pre-commit is a general purpose contributed framework for creating git precommit scripts. A wide range of off-the-shelf pre-commit actions are defined for performing particular tasks.

Rendering: rendering a file refers to the generation of a styled "output" version of a document from a source format. For example, a markdown document may rendered as a styled HTML document.

Generative document: a generative document is a document that includes executable source code. The source code provides a complete set of instructions for generating media assets as the source document is rendered into a distribution document.

Generative rendering: a generative document is rendered as a styled document containing media assets that are created by executing some form of source code within the source document as part of the rendering process

Generated asset: a generated asset is a media asset that has been generated from a source code representation as part of the rendering process. Updates to the media asset (for example, text labels or positioning in a diagram) are made by making changes to the source code and then re-rendering it, not by editing the asset directly.

Distribution: a distribution represents a complete set of version controlled files that could be distributed to an end user. In a content creation content, a distribution might take the form of a complete set of notebooks, a complete set of HTML files, or a set of rendered PDF documents. A distribution might be used as a formal handover in a regimented linear workflow process or as the basis of a set of files released to students. A uniquely identifying hash value can be used to identify each distribution and track exactly which version of each individual file is included in a particular distribution.

Release: a release is a version controlled distribution that can be distributed to end users such as a particular cohort of students on a particular module. A release will be given an explicit version number that should include the module code, year and month of presentation as well as lesser version (edition) numbers that help track releases integrating minor updates etc.

Source / Source files: the source files are the set of files from which a distribution is rendered. The source files might include structural metadata, comment code that is stripped from the source document and does not appear in the rendered document, and even code that is executed to produce generated assets that form part of the distribution, even if the source code does not.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: