[Reposted from Jupyter discourse site just so I have my own copy…]
By chance, I just came across the Library of Congress Sustainability of Digital Formats that has a schema for cataloguing digital document formats as well as a set of criteria against which the sustainability of digital documents formats can be tracked.
Sustainability factors include:
- Disclosure: specifications, schemata;
- Adoption: extent of use;
- Transparency: eg human readability, text format;
- Self-documentation: extent to which format is self-documenting;
- External dependencies: eg hardware, o/s;
- Impact of patents: patent encumbered; (“…and licensing” would perhaps a more useful generalisation of this field?)
- Technical protection mechanisms: eg encryption.
There are also fields associated with Quality and functionality factors which for text documents include: normal rendering, integrity of document structure, integrity of layout and display, support for mathematics/formulae etc., functionality beyond normal rendering.
I note that
.ipynb is not currently on the list of mentioned formats. Records for
Rdata provide a steer for the sorts of thing that an
ipynb record might initially contain. (I also note that Python / Jupyter kernels don’t have a standardised serialisation format akin to R’s
.rdata workspace serialisation (
dill goes some way to towards this, maybe also (, maybe also
data-vault). I also appreciate this is complicated by the wide variety of custom objects created by Python packages, but just as IPython supports rich display integration through
__repr__ methods (see also the notes at the end of the
IPython.display.display docs for a description of what methods are supported), it might also be timely to start thinking about
__serialise__ methods (they may already exist; there is so much I don’t know about Python! I do know that things don’t always work though; eg Python’s
json package in my py envt breaks when trying to serialise
There is now a significant number of notebooks on eg Github, as well as signs that notebooks are starting to be used as a publishing format (or at least, as a feedstock for publication, whether rendered using
nbconvert or more elaborate tools such as
ipypublish, or howsoever).
I wonder if it would be timely to review the
ipynb document format in terms of its sustainability and whether getting it included on the LoC list (or other appropriate forum) would be an appropriate thing to do for several reasons, including:
- signals the existence of the document format to the Library / sustainability community in terms the are familiar with and may be able to help with;
- help identify how
nbformatshould not develop in future in ways that might affect its sustainability as a format;
- help identify things that might help improve its sustainability;
- help inform workflows and behaviours regarding how eg cell metadata / tags feed into sustainability.
.ipynb is to remain the core data-structure for representing Jupyter executable documents and their outputs, and as other third party applications (such as VSCode, or Google Colab) start to support the format, and if it doesn’t already exist, I also wonder whether a simple RFC style document (cf. the GeoJSON RFC) would be appropriate alongside the slightly less formal
nbformat documentation as a formal statement of the document standard?
Interoperability is driven by convention as well as standard, and if we are going to see external services developing around Jupyter from individuals or organisations not previously associated with the Jupyter community, but offering interoperability with it, there needs to be a clear basis for what the standards are. This includes not just the base
ipynb format, but also messaging and state protocols.
nbformat format description docs pages seem to act as the normative reference work for the
.ipynb standard, and I assume the Jupyter client – messaging docs are the normative reference for the client-server messaging? For
ipywidgets, the widget messaging protocol and widget model state docs in the
ipywidgets repo appear to provide the normative reference.
See also: this thing on Managing computational notebooks from a preservation perspective and the FAIR Principles (Findable – Accessible – Interoperable – Re-usable).
And on the question of being able to rebuild eg Dockerised environments, here and here.