The Growing Popularity of Jupyter Notebooks…

It’s now five years since we first started exploring the use of Jupyter notebooks — then known as IPython notebooks — in the OU for the course that became (and still is) TM351 Data management and analysis.

At the time, the notebooks were evolving fast (they still are…) but knowing the length of time it takes to produce a course, and the compelling nature of the notebooks, and the traction they already seemed to have, it felt like a reasonably safe technology to go with.

Since then, the Jupyter project has started to rack up some impressive statistics. Using Google BigQuery on public datasets, we can find how many times the Jupyter Python package is installed from PyPi, the centralised repository for Python package distribution, each month by querying the the-psf:pypi.downloads dataset (if you don’t like the code in this post, how else do you expect me to demonstrably source the supporting evidence…?!):

#https://langui.sh/2016/12/09/data-driven-decisions/
SELECT
  STRFTIME_UTC_USEC(timestamp, "%Y-%m") AS yyyymm,
  COUNT(*) as download_count
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "year"),
    CURRENT_TIMESTAMP()
  )

WHERE file.project="jupyter"
GROUP BY yyyymm
ORDER BY yyyymm DESC
LIMIT 12

(Our other long term bet for TM351 was on the pandas Python package, and that has also gone from strength to strength…)

Google BigQuery datasets also include a Stack Overflow dataset (Stack Overflow is a go to technical question-and-answer site for developers), so something like the following crappy query will count the jupyter-notebook tagged questions appearing each year:

SELECT tags, COUNT(*) c,  year
FROM (
 SELECT SPLIT(tags, '|') tags, YEAR(creation_date) year
  FROM [bigquery-public-data:stackoverflow.posts_questions] a
  WHERE YEAR(creation_date) >= 2014 AND tags LIKE '%jupyter-notebook%'
)
WHERE tags='jupyter-notebook'
GROUP BY year, tags
ORDER BY year DESC

I had thought we might be able to use BigQuery to query the number of notebooks on Github (a site widely used by developers for managing code repositories and, increasingly, and educators for managing course notes/resources), but it seems that the Github public data tables [(bigquery-public-data:github_repos)] only represent a sample of 10% or so of public projects?

FWIW, here are a couple of sample queries on that dataset anyway. First, a count of projects identified as Jupyter Notebook projects:

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.languages]
WHERE language.name = "Jupyter Notebook"

And secondly, a count of .ipynb notebooks:

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.files]
#filter on files with a .ipynb suffix
WHERE RIGHT(path,6) = ".ipynb"

We can also look to see what the most popular Python packages imported into the notebooks are using a recipe I found here

numpy, matplotlib and pandas, then, perhaps not surprisingly… Here’s the (cribbed) code for that query:

#https://dev.to/walker/using-googles-bigquery-to-better-understand-the-python-ecosystem

#The query finds line breaks and then tries to parse import statements
SELECT
  CASE WHEN package LIKE '%\\n",' THEN LEFT(package, LENGTH(package) - 4)
    ELSE package END as package, n
FROM (
  SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
  FROM (
    SELECT SPLIT(content, '\n \"') as line
    FROM (SELECT * FROM [bigquery-public-data:github_repos.contents]
      WHERE id IN (
        SELECT id FROM [bigquery-public-data:github_repos.files]
         WHERE RIGHT(path, 6) = '.ipynb'))
       HAVING LEFT(line, 7) = 'import ' OR LEFT(line, 5) = 'from ')
  GROUP BY package
  ORDER BY n DESC)
LIMIT 10;

(I guess a variant of the above could be used to find out what magics are most commonly loaded into notebooks?)

That said, Github also seem to expose a count of files directly, as described in parente/nbestimate, from where the following estimates of growth in the the numbers of Jupyter notebook / .ipynb files on Github are taken:

The number returned by making the following request is approximate – maybe we get the exact count if we make an authenticated request?

https://github.com/search/count?q=extension%3Aipynb+nbformat_minor&ref=searchresults&type=Code

On Docker hub, where various flavours of official notebook container are hosted, there’ve been over a million pulls of the datascience notebook:

jupyter_-_Docker_Hub.png

Searching Docker hub for jupyter notebook also identifies over 5000 notebook related Docker container images at the current count, of which over 1500 are automatically built from Github repos.

Mentions of Jupyter notebooks are also starting to appear on ITJobswatch, a site that tracks IT and programming job ads in the UK:

By the by, if  you fancy exploring other sorts of queries, this simple construction will let you run a Google web search to look for .ipynb files that mention the word course or module on ac.uk websites…

filetype:ipynb site:ac.uk (course OR module)

Anyway, in the same way that it can take many years to get anything flying regularly on NASA space flights, OU courses take a couple of years to put together, and innovations often have to be proved to work for a couple of course presentations in another course before a second course will take them on; so it’s only now that we’re starting to see an uptick of interest in using Jupyter notebooks in other courses… but that interest is growing, and it’s looking like there’s a flurry of courses about to start production that are likely to be using notebooks.

We may have fallen a little bit behind places like UC Berkeley, where 50% of undergrads now take a core, Jupyter notebook powered, datascience course, with subject specifc bolt-on mini-modules, and notebooks now provide part of the university core infrastructure:

…but even so, there could still be exciting times ahead…:-)

Even more exciting if we could get the library to start championing it as a general resource…

n_campuJupyterhub

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: