Bulk Jupyter Notebook Uploads to nbgallery Using Selenium

I’ve recently started looking at nbgallery [repo], “an enterprise Jupyter Notebook sharing and collaboration platform” written in Ruby. The gallery provides a range of tools, including:

  • a Solr powered notebook search engine;
  • a notebook “health check” (I haven’t tried this yet);
  • integration with Jupyter notebooks, so you can run notebooks (I haven’t tried this yet).

One thing that seems to be lacking is the ability to bulk upload files (for example, contained in a zip file). I haven’t spotted an API either, or a Python wrapper to provide a de facto API. This makes a proper test over lots of notebooks tricky…

UPDATE: it looks like a Python API for nbgallery is on the way… nbgallery/nbgallery-api-python

The notebook upload is a two step process.

The first step requires selection of a notebook, and a required acknowledgement of rights:

The second provides and opportunity to submit a required title and non-null description and a (repeated) rights acknowledgement:

The upload process utilises a multi-part form.

To upload a notebook, a user needs to be logged in.

Creating a new user requires an email confirmation step, which means you need to set up email server details in the docker-compose.yml file. I used my OU ones:

EMAIL_USERNAME: $OU_USERNAME
EMAIL_PASSWORD: $OU_PWD
EMAIL_DOMAIN: open.ac.uk
EMAIL_ADDRESS: ${OUCU}@open.ac.uk
EMAIL_DEFAULT_URL_OPTIONS_HOST: localhost:3000
EMAIL_SERVER: smtp.office365.com

My usual approach for automating this sort of thing would be to have a go with mechanical soup or mechanize, but on a quick first attempt using both of those, I couldn’t get the scraper to work.

Instead, I took the opportunity to have a play with Selenium With Python, a Python wrapper for the Selenium web testing framework. This provides a set of Python functions for automating the launching of a web-browser (Chrome, Safari, Firefox, etc) and the automated clicking of pages viewed within that automated browser.

The full script I used can be found here.

The initialisation looks like this:

from selenium import webdriver

#Selenium package includes several utilitities
# for waiting until things are ready
#https://selenium-python.readthedocs.io/waits.html
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

#Allow the driver to poll the DOM for up to 10s when
# trying to find an element
driver.implicitly_wait(10)

#We might also want to explicitly define wait conditions
# on a particular element
wait = WebDriverWait(driver, 10)

driver.get("http://localhost:3000/")

The login function looks something like this:

def nbgallery_login(driver, wait, user, pwd):
    ''' Login to nbgallery.
        Return once the login dialogue has disappeared.
    '''

    driver.find_element_by_id("gearDropdown").click()

    element = driver.find_element_by_id("user_email")
    element.click()

    element.clear()
    element.send_keys(user)

    element = driver.find_element_by_id("user_password")
    element.clear()
    element.send_keys(pwd)
    element.click()

    driver.find_element_by_xpath("//input[@value='Login']").click()

The first form script looks like this:

    #path is full path to file
    if not path.endswith('.ipynb'):
        print('Not a notebook (.ipynb) file? [{}]'.format(path))
        return

    #Part 1

    element = wait.until(EC.element_to_be_clickable((By.ID, 'uploadModalButton')))
    element.click()

    driver.find_element_by_id("uploadFile").send_keys(path);
    driver.find_element_by_xpath('//*[@id="uploadFileForm"]/div[3]/div/div/label/input').click()
    driver.find_element_by_id("uploadFileSubmit").click()

And the script to handle the second part of the form looks like this:

    #Part 2
    element = driver.find_element_by_id("stageTitle")
    element.click()

    #Is there notebook metadata we can search for title?
    if not title:
        title = path.split('/')[-1].replace('.ipynb','')
    element.clear()
    element.send_keys(title)

    element = driver.find_element_by_id("stageDescription")
    element.click()

    #Is there notebook metadata we can search for description?
    #Any other notebook metadata we could make use of here?
    element.clear()
    #Description needs to be not null
    desc= 'No description.' if not desc else desc
    element.send_keys(desc)

    element = driver.find_element_by_id("stageTags-tokenfield")
    element.click()
    #time.sleep(1)

    #Handle various tagging styles
    #Is there notebook metadata we can search for tags?
    tags = '' if not tags else tags
    if isinstance(tags, list):
        tags=','.join(tags)
    tags = tags if tags.endswith(',') else tags+','

    element.clear()
    element.send_keys(tags) #need the final comma to set it?

    if private:
        driver.find_element_by_id("stagePrivate").click()

    driver.find_element_by_xpath('//*[@id="stageForm"]/div[9]/div/div/label/input').click()
    driver.find_element_by_id("stageSubmit").click()

    #https://blog.codeship.com/get-selenium-to-wait-for-page-load/
    #Wait for new page to load
    wait.until(EC.staleness_of(driver.find_element_by_tag_name('html')))

Here’s how it plays out:

There’s still stuff that could be added — error trapping for duplicate notebooks, for example — but I think this is enough to let me upload a complete set of course notebooks and see how useful nbgallery is as a way of presenting notebooks.

If it is, and I get the Jupyter notebook server integration working, then I wonder: would it be useable as a notebook navigator in the TM351 VM? It’d probably need really tight integration with the notebook server so that when notebooks are saved they are also committed to the gallery?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: