Grabbing Screengrab Images Using Selenium (Also Works in MyBinder)

It being roundy-roundy motorsport season again, here’s a recipe for grabbing a screenshot of a live timing screen and then emailing it to one or more people.

One of the reasons for using Selenium is that timing screen pages are often updated using data received over a live websocket. There’s no real need to keep shipping all the timing data, just the bits that change. Using a browser to grab the screenshot rather than having to try to figure out the data can make life easier if you don’t want to manipulate the data at all. For some background about such Javascript rendered pages, see here. Of course, we could also use selenium to give us access to the “innerHTML” as rendered by the Javascript. I’ll maybe have a look at that in a future post…

The recipe uses a headless version Firefox automated using selenium and can run in a MyBinder container.

A Dockerfile that can load in the necessary bits looks like this:

#Use a base Jupyter notebook container
FROM jupyter/base-notebook

#We need to install some Linux packages
USER root

#Using Selenium to automate a firefox or chrome browser needs geckodriver in place
ARG GECKO_VAR=v0.23.0
RUN wget https://github.com/mozilla/geckodriver/releases/download/$GECKO_VAR/geckodriver-$GECKO_VAR-linux64.tar.gz
RUN tar -x geckodriver -zf geckodriver-$GECKO_VAR-linux64.tar.gz -O > /usr/bin/geckodriver
RUN chmod +x /usr/bin/geckodriver
RUN rm geckodriver-$GECKO_VAR-linux64.tar.gz

#Install packages required to allow us to use eg firefox in a headless way
#https://www.kaggle.com/dierickx3/kaggle-web-scraping-via-headless-firefox-selenium
RUN apt-get update \
    && apt-get install -y libgtk-3-0 libdbus-glib-1-2 xvfb \
    && apt-get install -y firefox \
    && apt-get clean
ENV DISPLAY=":99"

#Copy repo files over
COPY ./notebooks ${HOME}/work
#And make sure they are owned by the notebook user...
RUN chown -R ${NB_USER} ${HOME}

#Reset the container user back to the notebook user
USER $NB_UID

#Install Selenium python package
RUN pip install --no-cache selenium

With everything installed, we can create a headless Firefox browser as follows:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)

In the page I am interested in, a spnny thing is displayed in an HTML tag with id loading whilst a web socket connection is set up and the timing data is loaded for the first time.

undesiredId = 'loading'

I also create a dummy filename into which to save the screenshot:

outfile = 'screenshot.png'

Grabbing a screenshot of a page at a particular URL can be achieved using the following sort of approach, which waits for the spinny thing tag to disappear before grabbing the screenshot.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#Set a default screensize...
#There are other ways of setting up the browser so that we can grab the full browser view,
#even long pages that would typically require scrolling to see completely
#For example: https://blog.ouseful.info/2019/01/16/converting-pandas-generated-html-data-tables-to-png-images/
driver.set_window_size(800, 400)

#Load a webpage at a specified URL
driver.get( URL )

#Handy bits...
#EC.visibility_of_element_located
#EC.presence_of_element_located
#EC.invisibility_of_element_located

#Let's wait for the spinny thing to disappear...
element = WebDriverWait(driver, 10).until( EC.invisibility_of_element_located((By.ID, undesiredId)))

#Save the page
driver.save_screenshot( outfile )
print('Screenshot saved to {}'.format(outfile))

If I need to select a particular tab in a tabbed view, I can also do that. In the screen I am interested, the different tabs have different HTML tag id values:

tabId = "Classification"

element = browser.find_element_by_id(tabId)
element.click()
element = WebDriverWait(browser, 10).until( EC.visibility_of_element_located((By.ID, tabId)))

I can then grab the screenshot…

Having saved an image, I can then email it.

If you have a Gmail account, sending an email is quite straightforward because we can use the Gmail SMTP server:

import smtplib, ssl, getpass

port = 465  # For SSL

sender_email = input("Type your GMail address and press enter: ")
sender_password =  getpass.getpass(prompt='Password: ')

# Create a secure SSL context
context = ssl.create_default_context()

receiver_email = "user@example.com"  # Enter receiver address
message = """\
Subject: Timing Screen

Here's some email; but what about the attachment?"""

with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server:
    server.login(sender_email, sender_password)
    server.sendmail(sender_email, receiver_email, message)

A recipe I found here describes how to add an image file as an attachment to an email; and one I found here describes how to use images embedded in an HTML email.

A copy of a notebook associated with this recipe, along with the example email’n’images code, as applied to TSL live timing, can be found here.

What I need to do next is a bit more selenium scraping to pull out metadata from the timing screen itself, such as the race it applies to. It would also make sense to grab all the screen tabs on a particular timing screen.

The next step would be to set something running so that the script could watch the timing screen and then email out the final results screen for each race. I’m not sure if the selenium browser is continually updated by the socket connection that drives the timing screen page, and whether it can watch for the race status to change to FINISHED, and then away from it to reset the script so it is ready to email out the next final classification.

But I can’t try that or test it right now. The timing screens have shut down for the day, and I’ve also spent the whole of this beautiful day in front of a screen rather than in the garden. Bah…:-(

PS By the by, we could also load the geckodriver in directly from a Python script:

#By the by, there is also a Python package for installing geckodriver
! pip install --no-cache webdriverdownloader
#This can be used as follows within a notebook
from webdriverdownloader import GeckoDriverDownloader
gdd = GeckoDriverDownloader()
geckodriver, geckobin = gdd.download_and_install("v0.23.0")

#If required, the path to the drive can be set explicitly:
from selenium import webdriver
browser = webdriver.Firefox(executable_path=geckobin, options=options)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: