Ed Tech Not Keeping Up With The Rest of the World

One of the problems we face with running distance education courses is how to post corrections to online course materials… You might think this is just a case of issuing a fix, right? But it’s often not that easy.

In the days of print, books would be printed and as errata were identified they were added to a list of errata that students could print off and were encouraged to use to manually update their own copy of the materials.

In the web world, we can either fix the materials and not tell anyone, fix the materials and post a list of fixes we’ve made, or just post a list of fixes and then expect students to somehow realise that the weirdly worded, doesn’t make sense sentence on a particular URL is actually a typo and that they should look to the errata list to see if a correction has been posted.

You might think that just fixing the materials is the best route, but with students reading materials at different rates, as well as forward and backward referencing materials as they study, if we change materials whilst a course is in presentation, we run the risk of confusing students who remember reading something that has since disappeared from the materials.

Personally, I think we should feel free to change materials whilst the course is in presentation, but also put a faint coloured background on material that has changed since the course material went live and that will pop up a history of changes, as well as the original copy, associated with that update. We should highlight diffs, in other words, and make it easy to compare the current version with any previous version. And allow searching into the original, changed text, highlighting that it has been updated.

(Yes, I know…. this gets difficult if changes are over multiple pages. But for simple changes within a page, IT’S NOT HARD.)

Something else we have to cope with are broken links. I’ve argued before that we should be creating archived copies of links to materials that we refer to from course content (see Name (Date) Title, Available at: URL (Accessed: DATE): So What? and Fragment – Virtues of a Programmer, With a Note On Web References and Broken URLs , for example) but each year we address issues in the forums relating to the same URL as last year being broken in the same way it was the year before because an update never made it into master or the target page has moved again. Automating link checking and putting in place archived link replacements is a simple thing to fix and something we keep not doing. Year on year.

We use forums as one channel for students to post issues they’ve found with particular content. If we fix an issue inline in our content, and don’t highlight the fix/diff, then the student’s post is just confusing if someone clicks through on it, looking for one thing and finding another.

So how can we keep forum posted links in synch with content. Checking the Jupyter discoure site just now, where I’d posted a link as part of a response to another post, I noted the bots had quickly got to work:

The link I posted — to a particular file referenced on the current master branch of a particular Git repository — was automatically updated to a fully referenced, persistent link to the file. If the master branch is changed and the file I’m referencing is deleted from it, the link still works. There is a downside though, in that if the file is updated, my link will still be to the original version (whereas my intent might be to be to link to the most current version).

Whatever.

The point I’m trying to flag is that outside of education, folk are thinking about how bots can help keep web sites running and putting them to work to that effect. They see websites as their business and make it their business to ensure those websites are, and keep, performing as best they can.

As I’ve posted before (Fragmentary Thoughts on Data (and “Analytics”) in Online Distance Education), I think we tend to avoid the fact that as a distance education organisation, we are in large part JUST a web publisher and should focus on doing that bit right rather than telling our students they’re wrong…

 

 

Library DigiHub – Providing Accessing to Institutionally Hosted Virtual Computational Services and Applications

Over the years, I’ve tried to imagine various ways in the Library might engage with what I’d call digital skills rather than what they claim as such, so I was completely taken aback today to learn about the new DigiHub. (I guess this in named in part on homage to DigiLab. Although now just a meeting room, the original vision for the Digilab, at least as I remember it, was also that it was a space where you could try out new and emerging technologies and devices; it included several games consoles, for example, as well as some of our tabletop programmable Lego robots).

From what I can tell, the DigiHub service will comprise of three main parts:

  • a digital application shelf (DASh), that will allow patrons to run Docker containerised applications on a Library server or on their own computer. The applications will include both course related applications as well as “generic” applications such as RStudio or OpenRefine; for unauthenticated users, this will launch temporary servers; authenticated users will also be able to save and load files from their connected institutional OneDrive account; the DASh will also include a historical items area including runnable versions of software from courses pasts. (I wonder if this is based on the Emulation as a Service  that I gave a quick review of here three years or so ago now? I also notice that several other institutions already host related things, such as CMU’s Olive Executable Archive. See also the Software Preservation Network?)
  • a digital integration workbench service (DiWb (pronounced “dewb”?!)) that will use Jupiterhub to provide access to integrated services from a single login. From what I can tell, these will be prebuilt environments such as that exposed by the TM351 VM that interconnect multiple services (eg a Jupiter notebook server and a Postgresh database server) within the same environment);
  • a “digital replay” service  (DRS), linked to items in the ORO research repository, that will allow patrons to “replay” notebooks or R scripts associated with repository items. Again, this looks like a ‘small pieces loosely joined’ approach that has been put together from already existing pieces: a Github repository (or maybe a locally hosted Gitlab repository?) is associated with publications that include reproducible research code items and project data files, along with a declaration of the software packages, and version numbers, required to run the software. Binderhub is then used to launch and run a temporary environment where code and analyses associated with a particular paper can be run and further explored.

This is all really exciting, and beyond my wildest dreams of what I thought the Library would be able to achieve within such a short time of getting into the idea providing digital computational services. (I can’t wait to see how they develop subject specific areas, for example, hooking in to the  OpenSTEM Labs, perhaps creating a Digital Humanities Lab, and maybe even helping my department see why a Digital Computing Lab would be a Good Thing…) In the meantime, I guess I’ll be able to play with DASh and DIWbs. I also wonder how many papers currently in ORO are in a position where they can be augmented with replayable research scripts, or whether this was a pre-emptive step from the Library seeing how researchers elsewhere are starting to support, and promote, their publications with associated research assets?

So does this represent the latest step in the evolution of technology use at the OU? (Related: Martin Weller on technology in From the University of the Air to the university everywhere.)

According to project manager Joanne Kerr, DigiHub services will start to appear later this month; the first will be a temporary notebook server, followed by the soft launch of the Digital Application shelf with items relating to courses that already run containerised, or easily containerised, software applications and a demonstration of some notable applications from the OU’s past on the Digital Software Archive Shelf. The repository runner will be out later in the year as part of a Digital Hub (RIS) (research infrastructure and support) foollow on project.

Machine Learning: What Story Do We Want to Tell?

A new course is in planning on machine learning, intended to be a required part of a data science qualification, driven by the Maths department. Interestingly, our data management and analysis course will not be a required course on that qualification (practical issues associated with working with data are presumably not relevant).

I also suspect that several approaches we could take to teaching machine learning related topics will be ruled as out of scope because: maths and equations

There are ways I think we can handle talking about equations. For example, Diagrams as Graphs, and an Aside on Reading Equations.

One of the distinguishing features of many OU courses is their strong narrative. In part this reflects the way in which material is presented to students: as the written word (occasionally embellished with video and audio explanations). OU materials also draw heavily on custom produced media assets: diagrams and animations, for example. We used to be a major producer of educational software applications, but as the development of web browsers has supported more interactivity, I get the feeling our courses have gone the other way. This is perhaps because of the overheads associated with developing software applications and then having to maintain them, cross-platform, for five years or more of a course’s life.

I also think OU material design, and the uniqueness of our approach, is under threat from things like notebook style interfaces being adopted and used by other providers. But that’s a subject for another post…

So in a new course on machine learning, in a world where there has been a recent flurry of text book publications (often around particular programming libraries that help you do machine learning) and an ever increasing supply of repositories on Github containing notebook based tutorials and course notes, what narrative twist should we apply to the story of machine learning to support the teaching of its underlying principles?

I typically favour a technology approach, where we use the technology to explore the technology, try to situate it in a practical context where we consider social, political, legal and ethical issues of using the technology within society, and try to make it relevant by exploring use cases and workflows. The academic component justifies the practical claims with robust mathematical models that provide explanations of how the stuff actually works and what it is actually doing. I also like historical timelines which show the evolution of ideas: sometimes ideas carry baggage with them from decisions that were made early on in the idea’s evolution that maybe wouldn’t be taken the same way today. (Sound familiar?) With a nod to advice given to PhD students going into the viva to defend their thesis, each course (thesis) should implicitly be able to answer the plaintive cry of the struggling student (or thesis examiner): “why should I care?”

So: machine learning. What story do we want to tell?

Having fallen asleep before going to bed last night, then groggily waking up from a rather intense dream in the early hours, this question had popped into my head by the time I made it into bed, took root, and got me back out of bed, wide awake, to scribble some notes. The rest of this post is based on those notes and is not necessarily very coherent…

Lines, planes and spaces

Take a look at the following chart, taken from the House of Commons Library post Budget 2018: The background in 9 charts:

It shows a couple of lines detailing the historical evolution of recorded figures for borrowing/deficit, as well predictions of where the line will go.

Prediction is one of the things we use machine learning for. But what does the machine learning do?

In the above case, we can look at the chart, say each line looks roughly like a straight line, and then create a model in the form of a simple (linear) equation to describe each line:

y = mx + c

We can then plug a year (as the independent x value) in and then get the dependent y value out from it as the modelled borrowing or surplus figure.

The following chart from the UK Met Office of climate temperatures for Southern England over several years shows periodicity, or seasonality of temperatures over months of the year.

If we were to plot temperature over several years, we’d have something like a sine wave, which again we can model as a simple equation:

y = a * sin( (b * x) + c).

Lines with more complex periodicities can be modelled by more complex equations as described by the composition of several sine waves, identified using Fourier analysis (Fourier, Laplace and z-transforms are just magical…).

But we don’t need to use machine learning to identify the equations of those lines, we can simply analyse them using techniques developed hundreds of years ago, made ever more accessible via a single line of code (example).

Some equations that define a line actually feed on themselves. If you think back to primary school days maths lessons, you may remember being set questions of the form: what is the next number in this series?

In this case, the equation iterates. For example, the famous Fibonacci sequence — 0, 1, 1, 2, 3, 5, 8, … — is defined as:

F_0=0, F_1=1, F_{n+1}=F_n + F_{n-1}

The equation eats itself.

Although we don’t need machine learning to help us match these equations to a particular data set, machines can help us fit them. For example, if we have guessed the right form of equation to fit a set of data points (for example, y = mx + c) a machine can quickly help identify the values of  m and c, or perform a Fourier analysis for us.

If we aren’t sure what equation might fit a set of data, we could use a machine to try out lots of different sorts of equation against a dataset, calculate the error on each (that is, how well it predicts data values when it is “best fit”) and then pick the one with the lowest error as the best fit equation. We could argue this as a weak form of a learning about what equation type from a preselected set best fits the data.

The aim, remember, is to come up with the equation of a line, or, where we have more than two dimensions, a plane  or surface) so that we can predict a dependent variable from one or more independent variables.

That may sound confusing, but you already have a folk mathematical sense of how this works: X is male, morbidly obese and 5 feet 6 inches tall. About how much does he weigh?

To estimate X’s weight, you might treat it as a prediction exercise and model that question in the form of an equation that predicts a ‘normal’ weight for that gender and height (two independent variables) then scales it in some way according to the label morbidly obese. If you’re a medic, you may define the morbidly obese term formally. If you aren’t, you may treat it as an arbitrary label that you associate, perhaps in a personally biased way, with a particular extent of overweightness.

There is another way you might approach the same question, and that is as something more akin to a classification task. For example, you know lots of people; you know, ish, their weights; you imagine which group of people X is most like and then estimate his weight based on the weight of people you know in that group.

Equations still have a role to play here. If you imagine the world of people as “tall and slight”, “tall and large”, “short and slight”, “short and large” and “average”, you may imagine the classification space as being constructed something like this:

Distinctions defined over each of the axes and then combined to  make the final classification. You might then associate further (arbitrary) labels with the different classifications: “short and slight” might be relabelled  “small” for example. Other categories might be relabelled “large”, “wide” and “lanky” (can you work out which they would apply to?). This relabelling is subjective and for our “benefit”, not the machine’s, to make the categories memorable. Memorable to our biases, stereotypes and preconceptions. It can also influence our thinking when we could to refer to the groups by these labels at a later stage…

So here then is another thing that we might look to machine learning for: given a set of data, can we classify items that are somehow similar along different axes and label them as a separate identifiable groupings. Firstly, for identification purposes (which group does this item belong to). Secondly, in combination with predictive models defined within those groupings, to allow us to make predictions about things we have classified in one way, and perhaps different predictions about things classified another way.

Again, we need to come up with equations that allow us to make distinctions (above the line or below the line, to the left of the line or to the right, within the box or outside the box) so we can run the data against those equations and make a categorisation.

Again, we don’t necessarily need “machine learning” to help us identify these equations, if we assume a particular classification model. For example, the k-means technique allows you to say you want to find k different groupings within a set of data and it will do that for you. For k=5, it will fit a set of equations that will group the data into five categories. But again, we might want to throw a set of possible models at the data and then pick the one that best works, which a machine can automate for us. A weak sort of learning, at best.

So what is machine learning?

It’s what you do when you don’t know what model to use. It’s what you do when you throw a set of data at a computer and say: I have no idea. You make sense of it. You find what equations to use and how to combine them. And it doesn’t matter if can’t understand or make sense of any of them. As far as I’m concerned, I’m model free and theory free. Go at it.

Of course, the machine may learn rules that are understandable, and that weren’t what you expected:

When feeding a machine, you need to be wary of what you feed it.

In telling the story of machine learning, then, do we need to do any of that ‘precursor’ stuff, of lines and planes and what we can do just anyway using ‘traditional’ mathematical techniques, perhaps with added automation so we can try lots of pre-defined models we can fit quickly? Or should we just get straight on to the machine learning bits? There is only so much time, breadth and depth available to us when delivering the course, after all.

I think we do…

Throughout the course, those foundational ideas would provide a ground truth that a student can use to anchor themselves back to the questions of: what am I (which is to say, the machine) actually trying to do? and why are we using machine learning? Don’t we have a theory about the data use a model based on that?

So what about spaces. What are they, and do we need to know about them?

When I put together the OU short course on game design and development (I knew nothing about that, either) I came across the idea of lenses, different ways of looking at a problem that bring different aspects of it into focus. Here’s a set of six lenses commonly used in news reporting: who?, what?, why?, when?, where?, how?.

When we pull together a dataset around which we want to make a set of predictions, or classifications, or classification based predictions, we define a space of inquiry within which we want to ask particular questions. In the classification example above, the space I came up with was a lazy one: something distinguishing and recognisable about people that I can identify different groupings within. Within that space I then identified a set of metrics (measurable things) that I could determine within it (height and weight). Already things are a bit arbitrary and subject to bias. Who chose the space? Why? What do they want to make those classifications for? Who chose the metrics? What population provided the data? How was the data collected? When was it collected? Where? Could any of that make a difference? (Like only using a particular machine for suspected hip fractures, and marking records urgent if there is a hip fracture evident in an x-ray, for example…) Is there, perhaps, a theory or an intuitive model we can apply to a dataset to perform a particular task that doesn’t mean we need to take the last resort of machine learning (“if machine X and marked urgent, then hip fracture”).

We might also create a space out of nowhere around the data we have. A space we can go fishing in but that we don’t really understand.

So another bedrock I think we need in a course on machine learning, another storyline we can call on, is a set of questions, a set of lenses, that we can use to identify what space our data lies in, as well as critically interrogating the metrics we impose upon it and from which we (via our machines) develop our machine learned models.

Readers who know anything about machine learning, which I don’t, really, will notice I never even got as far as talking about things like supervised learning, unsupervised learning, reinforcement learning, etc, let alone the different approaches or implementations we might take towards them. That’s the next part of the story we need to think about… This first part was more about getting a foundation stone in place that students can refer back to: “what is this thing actually trying to do?”, rather than “how is it going about it?”

PS By the by, the “lines and spaces” refrain puts me in mind of Wassily Kandinsky’s Point and Line to Plane. As it’s over 30 years ago since I last read it, and I still carry the title with me, I should probably read it again.

We Need to Talk About Geo…

Over the last couple of weeks, I’ve spent a bit of time playing with geodata. Maps are really powerful visualisation techniques, but how do you go about creating them?

One way is to use a bespoke GIS (Geographic Information System) application: tools such as the open source, cross-platform desktop application QGIS, that lets you “create, edit, visualise, analyse and publish geospatial information”.

Another is to take the “small pieces, loosely joined” approach of grabbing the functionality you need from different programming packages and wiring them together.

This “wiring together” takes two forms: in the first case, using standardised file formats we can open, save and transfer data between applications; in the second, different programming languages often have programming libraries that become de facto ways of working with particular sorts of data that are then used within yet other packages. In Python, pandas is widely used for manipulating tabular data, and shapely is widely used for representing geospatial data (point locations, lines, or closed shapes). This are combined in geopandas, and we then see tools and solutions built upon that format as the ecosystem build out further. In the R world, the Tidyverse provides a range of packages designed to work together, and again, an ecosystem of interoperable tools and workflows results.

Having robust building blocks allows higher level tools to be built on top of them designed to perform specific functions. Through working through some simple self-directed (and self-created) problems (for which read: things I wanted to try to do, or build, or wondered how to do), it strikes me once again that the quite ambitious sounding tasks can be completed quite straightforwardly if you can imagine a way of decomposing a problem into separate, discrete parts, looking for ways of solving those parts, and then joining the pieces back together again.

For example, here’s a map of the UK showing Westminster constituencies coloured by the party of the MP as voted for at the last general election:

How would we go about creating such a map?

The answer is quite straightforward if we make use of a geodataset that combines shape information (the boundary lines that make up each constituency, suitably represented) with information about the election result. Data such as that made available by Alasdair Rae, for example.

First things first, we need to obtain the data:

#Define the URL that points to the data file
electiondata_url = 'http://ajrae.staff.shef.ac.uk/wpc/geojson/uk_wpc_2018_with_data.geojson'

#Import the geopandas package for working with tabular and spatial data combined
import geopandas

#Enable inline plotting in Jupyter notebooks
#(Some notebook installations automatically enable this)
%matplotlib inline

#Load the data from the URL
gdf = geopandas.read_file(electiondata_url)

#Optionally preview the first few rows of the data
gdf.head()

That wasn’t too hard to understand, or demonstrate to students, was it?

  • make sure the environment is set up correctly for plotting things
  • import a package that helps you work with a particular sort of data
  • specify the location of a data file
  • automatically download the data into a form you can work with
  • preview the data.

So what’s next?

To generate a choropleth map that shows the majority in a particular constituency, we just need to check the dataframe for the column name that contains the majority values, and then plot the map:

gdf.plot(column='majority')

To control the the size of the rendered map, I need to do a little bit more work (it would be much better if the geopandas package let me do this as part of the .plot() method):

#Set the default plot size
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, figsize=(12, 12))

ax = gdf.plot(column='majority', ax=ax)

#Switch off the bounding box drawn round the map so it looks a bit tidier
ax.axis('off');

To plot the map coloured by party, I just need to change the column used as the basis for colouring the map.

fig, ax = plt.subplots(1, figsize=(12, 12))

ax = gdf.plot(column='Party' , ax=ax)
ax.axis('off');

You should be able to see how the code is more or less exactly the same as the previous bit of code except that I don’t need to import the pyplot package (it’s already loaded) and all I need to change is the column name.

The colours are wrong though — they’re set by default rather than relating to colours we might naturally associated with the parties.

So this is the next problem solving step — how do I associate a colour with a party name?

At the moment this is a bit fiddly (again, geopandas could make this easier), but once I have a recipe I should be able to reuse it to colour other columns using other column-value-to-colour mappings.

from matplotlib.colors import ListedColormap

#Set up color maps by party
partycolors = {'Conservative':'blue',
               'Labour':'red',
               'Independent':'black',
               'Liberal Democrat':'orange',
               'Labour/Co-operative':'red',
               'Green':'green' ,
               'Speaker':'black',
               'DUP':'pink',
               'Sinn Féin':'darkgreen',
               'Scottish National Party':'yellow',
               'Plaid Cymru':'brown'}

#The dataframe seems to assign items to categories based on the selected column sort order
#We can define a color map with a similar sorting
colors = [partycolors[k] for k in sorted(partycolors.keys())]

fig, ax = plt.subplots(1, figsize=(12, 12))

ax = gdf.plot(column='Party', cmap = ListedColormap(colors), ax=ax)
ax.axis('off');

In this case, I load in another helpful package, define a set of party-name-to-colour mappings, use that to generate a list of colour names in the correct order, and then build and use a cmap object within the plot function.

If I wanted to do a similar thing based on another column, all I would have to do is change the partycolors = {} definition and the column name in the plot command: the rest of the code would be reusable.

When you have a piece of code that works, you can wrap it in a function and reuse it, or share it with other people. For example, here’s how I use a function I created for displaying a choropleth map of a particular deprivation index measure for a local authority district and its neighbours (I’ll give the function code later on in the post):

plotNeighbours(gdf,
               'Portsmouth',
               'Education, Skills and Training - Rank of average rank')

Using pandas and geopandas we can easily add data from one source, for example, from an Excel spreadsheet file, to a geopandas dataset. For example, let’s download some local authority boundary files from the ONS and some deprivation data:

import geopandas

#From the downloads area of the page, grab the link for the shapefile download
url='https://opendata.arcgis.com/datasets/7ff28788e1e640de8150fb8f35703f6e_2.zip?outSR=%7B%22wkid%22%3A27700%2C%22latestWkid%22%3A27700%7D'
gdf = geopandas.read_file(url)

#Import pandas package
import pandas as pd

#https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015
#File 10: local authority district summaries
data_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/464464/File_10_ID2015_Local_Authority_District_Summaries.xlsx'

#Download and read in the deprivation data Excel file
df = pd.read_excel(data_url, sheet_name=None)

#Preview the name of the sheets in the data loaded from the Excel file
df.keys()

We can merge the two data files based on a common column, the local authority district codes:

#Merge in data
gdf = pd.merge(gdf, df['Education'],
               how='inner',  #The type of join (what happens if data is in one dataset and not the other)
               left_on='lad16cd', #Column we're merging on in left dataframe
               right_on='Local Authority District code (2013)'#Column we're merging on in right dataframe
              )

And plot a choropleth map of one of the deprivation indicators:

ax = gdf.plot(column='Education, Skills and Training - Average rank')
ax.axis('off');

Just by the by, plotting interactive Google style maps is just as easy as plotting static ones. The folium package helps with that, for example:

import folium

m =  folium.Map(max_zoom=9, location=[54.5, -0.8])
folium.Choropleth(gdf.head(), key_on='feature.properties.lad16cd',
                  data=df['Education'],
                  columns=['Local Authority District code (2013)',
                           'Education, Skills and Training - Rank of average rank'],
            fill_color='YlOrBr').add_to(m)
m

I also created some magic some time ago to try to make folium maps even easier to create: ipython_magic_folium.

To plot a choropleth of a specified local authority and its neighbours, here’s the code behind the function I showed previously:

#Via https://gis.stackexchange.com/a/300262/119781

def plotNeighbours(gdf, region='Milton Keynes',
                   indicator='Education, Skills and Training - Rank of average rank',
                   cmap='OrRd'):
    ''' Plot choropleth for an indicator relative to a specified region and its neighbours. '''

    targetBoundary = gdf[gdf['lad16nm']==region]['geometry'].values[0]
    neighbours = gdf.apply(lambda row: row['geometry'].touches(targetBoundary) or row['geometry']==targetBoundary ,
                           axis=1)

    #Show the data for the selected area and its neighbours
    display(gdf[neighbours][['lad16nm',indicator]].set_index('lad16nm'))

    #Generate choropleth
    ax = gdf[neighbours].plot(column=indicator, cmap=cmap)
    ax.axis('off');

One thing this bit of code does is look for boundaries that touch on the specified boundary. By representing the boundaries as geographical objects, we can use geopandas to manipulate them in a spatially meaningful way.

If you want to try a notebook containing some of these demos, you can launch one on MyBinder here.

So what other ways can we manipulate geographical objects? In the notebook Police API Demo.ipynb I show how we can use the osmnx package to find a walking route between two pubs, convert that route (which is a geographical line object) to a buffered area around the route (for example defining an area that lies within 100m of the route) and then make a call to the Police API to look up crimes in that area in a specified period.

The same notebook also shows how to create a Voronoi diagram based on a series of points that lay within a specified region; specifically, the points were registered crime location points within a particular neigbourhood area and the Voronoi diagram then automatically creates boundaried areas around those points so they can be coloured as in a choropleth map.

The ‘crimes with an area along a route’ and the Voronoi mapping, which are both incredibly powerful ideas and incredibly powerful techniques can be achieved with only a few lines of code. And once the code recipe has been discovered once, it can often be turned into a function and called with a single line of code.

One of the issues with things like geopandas is that the dataframe resides in computer memory. Shapefiles can be quite large, so this may have an adverse affect on your computer. But tools such as spatialite allow you to commit large geodata files to a simple file based SQLite database (no installation or running servers required) and do geo operations on it directly: such as looking for points within a particular boundaried area.

At the moment, SpatiaLite docs leave something to be desired, and finding handy recipes to reuse or work from can be challenging, but there are some out there. And I’ve also started to come up with my own demos. For example, check out this notebook of LSOA Sketches.ipynb that includes examples of how to look up an LSOA code from latitude and longitude co-ordinates. The notebook also shows how to download a database of postcodes into the same database as the shapefiles and then use postcode centroid locations to find which LSOA boundary contains the (centroid) location of a specified postcode.

Grabbing Screengrab Images Using Selenium (Also Works in MyBinder)

It being roundy-roundy motorsport season again, here’s a recipe for grabbing a screenshot of a live timing screen and then emailing it to one or more people.

One of the reasons for using Selenium is that timing screen pages are often updated using data received over a live websocket. There’s no real need to keep shipping all the timing data, just the bits that change. Using a browser to grab the screenshot rather than having to try to figure out the data can make life easier if you don’t want to manipulate the data at all. For some background about such Javascript rendered pages, see here. Of course, we could also use selenium to give us access to the “innerHTML” as rendered by the Javascript. I’ll maybe have a look at that in a future post…

The recipe uses a headless version Firefox automated using selenium and can run in a MyBinder container.

A Dockerfile that can load in the necessary bits looks like this:

#Use a base Jupyter notebook container
FROM jupyter/base-notebook

#We need to install some Linux packages
USER root

#Using Selenium to automate a firefox or chrome browser needs geckodriver in place
ARG GECKO_VAR=v0.23.0
RUN wget https://github.com/mozilla/geckodriver/releases/download/$GECKO_VAR/geckodriver-$GECKO_VAR-linux64.tar.gz
RUN tar -x geckodriver -zf geckodriver-$GECKO_VAR-linux64.tar.gz -O > /usr/bin/geckodriver
RUN chmod +x /usr/bin/geckodriver
RUN rm geckodriver-$GECKO_VAR-linux64.tar.gz

#Install packages required to allow us to use eg firefox in a headless way
#https://www.kaggle.com/dierickx3/kaggle-web-scraping-via-headless-firefox-selenium
RUN apt-get update \
    && apt-get install -y libgtk-3-0 libdbus-glib-1-2 xvfb \
    && apt-get install -y firefox \
    && apt-get clean
ENV DISPLAY=":99"

#Copy repo files over
COPY ./notebooks ${HOME}/work
#And make sure they are owned by the notebook user...
RUN chown -R ${NB_USER} ${HOME}

#Reset the container user back to the notebook user
USER $NB_UID

#Install Selenium python package
RUN pip install --no-cache selenium

With everything installed, we can create a headless Firefox browser as follows:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)

In the page I am interested in, a spnny thing is displayed in an HTML tag with id loading whilst a web socket connection is set up and the timing data is loaded for the first time.

undesiredId = 'loading'

I also create a dummy filename into which to save the screenshot:

outfile = 'screenshot.png'

Grabbing a screenshot of a page at a particular URL can be achieved using the following sort of approach, which waits for the spinny thing tag to disappear before grabbing the screenshot.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#Set a default screensize...
#There are other ways of setting up the browser so that we can grab the full browser view,
#even long pages that would typically require scrolling to see completely
#For example: https://blog.ouseful.info/2019/01/16/converting-pandas-generated-html-data-tables-to-png-images/
driver.set_window_size(800, 400)

#Load a webpage at a specified URL
driver.get( URL )

#Handy bits...
#EC.visibility_of_element_located
#EC.presence_of_element_located
#EC.invisibility_of_element_located

#Let's wait for the spinny thing to disappear...
element = WebDriverWait(driver, 10).until( EC.invisibility_of_element_located((By.ID, undesiredId)))

#Save the page
driver.save_screenshot( outfile )
print('Screenshot saved to {}'.format(outfile))

If I need to select a particular tab in a tabbed view, I can also do that. In the screen I am interested, the different tabs have different HTML tag id values:

tabId = "Classification"

element = browser.find_element_by_id(tabId)
element.click()
element = WebDriverWait(browser, 10).until( EC.visibility_of_element_located((By.ID, tabId)))

I can then grab the screenshot…

Having saved an image, I can then email it.

If you have a Gmail account, sending an email is quite straightforward because we can use the Gmail SMTP server:

import smtplib, ssl, getpass

port = 465  # For SSL

sender_email = input("Type your GMail address and press enter: ")
sender_password =  getpass.getpass(prompt='Password: ')

# Create a secure SSL context
context = ssl.create_default_context()

receiver_email = "user@example.com"  # Enter receiver address
message = """\
Subject: Timing Screen

Here's some email; but what about the attachment?"""

with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server:
    server.login(sender_email, sender_password)
    server.sendmail(sender_email, receiver_email, message)

A recipe I found here describes how to add an image file as an attachment to an email; and one I found here describes how to use images embedded in an HTML email.

A copy of a notebook associated with this recipe, along with the example email’n’images code, as applied to TSL live timing, can be found here.

What I need to do next is a bit more selenium scraping to pull out metadata from the timing screen itself, such as the race it applies to. It would also make sense to grab all the screen tabs on a particular timing screen.

The next step would be to set something running so that the script could watch the timing screen and then email out the final results screen for each race. I’m not sure if the selenium browser is continually updated by the socket connection that drives the timing screen page, and whether it can watch for the race status to change to FINISHED, and then away from it to reset the script so it is ready to email out the next final classification.

But I can’t try that or test it right now. The timing screens have shut down for the day, and I’ve also spent the whole of this beautiful day in front of a screen rather than in the garden. Bah…:-(

PS By the by, we could also load the geckodriver in directly from a Python script:

#By the by, there is also a Python package for installing geckodriver
! pip install --no-cache webdriverdownloader
#This can be used as follows within a notebook
from webdriverdownloader import GeckoDriverDownloader
gdd = GeckoDriverDownloader()
geckodriver, geckobin = gdd.download_and_install("v0.23.0")

#If required, the path to the drive can be set explicitly:
from selenium import webdriver
browser = webdriver.Firefox(executable_path=geckobin, options=options)

Customisation vs. Personalisation in Course Offerings

According to the Cambridge English Dictionary, customisation and personalisation are defined as follows:

  • customize: verb [ T ] uk usually customise UK — to make or change something according to the buyer’s or user’s needs
  • personalize: verb [ T ] uk usually personalise UK —​ to make something suitable for the needs of a particular person. If you personalize an object, you change it or add to it so that it is obvious that it belongs to or comes from you.

In this post, I’m going to take a more extreme position, contrasting them as:

  • customisation: the changes a vendor or service provider makes;
  • personalisation: the changes a user makes.

Note that a user may play a role in customisation. For example, when buying a car, or computer, a buyer might customise it during purchase using a configurator that lets them select various options: the customisation is done by the vendor, albeit under the control of the buyer. They may then personalise it when they receive it by putting stickers all over it.

One of the things I’ve been pondering in the context of how we deliver software to students is the extent to which we offer them customised and personalisable environments.

In the second half of the post Fragment — Some Rambling Thoughts on Computing Environments in Education I decompose computing environments into three components (PLC): a physical component (servers); a logical component (computing environment: operating system, installed packages, etc); and a cultural component (personal preference text editors, workflows, etc.).

When we provide students with a virtual machine, we provide them with a customised environment at the course (module) level. Each student gets the same logical virtual machine.

The behaviour of the machine in a logical sense will be the same for each student. But students have different computers, with different resource profiles (different processor speeds, or memory, for example).

So their experience of running the logical machine on their personal computer will be a personalised one.

As personalisation (under my sense of the term) it is outside our control.

If we offer students access to the logical machine running on our servers, we customise the physical layer in terms of compute resource, but students will still experience a personalised experience based on the speed and latency of their network connection.

At this point, I suggest that we can control what students receive in terms of the logical component (customisation), and we can suggest minimum resource requirements to try to ensure a minimum acceptable experience, if not parity of experience, in terms of the physical component.

But there is then a tension about the extent to which we tell students how they can personalise their physical component. If we are shipping a VM, should we tell students with powerful computers how to increase the memory size, or number of cores, used by the virtual machine? The change would be a personalisation implemented at the logical layer (changing default settings) that exploits personalisation at the lower physical layer. Or would that be unfair to students with low specced machines who cannot make changes at the logical layer that other students might be able to make at the logical layer?

If it takes the student with the lowest specced machine an hour to run a particularly expensive computation, should every student have to take an hour? Or should we tell students who are in a position to run the computation on their higher specced machine how to change the logical layer to let them run the activity in 5 minutes?

At the cultural layer, I would contend that we should be encouraging students to explore personalisation.

If we are running a course that involves an element of programming using a particular set of programming libraries, we can provide students with a logical environment containing all the required libraries. But should we also control the programming editor environment, particularly if the student is a seasoned developer, perhaps in another language, with a pre-existing workflow and a highly tuned, personalised editing environment?

In our TM351 Data Management and Analysis course, we deliver material to students using a virtual machine and Jupyter notebooks. To complete the course assessment, we require students to develop some code and present a data investigation in a notebook. For seasoned developers, the notebook environment is not necessarily the best environment in which to develop code, so when one student who lived in a Microsoft VS Code editor at work wanted to develop code in that personalised environment using our customised logical environment, that seemed fine to me.

Reflecting on this, it seems to me that at the cultural level we can make recommendations about what tools to use and manage our delivery of the experience in terms of: this is how to do this academic thing we are teaching in this cultural environment (a particular editor, for example) but if you want to personalise the cultural environment, that is fine (and perhaps more than that: it is right and proper…).

To riff on TM351 again, the Jupyter notebook environment we provide is customised (preconfigured) at the logical level with preinstalled extensions and customised at the cultural layer with certain of the extensions pre-enabled (a spell checker is enabled, for example, and a WYSIWYG markdown editor). But students are also free to personalise the notebook environment at the cultural level by enabling their own selection of preinstalled notebook extensions. They can also go further, and personalise the logical component by installing additional extensions that can facilitate personalisation at the cultural level, with the caveat that we only guarantee that things work using the logical component we provided students with.

PS I originally started pondering customisation vs personalisation as a rant against “personalisation” in education. I’d argue that it is actually “customisation” and an example of the institution imposing different customised offerings at the individual student level.

Fragment — Some Rambling Thoughts on Computing Environments in Education

One of the challenges that faces the distance educator, indeed, any educator, in delivering computing related activities is how to provide students with an environment in which they can complete practical computing related teaching and learning activities.

Simply getting the student to a place where the code you want them to work on, and run, is far from trivial.

In a recent post on Creating gentle introductions to coding for journalists… (which for history of ideas folk, and my own narrative development timeline, appeared sometime after most of this post was drafted, but contextualises it nicely), journalism educator Andy (@digidickinson) Dickinson describes how in teaching MA students a little bit of Python he wanted to:

– Avoid where possible, the debates – Should journalists learn to code? Anyone?
– Avoid where possible too much jargon – Is this actually coding or programming or just html
– Avoid the issue of installing development environments – “We’ll do an easy intro but, first lets install R/python/homebrew/jupyter/anaconda…etc.etc.”
– Not put people off – fingers crossed

The post describes how he tried to show a natural equivalence between, and progression from, from Excel formulas to Python code (see my post from yesterday on Diagrams as Graphs, and an Aside on Reading Equations which was in part inspired by that sentiment).

But that’s not what I want to draw on here.

What I do want to draw on is this:

The equation `Tech + Journalists=` is one you don’t need any coding experience to solve. The answer is stress.

Experience has taught me that as soon as you add tech to the mix, you can guarantee that one person will have a screen that looks different or an app that doesn’t work. Things get more complicated when you want people to play and experiment beyond the classroom. Apps that don’t install; or draconian security permissions are only the start. Some of this stuff is quite hardcore for a user who’s never used notepad before let alone fired up the command prompt. All of this can be the hurdle that most people fall at. It can sap your motivation.

Andy declares a preference for Anaconda, but I think that is… I prefer alternatives. Like Docker. This is my latest attempt at explaining why: This is What I Keep Trying to Say….

Docker is also like a friendly way in to the idea of infinite interns.

I first came across this idea — of infinite interns — from @datamineruk (aka Nicola Hughes), developed, I think, in association with Daithí Ó Crualaoich (@twtrdaithi, and by the looks of his Twitter stream, fellow Malta fan:-)

As an idea, I can’t think of anything that has had a deeper or more profound effect on my thinking as regards virtual computing than infinite interns.

Here’s how the concept was originally described, in a blog post that I think is now only viewable via the Internet Archive Wayback Machine — DataMinerUK: What I Do And How:

I specialise in the backend of data journalism: investigations. I work to be the primary source of a story, having found it in data. As such my skills lean less towards design and JavaScript and more towards scraping, databases and statistics.

I work in a virtual world. Literally. The only software I have installed on my machine are VirtualBox and Vagrant. I create a virtual machine inside my machine. I have blueprints for many virtual machines. Each machine has a different function i.e. a different piece of software installed. So to perform a function such as fetching the data or cleaning it or analysing it, I have a brand new environment which can be recreated on any computer.

I call these environments “Infinite Interns“. In order to help journalists see the possibilities of what I do, I tell then to think about what they could accomplish if they had an infinite amount of interns. Because that’s what code is. Here are a couple of slides about my Infinite Interns system:

And here are the slides, used without permission…

Let’s go back to Andy…

There are always going to be snags and, by the time we get to importing libs like pandas [a Python package for working with tabular data], things are going to get complicated – it’s unavoidable. But if the students come away knowing that code isn’t tricky at least in principle, that at a low level the basic structures and ideas are pretty simple and there’s plenty of support out there. Well, that’ll be a win. Fingers crossed.

What you really need is an infinite intern

Which is to say, what you really need is an easy way to tell students how to set up their computing environment.

Which is to say, you really need an easy way for students to tell their computers what sort of environment they’d like to work in.

Want a minimal Jupyter notebook?

docker run --rm -p 8877:8888 -e JUPYTER_TOKEN=letmein jupyter/minimal-notebook

and look to http://localhost:8877 then login with token letmein.

Need a scipy stack in there? Use a different intern…

docker run --rm -p 8877:8888 -e JUPYTER_TOKEN=letmein jupyter/scipy-notebook

And so on…

And if you can’t install Docker on your machine, you can still run (notebook running) containers in the cloud: for example, Running a Minimal OU Customised Personal Jupyter Notebook Server on Digital Ocean.

There’s also tooling to build containers from build specs in Github repos, such as repo2docker. This tool can automatically add in a notebook server for you. That same application is used to build containers that run on the cloud from a Github repo, at a single click: MyBinder (docs).

What this shows, though, is that installing software actually masks a series of issues.

If a student, or a data journalist, is on a low spec computer, or a computer that doesn’t let you install desktop software applications, or a computer that has a different operating system than the one required by the application you want to run, what are you to do?

What is the problem we are actually trying to solve?

I see the computing environment as made up of three components (PLC):

  • a physical component;
  • a logical component;
  • a cultural component.

The Physical Component

The physical component, (physical environment, or physical layer) corresponds to the physical (hardware) resource(s) required to run an activity. This might be a student’s own computer or it might be a remote server. It might include the requirement for a network connection with minimum bandwidth or latency properties. The physical resource maps onto the “compute, storage and network” requirements that must be satisfied in order to complete any given activity.

In some respects, we might be able to abstract completely away from the physical. If I am happy running a “disposable” application where I don’t need to save any files for use later, I can fire up a server, run some code, kill the server.

But if I want to save the files for use an arbitrary amount of time later, I need some persistent physical storage somewhere where I can put those files, and from where I can retrieve them when I need them. Persistence of files is one of the big issues we face when trying to think of how best to support our distance education students. Storage can be problematic.

How individuals connect to resources is another issue. This is the network component. If a student has a low powered computer (poor compute resource) we may need to offer them access to a more powerful remote service. But that requires a network connection. Depending on where files are stored, there are two network considerations we need to make: how does a student access files to edit them, and how do files get to compute so they can be processed.

The Logical Component

The logical component (logical layer; logical environment) might also be referred to as the computational environment. This includes operating system dependencies (for example, the requirement for a particular operating system), application or operating system dependencies (for example, we might require a particular application such as Scratch to be available, or a particular operating system package dependency that is required by a programming language package), programming language dependencies (for example, in a Python environment we might require a particular version of pandas to be installed, or a particular version of Java).

The Cultural Component

The cultural component (cultural layer; cultural environment) incorporates elements of the user environment and workflow. At one extreme, the adoption of a particular programming editor is an example of a cultural component (the choice of editor may actually be irrelevant as far as the teaching except insofar a student needs access to a code editor, not any particular code editor). The workflow element is more complex, covering workflows in both abstract terms (eg using a test driven approach, or using a code differencing and checkin management process) as well as practical terms (for example, using git and Github, or a particular testing framework).

For example, you could imagine a software design project activity in a computing course that instructs students to use a test driven approach and code versioning, but not specify the test framework, version control environment, or even programming language / computational environment.

This cultural element is one that we often ignore when it comes to HE, expecting students to just “pick up” tools and workflows, and one whose deficit makes graduates less than useful when it comes to actually doing some work when they do graduate. It’s also one that is hard to change in an organisation, and one that is hard to change at a personal level.

If you’ve tried getting a new technology into a course created by a course team, and / or into your organisation, you’ll know that one of the biggest blockers is the current culture. Adopting a new technology is really hard because if it really is new, it will lead to, may even require, new workflows — new cultures — for many, indeed any, of the benefits to reveal themselves.

Platform Independent Software Distribution – Physical Layer Agnosticism

Reflecting on the various ways in which we provide computing environments for distance education students on computing courses, one of my motivations is to package computational support for our educational materials in a way that is agnostic to the physical component. Ideally, we should be able to define a single logical environment that can be used across a wide range of physical environments.

Virtualisation has a role to play here: if we package software in virtualised environments, we have great flexibility when it comes to where the virtual machine physically runs. It could be on the student’s own computer, it could be on an OU server, it could be on a “bring your own server” basis.

Virtualisation essentially allows us to abstract away from much of the physical layer considerations because we can always look to provide alternative physical environments on which to run the same logical environment.

However, in looking for alternatives, we need to be mindful that (compute, storage, network) triple provides a set of multi-objective constraints that need to be satisfied and that may lead to certain trade-offs between them being required.

This is particularly true when we think of extrema, such as large data files (large amount of storage and/or large amount of bandwidth/network connectivity) and/or processes that require large amounts of computation (these may be associated with large amounts of data, or they may not; an example of the latter might be running a set of complex equations over multiple iterations, for example).

My preference is also that we should be distributing software environments and services that also allow students to explore, and even bring to bear, their own cultural components (for example, their favourite editor). I’ll have more to say about that in a future post…

Related: Fragment – Programming Privilege. See also This is What I Keep Trying to Say… where a few very short lines of configuration code let me combine / assemble pre-existing packages in new and powerful ways, without really having to understand anything about how the pieces themselves actually work.