Category: Tinkering

Fragment – Diagramming the Structure of a Python dict Using BlockDiag, & Some Quick Reflections on Computing Education

As a throwaway diagram in a piece of teaching material I wanted to visualise the structure of a Python dict. One tool I use for generating simple block diagrams is BlockDiag. This uses a simple notation for generating box and arrow diagrams:

So how can we get the key structure from a nested Python dict in that form? There may well be a Python method somewhere for grabbing this information, but it was just a quick coffee break puzzle to write a thing to grab that data and represent it as required:

def getChildren(d,t=None,root=None):
    if t is None: t=[]
    if root is None:
        for k in d:
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    else:
        for k in d:
            t.append('"{root}" -> "{k}";'.format(root=root,k=k))
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    return t

r={'a':{'b':1, 'c':{'d':2,'e':3}}}
print('{{ {graph} }}'.format(graph='\n'.join(getChildren(r)))) 

#------
{ "a" -> "b";
"a" -> "c";
"c" -> "d";
"c" -> "e"; }

There are probably better ways of doing it, but that’s not necessarily the point. Which is a point I realised chatting to a colleague earlier today: I’m not that interested in the teaching of formal computing approaches as a way of training enterprise developers. Nor am I interested in the teaching of computing through contrived toy examples. What I’m far more interested in is helping students do without us; and students, at that, who have end-user computing needs that they want to be able to satisfy in whatever domain they end up in.

Which is to say, not (in the first instance) enterprise level, production quality code. It’s code to get stuff done. End-user application development code. Personal, disposable/throwaway, ad hoc productivity tool development. Scruffy code that lets you use bits of string and gaffer tape and chunks of other people’s code to solve a particular problem.

But that’s not to say the code has to stay ropey… In testing the first attempt at the above code, it lacked the guards that checked whether a variable was a dict, at which point it failed whenever a literal value was encountered. There may well be other things that are broken about it but I can fix those as they crop up (because I know the sort of output I expect to see*, and if I don’t get it, I can try to fix it.) I also had to go and look up how to include literal curly brackets in a python formatted string (double them up) for the print statement. But that’s okay. That’s just syntax… Knowing that I should be able to print the literal brackets was the important thing… And that’s all part of what I think a lot of our curriculum lacks – enthusing folk, making them curious, getting them to have expectations about what is and isn’t and should be possible**, and then being able to act on that.

* informal test driven end user software application development…?!;-)
** with some personal ethics about what may be possible but shouldn’t be pursued and should be lobbied against…

Using Jupyter Notebooks For Assessment – Export as Word (.docx) Extension

One of the things we still haven’t properly worked out in our Data management and analysis (TM351 course is how best to handle Jupyter notebook based assignments. The assignments are set using a notebook to describe the tasks to be completed and completed by the student. We then need some mechanism for:

  • students to submit the assessment electronically;
  • markers mark assessments for their students: if the document contains a lot of OU text, it can be hard for the marker to locate the student text;
  • markers may provide on-script feedback; this means the marker needs to be able to edit the document and make changes/annotations.
  • markers return scripts to students;
  • students read feedback – so they need to be able to locate and distinguish the marker feedback within the document.

One Frankenstein process we tried was for students to save a Jupyter notebook file as a Markdown or HTML document and then convert it to a Microsoft Word document using pandoc.

This document could then be submitted and marked in a traditional way, with markers using comments and track chances to annotate the student script. Unfortunately, our original 32 bit VM meant we had to use an old version of pandoc, with the result that tabular data was not handled at all well in the conversion-to-Word process.

Updating to a 64 bit virtual machine means we can update pandoc, and the Word document conversion is now much smoother. However, the conversion process still requires students to export the word document as HTML and then use pandoc to convert the HTML to to the Microsoft Word .docx format. (The Jupyter nbconvert utility does not currently export to Word.)

So to make things a little easier, here’s my first attempt at a Download Jupyter Notecbook as Word (.docx) extension to do just that. It makes use of the Jupyter notebook custom bundler extensions API which allows you to add additional options to the notebook File -> Download menu option. The code I used was also cribbed from the dashboards_bundlers which converts a notebook to a dashboard and then downloads it.

One thing it doesn’t handle at the moment are things like embedded interactive maps. I’ve previously come up with a workaround for generating static images of interactive maps created using the folium package by using selenium to render the map and grab a screenshot of it; I’m not sure if that would work in our headless VM, though? (One to try, I guess?) There’s also a related thread in the folium repo issue tracker.

The above script is placed in a wordexport folder inside a package folder containing a simple setup.py script:

from setuptools import setup

setup(name='wordexport',
      version='0.0.1',
      description='Export Jupyter notebook as .docx file',
      author='Tony Hirst',
      author_email='tony.hirst@open.ac.uk',
      license='MIT',
      packages=['wordexport'],
      zip_safe=False)

The package can be installed and the extension enabled using a riff along the lines of the following command-line commands:

echo "...wordexport install..."
#Install the wordexport (.docx exporter) extension package
pip3 install --upgrade --force-reinstall ${THISDIR}/jupyter_custom_files/nbextensions/wordexport

#Enable the wordexport extension
jupyter bundlerextension enable --py wordexport.wordexport  --sys-prefix
echo "...wordexport done"

Restart the Jupyter server after enabling the extension, and the result should be a new MS Word (.docx) option in the notebook File -> Download menu option.

Querying Large CSV Files With Apache Drill

Via a post on the rud.is blog – Drilling Into CSVs — Teaser Trailer – I came across a handy looking Apache tool: Apache Drill. A Java powered service, Apache Drill allows you to query large CSV and JSON files (as well as a range of other data backends) using SQL, without any particular manipulation of the target data files. (The notes also suggest you can query directly over a set of CSV files (with the same columns?) in a directory, though I haven’t tried that yet…)

To give it a go, I dowloaded Evan Odell’s Hansard dataset which comes in as a CSV file at just over 3GB.

Installing Apache Drill, and running it from the command line – ./apache-drill-1.10.0/bin/drill-embedded – it was easy enough to start running queries from the off (Querying Plain Text Files):

SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv` LIMIT 3;

By default, queries from a CSV file ignore headers and treat all rows equally. Queries over particular columns can be executed by referring to numbered columns in he form COLUMNS[0], COLUMNS[1], etc. (Querying Plain Text Files).  However, Bob Rudis’ blog hinted there was a way to configure the server to use the first row of a CSV file as a header row. In particular, the Data Source Plugin Configuration Basics docs page describes how the CSV data source configuration can be amended with the clauses "skipFirstLine": true, "extractHeader": true to allow CSV files to be queried with the header row intact.

The configuration file for the live server can be amended via a web page published by the running Apache Drill service, by default on localhost port 8047:

Updating the configuration means we can start to run named column queries:

The config files are actually saved to a temporary location – /tmp/drill. If the (updated) files are copied to a persistent location – mv /tmp/drill /my/configpath – and the drill-override.conf file updated with the setting drill.exec: {sys.store.provider.local.path="/my/configpath"}, the (updated) configuration files will in future be uploaded from that location, rather than temporary default config files being created for each new server session (docs: Storage Plugin Registration).

Bob Rudis’ post also suggested that more efficient queries could be run by converting the CSV data file to a parquet data format, and then querying over that:

CREATE TABLE dfs.tmp.`/senti_post_v2.parquet` AS SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv`;

This creates a new parquet data folder /tmp/senti_post_v2.parquet. This can then be queried as for the CSV file:

SELECT gender, count(*) FROM dfs.tmp.`/senti_post_v2.parquet` GROUP BY gender;

…but with a significant speed up, on some queries at least:

To quit, !​exit.

And finally, to make using the Apache Drill service easier to use from code, wrapper libraries are available for R – sergeant R package – and Python – pydrill package.

First Attempt at Running the TM351 VM as an AMI on Amazon Web Services

One of the things that’s been on my to do list for ages is trying to get a version of the TM351 virtual machine (VM) up and running on Amazon Web Services (AWS) as an Amazon Machine Instance (AMI). This would allow students who are having trouble running the VM on their own computer to access the services running in the cloud.

(Obviously, it would be preferable if we could offer such a service via OU operated servers, but I can’t do politics well enough, and don’t have the mentality to attend enough of the necessary say-the-same-thing-again-again meetings, to make that sort of thing happen.)

So… a first attempt is up on the eu-west-1 region in all its insecure glory: TM351 AMI v1. The security model is by obscurity as much as anything – there’s no model for setting separate passwords for separate students, for example, or checking back agains an OU auth layer. And I suspect everything runs as root…

(One of the things we have noticed in (brief) testing is that the Getting Started instructions don’t work inside the OU, at least if you try to limit access to your (supposed) IP address. Reminds of when we gave up trying to build the OU VM from machines on the OU network because solving proxy and blocked port issues was an irrelevant problem to have to worry about when working from the outside…)

Open Refine doesn’t seem to want to run with the other services in the free tier micro (1GB) machine instance, but at 2GB everything seems okay. (I don’t know if possible race conditions in starting services means that Open Refine could start and then block the Jupyter service’s request for resource.  I need to do an Apollo 13 style startup sequence exploration to see if all services can run in 1GB, I guess!) One thing I’ve added to the to do list is to split things out so into separate AMIs that will work on the 1GB free tier machines. I also want to check that I can provision the AMI from Vagrant, so students could then launch a local VM or an Amazon Instance that way, just by changing the vagrant provider. (Shared folders/volumes might get a bit messed up in that case, though?)

If services can run one at a time in the 1GB machines, it’d be nice to provide a simple dashboard to start and stop the services to make that easier to manage. Something that looks a bit like this, for example, exposed via an authenticated web page:

This needn’t be too complex – I had in mind a simple Python web app that could run under nginx (which currently provides a simple authentication layer for Open Refine to sit behind) and then just runs simple systemctl start, stop and restart commands on the appropriate service.

#fragment...
import os
os.system('systemctl restart jupyter.service')

I’m not sure how the status should be updated (based on whether a service is running or not) or what heartbeat it should update to. There may be better ways, of course, in which case please let me know via the comments:-)

I did have a quick look round for examples, but the dashboards/monitoring tools that do exist, such as pydash, are far more elaborate than what I had in mind. (If you know of a simple example to do the above, or can knock one up for me, please let me know via the comments. And the simpler the better ;-)

If we are to start exploring the use of browser accessed applications running inside user-managed VMs, this sort of simple application could be really handy… (Another approach would be to use a VM running docker, and then have a container manager running, such as portainer.)

Creating a Jupyter Bundler Extension to Download Zipped Notebook and HTML Files

In the first version of the TM351 VM, we had a simple toolbar extension that would download a zipped ipynb file, along with an HTML version of the notebook, so it could be uploaded and previewed in the OU Open Design Studio. (Yes, I know, it would have been much better to have an nbviewer handler as an ODS plugin, but the we don’t do that sort of tech innovation, apparently…)

Looking at updating the extension today for the latest version of Jupyter notebooks, I noticed the availability of custom bundler extensions that allow you to add additional tools to support notebook downloads and deployment (I’m not sure what deployment relates to?). Adding a new download option allows it to be added to the notebook Edit -> Download menu:

The extension is created as a python package:

# odszip/setup.py
from setuptools import setup

setup(name='odszip',
      version='0.0.1',
      description='Save Jupyter notebook and HTML in zip file with .nbk suffix',
      author='',
      author_email='',
      license='MIT',
      packages=['odszip'],
      zip_safe=False)
#odszip/odszip/download.py

# Copyright (c) The Open University, 2017
# Copyright (c) Jupyter Development Team.

# Distributed under the terms of the Modified BSD License.
# Based on: https://github.com/jupyter-incubator/dashboards_bundlers/

import os
import shutil
import tempfile

#THIS IS A REQUIRED FUNCTION
def _jupyter_bundlerextension_paths():
    '''API for notebook bundler installation on notebook 5.0+'''
    return [{
                'name': 'odszip_download',
                'label': 'ODSzip (.nbk)',
                'module_name': 'odszip.download',
                'group': 'download'
            }]


def make_download_bundle(abs_nb_path, staging_dir, tools):
	'''
	Assembles the notebook and resources it needs, returning the path to a
	zip file bundling the notebook and its requirements if there are any,
	the notebook's path otherwise.
	:param abs_nb_path: The path to the notebook
	:param staging_dir: Temporary work directory, created and removed by the caller
	'''
    
	# Clean up bundle dir if it exists
	shutil.rmtree(staging_dir, True)
	os.makedirs(staging_dir)
	
	# Get name of notebook from filename
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	# Add the notebook
	shutil.copy2(abs_nb_path, os.path.join(staging_dir, notebook_basename))
	
	# Include HTML version of file
	cmd='jupyter nbconvert --to html "{abs_nb_path}" --output-dir "{staging_dir}"'.format(abs_nb_path=abs_nb_path,staging_dir=staging_dir)
	os.system(cmd)

	zip_file = shutil.make_archive(staging_dir, format='zip', root_dir=staging_dir, base_dir='.')
	return zip_file

#THIS IS A REQUIRED FUNCTION       
def bundle(handler, model):
	'''
	Downloads a notebook as an HTML file and zips it with the notebook
	'''
	
	# Based on https://github.com/jupyter-incubator/dashboards_bundlers
	
	abs_nb_path = os.path.join(
		handler.settings['contents_manager'].root_dir,
		model['path']
	)
		
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	tmp_dir = tempfile.mkdtemp()

	output_dir = os.path.join(tmp_dir, notebook_name)
	bundle_path = make_download_bundle(abs_nb_path, output_dir, handler.tools)
		
	handler.set_header('Content-Disposition', 'attachment; filename="%s"' % (notebook_name + '.nbk'))
	
	handler.set_header('Content-Type', 'application/zip')
	
	with open(bundle_path, 'rb') as bundle_file:
		handler.write(bundle_file.read())

	handler.finish()


	# We read and send synchronously, so we can clean up safely after finish
	shutil.rmtree(tmp_dir, True)

We can then create the python package and install the extension, remmebering to restart the Jupyter server for the extension to take effect.

#Install the ODSzip extension package
pip3 install --upgrade --force-reinstall ./odszip

#Enable the ODSzip extension
jupyter bundlerextension enable --py odszip.download  --sys-prefix

Getting Web Services Up and Running on MicroSoft Azure Using Vagrant and the Azure CLI

As well as Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI, we can also use Vagrant to provision machines on other web hosts, such as the Microsoft Azure cloud paltform. In this post, I’ll describe a command line based recipe for doing just that.

To start with, you’ll need to get a Microsoft Azure account.

When you’ve done that, install the Azure command line interface (CLI). On a Mac:

curl -L https://aka.ms/InstallAzureCli | bash

For me, this installed to ~/bin/az.

With the client installed, login:

~/bin/az login

This requires a token based handshake with a Microsoft authentication website.

List the range of machine images available (if you haven’t set the path to az, use the full ~/bin/az):

az vm image list

There was only one that looked suitable to me for my purposes: Canonical:UbuntuServer:16.04-LTS:latest.

To run the provisioner, we need a Subscription ID; this will be used to set the vagrant .subscription_id parameter. These are listed on the Azure Portal.

We also need to create an Active Directory Service Principal:

az ad sp create-for-rbac

This information will be used to configure the Vagrantfile: the appId sets the vagrant .client_id, the password the .client_secret, and the tenant the .tenant_id.

You can also inspect the application in the App Registrations area of the Azure Portal.

Now let’s set up Vagrant. We’ll use the vagrant-azure plugin:

vagrant plugin install vagrant-azure --plugin-version '2.0.0.pre6'

We need to add a dummy box:

vagrant box add azure https://github.com/azure/vagrant-azure/raw/v2.0/dummy.box

Now let’s set up the Vagrantfile:

config.vm.provider :azure do |azure, override|
    #The path to your ssh keys
    override.ssh.private_key_path = '~/.ssh/id_rsa'

    #The default box we added
    override.vm.box = 'azure'
    
    #Set a territory
    azure.location="uksouth"

    #Provide your own group and VM name
    azure.resource_group_name="tm351azuretest"
    azure.vm_name="tm351azurevmtest"

    # Set an appropriate image (the UbuntuServer is actually the current default value)
    azure.vm_image_urn="Canonical:UbuntuServer:16.04-LTS:latest"

    #Use a valid subscription ID
    #https://portal.azure.com/#blade/HubsExtension/MyAccessBlade/resourceId/
    azure.subscription_id = ENV['AZURE_SUBSCRIPTION_ID']

    # Using details from the Active Directory Service Principal setup
    azure.tenant_id = ENV['AZURE_TENANT_ID']
    azure.client_id = ENV['AZURE_CLIENT_ID']
    azure.client_secret = ENV['AZURE_CLIENT_SECRET']

end

With the Vagrantfile parameters in place, we should then be able call the Azure provider using the command:

vagrant up --provider=azure

But we’re still not quite done… If you’re running services on the VM, populated from elsewhere in the Vagranfile, you’ll need to add some security rules to make the ports accessible. I’m running services on ports 80,35180 and 35181 for example:

az vm open-port -g tm351azuretest -n tm351azuretest --port 80 --priority 130
az vm open-port -g tm351azuretest -n tm351azuretest --port 35180 --priority 140
az vm open-port -g tm351azuretest -n tm351azuretest --port 35181 --priority 150

Now we can lookup the IP address of the server:

az vm list-ip-addresses

and see if our applications are there :-)

Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI

From past experience of trying to get things up and running with AWS (Amazon Web Services), it can be a bit of a faff trying to work out what to set where the first time. So here’s an example of how to get a browser based application up and running on EC2 using vagrant from the command line.

(If you want to work through sorting the settings out via the AWS online management console, try Oliver Veits’ tutorial AWS Automation based on Vagrant — Part 2: Installation and Usage of the Vagrant AWS Plugin; you might also need to refer to Part 1: Getting started with AWS.)

This post in part assumes you know how to provision your own virtual machine locally using Vagrant. Here are the steps you need to take to be able to run an AWS provisioner (on a Mac or Linux machine… not sure about Windows?).

First up – sign up for AWS (get credit via the Github Education Pack)…

Pick up some credentials via AWS root Security Credentials (Access Keys (Access Key ID and Secret Access Key)):

Ensure that the key is active (Make Active).

There’s quite a bit of set up to do to configure the provisioner script. This can be done on the command line using the Amazon Command Line Interface (AWS CLI):

pip install --upgrade --user awscli

Now you need to configure the AWS CLI:

aws configure

Use the security credentials you picked up to configure the client*.

When we launch the AWS machine, vagrant needs to be able to access it via ssh using the public IP address automatically assigned to the machine. In deployment too, if we’re building specific services we want to be able to access over the web, we need to open up access to the ports those services are listening on.

By default, the machine will be locked down, so we need to open up specific ports by setting security rules. These are assigned on the basis of a security group. So lets create one of those (mine is named after the course VM I’m building…):

aws ec2 create-security-group --group-name tm351cloud --description "Security group for tm351 services"

We’re going to use this group in the .security_groups parameter in the Vagrantfile.

Now we need to create the security group rules. In my case, I want to open up ssh (port 22) to allow incoming traffic from my IP address, and ports 80, 35180 and 35181 to allow http traffic from anywhere. (The /0 suffix in the rules allows any IP format.)

MYIP=$(curl http://checkip.amazonaws.com/)
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 22 --cidr ${MYIP}/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35180 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35181 --cidr 0.0.0.0/0

# Check the policies
aws ec2 describe-security-groups --group-names tm351cloud

Having opened up at least the ssh port 22, we need to set up some SSH keys with a particular name (vagrantaws) that we will use with the vagrant .keypair_name parameter, and save them to a local file (vagrantaws.pem) with the appropriate permissions.

aws ec2 create-key-pair --key-name vagrantaws --query 'KeyMaterial' --output text > vagrantaws.pem
chmod 400 vagrantaws.pem

The vagrant provisioner also requires specific access tokens (.access_key_id, .secret_access_key, .session_token) to access EC2. Create these tokens, entering your own duration (in seconds):

aws sts get-session-token --duration-seconds 129600

Now we can start to look at the Vagrant set up. Install the vagrant AWS provisioner:

vagrant plugin install vagrant-aws

After setting up the Vagrantfile, you will be able to provision your machine on AWS using:

vagrant up --provider=aws

Add a dummy box:

vagrant box add awsdummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box

Now let’s look at the Vagrantfile:

#Set up the provider block
config.vm.provider :aws do |aws, override|

    #Use the ec2 security group set previously
    aws.security_groups = ["tm351cloud"]

    #Whatever name we want
    override.vm.hostname = "tm351aws"

    #The name of the dummy box we added
    override.vm.box = "awsdummy"
    
    #Set up machine access using our keypair name and ssh key path
    override.ssh.username = "ubuntu"
    aws.keypair_name="vagrantaws"
    override.ssh.private_key_path = "vagrantaws.pem"

    #Use the values generated by the session token generator
    aws.access_key_id = "YOUR_KEY_ID"
    aws.secret_access_key = "YOUR_SECRET_ACCESS_KEY"
    aws.session_token = "YOUR_SESSION_TOKEN"

    #Specify a region and valid ami for that region, along with the desired instance size
    aws.region = "eu-west-1"
    aws.ami = "ami-971238f1" 
    aws.instance_type="t2.small"

  end

Running vagrant up --provider=aws should run the Vagrant provisioner with the AWS provider. Running vagrant destroy will tear down the machine (so you don’t keep paying for it… I think the users, security groups and keypairs are free?)

To check on the IP address of your instance, run:

aws ec2 describe-instances

or check on the AWS EC2 console. You can also check the machine is ripped down correctly when you have finished with it from there.

(I need to check what happens if you vagrant suspend and then vagrant resume. Presumably, the state is preserved, but you are billed for storage, if not running time?)


*Alternatively, we could create a specific user with more limited credentials.

Create a user we can use to help set up the credentials to use with the vagrant provisioner:

aws iam create-user --user-name vagrant

Now we need to give that user permissions to build our EC2 instance, by attaching an appropriate security policy (AmazonEC2FullAccess). In other words, the Vagrantfile will make use of the AWS vagrant user to provision the machine, so we need to give that AWS user the appropraite permissions on AWS:

aws iam attach-user-policy --user-name vagrant --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

Generate some keys:

aws iam create-access-key --user-name vagrant

Then run aws configure with the new keys.