Python “Natural Time Periods” Package

Getting on for a year ago, I posted a recipe for generating “Natural Language” Time Periods in Python. At the time, a couple of people asked whether it was packaged – and it wasn’t…

It’s still not on pip, but I have since made a package repo for it: python-natural-time-periods.

Install using: pip3 install --force-reinstall --no-deps --upgrade git+https://github.com/psychemedia/python-natural-time-periods.git

The following period functions are defined:

  • today(), yesterday(), tomorrow()
  • last_week(), this_week(), next_week(), later_this_week(), earlier_this_week()
  • last_month(), next_month(), this_month(), earlier_this_month(), later_this_month()
  • day_lastweek(), day_thisweek(), day_nextweek()

Here’s an example of how to use it:

import natural_time_periods as ntpd

ntpd.today()
>>> datetime.date(2017, 8, 9)

ntpd.last_week()
>>> (datetime.date(2017, 7, 31), datetime.date(2017, 8, 6))

ntpd.later_this_month()
>>> (datetime.date(2017, 8, 10), datetime.date(2017, 8, 31))

ntpd.day_lastweek(ntpd.MON)
>>> datetime.date(2017, 7, 31)

ntpd.day_lastweek(ntpd.MON, iso=True)
>>> '2017-07-31'

NHS Digital Organisation Data Service (ODS) Python / Pandas Data Loader

One of the nice things about the Python pandas data analysis package is that there is a small – but growing – amount of support for downloading data from third party sources directly as a pandas dataframe using the pandas-datareader.

So I thought I’d have a go at producing a package inspired by the pandas wrapper for the World Bank indicators API for downloading administrative data from the NHS Digital Organisation Data Service. You can find the package here: python-pandas-datareader-NHSDigital.

There’s also examples of how to use the package here: python-pandas-datareader-NHSDigital demo notebook.

When Identifiers Don’t Align – NHS Digital GP Practice Codes and CQC Location IDs

One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.

Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.

Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.

So if we do want to join the CQC and NHS Digital datasets, what are we to do?

One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some  do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.

This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY  are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.

Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Fragment – Diagramming the Structure of a Python dict Using BlockDiag, & Some Quick Reflections on Computing Education

As a throwaway diagram in a piece of teaching material I wanted to visualise the structure of a Python dict. One tool I use for generating simple block diagrams is BlockDiag. This uses a simple notation for generating box and arrow diagrams:

So how can we get the key structure from a nested Python dict in that form? There may well be a Python method somewhere for grabbing this information, but it was just a quick coffee break puzzle to write a thing to grab that data and represent it as required:

def getChildren(d,t=None,root=None):
    if t is None: t=[]
    if root is None:
        for k in d:
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    else:
        for k in d:
            t.append('"{root}" -> "{k}";'.format(root=root,k=k))
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    return t

r={'a':{'b':1, 'c':{'d':2,'e':3}}}
print('{{ {graph} }}'.format(graph='\n'.join(getChildren(r)))) 

#------
{ "a" -> "b";
"a" -> "c";
"c" -> "d";
"c" -> "e"; }

There are probably better ways of doing it*, but that’s not necessarily the point. Which is a point I realised chatting to a colleague earlier today: I’m not that interested in the teaching of formal computing approaches as a way of training enterprise developers. Nor am I interested in the teaching of computing through contrived toy examples. What I’m far more interested in is helping students do without us; and students, at that, who have end-user computing needs that they want to be able to satisfy in whatever domain they end up in.

  • So, for example:
def getChildren(d,t=None,root=None):
    if t is None: t=[]
    if root is None:
        for k in d:
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    else:
        for k in d:
            s='"{root}" -> "{k}";'.format(root=root,k=k)
            if s not in t: t.append(s)
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    return t
 
r={'a':{'b':1, 'c':{'d':2,'e':3}}}
l=[r,{'a':{'b':1, 'c':{'e':3}}}]
o=[]
for z in l:
    o=getChildren(z,o)
o

#['"a" -> "b";', '"a" -> "c";', '"c" -> "d";', '"c" -> "e";']

Which is to say, not (in the first instance) enterprise level, production quality code. It’s code to get stuff done. End-user application development code. Personal, disposable/throwaway, ad hoc productivity tool development. Scruffy code that lets you use bits of string and gaffer tape and chunks of other people’s code to solve a particular problem.

But that’s not to say the code has to stay ropey… In testing the first attempt at the above code, it lacked the guards that checked whether a variable was a dict, at which point it failed whenever a literal value was encountered. There may well be other things that are broken about it but I can fix those as they crop up (because I know the sort of output I expect to see*, and if I don’t get it, I can try to fix it.) I also had to go and look up how to include literal curly brackets in a python formatted string (double them up) for the print statement. But that’s okay. That’s just syntax… Knowing that I should be able to print the literal brackets was the important thing… And that’s all part of what I think a lot of our curriculum lacks – enthusing folk, making them curious, getting them to have expectations about what is and isn’t and should be possible**, and then being able to act on that.

* informal test driven end user software application development…?!;-)
** with some personal ethics about what may be possible but shouldn’t be pursued and should be lobbied against…

Using Jupyter Notebooks For Assessment – Export as Word (.docx) Extension

One of the things we still haven’t properly worked out in our Data management and analysis (TM351 course is how best to handle Jupyter notebook based assignments. The assignments are set using a notebook to describe the tasks to be completed and completed by the student. We then need some mechanism for:

  • students to submit the assessment electronically;
  • markers mark assessments for their students: if the document contains a lot of OU text, it can be hard for the marker to locate the student text;
  • markers may provide on-script feedback; this means the marker needs to be able to edit the document and make changes/annotations.
  • markers return scripts to students;
  • students read feedback – so they need to be able to locate and distinguish the marker feedback within the document.

One Frankenstein process we tried was for students to save a Jupyter notebook file as a Markdown or HTML document and then convert it to a Microsoft Word document using pandoc.

This document could then be submitted and marked in a traditional way, with markers using comments and track chances to annotate the student script. Unfortunately, our original 32 bit VM meant we had to use an old version of pandoc, with the result that tabular data was not handled at all well in the conversion-to-Word process.

Updating to a 64 bit virtual machine means we can update pandoc, and the Word document conversion is now much smoother. However, the conversion process still requires students to export the word document as HTML and then use pandoc to convert the HTML to to the Microsoft Word .docx format. (The Jupyter nbconvert utility does not currently export to Word.)

So to make things a little easier, here’s my first attempt at a Download Jupyter Notecbook as Word (.docx) extension to do just that. It makes use of the Jupyter notebook custom bundler extensions API which allows you to add additional options to the notebook File -> Download menu option. The code I used was also cribbed from the dashboards_bundlers which converts a notebook to a dashboard and then downloads it.


# Copyright (c) The Open University, 2017
# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
# Custom bundler extensions API:
# http://jupyter-notebook.readthedocs.io/en/latest/extending/bundler_extensions.html
# Requires: _jupyter_bundlerextension_paths(), bundle()
# Based on: https://github.com/jupyter-incubator/dashboards_bundlers/
import os
import subprocess
import shlex
import shutil
import tempfile
def _jupyter_bundlerextension_paths():
'''API for notebook bundler installation on notebook 5.0+'''
return [{
'name': 'wordexport_wordexport',
'label': 'MS Word (.docx)',
'module_name': 'wordexport.wordexport',
'group': 'download'
}]
def bundle(handler, model):
'''
Downloads a notebook as a Microsoft Word .docx document after converting to HTML
'''
# Based on https://github.com/jupyter-incubator/dashboards_bundlers
abs_nb_path = os.path.join(
handler.settings['contents_manager'].root_dir,
model['path']
)
notebook_basename = os.path.basename(abs_nb_path)
notebook_name = os.path.splitext(notebook_basename)[0]
tmp_dir = tempfile.mkdtemp()
# Generate HTML version of file
cmd='jupyter nbconvert –to html "{abs_nb_path}" –output-dir "{tmp_dir}"'.format(abs_nb_path=abs_nb_path,tmp_dir=tmp_dir)
#os.system(cmd)
subprocess.check_call(cmd, shell=True)
staged=os.path.join(tmp_dir, notebook_name)
# Convert to MS Word .docx
cmd='pandoc -s "{staged}.html" -o "{staged}.docx"'.format(staged=staged)
#os.system(cmd)
subprocess.check_call(cmd, shell=True)
handler.set_header('Content-Disposition', 'attachment; filename="%s"' % (notebook_name + '.docx'))
handler.set_header('Content-Type', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document')
with open("{staged}.docx".format(staged=staged), 'rb') as bundle_file:
handler.write(bundle_file.read())
handler.finish()
# We read and send synchronously, so we can clean up safely after finish
shutil.rmtree(tmp_dir, True)

view raw

wordexport.py

hosted with ❤ by GitHub

[There’s now a Github repo: innovationOUtside/nb_extension_wordexport]

One thing it doesn’t handle at the moment are things like embedded interactive maps. I’ve previously come up with a workaround for generating static images of interactive maps created using the folium package by using selenium to render the map and grab a screenshot of it; I’m not sure if that would work in our headless VM, though? (One to try, I guess?) There’s also a related thread in the folium repo issue tracker.

The above script is placed in a wordexport folder inside a package folder containing a simple setup.py script:

from setuptools import setup

setup(name='wordexport',
      version='0.0.1',
      description='Export Jupyter notebook as .docx file',
      author='Tony Hirst',
      author_email='tony.hirst@open.ac.uk',
      license='MIT',
      packages=['wordexport'],
      zip_safe=False)

The package can be installed and the extension enabled using a riff along the lines of the following command-line commands:

echo "...wordexport install..."
#Install the wordexport (.docx exporter) extension package
pip3 install --upgrade --force-reinstall ${THISDIR}/jupyter_custom_files/nbextensions/wordexport

#Enable the wordexport extension
jupyter bundlerextension enable --py wordexport.wordexport  --sys-prefix
echo "...wordexport done"

Restart the Jupyter server after enabling the extension, and the result should be a new MS Word (.docx) option in the notebook File -> Download menu option.

Querying Large CSV Files With Apache Drill

Via a post on the rud.is blog – Drilling Into CSVs — Teaser Trailer – I came across a handy looking Apache tool: Apache Drill. A Java powered service, Apache Drill allows you to query large CSV and JSON files (as well as a range of other data backends) using SQL, without any particular manipulation of the target data files. (The notes also suggest you can query directly over a set of CSV files (with the same columns?) in a directory, though I haven’t tried that yet…)

To give it a go, I dowloaded Evan Odell’s Hansard dataset which comes in as a CSV file at just over 3GB.

Installing Apache Drill, and running it from the command line – ./apache-drill-1.10.0/bin/drill-embedded – it was easy enough to start running queries from the off (Querying Plain Text Files):

SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv` LIMIT 3;

By default, queries from a CSV file ignore headers and treat all rows equally. Queries over particular columns can be executed by referring to numbered columns in he form COLUMNS[0], COLUMNS[1], etc. (Querying Plain Text Files).  However, Bob Rudis’ blog hinted there was a way to configure the server to use the first row of a CSV file as a header row. In particular, the Data Source Plugin Configuration Basics docs page describes how the CSV data source configuration can be amended with the clauses "skipFirstLine": true, "extractHeader": true to allow CSV files to be queried with the header row intact.

The configuration file for the live server can be amended via a web page published by the running Apache Drill service, by default on localhost port 8047:

Updating the configuration means we can start to run named column queries:

The config files are actually saved to a temporary location – /tmp/drill. If the (updated) files are copied to a persistent location – mv /tmp/drill /my/configpath – and the drill-override.conf file updated with the setting drill.exec: {sys.store.provider.local.path="/my/configpath"}, the (updated) configuration files will in future be uploaded from that location, rather than temporary default config files being created for each new server session (docs: Storage Plugin Registration).

Bob Rudis’ post also suggested that more efficient queries could be run by converting the CSV data file to a parquet data format, and then querying over that:

CREATE TABLE dfs.tmp.`/senti_post_v2.parquet` AS SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv`;

This creates a new parquet data folder /tmp/senti_post_v2.parquet. This can then be queried as for the CSV file:

SELECT gender, count(*) FROM dfs.tmp.`/senti_post_v2.parquet` GROUP BY gender;

…but with a significant speed up, on some queries at least:

To quit, !​exit.

And finally, to make using the Apache Drill service easier to use from code, wrapper libraries are available for R – sergeant R package – and Python – pydrill package.

First Attempt at Running the TM351 VM as an AMI on Amazon Web Services

One of the things that’s been on my to do list for ages is trying to get a version of the TM351 virtual machine (VM) up and running on Amazon Web Services (AWS) as an Amazon Machine Instance (AMI). This would allow students who are having trouble running the VM on their own computer to access the services running in the cloud.

(Obviously, it would be preferable if we could offer such a service via OU operated servers, but I can’t do politics well enough, and don’t have the mentality to attend enough of the necessary say-the-same-thing-again-again meetings, to make that sort of thing happen.)

So… a first attempt is up on the eu-west-1 region in all its insecure glory: TM351 AMI v1. The security model is by obscurity as much as anything – there’s no model for setting separate passwords for separate students, for example, or checking back agains an OU auth layer. And I suspect everything runs as root…

(One of the things we have noticed in (brief) testing is that the Getting Started instructions don’t work inside the OU, at least if you try to limit access to your (supposed) IP address. Reminds of when we gave up trying to build the OU VM from machines on the OU network because solving proxy and blocked port issues was an irrelevant problem to have to worry about when working from the outside…)

Open Refine doesn’t seem to want to run with the other services in the free tier micro (1GB) machine instance, but at 2GB everything seems okay. (I don’t know if possible race conditions in starting services means that Open Refine could start and then block the Jupyter service’s request for resource.  I need to do an Apollo 13 style startup sequence exploration to see if all services can run in 1GB, I guess!) One thing I’ve added to the to do list is to split things out so into separate AMIs that will work on the 1GB free tier machines. I also want to check that I can provision the AMI from Vagrant, so students could then launch a local VM or an Amazon Instance that way, just by changing the vagrant provider. (Shared folders/volumes might get a bit messed up in that case, though?)

If services can run one at a time in the 1GB machines, it’d be nice to provide a simple dashboard to start and stop the services to make that easier to manage. Something that looks a bit like this, for example, exposed via an authenticated web page:

This needn’t be too complex – I had in mind a simple Python web app that could run under nginx (which currently provides a simple authentication layer for Open Refine to sit behind) and then just runs simple systemctl start, stop and restart commands on the appropriate service.

#fragment...
import os
os.system('systemctl restart jupyter.service')

I’m not sure how the status should be updated (based on whether a service is running or not) or what heartbeat it should update to. There may be better ways, of course, in which case please let me know via the comments:-)

I did have a quick look round for examples, but the dashboards/monitoring tools that do exist, such as pydash, are far more elaborate than what I had in mind. (If you know of a simple example to do the above, or can knock one up for me, please let me know via the comments. And the simpler the better ;-)

If we are to start exploring the use of browser accessed applications running inside user-managed VMs, this sort of simple application could be really handy… (Another approach would be to use a VM running docker, and then have a container manager running, such as portainer.)

Creating a Jupyter Bundler Extension to Download Zipped Notebook and HTML Files

In the first version of the TM351 VM, we had a simple toolbar extension that would download a zipped ipynb file, along with an HTML version of the notebook, so it could be uploaded and previewed in the OU Open Design Studio. (Yes, I know, it would have been much better to have an nbviewer handler as an ODS plugin, but the we don’t do that sort of tech innovation, apparently…)

Looking at updating the extension today for the latest version of Jupyter notebooks, I noticed the availability of custom bundler extensions that allow you to add additional tools to support notebook downloads and deployment (I’m not sure what deployment relates to?). Adding a new download option allows it to be added to the notebook Edit -> Download menu:

The extension is created as a python package:

# odszip/setup.py
from setuptools import setup

setup(name='odszip',
      version='0.0.1',
      description='Save Jupyter notebook and HTML in zip file with .nbk suffix',
      author='',
      author_email='',
      license='MIT',
      packages=['odszip'],
      zip_safe=False)
#odszip/odszip/download.py

# Copyright (c) The Open University, 2017
# Copyright (c) Jupyter Development Team.

# Distributed under the terms of the Modified BSD License.
# Based on: https://github.com/jupyter-incubator/dashboards_bundlers/

import os
import shutil
import tempfile

#THIS IS A REQUIRED FUNCTION
def _jupyter_bundlerextension_paths():
    '''API for notebook bundler installation on notebook 5.0+'''
    return [{
                'name': 'odszip_download',
                'label': 'ODSzip (.nbk)',
                'module_name': 'odszip.download',
                'group': 'download'
            }]


def make_download_bundle(abs_nb_path, staging_dir, tools):
	'''
	Assembles the notebook and resources it needs, returning the path to a
	zip file bundling the notebook and its requirements if there are any,
	the notebook's path otherwise.
	:param abs_nb_path: The path to the notebook
	:param staging_dir: Temporary work directory, created and removed by the caller
	'''
    
	# Clean up bundle dir if it exists
	shutil.rmtree(staging_dir, True)
	os.makedirs(staging_dir)
	
	# Get name of notebook from filename
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	# Add the notebook
	shutil.copy2(abs_nb_path, os.path.join(staging_dir, notebook_basename))
	
	# Include HTML version of file
	cmd='jupyter nbconvert --to html "{abs_nb_path}" --output-dir "{staging_dir}"'.format(abs_nb_path=abs_nb_path,staging_dir=staging_dir)
	os.system(cmd)

	zip_file = shutil.make_archive(staging_dir, format='zip', root_dir=staging_dir, base_dir='.')
	return zip_file

#THIS IS A REQUIRED FUNCTION       
def bundle(handler, model):
	'''
	Downloads a notebook as an HTML file and zips it with the notebook
	'''
	
	# Based on https://github.com/jupyter-incubator/dashboards_bundlers
	
	abs_nb_path = os.path.join(
		handler.settings['contents_manager'].root_dir,
		model['path']
	)
		
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	tmp_dir = tempfile.mkdtemp()

	output_dir = os.path.join(tmp_dir, notebook_name)
	bundle_path = make_download_bundle(abs_nb_path, output_dir, handler.tools)
		
	handler.set_header('Content-Disposition', 'attachment; filename="%s"' % (notebook_name + '.nbk'))
	
	handler.set_header('Content-Type', 'application/zip')
	
	with open(bundle_path, 'rb') as bundle_file:
		handler.write(bundle_file.read())

	handler.finish()


	# We read and send synchronously, so we can clean up safely after finish
	shutil.rmtree(tmp_dir, True)

We can then create the python package and install the extension, remmebering to restart the Jupyter server for the extension to take effect.

#Install the ODSzip extension package
pip3 install --upgrade --force-reinstall ./odszip

#Enable the ODSzip extension
jupyter bundlerextension enable --py odszip.download  --sys-prefix

Getting Web Services Up and Running on MicroSoft Azure Using Vagrant and the Azure CLI

As well as Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI, we can also use Vagrant to provision machines on other web hosts, such as the Microsoft Azure cloud paltform. In this post, I’ll describe a command line based recipe for doing just that.

To start with, you’ll need to get a Microsoft Azure account.

When you’ve done that, install the Azure command line interface (CLI). On a Mac:

curl -L https://aka.ms/InstallAzureCli | bash

For me, this installed to ~/bin/az.

With the client installed, login:

~/bin/az login

This requires a token based handshake with a Microsoft authentication website.

List the range of machine images available (if you haven’t set the path to az, use the full ~/bin/az):

az vm image list

There was only one that looked suitable to me for my purposes: Canonical:UbuntuServer:16.04-LTS:latest.

To run the provisioner, we need a Subscription ID; this will be used to set the vagrant .subscription_id parameter. These are listed on the Azure Portal.

We also need to create an Active Directory Service Principal:

az ad sp create-for-rbac

This information will be used to configure the Vagrantfile: the appId sets the vagrant .client_id, the password the .client_secret, and the tenant the .tenant_id.

You can also inspect the application in the App Registrations area of the Azure Portal.

Now let’s set up Vagrant. We’ll use the vagrant-azure plugin:

vagrant plugin install vagrant-azure --plugin-version '2.0.0.pre6'

We need to add a dummy box:

vagrant box add azure https://github.com/azure/vagrant-azure/raw/v2.0/dummy.box

Now let’s set up the Vagrantfile:

config.vm.provider :azure do |azure, override|
    #The path to your ssh keys
    override.ssh.private_key_path = '~/.ssh/id_rsa'

    #The default box we added
    override.vm.box = 'azure'
    
    #Set a territory
    azure.location="uksouth"

    #Provide your own group and VM name
    azure.resource_group_name="tm351azuretest"
    azure.vm_name="tm351azurevmtest"

    # Set an appropriate image (the UbuntuServer is actually the current default value)
    azure.vm_image_urn="Canonical:UbuntuServer:16.04-LTS:latest"

    #Use a valid subscription ID
    #https://portal.azure.com/#blade/HubsExtension/MyAccessBlade/resourceId/
    azure.subscription_id = ENV['AZURE_SUBSCRIPTION_ID']

    # Using details from the Active Directory Service Principal setup
    azure.tenant_id = ENV['AZURE_TENANT_ID']
    azure.client_id = ENV['AZURE_CLIENT_ID']
    azure.client_secret = ENV['AZURE_CLIENT_SECRET']

end

With the Vagrantfile parameters in place, we should then be able call the Azure provider using the command:

vagrant up --provider=azure

But we’re still not quite done… If you’re running services on the VM, populated from elsewhere in the Vagranfile, you’ll need to add some security rules to make the ports accessible. I’m running services on ports 80,35180 and 35181 for example:

az network nsg rule create -g tm351azuretest --nsg-name tm351azurevmtest-vagrantNSG --priority 130 --direction Inbound --destination-port-ranges 8888 --access Allow --name port_8888
az network nsg rule create -g tm351azuretest --nsg-name tm351azurevmtest-vagrantNSG --priority 130 --direction Inbound --destination-port-ranges 35181 --access Allow --name port_35181

Now we can lookup the IP address of the server:

az vm list-ip-addresses

and see if our applications are there :-)

Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI

From past experience of trying to get things up and running with AWS (Amazon Web Services), it can be a bit of a faff trying to work out what to set where the first time. So here’s an example of how to get a browser based application up and running on EC2 using vagrant from the command line.

(If you want to work through sorting the settings out via the AWS online management console, try Oliver Veits’ tutorial AWS Automation based on Vagrant — Part 2: Installation and Usage of the Vagrant AWS Plugin; you might also need to refer to Part 1: Getting started with AWS.)

This post in part assumes you know how to provision your own virtual machine locally using Vagrant. Here are the steps you need to take to be able to run an AWS provisioner (on a Mac or Linux machine… not sure about Windows?).

First up – sign up for AWS (get credit via the Github Education Pack)…

Pick up some credentials via AWS root Security Credentials (Access Keys (Access Key ID and Secret Access Key)):

Ensure that the key is active (Make Active).

There’s quite a bit of set up to do to configure the provisioner script. This can be done on the command line using the Amazon Command Line Interface (AWS CLI):

pip install --upgrade --user awscli

Now you need to configure the AWS CLI:

aws configure

Use the security credentials you picked up to configure the client*.

When we launch the AWS machine, vagrant needs to be able to access it via ssh using the public IP address automatically assigned to the machine. In deployment too, if we’re building specific services we want to be able to access over the web, we need to open up access to the ports those services are listening on.

By default, the machine will be locked down, so we need to open up specific ports by setting security rules. These are assigned on the basis of a security group. So lets create one of those (mine is named after the course VM I’m building…):

aws ec2 create-security-group --group-name tm351cloud --description "Security group for tm351 services"

We’re going to use this group in the .security_groups parameter in the Vagrantfile.

Now we need to create the security group rules. In my case, I want to open up ssh (port 22) to allow incoming traffic from my IP address, and ports 80, 35180 and 35181 to allow http traffic from anywhere. (The /0 suffix in the rules allows any IP format.)

MYIP=$(curl http://checkip.amazonaws.com/)
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 22 --cidr ${MYIP}/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35180 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35181 --cidr 0.0.0.0/0

# Check the policies
aws ec2 describe-security-groups --group-names tm351cloud

Having opened up at least the ssh port 22, we need to set up some SSH keys with a particular name (vagrantaws) that we will use with the vagrant .keypair_name parameter, and save them to a local file (vagrantaws.pem) with the appropriate permissions.

aws ec2 create-key-pair --key-name vagrantaws --query 'KeyMaterial' --output text > vagrantaws.pem
chmod 400 vagrantaws.pem

The vagrant provisioner also requires specific access tokens (.access_key_id, .secret_access_key, .session_token) to access EC2. Create these tokens, entering your own duration (in seconds):

aws sts get-session-token --duration-seconds 129600

Now we can start to look at the Vagrant set up. Install the vagrant AWS provisioner:

vagrant plugin install vagrant-aws

After setting up the Vagrantfile, you will be able to provision your machine on AWS using:

vagrant up --provider=aws

Add a dummy box:

vagrant box add awsdummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box

Now let’s look at the Vagrantfile:

#Set up the provider block
config.vm.provider :aws do |aws, override|

    #Use the ec2 security group set previously
    aws.security_groups = ["tm351cloud"]

    #Whatever name we want
    override.vm.hostname = "tm351aws"

    #The name of the dummy box we added
    override.vm.box = "awsdummy"
    
    #Set up machine access using our keypair name and ssh key path
    override.ssh.username = "ubuntu"
    aws.keypair_name="vagrantaws"
    override.ssh.private_key_path = "vagrantaws.pem"

    #Use the values generated by the session token generator
    aws.access_key_id = "YOUR_KEY_ID"
    aws.secret_access_key = "YOUR_SECRET_ACCESS_KEY"
    aws.session_token = "YOUR_SESSION_TOKEN"

    #Specify a region and valid ami for that region, along with the desired instance size
    aws.region = "eu-west-1"
    aws.ami = "ami-971238f1" 
    aws.instance_type="t2.small"

  end

Running vagrant up --provider=aws should run the Vagrant provisioner with the AWS provider. Running vagrant destroy will tear down the machine (so you don’t keep paying for it… I think the users, security groups and keypairs are free?)

To check on the IP address of your instance, run:

aws ec2 describe-instances

or check on the AWS EC2 console. You can also check the machine is ripped down correctly when you have finished with it from there.

(I need to check what happens if you vagrant suspend and then vagrant resume. Presumably, the state is preserved, but you are billed for storage, if not running time?)


*Alternatively, we could create a specific user with more limited credentials.

Create a user we can use to help set up the credentials to use with the vagrant provisioner:

aws iam create-user --user-name vagrant

Now we need to give that user permissions to build our EC2 instance, by attaching an appropriate security policy (AmazonEC2FullAccess). In other words, the Vagrantfile will make use of the AWS vagrant user to provision the machine, so we need to give that AWS user the appropraite permissions on AWS:

aws iam attach-user-policy --user-name vagrant --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

Generate some keys:

aws iam create-access-key --user-name vagrant

Then run aws configure with the new keys.