Category: Infoskills

Data Mining: Separating Fact From Fiction – Presentation to NetIKX

A couple of weeks ago I gave a presentation to NetIKX, an information professionals’ knowledge exchange network.

Search consultant Karen Blakeman opened with a presentation showing how the search engines don’t always return what you’d expect, and I tweaked my presentation somewhat as Karen spoke to try to better complement it.

My slides are available on slideshare in a partially annotated form. I didn’t deliver the more technical data mining algorithm slides towards the end of the deck, and I haven’t yet annotated those slides either.

Tools in Tandem – SQL and ggplot. But is it Really R?

Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.

So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:

#Set up a connection to a local copy of the ergast database
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Run a query
q='SELECT code, grid, year, COUNT(l.lap) AS Laps 
    FROM (SELECT grid, raceId, driverId from results) rg,
        lapTimes l, races r, drivers d 
    WHERE rg.raceId=l.raceId AND d.driverId=l.driverId
          AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId 
    GROUP BY grid, driverRef, year 
    ORDER BY year'


In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.


The chart I have opted for is a text plot faceted by year:


The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The x-axis is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.

g = ggplot(driverlapsledfromgridposition)
g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed')
g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45))
g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE)
g + scale_x_log10(expand=c(0,0.3)) + theme_bw()

There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.

However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.

So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an R based analysis?

For more details, see the “Laps Completed and Laps Led” draft chapter and the Wrangling F1 Data With R book.

Notes on Data Quality…

You know how it goes – you start trying to track down a forward “forthcoming” reference and you end up wending your way through all manner of things until you get back to where you started none the wiser… So here’s a snapshot of several docs I found trying to source the original forward reference for following table, found in Improving information to support decision making: standards for better quality data (November 2007, first published October 2007) with the crib that several references to it mentioned the Audit Commission…


The first thing I came across was The Use of Information in Decision Making – Literature Review for the Audit Commission (2008), prepared by Dr Mike Kennerley and Dr Steve Mason, Centre for Business Performance Cranfield School of Management, but that wasn’t it… This document does mention a set of activities associated with the data-to-decision process: Which Data, Data Collection, Data Analysis, Data Interpretation, Communication, Decision making/planning.

The data and information definitions from the table do appear in a footnote – without reference – in Nothing but the truth? A discussion paper from the Audit Commission in Nov 2009, but that’s even later… The document does, however, identify several characteristics (cited from an earlier 2007 report (Improving Information, mentioned below…), and endorsed at the time by Audit Scotland, Northern Ireland Audit Office, Wales Audit Office and CIPFA, with the strong support of the National Audit Office), that contribute to a notion of “good quality” data:

Good quality data is accurate, valid, reliable, timely, relevant and complete. Based on existing guidance and good practice, these are the dimensions reflected in the voluntary data standards produced by the Audit Commission and the other UK audit agencies
* „Accuracy – data should be sufficiently accurate for the intended purposes.
* Validity – data should be recorded and used in compliance with relevant requirements.
„* Reliability – data should reflect stable and consistent data collection processes across collection points and over time.
„* Timeliness – data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time period.
„* Relevance – data captured should be relevant to the purposes for which it is used.
„* Completeness – data requirements should be clearly specified based on the information needs of the body and data collection processes matched to these requirements.

The document also has some pretty pictures, such as this one of the data chain:


In the context of the data/information/knowledge definitions, the Audit Commission discussion document also references the 2008 HMG strategy document Information matters: building government’s capability in managing knowledge and information which includes the table in full; a citation link is provided, but 404s, but a source is given to the November 2008 version of Improving information, the one we originally started with. So the original reference forward refers the table to an unspecified report, but future reports in the area refer back to that “original” without making a claim to the actual table itself?

Just in passing, whilst searching for the Improving information report, I actually found another version of it… Improving information to support decision making: standards for better quality data, Audit Commission, first published March 2007.


The table and the definitions as cited in Information Matters do not seem to appear in this earlier version of the document?

PS Other tables do appear in both versions of the report. For example, both the March 2007 and November 2007 versions of the doc contain this table (here, taken from the 2008 doc) of stakeholders:


Anyway, aside from all that, several more documents for my reading list pile…

PS see also Audit Commission – “In the Know” from February 2008.

Quick Note – Apache Tika (Document Text Extraction Service)

I came across Apache Tika a few weeks ago, a service that will tell you what pretty much any document type is based on it’s metadata, and will have a good go at extracting text from it.

With a prompt and a 101 from @IgorBrigadir, it was pretty easier getting started with it – sort of…

First up, I needed to get the Apache Tika server running. As there’s a containerised version available on dockerhub (logicalspark/docker-tikaserver), it was simple enough for me to fire up a server in a click using tutum (as described in this post on how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour; pretty much all you need to do is fire up a server, start a container based on logicalspark/docker-tikaserver, and tick to make the port public…)

As @IgorBrigadir described, I could ping the server on curl -X GET and send a document to it using curl -T foo.doc

His suggested recipe for using python requests library borked for me – I couldn’t get python to open the file to get the data bits to send to the server (file encoding issues; one reason for using Tika is it’ll try to accept pretty much anything you throw at it…)

I had a look at pycurl:

!apt-get install -y libcurl4-openssl-dev
!pip3 install pycurl

but couldn’t make head or tail of how to use it: the pycurl equivalant of curl -T foo.doc can’t be that hard to write, can it? (Translations appreciated via the comments…;-)

Instead I took the approach of dumping the result of a curl request on the command line into a file:

!curl -T Text/foo.doc > tikatest.json

and then grabbing the response out of that:

import json

Not elegant, and far from ideal, but a stop gap for now.

Part of the response from the Tika server is the text extracted from the document, which can then provide the basis for some style free text analysis…

I haven’t tried with any document types other than crappy old MS Word .doc formats, but this looks like it could be a really easy tool to use.

And with the containerised version available, and tutum and Digital Ocean to hand, it’s easy enough to fire up a version in the cloud, let alone my desktop, whenever I need it:-)

Getting Started With Personal App Containers in the Cloud

…aka “how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour”…

I managed to get my first container up and running in the cloud today (yeah!:-), using tutum to launch a container I’d defined on Dockerhub and run it on a linked DigitalOcean server (or as they call them, “droplet”).

This sort of thing is probably a “so what?” to many devs, or even folk who do the self-hosting thing, where for example you can launch your own web applications using CPanel, setting up your own WordPress site, perhaps, or an online database.

The difference for me is that the instance of OpenRefine I got up and running in the cloud via a web browser was the result of composing several different, loosely coupled services together:

  • I’d already published a container on dockerhub that launches the latest release version of OpenRefine: psychemedia/docker-openrefine. This lets me run OpenRefine in a boot2docker virtual machine running on my own desktop and access it through a browser on the same computer.
  • Digital Ocean is a cloud hosting service with simple billing (I looked at Amazon AWS but it was just too complicated) that lets you launch cloud hosted virtual machines of a variety of sizes and in a variety of territories (including the UK). Billing is per hour with a monthly cap with different rates for different machine specs. To get started, you need to register an account and make a small ($5 or so) downpayment using Paypal or a credit card. So that’s all I did there – created an account and made a small payment. [Affiliate Link: sign up to Digital Ocean and get $10 credit]
  • tutum an intermediary service that makes it easy to launch servers and containers running inside them. By linking a DigitalOcean account to tutum, I can launch containers on DigitalOcean in a relatively straightforward way…

Launching OpenRefine via tutum

I’m going to start by launching a 2GB machine which comes in a 3 cents an hour, capped at $20 a month.



Now we need to get a container – which I’m thinking of as if it was a personal app, or personal app server:


I’m going to make use of a public container image – here’s one I prepared earlier…


We need to do a tiny bit of configuration. Specifically, all I need to do is ensure that I make the port public so I can connect to it; by default, it will be assigned to a random port in a particular range on the publicly viewable service. I can also set the service name, but for now I’ll leave the default.


If I create and deploy the container, the image will be pulled from dockerhub and a container launched based on it that I should be able to access via a public URL:


If the URL you copy or click on starts with tcp:// change it to http://

The first time I pull the container into a specific machine it takes a little time to set up as the container files are imported into the machine. If I create another container using the same image (another OpenRefine instance, for example), it should start really quickly because all the required files have already been loaded into the node.


Unfortunately, when I go through to the corresponding URL, there’s nothing there. Looking at the logs, I think maybe there wasn’t enough memory to launch a second OpenRefine container… (I could test this by launching a second droplet/server with more memory, and then deploying a couple of containers to that one.)


The billing is calculated on DigitalOcean on a hourly rate, based on the number and size of servers running. To stop racking up charges, you can terminate the server/droplet (so you also lose the containers).


Note than in the case of OpenRefine, we could allow several users all to access the same OpenRefine container (the same URL) and just run different projects within them.

So What?

Although this is probably not the way that dev ops folk think of containers, I’m seeing them as a great way of packaging service based applications that I might one to run at a personal level, or perhaps in a teaching/training context, maybe on a self-service basis, maybe on a teacher self-service basis (fire up one application server that everyone in a cohort can log on to, or one container/application server for each of them; I noticed that I could automatically launch as many containers as I wanted – a 64GB 20 core processor costs about $1 per hour on Digital Ocean, so for an all day School of Data training session, for example, with 15-20 participants, that would be about $10, with everyone in such a class of 20 having their own OpenRefine container/server, all started with the same single click? Alternatively, we could fire up separate droplet servers, one per participant, each running its own set of containers? That might be harder to initialise though (i.e. more than one or two clicks?!) Or maybe not?)

One thing I haven’t explored yet is mounting data containers/volumes to link to application containers. This makes sense in a data teaching context because it cuts down on bandwidth. If folk are going to work on the same 1GB file, it makes sense to just load it in to the virtual machine once, then let all the containers synch from that local copy, rather than each container having to download its own copy of the file.

The advantage of the approach described in the walkthrough above over “pre-configured” self-hosting solutions is the extensibility of the range of applications available to me. If I can find – or create – a Dockerfile that will configure a container to run a particular application, I can test it on my local machine (using boot2docker, for example) and then deploy a public version in the cloud, at an affordable rate, in just a couple of steps.

Whilst templated configurations using things like fig or panamax which would support the 1-click launch of multiple linked containers configurations aren’t supported by tutum yet, I believe they are in the timeline… So I look forward to trying out a click cloud version of Using Docker to Build Linked Container Course VMs when that comes onstream:-)

In an institutional setting, I can easily imagine a local docker registry that hosts images for apps that are “approved” within the institution, or perhaps tagged as relevant to particular courses. I don’t know if it’s similarly possible to run your own panamax configuration registry, as opposed to pushing a public panamax template for example, but I could imagine that being useful institutionally too? For example, I could put container images on a dockerhub style OU teaching hub or OU research hub, and container or toolchain configurations that pull from those on a panamax style course template register, or research team/project reregister? To front this, something like tutum, though with an even easier interface to allow me to fire up machines and tear them down?

Just by the by, I think part of the capital funding the OU got recently from HEFCE was slated for a teaching related institutional “cloud”, so if that’s the case, it would be great to have a play around trying to set up a simple self-service personal app runner thing ?;-) That said, I think the pitch in that bid probably had the forthcoming TM352 Web, Mobile and Cloud course in mind (2016? 2017??), though from what I can tell I’m about as persona non grata as possible with respect to even being allowed to talk to anyone about that course!;-)

Whose Browser (or Phone, or Drone?!) Is It Anyway?

I’m not sure how many Chrome users follow any of the Google blogs that occasionally describe forthcoming updates to Google warez, but if you don’t you perhaps don’t realise quite how frequently things change. My browser, for example, is at something like version 40, even though I never consciously update it.

One thing I only noticed recently that a tab appeared in the top right hand of the browser showing that I’m logged in (to the browser) with a particular Google account. There doesn’t actually appear to be an option to log out – I can switch user or go incognito – and I’m not sure I remember even consciously logging in to it (actually, maybe a hazy memory, when I wanted to install a particular extension) and I have no idea what it actually means for me to be logged in?

Via the Google Apps Update blog, I learned today that being logged in to the browser will soon support is seemless synching of my Google docs into my Chrome browser environment (Offline access to Google Docs editors auto-enabled when signing into Chrome browser on the web). Following a pattern popularised by Apple, Google are innovating on our behalf and automatically opting us in to behaviours it thinks make sense for us. So just bear that in mind when you write a ranty resignation letter in Google docs and wonder why it’s synched to your work computer on your office desk:

Note that Google Apps users should not sign into a Chrome browser on public/non-work computers with their Google Apps accounts to avoid unintended file syncing.

If you actually have several Google apps accounts (for example, I have a personal one, and a couple of organisational ones: an OU one, an OKF one), I assume that the only docs that are synched are the ones on an account that matches the account I have signed in to in the browser. That said, synch permissions may be managed centrally for organisational apps accounts:

Google Apps admins can still centrally enable or disable offline access for their domain in the Admin console .. . Existing settings for domain-level offline access will not be altered by this launch.

I can’t help but admit that even though I won’t have consciously opted in to this feature, just like I don’t really remember logging in to Chrome on my desktop (how do I log out???) and I presumably agreed to something when I installed Chrome to let it keep updating itself without prompting me, I will undoubtedly find it useful one day: on a train, perhaps, when trying to update a document I’d forgotten to synch. It will be so convenient I will find it unremarkable, not noticing I can now do something I couldn’t do as easily before. Or I might notice, with a “darn, I wish I’d..” then “oh, cool, [kewel…] I can…).

“‘Oceania has always been at war with Eastasia.'” [George Orwell, 1984]

Just like when – after being sure I’d disabled or explicitly no; opted in to any sort of geo-locating or geo-tracking behaviour on my Android phone, I found I must have left a door open somewhere (or been automatically opted in to something I hadn’t appreciated when agreeing to a particular update (or by proxy, agreeing to allow something to update itself automatically and without prompting and with implied or explicit permission to automatically opt me in to new features….) and found I could locate my misplaced phone using the Android Device Manager (Where’s My Phone?).

This idea of allowing applications to update themselves in the background and without prompting is something we have become familiar with in many web apps, and in desktop apps such as Google Chrome, though many apps do still require the user to either accept the update or take an even more positive action to install an update when notified that one is available. (It seems that ever fewer apps require you to specifically search for updates…)

In the software world, we have gone from a world where the things we buy we immutable, to one where we could search for and install updates (eg to operating systems of software applications), then accept to install updates when alerted to the fact, to automatically (and invisibly) accepting updates.

In turn, many physical devices have gone from being purely mechanical affairs, to electro-mechanical ones, to logical-electro-mechanical devices (for example, that include logic elements hardwired into silicon), to ones containing factory programmable hardware devices (PROMs, programmable Read Only Memories), to devices that run programmable and then re</programmable firmware (that is to say, software).

If you have a games console, a Roku or MyTV box, or Smart TV, you’ve probably already been prompted to get a (free) online update. I don’t know, but could imagine, new top end cars having engine management system updates at regular service events.

However, one thing perhaps we don’t fully appreciate is that these updates can also be used to limit functionality that our devices previously had. If the updates are done seemlessly (without permission, in the background) this may come as something as a surprise. [Cf. the complementary issue of vendors having access to “their” content on “your” machine, as described here by the Guardian: Amazon wipes customer’s Kindle and deletes account with no explanation]

A good example of loss of functionality arising by an (enforced, though self-applied) firmware update was reported recently in the context of hobbiest drones:

On Wednesday, SZ DJI Technology, the Chinese company responsible for the popular DJI Phantom drones that online retailers sell for less than $500, announced that it had prepared a downloadable firmware update for next week that will prevent drones from taking off in restricted zones and prevent flight into those zones.

Michael Perry, a spokesman for DJI, told the Guardian that GPS locating made such an update possible: “We have been restricting flight near airports for almost a year.”

“The compass can tell when it is near a no-fly zone,” Perry said. “If, for some reason, a pilot is able to fly into a restricted zone and then the GPS senses it’s in a no-fly zone, the system will automatically land itself.”

DJI’s new Phantom drones will ship with the update installed, and owners of older devices will have to download it in order to receive future updates.

Makers of White House drone offer fix to bar Phantom menace from no-fly zones

What correlates might be applied to increasingly intelligent cars, I wonder?! Or at the other extreme, phones..?

PS How to log out of Chrome You need to administer yourself… From the Chrome Preferences Settings (sic), Disconnect your Google account.

Note that you have to take additional action to make sure that you remove all those synched presentations you’d prepared for job interviews at other companies from the actual computer…

Take care out there…!;-)

Split View Web Page Bookmarklet

What feels like forever ago, I described a method Split Screen Screenshots that allow you to put a copy of the same web page into two frames, so that you can grab a screenshot that includes the page header in the top frame, for example, as well as something from waaaaaay down the page in the lower frame.

I didn’t realise that the recipe I described required help from an external service until I tried to reuse the method to grab the following screenshot…


Anyway, cribbing from this Chrome Dual View standalone bookmark, here’s an updates version of my split screen bookmarklet:

javascript:url=location.href;f='<frameset rows=\'30%,*\'>\n<frame src=\''+url+'\'/><frame src=\''+url+'\'/></frameset>';with(document){write(f);void(close())}

To make use of it, drag the following link onto you bookmark toolbar, or save the link as a bookmark: Hmm… seems the craptastic doesn’t let me post bookmarklet links? So you’ll have to: bookmark this page, edit the bookmark, paste the above code snippet into the URL. Then when you want to split the view of a webpage, just click the bookmarklet.