I’ve been doodling some chart in R/ggplot using geom_text() to generate a labelled scatterplot.
The chart actually builds up several layers using different datasets, so it’s not obvious how to set the ranges cleanly: I know the lower bound I want for the y-axis (y=0), but I want to let the upper bound float.
There’s also an issue with the labels overflowing the edges left and right.
So here are a couple of lines to make everything better (chart is in g):
#Find the current ymax value for upper bound #(via http://stackoverflow.com/questions/7705345/how-can-i-extract-plot-axes-ranges-for-a-ggplot2-object#comment24184444_8167461 ) gy=ggplot_build(g)$panel$ranges[]$y.range g=g+ylim(0,gy) #Handle the overflow by expanding the x-axis g=g+scale_x_continuous(expand=c(0.1,0))
One of the advantages of having a relatively long lived blog is that it gives me the ability to look back at the things that were exciting to me several years ago. For example, it was five years ago more or less to to the day when I first saw video ads on the underground; and seven and a half years ago since I remarked on the possible relevance of virtual machines (VMs) to OU teaching: Personal Computing Guidance for Distance Education Students. (At the time, I was more excited by portable applications that could be run from USB sticks, the motivating idea being that OU students might want to access course software or applications from arbitrary machines that they didn’t necessarily have enough permissions on to be able to download and install required applications.)
Since then, a couple of OU courses have dabbled with the virtual machines – the Linux course that’s now part of the course TM129 – Technologies in Practice makes use of a Linux virtual machine running in VirtualBox, and the digital forensics postgrad course (M812) makes use of a couple of VMs – a Windows box that needs analysing, and a Linux VM that contains the analysis tools.
We’re also looking at using a virtual machine for a new level three/third year equivalent course due out in October 2015 (sic…) on data stuff (short title!;-). I haven’t really been paying as much attention as I probably should have to VMs, but a little bit of playing at the end of last week and over the weekend made me realise the error of my ways…
So what are virtual machines (VMs)? You’re possibly familiar with the phrase “(computer) operating system”, and almost definitely will have heard of Windows and iOS. These are the bits of computer software that provide a desktop on top of your computer hardware, and run the services that that allow your applications to talk to the hardware and out in to the wider world. Virtual machines are boxes that allow you to run another operating system, as well as applications on top of it, on your own desktop. So a Windows machine can run a box that contains a fully working Linux computer; or if you’re like me and use a Mac, you’ll have a virtual machine that runs a copy of Windows so you can run Internet Explorer on it in order to access the OU’s expense claims system!
Now when it comes to shipping course software, we’re often faced with the problem of getting software to work on whatever operating system our students are using. In a traditional university, with computer labs, the computers in the public areas will all contain the same software, installed from a common source. (OU IT are trying to enforce a similar policy on staff machines at the moment. Referred to in reverential terms as “desktop optimisation”, the idea is that machines will only run the software that IT says can run on it. Which would rule out the possibility of me running pretty much any of the applications I use on a day to day basis. Although I think Macs are outside the optimisation fold for the moment…?)
Ideally, then, we might want students to all run the same operating system, so that we can test software on that system and write one set of instructions for how to use it. But students bring their own devices. And when it comes to installing the software tools we’d like computing students, for example, to install, there can be all sorts of problems getting the software to install properly.
So another option is to provide students with a machine that we control, that doesn’t upset their own settings, and that won’t kill their computer if something goes horribly wrong. (We can’t, for example, require students to run the OU’s optimised desktop, not least because we’d have to pay license fees for the use of Windows, but also because students would rightly get upset if we prevented them from downloading and installing Angry Birds on their own computer!) Virtual machines provide a way of doing this.
As a case in point, the new data course will probably make use of iPython Notebook, among other things. iPython Notebook is a browser accessed application that allows you to develop and execute Python programme code via an interactive, browser based user interface, which I find quite attractive from a pedagogical point of view. (This post may get read by OU folk in an OU teaching context, so I am obliged to use the p-word.)
Installing the Python libraries the course will draw on, as well as a variety of databases (PostgreSQL and MongoDB are the ones we’re thinking of using…) could be a major headache for our students, particularly if they aren’t well versed in sysadmin and library installation. But if we define a virtual machine that has the required libraries and applications preinstalled and preconfigured, we can literally contain the grief – if students run an application such as VirtualBox (which thy would have to install themselves), we can provide a preconfigured machine (known as a guest) that they can run within their own desktop (part of the host machine), that will make available services that they can access via their normal desktop browser.
So for example, we can build a virtual machine that contains iPython and all the required libraries, that can be defined to automatically run iPython Notebook when it boots, and that can make that notebook available via the host browser. And more than that, we can also configure the Notebook server running on the local guest VM so that it saves notebook files to a directory that is shared between the guest and the host. If a student then switches off, or even deletes, the guest machine, they don’t lose their work…
VMs have been used elsewhere for course delivery too, so we may also be able to learn more about the practicalities of VMs in a course context from those cases. For example, Running a next-gen sequence analysis course using Amazon Web Services describes how virtual machines running on Amazon Cloud services, (rather than in boxes running within a VirtualBox container on the user’s desktop) were used for a data analysis course that made us of very large datasets. (This demonstrates another benefit of virtualisation: we can configure a VM so that it can b run in containerised form on a student’s own computer, or run on a machine hosted on the net somewhere, and then accessed from the student’s own machine.)
Something I found really exciting were the VMs defined by @DataMinerUk and @twtrdaithi for use in data journalism applications – Infinite Interns, a range of virtual machines defined using Vagrant (which is super fun to play with!:-) that contain a rang of tools useful for data projects.
I also wonder about the extent to which the various MOOCs have made use of VMs… And whether there is an argument to be had in favour of “course boxes” in general…?
A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).
For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:
Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.
import os,re import urllib2 #New OU course will start using pandas, so I need to start getting familiar with it. #In this case it's overkill, because all I'm using it for is to load in a CSV file... import pandas as pd #url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf' #Load in the data scraped from Shell df= pd.read_csv('shell_30_11_13_ng.csv') errors= #For each line item: for url in df[df.columns]: try: print 'trying',url u = urllib2.urlopen(url) fn=url.split('/')[-1] #Grab a local copy of the downloaded picture containing PDF localFile = open(fn, 'w') localFile.write(u.read()) localFile.close() except: print 'error with',url errors.append(url) continue #If we look at the filenames/urls, the filenames tend to start with the JIV id #...so we can try to extract this and use it as a key id=re.split(r'[_-]',fn) #I'm going to move the PDFs and the associated images stripped from them in separate folders fo='data/'+id os.system(' '.join(['mkdir',fo])) idp='/'.join([fo,id]) #Try to cope with crappy filenames containing punctuation chars fn= re.sub(r'([()&])', r'\\\1', fn) #THIS IS THE LINE THAT PULLS OUT THE IMAGES #Available via poppler-utils #See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/ #Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo ]) os.system(cmd) #Still a couple of errors on filenames #just as quick to catch by hand/inspection of files that don't get moved properly print 'Errors',errors
Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng
The important line of code in the above is:
pdfimages -j FILENAME OUTPUT_STUB
FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)
pdfimages can be downloaded as part of poppler (I think?!)
See also this Stack Exchange question/answer: Extracting images from a PDF
PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.
http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:
- generate post containing images and body text made up from data in the associated line from the CSV file.
|Date Reported||Incident Site||JIV Date||Terrain||Cause||Estimated Spill Volume (bbl)||Clean-up Status||Comments||Photo|
|02-Jan-13||10″ Diebu Creek – Nun River Pipeline at Onyoma||05-Jan-13||Swamp||Sabotage/Theft||65||Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.||Site Certification was completed on 28th June 2013.||http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf|
So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).
A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).
We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).
There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:
Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.
If those phrases are common/templated refrains, we can parse the corresponding dates out?
I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?
Via @wilm, I notice that it’s time again for someone (this time at the Wall Street Journal) to have written about the scariness that is your Google personal web history (the sort of thing you probably have to opt out of if you sign up for a new Google account, if other recent opt-in by defaults are to go by…)
It may not sound like much, but if you do have a Google account, and your web history collection is not disabled, you may find your emotional response to seeing months of years of your web/search history archived in one place surprising… Your Google web history.
Not mentioned in the WSJ article was some of the games that the Chrome browser gets up. @tim_hunt tipped me off to a nice (if technically detailed, in places) review by Ilya Grigorik of some the design features of the Chrome browser, and some of the tools built in to it: High Performance Networking in Chrome. I’ve got various pre-fetching tools switched off in my version of Chrome (tools that allow Chrome to pre-emptively look up web addresses and even download pages pre-emptively*) so those tools didn’t work for me… but looking at chrome://predictors/ was interesting to see what keystrokes I type are good predictors of web pages I visit…
* By the by, I started to wonder whether webstats get messed up to any significant effect by Chrome pre-emptively prefetching pages that folk never actually look at…?
In further relation to the tracking of traffic we generate from our browsing habits, as we access more and more web/internet services through satellite TV boxes, smart TVs, and catchup TV boxes such as Roku or NowTV, have you ever wondered about how that activity is tracked? LG Smart TVs logging USB filenames and viewing info to LG servers describes not only how LG TVs appear to log the things you do view, but also the personal media you might view, and in principle can phone that information home (because the home for your data is a database run by whatever service you happen to be using – your data is midata is their data).
there is an option in the system settings called “Collection of watching info:” which is set ON by default. This setting requires the user to scroll down to see it and, unlike most other settings, contains no “balloon help” to describe what it does.
At this point, I decided to do some traffic analysis to see what was being sent. It turns out that viewing information appears to be being sent regardless of whether this option is set to On or Off.
you can clearly see that a unique device ID is transmitted, along with the Channel name … and a unique device ID.
This information appears to be sent back unencrypted and in the clear to LG every time you change channel, even if you have gone to the trouble of changing the setting above to switch collection of viewing information off.
It was at this point, I made an even more disturbing find within the packet data dumps. I noticed filenames were being posted to LG’s servers and that these filenames were ones stored on my external USB hard drive.
Hmmm… maybe it’s time I switched out my BT homehub for a proper hardware firewalled router with a good set of logging tools…?
PS FWIW, I can’t really get my head round how evil on the one hand, or damp squib on the other, the whole midata thing is turning out to be in the short term, and what sorts of involvement – and data – the partners have with the project. I did notice that a midata innovation lab report has just become available, though to you and me it’ll cost 1500 squidlly diddlies so I haven’t read it: The midata Innovation Opportunity. Note to self: has anyone got any good stories to say about TSB supporting innovation in micro-businesses…?
PPS And finally, something else from the Ilya Grigorik article:
The HTTP Archive project tracks how the web is built, and it can help us answer this question. Instead of crawling the web for the content, it periodically crawls the most popular sites to record and aggregate analytics on the number of used resources, content types, headers, and other metadata for each individual destination. The stats, as of January 2013, may surprise you. An average page, amongst the top 300,000 destinations on the web is:
- 1280 KB in size
- composed of 88 resources
- connects to 15+ distinct hosts
Is it any wonder that pages take so long to load on a mobile phone off the 3G netwrok, and that you can soon eat up your monthly bandwidth allowance!
A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)
An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.
Here’s a quick example of the sort of thing I mean – the generation of this piece of text:
The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.
from a data table that can be sliced like this:
In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.
The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.
There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.
There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94″ is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377″ generated mechanically from a year on year comparison).
Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.
The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.
PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…
Picking up on Political Representation on BBC Political Q&A Programmes – Stub , the quickest of hacks…
In OpenRefine, create a new project by importing data from a couple of URLs – data from the BBC detailing episode IDs for Any Questions and Question Time:
Import the data as XML, highlighting a single programme code row as the import element.
The data we get looks like this – /programmes/b007ck8s#programme – so we can add a column by URL around 'http://www.bbc.co.uk'+value.split('#')+'.json' to get JSON data back for each column.
Parse the JSON that comes back using something like value.parseJson()['programme']['medium_synopsis'] to create a new column containing the medium synopsis information.
The medium synopsis elements typically look like Topical debate from Colchester, with David Dimbleby. On the panel are Peter Hain, Sir Menzies Campbell, Francis Maude, singer Beverley Knight and journalist Cristina Odone. Which is to say they often contain the names of the panellists.
We can try to extract the names contained within each synopsis using the Zemanta API (key required) accessed via the Named-Entity Recognition extension for Google Refine / OpenRefine.
These seem to come back in reconciliation API form with the name set to a name and the id to a URL. We can get a concatenated list of the URLs that are returned by creating a column around something like this: forEach(cell.recon.candidates,v,v.id).sort().join('||') but I’m not sure that’s useful.
We can creata a column based just around the matched ID using cell.recon.match.name.
Let’s use the row view and fill down on programme IDs, then have a look at a duplicate facet and view only rows that are duplicated (that is, where an extracted named entity appears more than once). We can also use a text facet to see which names appear in multiple episodes of Question Time and/or Any Questions.
Selecting a single name allows us to see the programmes that person appeared on. If we pull out the time of first broadcast (value.parseJson()['programme']['first_broadcast_date']) and Edit Cells-Common Transforms-To date, we can also use a date facet to select out programmes first broadcast within a particular date range.
We can also run a text filter to limit records to episodes including a particular person and then use the Date facet to highlight the episodes in which they appeared on the timeline:
What this suggests is that we can use OpenRefine as some sort of ‘application shell’ for creating information tools around a particular dataset without actually having to build UI components ourselves?
If we custom export a table using programme IDs and matched names, and then rename the columns Source and Target, we can visualise them in something like Gephi (you can use the recipe described in the second part of this walkthrough: An Introduction to Mapping Company Networks Using Gephi and OpenCorporates, via OpenRefine).
The directed graph we load into Gephi connects entities (participant names, location names) with programme IDs. There is handy tool – Multimode Networks Projection – that can collapse the graph so that entities are connected to other entities that they shared a programme ID with.
(If you forget to remove the programme nodes, a degree range filter to select only nodes with degree greater than 2 tidies the graph up.)
If we run PageRank on the graph (now as an undirected graph), layout using ForceAtlas2 and size nodes according to PageRank, we can look into the heart of the UK political establishment as evidenced by appearances on Question Time and Any Questions.
The next step would probably be to try to pull info about each recognised entity from dbPedia (forEach(cell.recon.candidates,v,v.id).sort() seems to pull out dbpedia URIs) but grabbing data from dbPedia seems to be borked in my version of OpenRefine atm:-(
Anyway – a quick hack that took longer to write up than it did to do…
OpenRefine project file here.
It’s too nice a day to be inside hacking around with Parliament data as a remote participant in today’s Parliamentary hack weekend (resource list), but if it had been a wet weekend I may have toyed with one of the following:
- revisiting this cleaning script for Analysing UK Lobbying Data Using OpenRefine (actually, a look at who finds/offers support for All Party Groups. The idea was to get a dataset of people who provide secretariat and funds to APPGs, as well as who works for them, and then do something with that dataset…)
- tinkering with data from Question Time and Any Questions…
On that last one:
These gives us generatable URLs for programmes by month with URLs of form http://www.bbc.co.uk/programmes/b006t1q9/broadcasts/2013/01 but how do we get a JSON version of that?! Adding .json on the end doesn’t work?!:-( UPDATE – this could be a start, via @nevali – use pattern /programmes/PID.rdf , such as http://www.bbc.co.uk/programmes/b006qgvj.rdf
We can get bits of information (albeit in semi-structured from) about panellists in data form from programme URL hacks like this: http://www.bbc.co.uk/programmes/b007m3c1.json
Note that some older programmes don’t list all the panelists in the data? So a visit to WIkipedia – http://en.wikipedia.org/wiki/List_of_Question_Time_episodes#2007 – may be in order for Question Time (there isn’t a similar page for Any Questions?)
Given panellists (the BBC could be more helpful here in the way it structures its data…), see if we can identify parliamentarians (MP suffix? Lord/Lady title?) and look them up using the new-to-me, not-yet-played-with-it UK Parliament – Members’ Names Data Platform API. Not sure if reconciliation works on parliamentarian lookup (indeed, not sure if there is a reconciliation service anywhere for looking up MPs, members of the House of Lords, etc?)
From Members’ Names API, we can get things like gender, constituency, whether or not they were holding a (shadow) cabinet post, maybe whether they were on a particular committee at the time etc. From programme pages, we may be able to get the location of the programme recording. So this opens up possibility of mapping geo-coverage of Question Time/Any Questions, both in terms of where the programmes visit as well as which constituencies are represented on them.
If we were feeling playful, we could also have a stab at which APPGs have representation on those programmes!
It also suggests a simpler hack – of just providing a briefing around the representatives appearing on a particular episode in terms of their current (or at the time) parliamentary status (committees, cabinet positions, APPGs etc etc)?