OpenRefine Docker Containers

I had a go at building a couple of docker containers for OpenRefine, one from the latest release and one from the latest code on github:

In order to create the virtual machine, you should:

  • install boot2docker
  • run boot2docker
  • Either: to run with a project directory solely within the container, in the boot2docker terminal, enter the command docker run --name openrefine -d -p 3334:3333 psychemedia/openrefine
  • Or: to run with a project directory mounted from a shared folder on the host, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /path/to/yourSharedDirectory:/mnt/refine --name openrefine psychemedia/openrefine
  • Or: to run with a project directory in a linked data volume, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /mnt/refine --name openrefine psychemedia/openrefine

(To use the latest release rather than a recent build use psychemedia/docker-openrefine rather than psychemedia/openrefine.)

The port number you will be able to find OpenRefine on is given by the first number set in the flag -p NNNN:3333. To access OpenRefine via port 3334, use -p 3334:3333 etc.

OpenRefine will then be available via your browser at the URL http://IPADDRESS:NNNN. To find the required value of IPADDRESS can be found using the command boot2docker ip

The returned IP address (eg 192.168.59.103) is the IP address you can find OpenRefine on, for example: http://192.168.59.103:3334.

PS for a Dockerfile in the current directory, to build an image from the Dockerfile, use something like:

docker build -t myname/mycontainer .

Recreational Data: Data Golf

I’m still hopeful of working up the idea of recreational data as a popular pastime activity with a regular column somewhere and a stocking filler book each Christmas (?!;-), but haven’t had much time to commit to working up some great examples lately:-(

However, here’s a neat idea – data golf – as described in a post by Bogumił Kamiński (RGolf) that I found via RBloggers:

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation

An example is then given of a data reshaping/transformation problem based on a real data task (wrangling survey data, converting it from a long to a wide format in the smallest amount of R.

Of course, R need not be the only language that can be used to play this game. For the course I’m currently writing, I think I’ll pitch data golf as a Python/pandas activity in the section on data shaping. OpenRefine also supports a certain number of reshaping transformations, so that’s another possible data golf course(?). As are spreadsheets. And so on…

Hmmm… thinks… pivot table golf?

Also related: string parsing/transformation or partial string extraction using regular expressions; for example, Regex Tuesday, or how about Regex Crossword.

Narrative Charts Tell the Tale…

A couple of days ago, I got a message from @fantasticlfe asking if I’d done any tinkerings around what turned out to be “narrative charts”. I kept misapprehending what he was after (something to do with continuity?!;-), so here’s a summary of various graphical devices for looking at narrative texts that we passed back and forth, along with some we didn’t..

A Sankey diagram typically uses variable thickness lines to show flow between different elements in a system. (For this reason it’s often used to show energy flows throuygh a system, though it can also be used to good effect to show money flows.) The chart Michael linked to comes from xkcd:

xkcd narrative chart

In this chart, we have time along the horizontal x-axis. The y-axis is ambiguous (some sort of nominal ordering?) and the line thickness appears to represent army size.

To a certain extent, this diagram is reminiscent of Minard’s famous chart…

(See also What Makes a Minard? for some contemporary Minard diagrams. Is code available, I wonder?)

However, in the case of Minard’s chart (which I personally don’t like at all!), the x-y and co-ordinates represent map co-ordinates – the thick lines aren’t thick lines in a line chart (which a glanced “up and to the right” view might make you assume), they’re flow lines across a map.

I got distracted for a while by the Sankey aspect, and dug around my own bits of code. For example, Generating Sankey Diagrams from rCharts, an rCharts wrapper for the d3.js Sankey diagram. Michael was particularly interested in being able to group lines vertically (though I wasn’t sure what the y-axis would actually correspond to: some loose function of “location”, maybe as a categorical variable? Time was definitely to be on the horizontal x-axis); a posting on Stack Overflow (d3 sankey charts – manually position node along x axis) seemed likely to be able to help with that.

I then started going off on one…

Would a variant of nltk style lexical dispersion plots help, using characters rather than word categories? That would show when a character was in scene, but not much else?

lexical dispersion

How about sentence drawing, in which we show “turns” taken by different speakers?

sentence drawing

This shows something, but again, not relevant…

Nor are Kurt Vonnegut’s shapes-of-stories diagrams that plot some sort of emotional state on y and time on x:

Hmmm… Michael wanted to be able to look at scenes on x and presumably some function of location on y. Hmm… why? And how might we actually order those axes? Scenes occur in order in a film or play, but scene is a ranked, ordinal value. That said, scenes also have duration in terms of screentime, which may or may not be the same as the “interval” that the scene portrays in terms of the world it represents (this must have a name? eg a 20 second screen time scene shows a plane flying and this represents x hours in the story). The scene may also have a ‘calendar time’ associated with it in the story – so where you have a flashback scene this corresponds to a previous calendar time in the represented world. Did Michael want any of these dimensions capturing?

Related to shapes of stories, here’s how someone analysed several thousand plots: Examining the arc of 100,000 stories: a tidy analysis.

And then there’s location… how should these be represented? Locations are a distance apart and, perhaps more importantly from a continuity point of view, a travel time apart; as well as maybe a timezone difference apart. Did that need capturing in any way? Ordering axes for this could be quite hard if we wanted close things in space (distance? travel time?) to be close together on a single axis (A is 10 minutes from B and C, B is ten minutes from C: how do you show that intransitive relation on a single dimension? [Maybe relevant? Storygraph: Extracting patterns from spatio-temporal data, A Shrestha et al., Advances in Visual Computing.] Hmm… If we can capture distance between locations, and some sensible notion of time relating to scenes, could we maybe use line thickness to show that a person has lots of time to move between one (time, location) and another, as compared to scenetime? Do filmwriters have tools to support this? Do the police…?! Is the Mythology Engine relevant?

How about thinking about it as a graph? I’ve used Gephi before as a foil for getting me to think about ordered series as connected events in a graph – for example, Visualising F1 Timing Sheet Data. If we encode scene number as the x-coordinate and location number as the y-coordinate, with each graph line being the connected series of scenes a particular individual is in, then we can simply use a line chart to connect “individual lines” to different scene and location numbers. We’d also have a couple of extra dimensions to play with – node size and node colour, at each location. We’d also have the opportunity to play with edge (that is, line) colour and edge thickness?

Maybe I need to try to do some demos? But no time for that right now…

How about trying to find some? Here are some discovered via @jamesjefferies:

Here’s a view of connected (by travel between) locations in Game of Thrones:

game of thrones connected places

There’s also an animation of event in Game of Thrones, but I can’t quite figure out how to read it?!

geame of thrones events

Let’s go back to the sort of thing Michael was after – narrative charts..

@imhelenj found a related if cluttered interactive describing the evolution of web tech:

web histroy narrative chart

Then Michael shared a link to Comic Book Narrative Charts, a project for automatically generating xkcd style narrative charts:

xkc narrative chart d3js

Hovering over these charts, I noticed they were interactive d3.js charts. A quick View Source and the code for generating the chart dynamically from a characters file and a narrative file appeared to be there. Which I think is what Michael wanted all along…!

(By the by, the post also describes how the developers started thinking about fixing the vertical y-coordinate values. Here’s another example of someone thinking aloud around producing a narrative chart for the Holy Week story.)

Ho hum, an interesting set of detours nonetheless – and it got me thinking about the time-space complexity of a scene based tale that could keep be confused for weeks! :-)

PS this is quite interesting – visualising a process, via Tactical Tech Drawing By Numbers project:

visualise process

PPS some more bits: @r4isstatic points to Some visualisations of stories and narratives, another summary post similar to this one. Also via Paul Rissen, and picking up on whether the police have any interesting actor/event/time/location diagramming techniques, Vispol – An Interactive Scenario Visualization.

Elsewhere, I find Storyline Visualizations, which includes a paper (Design Considerations for Optimizing Storyline Visualizations, Y Tanahashi, and K-L Ma, IEEE Trans on Visualisation and Computer Graphics, 18(12) 2012, pp2679-2688 and some python code.

PPPS Some more… A collection by Stewart McKie of techniques for visualising screenplays: Screenplay Visualization: Concepts and Practice. The posts I wrote on the Digital Worlds game design uncourse blog about narrative structure. Sort of via Scott Wilson, some crime analysis software from xanalys.com (Link Explorer – White Paper) which includes descriptions of an event chart, a transaction chart and an activity timeline:

xanalys event chart

xanalys transaction chart

xanalys activity timeline

Via the comments, this rather lovely animated discourse map:

trinker.github animated discourse map

Experimenting With R – Point to Point Mapping With Great Circles

I’ve started doodling again… This time, around maps, looking for recipes that make life easier plotting lines to connect points on maps. The most attractive maps seem to use great circles to connect one point with another, these providing the shortest path between two points when you consider the Earth as a sphere.

Here’s one quick experiment (based on the Flowing Data blog post How to map connections with great circles), for an R/Shiny app that allows you to upload a CSV file containing a couple of location columns (at least) and an optional “amount” column, and it’ll then draw lines between the points on each row.

greatcircle map demo

The app requires us to solve several problems, including:

  • how to geocode the locations
  • how to plot the lines as great circles
  • how to upload the CSV file
  • how to select the from and two columns from the CSV file
  • how to optionally select a valid numerical column for setting line thickness

Let’s start with the geocoder. For convenience, I’m going to use the Google geocoder via the geocode() function from the ggmap library.

#Locations are in two columns, *fr* and *to* in the *dummy* dataframe
#If locations are duplicated in from/to columns, dedupe so we don't geocode same location more than once
locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)
#Run the geocoder against each location, then transpose and bind the results into a dataframe
cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 

The locs data is a vector of locations:

                    place
1              London, UK
2            Cambridge,UK
3            Paris,France
4       Sydney, Australia
5           Paris, France
6             New York,US
7 Cape Town, South Africa

The sapply(locs$place,geocode, USE.NAMES=F) function returns data that looks like:

    [,1]       [,2]     [,3]     [,4]      [,5]     [,6]      [,7]     
lon -0.1254872 0.121817 2.352222 151.207   2.352222 -74.00597 18.42406 
lat 51.50852   52.20534 48.85661 -33.86749 48.85661 40.71435  -33.92487

The transpose (t() gives us:

     lon        lat      
[1,] -0.1254872 51.50852 
[2,] 0.121817   52.20534 
[3,] 2.352222   48.85661 
[4,] 151.207    -33.86749
[5,] 2.352222   48.85661 
[6,] -74.00597  40.71435 
[7,] 18.42406   -33.92487

The cbind() binds each location with its lat and lon value:

                    place        lon       lat
1              London, UK -0.1254872  51.50852
2            Cambridge,UK   0.121817  52.20534
3            Paris,France   2.352222  48.85661
4       Sydney, Australia    151.207 -33.86749
5           Paris, France   2.352222  48.85661
6             New York,US  -74.00597  40.71435
7 Cape Town, South Africa   18.42406 -33.92487

Code that provides a minimal example for uploading the data from a CSV file on the desktop to the Shiny app, then creating dynamic drop lists containing column names, can be found here: Simple file geocoder (R/shiny app).

The following snippet may be generally useful for getting a list of column names from a data frame that correspond to numerical columns:

#Get a list of column names for numerical columns in data frame df
nums <- sapply(df, is.numeric)
names(nums[nums])

The code for the full application can be found as a runnable gist in RStudio from here: R/Shiny app – great circle mapping. [In RStudio, install.packages(“shiny”); library(shiny); runGist(9690079). The gist contains a dummy data file if you want to download it to try it out…]

Here’s the code explicitly…

The global.R file loads the necessary packages, installing them if they are missing:

#global.R

##This should detect and install missing packages before loading them - hopefully!
list.of.packages <- c("shiny", "ggmap","maps","geosphere")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages,function(x){library(x,character.only=TRUE)}) 

The ui.R file builds the Shiny app’s user interface. The drop down column selector lists are populated dynamically with the names of the columns in the data file once it is uploaded. An optional Amount column can be selected – the corresponding list only displays the names of numerical columns. (The lists of location columns to be geocoded should really be limited to non-numerical columns.) The action button prevents the geocoding routines firing until the user is ready – select the columns appropriately before geocoding (error messages are not handled very nicely;-)

#ui.R
shinyUI(pageWithSidebar(
  headerPanel("Great Circle Map demo"),
  
  sidebarPanel(
    #Provide a dialogue to upload a file
    fileInput('datafile', 'Choose CSV file',
              accept=c('text/csv', 'text/comma-separated-values,text/plain')),
    #Define some dynamic UI elements - these will be lists containing file column names
    uiOutput("fromCol"),
    uiOutput("toCol"),
    #Do we want to make use of an amount column to tweak line properties?
    uiOutput("amountflag"),
    #If we do, we need more options...
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("amountCol")
    ),
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("lineSelector")
    ),
    #We don't want the geocoder firing until we're ready...
    actionButton("getgeo", "Get geodata")
    
  ),
  mainPanel(
    tableOutput("filetable"),
    tableOutput("geotable"),
    plotOutput("geoplot")
  )
))

The server.R file contains the server logic for the app. One thing to note is the way we isolate some of the variables in the geocoder reactive function. (Reactive functions fire when one of the external variables they contain changes. To prevent the function firing when a variable it contains changes, we need to isolate it. (See the docs for me; for example, Shiny Lesson 7: Reactive outputs or Isolation: avoiding dependency.)

#server.R

shinyServer(function(input, output) {

  #Handle the file upload
  filedata <- reactive({
    infile <- input$datafile
    if (is.null(infile)) {
      # User has not uploaded a file yet
      return(NULL)
    }
    read.csv(infile$datapath)
  })

  #Populate the list boxes in the UI with column names from the uploaded file  
  output$toCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("to", "To:",items)
  })
  
  output$fromCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("from", "From:",items)
  })
  
  #If we want to make use of an amount column, we need to be able to say so...
  output$amountflag <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    checkboxInput("amountflag", "Use values?", FALSE)
  })

  output$amountCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    #Let's only show numeric columns
    nums <- sapply(df, is.numeric)
    items=names(nums[nums])
    names(items)=items
    selectInput("amount", "Amount:",items)
  })
  
  #Allow different line styles to be selected
  output$lineSelector <- renderUI({
    radioButtons("lineselector", "Line type:",
                 c("Uniform" = "uniform",
                   "Thickness proportional" = "thickprop",
                   "Colour proportional" = "colprop"))
  })
  
  #Display the data table - handy for debugging; if the file is large, need to limit the data displayed [TO DO]
  output$filetable <- renderTable({
    filedata()
  })
  
  #The geocoding bit... Isolate variables so we don't keep firing this...
  geodata <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (is.null(df)) return(NULL)
    
    isolate({
      dummy=filedata()
      fr=input$from
      to=input$to
      locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)      
      cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 
    })
  })

  #Weave the goecoded data into the data frame we made from the CSV file
  geodata2 <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (input$amountflag != 0) {
      maxval=max(df[input$amount],na.rm=T)
      minval=min(df[input$amount],na.rm=T)
      df$b8g43bds=10*df[input$amount]/maxval
    }
    gf=geodata()
    df=merge(df,gf,by.x=input$from,by.y='place')
    merge(df,gf,by.x=input$to,by.y='place')
  })
  
  #Preview the geocoded data
  output$geotable <- renderTable({
    if (input$getgeo == 0) return(NULL)
    geodata2()
  })
  
  #Plot the data on a map...
  output$geoplot<- renderPlot({
    if (input$getgeo == 0) return(map("world"))
    #Method pinched from: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
    map("world")
    df=geodata2()
    
    pal <- colorRampPalette(c("blue", "red"))
    colors <- pal(100)
    
    for (j in 1:nrow(df)){
      inter <- gcIntermediate(c(df[j,]$lon.x[[1]], df[j,]$lat.x[[1]]), c(df[j,]$lon.y[[1]], df[j,]$lat.y[[1]]), n=100, addStartEnd=TRUE)

      #We could possibly do more styling based on user preferences?
      if (input$amountflag == 0) lines(inter, col="red", lwd=0.8)
      else {
        if (input$lineselector == 'colprop') {
          colindex <- round( (df[j,]$b8g43bds[[1]]/10) * length(colors) )
          lines(inter, col=colors[colindex], lwd=0.8)
        } else if (input$lineselector == 'thickprop') {
          lines(inter, col="red", lwd=df[j,]$b8g43bds[[1]])
        } else lines(inter, col="red", lwd=0.8)
      } 
    } 
  })

})

So that’s the start of it… this app could be further developed in several ways, for example allowing the user to filter or colour displayed lines according to factor values in a further column (commodity type, for example), or produce a lattice of maps based on facet values in a column.

I also need to figure how to to save maps, and maybe produce zoomable ones. If geocoded points all lay within a blinding box limited to a particular geographical area, scaling the map view to show just that area might be useful.

Other techniques might include using proportional symbols (circles) at line landing points to show the sum of values incoming to that point, or some of values outgoing, or the difference between the two; (maybe use green for incoming outgoing, then size by the absolute difference?)

Recreational Data

Part of my weekend ritual is to buy the weekend papers and have a go at the recreational maths problems that are Sudoku and Killer. I also look for news stories with a data angle that might prompt a bit of recreational data activity…

In a paper that may or may not have been presented at the First European Congress of Mathematics in Paris, July, 1992, Prof. David Singmaster reflected on “The Unreasonable Utility of Recreational Mathematics”.

unreasonableUtility

To begin with, it is worth considering what is meant by recreational mathematics.

First, recreational mathematics is mathematics that is fun and popular – that is, the problems should be understandable to the interested layman, though the solutions may be harder. (However, if the solution is too hard, this may shift the topic from recreational toward the serious – e.g. Fermat’s Last Theorem, the Four Colour Theorem or the Mandelbrot Set.)

Secondly, recreational mathematics is mathematics that is fun and used as either as a diversion from serious mathematics or as a way of making serious mathematics understandable or palatable. These are the pedagogic uses of recreational mathematics. They are already present in the oldest known mathematics and continue to the present day.

These two aspects of recreational mathematics – the popular and the pedagogic – overlap considerably and there is no clear boundary between them and “serious” mathematics.

How is recreational mathematics useful?

Firstly, recreational problems are often the basis of serious mathematics. The most obvious fields are probability and graph theory where popular problems have been a major (or the dominant) stimulus to the creation and evolution of the subject. …

Secondly, recreational mathematics has frequently turned up ideas of genuine but non-obvious utility. …

Anyone who has tried to do anything with “real world” data knows how much of a puzzle it can represent: from finding the data, to getting hold of it, to getting it into a state and a shape where you can actually work with it, to analysing it, charting it, looking for pattern and structure within it, having a conversation with it, getting it to tell you one of the many stories it may represent, there are tricks to be learned and problems to be solved. And they’re fun.

An obvious definition [of recreational mathematics] is that it is mathematics that is fun, but almost any mathematician will say that he enjoys his work, even if he is studying eigenvalues of elliptic differential operators, so this definition would encompass almost all mathematics and hence is too general. There are two, somewhat overlapping, definitions that cover most of what is meant by recreational mathematics.

…the two definitions described above.

So how might we define “recreational data”. For me, recreational data activities are, in who or in part, data investigations, involving one or more steps of the data lifecycle (discovery, acquisition, cleaning, analysis, visualisation, storytelling). They are the activities I engage in when I look for, or behind, the numbers that appear in a news story. They’re the stories I read on FullFact, or listen to on the OU/BBC co-pro More or Less; they’re at the heart of the beautiful little book that is The Tiger That Isn’t; recreational data is what I do in the “Diary of a Data Sleuth” posts on OpenLearn.

Recreational data is about the joy of trying to find stories in data.

Recreational data is, or can be, the data journalism you do for yourself or the sense you make of the stats in the sports pages.

Recreational data is a safe place to practice – I tinker with Twitter and formulate charts around Formula One. But remember this: “recreational problems are often the basis of serious [practice]“. The “work” I did around Visualising Twitter User Timeline Activity in R? I can (and do) reuse that code as the basis of other timeline analyses. The puzzle of plotting connected concepts on Wikipedia I described in Visualising Related Entries in Wikipedia Using Gephi? It’s a pattern I can keep on playing with.

If you think you might like to do some doodle of your own with some data, why not check out the School Of Data. Or watch out on OpenLearn for some follow up stories from the OU/BBC co-pro of Hans Rosling’s award winning Don’t Panic

Quoting Tukey on Visual Storytelling with Data

Time was when I used to be a reasonably competent scholar, digging into the literature chasing down what folk actually said, and chasing forward to see whether claims had been refuted. Then I fell out of love with the academic literature – too many papers that said nothing, too many papers that contained errors, too many papers…

…but as we start production on a new OU course on “data”, I’m having to get back in to the literature so I can defuse any claims that what I want to say is wrong by showing that someone else has said it before (which, if what is said has been peer reviewed, makes it right…).

One thing I’d forgotten about chasing thought lines through the literature was that at times it can be quite fun… and that it can quite often turn up memorable, or at least quotable, lines.

For example, last night I had a quick skim through some papers by folk hero in the statistics world, John Tukey, and turned up the following:

The habit of building one technique on another — of assembling procedures like something made of erector-set parts — can be especially useful in dealing with data. So too is looking at the same thing in many ways or many things in the same way; an ability to generalize in profitable ways and a liking for a massive search for order. Mathematicians understand how subtle assumptions can make great differences and are used to trying to trace the paths by which this occurs. The mathematician’s great disadvantage in approaching data is his—or her—attitude toward the words “hypothesis” and “hypotheses”.

When you come to deal with real data, formalized models for its behavior are not hypotheses in the mathematician’s sense… . Instead these formalized models are reference situations—base points, if you like — things against which you compare the data you actually have to see how it differs. There are many challenges to all the skills of mathematicians — except implicit trust in hypotheses — in doing just this.

Since no model is to be believed in, no optimization for a single model can offer more than distant guidance. What is needed, and is never more than approximately at hand, is guidance about what to do in a sequence of ever more realistic situations. The analyst of data is lucky if he has some insight into a few terms of this sequence, particularly those not yet mathematized.

Picturing of Data Picturing of data is the extreme case. Why do we use pictures? Most crucially to see behavior we had not explicitly anticipated as possible – for what pictures are best at is revealing the unanticipated; crucially, often as a way of making it easier to perceive and understand things that would otherwise be painfully complex. These are the important uses of pictures.

We can, and too often do, use picturing unimportantly, often wastefully, as a way of supporting the feeble in heart in their belief that something we have just found is really true. For this last purpose, when and if important, we usually need to look at a summary.

Sometimes we can summarize the data neatly with a few numbers, as when we report:
– a fitted line—two numbers,
– an estimated spread of the residuals around the “best” line—one more number,
– a confidence interval for the slope of the “best” line—two final numbers.

When we can summarize matters this simply in numbers, we hardly need a picture and often lose by going to it. When the simplest useful summary involves many more numbers, a picture can be very helpful. To meet our major commitment of asking what lies beyond, in the example asking “What is happening beyond what the line describes!”, a picture can be essential.

The main tasks of pictures are then:
– to reveal the unexpected,
– to make the complex easier to perceive.

Either may be effective for that which is important above all: suggesting the next step in analysis, or offering the next insight. In doing either of these there is much room for mathematics and novelty.

How do we decide what is a “picture” and what is not? The more we feel that we can “taste, touch, and handle” the more we are dealing with a picture. Whether it looks like a graph, or is a list of a few numbers is not important. Tangibility is important—what we strive for most.

[Tukey, John W. “Mathematics and the picturing of data.” In Proceedings of the international congress of mathematicians, vol. 2, pp. 523-531. 1975.]

Wonderful!

Or how about these quotes from Tukey, John W. “Data-based graphics: visual display in the decades to come.” Statistical Science 5, no. 3 (1990): 327-339?

Firstly, on exploratory versus explanatory graphics:

I intend to treat making visual displays as something done by many people who want to communicate – often, on the one hand, to communicate identified phenomena to others, and often, on the other, to communicate unidentified phenomena to themselves. This broad clientele needs a “consumer product,” not an art course. To focus on a broad array of users is not to deny the existence of artists of visual communication, only to recognize how few they are and how small a share in the total volume of communication they can contribute. For such artists,very many statements that follow deserve escape clauses or caveats.

More thoughts on exploration and explanation (transfer of recognition), as well as a distinction between exploration and prospecting:

We all need to be clear that visual display can be very effective in serving two quite different functions, but only if used in correspondingly different ways. On the one hand, it can be essential in helping – or, even, in permitting – us to search in some data for phenomena, just as a prospector searches for gold or uranium.

Our task differs from the usual prospector’s task, in that we are concerned both with phenomena that do occur and with those that might occur but do not. On the other hand, visual display can be very helpful in transferring (to reader, viewer or listener) a recognition of the appearances that indicate the phenomena that deserve report. Indeed, when sufficient precomputation drives appropriately specialized displays, visual display can even also convey the statistical significances or non-significances of these appearances.

There is no reason why a good strategy for prospecting will also be a good strategy for transfer. We can expect some aspects of prospecting strategy (and most techniques) to carry over, but other aspects may not. (We say prospecting because we are optimistic that we may in due course have good lists of possible phenomena. If we do not know what we seek, we are “exploring” not “prospecting.”)

One major difference is in prospecting’s freedom to use multiple pictures. If it takes 5 or 10 kinds of pictures to adequately explore one narrow aspect of the data, the only question is: Will 5 or 10 pictures be needed, or can be condense this to 3 or 4 pictures, without appreciable loss? If “yes” we condense; if “no” we stick to the 5 or 10.

If it takes 500 to 1000 (quite different) pictures, however, our choice can only be between finding a relatively general way to do relatively well with many fewer pictures and asking the computer to sort out some number, perhaps 10 or 20 pictures of “greatest interest.”

In doing transfer, once we have one set of pictures to do what is needed, economy of paper (or plastic) and time (to mention or to read) push even harder toward “no more pictures than needed.” But, even here, we must be very careful not to insist, as a necessity not a desideratum, that a single picture can do it all. If it takes two pictures to convey the message effectively, we must use two.

For prospecting, we will want a bundle of pictures, probably of quite different kinds, so chosen that someone will reveal the presence of any one of possible phenomena, of potentially interesting behaviors, which will often have to be inadequately diverse. Developing, and improving, bundles of pictures for selected combinations of kinds of aspects and kinds of situations will be a continuing task for a long time.

For transfer, we will need a few good styles to transfer each phenomenon of possible importance, so that, when more than one phenomenon deserves transfer, we can choose compatible styles and try to transfer two, or even all, of these phenomena in a single picture. (We can look for the opportunity to do this, but, when we cannot find it, we will use two, or more, pictures as necessary.)

For prospecting, we look at long lists of what might occur – and expect to use many pictures. For transfer, we select short lists of what must be made plain – and use as few pictures as will serve us well.

Then on the power of visual representations, whilst recognising they may not always be up to the job…:

The greatest possibilities of visual display lie in vividness and inescapability of the intended message. A visual display can stop your mental flow in its tracks and make you think. A visual display can force you to notice what you never expected to see. (“Why, that scatter diagram has a hole in the middle!”) On the other hand, if one has to work almost as hard to drag something out of a visual display as one would to drag it out of a table of numbers, the visual display is a poor second to the table, which can easily provide so much more precision. (Here, as elsewhere, artists may deserve an escape clause.)

On design:

Another important aspect of impact is immediacy. One should see the intended at once; one should not even have to wait for it to gradually appear. If a visual display lacks immediacy in thrusting before us one of the phenomena for whose presentation it had been assigned responsibility, we ought to ask why and use the answer to modify the display so its impact will be more immediate.

(For a great example of how to progressively refine a graphic to support the making of a particular point, see this Storytelling With Data post on multifaceted data and story.)

Tukey, who we must recall was writing at a time when powerful statistical graphics tools such as ggplot were still yet to be implemented, also suggests that lessons are to be learned from graphic design for the production of effective statistical charts:

The art of statistical graphics was for a long time a pen-and-pencil cottage industry, with the top professionals skilled with the drafting or mapping pen. In the meantime, graphic designers, especially for books, have had access to different sorts of techniques (the techniques of graphic communication), such as grays of different screen weights, against which, for instance, both white and black lines (and curves) are effective. They also have a set of principles shared in part with the Fine Arts (some written down by Leonardo da Vinci). I do not understand all this well enough to try to tell you about “visual centers” and how attention usually moves when one looks at a picture, but I do know enough to find this area important – and to tell you that more of us need to learn a lot more about it.

Data – what is it good for?

Almost everything we do with data involves comparison – most often between two or more values derived from the data, sometimes between one value derived from the data and some mental reference or standard. The dedication of Richard Hamming’s book on numerical analysis reads “The purpose of computation is insight, not numbers.” We need a book on visual display that at least implies “The purpose of display is comparison (recognition of phenomena), not numbers.”

Tukey also encourages us to think about what data represents, and how it is represented:

Much of what we want to know about the world is naturally expressed as phenomena, as potentially interesting things that can be described in non numerical words. That an economic growth rate has been declining steadily throughout President X’s administration, for example, is a phenomenon, while the fact that the GNP has a given value is a number. With exceptions like “I owe him 27 dollars!” numbers are, when we look deeply enough, mainly of interest because they can be assembled, often only through analysis, to describe phenomena. To me phenomena are the main actors, numbers are the supporting cast. Clearly we most need help with the main actors.

If you really want numbers, presumably for later assembly into a phenomenon, a table is likely to serve you best. The graphic map of Napoleon’s incursion into Russia that so stirs Tufte’s imagination and admiration does quite well in showing the relevant phenomena, in giving the answers to “About where?”, “About when?” and “With roughly what fraction of the original army left?” It serves certain phenomena well. But if we want numbers, we can do better either by reading the digits that may be attached to the graphic – a simple but often effective form of table – or by going to a conventional table.

The questions that visual display (in some graphic mode) answers best are phenomenological (in the sense of the first sentence of this section). For instance:

* Is the value small, medium or large?
* Is the difference, or change, up, down or neutral?
* Is the difference, or change, small, medium or large?
* Do the successive changes grow, shrink or stay roughly constant?
* What about change in ratio terms, perhaps thought of as percent of previous?
* Does the vertical scatter change, as we move from left to right?
* Is the scatter pattern doughnut-shaped?

One way that we will enhance the usefulness of visual display is to find new phenomena of potential interest and then learn how to make displays that will be likely to reveal them, when they are present.

The absence of a positive phenomenon is itself a phenomenon! Such absences as:

* the values are all about the same
* there does not seem to be any definite curvature
* the vertical scatter does not seem to change, as we go from left to right!

are certainly potentially interesting. (We can all find instances where they are interesting.) Thus they are, honestly, phenomena in themselves. We need to be able to view apparent absence of specific phenomena effectively as well as noticing them when they are present! This is one of the reasons why fitting scatter plots with summarizing devices like middle traces (Tukey, 1977a, page 279 ff.) can be important.

Phenomena are also picked up later in the paper:

A graph or chart should not be just another form of table, in which we can look up the facts. If it is to do its part effectively, its focus – or so I believe – will have to be one or more phenomena.

Indeed, the requirement that we can directly read values from a chart seems to be something Tukey takes issue with:

As one who denies “reading off numbers” as the prime purpose of visual display, I can only denounce evaluating displays in terms of how well (given careful study) people read numbers off. If such an approach were to guide us well, it would have to be a very unusual accident.

He even goes so far as to suggest that we might consider being flexible in the way we geometrically map from measurement scales to points on a canvas, moving away from proportionality if that helps us see a phenomenon better:

The purpose of display is to make messages about phenomena clear. There is no place for a doctrinaire approach to “truth in geometry.” We must be honest and say what we did, but this need not mean plotting raw data.

The point is that we have a choice, not only in (x, y)-plots, but more generally. Planned disproportionality needs to be a widely-available option, one that requires the partnership of computation and display.

Tukey is also willing to rethink how we use familiar charts:

Take simple bar charts as an example, which I would define much more generally than many classical authors. Why have they survived? Not because they are geometrically true, and not because they lead to good numerical estimates by the viewer! In my thoughts, their virtue lies in the fact that we can all compare two bars, perhaps only roughly, in two quite different ways, both “About how much difference?” and “About what ratio?” The latter, of course, is often translated into “About how much percent change?” (Going on to three or more successive bars, we can see globally whether the changes in amount are nearly the same, but asking the same question about ratios – rather than differences-requires either tedious assessment of ratios between adjacent bars for one adjacent pair after another…

So – what other Tukey papers should I read?

Creating Data from Text – Regular Expressions in OpenRefine

Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).

Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.

More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)

Viva F1 - penalties - messy data

The VivaF1 site also publishes a visualisation summarising penalty outcomes incurred by each driver:

VIVA F1 - DISPLAY OF PENALTIES

The recreational data puzzle I set myself was this: how can we get the data contained in the descriptive sentences about the penalties into a data table that could be used to ask questions about different rule infractions, and the penalty outcomes applied, and allow for the ready generation of visualisations around that data?

The tool I opted to use was OpenRefine; and the predominant technique for getting the data out of the sentences and in to data columns? Regular expressions. (If regular expressions are new to you, think: search and replace on steroids. There’s a great tutorial that introduces the basics here: Everday text patterns.)

What follows is a worked example that shows how to get the “data” from the VivaF1 site into a form that looks more like this:

data types

Not every row is a tidy as it could be, but there is often a trade off in tidying data between trying to automate every step, and automating steps that clean the majority of the data, leaving some rows to tidy by hand…

So where to start? The first step is getting the “data” into OpenRefine. To do this we can just select the text on the VivaF1 penatlies-by-race page, copy it an paste it in the Clipboard import area of a new project in OpenRefine:

Paste in the data

We can then import the data as line items, ignoring blank lines:

import as line based files

The first step I’m going to take tidying up the data is to generate a column that contains the race name:

Pull out race name

The expression if(value.contains('Prix'),value,'') finds the rows that have the title of the race (they all include “Grand Prix” in their name) and creates a new column containing matches. (The expression reads as follows: if the original cell value contains ‘Prix’ , copy the cell value into the corresponding cell in the new column, else copy across an empty string/nothing that is, ”) We can then Fill Down on the race column to associate each row with particular race.

We can also create another new column containing the description of each penalty notice with a quick tweak of the original expression: if(value.contains('Prix'),'',value). (If we match “Prix”, copy an empty string, else copy the penalty notice.)

Copy the penalty notice

One of the things that we notice is there are some notices that “Overflow” on to multiple lines:

Missed lines...

We can filter using a regular expression that finds Penalty Notice lines that start (^) with a character that does not match a – ([^-]):

Find rows that donl;t start with a -

Looking at the row number, we see serval of the rows are xsecutive – we can edit thesse cells to move all the text into a single cell:

OpenRefine edit

Cut and paste as required…

edit a cell

Looking down the row number column (left hand most column) we see that rows 19, 57 and 174 are now the overflow lines. Remove the filter an in the whole listing, scroll to the appropriate part of the data table and cut the data out of the overflow cell and paste it into the line above.

Cut and append

By chance, I also notice that using “Prix” to grab just race names was overly optimistic!

overly agressive

Here’s how we could have checked – used the facet as text option on the race column…

Facet on race name

Phew – that was the only rogue! In line 56, cut the rogue text from the Race column and paste it into the penalty notice column. Then also paste in the additional content from the overflow lines.

Tidy up the big overflow.

Remove any filters and fill down again on the Race column to fix the earlier error…

The PEnalty Noptice column should now contain blank lines corresponding to rows that originally described the Grand Prix and overflow rows – facet the Penalty Notice column by text and highlight the blank rows so we can then delete them…

prune blank rows

So where are we now? We have a data file with one row per penalty and columns corresponding to the Grand Prix and the penalty notice. We can now start work on pulling data out of the penalty notice sentences.

If you inspect the sentences, you will see they start with a dash, then have the driver name and the team in brackets. Let’s use a regular expression to grab that data:

value.match(/- ([^\(]*)\s\(([^\)]*)\).*/).join('::')

Drivername grab

Here’s the actual regular expression: - ([^\(]*)\s\(([^\)]*)\).* It reads as follows: match a – followed by a space, then grab any characters that don’t contain an open bracket ([^\(]*) and that precede a space \s followed by an open bracket \): all together ([^\(]*)\s\( That gives the driver name into the first matched pattern. Then grab the team – this is whatever appears before the first close bracket: ([^\)]*)\) Finally, match out all characters to the end of the string .*

The two matches are then joined using ::

drivenae captured

We can then split these values to give driver and team columns:

driveteam split

Learning from out previous error, we can use the text facet tool on the drive and team columns just to check the values are in order – it seems like there is one oops in the driver column, so we should probably edit that cell and remove the contents.

facet on name

We can also check the blank lines to see what’s happening there – in this case no driver is mentioned but a team is, but that hasn’t been grabbed into the team column, so we can edit it here:

tweak team

We can also check the text facet view of the team column to make sure there are no gotchas, and pick up/correct any that did slip through.

So now we have a driver column and a team column too (it’s probably worth changing the column names to match…)

team and driver

Let’s look at the data again – what else can we pull out? How about the value of any fine? We notice that fine amounts seem to appear at the end of the sentence and be preceded by the word fined, so we gan grab data on that basis, then replace the euro symbol, strip out any commas, and cast the result to a number type: value.match(/.* fined (.*)/)[0].replace(/[€,]/,'').toNumber()

pull out fines

We can check for other fines by filtering the the Penalty Notice column on the word fine (or the Euro symbol), applying a number facet to the Fine column and looking for blank rows in that column.

tidy up the fines

Add in fine information by hand as required:

edit number

So now we have a column that has the value of fines – which means if we export this data we could do plots that show fines per race, or fines per driver, or fines per team, or calculate the average size of fines, or the total number of fines, for example.

What other data columns might we pull out? How about the session? Let’s look for phrases that identify free practice sessions or qualifying:

Session extraction

Here’s the regular expression: value.match(/.*((FP[\d]+)|(Q[\d]+)|(qualifying)).*/i)[0].toUppercase() Note how we use the pipe symbol | to say ‘look for one pattern OR another’. We can cast everything to uppercase just to help normalise the values that appear. And once again, we can use the Text Facet to check that things worked as we expected:

facet session

So that’s a column with the session the infringement occurred in (I think! We’d need to read all the descriptions to make absolutely sure!)

What else? There’s another set of numbers appear in some of the notices – speeds. Let’s grab those into a new column – look for a space, followed by numbers or decimal points, and then a sapce and km/h, grabbing the numbers of interest and casting them to a number type:

value.match(/.*\s([\d\.]+) km\/h.*/)[0].toNumber()

Speeding

So now we have a speed column. Which means we could start to look at speed vs fine scatterplots, perhaps, to see if there is a relationship. (Note, different pit lanes may have different speed limits.)

What else? It may be worth trying to identify the outcome of each infringement investigation?

value.match(/.*((fine)|(no further action)|([^\d]\d+.place grid.*)|(reprimand)|(drive.thr.*)|(drop of.*)|\s([^\s]+.second stop and go.*)|(start .*from .*)).*/i)[0].toUppercase()

Outcome grab

Here’s where we’re at now:

useful data so far

If we do a text facet on the outcome column, we see there are several opportunities for clustering the data:

Facet on outcome and cluster

We can try other cluster types too:

other clusters

If we look at the metaphone (soundalike) clusters:

other opportunities for clustering

we notice a couple of other things – an opportunity to normalise 5 PLACE GRID DROP as DROP OF 5 GRID PLACES for example:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS')

Or we might further standardise the outcome of that by fixing on GRID POSITIONS rather than GRID PLACES:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS')

And we might further normalise on numbers rather than number words:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS').replace('TWO','2').replace('THREE','3').replace('FIVE','5').replace('TEN','10')

Clustering again:

further cleaning - in this case

it might make sense to tidy out the (IN THIS CASE… statements:

value.replace(/ \(IN THIS.*/,'')

Depending on the questions we want to ask, it may be worth splitting out whether or not penalties like GRID DROPS are are this event of the next event, as well as generic penalty types (Drive through, stop and go, grid drop, etc)

exaple of outcome penalty types

Finally, let’s consider what sort of infringement has occurred:

infraction type

If we create a new column from the Infraction column, we can then cluster items into the core infraction type:

cluster core infraction

After a bit of tidying, we can start to narrow down on a key set of facet levels:

example core infringement

Viewing down the list further there may be additional core infringements we might be able to pull out.

So here’s where we are now:

data types

And here’s where we came from:

Viva F1 - penalties - messy data

Having got the data into a data from, we can now start to ask questions of it (whilst possible using those conversations to retrun to the data ans tidy it more as we work with it). But that will have to be the subject of another post…

Extracting Images from PDFs

A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).

For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:

shell ng oil spill

Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.

import os,re
import urllib2

#New OU course will start using pandas, so I need to start getting familiar with it.
#In this case it's overkill, because all I'm using it for is to load in a CSV file...
import pandas as pd

#url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf'

#Load in the data scraped from Shell
df= pd.read_csv('shell_30_11_13_ng.csv')

errors=[]

#For each line item:
for url in df[df.columns[15]]:
	try:
		print 'trying',url
		u = urllib2.urlopen(url)
		fn=url.split('/')[-1]

		#Grab a local copy of the downloaded picture containing PDF
		localFile = open(fn, 'w')
		localFile.write(u.read())
		localFile.close()
	except:
		print 'error with',url
		errors.append(url)
		continue
	
	#If we look at the filenames/urls, the filenames tend to start with the JIV id
	#...so we can try to extract this and use it as a key
	id=re.split(r'[_-]',fn)[0]

	#I'm going to move the PDFs and the associated images stripped from them in separate folders
	fo='data/'+id
	os.system(' '.join(['mkdir',fo]))
	idp='/'.join([fo,id])
	
	#Try to cope with crappy filenames containing punctuation chars
	fn= re.sub(r'([()&])', r'\\\1', fn)

	#THIS IS THE LINE THAT PULLS OUT THE IMAGES
	#Available via poppler-utils
	#See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/
	#Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory
	cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo  ])
	os.system(cmd)
	#Still a couple of errors on filenames
	#just as quick to catch by hand/inspection of files that don't get moved properly
	
print 'Errors',errors

Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng

The important line of code in the above is:

pdfimages -j FILENAME OUTPUT_STUB

FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)

pdfimages can be downloaded as part of poppler (I think?!)

See also this Stack Exchange question/answer: Extracting images from a PDF

PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.

http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:

– generate post containing images and body text made up from data in the associated line from the CSV file.

Example data:

Date Reported Incident Site JIV Date Terrain Cause Estimated Spill Volume (bbl) Clean-up Status Comments Photo
02-Jan-13 10″ Diebu Creek – Nun River Pipeline at Onyoma 05-Jan-13 Swamp Sabotage/Theft 65 Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013. Site Certification was completed on 28th June 2013. http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf

So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).

A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).

We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).

There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:

Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.

If those phrases are common/templated refrains, we can parse the corresponding dates out?

I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?

Negative Payments in Local Spending Data

In anticipation of a new R library from School of Data data diva @mihi_tr that will wrap the OpenSpending API and providing access to OpenSpending.org data directly from within R, I thought I’d start doodling around some ideas raised in Identifying Pieces in the Spending Data Jigsaw. In particular, common payment values, repayments/refunds and “balanced payments”, that is, multiple payments where the absolute value of a negative payment matches that of an above zero payment (so far example, -£269.72 and £269.72 would be balanced payments).

The data I’ve grabbed is Isle of Wight Council transparency (spending) data for the financial year 2012/2013. The data was pulled from the Isle of Wight Council website and cleaned using OpenRefine broadly according to the recipe described in Using OpenRefine to Clean Multiple Documents in the Same Way with a couple of additions: a new SupplierNameClustered column, originally created from the SupplierName column, but then cleaned to remove initials, (eg [SD]), direct debit prefixes (DD- etc) and then clustered using a variety of clustering algorithms; and a new unique index column, created with values '_'+row.index. You can find a copy of the data here.

To start with, let’s see how we can identify “popular” transaction amounts:

#We going to use some rearrangement routines...
library(plyr)

#I'm loading in the data from a locally downloaded copy
iw <- read.csv("~/code/Rcode/eduDemos1/caseStudies/IWspending/IWccl2012_13_TransparencyTest4.csv")

#Display most common payment values
commonAmounts=function(df,bcol='Amount',num=5){
  head(arrange(data.frame(table(df[[bcol]])),-Freq),num)
}
ca=commonAmounts(iw)

ca
#     Var1 Freq
#1 1567.72 2177
#2 1930.32 1780
#3  1622.6 1347
#4 1998.08 1253
#5  1642.2 1227

This shows that the most common payment by value is for £1567.72. An obvious question to ask here is: does this correspond to some sort of “standard payment” or tariff? And if so, can we reconcile this against the extent of the delivery of a particular level of service, perhaps as disclosed elsewhere?

We can also generate a summary report to show what transactions correspond to this amount, or the most popular amounts.

#We can then get the rows corresponding to these common payments
commonPayments=function(df,bcol,commonAmounts){
  df[abs(df[[bcol]]) %in% commonAmounts,]
}
cp.df=commonPayments(iw,"Amount",ca$Var1)

More usefully, we might generate “pivot table” style summaries of how the popular payments breakdown with respect to expenses area, supplier, or some other combination of factors. So for example, we can learn that there were 1261 payments of £1567.72 booked to the EF Residential Care services area, and SOMERSET CARE LTD [SB] received 231 payments of £1567.72. (Reporting by the clustered supplier name is often more useful…)

#R does pivot tables, sort of...
#We can also run summary statistics over those common payment rows.
zz1=aggregate(index ~ ServiceArea + Amount, data =cp.df, FUN="length")
head(arrange(zz1,-index))

#                       ServiceArea  Amount index
#1              EF Residential Care 1567.72  1261
#2             EMI Residential Care 1642.20   659
#3             EMI Residential Care 1930.32   645
#4              EF Residential Care 1930.32   561
#5              EF Residential Care 1622.60   530
#6 Elderly Frail Residential Income 1622.60   36

#Another way of achieving a similar result is using the plyr count() function:
head(arrange(count(cp.df, c('ServiceArea','ExpensesType','Amount')),-freq))
#That is:
zz1.1=count(cp.df, c('ServiceArea','ExpensesType','Amount'))

zz2=aggregate(index ~ ServiceArea +ExpensesType+ Amount, data =cp.df, FUN="length")
head(arrange(zz2,-index))
#                       ServiceArea                       ExpensesType  Amount
#1              EF Residential Care                Chgs from Ind Provs 1567.72
#2             EMI Residential Care                Chgs from Ind Provs 1930.32
#3             EMI Residential Care                Chgs from Ind Provs 1642.20
#4              EF Residential Care Charges from Independent Providers 1622.60
#...

zz3=aggregate(index ~ SupplierName+ Amount, data =cp.df, FUN="length")
head(arrange(zz3,-index))
#                SupplierName  Amount index
#1     REDACTED PERSONAL DATA 1567.72   274
#2     SOMERSET CARE LTD [SB] 1567.72   231
#3 ISLAND HEALTHCARE LTD [SB] 1930.32   214
#...

zz4=aggregate(index ~ SupplierNameClustered+ Amount, data =cp.df, FUN="length")
head(arrange(zz4,-index))

If I was a suspicious type, I suppose I might look for suppliers who received one of these popular amounts only once…

Let’s have a search for items declared as overpayments or refunds, perhaps as preliminary work in an investigation about why overpayments are being made…:

#Let's look for overpayments
overpayments=function(df,fcol,bcol='Amount',num=5){
  df=subset(df, grepl('((overpay)|(refund))', df[[fcol]],ignore.case=T))
  df[ order(df[,bcol]), ]
}
rf=overpayments(iw,'ExpensesType')

#View the largest "refund" transactions
head(subset(arrange(rf,Amount),select=c('Date','ServiceArea','ExpensesType','Amount')))
#        Date                           ServiceArea         ExpensesType    Amount
#1 2013-03-28 Elderly Mentally Ill Residential Care     Provider Refunds -25094.16
#2 2012-07-18                   MH Residential Care Refund of Overpaymts -24599.12
#3 2013-03-25 Elderly Mentally Ill Residential Care     Provider Refunds -23163.84

#We can also generate a variety of other reports on the "refund" transactions
head(arrange(aggregate(index ~ SupplierName+ ExpensesType+Amount, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType   Amount index
#1 THE ORCHARD HOUSE CARE HOME Provider Refunds  -434.84     8
#2 THE ORCHARD HOUSE CARE HOME Provider Refunds  -729.91     3
#3         SCIO HEALTHCARE LTD Provider Refunds  -434.84     3
#...
##Which is to say: The Orchard House Care home made 8 refunds of £434.84 and 3 of £729.91

head(arrange(aggregate(index ~ SupplierName+ ExpensesType, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType index
#1         SCIO HEALTHCARE LTD Provider Refunds    31
#2           SOMERSET CARE LTD Provider Refunds    31
#3 THE ORCHARD HOUSE CARE HOME Provider Refunds    22
#...

Another area of investigation might be “balanced” payments. Here’s one approach for finding those, based around first identifying negative payments. Let’s also refine the approach in this instance for looking for balanced payments involving the supplier involved in the largest number of negative payments.

##Find which suppliers were involved in most number of negative payments
#Identify negative payments
nn=subset(iw,Amount<=0)
#Count the number of each unique supplier
nnd=data.frame(table(nn$SupplierNameClustered))
#Display the suppliers with the largest count of negative payments
head(arrange(nnd,-Freq))

#                    Var1 Freq
#1 REDACTED PERSONAL DATA 2773
#2      SOUTHERN ELECTRIC  313
#3   BRITISH GAS BUSINESS  102
#4      SOMERSET CARE LTD   75
#...

Let’s look for balanced payments around SOUTHERN ELECTRIC…

#Specific supplier search
#Limit transactions to just transactions involving this supplier
se=subset(iw, SupplierNameClustered=='SOUTHERN ELECTRIC')

##We can also run partial matching searches...
#sw=subset(iw,grepl('SOUTHERN', iw$SupplierNameClustered))

#Now let's search for balanced payments
balanced.items=function(df,bcol){
  #Find the positive amounts
  positems=df[ df[[bcol]]>0, ]
  #Find the negative amounts
  negitems=df[ df[[bcol]]<=0, ]
  
  #Find the absolute unique negative amounts
  uniqabsnegitems=-unique(negitems[[bcol]])
  #Find the unique positive amounts
  uniqpositems=unique(positems[[bcol]])
  
  #Find matching positive and negative amounts
  balitems=intersect(uniqabsnegitems,uniqpositems)
  #Subset the data based on balanced positive and negative amounts
  #bals=subset(se,abs(Amount) %in% balitems)
  bals=df[abs(df[[bcol]]) %in% balitems,]
  #Group the data by sorting, largest absolute amounts first
  bals[ order(-abs(bals[,bcol])), ]
}

dse=balanced.items(se,'Amount')

head(subset(dse,select=c('Directorate','Date','ServiceArea','ExpensesType','Amount')))
#                              Directorate       Date                ServiceArea ExpensesType   Amount
#20423               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity -2770.35
#20424               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity  2770.35
#52004 Chief Executive, Schools & Learning 2013-01-30 Haylands - Playstreet Lane  Electricity  2511.20
#52008 Chief Executive, Schools & Learning 2013-01-31 Haylands - Playstreet Lane  Electricity -2511.20

We can also use graphical techniques to try to spot balanced payments.

#Crude example plot to highlight matched +/1 amounts
se1a=subset(se,ServiceArea=='Ryde Town Hall' & Amount>0)
se1b=subset(se,ServiceArea=='Ryde Town Hall' & Amount<=0)

require(ggplot2)
g=ggplot() + geom_point(data=se1a, aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

In this first image, we see payments over time – the red markers are “minus” amounts on the negative payments. Notice that in some cases balanced payments seem to appear on the same day. If the y-value of a red and a black marker are the same, they are balanced in value. The x-axis is time. Where there is a period of equally spaced marks over x with the same y-value, this may represent a regular scheduled payment.

ryde - energy

g=ggplot(se1a)+geom_point( aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+facet_wrap(~ExpensesType)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')
g=g+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

We can also facet out the payments by expense type.

ryde-energy facet

Graphical techniques can often help us spot patterns in the data that might be hard to spot just by looking at rows and columns worth of data. In many cases, it is worthwhile developing visual analysis skills to try out quick analyses “by eye” before trying to implement them as more comprehensive analysis scripts.

Using OpenRefine to Clean Multiple Documents in the Same Way

When working with data that is published on a monthly basis according to the same template, it is often the case that we need to apply the same data cleaning rules to the data each time a new file is released. This recipe shows to use OpenRefine to create a reusable script for cleaning data files that get published month on month according to the same schedule.

To make things a little more concrete, consider this example. Under UK transparency regulations, local councils publish spending data for amounts over £500 on a monthly basis:

IW council transparency data

To get to the actual download link for the CSV data files requires another click… (example CSV file, if you want to play along…;-)

IW council transparency data download link

If we want to get this data into a site such as OpenSpending.org, we need to do a little bit of tidying of the data to get it into the format that OpenSpending expects (see for example The OpenSpending data format). As each new file is released, we need to to clean it as we have cleaned the files before. If you are a confident programmer, you could write a script to handle this process. But what can the rest of us to to try to automate this process and make life a little easier.

One way is to use OpenRefine’s ability to “replay” cleaning scripts that you have generated before. Here’s an example of how to do just that…

Let’s start by loading the data in from a CSV file on the Isle of Wight local council website – we need to copy the download link URL ourselves:

OPenRefine - import from CSV

The recent spending files are in a common format (that is, they have been published according to the same template), which is something we’ll be relying on, so we could load multiple files in at once into one big data file, but in this case I’m going to take each file separately.

OpenRefine makes a good guess at the file format. One minor tweak we might make is to ignore any blank lines (OpenSpending doesn’t like blank lines!).

openrefien import csv settings

Here’s what the data looks like once we import it:

preview the datai in openrefine

OpenSpending expects the data to be presented in a particular way, which is why we need to clean the data a little before we can upload it. For example, OpenSpending likes column names that are free of spaces and punctuation; it requires an amount column that just contains numbers (so no commas in the number to make it more readable!); it requires dates in the format 2013-07-15 (that is, the yyyy-mm-dd format) (and I think it needs this column to be called time?).

Here’s how we can rename the columns:

openrefien - rename column

Rename each column in turn, as required – for example, remove any punctuation, use camelCase (removeing spaces and using capital letters to make work boundaries), or replace spaces with underscores (_).

Let’s look at the amount column – if we select the numeric facet we can see there are lots of things not identified as numbers:

open refine - check amount as numeric

We can preview what the non-numeric values are so we can set about tidying them up…

Openrefine look for non-numerics

So commas appear to be the major issue – let’s remove them by transforming cells in that column that contain a comma by removing the comma.

openrefine transform

We can do this by replacing a comma whenever we see one with nothing (that is, an empty character string) – value.replace(',','')

OpenRefine  - remove comma

Let’s see what effect that has on the numeric facet view:

Open Refine amount after cleaning

Well that seems to have worked – no non-numerics any more… (We might also use the sliders in the numeric facet to check the outlying values to see if they are plausible, or look like they may be errors.)

As far as the dates go, we have dates in the form 17.04.2013 rather than 2013-04-17 so let’s transform them into the required format. However, because there is more scope for things going wrong with this transformation, let’s put the transformed data into a new column:

open refine  - add column

Here’s how we define the contents of that column:

oen refine date transfrom

That is: value.toDate('dd.mm.yy').toString('yyyy-mm-dd')

The first part – value.toDate('dd.mm.yy') interprets the string as a date presented in a given format, and then transforms that data into the rewquired date format: .toString('yyyy-mm-dd')

We can quickly preview that this step has worked by transforming the column to a date type:

open refine trasnform to date

and then preview it using a timeline facet to check that all seems in order.

OPenrefien timeline facet

(If there were non-date or error elements, we could select them and look at the original date and transformed date to see where the problem may lie.)

We don’t want the date formatting for out OpenSpending data file, so we can undo the step that transformed the data into the timeline viewable date format:

openrefine history

So now we have our cleaned data file. How can we apply the same steps to another month? If you look at the history tab, you will see it offers an “Extract” option – this provides a history of the change operations we applied to the dataset.

If we copy this history to a text file somewhere, we can then make use of it again and again.

open refine extract

To see how, open another OpenRefine project and import the data for another month (such as this one). When you have created the project, click on the Undo/Redo history tab and select Apply:

open refine  - appply previous transformations

Paste in the transformation script we grabbed from the previous project:

open refne - extract reuse

Here’s the script I used – https://gist.github.com/psychemedia/6087946

When we apply the script, the data is cleaned using the same operations as previously:

open refine cleanded as if by magic

That is, as long as the new data file is in the same format as the previous one, and only requires the same cleaning operations, we don’t really have much to do – we can just reuse the script we generated the first time we cleaned a file of this type. And each time a new file is published, we have a quick recipe for cleaning it up so we can get it into OpenSpending:-)

Finally, we can export the data for use as required…

open refine export data

Magic:-)