## Experimenting With R – Point to Point Mapping With Great Circles

I’ve started doodling again… This time, around maps, looking for recipes that make life easier plotting lines to connect points on maps. The most attractive maps seem to use great circles to connect one point with another, these providing the shortest path between two points when you consider the Earth as a sphere.

Here’s one quick experiment (based on the Flowing Data blog post How to map connections with great circles), for an R/Shiny app that allows you to upload a CSV file containing a couple of location columns (at least) and an optional “amount” column, and it’ll then draw lines between the points on each row.

The app requires us to solve several problems, including:

• how to geocode the locations
• how to plot the lines as great circles
• how to upload the CSV file
• how to select the from and two columns from the CSV file
• how to optionally select a valid numerical column for setting line thickness

Let’s start with the geocoder. For convenience, I’m going to use the Google geocoder via the geocode() function from the ggmap library.

```#Locations are in two columns, *fr* and *to* in the *dummy* dataframe
#If locations are duplicated in from/to columns, dedupe so we don't geocode same location more than once
locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)
#Run the geocoder against each location, then transpose and bind the results into a dataframe
cbind(locs, t(sapply(locs\$place,geocode, USE.NAMES=F))) ```

The locs data is a vector of locations:

```                    place
1              London, UK
2            Cambridge,UK
3            Paris,France
4       Sydney, Australia
5           Paris, France
6             New York,US
7 Cape Town, South Africa```

The sapply(locs\$place,geocode, USE.NAMES=F) function returns data that looks like:

```    [,1]       [,2]     [,3]     [,4]      [,5]     [,6]      [,7]
lon -0.1254872 0.121817 2.352222 151.207   2.352222 -74.00597 18.42406
lat 51.50852   52.20534 48.85661 -33.86749 48.85661 40.71435  -33.92487```

The transpose (t() gives us:

```     lon        lat
[1,] -0.1254872 51.50852
[2,] 0.121817   52.20534
[3,] 2.352222   48.85661
[4,] 151.207    -33.86749
[5,] 2.352222   48.85661
[6,] -74.00597  40.71435
[7,] 18.42406   -33.92487```

The cbind() binds each location with its lat and lon value:

```                    place        lon       lat
1              London, UK -0.1254872  51.50852
2            Cambridge,UK   0.121817  52.20534
3            Paris,France   2.352222  48.85661
4       Sydney, Australia    151.207 -33.86749
5           Paris, France   2.352222  48.85661
6             New York,US  -74.00597  40.71435
7 Cape Town, South Africa   18.42406 -33.92487```

Code that provides a minimal example for uploading the data from a CSV file on the desktop to the Shiny app, then creating dynamic drop lists containing column names, can be found here: Simple file geocoder (R/shiny app).

The following snippet may be generally useful for getting a list of column names from a data frame that correspond to numerical columns:

```#Get a list of column names for numerical columns in data frame df
nums <- sapply(df, is.numeric)
names(nums[nums])```

The code for the full application can be found as a runnable gist in RStudio from here: R/Shiny app – great circle mapping. [In RStudio, install.packages(“shiny”); library(shiny); runGist(9690079). The gist contains a dummy data file if you want to download it to try it out…]

Here’s the code explicitly…

The global.R file loads the necessary packages, installing them if they are missing:

```#global.R

##This should detect and install missing packages before loading them - hopefully!
list.of.packages <- c("shiny", "ggmap","maps","geosphere")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages,function(x){library(x,character.only=TRUE)}) ```

The ui.R file builds the Shiny app’s user interface. The drop down column selector lists are populated dynamically with the names of the columns in the data file once it is uploaded. An optional Amount column can be selected – the corresponding list only displays the names of numerical columns. (The lists of location columns to be geocoded should really be limited to non-numerical columns.) The action button prevents the geocoding routines firing until the user is ready – select the columns appropriately before geocoding (error messages are not handled very nicely;-)

```#ui.R
shinyUI(pageWithSidebar(

sidebarPanel(
#Provide a dialogue to upload a file
fileInput('datafile', 'Choose CSV file',
accept=c('text/csv', 'text/comma-separated-values,text/plain')),
#Define some dynamic UI elements - these will be lists containing file column names
uiOutput("fromCol"),
uiOutput("toCol"),
#Do we want to make use of an amount column to tweak line properties?
uiOutput("amountflag"),
#If we do, we need more options...
conditionalPanel(
condition="input.amountflag==true",
uiOutput("amountCol")
),
conditionalPanel(
condition="input.amountflag==true",
uiOutput("lineSelector")
),
#We don't want the geocoder firing until we're ready...
actionButton("getgeo", "Get geodata")

),
mainPanel(
tableOutput("filetable"),
tableOutput("geotable"),
plotOutput("geoplot")
)
))```

The server.R file contains the server logic for the app. One thing to note is the way we isolate some of the variables in the geocoder reactive function. (Reactive functions fire when one of the external variables they contain changes. To prevent the function firing when a variable it contains changes, we need to isolate it. (See the docs for me; for example, Shiny Lesson 7: Reactive outputs or Isolation: avoiding dependency.)

```#server.R

shinyServer(function(input, output) {

filedata <- reactive({
infile <- input\$datafile
if (is.null(infile)) {
# User has not uploaded a file yet
return(NULL)
}
})

#Populate the list boxes in the UI with column names from the uploaded file
output\$toCol <- renderUI({
df <-filedata()
if (is.null(df)) return(NULL)

items=names(df)
names(items)=items
selectInput("to", "To:",items)
})

output\$fromCol <- renderUI({
df <-filedata()
if (is.null(df)) return(NULL)

items=names(df)
names(items)=items
selectInput("from", "From:",items)
})

#If we want to make use of an amount column, we need to be able to say so...
output\$amountflag <- renderUI({
df <-filedata()
if (is.null(df)) return(NULL)

checkboxInput("amountflag", "Use values?", FALSE)
})

output\$amountCol <- renderUI({
df <-filedata()
if (is.null(df)) return(NULL)
#Let's only show numeric columns
nums <- sapply(df, is.numeric)
items=names(nums[nums])
names(items)=items
selectInput("amount", "Amount:",items)
})

#Allow different line styles to be selected
output\$lineSelector <- renderUI({
c("Uniform" = "uniform",
"Thickness proportional" = "thickprop",
"Colour proportional" = "colprop"))
})

#Display the data table - handy for debugging; if the file is large, need to limit the data displayed [TO DO]
output\$filetable <- renderTable({
filedata()
})

#The geocoding bit... Isolate variables so we don't keep firing this...
geodata <- reactive({
if (input\$getgeo == 0) return(NULL)
df=filedata()
if (is.null(df)) return(NULL)

isolate({
dummy=filedata()
fr=input\$from
to=input\$to
locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)
cbind(locs, t(sapply(locs\$place,geocode, USE.NAMES=F)))
})
})

#Weave the goecoded data into the data frame we made from the CSV file
geodata2 <- reactive({
if (input\$getgeo == 0) return(NULL)
df=filedata()
if (input\$amountflag != 0) {
maxval=max(df[input\$amount],na.rm=T)
minval=min(df[input\$amount],na.rm=T)
df\$b8g43bds=10*df[input\$amount]/maxval
}
gf=geodata()
df=merge(df,gf,by.x=input\$from,by.y='place')
merge(df,gf,by.x=input\$to,by.y='place')
})

#Preview the geocoded data
output\$geotable <- renderTable({
if (input\$getgeo == 0) return(NULL)
geodata2()
})

#Plot the data on a map...
output\$geoplot<- renderPlot({
if (input\$getgeo == 0) return(map("world"))
#Method pinched from: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
map("world")
df=geodata2()

pal <- colorRampPalette(c("blue", "red"))
colors <- pal(100)

for (j in 1:nrow(df)){
inter <- gcIntermediate(c(df[j,]\$lon.x[[1]], df[j,]\$lat.x[[1]]), c(df[j,]\$lon.y[[1]], df[j,]\$lat.y[[1]]), n=100, addStartEnd=TRUE)

#We could possibly do more styling based on user preferences?
if (input\$amountflag == 0) lines(inter, col="red", lwd=0.8)
else {
if (input\$lineselector == 'colprop') {
colindex <- round( (df[j,]\$b8g43bds[[1]]/10) * length(colors) )
lines(inter, col=colors[colindex], lwd=0.8)
} else if (input\$lineselector == 'thickprop') {
lines(inter, col="red", lwd=df[j,]\$b8g43bds[[1]])
} else lines(inter, col="red", lwd=0.8)
}
}
})

})```

So that’s the start of it… this app could be further developed in several ways, for example allowing the user to filter or colour displayed lines according to factor values in a further column (commodity type, for example), or produce a lattice of maps based on facet values in a column.

I also need to figure how to to save maps, and maybe produce zoomable ones. If geocoded points all lay within a blinding box limited to a particular geographical area, scaling the map view to show just that area might be useful.

Other techniques might include using proportional symbols (circles) at line landing points to show the sum of values incoming to that point, or some of values outgoing, or the difference between the two; (maybe use green for incoming outgoing, then size by the absolute difference?)

## Narrative Charts Tell the Tale…

A couple of days ago, I got a message from @fantasticlfe asking if I’d done any tinkerings around what turned out to be “narrative charts”. I kept misapprehending what he was after (something to do with continuity?!;-), so here’s a summary of various graphical devices for looking at narrative texts that we passed back and forth, along with some we didn’t..

A Sankey diagram typically uses variable thickness lines to show flow between different elements in a system. (For this reason it’s often used to show energy flows throuygh a system, though it can also be used to good effect to show money flows.) The chart Michael linked to comes from xkcd:

In this chart, we have time along the horizontal x-axis. The y-axis is ambiguous (some sort of nominal ordering?) and the line thickness appears to represent army size.

To a certain extent, this diagram is reminiscent of Minard’s famous chart…

(See also What Makes a Minard? for some contemporary Minard diagrams. Is code available, I wonder?)

However, in the case of Minard’s chart (which I personally don’t like at all!), the x-y and co-ordinates represent map co-ordinates – the thick lines aren’t thick lines in a line chart (which a glanced “up and to the right” view might make you assume), they’re flow lines across a map.

I got distracted for a while by the Sankey aspect, and dug around my own bits of code. For example, Generating Sankey Diagrams from rCharts, an rCharts wrapper for the d3.js Sankey diagram. Michael was particularly interested in being able to group lines vertically (though I wasn’t sure what the y-axis would actually correspond to: some loose function of “location”, maybe as a categorical variable? Time was definitely to be on the horizontal x-axis); a posting on Stack Overflow (d3 sankey charts – manually position node along x axis) seemed likely to be able to help with that.

I then started going off on one…

Would a variant of nltk style lexical dispersion plots help, using characters rather than word categories? That would show when a character was in scene, but not much else?

How about sentence drawing, in which we show “turns” taken by different speakers?

This shows something, but again, not relevant…

Nor are Kurt Vonnegut’s shapes-of-stories diagrams that plot some sort of emotional state on y and time on x:

Hmmm… Michael wanted to be able to look at scenes on x and presumably some function of location on y. Hmm… why? And how might we actually order those axes? Scenes occur in order in a film or play, but scene is a ranked, ordinal value. That said, scenes also have duration in terms of screentime, which may or may not be the same as the “interval” that the scene portrays in terms of the world it represents (this must have a name? eg a 20 second screen time scene shows a plane flying and this represents x hours in the story). The scene may also have a ‘calendar time’ associated with it in the story – so where you have a flashback scene this corresponds to a previous calendar time in the represented world. Did Michael want any of these dimensions capturing?

Related to shapes of stories, here’s how someone analysed several thousand plots: Examining the arc of 100,000 stories: a tidy analysis.

And then there’s location… how should these be represented? Locations are a distance apart and, perhaps more importantly from a continuity point of view, a travel time apart; as well as maybe a timezone difference apart. Did that need capturing in any way? Ordering axes for this could be quite hard if we wanted close things in space (distance? travel time?) to be close together on a single axis (A is 10 minutes from B and C, B is ten minutes from C: how do you show that intransitive relation on a single dimension? [Maybe relevant? Storygraph: Extracting patterns from spatio-temporal data, A Shrestha et al., Advances in Visual Computing.] Hmm… If we can capture distance between locations, and some sensible notion of time relating to scenes, could we maybe use line thickness to show that a person has lots of time to move between one (time, location) and another, as compared to scenetime? Do filmwriters have tools to support this? Do the police…?! Is the Mythology Engine relevant?

How about thinking about it as a graph? I’ve used Gephi before as a foil for getting me to think about ordered series as connected events in a graph – for example, Visualising F1 Timing Sheet Data. If we encode scene number as the x-coordinate and location number as the y-coordinate, with each graph line being the connected series of scenes a particular individual is in, then we can simply use a line chart to connect “individual lines” to different scene and location numbers. We’d also have a couple of extra dimensions to play with – node size and node colour, at each location. We’d also have the opportunity to play with edge (that is, line) colour and edge thickness?

Maybe I need to try to do some demos? But no time for that right now…

How about trying to find some? Here are some discovered via @jamesjefferies:

Here’s a view of connected (by travel between) locations in Game of Thrones:

There’s also an animation of event in Game of Thrones, but I can’t quite figure out how to read it?!

Let’s go back to the sort of thing Michael was after – narrative charts..

@imhelenj found a related if cluttered interactive describing the evolution of web tech:

Then Michael shared a link to Comic Book Narrative Charts, a project for automatically generating xkcd style narrative charts:

Hovering over these charts, I noticed they were interactive d3.js charts. A quick View Source and the code for generating the chart dynamically from a characters file and a narrative file appeared to be there. Which I think is what Michael wanted all along…!

(By the by, the post also describes how the developers started thinking about fixing the vertical y-coordinate values. Here’s another example of someone thinking aloud around producing a narrative chart for the Holy Week story.)

Ho hum, an interesting set of detours nonetheless – and it got me thinking about the time-space complexity of a scene based tale that could keep be confused for weeks! :-)

PS this is quite interesting – visualising a process, via Tactical Tech Drawing By Numbers project:

PPS some more bits: @r4isstatic points to Some visualisations of stories and narratives, another summary post similar to this one. Also via Paul Rissen, and picking up on whether the police have any interesting actor/event/time/location diagramming techniques, Vispol – An Interactive Scenario Visualization.

Elsewhere, I find Storyline Visualizations, which includes a paper (Design Considerations for Optimizing Storyline Visualizations, Y Tanahashi, and K-L Ma, IEEE Trans on Visualisation and Computer Graphics, 18(12) 2012, pp2679-2688 and some python code.

PPPS Some more… A collection by Stewart McKie of techniques for visualising screenplays: Screenplay Visualization: Concepts and Practice. The posts I wrote on the Digital Worlds game design uncourse blog about narrative structure. Sort of via Scott Wilson, some crime analysis software from xanalys.com (Link Explorer – White Paper) which includes descriptions of an event chart, a transaction chart and an activity timeline:

Via the comments, this rather lovely animated discourse map:

## Recreational Data: Data Golf

I’m still hopeful of working up the idea of recreational data as a popular pastime activity with a regular column somewhere and a stocking filler book each Christmas (?!;-), but haven’t had much time to commit to working up some great examples lately:-(

However, here’s a neat idea – data golf – as described in a post by Bogumił Kamiński (RGolf) that I found via RBloggers:

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation

An example is then given of a data reshaping/transformation problem based on a real data task (wrangling survey data, converting it from a long to a wide format in the smallest amount of R.

Of course, R need not be the only language that can be used to play this game. For the course I’m currently writing, I think I’ll pitch data golf as a Python/pandas activity in the section on data shaping. OpenRefine also supports a certain number of reshaping transformations, so that’s another possible data golf course(?). As are spreadsheets. And so on…

Hmmm… thinks… pivot table golf?

Also related: string parsing/transformation or partial string extraction using regular expressions; for example, Regex Tuesday, or how about Regex Crossword.

## OpenRefine Docker Containers

I had a go at building a couple of docker containers for OpenRefine, one from the latest release and one from the latest code on github:

In order to create the virtual machine, you should:

• install boot2docker
• run boot2docker
• Either: to run with a project directory solely within the container, in the boot2docker terminal, enter the command docker run --name openrefine -d -p 3334:3333 psychemedia/openrefine
• Or: to run with a project directory mounted from a shared folder on the host, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /path/to/yourSharedDirectory:/mnt/refine --name openrefine psychemedia/openrefine
• Or: to run with a project directory in a linked data volume, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /mnt/refine --name openrefine psychemedia/openrefine

(To use the latest release rather than a recent build use psychemedia/docker-openrefine rather than psychemedia/openrefine.)

The port number you will be able to find OpenRefine on is given by the first number set in the flag -p NNNN:3333. To access OpenRefine via port 3334, use -p 3334:3333 etc.

OpenRefine will then be available via your browser at the URL http://IPADDRESS:NNNN. To find the required value of IPADDRESS can be found using the command boot2docker ip

The returned IP address (eg 192.168.59.103) is the IP address you can find OpenRefine on, for example: http://192.168.59.103:3334.

PS for a Dockerfile in the current directory, to build an image from the Dockerfile, use something like:

docker build -t myname/mycontainer .