Figure Aesthetics or Overlays?

Tinkering with a new chart type over the weekend, I spotted something rather odd in in my F1 track history charts – what look to be outliers in the form of cars that hadn’t been lapped on that lap appearing behind the lap leader of the next lap, on track.

If you count the number of cars on that leadlap, it’s also greater than the number of cars in the race on that lap.

How could that be? Cars being unlapped, perhaps, and so “appearing twice” on a particular leadlap – that is, recording two laptimes between consecutive passes of the start/finish line by the race leader?

My fix for this was to add an “unlap” attribute that detects whether

#Overplot unlaps
lapTimes=ddply(lapTimes,.(leadlap,code),transform,unlap= seq_along(leadlap))

This groups by leadlap an car, and counts 1 for each occurrence. So if the unlap count is greater than 1, a car a has completed more than 1 lap in a given leadlap.

My first thought was to add this as an overprint on the original chart:

#Overprint unlaps
g = g + geom_point(data = lapTimes[lapTimes['unlap']>1,],
                   aes(x = trackdiff, y = leadlap, col=(leadlap-lap)), pch = 0)

This renders as follows:

Whilst it works, as an approach it is inelegant, and had me up in the night pondering the use of overlays rather than aesthetics.

Because we can also view the fact that the car was on its second pass of the start/finish line for a given lead lap as a key property of the car and depict that directly via an aesthetic mapping of that property onto the symbol type:

  g = g + geom_point(aes( x = trackdiff, y = leadlap,
                          col = (lap == leadlap),
                          pch= (unlap==1) ))+scale_shape_identity()

This renders just a single mark on the chart, depicting the diff to the leader *as well as * the unlapping characteristic, rather than the two marks used previously, one for the diff, the second, overprinting, mark to depict the unlapping nature of that mark.

So now I’m wondering – when would it make sense to use multiple marks by overprinting?

Here’s one example where I think it does make sense: where I pass an argument into the chart plotter to highlight a particular driver by infilling a marker with a symbol to identify that driver.

#Drivers of interest passed in using construction: code=list(c("STR","+"),c("RAI","*"))
if (!is.na(code)){
  for (t in code) {
    g = g + geom_point(data = lapTimes[lapTimes['code'] == t[1], ],
                       aes(x = trackdiff, y = leadlap),
                       pch = t[2])
  }
}

In this case, the + symbol is not a property of the car, it is an additional information attribute that I want to add to that car, but not the other cars. That is, it is a property of my interest, not a property of the car itself.

Race Track Concordance Charts

Since getting started with generating templated R reports a few weeks ago, I’ve started spending the odd few minutes every race weekend around looking at ways of automating the generation of F1 qualifying and race reports.

Im yesterday’s race, some of the commentary focussed on whether MAS had given BOT an “assist” in blocking VET, which got me thinking about better ways of visualising whether drivers are stuck in traffic or not.

The track position chart makes a start at this, but it can be hard to focus on a particular driver (identified using a particular character to infill the circle marker for that driver). The race leader’s track position ahead is identified from the lap offset race leader marker at the right hand side of the chart.

One way to help keep track of things from the perspective of a particular driver, rather than the race leader, is to rebase the origin of the x-axis relative to the that driver.

In my track chart code, I use a dataframe that has a trackdiff column that gives a time offset on track to race leader for each lead lap.

track_encoder=function(lapTimes){
  #Find the accumulated race time at the start of each leader's lap
  lapTimes = ddply(lapTimes, .(leadlap), transform, lstart = min(acctime))

  #Find the on-track gap to leader
  lapTimes['trackdiff'] = lapTimes['acctime'] - lapTimes['lstart']
  lapTimes
}

Rebasing for a particular driver simply means resetting the origin with respect to that time, using the trackdiff time for one driver as an offset for the others, to create a new trackdiff2 for use on the x-axis.

#I'm sure there must be a more idiomatic way of doing this?
rebase=lapTimes[lapTimes['code']==code,c('leadlap','trackdiff')]
rebase=rename(rebase,c('trackdiff'='trackrebase'))
lapTimes=merge(lapTimes,rebase,by='leadlap')
lapTimes['trackdiff2']=lapTimes['trackdiff']-lapTimes['trackrebase']

Here’s how it looks for MAS:

But not so useful for BOT, who led much of the race:

This got me thinking about text concordances. In the NLTK text analysis package, the text concordance function allows you to display a search term centred in the context in which it is found:

concordance

The concordance view finds the location of each token and then displays the search term surrounded by tokens in neighbouring locations, within a particular window size.

I spent a chunk of time wondering how to do this sensibly in R, struggling to identify what it was I actually wanted to do: for a particular driver, find the neighbouring cars in terms of accumulated laptime on each lap. After failing to see the light for more an hour or so, I thought of it in terms of an SQL query, and the answer fell straight out – for the specified driver on a particular lead leadlap, get their accumulated laptime and the rows with accumulated laptimes in a window around it.

inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta,
l2.lap-l1.lap as lapdelta, l2.lap as focuslap
FROM lapTimes as l1 join lapTimes as l2
WHERE l1.acctime &lt; (l2.acctime + ', abs(limits[2]), ') AND l1.acctime &gt; (l2.acctime - ', abs(limits[1]),')
AND l2.code="',code,'";'))

Plotting against the accumalated laptime delta on the x-axis gives a chart like this:

If we add in horizontal rules to show laps where the specified driver pitted and vertical bars to show pit windows, we get a much richer particular of the race from the point of view of the driver.

Here’s how it looks from the perspective of BOT, who led most of the race:

Different symbols inside the markers can be used to track different drivers (in the above charts, BOT and VET are highlighted). The colours are used to identify whether or not cars on the same lap as the specified driver, are cars on laps ahead for shades of blue then green (as per “blue flag”) and orange to red for cars on increasing laps behind (i.e. backmarkers from the perspective of the specified driver). If a marker is light blue, that car is on the same lap and you’re racing…

All in all, I’m pretty chuffed (for now!) with how that chart came together.

And a new recipe to add to the Wrangling F1 Data With R book, I guess..

PS in response to [misunderstanding…] a comment from @sidepodcast, we also have control over the concordance window size, and the plotsize:

concordresize

Generating hi-res versions in other file formats is also possible.

Just got to wrap it all up in a templated report now…

PPS On the track position charts, I just noticed that where cars are lapped, they fall off the radar… so I’ve added them in behind the leader to keep the car count correct for each leadlap…

trackposrebaselapped

PS See also: A New Chart Type – Race Concordance Charts, which also includes examples of “line chart” renderings of the concordance charts so you can explicitly see the progress of each individually highlighted driver on track.

Experimenting With Sankey Diagrams in R and Python

A couple of days ago, I spotted a post by Oli Hawkins on Visualising migration between the countries of the UK which linked to a Sankey diagram demo of Internal migration flows in the UK.

One of the things that interests me about the Jupyter and RStudio centred reproducible research ecosystems is their support for libraries that generate interactive HTML/javascript outputs (charts, maps, etc) from a computational data analysis context such as R, or python/pandas, so it was only natural (?!) that I though I should see how easy it would be to generate something similar from a code context.

In an R context, there are several libraries available that support the generation of Sankey diagrams, including googleVis (which wraps Google Chart tools), and a couple of packages that wrap d3.js – an original rCharts Sankey diagram demo by @timelyporfolio, and a more recent HTMLWidgets demo (sankeyD3).

Here’s an example of the evolution of my Sankey diagram in R using googleVis – the Rmd code is here and a version of the knitred HTML output is here.

The original data comprised a matrix relating population flows between English regions, Wales, Scotland and Northern Ireland. The simplest rendering of the data using the googleViz Sankey diagram generator produces an output that uses default colours to label the nodes.

Using the country code indicator at the start of each region/country identifier, we can generate a mapping from country to a country colour that can then be used to identify the country associated with each node.

One of the settings for the diagram allows the source (or target) node colour to determine the edge colour. We can also play with the values we use as node labels:

If we exclude edges relating to flow between regions of the same country, we get a diagram that is more reminiscent of Oli’s orignal (country level) demo. Note also that the charts that are generated are interactive – in this case, we see a popup that describes the flow along one particular edge.

If we associate a country with each region, we can group the data and sum the flow values to produce country level flows. Charting this produces a chart similar to the original inspiration.

As well as providing the code for generating each of the above Sankey diagrams, the Rmd file linked above also includes demonstrations for generating basic Sankey diagrams for the original dataset using the rCharts and htmlwidgets R libraries.

In order to provide a point of comparison, I also generated a python/pandas workflow using Jupyter notebooks and the ipysankey widget. (In fact, I generated the full workflow through the different chart versions first in pandas – I find it an easier language to think in than R! – and then used that workflow as a crib for the R version…)

The original notebook is here and an example of the HTML version of it here. Note that I tried to save a rasterisation of the widgets but they don’t seem to have turned out that well…

The original (default) diagram looks like this:

and the final version, after a bit of data wrangling, looks like this:

Once again, all the code is provided in the notebook.

One of the nice things about all these packages is that they produce outputs than can be reused/embedded elsewhere, or that can be used as a first automatically produced draft of code that can be tweaked by hand. I’ll have more to say about that in a future post…

How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?

If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)

But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.

In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:

either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;
and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.

Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.

As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.

For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).

Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.

Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.

Reporting in a Repeatable, Parameterised, Transparent Way

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, ad hoc questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the folium library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and in particular the knitr tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.

__dropbox_parlidata_rdemos_-_rstudio

For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive leaflet map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The bookdown package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which bookdown supports partially parameterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

to look for sensible ways of comparing the workflow associated with each,
the ramp-up skills required, and blockers (including cultural blockers (also administrative / organisational blockers, h/t @dasbarrett)) associated with getting started with new tools such as Jupyter or RStudio, and
the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

identifying what data or insight you want to report from your dataset;
(automatically deriving the insights);
constructing appropriate sentences from the data;
organising the sentences into some sort of narrative structure;
making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

Running the Numbers – How Can Hamilton Still Take the 2016 F1 Drivers’ Championship?

Way back in 2012, I posted a simple R script for trying to work out the finishing combinations in the last two races of that year’s F1 season for Fernando Alonso and Sebastien Vettel to explore the circumstances under which Alonso could take the championship (Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix); I also put together a simple shiny version of the script to make it bit more app like (Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship), which I also updated for the 2014 season (F1 Championship Race, 2014 – Winning Combinations…).

And now we come to 2016, and once again, with two races to go, there are two drivers in with a chance of winning overall… But what race finishing combinations could see Hamilton make a last stand and reclaim his title? The F1 Drivers’ Championship Scenarios, 2016 shiny app will show you…

You can find the code in a gist here:

	library(shiny)
	library(ggplot2)
	library(reshape)

	# Define server logic required to generate and plot a random distribution
	shinyServer(function(input, output) {
	points=data.frame(pos=1:11,val=c(25,18,15,12,10,8,6,4,2,1,0))
	points[[1,2]]
	a=330
	v=349

	pospoints=function(a,v,pdiff,points){
	pp=matrix(ncol = nrow(points), nrow = nrow(points))
	for (i in 1:nrow(points)){
	for (j in 1:nrow(points))
	pp[[i,j]]=v-a+pdiff[[i,j]]
	}
	pp
	}

	pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
	for (i in 1:nrow(points)){
	for (j in 1:nrow(points))
	pdiff[[i,j]]=points[[i,2]]-points[[j,2]]
	}

	ppx=pospoints(a,v,pdiff,points)

	winmdiff=function(vadiff,pdiff,points){
	win=matrix(ncol = nrow(points), nrow = nrow(points))
	for (i in 1:nrow(points)){
	for (j in 1:nrow(points))
	if (i==j) win[[i,j]]=''
	else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='ROS'
	else win[[i,j]]='HAM'
	}
	win
	}

	# Function that generates a plot of the distribution. The function
	# is wrapped in a call to reactivePlot to indicate that:
	#
	# 1) It is "reactive" and therefore should be automatically
	# re-executed when inputs change
	# 2) Its output type is a plot
	#
	output$distPlot <- renderPlot( {
	wmd=winmdiff(ppx[[input$ros,input$ham]],pdiff,points)
	wmdm=melt(wmd)
	g=ggplot(wmdm)+geom_text(aes(X1,X2,label=value,col=value))
	g=g+xlab('ROS position in Abu Dhabi')+ ylab('HAM position in Abu Dhabi')
	g=g+labs(title="Championship outcomes in Abu Dhabi")
	g=g+ theme(legend.position="none")
	g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))
	g=g+coord_flip()
	print(g)
	})
	})

view raw server.R hosted with ❤ by GitHub

	library(shiny)

	shinyUI(pageWithSidebar(

	# Application title
	headerPanel("F1 Driver Championship Scenarios, 2016"),

	# Sidebar with a slider input for number of observations
	sidebarPanel(
	sliderInput("ham",
	"HAM race pos in Brazilian Grand Prix:",
	min = 1,
	max = 11,
	value = 1),
	sliderInput("ros",
	"ROS race pos in Brazilian Grand Prix:",
	min = 1,
	max = 11,
	value = 2),
	hr(),
	em("See also:"),br(),
	a(href="http://f1datajunkie.com","f1datajunkie.com"),
	br(),
	a(href="https://leanpub.com/wranglingf1datawithr","Wrangling F1 Data With R")
	),

	# Show a plot of the generated distribution
	mainPanel(
	plotOutput("distPlot"),
	h3("How to use the Championship predictor"),
	p("With two more races to go in the 2016 F1 season, what do Hamilton and Rosberg each need to do to win the Drivers' Championship?" ),
	p("Using the sliders, select various finishing positions for the drivers in the next race, the Brazilian Grand Prix." ),
	p("The output will then update to show the who will be champion if for a all possible points scoring finishing positions at the last race in Abu Dhabi. For a particular finishing combination, the champion will be the named driver." ),
	h4("How to Read the Chart"),
	p("If you expect Hamilton to win in Brazil, and Rosberg to come second, set the sliders accordingly. The display changes to show that if HAM finished first in Abu Dhabi, if ROS finishes second or third, ROS will take the championship. If HAM finishes second in Abu Dhabi, HAM will win overall if ROS places 8th or lower. If ROS places 9th in Abu DHan=bi, HAM wins overall if he makes it onto the race podium. If ROS finishes on the race podium in the final race, he takes the Drivers' Championship wherever HAM finishes." ),
	p("If neither of the drivers take points in Brazil, HAM can still win overall if he wins in Abu Dhabi and ROC somes 8th or lower." ),
	hr(),
	em("For more F1 stats'n'data wrangling, see "), a(href="http://f1datajunkie.com","f1datajunkie.com"), em("or the Leanpub book"),
	a(href="https://leanpub.com/wranglingf1datawithr","Wrangling F1 Data With R"),em(".")
	)
	))

view raw ui.R hosted with ❤ by GitHub

DH Box – Digital Humanities Virtual Workbench

As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?

I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?

DH_Box0

If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).

DH_Box

(The toolbar menu reminded me of Stringle / DockLE ;-)

There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.

DH_Box6

The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?

DH_Box2

Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:

DH_Box4

They work, too!

DH_Box5

On the Jupyter notebook front, you get access to Python3 and R kernels:

DH_Box3

In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].

Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.

It also reminded me that I need to play with my own install of tmpnb, not least because of the claim that “tmpnb can run any Docker container”. Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?

If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)

That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)

It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…

Using Docker as a Personal Productivity Tool – Running Command Line Apps Bundled in Docker Containers

With its focus on enterprise use, it’s probably with good reason that the Docker folk aren’t that interested in exploring the role that Docker may have to play as a technology that supports the execution of desktop applications, or at least, applications for desktop users. (The lack of significant love for Kitematic seems to be representative of that.)

But I think that’s a shame; because for educational and scientific/research applications, docker can be quite handy as a way of packaging software that ultimately presents itself using a browser based user interface delivered over http, as I’ve demonstrated previously in the context of Jupyter notebooks, OpenRefine, RStudio, R Shiny apps, linked applications and so on.

I’ve also shown how we can use Docker containers to package applications that offer machine services via an http endpoint, such as Apache Tika.

I think this latter use case shows how we can start to imagine things like a “digital humanities application shelf” in a digital library (fragmentary thoughts on this), that allows users to take either an image of the application off the shelf (where an image is a thing that lets you fire up a pristine instance of the application), or a running instance of the application of the shelf. (Furthermore, the application can be run locally, on your own desktop computer, or in the cloud, for example, using something like a mybinder like service). The user can then use the application directly (if it has a browser based UI), or call on it from elsewhere (eg in the case of Apache Tika). Once they’re done, they can keep a copy of whatever files they were working with and destroy their running version of the application. If they need the application again, they can just pull a new copy (of the latest version of the app, or the version they used previously) and fire up a new instance of it.

Another way of using Docker came to mind over the weekend when I saw a video demonstrating the use of the contentmine scientific literature analysis toolset. The contentmine installation instructions are a bit of a fiddle for the uninitiated, so I thought I’d try to pop them into a container. That was easy enough (for a certain definition of easy – it was a faff getting node to work and npm to be found, the Java requirements took a couple of goes, and I;m sure the image is way bigger than it really needs to be…), as the Dockerfile below/in the gist shows.

But the question then was how to access the tools? The tools themselves are commandline apps, so the first thing we want to do is to be able to call into the container to run the command. A handy post by Mike English entitled Distributing Command Line Tools with Docker shows how to do this, so that’s all good then…

The next step is to consider how to retain copies of the files created by the command line apps, or pass files to the apps for processing. If we have a target host directory and mount it into the container as as a shared volume, we can keep the files on our desktop or allow the container to create files into the host directory. Then they’ll be accessible to us all the time, even if we destroy the container.

The gist that should be embedded below shows the Dockerfile and a simple batch file passes the Contentmine tool commands into the container which then executes them. The batch file idea could be further extended to produce a set of command shortcuts that essentially alias the Contentmine commands (eg a ./getpapers command rather than a ./contentmine getpapers command, or that combine the various steps associated with a particular pipeline or workflow – getpapers/norma/cmine, for example – into a single command.

UPDATE: the CenturyLinkLabs DRAY docker pipeline looks interesting in this respect for sequencing a set of docker containers and passing the output of one as the input to the next.

If there are other folk out there looking at using Docker specifically for self-managed “pull your own container” individual desktop/user applications, rather than as a devops solution for deploying services at scale, I’d love to chat…:-)

PS for several other examples of using Docker for desktop apps, including accessing GUI based apps using X WIndows / X11, see Jessie Frazelle’s post Docker Containers on the Desktop.

PPS See also More Docker Doodlings – Accessing GUI Apps Via a Browser from a Container Using Guacamole for an attempt at exposing a GUI based app, such as Audacity, running in a container via a browser. Note that I couldn’t get a shared folder or the audio to work, although the GUI bit did…

PPPS I wondered how easy it would be to run command-line containers from within Jupyter notebook itself running in inside a container, but got stuck. Related question on Stack Overflow here.

The rest of the way this post is published is something of an experiment – everything below the line is pulled in from a gist using the WordPress – embedding gists shortcode…

Contentmine Docker CLI App

An attempt to make it easier to use Contentmine tools, by simplifying the install...

Based on some cribs from Distributing Command Line Tools with Docker

create a contentmine directory: eg mkdir -p contentmine
download the Dockerfile into it
from a Docker CLI, cd in to the directory; create an image using eg docker build -t psychemedia/contentmine .
download the contentmine script and make it executable: chmod u+x contentmine
create a folder on host in the contentmine folder to act as a shared folder with the contentmine container: eg mkdir -p cm
run commands
- ./contentmine getpapers -q aardvark -o /contentmine/aardvark -x
- ./contentmine norma --project /contentmine/aardvark -i fulltext.xml -o scholarly.html --transform nlm2html
- ./contentmine cmine /contentmine/aardvark

Contentmine Home: Contentmine

view raw README.md hosted with ❤ by GitHub

	#!/bin/bash
	## contentmine - a wrapper script for runnining contentmine packages
	#/via Distributing Command Line Tools with Docker https://spin.atomicobject.com/2015/11/30/command-line-tools-docker/
	#Make this file executable: chmod u+x contentmine

	docker run --rm --volume "${PWD}/cm":/contentmine --tty --interactive psychemedia/contentmine "$@"

view raw contentmine hosted with ❤ by GitHub

	FROM node:4.3.2

	##Based on
	MAINTAINER Tony Hirst


	RUN apt-get clean -y && apt-get -y update && apt-get -y upgrade && \
	apt-get -y update && apt-get install -y wget ant unzip openjdk-7-jdk && \
	apt-get clean -y

	RUN wget --no-check-certificate https://github.com/ContentMine/norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb

	RUN wget --no-check-certificate https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

	RUN dpkg -i norma_0.1.SNAPSHOT_all.deb
	RUN dpkg -i ami2_0.1.SNAPSHOT_all.deb

	RUN npm install --global getpapers


	RUN mkdir /contentmine
	VOLUME /contentmine

	RUN cd /contentmine

	#The intuition behind this is: 'can we avoid setup hassles by distributing contentmine commandline app via a container?'

	#docker build -t psychemedia/contentmine .
	#Then maybe something like:
	#mkdir -p cm
	#docker run --volume "${PWD}/cm":/contentmine --interactive psychemedia/contentmine getpapers -q aardvark -o /contentmine/aardvark -x
	#docker run --volume "${PWD}/cm":/contentmine --interactive psychemedia/contentmine norma --project /contentmine/aardvark -i fulltext.xml -o scholarly.html --transform nlm2html
	#docker run --volume "${PWD}/cm":/contentmine --interactive psychemedia/contentmine cmine /contentmine/aardvark
	#
	##Following https://spin.atomicobject.com/2015/11/30/command-line-tools-docker/ how about:
	#Create the following script as contentmine; chmod u+x contentmine
	#--
	##!/bin/bash
	## contentmine - a wrapper script for runnining contentmine packages
	#docker run --volume "${PWD}/cm":/contentmine --tty --interactive psychemedia/contentmine "$@"
	#--
	#Then run eg: ./contentmine getpapers -q aardvark -o /contentmine/aardvark2 -x

view raw Dockerfile hosted with ❤ by GitHub

First Thoughts on Detecting Motorsport Safety Car Periods from Laptimes

Prompted by Markku Hänninen, I thought I’d have a quick look at estimating motorsport safety car laps from a set of laptime data. For the uninitiated, if there is a dangerous hazard on track, the race-cars are kept out while the hazard is cleared, but led around by a safety car that limits the pace. No overtaking is allowed for race position, but under certain regulations, lapped cars may unlap themselves. Cars may also pit under the safety car.

Timing sheets typically don’t identify safety car periods, so the question arises: how can we detect them?

One condition that is likely to follow is that the average pace of the laps under safety car conditions will be considerably slower than under racing conditions. A quick way of estimating the race pace is to find the fastest laptime across the whole of the race (or in an online algorithm, the fastest laptime to date).

With a lapTimes dataframe containing columns lap, rawtime (that is, raw laptime in seconds) and position (that is, the race position of the driver recording a particular laptime on a particular lap), we can easily find the fastest lap:

minl=min(lapTimes['rawtime'])

We can find the mean laptime per lap using ddply() to group around each lap:

ddply(lapTimes[c('lap', 'rawtime', 'position')], .(lap), summarise,
      mean_laptime=mean(rawtime) )

We can also generate a variety of other measures. For example, within the grouped ddply operation, if we divide the mean laptime per lap (mean(rawtime)) by the fastest overall laptime we get a normalised mean laptime based on the fastest lap in the race.

ddply(lapTimes[c('lap','rawtime')], .(lap), summarise,
      norm_laptime=mean(rawtime)/minl )

We might also normalise the leader’s laptime for each lap, on the basis that the leader will be the car most likely to be driving at the safety car’s pace. (The summarising is essentially redundant here because we only have one row per group.

ddply(lapTimes[lapTimes['position']==1, c('lap','rawtime')], .(lap), summarise,
      norm_leaders_laptime=mean(rawtime)/minl )

Using the normalised times, we can identify slow laps. For example, slow laps based on mean laptime. In this case, I am using a heuristic that says the laptime is a slow laptime if the normlised time is more than 1.3 times that of the fastest lap:

ddply(lapTimes[c('lap','rawtime')], .(lap), summarise,
      slow_lap_meanBasis= (mean(rawtime)/minl) > 1.3 )

If we assume that the first lap does not start under the safety car, we can then make a crude guess that a slow lap not on the first lap is a safety car lap.

However, this does not take into account things like sudden downpours or other changes to the weather or track conditions. In such a case it may be likely that the majority of the field pits, so we might want to have a detector that flags whether a certain number of cars have pitted on a lap, possibly normalised against the current size of the field.

laps=lapsData.df(2016,4)
pits=pitsData.df(2016,4)[c('lap','driverId')]
pits['pitstop']=T
lapTimes=merge(laps, pits, by=c('lap','driverId'), all.x=T)
lapTimes['pitstop']=!is.na(lapTimes['pitstop'])

#Count of stops per lap
ddply( lapTimes, .(lap), summarise, ps=sum(pitstop==TRUE) )

#Proportion of cars stopping per lap
ddply( lapTimes, .(lap), summarise, ps=sum(pitstop==TRUE)/length(pitstop) )

That said, under safety car conditions, many cars do also take the opportunity to pit. However, under sudden changes of weather condition, we might expect nearly all the cars to come in, even if it means doubling up. (So another detector for weather might be two cars in the same team, close to each other in terms of gap, pitting on the same lap, with the result that one will be queued behind the other.)

As and when I get a chance, I’ll try to add some sort of ‘safety car’ estimator to the Wrangling F1 Data With R book.