Working Visually with the ggplot2 Web Interface…

We’re all familiar with line charts and pie charts, but how might we create new ways of visualising data. One way, is to play…

Over the weekend, I was doodling with an online, GUI driven hosted version of the ggplot graphing library for the statistical analysis language R: Jeroen C.L. Ooms (2010). A web interface for the R package ggplot2

This service lets you upload a file and then using a handful of menu options create a wide variety of chart types. It also generates the corresponding ggplot/R commands so you can recreate the charts in your own R environment.

Because R is built for doing stats, ggplot has access to a powerful set of data transforming routines that can be applied to the data set you want to visualise. This includes being able to group data elements, plot different facets on different charts in a lattice style display, colour and size nodes according to a particular column and so on:

The data set I had was of the form related to (no surprises) F1 timing data. The columns included things like the driver name, car number, lap number and lap time, and stint number (each stint for a car is separated by a pitstop). Each row corresponded to data from a single lap for a single car. Which is to say, looking down the rows, we see driver names repeat (once for each lap they completed); or looking down the stint column, we can filter out rows that just correspond to the first or second stint for each driver. Which is to say: the data has structure, and can be viewed in a variety of ways.

So for example, in the following chart, we can plot for each driver (each separate graph, specified by treating the car number as a facet) their fuel corrected laptimes (y-axis) over the course of the race (lap number on x-axis). The colour of each point is determined by the pitstop column value which identifies whether a car has pitted on that lap:

Fuel corrected laptimes

The chart was created in six steps: one to upload the data, one to create a scatterplot layer and four menu selections (x, y, facet, colour).

As well as specifying facets, we can also group data rows to display them on the same faceted graph. Here, I group the data by car (so we will see a separate line for each car), and facet by stint:

F1 2011 Hungary stint analysis

Hopefully, you’ll see how it’s possible to explore a wide range of views over the data, cutting it up in all sorts of ways, quite easily. At a glance, we can get a view of different aspects of the whole data set, and start looking for surprising features that might be worth closer investigation. Because it’s quick to create different views over the data, we can quickly get an overall picture of how the data values are arranged.

There are issues of course: whilst the facet and group options provide one way of arranging the data, I don’t think there’s a way of filtering the data using the web interface (for example, to show the times by stint for cars 1 to 5). Adding support for filters (that is, data/subset operations) would make the web interface far more powerful as an interactive visual analysis tool, methinks, for example using the sort of dialogue used to specify filters in Gephi.

That said, I’ve now added Ooms’ ggplot web interface to the list of visualisation sites (along with IBM Many Eyes, for example), that I can use for getting a visual overview of a dataset quickly and easily. If you want to give it a go, you can find it at:

Here’s a quick video overview if you need more convincing:

Playing With R/ggplot2 Online (err, I think..?!)

Trying to get my head round what to talk about in another couple of presentations – an online viz tools presentation for the JISC activity data synthesis project tomorrow, and an OU workshop around the iChart eSTeEM project – I rediscovered an app that I’d completely forgotten about: an online R server that supports the plotting of charts using the ggplot library (err, I think?!):

Example of how to use

By the by, I have started trying to get my head round R using RStudio, but the online ggplot2 environment masks the stats commands and just focusses on helping you create quick charts. I randomly uploaded one of my F1 timing data files from the British Grand Prix, had a random click around, and in 8(?) clicks – from uploading the file, to rendering the chart – I’d managed to create this:

ggplot - British Grand Prix

What it shows is a scatterplot for each car showing the time on the current leader lap that the leader is ahead. When the plotted points drop from 100 or so seconds behind to just a few seconds behind, that car has been lapped.

What this chart shows (which I stumbled across just by playing with the environment) is a birds-eye view over the whole of the race, from each driver’s point of view. One thing I don’t make much use of is the colour dimension – or the size of each plotted point – but if tweak the input file to include the number of laps a car is behind the leader, their race position, the number of pitstops they’ve had, or their current tyre selection, I could easily view a couple more of these dimensions.

Where there’s a jump in the plotted points for a lap or two, if the step/break goes above the trend line (the gap to leader increases by 20s or so), the leader has lapped before the car. If the jump goes below the trend line (the gap to the leader has decreased), the leader has pitted before the car in question.

But that’s not really the point; what is the point is that here is a solution (and I think mirroring options are a possibility) for hosting within an institution an interactive chart generator. I also wonder to what extent it would be possible to extend the environment to detect single sign on credentials and allow a student to access a set of files related to a particular course, for example? Alternatively, it looks as if there is support for loading files in from Google Docs, so would it be possible to use this environment as a way of providing a graphing environment for data files stored (and maybe shared via a course) within a student’s Google Apps account?

Red-R – Pipeline Visual Editor for Doing Stats With R

Over the last few weeks, I’ve started scraping through some of the visual things that statisticians do with data. I have to admit that I’m not that interested in learning arcane tests for significance in peculiarly distributed populations, but I am keen to see what statistical graphs are out there that you can throw data at to see whether or not the data hides an interesting story (trends, clusters and outliers are three story signatures that make folk-sense to me in data terms).

And as R seems to have traction at the moment, as well as: a) being cross platform and free; b) having an active plugin “developer” community, it seems to make sense to find a way in through that… (There seems to be a shed load of recent and soon-to-be-published books around R on Amazon at the moment, and if Google uses it… (also: Google’s R style guide)).

To ease my way into R, I’ve started using R-Studio, an in-development IDE. But the other day, I was also tipped off about Red-R, a visual programming environment for R that seems to be built around the same tooling as the Orange data analysis tool I wrote about last year.

It’s still pretty ropey at the moment (on a Mac at least), but works enough to be going on with…

The metaphor is based on pipeline processing of data, chaining together functional blocks with wires in the order you want the functions to be executed. Getting data in is currently from a file (it would be nice to see hooks into online datasources supported too), with a range of options for getting the data into the environment in a structured way:

After loading the data in (click the Load button) we can preview the data using a View Data block:

As with Orange, there are lots of opportunities to comment and add notes to record what you’re doing/seeing within a block.

Going through the widget menus quickly, there are blocks for – reshaping data:

(I know from web traffic into this blog that this is something that a lot of people appreciate tool support for; I’ll try to do a summary post of how R can help in this regard in the next somewhen…)

We can also generate subsets of data:

(I found it easy enough to pull out a set of columns, but I didn’t immediately spot how to pull out rows by cell value in a given column, which is pretty fundamental?)

There are also blocks for plotting various charts and graphs:

And of course there’s the whole stats thing too…

I’ll try to have a play over the coming weeks and pop up some simple recipes of how to use this app…

For anyone who really struggles with using a command line, Red-R may provide a handy way in; from a very quick play, though, it’s not obvious how to do certain trivial things (filter in or filter out a set of rows for example).

If you already know R, it may or may not be obvious how to use Red-R immediately and cut down on the typing; but if you’re thinking of using Red-R as a visual environment for plotting graphs and charts with no prior experience of R, there may be “issues”. For example, if Red-R works best for people who have an understanding of R and how it works, Red-R may not actually be a very good tool for teaching the model that underpins R, and as such it may be hard to learn to use through direct manipulation. This might be particularly true if Red-R is a very literal interpretation of R, and simply puts a widget layer on top of R-commands to make them dialogue driven. Which is to say, it may be that an additional abstraction layer is required that combines several basic R commands into higher level and more natural dialogues that makes Red-R as a visual environment easier to use without instruction?

Because “without instruction” is currently the only way to engage with Red-R: the documententation is currently very sparse indeed:-(

In the meantime, I think I’m going to stick with using R-Studio

PS ..but then, maybe a few quick tutorial posts here might help to start address the lack of quick start howto’s???;-)

First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Anscombe's quartet /via Wikipedia

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)

I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
plot(cars, xlim=c(0,x.max)),

Hmm… can we set the filter value dynamically I wonder?

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),

Seems like it…?:-) We can also combine interactive controls:

manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group’s one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you “see” statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome… Err….?!]