Using One Programming Language In the Context of Another – Python and R

Over the last couple of years, I’ve settled into using R an python as my languages of choice for doing stuff:

  • R, because RStudio is a nice environment, I can blend code and text using R markdown and knitr, ggplot2 and Rcharts make generating graphics easy, and reshapers such as plyr make wrangling with data realtvely easy(?!) once you get into the swing of it… (though sometimes OpenRefine can be easier…;-)
  • python, because it’s an all round general purpose thing with lots of handy libraries, good for scraping, and a joy to work with in iPython notebook…

Sometimes, however, you know – or remember – how to do one thing in one language that you’re not sure how to do in another. Or you find a library that is just right for the task hand but it’s in the other language to the one in which you’re working, and routing the data out and back again can be a pain.

How handy it would be if you could make use of one language in the context of another? Well, it seems as if we can (note: I haven’t tried any of these recipes yet…):

Using R inside Python Programs

Whilst python has a range of plotting tools available for it, such as matplotlib, I haven’t found anything quite as a expressive as R’s ggplot2 (there is a python port of ggplot underway but it’s still early days and the syntax, as well as the functionality, is still far from complete as compared to the original [though not a far as it was given the recent update;-)] ). So how handy would it be to be able to throw a pandas data frame, for example, into an R data frame and then use ggplot to render a graphic?

The Rpy and Rpy2 libraries support exactly that, allowing you to run R code within a python programme. For an example, see this Example of using ggplot2 from IPython notebook.

There also seems to be some magic help for running R in iPython notebooks and some experimental integrational work going on in pandas: pandas: rpy2 / R interface.

(See also: ggplot2 in Python: A major barrier broken.)

Using python Inside R

Whilst one of the things I often want to do in python is plot R style ggplots, one of the hurdles I often encounter in R is getting data in in the first place. For example, the data may come from a third party source that needs screenscraping, or via a web API that has a python wrapper but not an R one. Python is my preferred tool for writing scrapers, so is there a quick way I can add a python data grabber into my R context? It seems as if there is: rPython, though the way code is included looks rather clunky and WIndows support appears to be moot. What would be nice would be for RStudio to include some magic, or be able to support python based chunks…

(See also: Calling Python from R with rPython.)

(Note: I’m currently working on the production of an Open University course on data management and use, and I can imagine the upset about overcomplicating matters if I mooted this sort of blended approach in the course materials. But this is exactly the sort of pragmatic use that technologists use code for – as a tool that comes to hand and that can be used quickly and relatively efficiently in concert with other tools, at least when you’re working in a problem solving (rather than production) mode.)

Working Visually with the ggplot2 Web Interface…

We’re all familiar with line charts and pie charts, but how might we create new ways of visualising data. One way, is to play…

Over the weekend, I was doodling with an online, GUI driven hosted version of the ggplot graphing library for the statistical analysis language R: Jeroen C.L. Ooms (2010). A web interface for the R package ggplot2

This service lets you upload a file and then using a handful of menu options create a wide variety of chart types. It also generates the corresponding ggplot/R commands so you can recreate the charts in your own R environment.

Because R is built for doing stats, ggplot has access to a powerful set of data transforming routines that can be applied to the data set you want to visualise. This includes being able to group data elements, plot different facets on different charts in a lattice style display, colour and size nodes according to a particular column and so on:

The data set I had was of the form related to (no surprises) F1 timing data. The columns included things like the driver name, car number, lap number and lap time, and stint number (each stint for a car is separated by a pitstop). Each row corresponded to data from a single lap for a single car. Which is to say, looking down the rows, we see driver names repeat (once for each lap they completed); or looking down the stint column, we can filter out rows that just correspond to the first or second stint for each driver. Which is to say: the data has structure, and can be viewed in a variety of ways.

So for example, in the following chart, we can plot for each driver (each separate graph, specified by treating the car number as a facet) their fuel corrected laptimes (y-axis) over the course of the race (lap number on x-axis). The colour of each point is determined by the pitstop column value which identifies whether a car has pitted on that lap:

Fuel corrected laptimes

The chart was created in six steps: one to upload the data, one to create a scatterplot layer and four menu selections (x, y, facet, colour).

As well as specifying facets, we can also group data rows to display them on the same faceted graph. Here, I group the data by car (so we will see a separate line for each car), and facet by stint:

F1 2011 Hungary stint analysis

Hopefully, you’ll see how it’s possible to explore a wide range of views over the data, cutting it up in all sorts of ways, quite easily. At a glance, we can get a view of different aspects of the whole data set, and start looking for surprising features that might be worth closer investigation. Because it’s quick to create different views over the data, we can quickly get an overall picture of how the data values are arranged.

There are issues of course: whilst the facet and group options provide one way of arranging the data, I don’t think there’s a way of filtering the data using the web interface (for example, to show the times by stint for cars 1 to 5). Adding support for filters (that is, data/subset operations) would make the web interface far more powerful as an interactive visual analysis tool, methinks, for example using the sort of dialogue used to specify filters in Gephi.

That said, I’ve now added Ooms’ ggplot web interface to the list of visualisation sites (along with IBM Many Eyes, for example), that I can use for getting a visual overview of a dataset quickly and easily. If you want to give it a go, you can find it at:

Here’s a quick video overview if you need more convincing:

Playing With R/ggplot2 Online (err, I think..?!)

Trying to get my head round what to talk about in another couple of presentations – an online viz tools presentation for the JISC activity data synthesis project tomorrow, and an OU workshop around the iChart eSTeEM project – I rediscovered an app that I’d completely forgotten about: an online R server that supports the plotting of charts using the ggplot library (err, I think?!):

Example of how to use

By the by, I have started trying to get my head round R using RStudio, but the online ggplot2 environment masks the stats commands and just focusses on helping you create quick charts. I randomly uploaded one of my F1 timing data files from the British Grand Prix, had a random click around, and in 8(?) clicks – from uploading the file, to rendering the chart – I’d managed to create this:

ggplot - British Grand Prix

What it shows is a scatterplot for each car showing the time on the current leader lap that the leader is ahead. When the plotted points drop from 100 or so seconds behind to just a few seconds behind, that car has been lapped.

What this chart shows (which I stumbled across just by playing with the environment) is a birds-eye view over the whole of the race, from each driver’s point of view. One thing I don’t make much use of is the colour dimension – or the size of each plotted point – but if tweak the input file to include the number of laps a car is behind the leader, their race position, the number of pitstops they’ve had, or their current tyre selection, I could easily view a couple more of these dimensions.

Where there’s a jump in the plotted points for a lap or two, if the step/break goes above the trend line (the gap to leader increases by 20s or so), the leader has lapped before the car. If the jump goes below the trend line (the gap to the leader has decreased), the leader has pitted before the car in question.

But that’s not really the point; what is the point is that here is a solution (and I think mirroring options are a possibility) for hosting within an institution an interactive chart generator. I also wonder to what extent it would be possible to extend the environment to detect single sign on credentials and allow a student to access a set of files related to a particular course, for example? Alternatively, it looks as if there is support for loading files in from Google Docs, so would it be possible to use this environment as a way of providing a graphing environment for data files stored (and maybe shared via a course) within a student’s Google Apps account?