This recipe using the cross-platform stats analysis package, R. I use R via the R-Studio client, which provides an IDE wrapper around the R environment.

So for example, here’s how to merge a couple of files sharing elements in a common column…

First, load in your two data files – for example, I’m going to load in separate files that contain qualifying and race stats from the last Grand Prix:

We can merge the datasets using a command of the form:

`m=merge(hun_2011racestats,hun_2011qualistats,by="driverNum")`

The *by* parameter identifies which column we want to merge the tables around. (If the two datasets have different column names, you need to set *by.x=* and *by.y=* to specify the column from each dataset that is the focus for merging).

So for example, in the simple case where we are merging around two columns of the same name in different tables:

After the merge, column names for columns from the first table have the *.x* suffix added, and from the second, *.y*.

We can then export the combined dataset as a CSV file:

`write.csv(m, file = "demoMerge.csv")`

[If you want to extract a subset of the columns, specify the required columns in an R command of the form: *m2=m[c(“driverNum”,”name.x”,”ultimate.x”,”ultimate.y”)]* See also: R: subset]

Simples;-)

PS in the above example, the merge table only contains merged rows. If there are elements in the common column of one table, but not the other, that partial data will *not* be included in the merged table. To include all rows, set *all=TRUE*. To include all rows from the first table, but not unmatched rows from the second, set *all.x=TRUE*; (the cells from columns in the unmatched row of the second table will be set to *NA*). (*all.y=TRUE* is also legitimate). From the R *merge* documentation:

In SQL database terminology, the default value of all = FALSE [the default] gives a natural join, a special case of an inner join. Specifying all.x = TRUE gives a left (outer) join, all.y = TRUE a right (outer) join, and both (all=TRUE a (full) outer join. DBMSes do not match NULL records, equivalent to incomparables = NA in R.

For other ways of combining data from two different data sets, see:

– Merging Datasets with Common Columns in Google Refine

– A Further Look at the Orange Data Playground – Filters and File Merging

– Merging CSV data files with Google Fusion Tables

If you know of any other simple ways of joining data files about a common column, please reveal all in the comments:-)

]]>Whilst looking for an easy way of shaping an R data frame so that I could plot categorical values sector1, sector2, sector3 on the x-axis, and then a line for each driver showing their time in the sector on the y-axis (I still haven’t worked out how to do that? Any hints? Add them to the comments please…;-), I came across a variant of the parallel coordinate plot hidden away in the lattice package:

What this plot does is for each row (i.e. each driver) take values from separate columns (i.e. times from each sector), normalise them, and then plot lines between the normalised value, one “axis” per column; each row defines a separate category.

The normalisation obviously hides the magnitude of the differences between the time deltas in each sector (the min-max range might be hundredths in one sector, tenths in another), but this plot does show us that there are different groupings of cars – there is clear(?!) white space in the diagram:

Whilst the parallel co-ordinate plot helps identify groupings of cars, and shows where they may have similar performance, it isn’t so good at helping us get an idea of which sector had most impact on the final lap time. For this, I think we need to have a single axis in seconds showing the delta from the fastest time in the sector. That is, we should have a parallel plot where the parallel axes have the same scale, but in terms of sector time, a floating origin (so e.g. the origin for one sector might be 28.6s and for another, 22.4s). For convenience, I’d also like to see the deltas shown on the y-axis, and the categorical ranges arranged on the x-axis (in contrast to the diagrams above, where the different ranges appear along the y-axis).

PS I also wonder to what extent we can identify signatures for the different teams? Eg the fifth and sixth slowest cars in sector 1 have the same signature across all three sectors and come from the same team; and the third and fourth slowest cars in sector 2 have a very similar signature (and again, represent the same team).

Where else might we look for signatures? In the speed traps maybe? Here’s what the parallel plot for the speed traps looks like:

(In this case, max is better = faster speed.)

To combine the views (timings and speed), we might use a formulation of the flavour:

`parallel(~data.frame(a$sector1,a$sector2,a$sector3, -a$inter1,-a$inter2,-a$finish,-a$trap))`

This is a bit too cluttered to pull much out of though? I wonder if changing the order of parallel axes might help, e.g. by trying to come up with an order than minimises the number of crossed lines?

And if we colour lines by team, can we see any characteristics?

*Using a dashed, rather than solid, line makes the chart a little easier to read (more white space). Using a thinking line also helps bring out the colours.*

`parallel(~data.frame(a$sector1,-a$inter1,-a$inter2,a$sector2,a$sector3, -a$finish,-a$trap),col=a$team,lty=2,lwd=2)`

Here’s another ordering of the axes:

Here are the sector times ordered by team (min is better):

Here are the speeds by team (max is better):

Again, we can reorder this to try to make it easier(?!) to pull out team signatures:

(I wonder – would it make sense to try to order these based on similarity eg derived from a circuit guide?)

Hmmm… I need to ponder this…

]]>One of the things that struck me from conversations and some of the presentations was that statistics – and in particular public engagement around statistics – appears to be lagging science efforts in this area.

When I first moved to the OU as a lecturer a dozen or so years ago, I got involved with various activities that, at the time, were classed as “public understanding of science and technology”, though at the time the whole sci-comm area was in a state of flux and ideas were moving towards a focus on public *engagement* with science. As a member of the NESTA Crucible one year, I saw how there was also concern around engagement with science and technology policy, and how it could be moved “upstream”, to a point where dialogue with various publics could actually contribute to, and even influence, policy development.

(The NESTA Crucible experience significantly influenced my world view and was one of the most rewarding schemes I have ever been involved with…)

Since then, it seems to me that the school science curriculum has witnessed a similar change, with a move away from a focus purely on the basic science (and perhaps industrial applications?) to one that includes a consideration of socio-technical considerations (one might say, policy implications…)

At the VIPS event, one of the phrases that jumped out at me in at least one presentation (aside from repeated mentions to RSS…;-) talked about difficulties in promoting the *public understanding of statistics*. Ally this with the fact that the school maths curriculum seems *not* to have evolved so much, (“averages”, means and histogram still seem to be the focus?!) and I wonder: is statistics today where science was a decade or so ago?

The recent rhetoric around – and actual release of – “open public data” suggests that, as citizens and journalists, there is an increasing number of opportunities to hold governments and public bodies to account using evidentiary data and maybe also engage in data-driven (or at least data informed) policy formulation. With so much data out there, and so many possible ways of combining and interrogating it – so many possible different questions to ask and places to ask them – there are increasingly opportunities for informed amateurs to make a very real contribution (in the same way that amateur astronomers can make a real contribution to the recording and analysis of astronomical observations).

The growing instrumentation of our world also means that there is increasing amounts of data about ourselves that we can have access to in the form of personal data dashboards (for example, think of various social media/reputation tools, but also expect to see various tools appearing that allow you to mine your health/fitness, financial or shopping transaction data, for example). These dashboards will be visually rich, and designed to give at-a-glance overviews of the state of this, or that quantity or metric. But to get most from them, we will need to include more complex and powerful visualisation types, *and find a way of helping people learn how to “see” them, “read” them and interpret them*/

So to what extent do we need to engage with the “public understanding of statistics” as compared to the development of skills in the public *appreciation* of statistics and improvements in the way the public can engage with each other and with policy makers in discussions where statistics play a role? (Public engagement *in* statistics? Public engagement *with* statistics?)

Over the last few weeks, I’ve started trying to immerse myself in the world of statistical graphics, on the basis that our perceptual apparatus is pretty good at pattern detection and can help us get to grip with visually meaningful properties of distributions of data without us necessarily having to understand much in the way of formal statistics. (Of course, the visual apparatus can also be conned by misleading graphs and charts, which is where some semblance of critical understanding and, dare I say it, statistical literacy, comes in.)

My intuition is that it will be easier to develop a visual literacy in the reading and interpretation of charts (i.e. building on “folk statistical graphics/visual statistics”) than a widespread mathematical understanding of statistics. (I suspect that for most people, pie charts – and more recently ‘donut’ charts – as well as line graphs and simple bar charts are about the limit of what they are comfortable with, along with thematic maps (in particular, *choropleth* maps) and (in recent years again?) *proportional symbol maps*. I also know from asking even well informed audiences that awareness of more recently developed techniques, such as treemaps, are not widespread.)

At the moment, the infographics designers appear to be leading the charge into public consciousness of data-driven graphics, but as I’m finding out, the stats community has a wealth of visual techniques already to hand that are maybe “sounder” in terms of deriving visual representations that reflect statistical properties and concerns than the tricks the infographics crowd are using. (This is all just my anecdotal opinion, and not based in any formal research!)

Many infographics build on a common visual grammar (in the West, line charts up to the right increase over time; for area based charts, the bigger the area the more of something is being represented). But many infographics are also limited by the chart types we are all familiar with (line charts, bar charts, coloured maps…) Maybe the place to start is the stats community finding ways of introducing new-to-the-majority statistical graphs into the mainstream media along with a strong narrative to explain what is going on in those charts (and not necessarily so much discussion about the actual maths and stats…)?

]]>