OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Merging Two Different Datasets Containing a Common Column With R and R-Studio

Another way for the database challenged (such as myself!) for merging two datasets that share at least one common column…

This recipe using the cross-platform stats analysis package, R. I use R via the R-Studio client, which provides an IDE wrapper around the R environment.

So for example, here’s how to merge a couple of files sharing elements in a common column…

First, load in your two data files – for example, I’m going to load in separate files that contain qualifying and race stats from the last Grand Prix:

R import data

We can merge the datasets using a command of the form:

m=merge(hun_2011racestats,hun_2011qualistats,by="driverNum")

The by parameter identifies which column we want to merge the tables around. (If the two datasets have different column names, you need to set by.x= and by.y= to specify the column from each dataset that is the focus for merging).

So for example, in the simple case where we are merging around two columns of the same name in different tables:

R merge
Merging datasets in R

After the merge, column names for columns from the first table have the .x suffix added, and from the second, .y.

We can then export the combined dataset as a CSV file:

write.csv(m, file = "demoMerge.csv")

fragment of csv file

[If you want to extract a subset of the columns, specify the required columns in an R command of the form: m2=m[c(“driverNum”,”name.x”,”ultimate.x”,”ultimate.y”)] See also: R: subset]

Simples;-)

PS in the above example, the merge table only contains merged rows. If there are elements in the common column of one table, but not the other, that partial data will not be included in the merged table. To include all rows, set all=TRUE. To include all rows from the first table, but not unmatched rows from the second, set all.x=TRUE; (the cells from columns in the unmatched row of the second table will be set to NA). (all.y=TRUE is also legitimate). From the R merge documentation:

In SQL database terminology, the default value of all = FALSE [the default] gives a natural join, a special case of an inner join. Specifying all.x = TRUE gives a left (outer) join, all.y = TRUE a right (outer) join, and both (all=TRUE a (full) outer join. DBMSes do not match NULL records, equivalent to incomparables = NA in R.

For other ways of combining data from two different data sets, see:
Merging Datasets with Common Columns in Google Refine
A Further Look at the Orange Data Playground – Filters and File Merging
Merging CSV data files with Google Fusion Tables

If you know of any other simple ways of joining data files about a common column, please reveal all in the comments:-)

Written by Tony Hirst

August 2, 2011 at 10:38 am

Posted in Data, Infoskills, onlinejournalismblog, Rstats

Tagged with ,

4 Responses

Subscribe to comments with RSS.

  1. Have you played with Google Refine yet Tony?

    It’s a great tool for merging and ammending datasets.

    Alex Bilbie

    August 2, 2011 at 2:52 pm

    • Yep, I played with it first when it was Freebase Gridworks;-)

      See also eg http://blog.ouseful.info/2011/05/06/merging-datesets-with-common-columns-in-google-refine/

      I really want to spend a proper chunk of time on Refine, and pull a few more data cleaning recipes together. I also think I need to learn a bit more about data reshaping possibilities in R. Finally, there’s Stanford Data Wrangler which I think has moved on somewhat since I last played with it, and seems to complement both Refine and R eg in the way it can handled unstructured data and pull it into a regular shape.

      If I was a bit more disciplined, I should probably think about setting up a pattern collection in a wiki that would show how to choose between these tools to do different sorts of data wrangling and achieve particular transformations. But that seems too much like real work!

      Tony Hirst

      August 2, 2011 at 3:06 pm

  2. […] other ways of combining data from two different data sets, see: – Merging Two Different Datasets Containing a Common Column With R and R-Studio – A Further Look at the Orange Data Playground – Filters and File Merging – Merging CSV data […]

  3. […] codes with practice names (Merging Datasets with Common Columns in Google Refine might be one way? Merging Two Different Datasets Containing a Common Column With R and R-Studio […]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 869 other followers

%d bloggers like this: