OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into mpExpenses2012. Previewing it we see there is a column MP.s.Name identifying which MP each expense claim line item corresponds to:

mpexpenses2012R

We can easily pull out the unique values of the MP names using the levels command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  tmp=subset(mpExpenses2012,MP.s.Name==name)
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP
  write.csv(tmp,fn,row.names=FALSE)
}

Simples:-)

This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.

Written by Tony Hirst

April 3, 2013 at 8:54 am

Posted in Rstats

Tagged with ,

3 Responses

Subscribe to comments with RSS.

  1. You could also use `split()` to create a list of dataframes, one for each value of the mpExpenses2012$MP.s.Name variable and then `lapply()` over the list to save each smaller dataframe.
    The code might look like this:
    MPlist <- split(mpExpenses2012, mpExpenses2012$MP.s.Name)
    lapply(names(MPlist), function(name) write.csv(MPlist[[name]], file = paste('mpExpenses2012/',gsub(' ','',name),sep = ''), row.names = F))

    Robert Norberg

    April 3, 2013 at 7:41 pm

    • @robert Thanks for that tip – I haven’t used split() before… Your approach looks far more R idiomatic:-)

      Tony Hirst

      April 3, 2013 at 10:41 pm


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 797 other followers

%d bloggers like this: