OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Quick and Dirty Recipe: Merging (Concatenating) Multiple CSV files (ODA Spending)

There’s been a flurry of tweets over the last few days about LOCOG’s exemption from FOI (example LOCOG response to an FOI request), but the Olympic Delivery Authority (ODA, one of the owner stakeholders) is rather more open, and publishes its spends over £25k: ODA Finance: Transparency Reports.

CSV files containing spend on a monthly basis are available from the site, using a consistent CSV file format each time (I think…). For what it’s worth, I thought it might be worth sharing a pragmatic, though not ideal, Mac/Linux/unixtools commandline recipe for generating a single file containing all this data.

  1. Right-click and download each of the CSV files on the page to the same directory (eg odaSpending) on your local machine. (There are easier ways of doing this – I tried wget on the command line, but got an Access Denied response (workaround anyone?); there are probably more than a few browser extensions/plugins that will also download all the files linked to from a page. If so, you just want to grab the csv files; if you get them all, from the command line, just copy the csv files to a new directory: eg mkdir csvfiles;cp *.csv csvfiles)
  2. On the commandline, change directory to the files directory – eg cd odaSpending/csvfiles; then join all the files together: files=*; cat $files > odaspending.csv
  3. You should now have a big file, odaspending.csv, containing all the data, although it will also contain multiple header rows (each csv file had its own header row). Open the file in a text editor (I use TextWrangler), copy from the start of the first line to the start of the second (ie copy the header row, including the end of line/carriage return), then do a Find on the header and global Replace with nothing replacing the search string. Then, depending where you started the replace, maybe paste the header (if required) back into the first row

To turn the data file into something you can explore more interactively, upload it to something like Google Fusion Tables, as I did here (data to May 2012): ODA Spending in Google Fusion Tables

Note that this recipe is a pragmatic one. Unix gurus would surely be able to work out far more efficient scripts that concatenate the files after stripping out the header in all but the first file, for example, or that maybe even check the columns are the same etc etc. But if you want something quick and dirty, this is one way of doing it… (Please feel free to add alternative recipes for achieving the same thing in the comments…)

PS here’s an example of one sort of report you can then create in Fusion Tables – ODA spend with G4S; here’s another: Seconded staff

Written by Tony Hirst

July 16, 2012 at 10:16 am

4 Responses

Subscribe to comments with RSS.

  1. Thanks for writing this post! It has been really helpful, but has only gotten me so far… I have a slightly more complicated problem:

    Say the files were published every month using inconsistant header rows containing similar data each month. For example maybe one month contains a header labeled “Expenses” and the next month the header is labeled “Expenditures” but both columns contain the same kind of data: what was spent that month. What’s more, maybe these columns are in different locations in the CSV. Expenses is located in the first column and the next month Expenditures is located in the fourth column.

    Is there a “quick and dirty” way to change the header names and locations and then concatenate the two months so that everything will be in the correct order?

    calliewheeler

    November 1, 2012 at 11:19 pm


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 729 other followers

%d bloggers like this: