<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; Postcards from a Text Processing Excursion</title>
	<atom:link href="http://blog.ouseful.info/2011/06/03/postcards-from-a-text-processing-excursion/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Fri, 24 May 2013 12:19:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; Postcards from a Text Processing Excursion</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>Postcards from a Text Processing Excursion</title>
		<link>http://blog.ouseful.info/2011/06/03/postcards-from-a-text-processing-excursion/</link>
		<comments>http://blog.ouseful.info/2011/06/03/postcards-from-a-text-processing-excursion/#comments</comments>
		<pubDate>Fri, 03 Jun 2011 11:53:54 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Infoskills]]></category>
		<category><![CDATA[onlinejournalismblog]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[Uncourse]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[text wrangling]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=5579</guid>
		<description><![CDATA[It never ceases to amaze me how I lack even the most basic computer skills, but that&#8217;s one of the reasons I started this blog: to demonstrate and record my fumbling learning steps so that others maybe don&#8217;t have to spend so much time being as dazed and confused as I am most of the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=5579&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>It never ceases to amaze me how I lack even the most basic computer skills, but that&#8217;s one of the reasons I started this blog: to demonstrate and record my fumbling learning steps so that others maybe don&#8217;t have to spend so much time being as dazed and confused as I am most of the time&#8230;</p>
<p>Anyway, I spent a fair chunk of yesterday trying to find a way of getting started with grappling with CSV data text files that are just a bit too big to comfortably manage in a text editor or simple spreadsheet (so files over 50,000 or so rows, up to low millions) and that should probably be dumped into a database <em>if</em> that option was available, but for whatever reason, isn&#8217;t&#8230; (Not feeling comfortable with setting up and populating a database is one example&#8230;But I doubt I&#8217;ll get round to blogging my SQLite 101 for a bit yet&#8230;)</p>
<p>Note that the following tools are Unix tools &#8211; so they work on Linux and on a Mac, but probably not on Windows unless you install a unix tools package (such as <a href="http://gnuwin32.sourceforge.net/">GnuWin</a> &#8211; <em>coreutils</em> and <em>sed</em>, which look good for starters&#8230;). Another alternative would be to download the <a href="http://susegallery.com/a/RQrRBY/data-journalism-developer-studio--2">Data Journalism Developer Studio</a> and run it either as a bootable CD/DVD, or as a virtual machine using something like <a href="http://www.vmware.com/">VMWare</a> or <a href="http://www.virtualbox.org/">VirtualBox</a>.</p>
<p>All the tools below are related to the basic mechanics of wrangling with text files, which include CSV (comma separated) and TSV (tab separated) files. Your average unix jockey will look at you with sympathetic eyes if you rave bout them, but for us mere mortals, they may make life easier for you than you ever thought possible&#8230;</p>
<p><em>[If you know of simple tricks in the style of what follows that I haven't included here, please feel free to add them in as a comment, and I'll maybe try to work then into a continual updating of this post...]</em></p>
<p>If you want to play along, why not check out this <a href="http://openurl.ac.uk/doc/data/data.html">openurl data from EDINA</a> (<a href="http://openurl.ac.uk/doc/data/sample.html">data sample</a>; a more comprehensive set is also available if you&#8217;re feeling brave: <a href="http://openurl.ac.uk/doc/data/thedata.html">monthly openurl data</a>).</p>
<p>So let&#8217;s start at the beginning and imagine your faced with a large CSV file &#8211; 10MB, 50MB, 100MB, 200MB large &#8211; and when you try to open it in your text editor (the file&#8217;s too big for Google spreadsheets and maybe even for Google Fusion tables) the whole thing just grinds to a halt, if doesn&#8217;t actually fall over.</p>
<p>What to do?</p>
<p>To begin with, you may want to take a deep breath and find out just what sort of beast you have to contend with. You know the file size, but what else might you learn? (I&#8217;m assuming the file has a csv suffix, <em>L2sample.csv</em> say, so for starters we&#8217;re assuming it&#8217;s a text file&#8230;)</p>
<p>The <tt>wc</tt> (word count) command is a handy little tool that will give you a quick overview of how many rows there are in the file:</p>
<p><tt>wc -l L2sample.csv</tt></p>
<p>I get the response <em>101 L2sample.csv</em>, so there are presumably 100 data rows and 1 header row.</p>
<p>We can learn a little more by taking the <tt>-l</tt> linecount switch off, and getting a report back on the number of words and characters in the file as well:</p>
<p><tt>wc L2sample.csv</tt></p>
<p>Another thing that you might consider doing is just having a look at the structure of the file, by sampling the first few rows of it and having a peek at them. The <tt>head</tt> command can help you here.</p>
<p><tt>head L2sample.csv</tt></p>
<p>By default, it returns the first 10 rows of the file. IF we want to change the number of rows displayed, we can use the <tt>-n</tt> switch:</p>
<p><tt>head -n 4 L2sample.csv</tt></p>
<p>As well as the <tt>head</tt> command, there is the <tt>tail</tt> command; this can be used to peek at the lines at the end of the file:</p>
<p><tt>tail L2sample.csv<br />
tail -n 15 L2sample.csv</tt></p>
<p>When I look at the rows, I see they have the form:</p>
<pre>logDate	logTime	encryptedUserIP	institutionResolverID	routerRedirectIdentifier ...
2011-04-04	00:00:03	kJJNjAytJ2eWV+pjbvbZTkJ19bk	715781	ukfed ...
2011-04-04	00:00:14	/DAGaS+tZQBzlje5FKsazNp2lhw	289516	wayf ...
2011-04-04	00:00:15	NJIy8xkJ6kHfW74zd8nU9HJ60Bc	569773	athens ...</pre>
<p>So, not <em>comma</em> separated then; <em>tab</em> separated&#8230;;-)</p>
<p>If you were to upload a tab separated file to something like Google Fusion Tables, which I think currently only parses CSV text files for some reason, it will happily spend the time uploading the data &#8211; and then shove it into a single column.</p>
<p><em>I&#8217;m not sure if there are column splitting tools available in Fusion Tables &#8211; there weren&#8217;t last time I looked, though maybe we might expect a fuller range of import tools to appear at some point; many applications that accept text based data files allow you to specify the separator type, as for example in Google spreadsheets:</p>
<p><img src="http://ouseful.files.wordpress.com/2011/06/googspreadsheetimportdialogue.png?w=700" alt="" title="googspreadsheetimportdialogue"   class="alignnone size-full wp-image-5582" /></p>
<p>I&#8217;m personally living in hope that some sort of integration with the <a href="http://code.google.com/p/google-refine/">Google Refine data cleaning tool</a> will appear one day&#8230;</em></p>
<p>If you want to take a sample of a large data file and put into another smaller file that you can play with or try things out with, the <tt>head</tt> (or <tt>tail</tt>) tool provides one way of doing that thanks to the magic of Unix <em>redirection</em> (which you might like to think of as a &#8220;pipe&#8221;, although that has a slightly different meaning in Unix land&#8230;). The words/jargon may sound confusing, and the syntax may look cryptic, but the effect is really powerful: <strong>take the output from a command and shove it into a file.</strong></p>
<p>So, given a CSV file with a million rows, suppose we want to run a few tests in an application using a couple of hundred rows. <em>This trick will help you generate the file containing the couple of hundred rows.</em></p>
<p>Here&#8217;s an example using <em>L2sample.csv</em> &#8211; we&#8217;ll create a file containing the first 20 rows, plus the header row:</p>
<p><tt>head -n 21 L2sample.csv <strong>&gt;</strong> subSample.csv</tt></p>
<p>See the <strong>&gt;</strong> sign? That says &#8220;take the output from the command on the left, and shove it into the file on the right&#8221;. (Note that if <em>subSample.csv</em> already exists, it will be overwritten, and you will lose the original.)</p>
<p>There&#8217;s probably a better way of doing this, but if you want to generate a CSV file (with headers) containing the last 10 rows, for example, of a file, you can use the <em>cat</em> command to join a file containing the headers with a file containing the last 10 rows:</p>
<p><tt>head -n 1 L2sample.csv &gt; headers.csv<br />
tail -n 20 L2sample.csv &gt; subSample.csv<br />
<strong>cat</strong> headers.csv subSample.csv &gt; subSampleWithHeaders.csv</tt></p>
<p>(Note: don&#8217;t try to <em>cat</em> a file into itself, or Ouroboros may come calling&#8230;)</p>
<p>Another very powerful concept from the Unix command line is the notion of <strong>|</strong> (the <em>pipe</em>). This lets you take the output from one command and direct it to another command (rather than directing it into a file, as &gt; does). So for example, if we want to extract rows 10 to 15 from a file, we can use <em>head</em> to grab the first 15 rows, then <em>tail</em> to grab the last 6 rows of those 15 rows (count them: 10, 11, 12, 13, 14, 15):</p>
<p><tt>head -n 15 L2sample.csv | tail -n 6 &gt; middleSample.csv</tt></p>
<p>Try to read in as an English phrase (the | and &gt; are punctuation): <em>take the the first [<strong>head</strong>] 15 rows [<strong>-n 15</strong>] of the file <strong>L2sample.csv</strong> and use them as input [<strong>|</strong>] to the <tt>tail</tt> command; take the last [<strong>tail</strong>] 6 lines [<strong>-n 6</strong>] of the input data and save them [<strong>&gt;</strong>] as the file <strong>middleSample.csv</strong></em>.</p>
<p>If we want to add in the headers, we can use the <em>cat</em> command:</p>
<p><tt>cat headers.csv middleSample.csv &gt; middleSampleWithHeaders.csv</tt></p>
<p>We can use a pipe to join all sorts of commands. If our file only uses a single word for each column header, we can count the number of columns (single words) by grabbing the header row and sending it to <tt>wc</tt>, which will count the words for us:</p>
<p><tt>head -n 1 L2sample.csv | wc</tt></p>
<p>(Take the first row of L2sample.csv and count the lines/words/characters. If there is one word per column header, the word count gives us the column count&#8230;;-)</p>
<p>Sometimes we just want to split a big file into a set of smaller files. The <tt>split</tt> command is our frind here, and lets us split a file into smaller files containing up to a know number of rows/lines:</p>
<p><tt>split -l 15 L2sample.csv subSamples</tt></p>
<p>This will generate a series of files named <em>subSamples<strong>aa</strong></em>, <em>subSamples<strong>ab</strong></em>, &#8230;, each containing 15 lines (except for the last one, which may contain less&#8230;).</p>
<p>Note that the first file will contain the header and 14 data rows, and the other files will contain 15 data rows but no column headings. To get round this, you might want to <em>split</em> on a file that doesn&#8217;t contain the header. (So maybe use <em>wc -l</em> to find the number of rows in the original file, create a header free version of the data by using <em>tail</em> on one less than the number of rows in the file, then <em>split</em> the header free version. You might then one to use <em>cat</em> to put the header back in to each of the smaller files&#8230;)</p>
<p>A couple of other Unix text processing tools let us use a CSV file as a crude database. The <tt>grep</tt> searches a file for a particular term <em>or text pattern</em> (known as a regular expression, which I&#8217;m not going to cover much in this post&#8230; suffice to note for now that you can do real text processing voodoo magic with regular expressions&#8230;;-)</p>
<p>So for example, in out test file, I can search for rows that contain the word <em>mendeley</em></p>
<p><tt>grep mendeley L2sample.csv</tt></p>
<p>We can also redirect the output into a file:</p>
<p><tt>grep EBSCO L2sample.csv &gt; rowsContainingEBSCO.csv</tt></p>
<p>If the text file contains columns that are separated by a unique delimiter (that is, some symbol that is <em>only</em> ever used to separate the columns), we can use the <tt>cut</tt> command to just pull out particular columns. The cut command assumes a tab delimiter (we can specify other delimiters explicitly if we need to), so we can use it on our testfile to pull out data from the third column in our test file:</p>
<p><tt><strong>cut -f 3</strong> L2sample.csv</tt></p>
<p>We can also pull out multiple columns and save them in a file:</p>
<p><tt><strong>cut -f 1,2,14,17</strong> L2sample.csv &gt; columnSample.csv</tt></p>
<p>If you pull out just a single column, you can sort the entries to see what different entries are included in the column using the <tt>sort</tt> command:</p>
<p><tt>cut -f 40 L2sample.csv | sort</tt></p>
<p>(Take column 40 of the file L2sample.csv and sort the items.)</p>
<p>We can also take this sorted list and identify the unique entries using the <tt>uniq</tt> command; so here are the different entries in column 40 of our test file:</p>
<p><tt>cut -f 40 L2sample.csv | sort | uniq</tt></p>
<p>(Take column 40 of the file L2sample.csv, sort the items, and display the unique values.)</p>
<p>(The <tt>uniq</tt> command appears to make comparaisons between consecutive lines, hence the nee to sort first.)</p>
<p>The <tt>uniq</tt> command will also count the repeat occurrence of unique entries if we ask it nicely (<tt>-c</tt>):</p>
<p><tt>cut -f 40 L2sample.csv | sort | uniq <strong>-c</strong></tt></p>
<p>(Take column 40 of the file L2sample.csv, sort the items, and display the unique values along with how many times they appear in the column as a whole.)</p>
<p>The final command I&#8217;m going to mention here is magic search and replace operator called <tt>sed</tt>. I&#8217;m aware that this post is already over long, so I&#8217;ll maybe return to this in a later post, aside from giving you a tease of scome scarey voodoo&#8230; how to convert a tab delimited file to a comma separated file. One recipe is given by Kevin Ashley as follows:</p>
<p><a href="https://twitter.com/#!/kevingashley/statuses/76274081737084928"><img src="http://ouseful.files.wordpress.com/2011/06/tab2csvtweet.png?w=700" alt="" title="tab2csvtweet"   class="alignnone size-full wp-image-5583" /></a></p>
<p><tt>sed 's/"/\\\"/g; s/^/"/; s/$/"/; s/<em>ctrl-V&lt;TAB&gt;</em>/","/g;' origFile.tsv &gt; newFile.csv</tt></p>
<p>(See also this related question on #getTheData: <a href="http://getthedata.org/questions/642/converting-large-ish-tab-separated-files-to-csv">Converting large-ish tab separated files to CSV</a>.)</p>
<p>Note: if you have a small amount of text and need to wrangle it on some way, the <a href="http://textmechanic.com/">Text Mechanic</a> site might have what you need&#8230;</p>
<p>This lecture note on <a href="http://www.ling.upenn.edu/courses/Spring_2003/ling538/Lecnotes/UnixTools.html">Unix Tools</a> provides a really handy cribsheet of Unix command line text wrangling tools, though the syntax does appear to work for me using some of the commands as given their (the important thing is the <em>idea</em> of what&#8217;s possible&#8230;).</p>
<p>If you&#8217;re looking for regular expression helpers (I haven&#8217;t really mentioned these at all in this post, suffice to say they&#8217;re a mechanism for doing pattern based search and replace, and which in the right hands can look like real voodoo text processing magic!), check out <a href="http://txt2re.com/">txt2re</a> and <a href="http://regexpal.com/">Regexpal</a> (<a href="http://blog.stevenlevithan.com/archives/regexpal">about regexpal</a>).</p>
<p>TO DO: this is a biggie &#8211; the <em>join</em> command will join rows from two files with common elements in specified columns. I canlt get it working properly with my test files, so I&#8217;m not blogging it just yet, but here&#8217;s a starter for 10 if you want to try&#8230; <a href="http://www.albany.edu/~ig4895/join.htm">Unix <em>join</em> examples</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/5579/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/5579/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=5579&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2011/06/03/postcards-from-a-text-processing-excursion/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/06/googspreadsheetimportdialogue.png" medium="image">
			<media:title type="html">googspreadsheetimportdialogue</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/06/tab2csvtweet.png" medium="image">
			<media:title type="html">tab2csvtweet</media:title>
		</media:content>
	</item>
	</channel>
</rss>
