<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; Finding (Nearly) Duplicate Items in a Data Column</title>
	<atom:link href="http://blog.ouseful.info/2012/11/14/finding-nearly-duplicate-items-in-a-data-column/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Sat, 18 May 2013 22:36:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; Finding (Nearly) Duplicate Items in a Data Column</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>Finding (Nearly) Duplicate Items in a Data Column</title>
		<link>http://blog.ouseful.info/2012/11/14/finding-nearly-duplicate-items-in-a-data-column/</link>
		<comments>http://blog.ouseful.info/2012/11/14/finding-nearly-duplicate-items-in-a-data-column/#comments</comments>
		<pubDate>Wed, 14 Nov 2012 15:09:51 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[onlinejournalismblog]]></category>
		<category><![CDATA[Tinkering]]></category>
		<category><![CDATA[Google Refine]]></category>
		<category><![CDATA[openrefine]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=8897</guid>
		<description><![CDATA[[WARNING - THIS IS A *BAD ADVICE* POST - it describes a trick that sort of works, but the example is contrived and has a better solution - text facet and then cluster on facet (h/t to @mhawksey's Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=8897&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>[WARNING - THIS IS A *BAD ADVICE* POST - it describes a trick that sort of works, but the example is contrived and has a better solution - text facet and then cluster on facet (h/t to @mhawksey's <a href="http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/">Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv]</a> for making me squirm so much through that oversight I felt the need to post this warning&#8230;]</p>
<p>Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine <em>Duplicate</em> custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don&#8217;t quite match: for example, an original tweet and it&#8217;s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here&#8217;s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.</p>
<p>The <tt>Ngram</tt> GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N&#8217;th word, and so on:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-ngram.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-ngram.png?w=700" alt="" title="openrefine ngram"   class="alignnone size-full wp-image-8904" /></a></p>
<p>If we can reasonably expect word sequences of length N to appear in out &#8220;duplicate-ish&#8221; strings, we can generate a facet on ngrams of that length.</p>
<p>It may also be worth experimenting with combining the ngram function with the <tt>fingerprint</tt> GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-fingerprint.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-fingerprint.png?w=700" alt="" title="openrefine fingerprint"   class="alignnone size-full wp-image-8903" /></a></p>
<p>If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-ngram-fingerprint.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-ngram-fingerprint.png?w=700" alt="" title="OpenRefine ngram fingerprint"   class="alignnone size-full wp-image-8902" /></a></p>
<p>For a sizeable dataset, and/or long strings, it&#8217;s likely that we&#8217;ll get a lot of ngram facet terms:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-too-many-choices.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-too-many-choices.png?w=700" alt="" title="openrefine too many choices"   class="alignnone size-full wp-image-8901" /></a></p>
<p>We could list them all by setting an appropriate <em>choice count limit</em>, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-facetcount.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-facetcount.png?w=700" alt="" title="openrefine facetcount"   class="alignnone size-full wp-image-8900" /></a></p>
<p>Even if tweets aren&#8217;t identical, if they contain common ngrams we can pull them out.</p>
<p>Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this post<a href="http://blog.ouseful.info/2012/11/06/chit-chat-with-new-datasets-facets-in-open-was-google-refine/">Facets in OpenRefine</a>):</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-datesort.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-datesort.png?w=700" alt="" title="openrefine datesort"   class="alignnone size-full wp-image-8899" /></a></p>
<p>Or alternatively, we might choose to view them using a time facet:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/openrefine-timefacet.png"><img src="http://ouseful.files.wordpress.com/2012/11/openrefine-timefacet.png?w=700" alt="" title="openrefine timefacet"   class="alignnone size-full wp-image-8898" /></a></p>
<p>Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/8897/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/8897/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=8897&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2012/11/14/finding-nearly-duplicate-items-in-a-data-column/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-ngram.png" medium="image">
			<media:title type="html">openrefine ngram</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-fingerprint.png" medium="image">
			<media:title type="html">openrefine fingerprint</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-ngram-fingerprint.png" medium="image">
			<media:title type="html">OpenRefine ngram fingerprint</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-too-many-choices.png" medium="image">
			<media:title type="html">openrefine too many choices</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-facetcount.png" medium="image">
			<media:title type="html">openrefine facetcount</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-datesort.png" medium="image">
			<media:title type="html">openrefine datesort</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/openrefine-timefacet.png" medium="image">
			<media:title type="html">openrefine timefacet</media:title>
		</media:content>
	</item>
	</channel>
</rss>
