<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; Chit Chat with New Datasets &#8211; Facets in OpenRefine (Was /Google Refine/)</title>
	<atom:link href="http://blog.ouseful.info/2012/11/06/chit-chat-with-new-datasets-facets-in-open-was-google-refine/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Tue, 21 May 2013 06:00:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; Chit Chat with New Datasets &#8211; Facets in OpenRefine (Was /Google Refine/)</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>Chit Chat with New Datasets &#8211; Facets in OpenRefine (Was /Google Refine/)</title>
		<link>http://blog.ouseful.info/2012/11/06/chit-chat-with-new-datasets-facets-in-open-was-google-refine/</link>
		<comments>http://blog.ouseful.info/2012/11/06/chit-chat-with-new-datasets-facets-in-open-was-google-refine/#comments</comments>
		<pubDate>Tue, 06 Nov 2012 10:39:50 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[onlinejournalismblog]]></category>
		<category><![CDATA[OpenRefine]]></category>
		<category><![CDATA[Tinkering]]></category>
		<category><![CDATA[open refine]]></category>
		<category><![CDATA[openrefine]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=8798</guid>
		<description><![CDATA[One of the many ways of using Google OpenRefine is as a toolkit for getting a feel for the range of variation contained within a dataset using the various faceting options. In the sense of analysis being a conversation with data, this is a bit like an idle chit-chat/getting to know you phase, as a [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=8798&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>One of the many ways of using <s>Google</s> OpenRefine is as a toolkit for getting a feel for the range of variation contained within a dataset using the various <em>faceting</em> options. In the sense of analysis being a conversation with data, this is a bit like an idle chit-chat/getting to know you phase, as a precursor to a full blown conversation.</p>
<p><em>Faceted search</em> or <em>faceted browsing/navigation</em> typically provides a set of limiting search filters to a set of search results that limits or restricts the displayed results to ones that fulfil certain conditions. In a library catalogue, the facets might refer to metadata fields such as publication date, thus allowing a user to search within a given date range, or publisher:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/faceted-search-ou-library.png"><img src="http://ouseful.files.wordpress.com/2012/11/faceted-search-ou-library.png?w=700&#038;h=443" alt="" title="faceted search OU library" width="700" height="443" class="alignnone size-full wp-image-8800" /></a></p>
<p>Where the facet relates to a <em>categorical</em> variable &#8211; that is, where there is a set of unique values that the facet can take (such as the names of different publishers) &#8211; a view of the facet values will show the names of the different publishers extracted from the original search results. Selecting a particular publisher, for example, will then limit the displayed results to just those results associated with that publisher. For <em>numerical</em> facets, where the quantities associated with the facet related to a number or date (that is, a set of things that have a numerical <em>range</em>), the facet view will show the full range of values contained within that particular facet. The user can then select a subset of results that fall within a specified part of that range.</p>
<p>In the case of Open Refine, facets can be defined on a per column basis. For categorical facets, Refine will identify the set of unique values associated with a particular faceted view that are contained within a column, along with a count of how many times each facet value occurs throughout the column. The user can then choose to view only those rows with a particular (facet selected) value in the faceted column. For columns that contain numbers, Refine will generate a numerical facet that spans the range of values contained within the column, along with a histogram that provides a count of occurrences of numbers within small ranges  across the full range.</p>
<p>So what faceting options does Google Refine provide?</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/refine-facets.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-facets.png?w=700" alt="" title="refine facets"   class="alignnone size-full wp-image-8802" /></a></p>
<p>Here&#8217;s how they work (data used for the examples comes from <a href="http://analyticsmadeskeezy.com/2012/10/11/graphing-clustering-community-detection-wholesale-drug-deals-using-excel-and-gephi/">Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering &amp; Community Detection in Excel and Gephi</a> and <a href="http://blog.ouseful.info/2012/10/02/grabbing-twitter-search-results-into-google-refine-and-exporting-conversations-into-gephi/">JSON import from the Twitter search API</a>&#8230;):</p>
<p>- exploring the set of categories described within a column using the <em>text</em> facet:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-text-facet.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-text-facet.png?w=700&#038;h=334" alt="" title="refine text facet" width="700" height="334" class="alignnone size-full wp-image-8803" /></a></p>
<p>Faceted views also allow you to view the facet values by occurrence count, so it&#8217;s easy to see which the most popular facet values are:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-facet-sort-by-count.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-facet-sort-by-count.png?w=700" alt="" title="refine facet sort by count"   class="alignnone size-full wp-image-8841" /></a></p>
<p>You can also get a tab separated list of facet values:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-facet-values-tab-separated.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-facet-values-tab-separated.png?w=700&#038;h=318" alt="" title="refine facet values tab separated" width="700" height="318" class="alignnone size-full wp-image-8842" /></a></p>
<p>Sometimes it can be useful to view rows associated with particular facet values that occur a particular number of times, particulalry at the limits (for example, very popular facet values, or uniquely occurring facet values):</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/refine-facet-count.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-facet-count.png?w=700" alt="" title="refine facet count"   class="alignnone size-full wp-image-8843" /></a></p>
<p>- looking at the range of numerical values contained in a column using the <em>numeric</em> facet:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-numeric-facet.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-numeric-facet.png?w=700&#038;h=274" alt="" title="refine numeric facet" width="700" height="274" class="alignnone size-full wp-image-8804" /></a></p>
<p>- looking at the distribution over time of column contents using the <em>timeline</em> facet:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-date-facet.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-date-facet.png?w=700&#038;h=365" alt="" title="refine date facet" width="700" height="365" class="alignnone size-full wp-image-8805" /></a><br />
Faceting by time requires time-related strings to be parsed as such; sometimes, Refine needs a little bit of help in interpreting an imported string as a time string. So for example, given a &#8220;time&#8221; string such as <em>Mon, 29 Oct 2012 10:56:52 +0000</em> from the Twitter search API, we can use the GREL function <tt>toDate(value,"EEE, dd MMM y H:m:s")</tt> to create a new column with time-cast elements.<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-datetime-conversion.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-datetime-conversion.png?w=700" alt="" title="refine datetime conversion"   class="alignnone size-full wp-image-8806" /></a><br />
(See <a href="http://code.google.com/p/google-refine/wiki/GRELDateFunctions">GRELDateFunctions</a> and the Java <a href="http://docs.oracle.com/javase/1.4.2/docs/api/java/text/SimpleDateFormat.html">SimpleDateFormat class documentation</a> for more details.)</p>
<p>- getting a feel for the correlation of values across numerical columns, and exploring those correlations further, using the <em>scatterplot</em> facet.<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot0.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot0.png?w=700" alt="" title="refine scatterplot0"   class="alignnone size-full wp-image-8807" /></a><br />
This generates a view that generates a set of scatterplots relating to pairwise combinations of all the numerical columns in the dataset:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot.png?w=700" alt="" title="refine scatterplot"   class="alignnone size-full wp-image-8808" /></a><br />
Clicking on one of these panels allows you to filter points within a particular area of the corresponding scatter chart (click and drag a rectangular area over the points you want to view), effectively allowing you to filter the data across related ranges of two numerical columns at the same time:<br />
<a href="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot-range.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot-range.png?w=700" alt="" title="refine scatterplot range"   class="alignnone size-full wp-image-8809" /></a></p>
<p>A range of customisable faceting options are also provided that allow you to define your own faceting functions:</p>
<ul>
<li>the <em>Custom text&#8230;</em> facet;</li>
<li>the <em>Custom Numeric&#8230;</em> facet</li>
</ul>
<p>More conveniently, a range of predefined Customized facets</em> are provided that provide shortcuts to &#8220;bespoke&#8221; faceting functions:</p>
<p><a href="http://ouseful.files.wordpress.com/2012/11/refine-custom-facets.png"><img src="http://ouseful.files.wordpress.com/2012/11/refine-custom-facets.png?w=700" alt="" title="refine custom facets"   class="alignnone size-full wp-image-8810" /></a></p>
<p>So for example:</p>
<ul>
<li>the <em>word facet</em> splits strings contained in cells into single words, counts their occurrences throughout the column, and then lists unique words and their occurrence count in the facet panel. This faceting option thus provides a way of selecting rows where the contents of a particular column contain one or more specified words. (The user defined GREL custom text facet <tt>ngram(value,1)</tt> provides a similar (though not identical) result &#8211; duplicated words in a cell are identified as unique by the single word ngram function; see also <tt>split(value," ")</tt>, which does seem to replicate the behaviour of the word facet function.)</p>
<li>the <em>duplicates facet</em> returns boolean values of <tt>true</tt> and <tt>false</tt>; filtering on <tt>true</tt> values returns all the rows that have duplicated values within a particular column; filtering on <tt>false</tt> displays all unique rows.</li>
<li>the <em>text length facet</em> produces a facet based on the character count(?) of strings in cells within the faceted column; the custom numeric facet <tt>length(value)</tt> achieves something similar; the related measure, word count, can be achieved using the custom numeric facet <tt>length(split(value," "))</tt></li>
</ul>
<p>Note that facet views can be combined. Selecting multiple rows within a particular facet panel provides a Boolean OR over the selected values (that is, if <em>any</em> of the selected values appear in the column, the corresponding rows will be displayed). To AND conditions, even within the same facet, create a separate facet panel for each ANDed condition.</p>
<p>PS On the <em>OpenRefine</em> (<strong>was</strong> <em>Google Refine</em>) name change, see <a href="http://googlerefine.blogspot.co.uk/2012/10/from-freebase-gridworks-to-google.html">From Freebase Gridworks to Google Refine and now OpenRefine</a>. The code repository is now on github: <a href="https://github.com/OpenRefine">OpenRefine Repository</a>. I also notice that <a href="http://openrefine.org/">openrefine.org/</a> has been minted and is running a placeholder instance of WordPress. I wonder if it would be worth setting up an aggregator for community posts, a bit like R-Blogger (for example, I have an RStats category feed from this blog that I syndicate to the RBloggers aggregator, and have just created an <em>OpenRefine</em> category that could feed a OpenRefinery aggregator channel).</p>
<p>PPS for an example of using OpenRefine to find differences between two recordsets, see Owen Stephens&#8217; <a href="http://www.meanboyfriend.com/overdue_ideas/2012/11/using-open-refine/">Using Open Refine for e-journal data</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/8798/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/8798/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=8798&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2012/11/06/chit-chat-with-new-datasets-facets-in-open-was-google-refine/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/faceted-search-ou-library.png" medium="image">
			<media:title type="html">faceted search OU library</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-facets.png" medium="image">
			<media:title type="html">refine facets</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-text-facet.png" medium="image">
			<media:title type="html">refine text facet</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-facet-sort-by-count.png" medium="image">
			<media:title type="html">refine facet sort by count</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-facet-values-tab-separated.png" medium="image">
			<media:title type="html">refine facet values tab separated</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-facet-count.png" medium="image">
			<media:title type="html">refine facet count</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-numeric-facet.png" medium="image">
			<media:title type="html">refine numeric facet</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-date-facet.png" medium="image">
			<media:title type="html">refine date facet</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-datetime-conversion.png" medium="image">
			<media:title type="html">refine datetime conversion</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot0.png" medium="image">
			<media:title type="html">refine scatterplot0</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot.png" medium="image">
			<media:title type="html">refine scatterplot</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-scatterplot-range.png" medium="image">
			<media:title type="html">refine scatterplot range</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2012/11/refine-custom-facets.png" medium="image">
			<media:title type="html">refine custom facets</media:title>
		</media:content>
	</item>
	</channel>
</rss>
