<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; Fragments: Glueing Different Data Sources Together With Google Refine</title>
	<atom:link href="http://blog.ouseful.info/2011/05/04/fragments-gluing-different-data-sources-together-with-google-refine/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Wed, 22 May 2013 00:41:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; Fragments: Glueing Different Data Sources Together With Google Refine</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>Fragments: Glueing Different Data Sources Together With Google Refine</title>
		<link>http://blog.ouseful.info/2011/05/04/fragments-gluing-different-data-sources-together-with-google-refine/</link>
		<comments>http://blog.ouseful.info/2011/05/04/fragments-gluing-different-data-sources-together-with-google-refine/#comments</comments>
		<pubDate>Wed, 04 May 2011 12:13:31 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Infoskills]]></category>
		<category><![CDATA[onlinejournalismblog]]></category>
		<category><![CDATA[Pipework]]></category>
		<category><![CDATA[Tinkering]]></category>
		<category><![CDATA[datastore explorer]]></category>
		<category><![CDATA[Google Refine]]></category>

		<guid isPermaLink="false">http://blog.ouseful.info/?p=5377</guid>
		<description><![CDATA[I&#8217;m working on a new pattern using Google Refine as the hub for a data fusion experiment pulling together data from different sources. I&#8217;m not sure how it&#8217;ll play out in the end, but here are some fragments&#8230;. Grab Data into Google Refine as CSV from a URL (Proxied Google Spreadsheet Query via Yahoo Pipes) [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=5377&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;m working on a new pattern using Google Refine as the hub for a data fusion experiment pulling together data from different sources. I&#8217;m not sure how it&#8217;ll play out in the end, but here are some fragments&#8230;.</p>
<p><strong>Grab Data into Google Refine as CSV from a URL (Proxied Google Spreadsheet Query via Yahoo Pipes)</strong></p>
<p>Firstly, getting data into Google Refine&#8230; I had hoped to be able to pull a subset of data from a Google Spreadsheet into Google Refine by importing CSV data obtained from the spreadsheet via a query generated using my Google Spreadsheet/Guardian datastore explorer (see <a href="http://blog.ouseful.info/2009/05/18/using-google-spreadsheets-as-a-databace-with-the-google-visualisation-api-query-language/">Using Google Spreadsheets as a Database with the Google Visualisation API Query Language</a> for more on this) but it seems that Refine would rather pull the whole of the spreadsheet in (or at least, the whole of the first sheet (I think?!)).</p>
<p>Instead, I had to tweak create a proxy to run the query via a Yahoo Pipe (<a href="http://pipes.yahoo.com/pipes/pipe.info?_id=4562a5ec2631ce242ebd25a0756d6381">Google Spreadsheet as a database proxy pipe</a>), which runs the spreadsheet query, gets the data back as CSV, and then relays it forward as JSON:</p>
<p><a href="http://pipes.yahoo.com/pipes/pipe.info?_id=4562a5ec2631ce242ebd25a0756d6381"><img src="http://ouseful.files.wordpress.com/2011/05/yahoo-pipe-google-spreadsheet-as-db-proxy.png?w=700&#038;h=517" alt="" title="Yahoo Pipe - Google spreadsheet as db proxy" width="700" height="517" class="alignnone size-full wp-image-5378" /></a></p>
<p>Here&#8217;s the interface to the pipe &#8211; it requires the Google spreadsheet public key id, the sheet id, and the query&#8230;  The data I&#8217;m using is a spreadsheet maintained by the Guardian datastore containing <a href="http://www.guardian.co.uk/news/datablog/2011/mar/25/higher-education-universityfunding">UK university fees data</a> (<a href="https://spreadsheets2.google.com/spreadsheet/ccc?hl=en&amp;key=tupBUgwFqBDfB4878EK2vUQ&amp;hl=en#gid=1">spreadsheet</a>.</p>
<p><a href="http://pipes.yahoo.com/pipes/pipe.info?_id=4562a5ec2631ce242ebd25a0756d6381"><img src="http://ouseful.files.wordpress.com/2011/05/yahoo-pipe-google-spreadsheet-db-proxy.png?w=700&#038;h=366" alt="" title="Yahoo pipe - google spreadsheet db proxy" width="700" height="366" class="alignnone size-full wp-image-5379" /></a></p>
<p>You can get the JSON version of the data out directly, or a proxied version of the CSV, <em>as CSV</em> via the <em>More options</em> menu&#8230;</p>
<p>Using the Yahoo Pipes CSV output URL, I <em>can</em> now get a subset of data from a Google Spreadsheet into Google Refine&#8230;</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/pipes-proxy-import-into-google-refine.png?w=700&#038;h=366" alt="" title="Pipes proxy import into Google Refine" width="700" height="366" class="alignnone size-full wp-image-5380" /></p>
<p>Here&#8217;s the result &#8211; a subset of data as defined by the query:</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-imported-data.png?w=700&#038;h=260" alt="" title="Google Refine - imported data" width="700" height="260" class="alignnone size-full wp-image-5381" /></p>
<p>We can now augment this data with data from another source using Google Refine&#8217;s ability to <a href="http://code.google.com/p/google-refine/wiki/FetchingURLsFromWebServices">import/fetch data from a URL</a>. In particular, I&#8217;m going to use the Yahoo Pipe described above to grab data from a different spreadsheet and pass it back to Google Refine as a JSON data feed. (Google spreadsheets will publish data as JSON, but the format is a bit clunky&#8230;)</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-generate-column-from-url.png?w=700" alt="" title="Google Refine generate column from URL"   class="alignnone size-full wp-image-5384" /></p>
<p>To test out my query, I&#8217;m going to create a test query in my <a href="http://ouseful.open.ac.uk/datastore/gspreadsheetdb4.php?gsKey=tpxpwtyiYZwCMowl3gNaIKQ#gid=0">datastore explorer</a> using the Guardian datastore HESA returns (2010) spreadsheet URL (<em><a href="http://spreadsheets1.google.com/spreadsheet/ccc?hl&#038;key=tpxpwtyiYZwCMowl3gNaIKQ#gid=0" rel="nofollow">http://spreadsheets1.google.com/spreadsheet/ccc?hl&#038;key=tpxpwtyiYZwCMowl3gNaIKQ#gid=0</a></em>) which also has a column containing HESA numbers. (Ultimately, I&#8217;m going to generate a URL that treats the Guardian datastore spreadsheet as a database that lets me get data back from the row with a particular HESA code column value. By using the HESA number column in Google Refine to provide the key, I can generate a URL for each institution that grabs its HESA data from the Datastore HESA spreadsheet.)</p>
<p><a href="http://ouseful.open.ac.uk/datastore/gspreadsheetdb4.php?gsKey=tpxpwtyiYZwCMowl3gNaIKQ#gid=0"><img src="http://ouseful.files.wordpress.com/2011/05/datstore-explorer-set-up.png?w=700&#038;h=159" alt="" title="Datstore explorer - set up" width="700" height="159" class="alignnone size-full wp-image-5391" /></a></p>
<p>Hit &#8220;Preview Table Headings&#8221;, then scroll down to try out a query:</p>
<p><a href="http://ouseful.open.ac.uk/datastore/gspreadsheetdb4.php?gsKey=tpxpwtyiYZwCMowl3gNaIKQ#gid=0"><img src="http://ouseful.files.wordpress.com/2011/05/guardian-datastore-building-up-a-query.png?w=700&#038;h=464" alt="" title="Guardian Datastore - building up a query" width="700" height="464" class="alignnone size-full wp-image-5392" /></a></p>
<p>Having tested my query, I can now try the parameters out in the Yahoo pipe. (For example, my query is <em>select D,E,H where D=21</em> and the key is <em>tpxpwtyiYZwCMowl3gNaIKQ</em>; this grabs data from columns <em>D</em>, <em>E</em> and <em>H</em> where the value of <em>D</em> (HESA Code) is 21</em>). Grab the JSON output URL from the pipe, and use this as a template for the URL template in Google Refine. Here&#8217;s the JSON output URL I obtained:</p>
<p><em><a href="http://pipes.yahoo.com/pipes/pipe.run?_id=4562a5ec2631ce242ebd25a0756d6381" rel="nofollow">http://pipes.yahoo.com/pipes/pipe.run?_id=4562a5ec2631ce242ebd25a0756d6381</a><br />
&amp;_render=json&amp;key=tpxpwtyiYZwCMowl3gNaIKQ<br />
&amp;q=select+D%2CE%2CH+where+D%3D21</em></p>
<p>Remember, the HESA code I experiment with was <em>21</em>, so this is what we want to replace in the URL with the value from the HESA code column in Google Refine&#8230;</p>
<p>Here&#8217;s how we create the URLs built around/keyed by an appropriate HESA code&#8230;</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-add-column-from-url.png?w=700" alt="" title="Google Refine - Add column from URL"   class="alignnone size-full wp-image-5385" /></p>
<p>Google Refine does its thing and fetches the data&#8230;</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-augmented-data.png?w=700&#038;h=370" alt="" title="Google Refine - Augmented data" width="700" height="370" class="alignnone size-full wp-image-5386" /></p>
<p>Now we process the JSON response to generate some meaningful data columns (for more on how to do this, see <a href="http://blog.ouseful.info/2011/04/12/tech-tips-making-sense-of-json-strings-follow-the-structure/">Tech Tips: Making Sense of JSON Strings – Follow the Structure</a>).</p>
<p>First say we want to create a new column based on the imported JSON data:</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-creating-a-derived-column.png?w=700" alt="" title="Google Refine - creating a derived column"   class="alignnone size-full wp-image-5387" /></p>
<p>Then parse the JSON to extract the data field required in the new column.</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-parsing-json.png?w=700" alt="" title="Google Refine - parsing JSON"   class="alignnone size-full wp-image-5388" /></p>
<p>For example, from the HESA data we might extract the <em>Expenditure per student /10</em>:</p>
<p><tt>value.parseJson().value.items[0]["Expenditure per student / 10"]</tt></p>
<p>or the <em>Average Teaching Score</em> (<tt>value.parseJson().value.items[0]["Average Teaching Score"]</tt>):</p>
<p><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-json-parsing.png?w=700" alt="" title="Google Refine - JSON Parsing"   class="alignnone size-full wp-image-5389" /></p>
<p>And here&#8217;s the result:</p>
<p><a href="http://ouseful.files.wordpress.com/2011/05/google-refine-derived-data.png"><img src="http://ouseful.files.wordpress.com/2011/05/google-refine-derived-data.png?w=700&#038;h=310" alt="" title="Google Refine - derived data" width="700" height="310" class="alignnone size-full wp-image-5390" /></a></p>
<p>So to recap:</p>
<p>- we use a Yahoo Pipe to query a Google spreadsheet and get a subset of data from it;<br />
- we take the CSV output from the pipe and use it to create a new Google Refine database;<br />
- we note that the data table in Google Refine has a HESA code column; we also note that the Guardian datastore HESA spreadsheet has a HESA code column;<br />
- we realise we can treat the HESA spreadsheet as a database, and further that we can create a query (prototyped in the datastore explorer) as a URL keyed by HESA code;<br />
- we create a new column based on HESA codes from a generated URL that pulls JSON data from a Yahoo pipe that is querying a Google spreadsheet;<br />
- we parse the JSON to give us a couple of new columns.</p>
<p>And there we have it &#8211; a clunky, but workable, route for merging data from two different Google spreadsheets using Google Refine.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/5377/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/5377/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=5377&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2011/05/04/fragments-gluing-different-data-sources-together-with-google-refine/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/yahoo-pipe-google-spreadsheet-as-db-proxy.png" medium="image">
			<media:title type="html">Yahoo Pipe - Google spreadsheet as db proxy</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/yahoo-pipe-google-spreadsheet-db-proxy.png" medium="image">
			<media:title type="html">Yahoo pipe - google spreadsheet db proxy</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/pipes-proxy-import-into-google-refine.png" medium="image">
			<media:title type="html">Pipes proxy import into Google Refine</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-imported-data.png" medium="image">
			<media:title type="html">Google Refine - imported data</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-generate-column-from-url.png" medium="image">
			<media:title type="html">Google Refine generate column from URL</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/datstore-explorer-set-up.png" medium="image">
			<media:title type="html">Datstore explorer - set up</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/guardian-datastore-building-up-a-query.png" medium="image">
			<media:title type="html">Guardian Datastore - building up a query</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-add-column-from-url.png" medium="image">
			<media:title type="html">Google Refine - Add column from URL</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-augmented-data.png" medium="image">
			<media:title type="html">Google Refine - Augmented data</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-creating-a-derived-column.png" medium="image">
			<media:title type="html">Google Refine - creating a derived column</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-parsing-json.png" medium="image">
			<media:title type="html">Google Refine - parsing JSON</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-json-parsing.png" medium="image">
			<media:title type="html">Google Refine - JSON Parsing</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2011/05/google-refine-derived-data.png" medium="image">
			<media:title type="html">Google Refine - derived data</media:title>
		</media:content>
	</item>
	</channel>
</rss>
