<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>OUseful.Info, the blog... &#187; Data Scraping Wikipedia with Google Spreadsheets</title>
	<atom:link href="http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.ouseful.info</link>
	<description>Trying to find useful things to do with emerging technologies in open education</description>
	<lastBuildDate>Fri, 24 May 2013 12:19:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.ouseful.info' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>OUseful.Info, the blog... &#187; Data Scraping Wikipedia with Google Spreadsheets</title>
		<link>http://blog.ouseful.info</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.ouseful.info/osd.xml" title="OUseful.Info, the blog..." />
	<atom:link rel='hub' href='http://blog.ouseful.info/?pushpress=hub'/>
		<item>
		<title>Data Scraping Wikipedia with Google Spreadsheets</title>
		<link>http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/</link>
		<comments>http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/#comments</comments>
		<pubDate>Tue, 14 Oct 2008 22:21:09 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[CandS_HowTo]]></category>
		<category><![CDATA[Pipework]]></category>
		<category><![CDATA[Tinkering]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://ouseful.wordpress.com/?p=306</guid>
		<description><![CDATA[Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up &#8211; the 90 minute session on Mashing Up the PLE &#8211; RSS edition is the only reason I&#8217;m going in&#8230;), and in part by Scott Leslie&#8217;s compelling programme for a similar duration [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=306&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up &#8211; the 90 minute session on <a href="http://www.slideshare.net/psychemedia/mashing-up-the-ple-rss-edition-presentation/">Mashing Up the PLE &#8211; RSS edition</a> is the <em>only</em> reason I&#8217;m going in&#8230;), and in part by Scott Leslie&#8217;s compelling programme for a similar duration <a>Mashing Up your own PLE</a> session (scene scetting here: <a href="http://wcet.informz.net/admin31/content/template.asp?sid=1795&amp;ptid=55&amp;brandid=4147&amp;uid=1003126017&amp;mi=162846">Hunting the Wily &#8220;PLE&#8221;</a>), I started having a tinker with using Google spreadsheets as for data table screenscraping.</p>
<p>So here&#8217;s a quick summary of (part of) what I found I could do.</p>
<p>The Google spreadsheet function <em>=importHTML(&#8220;&#8221;,&#8221;table&#8221;,N)</em> will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target <em>table</em> element both need to be in double quotes. The number <em>N</em> identifies the <em>N&#8217;th</em> table in the page (counting starts at 0) as the target table for data scraping.</p>
<p>So for example, have a look at the following Wikipedia page &#8211; <a href="http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population">List of largest United Kingdom settlements by population</a> (found using a search on Wikipedia for <a href="http://en.wikipedia.org/wiki/Special:Search?search=uk+city+population&amp;go=Go">uk city population</a> &#8211; NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar&#8230;):</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942047723/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3286/2942047723_6b93f078ee.jpg" width="500" height="337"></a></p>
<p>Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula &#8220;=importHTML&#8221; into one of the cells:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942043363/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3202/2942043363_d1ba3cc1ca.jpg" width="279" height="184"></a></p>
<p>Autocompletion works a treat, so finish off the expression:</p>
<p><em>=ImportHtml(&#8220;<a href="http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population&#038;#8221" rel="nofollow">http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population&#038;#8221</a>;,&#8221;table&#8221;,1)</em></p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942045811/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3013/2942045811_be76a746c3.jpg" width="500" height="76"></a></p>
<p>And as if by magic, a data table appears:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942913394/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3170/2942913394_0c5d618177.jpg" width="500" height="199"></a></p>
<p>All well and good &#8211; if you want to create a chart or two, why not try the Google charting tools?</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942920870/" title="Google chart by psychemedia, on Flickr"><img src="http://farm4.static.flickr.com/3024/2942920870_fe25d82d1b.jpg" width="500" height="353" alt="Google chart" /></a></p>
<p>Where things get really interesting, though, is when you start letting the data flow around&#8230;</p>
<p>So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942067261/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3241/2942067261_bdcb2afcb5_o.png" width="213" height="160"></a></p>
<p>As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data &#8211; and a host of other data formats:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942928462/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3219/2942928462_d4cf559b38.jpg" width="500" height="255"></a></p>
<p>See the &#8220;More publishing options&#8221; link? Lurvely :-)</p>
<p><a href="http://www.flickr.com/photos/psychemedia/2942073181/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3238/2942073181_c1b81e4f5f.jpg" width="385" height="347"></a></p>
<p>Let&#8217;s have a bit of CSV goodness: </p>
<p><a href="http://spreadsheets.google.com/pub?key=p1rHUqg4g422Ia1T4s1b-CQ&amp;output=csv&amp;gid=0&amp;range=A1:C66" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3022/2942949732_f76353a837.jpg" width="376" height="388"></a></p>
<p>Why CSV? Here&#8217;s why:</p>
<p><a href="http://pipes.yahoo.com" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3001/2942937626_8869402934.jpg" width="500" height="270"></a></p>
<p>Lurvely&#8230; :-)</p>
<p><a href="http://pipes.yahoo.com/pipes/pipe.info?_id=fteLzTua3RGp0Cn7BB50VA/" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3196/2942954398_719b3d227a.jpg" width="327" height="500"></a></p>
<p>(NOTE &#8211; Google spreadsheets&#8217; CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe &#8211; here&#8217;s an example: <a href="http://blog.ouseful.info/2013/02/05/when-a-hack-goes-wrong-google-spreadsheets-and-yahoo-pipes/">When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes</a>.)</p>
<p>Unfortunately, the *&#8217;s in the element names mess things up a bit, so let&#8217;s rename them (don&#8217;t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:</p>
<p><a href="http://pipes.yahoo.com/pipes/pipe.info?_id=fteLzTua3RGp0Cn7BB50VA" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3147/2942974878_75d71ea473.jpg" width="500" height="236"></a></p>
<p>Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):</p>
<p><strong>DEPRECATED&#8230;. The following image is the OLD WAY of doing this and is not to be recommended&#8230;<br />
<a href="http://pipes.yahoo.com/pipes/pipe.info?_id=fteLzTua3RGp0Cn7BB50VA" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3040/2942983646_97cb82ecaa.jpg" width="500" height="271"></a><br />
&#8230;DEPRECATED</strong></p>
<p><em>Geocoding in Yahoo Pipes is done more reliably through the following trick &#8211; replace the Location Builder block with a Loop</strong> block into which you should insert a <strong>Location Builder Block</strong></em></p>
<p><a href="http://ouseful.files.wordpress.com/2008/10/yahoo-pipe-loop.png"><img src="http://ouseful.files.wordpress.com/2008/10/yahoo-pipe-loop.png?w=700" alt="yahoo pipe loop"   class="alignnone size-full wp-image-9781" /></a></p>
<p>The location builder will look to a specified element for the content we wish to geocode:</p>
<p><a href="http://ouseful.files.wordpress.com/2008/10/yahoo-pipe-location-builder.png"><img src="http://ouseful.files.wordpress.com/2008/10/yahoo-pipe-location-builder.png?w=700&#038;h=166" alt="yahoo pipe location builder" width="700" height="166" class="alignnone size-full wp-image-9782" /></a></p>
<p>The <em>Location Builder</em> block should be configured to output the geocoded result to the <tt>y:location</tt> element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:</p>
<p><a href="http://ouseful.files.wordpress.com/2008/10/regex-uk.png"><img src="http://ouseful.files.wordpress.com/2008/10/regex-uk.png?w=700&#038;h=155" alt="regex uk" width="700" height="155" class="alignnone size-full wp-image-9780" /></a></p>
<p>This block says &#8211; in the <em>title</em> element, grab a copy of everything &#8211; .* &#8211; into a variable &#8211; (.*) &#8211; and then replace the contents of the <em>title</em> element with it&#8217;s original value &#8211; $1 &#8211; as well as &#8220;, UK&#8221; &#8211; $1, UK</p>
<p>Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing &#8220;Cambridge, UK&#8221; for example, rather than just &#8220;Cambridge&#8221;.</p>
<p>Lurvely&#8230;</p>
<p>And to top it all off:</p>
<p><a href="http://pipes.yahoo.com/pipes/pipe.info?_id=fteLzTua3RGp0Cn7BB50VA" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3293/2942129809_d2760c0480.jpg" width="500" height="391"></a></p>
<p>And for the encore? Grab the <a href="http://pipes.yahoo.com/pipes/pipe.run?_id=fteLzTua3RGp0Cn7BB50VA&amp;_render=kml">KML feed out of the pipe</a>:</p>
<p><a href="http://pipes.yahoo.com/pipes/pipe.run?_id=fteLzTua3RGp0Cn7BB50VA&amp;_render=kml" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3054/2942995398_41f4436866.jpg" width="182" height="280"></a></p>
<p>&#8230;and shove it in a Google map:<br />
<a href="http://maps.google.co.uk/maps?f=q&amp;hl=en&amp;geocode=&amp;q=http:%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3DfteLzTua3RGp0Cn7BB50VA%26_render%3Dkml&amp;ie=UTF8&amp;z=6" title="Photo Sharing"><img src="http://farm4.static.flickr.com/3051/2942140589_6e00bd37d4.jpg" width="500" height="353"></a></p>
<p>So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the <em>=importHTML</em> formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a <s>Yahoo</s>Google map.</p>
<p>Kewel :-)</p>
<p>PS If you &#8220;own&#8221; the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it &#8216;interactive&#8217;, with very little effort &#8211; see <a href="http://ouseful.open.ac.uk/blogarchive/014014.html">Progressive Enhancement &#8211; Some Examples</a> and <a href="http://ouseful.wordpress.com/2008/08/29/html-tables-and-the-data-web/">HTML Tables and the Data Web</a> for more details&#8230;</p>
<p>PPS for a version of this post in German, see: <a href="http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/">http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/</a>. (Please post a linkback if you&#8217;ve translated this post into any other languages :-)</p>
<p>PPPS this is neat &#8211; geocoding in Google spreadsheets itself: <a href="http://apitricks.blogspot.com/2008/10/geocoding-by-google-spreadsheets.html">Geocoding by Google Spreadsheets</a>.</p>
<p>PPPS Once you have scraped the data into a Google spreadsheet, it&#8217;s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see <a href="http://ouseful.wordpress.com/2010/01/19/using-google-spreadsheets-like-a-database-the-query-formula/">Using Google Spreadsheets Like a Database – The QUERY Formula</a> and <a href="http://ouseful.wordpress.com/2010/02/15/creating-a-winter-olympics-2010-medal-map-in-google-spreadsheets/">Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ouseful.wordpress.com/306/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ouseful.wordpress.com/306/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.ouseful.info&#038;blog=325417&#038;post=306&#038;subd=ouseful&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/feed/</wfw:commentRss>
		<slash:comments>186</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/abbd9f90565ce9ae4d065d93a81d8c03?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">Tony Hirst</media:title>
		</media:content>

		<media:content url="http://farm4.static.flickr.com/3286/2942047723_6b93f078ee.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3202/2942043363_d1ba3cc1ca.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3013/2942045811_be76a746c3.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3170/2942913394_0c5d618177.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3024/2942920870_fe25d82d1b.jpg" medium="image">
			<media:title type="html">Google chart</media:title>
		</media:content>

		<media:content url="http://farm4.static.flickr.com/3241/2942067261_bdcb2afcb5_o.png" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3219/2942928462_d4cf559b38.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3238/2942073181_c1b81e4f5f.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3022/2942949732_f76353a837.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3001/2942937626_8869402934.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3196/2942954398_719b3d227a.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3147/2942974878_75d71ea473.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3040/2942983646_97cb82ecaa.jpg" medium="image" />

		<media:content url="http://ouseful.files.wordpress.com/2008/10/yahoo-pipe-loop.png" medium="image">
			<media:title type="html">yahoo pipe loop</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2008/10/yahoo-pipe-location-builder.png" medium="image">
			<media:title type="html">yahoo pipe location builder</media:title>
		</media:content>

		<media:content url="http://ouseful.files.wordpress.com/2008/10/regex-uk.png" medium="image">
			<media:title type="html">regex uk</media:title>
		</media:content>

		<media:content url="http://farm4.static.flickr.com/3293/2942129809_d2760c0480.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3054/2942995398_41f4436866.jpg" medium="image" />

		<media:content url="http://farm4.static.flickr.com/3051/2942140589_6e00bd37d4.jpg" medium="image" />
	</item>
	</channel>
</rss>
