Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population”,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…

…DEPRECATED

Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Visualising the OU Twitter Network

Readers of any prominent OU bloggers will probably have noticed that we appear to have something of Twitter culture developing within the organisation (e.g. “Twitter, microblogging and living in the stream“). After posting a few Thoughts on Visualising the OU Twitter Network…, I couldn’t resist the urge to have a go at drawing the OpenU twittergraph at the end of last week (although I had hoped someone else on the lazyweb might take up the challenge…) and posted a few teaser images (using broken code – oops) via twitter.

Anyway, I tidied up the code a little, and managed to produce the following images, which I have to say are spectacularly uninteresting. The membership of the ‘OU twitter network’ was identified using a combination of searches on Twitter for “open.ac.uk” and “Open University”, coupled with personal knowledge. Which is to say, the membership list may well be incomplete.

The images are based on a graph that plots who follows whom. If B follows A, then B is a follower and A is followed. In the network graphs, an arrow goes from A to B if A is followed by B (so in the network graph, the arrows point to people who follow you. The graph was constructed by making calls to the Twitter API for the names of people an individual followed, for each member of the OU Twitter network. An edge appears in the graph if a person in the OU twitter network follows another person in the OU Twitter network. (One thing I haven’t looked at is to see whether there are individuals followed by a large number of OpenU twitterers who aren’t in the OpenU twitter network… which might be interesting…)

Wordle view showing who in the network has the most followers (the word size is proportional to the number of followers, so the bigger your name, the more people there are in the OU network that follow you). As Stuart predicted, this largely looks like a function of active time spent on Twitter.

We can compare this with a Many Eyes tag cloud showing how widely people follow other members of the OU network (the word size is proportional to the number of people in the OU network that the named individual follows – so the bigger your name, the more people in the OU network you follow).

Note that it may be interesting to scale this result according to the total number of people a user is following:

@A’s OU network following density= (number of people @A follows in OU Twitter network)/(total number of people @A follows)

Similarly, maybe we could also look at:

@A’s OU network follower density= (number of people in OU Twitter network following @A)/(total number of people following @A)

(In the tag clouds, the number of people following is less than the number of people followed; I think this is in part because I couldn’t pull down the names of who a person was following for people who have protected their tweets?)

Here’s another view of people who actively follow other members of the OU twitter network:

And who’s being followed?

These treemaps uncover another layer of information if we add a search…

So for example, who is Niall following/not following?

And who’s following Niall?

I’m not sure how useful a view of the OU Twittergraph is itself, though?

Maybe more interesting is to look at the connectivity between people who have sent each other an @message. So for example, here’s how Niall has been chatting to people in the OU twitter network (a link goes from A to B if @A sends a tweet to @B):

ou personal activer twittermap

We can also compare the ‘active connectivity’ of several people in the OU Twitter network. For example, who is Martin talking to, (and who’s talking to Martin) compared with Niall’s conversations?

2008-10-13_0157

As to why am I picking on Niall…? Well, apart from making the point that by engaging in ‘public’ social networks, other people can look at what you’re doing, it’s partly because thinking about this post on ‘Twitter impact factors’ kept me up all night: Twitter – how interconnected are you?.

The above is all “very interesting”, of course, but I’m not sure how valuable it is, e.g. in helping us understand how knowledge might flow around the OU Twitter network? Maybe I need to go away and start looking at some of the social network analysis literature, as well as some of the other Twitter network analysis tools, such as Twinfluence (Thanks, @Eingang:-)

PS Non S. – Many Eyes may give you a way of embedding a Wordle tagcloud…?)

iTunes in Your Pocket… Almost…

Having been tipped off about about a Netvibes page that the Library folks are pulling together about how to discover video resources (Finding and reusing video – 21st century librarianship in action, methinks? ;-) I thought I’d have a look at pulling together an OU iTunes OPML bundle that could be used to provide access to OU iTunes content in a Grazr widget (or my old RadiOBU OpenU ‘broadcast’ widget ;-) and maybe also act as a nice little container for viewing/listening to iTunes content on an iPhone/iPod Touch.

To find the RSS feed for a particular content area in iTunesU, navigate to the appropriate page (one with lists of actual downloadable content showing in the bottom panel), make sure you have the right tab selected, then right click on the “Subscribe” button and copy the feed/subscription URL (or is there an easier way? I’m not much of an iTunes user?):

You’ll notice in the above case that as well as the iPod video (mp4v format?), there is a straight video option (.mov???) and a transcript. I haven’t started to think about how to make hackable use of the transcripts yet, but in my dreams I’d imagine something like these Visual Interfaces for Audio/Visual Transcripts! ;-) In addition, some of the OU iTunesU content areas offer straight audio content.

Because finding the feeds is quite a chore (at least in the way I’ve described it above), I’ve put together an OU on iTunesU OPML file, that bundles together all the separate RSS from the OU on iTunesU area (to view this file in an OPML widget, try here: OU iTunesU content in a Grazr widget).

The Grazr widget lets you browse through all the feeds, and if you click on an actual content item link, iit should launch a player (most likely Quicktime). Although the Grazr widget has a nice embedded player for MP3 files, it doesn’t seem to offer an embedded player for iTunes content (or maybe I’m missing something?)

You can listen to the audio tracks well enough in an iPod Touch (so the same is presumably true for an iPhone?) using the Grazr iphone widget – but for some reason I can’t get the iPod videos to play? I’m wondering if this might be a mime-type issue? or maybe there’s some other reason?

(By the by, it looks like the content is being served from an Amazon S3 server… so has the OU bought into using S3 I wonder? :-)

For completeness, I also started to produce a handcrafted OPML bundle of OU Learn Youtube playlists, but then discovered I’d put together a little script ages ago that will create one of these automatically, and route each playlist feed through a feed augmentation pipe that adds a link to each video as a video enclosure:

http://ouseful.open.ac.uk/xmltools/youtubeUserPlaylistsOPML.php?user=oulearn

Why would you want to do this? Because if there’s a video payload as an enclosure, Grazr will provide an embedded player for you… as you can see in this screenshot of Portable OUlearn Youtube playlists widget (click through the image to play with the actual widget):

These videos will play in an iPod Touch, although the interaction is a bit clunky, and actually slight cleaner using the handcrafted OPML: OUlearn youtube widget for iphone.

PS it’s also worth remembering that Grazr can embed Slideshare presentations, though I’m pretty sure these won’t work on the iPhone…

eduTwitterin’

Jane’s list of “100+ (E-)Learning Professionals to follow on Twitter” (which includes yours truly, Martin and Grainne from the OpenU :-) has been doing the rounds today, so in partial response to Tony Karrer asking “is there an equivalent to OPML import for twitter for those of us who don’t want to go through the list and add people one at a time?”, I took an alternative route to achieving a similar effect (tracking those 100+ e-learning professionals’ tweets) and put together a Yahoo pipe to produce an aggregated feed – Jane’s edutwitterers pipe

Scrape the page and create a semblance of a feed of the edutwitterers:

Tidy the feed up a bit and make sure we only include items that link to valid twitter RSS feed URLs (note that the title could do with a little more tidying up…) – the regular expression for the link creates the feed URL for each edutwitterer:

Replace each item in the edutwitterers feed with the tweets from that person:

From the pipe, subscribe to the aggregated edutwitters’ feed.

Note, however, that the aggregated feed is a bit slow – it takes time to pull out tweets for each edutwitterer, and there is the potential for feeds being cached all over the place (by Yahoo pipes, by your browser, or whatever you happen to view the pipes output feed etc. etc.)

A more efficient route might be to produce an OPML feed containing links to each edutwitterer’s RSS feed, and then view this as a stream in a Grazr widget.

Creating the OPML file is left as an exercise for the reader (!) – if you do create one, please post a link as a comment or trackback… ;-) Here are three ways I can think of for creating such a file:

  1. add the feed URL for each edutwitter as a separate feed in an Grazr reading list (How to create a Grazr (OPML) reading list). If you don’t like/trust Grazr, try OPML Manager;
  2. build a screenscraper to scrape the usernames and then create an output OPML file automatically;
  3. view source of Jane’s orginal edutwitterers page, cut out the table that lists the edutwitterers, paste the text into a text editor and work some regular ecpression ‘search and replace’ magic; (if you do this, how about posting your recipe/regular expressions somewhere?!;-)

Enough – time to start reading Presentation Zen

ORO Results in Yahoo SearchMonkey

It’s been a long – and enjoyable – day today (err, yesterday, I forgot to post this last night!), so just a quick placeholder post, that I’ll maybe elaborate on with techie details at a later date, to show one way of making some use of the metadata that appears in the ORO/eprints resource splash pages (as described in ORO Goes Naked With New ePrints Server): a Yahoo SearchMonkey ORO augmented search result – ORO Reference Details (OUseful).

The SearchMonkey extension – which when “installed” in your Yahoo profile, will augment ORO results in organic Yahoo search listings with details about the publication the reference appears in, the full title (or at least, the first few characters of the title!), the keyowrds used to describe the reference and the first author, along with links to a BibTeX reference and the document download (I guess I could also add a link in there to a full HTML reference?)

The SearchMonkey script comes in two parts – a “service” that scrapes the page linked to from the results listing:

And a “presentation” part, that draws on the service to augment the results:

It’s late – I’m tired – so no more for now; if you interested, check out the Yahoo SearchMonkey documentation, or Build your own SearchMonkey app.

Rehashing Old Tools to Look at CCK08

I haven’t posted for a few days (nothing to write about, sigh….) so here’s a cheap’n’lazy post reusing a couple of old visual demos (edupunk chatter, More Hyperbolic Tree Visualisations – delicious URL History: Users by Tag) to look at what’s happening around the use of the CCK08 tag that’s being used to annotate – in a distributed way – the Connectivism and Connective Knowledge online course

For example, here’s a view of people who have been using the cck08 tag on delicious:

People twittering mentions of cck08:

And here’s how people have been tagging the Connectivism and Connective Knowledge course homepage on delicious (along with te people who’ve been using those tags).

The next step is to move from hierarchical info displays (such as the above) to mining networks – grous of people who are talking about the same URLs on delicious and twitter, and maybe even blogging about CCK08 too…

UK HEI “Page Not Found” Error Pages

In Back from Behind Enemy Lines, Without Being Autodiscovered(?!), I described a simple service that displays the autodiscoverable RSS feeds from UK HEI homepages (it went down over the weekend as the screenscraping broke, but it’s back now and some of the ‘issues’ with some of the linkscraping has been fixed ;-)

Over the weekend, I tweaked the code and created a parallel service that displays the ‘Page Not Found’ (HTML error code 404) splash page for UK HEIs using thumbnails generated using websnapr.

You can find the service here: UK HEI “Page Not Found” pages

The page takes in a list of UK HEI homepage URLs, generates a nonsense URL off each domain, and uses that nonexistent page URL as the basis for the thumbnail screenshot.

You’ll also notice I took the opportunity to do a little (unofficial) OU course advertising… At times like this, it would so handy for us to have a contextual course ad serving javascript snippet/widget…

PS Brian Kelly pinged me with a note that he’s had a UK HEI 404 viewer around for just about forever… University 404 pages rolling demo… Just by the by, the script that Brian used to scroll through the pages was the inspiration for the original “deliShow” version of feedshow (about feedshow).