Back from Behind Enemy Lines, Without Being Autodiscovered(?!)

Home again after a few very enjoyable days away at IWMW2008 in Aberdeen, and I feel like I need a way of saying thank you to the web managers’ community for allowing an academic in…heh heh ;-)

So I spent half an hour or so (no… really…;-) on the train back from the airport putting together a front end for an HEI feed autodiscovery pipe that I knocked up in one of the presentations yesterday (I was giving the talk that was on at the time my full partial attention, of course ;-) that picks up on some of the conversation that was in the air at the end of the innovation competition session (I didn’t win, of course…;-(

The context is/was a comment from Mike Ellis that HEIs et al. could start opening up their data by offering RSS feeds of news releases, job/recruitment ads and event listings, because there’s no reason not to…. So my, err, gift(?!) back to the IWMW community is a little something where UK HEI web managers can proudly show off how they’ve taken up the challenge and published a whole suite of autodiscoverable RSS feeds from their home pages ;-): UK HEI autodiscoverable feeds.

(Cynics may say that the page actually names and shames sites that don’t offer any autodiscoverable feeds; I couldn’t possibly comment… ;-)

Anyway, the pipework goes like this…

First of all I grab a feed of UK HEI homepages… There isn’t an official one, of course, so as a stopgap I’ve scraped a dodgy secondary source (safe in the expectation that Mike Ellis will have an authoritative, hacked feed available from sometime soon…)

All that’s required then is to pull out the link in each item, that hopefully corresponds to the HEI homepage, and use that as the focus for feed autodiscovery:

Any feed URLs that are autodiscovered are then added as elaborations to the corresponding HEI feed item. Although these elaborations aren’t exposed in the RSS feed output from the pipe, they are available in the pipe’s JSON output, so the half-hour (offline) hack on the train earlier this afternoon consumes the JSON feed and gives a quick and dirty show’n’tell display of which institutions have autodiscoverable feeds on their homepage: UK HEI autodiscoverable feeds.

Looking at a couple of comments to the post Nudge: Improving Decisions About RSS Usage, (in which Brian Kelly tabulated the provision of RSS feeds from Scottish HEIs), it seems that publicly highlighting the lack of support for feed autodiscovery can encourage people to look at their pages and add the feature… So I wonder: when IWMW comes around next year, will the phrase No autodiscoverable feeds… be missing from the UK HEI autodiscoverable feeds page, possibly in part because that page exists?!

(Note that if you use this page to test a homepage you’ve added feed autodiscovery to, there is cacheing going on everywhere so you may not see any change in the display for an hour or so… I’ll try and post a cache-resistant feed autodiscovery page over the next few days; in the meantime, most browsers glow somewhere if they load an HTML page containing autodiscoverable feeds…)

UK HEI “Page Not Found” Error Pages

In Back from Behind Enemy Lines, Without Being Autodiscovered(?!), I described a simple service that displays the autodiscoverable RSS feeds from UK HEI homepages (it went down over the weekend as the screenscraping broke, but it’s back now and some of the ‘issues’ with some of the linkscraping has been fixed ;-)

Over the weekend, I tweaked the code and created a parallel service that displays the ‘Page Not Found’ (HTML error code 404) splash page for UK HEIs using thumbnails generated using websnapr.

You can find the service here: UK HEI “Page Not Found” pages

The page takes in a list of UK HEI homepage URLs, generates a nonsense URL off each domain, and uses that nonexistent page URL as the basis for the thumbnail screenshot.

You’ll also notice I took the opportunity to do a little (unofficial) OU course advertising… At times like this, it would so handy for us to have a contextual course ad serving javascript snippet/widget…

PS Brian Kelly pinged me with a note that he’s had a UK HEI 404 viewer around for just about forever… University 404 pages rolling demo… Just by the by, the script that Brian used to scroll through the pages was the inspiration for the original “deliShow” version of feedshow (about feedshow).

OU News Tracking

A couple of days ago, Stuart pointed me to Quarkbase, a one stop shop for looking at various web stats, counts and rankings for a particular domain (here’s the domain on quarkbase, for example; see also: the Silobreaker view of the OU), which reminded me that I hadn’t created a version of the media release related news stories tracker that won me a gift voucher at IWMW2008 ;-)

So here it is: OU Media release effectiveness tracker pipe.

And to make it a little more palatable, here’s a view of the same in a Dipity timeline (which will also have the benefit of aggregating these items over time): OU media release effectiveness tracker timeline.

I also had a mess around trying to see how I could improve the implementation (one was was to add the “sort by date” flag to the Google news AJAX call (News Search Specific Arguments)), but then, of course, I got sidetracked… because it seemed that the Google News source I was using to search for news stories didn’t cover the THES (Times Higher Education Supplement).

My first thought was to use a Yahoo pipe to call a normal Google search, limited by domain to OU THES Search (via Google).

But that was a bit hit and miss, and didn’t necessarily return the most recent results… so instead I created a pipe to search over the last month of the THES for stories that mention “open university” and then scrape the THES search results page: OU THES Scraper.

If you want to see how it works, clone the pipe and edit it…

One reusable component of the pipe is this fragment that will make sure the date is in the correct format for an RSS feed (if it isn’t in the right format, Dipity may well ignore it…):

Here’s the full expression (actually, a PHP strftime expression) for outputting the date in the required RFC 822 date-time format: %a, %d %b %Y %H:%M:%S %z

To view the OU in the THES tracker over time, I’ve fed it into another Dipity timeline: OU in the THES.

(I’ve also added the THES stories to the OUseful “OU in the news” tab at

Going back to the media release effectiveness tracker, even if I was to add the THES as another news source, the coverage of that service would still be rather sparse. For a more comprehensive version, it would be better to plug in to something like the LexisNexis API and search their full range of indexed news from newspapers, trade magazines and so on… That said, I’m not sure if we have a license to use that API, and/or a key for it? But then again, that’s not really my job… ;-)

Managing Time in Yahoo Pipes

In the previous post – OU News Tracking – I briefly described how to get a Yahoo pipe to output a “pubDate” timestamp in the “correct” RSS standard format:

Here’s the full expression (actually, a PHP strftime expression) for outputting the date in the required RFC 822 date-time format: %a, %d %b %Y %H:%M:%S %z

The Date Builder block is being applied to a particular field (cdate) in every feed item, and assigning the results to the y:published tag. But what exactly is it outputting? A special datetime object, that’s what:

The Date Builder module is actually quite flexible in what it accepts – in the above pipe, cdate contains values like “21 August 2008”, but it can do much more than that…

For example, take the case of the THES search pipe, also described in the previous post. The pipe constructs a query that searches the Times Higher from the current date back to the start of the year. Heres what the query looks like in the original search form:

And here’s what the URL it generates looks like:

sday is “start date”, emth is “end month”, and so on…

Posting the URL into the Pipe URL builder separates out the different arguments nicely:

You’ll notice I’ve hardcoded the sday and smth to the January 1st, but the other date elements are wired in from a datetime object that has been set to the current date:

Terms like “now” also work…

Taken together, the two date/time related blocks allow you to manipulate time constructs quite easily within the Yahoo pipe :-)

Joining the Flow – Invisible Library Tech Support

Idling some thoughts about what to talk about in a session the OU Library* is running with some folks from Cambridge University Library services as part of an Arcadia Trust funded project there (blog), I started wondering about how info professionals in an organisation might provide invisible support to their patrons by joining in the conversation…

*err – oops; I mentioned the OU Library without clearing the text first; was I supposed to submit this post for censor approval before publishing it? ;-)

One way to do this is to comment on blog posts, as our own Tim Wales does on pages from time to time (when I don’t reply, Tim, it’s because I can’t add any more… but I’ll be looking out for your comments with an eagle eye from now on… ;-) [I also get delicious links for:d to me by Keren – who’s also on Twitter – and emailed links and news stories from Juanita on the TU120 course team.]

Another way is to join the twitterati…

“Ah”, you might say, “I can see how that would work. We set up @OULibrary, then our users subscribe to us and then when they want help they can send us a message, and we can get back to them… Cool… :-)”

Err… no.

The way I’d see it working would be for @OULibrary, for example, to subscribe to the OU twitterati and then help out when they can; “legitimate, peripheral, participatory support” would be one way of thinking about it…

Now of course, it may be that @OULibrary doesn’t want to be part of the whole conversation (at least, not at first…), but just the question asking parts…

In which case, part of the recipe might go something like this: use the advanced search form to find out the pattern for cool uri that lets you search for “question-like” things from a particular user:

(Other queries I’ve found work well are searches for: ?, how OR when OR ? , etc.)


The query gives you something like the above, including a link to an RSS feed for the search:

So now what do we do? We set up a script that takes a list of the twitter usernames of OU folks – you know how to find that list, right? I took the easy way ;-)

Liam’s suggestion links to an XML stream of status messages from people who follow PlanetOU, so the set might be leaky and/or tainted, right, and include people who have nothing to do with the OU… but am I bovvered? ;-)

(You can see a list of the followers names here, if you log in:

Hmmm… a list of status messages from people who may have something to do with the OU… Okay, dump the search thing, how about this…

The XML feed of friends statuses appears to be open (at the moment) so just filter the status messages of friends of PlanetOU and hope that OU folks have declared themselves to PlanetOU? (Which I haven’t… ;-)

Subscribe to this and you’ll have a stream of questions from OU folks who you can choose to help out, if you want…

A couple of alternatives would be to take a list of OU folks twitter names, and either follow them and filter your own friends stream for query terms, or generate search feed URLs for all them (my original thought, above) and roll those feeds into a single stream…

In each case, you have set up where the Library is invisibly asking “can I help you?”

Now you might think that libraries in general don’t work that way, that they’re “go to” services who help “lean forward” users, rather than offering help to “lean back” users who didn’t think to ask the library in the first place (err…..?), but I couldn’t possibly comment…

PS More links in to OU communities…

which leads to:

PPS (March 2011) seems like the web ha caught up: InboxQ


Jane’s list of “100+ (E-)Learning Professionals to follow on Twitter” (which includes yours truly, Martin and Grainne from the OpenU :-) has been doing the rounds today, so in partial response to Tony Karrer asking “is there an equivalent to OPML import for twitter for those of us who don’t want to go through the list and add people one at a time?”, I took an alternative route to achieving a similar effect (tracking those 100+ e-learning professionals’ tweets) and put together a Yahoo pipe to produce an aggregated feed – Jane’s edutwitterers pipe

Scrape the page and create a semblance of a feed of the edutwitterers:

Tidy the feed up a bit and make sure we only include items that link to valid twitter RSS feed URLs (note that the title could do with a little more tidying up…) – the regular expression for the link creates the feed URL for each edutwitterer:

Replace each item in the edutwitterers feed with the tweets from that person:

From the pipe, subscribe to the aggregated edutwitters’ feed.

Note, however, that the aggregated feed is a bit slow – it takes time to pull out tweets for each edutwitterer, and there is the potential for feeds being cached all over the place (by Yahoo pipes, by your browser, or whatever you happen to view the pipes output feed etc. etc.)

A more efficient route might be to produce an OPML feed containing links to each edutwitterer’s RSS feed, and then view this as a stream in a Grazr widget.

Creating the OPML file is left as an exercise for the reader (!) – if you do create one, please post a link as a comment or trackback… ;-) Here are three ways I can think of for creating such a file:

  1. add the feed URL for each edutwitter as a separate feed in an Grazr reading list (How to create a Grazr (OPML) reading list). If you don’t like/trust Grazr, try OPML Manager;
  2. build a screenscraper to scrape the usernames and then create an output OPML file automatically;
  3. view source of Jane’s orginal edutwitterers page, cut out the table that lists the edutwitterers, paste the text into a text editor and work some regular ecpression ‘search and replace’ magic; (if you do this, how about posting your recipe/regular expressions somewhere?!;-)

Enough – time to start reading Presentation Zen

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:


And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…


Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.


And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.

Viewing Campaign Finance Data In a Google Spreadsheet via the New York Times Campaign Data API

It’s all very well when someone like the New York Times opens up an API to campaign finance data (announced here), but only the geeks can play, right? It’s not like they’re making the data available in a spreadsheet so that more people have a chance to try and do something with it, is it? Err, well, maybe this’ll help? A way of getting data out from the NYT API and into an online Google spreadsheet, where you can do something with it… (you can see an example here: campaign data in a Google Spreadsheet (via The New York Times Campaign Data API)).

Following up on a comment to Visualising Financial Data In a Google Spreadsheet Motion Chart by Dan Mcquillan: “FWIW, there’s a lot of campaigning potential in these tools”, I just had a quick tinker with another Google spreadsheet function – importXML – that pulls XML data into a Google spreadsheet, and in particular pointed it at the newly released New York Times Campaign Finance API.

To use this API you need to get yourself a key, which requires you to register with the New York Times, and also provide a URL for your application: create a new Google spreadsheet, and use the URL for it (it’ll look something like as the URL for your NYT API app.

First up, enter “API KEY” as a label in a cell – I used G22 – and paste your New York Times API Campaign Finance API key for that spreadsheet into the cell next to it – e.g. H32. (My spreadsheet was cobbled together in scratchpad mode, so the cells are all over the place!)

In cell G23 enter the label “Names URL” and in cell H23 the following formula, which constructs a URL for the “Candidate Summaries”:

[aaargghhh – crappy styylesheet… the formula is this, but all on one line:


(The formula concatenates a NYT API calling URL with your API key.)

We’re now going to pull in the list of candidates: in e.g. cell A24 enter the formula:

H23, remember, is a reference to the cell that contains the URL for the API call that grabs all the candidate summaries. The =importXML grabs the XML result from the API call, and the “//candidate” bit looks through the XML file till it finds a “candidate” element. If you look at the XML from the webservice, you’ll see it actually contains several of these:

The spreadsheet actually handles all of these and prints out a summary table for you:

All we did was configure one cell to grab all this information, remember: cell A24 which contains =ImportXML(H23,”//candidate”).

We’re now going to pull in more detailed campaign info for each candidate using the “Candidate Details” API request, which looks something like

(The (lack of) capitalisation of the surname of the candidate in the URL is not necessary – it seems to work okay with “Romney.xml” for example).

To construct the URLs for each candidate, let’s write a formula to construct the URL for one of them.

In my spreadsheet, the Candidate Summary import in cell A24 built a table that filled cells A24:E42, so we’re going to start working in additional columns alongside that table.

You’ll notice that the URL for the detailed report just requires the candidate’s surname, whereas the candidate summary provides the forename as well.

In cell F24, add the following formula: SPLIT(A24, “,”)

What this does is split the contents of cell A24 (e.g. Obama, Barack) at the comma, and populate two cells (in columns F and G) as a result:

If you highlight the cell (F24), click on the little blue square in the bottom right hand corner, and drag down, you should be able to magically copy the formula and so spilt all the candidates’ names.

Now what we need to do is construct the URL that will pull back the detailed campaign information for each candidate. We’ll build the URLs in the next free column along (column H in my spreadsheet – the split name is across columns F and G).

In cell H24, enter the following formula (all on one line – split here for convenience):


H$22 refers to the cell that contains your NYT API key. The $ is there to anchor the row number in the formula, so when you drag the cell to copy it, the copied formula still refers to that exact same cell.

Highlight cell H24 and drag the little blue square down to complete the table and construct URLs for each candidate.

Good stuff :-)

Now we can construct another formula to import the full campaign details for each candidate.

in cell A2 in my spreadsheet, I entered the following:


H24 is the URL for the candidate details API call for the first candidate (Barack Obama in my case).

Hopefully, you should get a row of data pulled into the spreadsheet:

The XML file responsible looks something like this:

Use your eyes and work out what the contents of each cell might refer to:

Click on cell A2, and drag the blue square down to A20. The contents of cell A2 will be copied in a relative way, and should now be importing detailed information for all the candidates.

Now you have the data in a form you can play with :-)

And it should update whenever the New York Times updates the data it exposes through the API.

(I would publish the spreadsheet in all its glory, but then I’d have to give away my API key… Ah, what the heck = here it is: Spreadsheet of Campaign Finances, courtesy of the New York Times. (Maybe I should read the T&C to check this is a fair use…!))

For other New York Times Campaign Data API functions, see the documentation. You should know enough now to be able to work out how to use it…

PS the “//candidate” XPATH stuff can be tricky to get your head round. If anyone wants to post as a comment a list of XPATH routines to pull out different data elements, feel free to do so :-)

If you make use of the data in any interesting ways, please link back here so we can see what you’re getting up to and how you’re using the data…

Getting an RSS Feed Out of a Google Custom Search Engine (CSE)

Alan posted me a tweet earlier today asking me to prove my “genius” credentials (heh, heh;-):

As far as I know, Google CSEs don’t offer an RSS output (yet: Google websearch doesn’t either, though rumour has it that it will, soon… so maybe CSEs will open up with opensearch too?)

So here’s a workaround…

If you make a query in a Google CSE – such as the rather wonderful How Do I? instructional video CSE ;-) – you’ll notice in the URL an argument that says &cx=somEGobbleDYGookNumber234sTUfF&cof….

google cse

The characters between cx= and either the end of the URL or an ampersand (&) are the ID of the CSE. In the case of How Do I?, the ID is 009190243792682903990%3Aqppoopa3lxa – almost; the “%3A” is a safe encoding for the web of the character “:”, so the actual CSE ID is 009190243792682903990:qppoopa3lxa. But we can work round that, and work with the encoded CSE ID cut straight from the URL.

Using the Google AJAX search API, you can create a query on any CSE that will return a result using the JSON format (a javascript object that can be loaded into a web page). The Google AJAX search API documentation tells you how: construct a Google AJAX web search query using the root and add a few extra arguments to pull in results from a particular CSE: Web Search Specific Arguments.

JSON isn’t RSS, but we can get it into RSS quite easily, using a Yahoo pipe…

Just paste in the ID of a CSE (or the whole results URL), add your query, and subscribe to the results as an RSS feed from the More Options menu:

The pipe works as follows…

First up, create a text box to let a user enter a CSE ID cut and pasted from a CSE results page URL (this should work if you paste in the whole of the URL of the results page from a query made on your CSE):

Then create the search query input box, and along with the CSE ID use it to create a URL that calls the Google AJAX API:

Grab the JSON data feed from the Google AJAX Search API and translate the results so that the pipe will output a valid RSS feed:

And there you have it – an RSS feed for a particular query made on a particular Google CSE can be obtained from the Get as RSS output on the pipe’s More Options menu.

Mashup Reuse – Are You Lazy Enough?

Late on Friday night, I picked up a challenge (again on Twitter) from Scott Leslie:

After a little more probing, the problem turned out to be to do with pulling a list of URLs from a page on the Guardian blog together into a playlist: The 50 greatest arts videos on YouTube.

As Scott suggested, it would have been far more useful to provide the list as a Youtube playlist. But they didn’t… So was there an easy way of creating one?

Now it’s quite possible that there is a way to programmatically create a playlist via the Youtube gdata API, but here’s a workaround that uses a Grazr widget as a player for a list of Youtube URLs:

So let’s work back from this widget to see how it was put together.

The Grazr widget works by loading in the URL of an OPML or RSS feed, or the URL of an HTML page that contains an autodiscoverable feed:

The URL loaded into the widget is this:

If we load this URL into our normal browser, (and then maybe also “View Source” from the browser Edit menu, or “Page Source” from the browser View menu) this is what the Grazr widget is consuming:

If you know anything about Grazr widgets, then you’ll maybe know that if the feed contains a media enclosure, Grazr will try to embed it in an appropriate player…

So where is the feed the Grazr widget is feeding on actually come from? The answer, a Yahoo pipe. This pipe in fact:

Let’s have a look at how it works – click on the “Edit Pipe” button (or maybe Clone the pipe first to get your own copy of it – though you’ll have to log in and/or create a Pipes user account using your Yahoo account first…):

And here’s what we find:

What this pipe does is Fetch a CSV file from another URL and rename the pipe’s internal representation of the data that was loaded in from the CSV file in such a way that the pipe now represents a valid, if not too informative, RSS feed:

The Loop element is used to run another pipe function (TinyURL to full (preview) URL), whose name suggests that it returns the target (original) URL from a TinyURL:

This pipe block is actually one I created before (and remembered creating) – if you inspect the debug output of the block, you’ll see the TinyURLs have been unpacked to Youtube video page URLs.

(If you want to explore how it works, you can find it here: TInyurl to full (preview) URL.)

The final bit of the pipe renames the attribute that was added to the pipe’s internal feed representation as an enclosure.url, and then rewrites the links from URLs that point to a Youtube video (splash) page to the URL of the video asset itself (a Flash swf file).

So that’s how the pipe works – it takes a CSV file input from somewhere on the web, and generates an RSS feed with a video file enclosure that can be played in a Grazr widget.

So where does the CSV file come from? If we look at the URL that the CSV block is loading in in the pipe we can find out:

Here’s the spreadsheet itself:

And the cells we want are cells A30:A82:

TinyURLs :-)

But where are they coming from?

Hmm, they are CONTINUEing in from cell A1:

The little orange square in the top right of cell A1 in the spreadsheet shows a formula is being used in that cell…

So let’s see what the formula is:

Here it is:

I recognise that URL! ;-) So what this formula does is is load in the The 50 greatest arts videos on YouTube page from the Guardian website, and then pull out all the anchor tags – all the <a> tags… which happen to include the links to the movies which we found at cells A30:A82.

Just looking back at the original page, here’s what it looked like:

And here’s what it really looks like if we view the page source:

So to recap, what appears to be going on is this:

A Google spreadsheet loads in the Guardian web page as an XML document, and strips out the URLs. The top 50 video URLs appear in contiguous cells as TinyURLs. These are published as a CSV file and consumed by a Yahoo pipe. The pipe takes the TinyURLs in the CSV feed, creates an RSS feed from them, unpacks them to their full Youtube URL form, and adds a link to the actual Youtube video asset as a feed enclosure. The feed is then rendered in a Grazr widget that automatically loads an embedded video player when it sees the Youtube video enclosure.

So how did I put this “mashup” together?

Firstly, I looked at the original page that contained the links that Scott (remember Scott?… strains of Alice’s Restaurant etc etc ;-), and knowing that I could use a Grazr widget as a player for a feed that contained Youtube movie enclosures all I had to do was get the URLs into a feed… and so I looked at the 50 top videos for cluses as to whether the links were in a form I could do something with, maybe using the Yahoo pipes ‘impoort HTML’ page scraping block; but the page didn’t look that friendly, so then I View source‘d. And the page structure didn’t look overly helpful either; but the links were there so rather than look t the page too closely, I though (on the off chance) I’d see what they looked like if I just link scraped the page. And knowing that the Google importXML function (with the //a XPATH) is a link scraper, I gave it a try; and the TinyURLs were all blocked together, so I knew I could use ’em by publishing that block of cells via a CSV file. And I remembered I’d created a TinyURL decoder block in Yahoo pipes some time ago, and I remembered creating a Youtube enclosure pipe before too, so I could crib that as well. And so it fell together…

And it fell together because I’d built reuable bits before, and I remembered where I could find them, so l took the lazy solution and wired them up together.

And I think that’s maybe what we need more of it mashups are going to become more widely used – more laziness… Many mashups are bespoke one-off’s because it’s getting easier to build “vertical” disposable mashup solutions. But sometimes there’s even easier to put together if they’re made out of Lego… ;-)

See also: Mashup Recycling: Now this is Green IT! and The Revolution Starts (near) Here