Archive for the ‘Pipework’ Category
Home again after a few very enjoyable days away at IWMW2008 in Aberdeen, and I feel like I need a way of saying thank you to the web managers’ community for allowing an academic in…heh heh ;-)
So I spent half an hour or so (no… really…;-) on the train back from the airport putting together a front end for an HEI feed autodiscovery pipe that I knocked up in one of the presentations yesterday (I was giving the talk that was on at the time my full partial attention, of course ;-) that picks up on some of the conversation that was in the air at the end of the innovation competition session (I didn’t win, of course…;-(
The context is/was a comment from Mike Ellis that HEIs et al. could start opening up their data by offering RSS feeds of news releases, job/recruitment ads and event listings, because there’s no reason not to…. So my, err, gift(?!) back to the IWMW community is a little something where UK HEI web managers can proudly show off how they’ve taken up the challenge and published a whole suite of autodiscoverable RSS feeds from their home pages ;-): UK HEI autodiscoverable feeds.
(Cynics may say that the page actually names and shames sites that don’t offer any autodiscoverable feeds; I couldn’t possibly comment… ;-)
Anyway, the pipework goes like this…
First of all I grab a feed of UK HEI homepages… There isn’t an official one, of course, so as a stopgap I’ve scraped a dodgy secondary source (safe in the expectation that Mike Ellis will have an authoritative, hacked feed available from studentviews.net sometime soon…)
All that’s required then is to pull out the link in each item, that hopefully corresponds to the HEI homepage, and use that as the focus for feed autodiscovery:
Any feed URLs that are autodiscovered are then added as elaborations to the corresponding HEI feed item. Although these elaborations aren’t exposed in the RSS feed output from the pipe, they are available in the pipe’s JSON output, so the half-hour (offline) hack on the train earlier this afternoon consumes the JSON feed and gives a quick and dirty show’n’tell display of which institutions have autodiscoverable feeds on their homepage: UK HEI autodiscoverable feeds.
Looking at a couple of comments to the post Nudge: Improving Decisions About RSS Usage, (in which Brian Kelly tabulated the provision of RSS feeds from Scottish HEIs), it seems that publicly highlighting the lack of support for feed autodiscovery can encourage people to look at their pages and add the feature… So I wonder: when IWMW comes around next year, will the phrase No autodiscoverable feeds… be missing from the UK HEI autodiscoverable feeds page, possibly in part because that page exists?!
(Note that if you use this page to test a homepage you’ve added feed autodiscovery to, there is cacheing going on everywhere so you may not see any change in the display for an hour or so… I’ll try and post a cache-resistant feed autodiscovery page over the next few days; in the meantime, most browsers glow somewhere if they load an HTML page containing autodiscoverable feeds…)
In Back from Behind Enemy Lines, Without Being Autodiscovered(?!), I described a simple service that displays the autodiscoverable RSS feeds from UK HEI homepages (it went down over the weekend as the screenscraping broke, but it’s back now and some of the ‘issues’ with some of the linkscraping has been fixed ;-)
Over the weekend, I tweaked the code and created a parallel service that displays the ‘Page Not Found’ (HTML error code 404) splash page for UK HEIs using thumbnails generated using websnapr.
You can find the service here: UK HEI “Page Not Found” pages
The page takes in a list of UK HEI homepage URLs, generates a nonsense URL off each domain, and uses that nonexistent page URL as the basis for the thumbnail screenshot.
PS Brian Kelly pinged me with a note that he’s had a UK HEI 404 viewer around for just about forever… University 404 pages rolling demo… Just by the by, the script that Brian used to scroll through the pages was the inspiration for the original “deliShow” version of feedshow (about feedshow).
A couple of days ago, Stuart pointed me to Quarkbase, a one stop shop for looking at various web stats, counts and rankings for a particular domain (here’s the open.ac.uk domain on quarkbase, for example; see also: the Silobreaker view of the OU), which reminded me that I hadn’t created a version of the media release related news stories tracker that won me a gift voucher at IWMW2008 ;-)
So here it is: OU Media release effectiveness tracker pipe.
And to make it a little more palatable, here’s a view of the same in a Dipity timeline (which will also have the benefit of aggregating these items over time): OU media release effectiveness tracker timeline.
I also had a mess around trying to see how I could improve the implementation (one was was to add the “sort by date” flag to the Google news AJAX call (News Search Specific Arguments)), but then, of course, I got sidetracked… because it seemed that the Google News source I was using to search for news stories didn’t cover the THES (Times Higher Education Supplement).
But that was a bit hit and miss, and didn’t necessarily return the most recent results… so instead I created a pipe to search over the last month of the THES for stories that mention “open university” and then scrape the THES search results page: OU THES Scraper.
If you want to see how it works, clone the pipe and edit it…
One reusable component of the pipe is this fragment that will make sure the date is in the correct format for an RSS feed (if it isn’t in the right format, Dipity may well ignore it…):
Here’s the full expression (actually, a PHP strftime expression) for outputting the date in the required RFC 822 date-time format: %a, %d %b %Y %H:%M:%S %z
To view the OU in the THES tracker over time, I’ve fed it into another Dipity timeline: OU in the THES.
(I’ve also added the THES stories to the OUseful “OU in the news” tab at http://ouseful.open.ac.uk/.)
Going back to the media release effectiveness tracker, even if I was to add the THES as another news source, the coverage of that service would still be rather sparse. For a more comprehensive version, it would be better to plug in to something like the LexisNexis API and search their full range of indexed news from newspapers, trade magazines and so on… That said, I’m not sure if we have a license to use that API, and/or a key for it? But then again, that’s not really my job… ;-)
In the previous post – OU News Tracking – I briefly described how to get a Yahoo pipe to output a “pubDate” timestamp in the “correct” RSS standard format:
Here’s the full expression (actually, a PHP strftime expression) for outputting the date in the required RFC 822 date-time format: %a, %d %b %Y %H:%M:%S %z
The Date Builder block is being applied to a particular field (cdate) in every feed item, and assigning the results to the y:published tag. But what exactly is it outputting? A special datetime object, that’s what:
The Date Builder module is actually quite flexible in what it accepts – in the above pipe, cdate contains values like “21 August 2008″, but it can do much more than that…
For example, take the case of the THES search pipe, also described in the previous post. The pipe constructs a query that searches the Times Higher from the current date back to the start of the year. Heres what the query looks like in the original search form:
And here’s what the URL it generates looks like:
sday is “start date”, emth is “end month”, and so on…
Posting the URL into the Pipe URL builder separates out the different arguments nicely:
You’ll notice I’ve hardcoded the sday and smth to the January 1st, but the other date elements are wired in from a datetime object that has been set to the current date:
Terms like “now” also work…
Taken together, the two date/time related blocks allow you to manipulate time constructs quite easily within the Yahoo pipe :-)
Idling some thoughts about what to talk about in a session the OU Library* is running with some folks from Cambridge University Library services as part of an Arcadia Trust funded project there (blog), I started wondering about how info professionals in an organisation might provide invisible support to their patrons by joining in the conversation…
*err – oops; I mentioned the OU Library without clearing the text first; was I supposed to submit this post for censor approval before publishing it? ;-)
One way to do this is to comment on blog posts, as our own Tim Wales does on OUseful.info pages from time to time (when I don’t reply, Tim, it’s because I can’t add any more… but I’ll be looking out for your comments with an eagle eye from now on… ;-) [I also get delicious links for:d to me by Keren – who’s also on Twitter – and emailed links and news stories from Juanita on the TU120 course team.]
Another way is to join the twitterati…
“Ah”, you might say, “I can see how that would work. We set up @OULibrary, then our users subscribe to us and then when they want help they can send us a message, and we can get back to them… Cool… :-)”
The way I’d see it working would be for @OULibrary, for example, to subscribe to the OU twitterati and then help out when they can; “legitimate, peripheral, participatory support” would be one way of thinking about it…
Now of course, it may be that @OULibrary doesn’t want to be part of the whole conversation (at least, not at first…), but just the question asking parts…
In which case, part of the recipe might go something like this: use the advanced search form to find out the pattern for cool uri that lets you search for “question-like” things from a particular user:
(Other queries I’ve found work well are searches for: ?, how OR when OR ? , etc.)
The query gives you something like the above, including a link to an RSS feed for the search:
So now what do we do? We set up a script that takes a list of the twitter usernames of OU folks – you know how to find that list, right? I took the easy way ;-)
Liam’s suggestion links to an XML stream of status messages from people who follow PlanetOU, so the set might be leaky and/or tainted, right, and include people who have nothing to do with the OU… but am I bovvered? ;-)
(You can see a list of the followers names here, if you log in:
Hmmm… a list of status messages from people who may have something to do with the OU… Okay, dump the search thing, how about this…
The XML feed of friends statuses appears to be open (at the moment) so just filter the status messages of friends of PlanetOU and hope that OU folks have declared themselves to PlanetOU? (Which I haven’t… ;-)
Subscribe to this and you’ll have a stream of questions from OU folks who you can choose to help out, if you want…
A couple of alternatives would be to take a list of OU folks twitter names, and either follow them and filter your own friends stream for query terms, or generate search feed URLs for all them (my original thought, above) and roll those feeds into a single stream…
In each case, you have set up where the Library is invisibly asking “can I help you?”
Now you might think that libraries in general don’t work that way, that they’re “go to” services who help “lean forward” users, rather than offering help to “lean back” users who didn’t think to ask the library in the first place (err…..?), but I couldn’t possibly comment…
PS More links in to OU communities…
which leads to:
PPS (March 2011) seems like the web ha caught up: InboxQ
Jane’s list of “100+ (E-)Learning Professionals to follow on Twitter” (which includes yours truly, Martin and Grainne from the OpenU :-) has been doing the rounds today, so in partial response to Tony Karrer asking “is there an equivalent to OPML import for twitter for those of us who don’t want to go through the list and add people one at a time?”, I took an alternative route to achieving a similar effect (tracking those 100+ e-learning professionals’ tweets) and put together a Yahoo pipe to produce an aggregated feed – Jane’s edutwitterers pipe…
Scrape the page and create a semblance of a feed of the edutwitterers:
Tidy the feed up a bit and make sure we only include items that link to valid twitter RSS feed URLs (note that the title could do with a little more tidying up…) – the regular expression for the link creates the feed URL for each edutwitterer:
Replace each item in the edutwitterers feed with the tweets from that person:
From the pipe, subscribe to the aggregated edutwitters’ feed.
Note, however, that the aggregated feed is a bit slow – it takes time to pull out tweets for each edutwitterer, and there is the potential for feeds being cached all over the place (by Yahoo pipes, by your browser, or whatever you happen to view the pipes output feed etc. etc.)
A more efficient route might be to produce an OPML feed containing links to each edutwitterer’s RSS feed, and then view this as a stream in a Grazr widget.
Creating the OPML file is left as an exercise for the reader (!) – if you do create one, please post a link as a comment or trackback… ;-) Here are three ways I can think of for creating such a file:
- add the feed URL for each edutwitter as a separate feed in an Grazr reading list (How to create a Grazr (OPML) reading list). If you don’t like/trust Grazr, try OPML Manager;
- build a screenscraper to scrape the usernames and then create an output OPML file automatically;
- view source of Jane’s orginal edutwitterers page, cut out the table that lists the edutwitterers, paste the text into a text editor and work some regular ecpression ‘search and replace’ magic; (if you do this, how about posting your recipe/regular expressions somewhere?!;-)
Enough – time to start reading Presentation Zen…
Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.
So here’s a quick summary of (part of) what I found I could do.
The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.
So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):
Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:
Autocompletion works a treat, so finish off the expression:
And as if by magic, a data table appears:
All well and good – if you want to create a chart or two, why not try the Google charting tools?
Where things get really interesting, though, is when you start letting the data flow around…
So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:
As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:
See the “More publishing options” link? Lurvely :-)
Let’s have a bit of CSV goodness:
Why CSV? Here’s why:
(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)
Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:
Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):
Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block
The location builder will look to a specified element for the content we wish to geocode:
The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:
This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK
Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.
And to top it all off:
And for the encore? Grab the KML feed out of the pipe:
So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a
PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…
PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)
PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.
PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.