OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for July 2009

First Dabblings With @daveyp’s MOSAIC Library Competition Data API

with 3 comments

A couple of days ago, Dave Pattern published a simple API to the JISC MOSAIC project HE library loans data that has been opened up as part of a developer competition running over the summer (Simple API for JISC MOSAIC Project Developer Competition data [competition announcement]).

The API offers a variety of simple queries on the records data (we like simple queries:-) that allow the user to retrieve records according to ISBN of the book that was borrowed, books that were taken out by people on a particuler course, (using a UCAS course code identifier) or a combination of the two.

The results file from a search is returned as an XML file containing zero or more results records; each record has the form:

<useRecord row="19">
 <from>
  <institution>University of Huddersfield</institution>
  <academicYear>2008</academicYear>
 </from>
 <resource>
  <media>book</media>
  <globalID type="ISBN">1856752321</globalID>
  <author>Mojay, Gabriel.</author>
  <title>Aromatherapy for healing the spirit : a guide to restoring emotional and mental balance through essential oils /</title>
  <localID>582083</localID>
  <catalogueURL>http://library.hud.ac.uk/catlink/bib/582083</catalogueURL
  <publisher>Gaia</publisher><published>2005</published>
 </resource>
 <context>
  <courseCode type="ucas">B390</courseCode>
  <courseName>FdSc Holistic Therapies</courseName>
  <progression>UG1</progression>
 </context>
</useRecord>

[How to style code fragments in a hosted Wordpres.com blog, via WordPress Support: "Code" shortcode]

(For a more complte description of the record format, see Mosaic data collection – A guide [PDF])

As a warm up exercise to familiarise myself with the data, I did a proof of concept hack (in part inspired by the Library Traveller script) that would simply linkify course codes appearing on a UCAS courses results page such that each course code would link to a results page listing the books that had been borrowed by students associated with courses with a similar course code in previous years.

Looking at the results page, w can see the course code appears next to tha name of each course:

A simple bookmarklet can b used to linkify the qualification code so that it points to the results of a query on the MOSAIC data with the appropriate course code:

javascript:(function(){
 var regu=new RegExp("\([A-Z0-9]{4}\)");
 var s=document.getElementsByTagName('span');
 for (i=0;i<s.length;i++){
  if (s[i].getAttribute('class')=='bodyTextSmallGrey')
   if (regu.test(s[i].innerHTML)){
    var id=regu.exec(s[i].innerHTML);
    s[i].innerHTML=s[i].innerHTML.replace( regu,
     "<a href=\"http://library.hud.ac.uk/mosaic/api.pl?ucas="
     +id[0]+"\">"+id[0]+"</a>");}}})()

(It’s trivial to convert this to to Greasemonkey script: simple add the required Greasemonkey header and save the file with an appropriate filename – e.g. ucasCodeLinkify.user.js)

Clicking on the bookmarklet linkifies the qualification code to point to a search on http://library.hud.ac.uk/mosaic/api.pl?ucas= wigth the appropriate code.

To make the results a little friendlier, I created a simple Yahoo pipe that generates an RSS feed containing a list of books (along with their book covers) that had been borrowed more than a specified minimum number of times by people associated with that course code in previous years.

To start with, create the URI with a particular UCAS qualification code to call the web service and pull in the XML results feed:

Map elements onto legitimate RSS feed elements:

and strip out duplicate books. Note that the pipe also counts how many duplicate (?!) items there are:

Now filter the items based on the duplication (replication?) count – we want to only see books that have been borrowed at least a minimum number of times (this rules out ‘occasional’ loans on unrelated courses by single individuals – we only want the popular or heavily borrowed books associated with a particular UCAS qualification code in the first instance):

Finally, create a link to each book on Google books, and grab the book cover. Note that I assume the global identifier is an ISBN10… (if it’s an ISBN13, see Looking Up Alternative Copies of a Book on Amazon, via ThingISBN for a defensive measure using the LibraryThing Check ISBN API. That post also points towards a way by which we might find other courses that are associated with different editions of a particular book… ;-)

http://pipes.yahoo.com/pipes/pipe.info?_id=HqJfbDR93hGE5DAewmH_9A

You can find the pipe here: MOSAIC data: books borrowed on particular courses.

If we now change the linkifying URL to the RSS output of the pipe, we can provide a link for each course on the UCAS course search results page that points to an RSS feed of reading material presumably popularly associated with the course in previous years (note, however, that note all codes have books associated with them).

To do this, simply change the following URI stub in the bookmarklet:
http://library.hud.ac.uk/mosaic/api.pl?ucas=
to
http://pipes.yahoo.com/pipes/pipe.run?_id=HqJfbDR93hGE5DAewmH_9A&_render=rss&min=3&cc=

The “popular related borrowings” reading list generated now allows a potential student to explore some of the books associated with a particular course at decision time:-)

One possible follow on step would be to look up other courses related to each original course by virtue of several people having borrowed the same book (or other editions of it) on other courses. Can you see how you might achieve that, or at least what sorts of steps you need to implement?

PS If anyone would like to work this recipe up properly as (part of) a competition entry, feel free, though I’d appreciate a cut of any prize money if you win;-)

Written by Tony Hirst

July 31, 2009 at 12:31 am

Posted in Library, Pipework, Tinkering

Tagged with ,

IWMW Mashups Round the Edges: Scraping Tables

with 5 comments

Even though most people, most of the time, don’t set out to publish data in a semantically structured way on the web, it’s often possible to drive the semantics from the HTML elements used to represent the content within the page.

The most obvious example of this is the HTML table (see also: HTML Tables and the Data Web). Whilst often misused as a way of organising the visual layout of content on a page by designers who don’t yet understand the DOM or CSS (?!;-), tables are more appropriately used as containers for tabulated data.

They are also eminently amenable to screenscraping, if they are used properly…

So in this post I’ll show you three ways of scraping tabular data from a web page without having to leave the comfort of your own browser, just by using freely available online applications (although you may need to set up a user account with Yahoo and/or Google to play with them…):

- table scraping with YQL;
- table scraping with Yahoo pipes;
- table scraping with Google docs (in fact, with Google spreadsheets).

To illustrate table scraping with each of these tools, we’ll use the page at http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm:

If we look at the HTML source of the page, we see that it contains at least one table:

So let’s see how we can scrape that information about courses out of the web page, so that we can republish it elsewhere.

Table Scraping With YQL

First up, YQL, the Yahoo Query Language. Using the html datable, it’s easy enough to just grab the HTML table out of the page as is, to allow us tho republish it elsewhere.

Inspection of the HTML source of the original page shows that the table containing the course information is identified as follows:

<table class=”courses”>

The following query trivially extracts the table from the original page:

select * from html where url=”http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm” and xpath=’//table[@class="courses"]‘

The HTML is directly available in the query results field:

This table could then be redisplayed elsewhere, or processed using a YQL Execute script.

Table Scraping With Yahoo Pipes

Another simple screen scraping route can be achieved using Yahoo Pipes. In this example, we’ll start to explore how we can get a bit more of a handle on the information contained in each row, although as we’ll see, Pipes is not necessarily the ideal environment for doing this in this particular case.

Whilst Pipes started out as a drag and drop environment for processing RSS feeds, a Pipe can also import an HTML page at the start of the pipe and then process it in order to produce an RSS feed output.

To get the HTML page into the pipe, we use a Fetch Page block from the Sources menu:

This gives us separate items within the pipe that correspond to each row in the first table in the page. Let’s just tidy up those elements a little bit using the regular expression block:

And a bit more tidying – filter out the first couple of rows:

We’re now going to create a proper RSS feed, with title, link and description elements, so start out by setting each of them to b a copy of the content attribute in the internal feed representation:

Now we use some more regular expressions to define those elements appropriately:

Note that we could further process the description element and maybe reintroduce semantics, for example by specifically and explicitly identifying the level, number of points, next start date, and cost.

The resulting RSS feed output from the pipe can now be used to syndicate details of the OU’s science courses, information that was originally locked down to only appear in the original web page, or, by using the YQL trick shown above, as a simple republication of the original table.

Now I know that the semantic web purists will already be up in arms about the way the semantics were decoupled from the data at the start of the pipe, and that we’re now considering arbitrarily reintroducing them in a completely cavalier way. But so what..? We have to be pragmatic. Yes, we’d need to keep checking that the fields hadn’t changed, but this is always a risk when we’re screen scraping.

But are there any better ways of scraping this data that maybe preserve the columnar ordering in a more semantically transparent way? Maybe…

Table Scraping With Google Spreadsheets

One of the common features of the online spreadsheet offerings from Google and Zoho, as well as online database service DabbleDB is the ability to import tabular data scraped from webpage directly into a spreadsheet, given the URL of the original web page.

In Google spreadsheets, this is as simple of using a formula along the lines of:

=ImportHtml(“http://www3.open.ac.uk/study/undergraduate/science/courses/index.htm”,”table”,1)

And the result?

What’s particularly handy about this representation is that we have access to the semantics of each cell value via the column headings.

We can see this by viewing the output of the CSV representation of the spreadsheet in a Yahoo Pipe:

(Note that we can’t actually rename these columns directly within the Pipe because the names aren’t well formed and Pipes can’t handle them. But there are workarounds – such as ignoring the first row and defining our own column names… though again, this breaks the direct semantic linkage between the original column name and the pipe’s internal element naming.)

One of the little known features of Google spreadsheets (for the moment, at least), is an API that allows a user to construct queries on the data contained within the spreadsheet. In actual fact, there are two APIs; a GData spreadsheets API, and a ‘datasource’ API defined as a Google Query Language component of the Google Visualization API; it’s the latter, the Google Query Language service we’ll play with :-)

One of the easiest ways that I know of to get started with writing GQL queries (which can all be expressed as RESTful URI calls to a web service that exposes the query language interface to a particular document or data source is to use the Datastore Explorer that I put together to help people write simple queries on spreadsheets collected together in the Guardian datastore.

The user interface is not the best designed interface you’ve ever seen, but when you get to grips with is it can help you put together query URIs quite quickly. (If anyone is interested in working on an open source project with me to try and build a proper system, please drop me an email:-)

Go to the datastore explorer, paste in the URI of the spreadsheet you want to explore and click on Preview table headings. The key for the spreadsheet will be extracted and used to query the spreadsheet. Note that you may also need to enter the sheet number separately. (If there is a gid= argument in the URI, then that’s a big giveaway of the sheet number;-)

You can now start to build up a query to explore the data source, such as the following:

Preview URLs are also generated that will display the table as an HTML table , or as a CSV file.

So for example, here is the HTML preview table for the above query:

Using tools such as the datasource explorer, it’s possible to generate queries on Google spreadsheets that contain data that has been imported into the spreadsheet via a simple table screenscraping function.

What this means in practice is: if you’ve ever published tabular information in simple HTML table, it’s easy enough for a third party to treat that table as a database table :-)

Written by Tony Hirst

July 28, 2009 at 2:47 pm

Posted in Tinkering

Tagged with

IWMW Mashups Round the Edges: YQL, Microformats and Structured Data

leave a comment »

Whenever you publish HTML on the web, there is a possibility that someone, somewhere, will think that the information on your page is so interesting or valuable that they want to republish a particular of it elsewhere.

So for example, if you publish a list of calendar events on a webpage, there may well be someone out there who sees the benefit of making that list more widely available, for example by including the dates in another, more comprehensive aggregating calendar. (A great example of this is Jon Udell’s Elm City project; in an academic context, see Jim Groom’s Aggregating Google Calendars.)

In the case of calendar dates, if you’re feeling helpful you can publish the calendar events using a syndication format, such as iCal. If you’re feeling unhelpful, you can write a mess of HTML with no particular structure, putting dates all over the web page in a variety of formats and using a variety of HTML markup. And somewhere in-between, you could publish the information in a semantically meaningful way, where the HTML structure can be used to identify the different components of an event record (event name, date, location, and so on).

Why would you do this? Well, if semantics are included in the page structure, CSS styling might be able to reveal that meaning in an appropriate presentational way, which makes life as a web page designer easier. And depending on how you semantically mark-up your web page, browser add-ons might be able to exploit that structure to provide additional user functionality, such as adding selected calendar dates on a web page to a personal calendar.

Semantics can be added to calendar information in a web page in an informal way as tabulated data, where separate columns in an HTML table might identify things like the event name, date, location, etc. Semantics can also be associated with each element in an event record using a standardised markup convention such as the hCalendar microformat.

A good example of microformats in action can be found on the University of Bath Semester Timetable:

Here’s how each cell (each ‘event’) is represented in the table using hCalendar:

Because hCalendar is a recognised format (by some people at least!), several tools already exist for scraping it in an efficient way from a web page. For example, Brian Suda’s X2V service, “a BETA implementation of an XSLT file to transform and hCa* encoded XHTML file into the corresponding vCard/iCalendar file”.

Generating the iCal feed at the click of a button gives me something I can subscribe to in my desktop calendar:

And here it is:

Brian’s approach is based on the use of XSLT to extract the microformatted data from the page and represent it. Essentially, using microformats ‘in page’ allows pre-defined screenscraping utilties to effectively implement an API on top of the page that exposes particular data contained within it. The W3C GRDDL Recommendation generalises this sort of approach.

Another, more recent take on scraping conventionally (or consistently) marked up information from web pages is Yahoo’s YQL. For some time, Yahoo have been providing tools and utilities form scraping structured data from webpages so that it can be used to augment search results listing (Yahoo SearchMonkey), but YQL takes this a whole step further.

YQL offers a SQL like query language that provides a search query console over the web, as well as individual pages on the web. So for example, we can scrape all the microformatted entries from a webpage using a query of the form:

Here’s the result:

A RESTful URI can be constructed to run this query and return the results as XML or JSON, which can then be used elsewhere.

(Note that YQL can be used equally well to scrape more loosely structured pages – XPATH and CSS Selector statements can both be used in a YQL query to extract the part of the page you want to gt hold of.)

As well as microformats, YQL also sees a wide range of other content as queryable “datatable’s on the web”, and provides a way for developers to define their own datatable interfaces to their own web pages, or equally pages on third party sites.

The YQL Execute extends the power YQL further by allowing the developer to run server-side Javascript programmes that will process the results obtained from a YQL query howsoever they want (With YQL Execute, the Internet becomes your database). But that’s for another time… ;-)

Written by Tony Hirst

July 28, 2009 at 2:45 pm

Posted in Tinkering

Tagged with

Social Media Releases and the University Press Office

with 10 comments

A couple of weeks ago, the BBC launched the production of a new OU/BBC series about the history of the web (The Web at 20: Digital Revolution).

Looking over the OU’s recent OU/BBC related media releases, I can’t see anything mentioning the launch event, although it did get a fair amount of coverage from several of the BBC’s technology bloggers.

My impression after the event was that here, if ever, was an opportunity to have produced a social media news release, a media release that makes content available in a form that bloggers and other online publishers can readily pillage for freely licensed embedded videos, images, social bookmarking buttons and related links to fill out their own post.


[Social media release template, from PR-Squared]

Now I haven’t done a trawl of Higher Education media sites to see how many have started experimenting with social media releases, though I do know that most HEIs don’t publish autodiscoverable RSS news feeds from their homepage – only 30.8% (41 out of 133 institutions) publish any autodiscoverable feeds when I last checked using the UK HEI autodiscoverable feeds link from Back from Behind Enemy Lines, Without Being Autodiscovered(?!). I’m not sure what the current state of play is with official univrsity Youtube channels though? My round-up of UK HEI official Youtube channels is probably rather out of date now?

On the government department front, there has been a little more exploration. A couple of month’s ago, Steph Gray reviewed the first Baby steps in Social Media News Releases that were being made by the now deprecated(?!) Department for Industry, Universities and Skills (DIUS), and helpfully identified some of the issues involved in piecing together and measuring the effectiveness of a social media release. Now part of BIS, that team’s experiments with social media releases continues, as exemplifed by the recent social media release for a white paper on A Better Deal for Consumers – Delivering Real Help Now and Change for the Future.

(For more general examples of social media news releases, there are plenty of links on Social Media News Room Examples. In the government sphere, Snapshot of UK govnt use of social tools – and Press Office involvement also has a round up of related activity from a couple of months or so ago.)

However, whilst reading @neillyneil’s How to write a corporate Twitter strategy (…and here’s one I made earlier) yesterday, a great summary of what’s involved in comms related tweeting, that also includes a round-up of current UK gov and traditional media twitterers, it struck me that maybe we don’t need social media releases at all, as such? Wouldn’t a social media release theme on a platform such as WordPress work equally well? That is, why should the press release look like the half finished article? Indeed, is there a reason for having press releases at all? And if there is, what’s the bst way of releasing them? Live feeds are one way, but I doubt that any HEI press offices have adopted them yet? (For a related example, see GovFresh, a site that collects together US Government live feeds.)

Or do institutions just need a social media strategy, with an aggregation site (potentially partitioned into different topic, or category, areas) that at any given instant in time essentially acts as a social media release for whatever is newsworthy at the time? In the same way that the Digital Revolution TV production will (allegedly) be producing rushes of content as the programme is produced, an institution’s social media news site could provide rushes of content – delivered via a templated theme – that can be taken and reworked by others. The release isn’t a thing in its own right – it’s the current state of the social media news site, the result of newsmastering by the institution’s press office of items related to newsworthy activity in the institution itself.

Wrapped up with that could be a press office strategy for news that breaks out on the web related to a particular institution. For example, here’s how the US Air Force apparently handle outbreaks of news about the USAF in the blogosphere:

In related news(?! ;-), the reinvention (or not?!) of the academic article continues: Elsevier’s Article of the Future pilot (don’t you just love that press release?!;-) demos one possibly way of presenting an interactive journal article. (Here’s a diffrent example form a couple of months ago: Academia 2.0: What Would a Fully Interactive Journal Article Look Like.)

Again, I can’t help wondering whether research schools should have a rolling social media news site that provides a living social media release about the most recent or exciting research going on in the institution at the current time that may or may not include snippets taken from ‘interactive’ journal articles?

Written by Tony Hirst

July 23, 2009 at 1:03 pm

Posted in OU2.0

Brand Association and Your Twitter Followers

with 15 comments

One of the thing’s that Martin appears to have been thinking a lot about lately are metrics for rating ‘digital scholars’, i.e. those of us who don’t do any of the reputation bearing thing that traditional academics do, (though whether that’s because we’re not very good at those things is not for me to say;-)

So for example, in The Keynote Equivalent? he reviews the notions of reputation and impact factor, and in Connectors versus outputs he calls on some really dubious social media buzz metrics to raise the more far more valid issue of how we measure influence within, and value the contributions made to, a social network (peer community?) in order to recognise the extent of someone’s influence within that network from outside of it.

Using Twitter as a base case, one of the many interesting features of Twitter’s ‘open privacy’ model is that in most cases it’s possible for you to look at someone else’s followers to see who they are.

The value of that network to an individual is at least twofold – firstly, as a source of information, observations, news and feedback to you as the person at the centre of your own network; secondly as an amplifier of your own ego broadcast messages. (There are other benefits of course – like being able to see who is talking to whom about what.) You may also feel there is some benefit to just having a large number of followers, if only in the bragging stakes.

That is, the more followers the better, right? It’s bound to be good for my reputation, if nothing else, surely…?

Well….. maybe not…?

Consider these two questions:

- who follows you? if I look at your followers what can I tell about you, from them?
- what is your blocking policy? who you block is just as much a part of the way you manage your network as the people you actively follow.

As far as my own Twitter network goes, I am on a follow:followed ratio of about 1:4. That is, approximately four times as many people follow me as I follow back. For every 10 or so new followers I get, I block one or two.

I check my followers list maybe once every two or three days, which lets me keep up with the pruning on just one or two screens of followers using the Twitter web interface. If the name or avatar is suspect, I’ll check out the tweets to see if I want to block. (I really miss the ability to hover over a person’s name and get a tooltip containing their bio:-( If the name or avatar is familiar or intriguing, I’ll check the tweets to see if I’m going to follow back (maybe 1 in 20? Following back is not the main source for me of new people to follow – you’ll have to get to me another way;-).

The people I block? People who’s tweets are never replies, but who just tweet out advertising links all the time; Britney, whatever she happens to be sucking or loving at the time; product tweeters; and so on. If you’re following lots of people and only followed by a few? Not good – why should I follow you if no-one else does? If you’re following lots of people and are followed by lots of people? Also not good: either you’re a spammer being spammed back, or you’re an indiscriminate symmetric follower so why should I trust you, or you’ve so many followers I’m not going to get a look in. If I’m not sure about a new follower, it’s 50/50 that I’ll either block them or not, so there may well be the odd false positive amongst the people I’ve blocked (if so, sorry…) And why do I block them? Because they add no value to me… Like junk mail… And because by association, if you look at my followers and see they’re all Britney, you’ll know my amplification network is worthless. And by association… ;-)

The people I follow? People I’ve chatted to, have been introduced to through RTs, or via interesting/valuable multiaddressed tweets that include me; people who appear not to be part of any other network I follow (or who might add value in a sphere of influence or interest that I don’t feel I currently benefit from), and so on.

And the people I don’t follow but don’t block (i.e. the majority) – nothing personal, but I only have so many hours in the day, and can’t cope with too many new messages every update cycle in by twitter client!

So all this might sound a little bit arrogant, but it’s my space and it’s me that has to navigate it!

PS just by the by, it struck me during an exchange last week that networks can also act as PR channels. A tweet went out from @ruskin147 asking if anyone knew anyone “who can analyse how viral emails,campaigns etc, can knock a firm off course?” Now I should probably have recommended someone from the OU Business School, because I think there is someone there who knows this stuff; but they’re not part of any of my networks so I’d have to go and search for them and essentially recommend them cold. So instead I suggested @mediaczar (who blogs under the same ID) because he’s been sharing code and insight about his analysis of connectivity and the flow of ideas across social networks for the PR firm (I think?) he works for. (Some irony there, methinks?;-) And it turned out that the two of them hooked up and had a chat…

So why’s that good for me? Because it strengthened the network that I inhabit. It increased the likelihood of those people having an interesting conversation that I was likely to also be interested in. I get value not just from people telling me things, but also from people in my network telling each other things that I am likely to find interesting.

And as a spin-off, it maybe increases my reputation with those two people for having helped create that conversation between them?

In terms of externally recognised value though? How are you going to measure that, Martin?

See also: Time to Get Scared, People?

Written by Tony Hirst

July 21, 2009 at 10:20 am

Posted in Thinkses

Tagged with ,

Feed Powered Auto-Responders

leave a comment »

A few weeks ago, I got my first “real” mobile phone, an HTC Magic (don’t ask; suffice to say, I wish I’d got an iPhone:-( and as part of the follow up service from the broker (phones4U – I said I might be tempted to recommend them, so I am) I got a ‘will you take part in a short customer satisfaction survey’ type text message.

So when I responded (by text) I immediately got the next message in the sequence back as a response.

That is, the SMS I sent back was caught and handled by an auto-responder, that parsed my response, and automatically replied with an appropriate return message.

Auto-responders are widely used in email marketing and instant messaging environments, of course, and as well as acting in a direct response mode, can also be used to schedule the delivery of outgoing messages either according to a fixed calendar schedule (a bulk email to all subscribers on the first of the month, for example) or according to a more personalised, relative time schedule.

So for example, a day or two after getting my new phone, Vodafone started sending me texts about how to use my phone on their network*, presumably according to a schedule that was initiated when I registered the phone for the first time on the network; and the Phones4U courtesy chase up was presumably also triggered according to some preset schedule.

* something sucks here, somewhere: I keep finding my phone has connected to other, rival networks, and as such seems to spend large amounts of its time roaming, even when in a Vodafone signal area. Flanders – you owe me for making such a crappy recommendation… and Kelly, you have something to answer for, too…

So, these auto-scheduled, auto-responding systems are exactly the same idea as daily feeds: whenever you subscribe, a clock starts ticking and content is delivered to you according to a predefined schedule via that same channel.

In a true autoresponder, of course, the next mailing in a predefined sequence is sent in response to some sort of receipt from the recipient, rather than a relative time schedule, and in the case of autoresponding feeds this can be supported too if the feed scheduler supports unique identifiers for each subscription.

(The simplest daily feed system has a subscription URL that contains the start date; content is then delivered according to a relative time schedule that starts on the date contained in the subscription URL. A more elaborate syndication platform would use a unique identifier in the subscription URL, and the content delivery schedule is then tied to the current state of the schedule associated with that unique identifier.)

So how might a feed autoresponder work? How about in the same way as a feed stats package such as Feedburner? These measure ‘reach’ by inserting a small image at the very end of each feed item that is loaded whenever the feed item is viewed. By tracking how many images are served, it’s possible to get an idea of how many times the feed item was viewed.

The same mechanism can be used as part of a feed auto-responder system: for a subscription via a URI that contains a unique identifier, serve an image with a unique, obfuscated (impossible to guess at, and robots excluded) filename for each item. When the image is polled from a browser client, assume that the subscriber has read that item and publish the next item to the feed after a short delay. The next time the user visits their feedreader, the next item should be there waiting for them.

PS Note that someone somewhere has probably patented this, although as a mechanism it’s been around and blogged about for years (prior art doesn’t seem to be respected much in the world of software patents…) If you have a reference, please provide a link to it in the comments to this post.

Written by Tony Hirst

July 20, 2009 at 1:05 pm

Why Private Browsing Isn’t…

with 20 comments

One of the features of the latest crop of browsers is a ‘private browsing’ mode (aka the porn mode) in which cookies and URL histories form a browsing session are discarded at the end of the session, leaving ‘no trace’.

Hmmm…

Whilst watching the BBC iPlayer last night, I got fed up with the programme stalling (too many open apps, etc etc) so I decided to move to another browser. On going to the appropriate progamme page, I had the option to “Resume” the programme at the point I had just stopped watching it in the other browser.

A quick tweet asking how this might work was met with the response that iPlayer was probably making use of “Stored Objects, Flash’s equivalent of cookies”, as confirmed (I think?!) by @dansumption.

That is: when you visit a website, most browsers are capable of storing a small amount of data (known as a cookie) specified by the website. This might include a unique identifier that allows the website to recognise you when you visit the site again using the same browser, for example, or store personalisation information for you. Third party cookies allow adservers to recognise who you are when you wander across different websites, too. (A brief into to cookies can be found on the OpenLearn site: What are cookies?.)

If you don’t want a website to be able to recognise you if you revisit it, you can either block the cookies it wants to set, or delete the cookies it has set in a previous session. Private browsing handles this for you automatically.

Another thing that browsers do is maintain a history of websites that you have visited. Once again, private browsing steps in here to prevent the browser from remembering what sites you have visited during a private browsing session. And finally, private browsing doesn’t keep track of any searches you might have made in the private browsing session using the browser’s built in search box.

Whilst there are still traces all over the place of the sites you have visited, from the firewall log on your computer or your broadband router box to your ISP, if you were browsing within a private browsing session, you might expect that at least your computer would remain ‘free of evidence’ about what you had been searching for, or which sites you had visited (along with removing any tell tale cookies they may otherwise have left behind).

Well, as the BBC iPlayer cross-browser ‘Resume programme’ facility, suggests: no.

Many sites that use Flash, (BBC iPlayer included) make use of Flash Stored objects which sit outside the control (for now at least, and as I understand it) a browser’s private history mode. I’m guessing it also sits outside the scope of a browser’s ‘clear cookies’ and ‘clear history’ actions?

If you’re intrigued about what flash ‘cookies’ you might have set on your computer, you can inspect them (and delete them) using this Adobe tool: Flash Player: Website Storage Settings panel

Anyway, if you run info skills courses, it’s maybe to one to remember…

PS we may not need Flash for much longer anyway, as Mike Ellis suggested when I pointed him to this rather wonderful site demoing the power of CSS in a modern browser: Text ShadowCSS effect;-)

PPS see also When Delete Doesn’t

Written by Tony Hirst

July 15, 2009 at 12:52 pm

Posted in Infoskills

Visualising Where the Money Goes: Westminster Quangos, Part 2

leave a comment »

One of the things I try to impress on folk whenever I do a talk about web stats is how showing charts of mean (that is, simple averages) numbers of visitors or time on site is all but useless, because the actual distribution of values is not likely to be normal, and so the simple averages reported are, too all intents and purposes, not good for anything.

So when putting together a change treemap to summarise the changes in funding of quangos form government departments (Visualising Where the Money Goes: Westminster Quangos, Part 1), what I did originally was to visualise the change in funding for each quango within each department, rather than just displaying the following simpler, overall funding by government department treemap that I posted about previously:

The reason being? The overall change treemap might show increases or decreases in expenditure for a department as a whole, but it doesn’t reveal whether the funded quangos were all treated equally, or whether lots of small quangos received a cut whilst one big one received a huge rise in funding, for example:

So how did I create this treemap? The simplest way is just to create a query on the original spreadsheet that pulls in 4 columns – department, quango, and two expnditure columns (one for 2006 and one for 2007). A query built around this kernel, in fact:

select C,D,R,S order by C,R desc

(To see how such a query is put together, see Part 1 of this exercise.)

To generate a query that only displays quangos that had a non-zeor expenditure in both 2006 and 2007, just add:

where (R>0 AND S>0) after the select statement and before order by.

Just as an aside, note that it’s possible to to pull the output of this query into another spreadsheet. This allows you to add derived columns to the spreadsheet, for example, busing relative formaula that act on quantities automativcally inported into other columns in the spreadsheet. (One thing I intend to explore with the data store explorer is a 1-click way of creating a new spreadsheet that pulls in a query created using the explorer. See also: Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…))

The CSV feed can then be pulled into a Many Eyes Wikified and visulaised in a variety of ways that reveal both the overall expenditure of quangos funded from within each department, as well as the relative funding over 2006/2007 of each quango:

So for example, for the two biggest quangos by expenditure in Defence, one spent more over 2006/7, and the other spent less…

Using the same data page , we can create other visualisations within the Many Eyes Wikified environment that allow us to explore the data in a little more detail. So for example, here’s a bit more detail about the funding of quangos in the Department of Health. Parallel bands show that quangos spent equivalent amounts in 2006 and 2007, lines that diverge going from left to right show an increase in expenditure, and lines that converge going from left to right depict decreasing expenditure.

A scatter chart readily shows large changes in expnditure over the two years:

And some charts are just shiny ;-)

Compared with just trawling through the spreadsheet data, I think there is a lot to be said for using visualisations to identify out of the norm data points (e.g. using the scatterplot), or unpacking totals (as in the case of departmentally funded quango expenditure) in a quick and easy way as a starting point to exploring the data in a more systematic way, for example in order to ask journalistic questions (whatever they might be).

Written by Tony Hirst

July 14, 2009 at 1:40 pm

Posted in Data, Tinkering

Tagged with , ,

Relative Time Replay: History, In Real Time

with 4 comments

Brilliant, brilliant, brilliant: via my feed subscription to Jane’s e-learning pick of the day, I just came across We Choose The Moon – http://www.wechoosethemoon.org, a real-time recreation of the Apollo 11 moon landings.

Twitter feeds are also available…

That is, @ap11_capcom , @ap11_spacecraft and @ap11_eagle.

This ability to follow along replays of historical events in relative real time (realitive [re/al/i/tive ?] time;-) provides a degree of authenticity that makes the event ‘real’. So is technology enabled real-time replay being used in education at all (other than in hugely expensive simulation training environments?).

Although I rarely play online games, there is one I keep coming back to – Sharkrunners which allows you to go on a scientific mission in search of sharks…

And whilst I don’t read many history books, I’ve often thought that replays of things like Harry Lamin’s letters from the Wordl War 1 trenches were ideal for either real-time, or daily feed replays. ( A quick trawl will probably pull up several other replays of letters from the trenches.)

Then again, Peter Watkins’ black and white “documentary” of the battle of Culloden is pretty much the only thing I remember from that period of time in my school History lessons! Maybe it’s the apparent authenticity of the medium that’s what engages me?;-)

Which is maybe why I like this Google Earth recreation of the Hudson bay plane crash so much:-)

Replays from the diarists also hit the blogsophere from time to time: Pepys diary for example. (I’m not sure if there’s a real-time/relative time replayable version of Anne Frank’s diary online somewhere?)

If I didn’t have way too much to do already, or if I had “independent means”, I think I’d be tempted to try to put together a “relative realtime re-player”, maybe chasing someone like 4iP for the funds (though we’re still chasing them for something else…), or trying to do a deal with various news archives…

Hmm, now there’s a thought – maybe the news media could extract a bit of value from their archives by allowing people to replay historical events in real time? How far back does the Guardian API search I wonder? Certainly, I think Google news has a historical news search… I’d be quite tempted to follow the “Operation Julie” trial from the 1970s in relative real time, I think, rather than just reading a book about it, for example.

There are several things that I think can add to the feeling of authenticity by replaying content in this way:

1) the chunk size – the material is chunked at a human scale; lots of us write (or at least read) 500 word articles, watch 30 minute programmes etc in one sitting; but we don’t tend to read a book in one go, or watch a 20 DVD hour box set on one go;

2) the delivery schedule follows a human time scale – it becomes real to us as the event play out on a human timescale, and one that we are comfortable with;

3) the replayed material is authentic, and was generated at the time of the actual events;

4) the opportunity to look back to future at events as they unfold in relative realtime, but with the added advantage of hindsight.

Are there any history educators out there already teaching in this way, I wonder?

Oh for more time to think about this sort of realitivity, but ‘real work’ beckons :-(

Written by Tony Hirst

July 13, 2009 at 11:57 am

Posted in Thinkses

Tagged with

The Web at 20: Digital Revolution

leave a comment »

Along with a host of others, I had to cut short my attendance at the #newsinnovation unconference* on Friday to hope on a Central line tube to White City for the launch of a new OU/BBC collaboration tentatively called “Digital Revolution“.

* I did and I didn’t feel bad about leaving #newsinnovation early. Bad, because I could have stayed longer and probably got quit a lot out of the presentations and, more importantly, the conversations, not bad because in the time I was there I got value out of it (a head full of thoughts and possibilities), chatted to a few people who were there, and hopefully added my own little bit of value. I’ll try to jot some notes down before the end of the week…

Digital Revolution is, in the jargon, a new 360 production for BBC Two, hosted by Aleks Krotoski (good choice, chaps…) that will document the revolutionary impact of the first 7000 days of the web in a series of four TV programmes, backed up by a major website.

The launch event was a panel affair, with four leading light speakers giving us their 2p’s worth about the web and then taking questions… Rather than me repeating what they had to say, you should check out Rory Cellan-Jones’ write up at The Web at 20, or these notes from @KarenK: BBC Digital Revolution launch event.

If that’s too much trouble, you can watch Tmm Berners Lee’s keynote here:

As befits the nature of the project, a programme blog has already been set up at: Digital Revolution (Working Title): Blog, along with the programme’s production website: Digital Revolution (working Title): Home.

As the programme’s producer (Russell Barnes ) writes at Charting the Digital Revolution:

we have decided to adopt a radical, open-source approach to the production process. We don’t just want to observe bloggers from on high; we want to blog ourselves and get feedback and comment on our ideas.

(Stifles a yawn…shouldn’t these more open production processes be the norm already? [I know they're not, of course, which is why I started things like Digital Worlds and Visual Gadgets to draft ideas out in publicfor courses I've recently been involved with.] For example, will production staff be posting personal thought on personal blogs, twittering along etc etc too? That’s where the real community grows…)

But there’s more:

The second phase of our online project will begin in September. We want to share our rushes online, as they are filmed, including our encounters with the web’s head honchos.
We hope to release those under a permissive licence so that web users can re-use them or do their own mash-ups as they please. Whenever we can, we’re trying to rewrite the traditional BBC script and create something truer to the spirit of the web.

Ooh… sounds like a bit of R&DTV to me, the BBC RAD Lab initiative to produce a geek news programme in raw form, as well as the finished online article. (I am hopeful that we can do something similar with an OU co-produced version of Digital Planet, so if you have any ideas for how we can make that work, pleas get in touch ;-)

In just the same way I couldn’t start experimenting with different feed powered ways of how we might deliver OU course materials until OpenLearn opened up some authentic OU materials, it seems that easiest way for the OU to add remix value to OU/BBC co-pros could well be to take publicly shared assets and have a tinker with them, rather than try to negotiate rights hurdles over what we can do with stuff we’ve, err, paid for… ;-)

(Although that said, when I embedded some audio from a Digital Planet episode on an open2 blog, no-one was concerned about it… DIY tech – just making it up with the Arduino.)

in our next phase, and working in partnership with Tim Berners-Lee’s Web Science Research Initiative, we will be engaging web users in a number of online experiments that we hope will put long-held assumptions to the test.
For instance, it is said we now write more than we read, but what percentage of web users create genuinely new content out there? We want to find out.
Are there still six degrees of separation between anyone on the planet or has Facebook crushed it to two?

Good stuff… Maybe they can take things like Digital Planet listeners’ map to the next level, or use the power of the web to pull together something I haven’t managed to persuade anyone to support yet: a photosynth edition of Global Sunrise?;-)

(For other examples of what we have done to date with Digital Planet, see Exploring the GeoWeb with Digital Planet and A Week on the Digital Planet….)

Finally, in the last phase of production, after transmission of the series on BBC Two, our website will host a fully interactive version of the series that will remain online indefinitely.
Here web users will be able to browse through shortform video clips linking off to all the debate and discussion that we’ve generated on bbc.co.uk and around the web.

Ah, the legacy site… Shame we’re not allowed to host copies of the Digital Planet programmes we co-produce on open2.net (a site that is soon to be deprecated – I’m just hoping that the current URIs persist…), although there is the odd programme clip embedded there… That said, I think w have a reasonable legacy from the sites we’ve produced so far, from the listeners’ map, to a couple of travel bugs from the geo episode that are still out there; our photosynths (from the studio and the Colossus at Bletchley Park); a Digital Planet font; and of course, the Digital Planet ringtone. All of which is hosted on the Digital Planet open2 site itself, of course. And all done with no real budget to speak of ;-)

Anyway – it’s great to see the Digital Revolution programme being launched in such an open way – lt’s hope the production process continues in the same vein :-)

PS if we can hook an episode of Digital Planet into the process somehow, I think it’d be a good thing to do…. but maybe that’s a little bit too 360…?;-)

Written by Tony Hirst

July 13, 2009 at 10:33 am

Posted in BBC, OBU

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 126 other followers