Archive for November 2009
Viewing SPARQLed data.gov.uk Data in a Google Spreadsheet
This is a stub post as much as anything to help me keep tabs on a technique I’ve not had any time to properly play with, let alone document: consuming Linked Data in a Google spreadsheet.
First up – why might this be useful? In short, I think that many people who might want to make use of data.gov.uk data are probably comfortable with spreadsheets but not with code, so giving them access to SPARQL and RDF is not necessarily useful.
So whilst the recipe shown here is a hacky one, at least it opens up the playground a little to explore what the issues might be – and what things might be facilitated by – providing fluid, live data routes from RDF triple stores into spreadsheets.
So here’s what I’m aiming for – some data from the education datastore in a spreadsheet:
And how did it get there? Via a csv import using the =importData formula.
And where did the CSV come from? The sparqlproxy webservice:
As well as giving CSV output, the serivce can also gnerate HTML and a variety of JSON formats, including the MIT Simile and Googl Viz API formats (which means it’s easy to just plug other data into a wide variety of visualisation formats.
To get the data into a Google spreadsheet, simply copy the CSV URI into an =importData(“CSV_URI_HERE”) formula in a spreadsheet cell.
The sparwlproxy service can also pull in and transform queries that have been posted on the web:
So for example, in the above case the query at http://data-gov.tw.rpi.edu/sparql/stat_ten_triples_default.sparql looks like:
What this means is that someone else can write complex queries and mortals can then access the data and display it however they want. (What I’d really like to see is a social site that supports the sharing of endpoint/query pairs for particular queries (I could probably even hack something to do this using delicious?) ;-)
Once the data is in the spreadsheet, it can be played with in the normal way of course. So for example, I can query the spreadsheet using my old prototype Guardian datastore explorer:
In the above example, the route is then:
1) a sparql query onto http://services.data.gov.uk/education/sparql is run through
2) the http://data-gov.tw.rpi.edu/ws/sparqlproxy.php sparqlproxy service to produce a CSV output that is
3) pulled into a Google spreadsheet using an =importData() formula, and in turn
4) queried using my Google Datastore explorer using the Google visualisation API query language and then
5) rendered in a Google Visualisation table widget.
Lurvely… :-)
Looking Back to the Future – Where Did It All Go Wrong..?
So I first bookmarked a variant of EPIC 2015 four and a half years ago…
Where does it go wrong..?
Remind me – what year are we in now?
PS I’m thinking…. it’s time for a remake…
“Look at me, Look at me” – Rewriting Google Analytics Tracking Codes
A couple of quick post hoc thoughts to add to Google/Feedburner Link Pollution:
1) there’s an infoskills issue here based on an understanding of what proxied links are, what is superfluous in a URI (Google tracking attributes etc);
2) there’s fun to be had… so for example, @ajcann recently posted on how students are Leicester are getting into the bookmarked resource thing and independently “doing some excellent work on delicious, creating module resources”: Where’s the social?.
Here’s the original link as polluted by Feedburner (I clicked through to the page from Google Reader):
http://scienceoftheinvisible.blogspot.com/2009/11/wheres-social.html
?utm_source=feedburner
&utm_medium=feed
&utm_campaign=Feed%3A+SOTI+%28Science+of+the+Invisible%29
&utm_content=Google+Reader
Normally, I would have stripped the tracking cod from the link I made above to Alan’s post. Instead, I used this:
http://scienceoftheinvisible.blogspot.com/2009/11/wheres-social.html
?utm_source=ouseful.info
&utm_medium=blogosphere
&utm_campaign=infoskills,analytics
&utm_content=http://wp.me/p1mEF-EH
(The campaign element is the category I used for this post, the content is the shortcode for the post.)
Don’t ya just love it: tracking code spam :-)
So I’m thinking – maybe I need a WordPress plugin that will preemptively clean all external links of Google tracking codes and then add my own ‘custom’ tracking stuff on instead (under the assumption that the linked to site is running Google Analytics. If it isn’t, then the annotations are just an unsightly irrelevance, or noise in the URI…
Google/Feedburner Link Pollution
Just a quick observation…
If you run a blog (or any other) RSS feed through Feedburner, the title links in the feed point to a Feedburner proxy for the link.
If you use Google Reader, and send a post to delicious:
the Feedburner proxy link is the link that you’ll bookmark:
(Hmmmm, methinks it would be handy if Delicious gave you the option to bookmark the ‘terminal’ URI rather than a proxied or short URI? Maybe by getting Google proxied links into Delicious, Google is amassing data about social bookmarking behaviour from RSS feeds on Delicious? So how about this for a scenario: you wake up tomorrow to find the Goog has bought Delicious off Yahoo, and all your bookmarked links are suddenly rewritten in the form: http://deliproxy.google.com/~r/gamesetwatch/~3/Yci8wJb49yk/fighting_fantasy_flowcharts.php)
If you click on the link to take you through to the actual linked page, and the actual page URI, you may well get something like this:
http://www.gamesetwatch.com/2009/11/fighting_fantasy_flowcharts.php?
utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+gamesetwatch+%28GameSetWatch%29
That is, a URI with Google Analytics tracking info attached automagically by Feedburner (see Google Analytics, Feedburner and Google Reader for more on this).
Here, then, are a couple of good examples of why you might not want to use (Google) Feedburner for your RSS feeds:
1) it can pollute your links, first by appending them with Google Analytics tracking codes, then by rewriting the link as a proxied link;
2) you have no idea what future ‘innovations’ the Goog will introduce to pollute your feed even further.
(Bear in mind that Google Feedburner also allows you to inject ads into a feed you have burned using AdSense for Feeds.)
Using JISCPress/Digress.it for Reading List Publication
One of the things I’ve been doodling with but not managing to progress much thinking wise (not enough dog walking time lately!) is how we might be able to use the digress.it WordPress theme to support various course related functions in ways that exploit the disaggregating features of the theme.
Chatting with Huw Jones last week about his upcoming Arcadia seminar on “The Problem of Reading Lists” (this coming Tuesday, Nov 24th – all welcome;-) I started thinking again about the potential for using digress.it as a means of publishing, and collecting comments on, reading lists.
So for example, over on the doodlings WriteToReply site I’ve posted an example of how a reading list posted under the theme is automatically disaggregated into separate, uniquely identified references:
The reading list was generated simply by copying and pasting a PDF based reading list into a WordPress blog post. Looking at the format of the list, one could imagine adding further comments or notes relating to each reference using a blog comment. Given that the basis of each paragraph is a citation to a particular work, it might be possible to parse out enough information to generate a link to a search on the University OPAC for the corresponding work (and if so, pull back an indication of the availability of the book as, for example, my Library Traveler script used to do for books viewed on Amazon).
Under the current in-testing digress.it theme, each paragraph on the page can be made available as a separate item in an RSS feed; that is, as well as the standard ‘single item’ RSS page feed that WordPress generates automatically, we can get an N-item feed from the page for the N-paragraphs contained on a page.
Which in terms means that to generate an itemised RSS feed version of a reading list, all I need to do is paste the reading list – with each reference in a separate paragraph – into a single blog post. (the same is true for disaggregating/feed itemising previous exam papers, for example, or I guess video links in order to generate a DeliTV programme bundle…?!)
(For more details of the various ways in which digress.it can automatically disaggregate/atomise a document, see Open Data: What Have We Got?.)
PS just a reminder again – Huw’s Reading List project talk, which is about far more than just reading lists, is on Tuesday in the Old Combination Room, Wolfson College, Cambridge, at 6pm.
Google Analytics, Feedburner and Google Reader
Over the last couple of weeks, it seems as if the Goog has been doing a bit of reconciliation on the old analytics front, in particular the ability to track traffic driven back to your website from links contained within a feed published from that site using Feedburner…
The first thing I’d noticed as being different was the appearance Google Analytics tracking codes on Feedburner powered posts that I was reading in Google Reader – opening such a post in a new window seems to display it with a set full blown set of GA tracking attributes. So for example, opening a post from the Feedburnered OUsful.Info feed results in a URI like this:
http://ouseful.wordpress.com/2009/11/18/under-the-radar/?
utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+ouseful+%28OUseful+Info%29&utm_content=Google+Reader
…and I’m pretty sure I didn’t put those tracking codes in there explicitly…
In “Campaign” Tracking With Google Analytics, I started sketching out how it might be possible to use Google Analytics campaign tracking codes to to track the spread of referrer links to documents or document fragments hosted on WriteToReply or JISCPress, so let’s see how the Feedburner annoations are structured:
- utm_source=feedburner (that is, the originator of the feed);
- utm_medium=feed (that is, the means by which the content was transported/syndicated);
- utm_campaign=Feed: ouseful (OUseful Info) (that is, the name of the Feedburner feed (I think: the feed URL is http://feedburner.com/ouseful), followed by the feed title (OUseful Info);
- utm_content=Google Reader (that is, the place where I viewed the link).
Compare this with the suggestion I made for annotating WriteToReply links:
- utm_source=twitter.com (that is, the place a link was ‘launched’);
- utm_medium=question (that is, the type of slug content used to qualify the link);
- utm_campaign=jiscri (that is, the consultation document linked to, e.g. for the link <em.http://writetoreply.org/jiscri/2009/03/11/rapid-innovation-projects/);
- utm_content=slug3 (that is, a unique ID to identify the text used to qualify the syndicated link).
So how can you get Googalytics tracking codes on your Feedburner feeds? Details are still sketchy, (e.g. see the original announcement on the Goole Analytics blog here: An Integration With Feedburner, and the Google AdSense for Feeds blog here: “Afternoon, Frank.” “Hey howdy, George.”) but this Google FAQ post on How do I set up my FeedBurner feed to report feed clicks in Google Analytics?:
If you use Google Analytics to track web site visitors, you can see feed clicks originating from your FeedBurner feed by activating an option on the Analyze tab.
When someone clicks one of your feed items and ends up back on your web site, Google Analytics will track that activity and include it in the “Traffic Sources” section.
The post also tells you where you can set up the tracking details – from the Configure Stats menu option. And selecting that, I can now see why my feed links are annotated as they are:
(I’m not sure how the $distributionEndpoint is treated for none Google properties?)
The Google AdSense for Feeds post suggests that:
By default, these analytics will show up in the “All Traffic Sources” and “Campaigns” views in Google Analytics. You can filter the results just to only the traffic that comes from Google FeedBurner by filtering on “feedburner” on the All Traffic Sources page or “Feed:” on the campaigns view. You can also use these sources in the Advanced Segments views.
which suggests that for sites like JISCPress/WriteToReply that use Google Analytics on the main site and Feedburner for the public/promoted feeds, the Feedburner integration will automatically annotate feed links with tracking codes that can be tracked from the site’s Google Analytics dashboard.
Under the Radar…
Here’s a quick post from under the radar… Apparently, folk from Cam Libraries get together every so often for an informal but issues related brown bag lunch somewhere… It seems like the where and whenabouts of these events is a closely guarded secret.
I think I’m ‘presenting’ at a brown bag lunch session next week, Nov 27th, but I don’t have access to the mailing list the announcement went out on so don’t know any more details than that.
i did, however, manage to grab a bootleg of a trailer for the what may or may not be this event based on what I think I said I could talk about if I managed to get an invite:
If the event is on, I guess I’ll be told immediately before the event and taken to the location blindfolded (presumably using a brown paper bag?)
Just in case, best keep this hush hush, okay? ;-)
The Real-Time Web and its Relationship with Discovery and Search
As well as a presenting at Online Information this year, I’m also moderating a session on The realtime web: Discovery vs. Search.
To try and frame what I think this topic means to me, I jotted down a set of questions that I’m not sure I have any answers for that I hope will, at least in part, be covered by one or more of the speakers:
Over the last few weeks, the major search engines all announced realtime search capabilities – but what is ‘realtime search’?
To what extent can traditional search engine techniques for determining relevancy and ranking results “just cope” with real time signals? Will realtime search focus on ‘turning up’ real time ‘atomic’ status updates, or will it focus on using realtime information as part of a ranking algortihm applied to more traditional web pages? Will ‘social discovery’ or social amplification factors just provide yet another ranking factor to traditional search engines, or is it more complicated than that?
In financial markets, being able to publish price information in as near as realtime as possible is key to success, but may also require dedicated hardware and high speed/low latency networks. How ‘real time’ does the real time search’n'discovery web really need to be? Will speed be a determining factor in which ‘realtime’ search engine people use?
For som time, it’s ben possibility to identify trends in search behaviour with a variety of periodicities (eg Trendspotting). To what extent does/could/should the ‘periodic web’ influence the results that search engines return. If certain signals tend to lead particular behaviours (eg a spike in a real timee signal on a Thursday predicts a certain bhaviour on a Friday, or a cheer of “goal” on twitter predicts and spike in electricity generation as everyone goes off to boil the kettle and make a cup of tea), might that affect not only search engine rankings in ‘real predictive time’, but also other instrumented systems (such as AdWord pricing, or even energy supply)?
To what extent is the realtime web just ‘high frequency’ noise (low effort to produce, quick to disappear) compared to more substanital and expensive to produce ‘low frequency’ signals such as the steady accretion of long lasting links to a web page over time?
The web is the eleephant in the room, and it nevr forgets. Given we can now capture, and potentially store, ever increasing amounts of real time sourced data, are w going to need ‘forgetting algorithms’? If so, what might those algorithms do, and over what timescales might they operate?
With the increasing instrumentation of the web we are seeing a rise in live “operations” data on the web via services such as Pachube (as well as monthy, quarterly or annual data dumps, such as those released increasingly by government here in the UK, in the US, in Australia and so on). To what extent, if any, might this real time machine collected data about our environment play a role in supporting the (public) discovery of real time events in near real time. (So for example, road traffic information, travel information, weather warnings, earthquake warnings etc)
To what extent might realtime data in one medium influence discovery in another – for example, if a lot of photos tagged in a similar way are uploaded to flickr with a particular location, how might that signal; be used?
To what extent do the discovery of events in real time impact on the enterprise. What sort of role, if any, is there for a real time, or real time supported capability within a corporate/enterprise/intranet search engine?
In th UK, an increasing amount of Linked Data is being made available from government sources. What role, if any, does real time sourced data have to play in the discovery of, or rasoning across, Linked Data? What risks might there be to Linked Data systems by the inclusion or availability of dynamic data, or data that is contiinually generated in real time.
And who are the speakers in the session?
- Stephen Arnold, President, Arnold Information Technology, USA
- Antonio Gulli, Principal Development Manager, STC Europe, Microsoft
- Conrad Wolfram, Co-Founder, Wolfram Research and Wolfram|Alpha, UK
- Morgan Zimmerman, VP Business Development, Exalead, France
If there are any other questions that you think need asking, please add them as comment below…
Recommendations By Magic
I’m not sure how I feel about this – maybe the magic is good magic, maybe it’s voodoo magic, or maybe it’s fake magic, the work of a charlatan, but I wonder, I wonder, might Google’s ‘Personalised Ranking’ utility in Google Reader be useful in filtering, or at least ranking, latest issue table of contents feeds from somewhere like TicTocs?
Only have a 10-minute coffee break and want to see the best items first? All feeds now have a new sort option called “magic” that re-orders items in the feed based on your personal usage, and overall activity in Reader, instead of default chronological order. Click “Sort by magic” under the Folder Settings menu of your feed to switch to personalized ranking. Unlike the old “auto” ranking, this new ranking is personalized for you, and gets better with time as we learn what you like best — the more you “like” and “share” stuff, the better your magic sort will be. Give it a try on a high-volume feed folder or All items and see for yourself!
[Google Reader Personalised Ranking]
Now I believe that there is also a JISCRI project looking at a related sort of thing – Bayesian Feed Filter…: “The Bayesian Feed Filtering project will be trying to identify those articles that are of interest to specific researchers from a set of RSS feeds of Journal Tables of Content by applying the same approach that is used to filter out junk emails.” [Project Kicks Off]
So I’m thinking: it’d be great to see how their approach might filter subscribed to feeds bayesed (!;-) on what users read from those feeds, compared to the Google magic?
Create Your Own Google Custom News Sections
For many years now, it’s been possible to subscribe to persistent (“saved”) Google News searches and so build up your own custom dashboard views of news… Indeed, it was over three years ago now that I hacked together a demo news feed roller (Persistent News Search OPML Feed Roller) that let users bundle up a roll of feeds in an OPML file (sort of!) for easy viewing elsewhere.

And if OPML isn’t your thing, then services like Netvibes or Pageflakes let you easily wire up your own news dashboard:
But we all know in our heart of hearts that RSS and Atom feed subscriptions are just not popular widespread as a consumer technology. Folk aren’t knowingly using feeds, and they not unknowingly using them directly either. (But feeds are being used as wiring/plumbing behind the scenes, so RSS is not dead yet, okay?!;-)
(In the Library world, as well as the wider news reading world, this failure to engage with feed subscriptions can be seen (in part) by the lack of significant uptake of RSS alerts.)
So when Google announced last week that you can now Create and Share custom News sections, it struck me that they were getting round the exposed plumbing problem that subscribing to a feed implies, and instead making it easy to create a custom view (the output of which can also be subscribed to) with the appearance of having to do much plumbing at all – How to Create Your Own Google Custom News Section (Tutorial):
You can search the directory of already created news sections – as well as find a link to a page that lets you create your own news sections, here: Google News: Custom sections directory.
So for example, here are a few I have already made:
- UK Higher Education News
- Isle of WIght News
- UK Broadcasting News
- Formula One News
The extent to which you can create a finely tuned view of the news is, admittedly, limited. You can’t, for example, limit the search to specified publications (which you can do in a Google news advanced/search limited search) – filtering is limited to keywords and locale (I’m not sure of the extent to which the order in which you enter the keywords affects things?). But if you already know how to create that sort of filtered search, you probably also know how to set up a new search alert, wire up an feed powered dashboard of your own, and so on. And if the Google Custom News sections editor was any more complicated, I dare say it would put off the users I imagine Google are reaching out to…











