ukdiscovery – OUseful.Info, the blog…

Open Standards Consultation and Open Data Standards Challenges

Take a look around you… see that plug socket? If you’re in the UK, it should conform to British Standard BS1363 (you can read the spec if you have have you credit card to hand…). Take a listen around you… is that someone listening to an audio device playing an MP3 music file? ISO/IEC 11172-3:1993 (or ISO/IEC 13818-3:1995) helped make that possible… “that” being the agreed upon standard that let the music publisher put the audio file into a digital format that the maker of the audio device knows how to recognise and decode. (Beware, though. The MP3 specification is tainted with all sorts of patents – so you need to check whether or if you need to pay someone in order to build a device that encodes or decodes MP3 files.) If the music happens to be being played from a CD (hard to believe, but bear with me!), then you’ll be thankful the CD maker and the audio player manufacturer agreed to both work with a physical object that conforms to IEC 60908 ed2.0 (“Audio recording – Compact disc digital audio system”), and that maybe makes use of Standard ECMA-130 (also available as ISO/IEC 10149:1995). That Microsoft Office XML document you just opened somewhere? ISO/IEC 29500-1:2011. And so on…

Standards make interoperability possible. Which means that standards can be a valuable thing. If I create a standard that allows lots of things to interoperate, and I “own” the “intellectual property” associated with that standard, I can make you pay every time you sell a device that implements that standard. If I control the process by which the standard is defined and updated, then I can make changes to the standard that may or may not be to your benefit but with which you have to comply if you want to continue to be able to use the standard.

There are at least a couple of issues we need to take into account, then, when we look at adopting or “buying in” to a standard: who says what goes in to the standard, and how is agreement reach about those things; and under what terms is usage of the standard allowed (for example, do I have to pay to make use of the standard, do I have to pay in order to even read the standard).

At the adoption level, there is also the question of who decides what standard to adopt, and the means by which adoption of the standard is forced onto other parties. In the case of legislation, governments have the power to inflict a considerable financial burden on companies and government agencies by passing legislation that mandates the adoption of a particular standard that has some of fee associated with it’s use. Even outside of legislation, if a large organisation requires its suppliers to use a particular standard, then it could be commercial suicide for a supplier not to adopt the standard even if there are direct licensing costs associated with using it.

If we want to reduce the amount of friction in a process that is introduced by costs associated with the adoption of standards that make that process possible, then “open standards” may be a way forward. But what are “open standards” and what might we expect of them?

A new consultation from the Cabinet Office seeks views on this matter, with a view towards adopting open standards (whatever they are?!;-) across government, wherever possible: Cabinet Office calls on IT Community to engage in Open Standards consultation. In particular, the consultation will inform:

– the definition of open standards in the context of government IT;
– the meaning of mandation and the effects compulsory standards may have on government departments, delivery partners and supply chains; and
– international alignment and cross-border interoperability.

The consultation closes on 1 May 2012.

(Hmm, the consultation doesn’t seem to be online commentable… wouldn’t it be handy if there was something around like the old WriteToReply…?;-)

Related to the whole open standards thang is a new challenge on the Standards Hub posted by the HM Gov Open Data Standards (Shadow*) Panel (disclaimer: I’m a member of said panel; it’s (Shadow) because the board it will report to has not been formally constituted yet). The challenge covers open standards for “Managing and Using Terms and Codes” and seeks input from concerned parties relating to document standards and specifications relating to the coding and publication of controlled term lists, their provenance, version control/change files, and so on. (So for example, if you happened to work on the W3C provenance data model (which I note has reached the third working draft stage), and think it’s relevant, it might be worth bringing it to the attention of the panel as a reply to the challenge).

It occurs to me that recent JISC activity relating to UK Discovery intitiative may have something to say about the issues involved with, and formats appropriate for, representing and sharing data lists, so I commend the challenge to them: open standards for “Managing and Using Terms and Codes” (I’ll also pick my way through the #ukdiscovery docs and feed anything I find there back to the panel). I also suspect the library and shambrarian community may have something to offer, as well as members of the Linked Universities community…

[A quick note on the Open Data Standards Panel – it’s role in part is to help identify and recommend open standards appropriate for adoption across government, as well as identify areas where there is a need for open standards development. It won’t directly develop any standards, although it may have a role in recommending the commissioning of standards.]

A couple of other things to note on sort of tangentially related matters (this post is in danger of turning in to a newsletter, methinks… [hmmm: should I do a weekly newsletter?!]):

JISC just announced some invitations to tender on the production of some reports on Digital Infrastructure Directions. The reports are to cover the following areas: Advantages of APIs, Embedded Licences: What, Why and How, Activity Data: Analytics and Metrics, The Open Landscape, Access to citation data: a cost-benefit and risk review and forward look.
the Open Knowledge Foundations has a post up Announcing the School of Data, “a joint venture between the Open Knowledge Foundation and Peer 2 Peer University (P2PU)”. The course is still in the early planning stage, and volunteers are being sought…

Related: last year, the OU co-produced a special series of programmes on “openness” with the BBC World Service Digital Planet/Click (radio) programme. You can listen to the programmes again here:

Open Data Processes: the Open Metadata Laundry

Another quick note from yesterday’s mini-mash at Cambridge, hosted by Ed Chamberlain, and with participation from consultant Owen Stephens, Lincoln’s Paul Stainthorp and his decentralised developers, and Sussex’s Chris Keene. This idea came from the Lincoln Jerome project (I’m not sure if this has been blogged on the Jerome project blog?), and provides a way of scrubbing MARC based records to free the metadata up from license restrictions.

The recipe goes along the lines of reconciling the record for each item with openly licensed equivalents, and creating a new record for each item where data fields are populated with content that is know to be openly licensed. In part, this relies on having a common identifier. One approach that was discussed was generating hashes based on titles with punctuation removed. This feels a bit arbitrary to me…? I’d probably reduce all the letters to the same case at the very least in an attempt to normalise the things we might be trying to hash?

I wonder if Ed’s mapping of metadata ownership might also have a role to play in developing a robust laundry service? (e.g. “Ownership” of MARC-21 records and Where exactly DOES a record come from?).

We also discussed recipes where different libraries, each with their own MARC records for a work, might be compared field by field to identify differences between the ways similar items might be catalogued differently. As well as identifying records that maybe contain errors, this approach might also enhance discovery, for example through widening a set of keywords or classification indices.

One of the issues we keep returning to is why it might be interesting to release lots of open data in a given context. Being able to pivot from a resource in one context to a resource in another context is a general/weak way of answering this question, but here are a couple of more specific issues that came up in conversation:

1) having unique identifiers is key, and becomes useful when people use the same identifier, or same-as’d identifiers, to refer to the same thing;

2) we need tool support to encourage people creating metadata to start linking in to a recognised/shared identifier spaces. I wonder if there might be value in institutions starting to publish reconciliation services that can be addressed from tools like Google Refine. (For example, How to use OpenCorporates to match companies in Google Refine or Google Refine Reconciliation Service API). Note that it might make sense for reconciliation services to employ various string similarity heuristics as part of the service.

3) we still don’t have enough compelling use cases about the benefits of linked IDs, or tools that show why it’s powerful. (I think of linked identifier spaces that are rich enough to offer benefits as if they were (super)saturated solutions, where it’s easy to crystallise out interesting things…) One example I like is how Open Corporates use reconciliation to allow you to map companies names in local council accounts to specific corporate entities. In time, one can imagine mapping company directors and local council councillors onto person entities and then starting to map these councillor-corporate-contract networks out…;-)

Finally, something Owen mentioned that resonates with some of my thinking on List Intelligence: Superduping/Work Superclusters, in which we take an ISBN, look at its equivalents using ThingISBN or xISBN, and then for each of those alternatives, look at their ThingISBN/xISBN alternatives, until we reach a limit set. (cf my approaches for looking at lists a Twitter UserID is included on, looking at the other members of the same lists, then finding the other lists they are mentioned on, etc. Note in the case of Twitter lists, this doesn’t necessarily hit a limit without the use of thresholding!)

Getting Library Catalogue Searches Out There…

As a long time fan of custom search engine offerings, I keep wondering why Google doesn’t seem to have much active interest in this area? Google Custom Search updates are few and far between, and typically go unreported by the tech blogs. Perhaps more surprisingly, Custom Search Engines don’t appear to have much, if any, recognition in the Google Apps for Education suite, although I think they are available with a Google Apps for education ID?

One of the things I’ve been mulling over for years is the role that automatically created course related search engines might have to play as part of a course’s VLE offering. The search engine would offer search results either over a set of web domains linked to from the actual course materials, or simply boost results from those domains in the context of a “normal” set of search results. I’ve recently started thinking that we could also make use “promoted” results to highlight specific required or recommended readings when a particular topic is searched for (for example, Integrating Course Related Search and Bookmarking?).

During an informal “technical” meeting around three JISC funded reseource discovery projects at Cambridge yesterday (Comet, Jerome, SALDA; disclaimer: I didn’t work on any of them, but I was in the area over the weekend…), there were a few brief mentions of how various university libraries were opening up their catalogues to the search engine crawlers. So for example, if you do a site: limited search on the following paths:

– sabre.sussex.ac.uk/vufindsmu/Record/
– jerome.library.lincoln.ac.uk/catalogue/
– webcat.hud.ac.uk/catlink/bib/
– search.lib.cam.ac.uk/

you can get (partial?) search results, with a greater or lesser degree of success, from the Sussex, Lincoln, Huddersfield and Cambridge catalogues respectively.

In a Google custom search engine context, we can tunnel in a little deeper in an attempt to returns results limited to actual records:

– sabre.sussex.ac.uk/vufindsmu/Record/*/Description
– jerome.library.lincoln.ac.uk/catalogue/*
– webcat.hud.ac.uk/catlink/bib/*
– search.lib.cam.ac.uk/?itemid=*

I’ve added these to a new Catalogues tab on my UK HE library website CSE (about), so we can start to search over these catalogues using Google.

I’m not sure how useful or interesting this is at the moment, except to the library systems developers maybe, who can compare how informatively their library catalogue content is indexed and displayed in Google search results compared to other libraries… (so for example, I noticed that Google appears to be indexing the “related items” that Huddersfield publishes on a record page, meaning that if a search term appears in a related work, you might get a record that at first glance appears to have little to do with your search term, in effect providing a “reverse related work” search (that is, search on related works and return items that have the search term as the related work)).

But it’s a start… and with the addition of customised rankings, might provide a jumping off point for experimenting with novel ways of searching across UK HE catalogues using Google indexed content. (For example, a version of the CSE on the cam.ac.uk domain might boost the Cambridge results; within an institution, works related to a particular course through mention on a reading list might get a boost if a student on that course runs a search… and so on…

PS A couple of other things that may be worth pondering… could Google Apps for Education account holders be signed up to to Subscribed Links offering customised search results in the main Google domain relating to a particular course. (That is, define subscribed link profiles for a each course, and automatically add those subscriptions to an Apps for Edu user’s account based on the courses they’re taking?) Or I wonder if it would be possible to associate subscribed links to public access browsers in some way?

And how about finding some way of working with Google to open up “professional” search profiles, where for example students are provided with “read only” versions of the personalised search results of an expert in a particular area who has tuned, through personalisation, a search profile that is highly specialised in a particular subject area, e.g. as mentioned in Google Personal Custom Search Engines? (see also Could Librarians Be Influential Friends? And Who Owns Your Search Persona?).

If anyone out there is working on ways of using Google customised and personalised search as a way of delivering “improved” search results in an educational context, I’d love to hear more about what you’re getting up to…

Fragments… Obtaining Individual Photo Descriptions from flickr Sets

I think I’ve probably left it too late to think up some sort of hack for the UK Discovery developer competition, but here’s a fragment that might provide a starting point for someone else… How to use a Yahoo pipe to grab a list of photos in a particular flickr set (such as one of the sets posted by the UK National Archive to the flickr commons)

The recipe makes use of two calls to the flickr api: one to get the a list of photos in a particular set, the second, repeatedly made call, to grab details down for each photo in the set.

In pseudo-code, we would write the algorithm along the lines of:

get list of photos in a given flickr set
for each photo in the set:
  get the details for the photo

Here’s the pipe:

The first step is construct a call to the flickr API to pull down the photos in a given set. The API is called via a URI of the form:

http://api.flickr.com/services/rest/?method=flickr.photosets.getPhotos &api_key=APIKEY&photoset_id=PHOTOSETID&format=json&nojsoncallback=1

The API returns a JSON object containing separate items identifying each photo in the set.

The rename block constructs a new attribute for each photo item (detailURI) containing the corresponding photo ID. The RegEx block applies a regular expression to each item’s detailURI attribute to transform it to a URI that calls the flickr API for details of a particular phot, by photo id. The call this time is of the form:

http://api.flickr.com/services/rest/?method=flickr.photos.getInfo &api_key=APIKEY&photo_id=PHOTOID&format=rest

Finally, the Loop block runs through each item in the original set, calls the flickr API using the detailURI to get the details for the corresponding photo, and replaces each item with the full details of each photo.

You can find the pipe here: grabbing photo details for photos in a flickr set

An obvious next step might be to enrich the phot decriptions with semantic tags using something like the Reuters OpenCalais service. On a quick demo, this didn’t seem to work in the pipes context (I wonder if there is Calais API throttling going on, or maybe a timeout?) but I’ve previously posted a recipe using Python that shows how to call the Open Calais service in a Python context: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags.

Again in pseudo code, we might do something like:

get JSON feed out of Yahoo pipe
for each item:
  call the Calais API against the description element

We could then index the description, title, and semantic tags for each photo and use them to support search and linking between items in the set.

Other pivots potentially arise from identifying whether photos are members of other sets, and using the metadata associated with those other sets to support discovery, as well as information contained within comments associate with each photo.

Using GetTheData to Organise Your Data/API FAQs?

It’s generally taken as read that folk hate doing documentation*. This is as true of documenting data and APIs as it is of code. I’m not sure if anyone has yet done a review of “what folk want from published datasets” (JISC? It’s probably worth a quick tender call…?), but there have certainly been a few reports around what developers are perceived to expect of an API and its associated documentation and community support (e.g. UKOLN’s JISC Good APIs Management Report and API Good Practice reports, and their briefing docs on APIs).

* this is one reason why I think bloggers such as myself, Martin Hawksey and Liam Green Hughes offer a useful service: we do quick demos and geting started walkthroughs of newly launched services, demonstrating their application in a “real” context…

At a recent technical advisory group meeting in support of the Resource Discovery Taskforce UK Discovery initiative (which is aiming to improve the discoverability of information resources through the publication of appropriate metadata, and hopefully a bit of thought towards practical SEO…) I suggested that a Q and A site might be in order to support developer activities: content is likely to be relevant, pre-SEOd (blending naive language questions with technical answers), and maintained and refreshed by the community:-)

In much the same way that JISCPress arose organically from the ad hoc initiative between myself and Joss Winn that was WriteToReply, I suggested that the question and answer site with a focus on data that I set up with Rufus Pollock might provide a running start to UK Discovery Q&A site: GetTheData.

API connections to OSQA, the codebase that underpins GetTheData, are still lacking, but there are mechanisms for syndicating content from RSS feeds (for example, it’s easy enough to get a feed out of tagged questions out, or questions and answers relating to a particular search query); which is to say – we could pull in ukdiscovery tagged questions and answers in to the UK Discovery website developers’ area.

Another issue relates to whether or not developers would actually engage in the asking and answering of questions around UK Discovery technical issues. Something I’ve been mulling over is the extent to which GetTheData could actually be used to provide QandA styled support documentation for published data or data APIs, concentrating a wide range of data related Q&A content on GetTheData (and hence helping building community/activity through regularly refreshed content and a critical mass of active users) and then syndicating specific content to a publisher’s site.

So for example: if a data/api publisher wants to use GetTheData as a way of supporting their documentation/FAQ effort, we could set them up as an admin and allow them rights over the posting and moderation of questions and answers on the site. (Under the current permissions model, I think we’d have to take it on trust that they wouldn’t mess with other bits of the site in a reckless or malevolent way…;-)

API/data publishers could post FAQ style questions on GetTheData and provide canned, accepted (“official”) answers. Of course, the community could also submit additional answers to the FAQs, and if they improve on the official answer be promoted to accepted answers. Through syndication feeds, maybe using a controlled tag filtered through a question submitter filter (i.e. filtering questions by virtue of who posted them), it would be possible to get a “maintained” lists of questions out of GetTheData that could then be pulled in via an RSS feed into a third party site – such as the FAQ area of a data/api publisher’s website.

Additional activity (i.e. community sourced questions and answers) around the data/API on GetTheData could also be selectively pulled in to the official support site. (We may also be able to pull out the lists of people who are active around a particular tag???) In the medium term, it might also be possible to find a way of supporting remote question submission that could be embedded on the API/data site…

If any data/API publishers would like to explore how they might be able to use GetTheData to power FAQ areas of their developer/documentation sites, please get in touch:-)

And if anyone has comments about the extent to which GetTheData, or OSQA, either is or isn’t appropriate for discovery.ac.uk, please feel free to air them below…:-)