Academic Library Usage Data as Reported to SCONUL, via FOI, And a Thought About Whitebox Data Reporting

Something I’ve been meaning to do for ages, but only just got round to starting to do, is to send up trial balloon FOI requests around the data that one public organisation might release to other organisations as part of a formal or templated reporting procedure.

So here’s the first one – an FOI request to the University of Bath Library for a copy of the data they returned to SCONUL for the period 2008/2009 made via MySociety’s WhatDoTheyKnow service – and here’s the response, along with a copy of the return.

(In general, I wonder if it would be more useful to ask for a copy of the document if possible, in the document format it was submitted (for example, a Microsoft Word document, if that was the document type submitted).)

The information reported to SCONUL is not available from SCONUL for free, although aggregated data from across the UK HE sector is available via a paid for report. A copy of the current questionnaire used to collect the data is available.

It seems to me that what requests of this sort do is demonstrate a precedent regarding the release of data that is produced as part of a formal or standardised reporting process that can be used to encourage (oblige?) other institutions in the same sector to make the information available in the same way?

So here’s what I have in mind: a site that collects and collates information about standard reports that are used to transport information between public sector organisations (including copies of the forms used to collate that data), including but not limited to the information/data that public institutions are obliged to return to government or overseeing agencies.

For example, this DCLG list of of the minimum data central Government needs from local authorities is a good start – is there an equivalent for universities (come on, BIS…;-)? [Ah, maybe this is a place to start, at least as far as HESA goes: HEFCE Report 2008: Making your data work for you – Data quality and efficiency in higher education. I imagine there is also a considerable data burden arising from REF reporting?]

As and when reports are demonstrated to be FOIable, their contents also become candidates for open data release. One aim here is to start making data chains visible to the organisations that are producing the data (internal transparency) so that the organisation can become more aware of its own data resources and how they might be used elsewhere within the organisation. (Transparency within the organisation may also lead to a reduction in duplication of effort creating or collating the same data at several different locations within the same organisation?)

The claim I guess I’m making towards this approach to opening up data may be summarised as follows: data that is produced as part of formal reporting and that is FOIable should be made public as a matter of course. As a consequence, there should be little extra effort required to open up the data. Indeed, it may be possible to submit the reports via an open and transparent whitebox reporting process.

[See also: Putting Public Open Data to Work…?]

PS for what it’s worth, I think the SCONUL data application provides another example of a situation where it might be useful to have a WhatDoTheyKnow service that allows you to make the same (bulk) request to every institution in a particular sector (such as universities, or local councils). I can see there may need to be controls around such a service to prevent abuse, but

PPS I wonder, do MySociety license WhatDoTheyKnow to any public institutions to help they manage their FOI process?

PPPS Here’s a related comment I posted to the Public Data Corporation engagement exercise:

Question 5 – What methods of access to datasets would most benefit you or your organisation?

One particular class of data that interests me is data that is:

1) reported by a local organisation to a central body;
2) using a standardised, templated reporting format,
3) and that is FOIable either from the local organisation, and/or from the central body.

For example, in Higher Education, this might include data on library usage as reported to SCONUL, or marketing information about courses submitted to UCAS.

It can often be hard to find out how to phrase an FOI request to obtain this data as submitted, unless you know the type of reporting form used to submit it.

What I would like to see is the Public Data Corporation acting in part as a Public Data Exchange Directory, showing how different classes of public organisation make standard (public data containing) reports to other public organisations, detailing the standard report formats, with names/identifiers for those forms if appropriate, and describing which sections of the report are FOIable. This could also link in to the list of local council data burdens, for example ( http://www.communities.gov.uk/… and/or the code of practice for local authority transparency ( http://www.communities.gov.uk/… )

The next step would be to introduce a pubsub (publish-subscribe) model in the reporting chain for reporting documents* that are wholly FOIable. This could happen in several ways:

A) /open report publication/ – the publishing organisation could post their report to their opendata reporting store, and the consuming organisation (the one to which the report was being made) would subscribe to that store, collecting the data from there as it was published; third parties could also subscribe to the local publishing store and be alerted to reports as they are published. If co-publication to the central organisation and the public is not appropriate, the report could be witheld from public/press consumption for a specified period of days, or published to the press but not the public under embargo.

B) /open deposit/ – the publishing organisation publishes the report/data to an open deposit box owned by the central organisation which is receiving the report. After a specified period of time, the report is made public (ie published) via that central deposit box.

C) /data corp in the middle/ – a centralised architecture in which local organisations submit public reports to a Public Data Exchange, which then passes them on to the central body to which reports are made, and publishes them to the public, maybe after a fixed period of time.

The intention of all three approaches described above is to provide an open window onto the reporting chain. At the current time, open public data tends to be data that is published via a separate branch “to the public”. In contrast, the above approach suggests that public data publication acts as a view onto all or part of the data as it goes about it’s daily business being published from one organisation to another. That is, public data publication becomes a “tap” onto a dataflow/workflow process.

If one of the desires for data exploitation is to help introduce efficiencies as well as reuse in data related activities, third parties need to be able to work with data as it currently used.

A final issue relates to the way data is published. The JISC Resource Discovery Taskforce is currently consulting [ http://rdtfmetadata.jiscpress…. ] about metadata standards for describing resources in the Museums, Libraries and Archives field, and work is also ongoing with respect to efficient and complete ways of publishing scientific data. To the extent that generic models or guidance is possible with respect the representation of arbitrary data sets, it may be worth liaising with those working groups on generic guidelines for effective data publishing conventions. [Disclaimer: I am on the RDTF Technical Advisory Group]

* when talking about reports, I include the following sense: where a report is made, it is likely to include summary reports and maybe complete datasets. Ideally, data contained in reports should also be made available as “raw data” in an open data format, for example compliant with two or more stars in the W3C Linked Data 5 star open Linked Data publishing scheme [ http://www.w3.org/DesignIssues&#8230; ]. In addition, where summary reports appear, referencing views over raw data sets, the queries/database queries that generate the summary report view from the raw data should also be published, thus providing transparency over how the raw data generates summary statistics, for example, in the final report.]</blockquote

Broken RSS, and a Comment About Blog Comments

Originally posted as a comment on Brian Kelly’s Is It Too Late To Exploit RSS In Repositories?:

I used to advocate the adoption of RSS a lot, and came across some of the problems you mention repeatedly, such as the inability to consume certain pages in off-the-shelf feed consuming apps.

Many of the problems resulted from non-standard character encodings, or incorrectly encoded item.description text. Links/URLs were occasionally missing or pointless (e.g. pointing to the root domain from which the feed was served, rather than anything relating to the particular feed item). Generating sensible URLs for feed items could also turn up issues with the way pages were served, eg on sites where session variables or other arbitrary keys were required.

The reason the problems were allowed to slip through was because of the context in which the feeds were published. Eg request goes in for ‘we need a feed’; developer adds feed, runs it through validator, job done.

But the job isn’t done, just as the job isn’t done when a someone publishes a public/open data set but doesn’t do anything more than that, or someone publishes an OER and considers that now it’s public, it’s useful.

I spend way too much of my time trying to glue things together, and finding more often than not that they don’t play nice. For example, Guardian datastore data often falls just short of being easily combined with other data sets, even other Guardian datastore published datasets, though this is getting better all the time as workflows are tweaked ever so slightly…

One possible solution, where things are published /with the intentions that others re* them/ is for the publisher to demonstrate a simple remix or combination with at least one other information source.

If you publish an RSS feed, demonstrate one or two off-the-shelf ways of consuming it. This is what any user is likely to try first, so save them the grief of finding out it doesnlt work by making sure it does.

When releasing data, if you’re publishing data relating to countries, for example, see if you can use one of the many services for generating map mashups to map the data. IF you can’t, what is it in or missing from your data that’s making it hard to do.

If you’re publishing an OER, big or little, /how/ might you see it being remixed/reused with other OERs. If your content includes lots of diagrams, how easy is it for someone else to reuse that image (with attribution and in compliance with any other license requirements) in their own presentation. If they want to embed it in a blog post (generating not only more views of the content, but also trackable data that you can measure) just try giving a few examples of embedded use. If it’s hard for use as publisher to do the baby steps, why should anyone else bother? (Saying you’re publishing something because you don’t know how other people will use it is not the issue… if it’s hard to do the easy stuff, very few people will bother. The publisher needs to demonstrate the easy stuff, and see it as a way of getting a couple of pragmatic tests implemented as well as a quick tutorial in getting started with re*ing the warez.

PS one of the things I’m considering doing more next year is comment on other people’s posts directly. The danger with taking such an approach is that those responses get lost (i.e. I can’t easily search for them, and as the major user of this blog as personal notebook, searching over things I’ve previously written is an important feature). Of course, I could blog a response to other peoples’ posts, but this fractures the conversation somewhat. I also know from experience that whilst folk may read comments on a blog post, they may not always click through on trackbacked links, if such links exist.

So, I’m considering adding a new category to this blog – CommentedElsewhere – that captures the longer comments as reposts here, with a link back to the original comment, and the original context. Good plan, or not? Will it just make OUseful.info even harder to follow? Should I set-up a separate ‘OUsefulComments” blog, repost substantial comments there and then maybe draw a feed into the sidebar here? Your comments would be appreciated…:-)