OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Policy’ Category

Protection of Freedoms Bill – Release and publication of datasets held by public authorities

with 5 comments

With UK legislation still lacking online commentability, I thought I’d grab a copy of the part of the Bill dealing with the release of public datasets, Part 6 Freedom of information and data protection: Publication of certain datasets.

However, I couldn’t find an easy way of grabbing the appropriate sections in such a way that I could paste it into this document and retain the formatting (given additional constraints that this is a WordPress hosted blog and I can’t appply custom styling to span tagged elements…). So here are some image grabs of the appropriate pages of the PDF version of the Bill…

PS if anyone has any ideas on embedding Bill contents in a blogpost, please let me know…

PS via the International Forum for Responsible Media Blog, here’s a commentary on the above clauses by (legal trainer?) Ibrahim Hasan: FOI Datasets: The Protection of Freedoms Bill

Written by Tony Hirst

February 19, 2012 at 11:48 am

Posted in Policy

Open Standards Consultation and Open Data Standards Challenges

leave a comment »

Take a look around you… see that plug socket? If you’re in the UK, it should conform to British Standard BS1363 (you can read the spec if you have have you credit card to hand…). Take a listen around you… is that someone listening to an audio device playing an MP3 music file? ISO/IEC 11172-3:1993 (or ISO/IEC 13818-3:1995) helped make that possible… “that” being the agreed upon standard that let the music publisher put the audio file into a digital format that the maker of the audio device knows how to recognise and decode. (Beware, though. The MP3 specification is tainted with all sorts of patents – so you need to check whether or if you need to pay someone in order to build a device that encodes or decodes MP3 files.) If the music happens to be being played from a CD (hard to believe, but bear with me!), then you’ll be thankful the CD maker and the audio player manufacturer agreed to both work with a physical object that conforms to IEC 60908 ed2.0 (“Audio recording – Compact disc digital audio system”), and that maybe makes use of Standard ECMA-130 (also available as ISO/IEC 10149:1995). That Microsoft Office XML document you just opened somewhere? ISO/IEC 29500-1:2011. And so on…

Standards make interoperability possible. Which means that standards can be a valuable thing. If I create a standard that allows lots of things to interoperate, and I “own” the “intellectual property” associated with that standard, I can make you pay every time you sell a device that implements that standard. If I control the process by which the standard is defined and updated, then I can make changes to the standard that may or may not be to your benefit but with which you have to comply if you want to continue to be able to use the standard.

There are at least a couple of issues we need to take into account, then, when we look at adopting or “buying in” to a standard: who says what goes in to the standard, and how is agreement reach about those things; and under what terms is usage of the standard allowed (for example, do I have to pay to make use of the standard, do I have to pay in order to even read the standard).

At the adoption level, there is also the question of who decides what standard to adopt, and the means by which adoption of the standard is forced onto other parties. In the case of legislation, governments have the power to inflict a considerable financial burden on companies and government agencies by passing legislation that mandates the adoption of a particular standard that has some of fee associated with it’s use. Even outside of legislation, if a large organisation requires its suppliers to use a particular standard, then it could be commercial suicide for a supplier not to adopt the standard even if there are direct licensing costs associated with using it.

If we want to reduce the amount of friction in a process that is introduced by costs associated with the adoption of standards that make that process possible, then “open standards” may be a way forward. But what are “open standards” and what might we expect of them?

A new consultation from the Cabinet Office seeks views on this matter, with a view towards adopting open standards (whatever they are?!;-) across government, wherever possible: Cabinet Office calls on IT Community to engage in Open Standards consultation. In particular, the consultation will inform:

- the definition of open standards in the context of government IT;
- the meaning of mandation and the effects compulsory standards may have on government departments, delivery partners and supply chains; and
- international alignment and cross-border interoperability.

The consultation closes on 1 May 2012.

(Hmm, the consultation doesn’t seem to be online commentable… wouldn’t it be handy if there was something around like the old WriteToReply…?;-)

Here’s a related “open data standards in government” session from UKGovCamp 2012:

Related to the whole open standards thang is a new challenge on the Standards Hub posted by the HM Gov Open Data Standards (Shadow*) Panel (disclaimer: I’m a member of said panel; it’s (Shadow) because the board it will report to has not been formally constituted yet). The challenge covers open standards for “Managing and Using Terms and Codes” and seeks input from concerned parties relating to document standards and specifications relating to the coding and publication of controlled term lists, their provenance, version control/change files, and so on. (So for example, if you happened to work on the W3C provenance data model (which I note has reached the third working draft stage), and think it’s relevant, it might be worth bringing it to the attention of the panel as a reply to the challenge).

It occurs to me that recent JISC activity relating to UK Discovery intitiative may have something to say about the issues involved with, and formats appropriate for, representing and sharing data lists, so I commend the challenge to them: open standards for “Managing and Using Terms and Codes” (I’ll also pick my way through the #ukdiscovery docs and feed anything I find there back to the panel). I also suspect the library and shambrarian community may have something to offer, as well as members of the Linked Universities community…

[A quick note on the Open Data Standards Panel - it's role in part is to help identify and recommend open standards appropriate for adoption across government, as well as identify areas where there is a need for open standards development. It won't directly develop any standards, although it may have a role in recommending the commissioning of standards.]

A couple of other things to note on sort of tangentially related matters (this post is in danger of turning in to a newsletter, methinks… [hmmm: should I do a weekly newsletter?!]):

  • JISC just announced some invitations to tender on the production of some reports on Digital Infrastructure Directions. The reports are to cover the following areas: Advantages of APIs, Embedded Licences: What, Why and How, Activity Data: Analytics and Metrics, The Open Landscape, Access to citation data: a cost-benefit and risk review and forward look.
  • the Open Knowledge Foundations has a post up Announcing the School of Data, “a joint venture between the Open Knowledge Foundation and Peer 2 Peer University (P2PU)”. The course is still in the early planning stage, and volunteers are being sought…

Related: last year, the OU co-produced a special series of programmes on “openness” with the BBC World Service Digital Planet/Click (radio) programme. You can listen to the programmes again here:

Written by Tony Hirst

February 10, 2012 at 12:36 pm

Several Takes on the Notion of “Data Laundering”

leave a comment »

Picking up on Sleight of Hand and Data Laundering in Evidence Based Policy Making and Paul Bradshaw’s response that we should maybe follow the data, here’s a quick summary of several competing conceptualisations of “data laundering”.

The first relates to the usage in the sense of “[o]bscuring, removing, or fabricating the provenance of illegally obtained data such that it may be used for lawful purposes” ([SectorPrivate's] definition of Data Laundering – as inspired by William Gibson from Mona Lisa Overdrive).

SectorPrivate cites this example from a 2005 Privacy conference paper by Thilo Weichert (Privacy and Data Protection in federal police cooperation):

As working in joint committees naturally is a more informal cooperation, supervision as regards data protection is practically impossible. It is not ensured that personal data transmissions are put down in a protocol being checked. Therefore, it is usually impossible to find out the point of origin of specific information, whether it was obtained lawfully and how its utilisation is limited. In this context one can even talk about data laundering facility: data obtained unlawfully can be passed across the table and be processed without complaints by the receiver in a now cleaned form and can thereupon be passed back.

A more recent reference in ThinkMind // International Journal On Advances in Security, volume 2, numbers 2 and 3, 2009 on Design Patterns for a Systemic Privacy Protection identifies the following:

Problem Situation 4 – Data laundering. Companies are paying a lot of money for personal and group profiles and there are market actors in position to sell them.
This is clearly against data protection principles. This phenomenon is known as ‘data laundering’. Similar to money laundering, data laundering aims to make illegally obtained personal data look as if they were obtained legally, so that they can be used to target customers.

This example is also referred to from an EU Sixth Framework Information Scoiety Technologies deliverable – Safeguards in a World of Ambient Intelligence (SWAMI) Threats, Vulnerabilities and Safeguards in Ambient Intelligence Deliverable D3 3 July 2006 which cites the source as the second SWAMI deliverable. SWAMI-D2 describes the process of data laundering as follows: “Via a large number of transactions and operations, the illegal origin (illegal collection) of personal data can be camouflaged”.

The third deliverable goes on to make the following recommendation:

A means to prevent data laundering could be an obligation imposed on those who buy or
otherwise acquire databases, profiles and vast amounts of personal data, to check diligently
the legal origin of the data. If the buyer does not check the origin and/or the legality of the
databases and profiles, he could be considered equal to a receiver of stolen goods and thus
held liable for illegal data processing. An obligation could also be created which would
require buyers to notify the national data protection officers when personal data(bases) are
acquired. Persons or companies involved or assisting in data laundering could be made
subject to criminal sanctions.

The SWAMI reports thus situate data laundering in the context of invasions privacy and/or contraventions to data protection legislation. State sponsored, rather than evil criminal mafia initiated, usage of illegally acquired data (eg US gov’t data-laundering: using corporate databases to get around privacy law) also falls into a broadly similar area of data protection/privacy law abuse.

The term “data laundering” also appears to have varied usage in the sense of data cleaning (aka data cleansing), (eg Quick and Dirty Data Laundering: A Scalable Solution for Range Checking Data, Data laundering by target rotation in chemistry-based oil exploration).

The sense in which I first came across the term was whilst discussing a data laundry process that could replace metadata records or fields with metadata records in library catalogues that are tainted with commercial license restrictions with data of equivalent of higher quality, known provenance and open license terms (Open Data Processes: the Open Metadata Laundry).

The notion I was going for in Sleight of Hand and Data Laundering in Evidence Based Policy Making is different again. Whilst it shares the SWAMI characterisation insofar as it relates to the practice of removing provenance traces from a data set, it does not assume that the data was acquired illegally and it also differs in the purpose to which the laundered data is applied. In the sense I intended, the data is legal but of low or unverified quality, contains a significant bias, or whose provenance may lead to a conflict of interest arising from the use to which the data is to be put. The laundering is there not to remove traces of the illegal provenance of the data, but to mask the original provenance with a provenance, authority or veneer of quality associated with another agent, such that the data becomes accepted “at face value” with the imprimateur of an independent trusted party. The second part of my take on data laundering was the use to which the laundered data might be put. Specifically, having been laundered of its dubious provenance, and remarqued with a stamp of independent and/or trusted authority, the data would continue to make it’s way through a policy development process with the intention that it would influence the policy decision in favour of the outcome preferred by the agent who insinuated the original data into the data laundering chain.

Compare this with the WIkipedia description of money laundering: “Money laundering often occurs in three steps: first, cash is introduced into the financial system by some means (‘placement’), the second involves carrying out complex financial transactions in order to camouflage the illegal source (‘layering’), and the final step entails acquiring wealth generated from the transactions of the illicit funds (‘integration’).”

I would contend that there are thus several different sorts of data malpractice that we might term as data laundering and that one of the tasks facing a Fourth Estate might be to clarify and chase down these various abuses of process whether they occur in the corporate world, academia, the public sector or in government itself.

Written by Tony Hirst

February 3, 2012 at 12:03 am

Posted in Anything you want, Policy

Tagged with

Sleight of Hand and Data Laundering in Evidence Based Policy Making

with 5 comments

I’ve still to make this year’s New Year’s Resolution, but one of the things that I thing I’d like to spend more time getting my head round is the notion of “evidence based policy making” (e.g. Is Evidence-Based Government Possible?.

As far as I can tell, this is often caricatured as either involving Googling around a policy area using ministerially obvious Google terms and referencing whatever’s in the top 5 hits, or taking a policy decision then looking for selective evidence to support that decision, along with contrary evidence against competing alternatives; (in a related area of evidence based practice, see for example: Some Questions about Evidence-based Practice in Education. If you have other examples in a similar vein, please let me know… #lookingForAnEvidenceBase Also e.g. the idea of policy based evidence making [h/t Jon Warbrick];-)

One of the suspicions I have is that “evidence” inherits the authority associated with the most reputable source associated with it when we wish to call on it to justify it, (and possibly as a complement to that, the least reputable source if we wish to discount it?)

So for example, in his Networker Observer column last weekend, John Naughton describes a presentation given to a technology conference by Facebook’s chief operating officer, Sheryl Sandberg, that pre-empted a European commission announcement on privacy:

Sandberg made claims about the economic benefits of privacy abuse that defy parody. For example, she unveiled a report that Facebook had commissioned from Deloitte, a consultancy firm, which estimated that Facebook – an outfit with a global workforce of about 3,000 – indirectly helped create 232,000 jobs in Europe in 2011 and enabled more than $32bn in revenues.

Inspection of the “report” confirms one’s suspicion that you couldn’t make this stuff up. Or, rather, only an international consulting firm could make it up. Interestingly, Deloitte itself appears to be ambivalent about it. “The information contained in the report”, it cautions, “has been obtained from Facebook Inc and third party sources that are clearly referenced in the appropriate sections of the report. Deloitte has neither sought to corroborate this information nor to review its overall reasonableness. Further, any results from the analysis contained in the report are reliant on the information available at the time of writing the report and should not be relied upon in subsequent periods.” (Emphasis added by JN.)

Accordingly, continues Deloitte, “no representation or warranty, express or implied, is given and no responsibility or liability is or will be accepted by or on behalf of Deloitte or by any of its partners, employees or agents or any other person as to the accuracy, completeness or correctness of the information contained in this document or any oral information made available and any such liability is expressly disclaimed”.

In this case, the Deloitte report was used as evidence by Facebook to demonstrate a particular economic benefit made possible by Facebook’s activities. The consultancy firms caveats were ignored, (including the fact that the data may in part at least have come from Facebook itself), in reporting this claim. So: this is data laundering, right? We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make it’s way into a policy briefing. Or that’s how I imagine it, any way..

John’s take was in a similar vein:

The sole purpose of “reports” such as this is to impress or intimidate politicians and regulators, many of whom still seem unaware of the extent to which international consulting firms are used by corporations to lend an aura of empirical respectability to hogwash.

Quite so. ;-) I think my concerns go further though – not only is the Deloitte cachet used to bludgeon evidence-poor audiences into submission, it may also perniciously make it’s way into documents further up the policy development ladder where only the findings, and none of the caveats (including the dodgy provenance of the data) are disclosed.

So here are a couple of things for the data journalists to take away, maybe?

1) there may be stories to be told about the way other people have sourced and used their data. Where one report quotes data from another, treat it with as much suspicion as you would hearsay… Check with the source.

2) when developing your own data stories, keep really good tabs on where the data’s come from and be suspicious about it. If you can be, be open with republishing the data, or links to it.

PS if you have other examples of data provenance laundering, please add a link as a comment to this post:-)

PPS see also How SOPA and PIPA did and didn’t change how Washington lobbying works: “The political scientist E.E. Schattschneider once called politics “the mobilization of bias.” By this, he meant something both simple and profound. All political battles are fights between competing interests, he noted, but political outcomes are almost always determined by the bias of those paying attention to the conflict. The trick is to make sure you mobilize the crowd that will cheer for you.”

PPPS A bit of history relating to the “data laundry” idea, originally in the context of scrubbing rights tainted records from library catalogue metadata: http://blog.ouseful.info/2011/08/09/open-data-processes-the-open-metadata-laundry/

Written by Tony Hirst

February 1, 2012 at 11:26 am

Posted in Anything you want, Policy

Tagged with

Tracking Down Local Government Consultation Web Pages

with 2 comments

One of the things I have on my to do list for this year is to try to get a joint paper out with Danilo Rothberg on public consultation platforms at local, national and European level.

In the UK, many local councils have an area of their website dedicated to local consultations, so my first hacky thought for a way to track them down was to scrape something together around a Google search of the form: site:gov.uk intitle:consultation intitle:council.

By chance, I stumbled across page on OpenlyLocal linking to the services offered by a particular council, which made me wonder if I could actually pull down a list of the URLs of consulation pages by council directly from OpenlyLocal.

A quick Twitter exchange with that site’s maestro, Chris Taggart/@countculture, suggested that OpenlyLocal “[s]piders the Localgov redirect urls every week… …trick is knowing the LGD service id code, and then you can get all URLs for councils with URL for it”. In addition, “It’s the OL key you need (that maps to the ldg native uid). Something like this: http://openlylocal.com/services?ldg_service_id=370“.

So, decoding that, and with a bit of extra Googling, here’s where I’m at:

  • from the esd/effective service delivery toolkit (“Facilitated by the Local Government Association (LGA) working for local government improvement so councils can serve people and places better. esd-toolkit is owned and led by the local government sector”), we can find the LGD service ID codes for services relating to consultations:

    • Council – consultation – service delivery (867): All councils are expected to consult on specific areas of their service delivery. This allows service users and other interested parties to have to opportunities to be involved in planning, prioritising and monitoring of services. It also gives customers an opportunity to see all consultation activity, both current and in the past, and a mechanism for customers to research satisfaction with service delivery, opinions about specific projects and looks at lifestyle profiles which helps us design better local services.
    • Council – consultation and community engagement (366): The local authority uses various means to consult and engage with local communities including development of community and citizens’ forums and panels, consultation events, public events, young people’s participation.
    • Council – spending plans – consultation (658): Arrangement of public meetings or other means by which citizens can be consulted on budget plans for the forthcoming year. Previous consultations may be published or available for view on request.
    • Education – consultations (49): The education authority consult with all interested parties (schools, teachers, parents, pupils) on all issues concerning education provision and in particular on any proposed changes to education within schools run by the authority.
    • Equalities and diversity – assessment and consultation (861): The LA is responsible for ensuring that equality and diversity is considered at all times both in employment policy and in the provision of services. Every authority should assess, and consult on, the impact of policy in relation to equality and diversity within their community
    • Planning – consultation (855): The involvement of the public in the planning process. When planning applications are submitted there is a comprehensive system in place which ensures that proposals are publicised in order to invite comments from the local community.
  • To pull down the URL associated with each service for each council from OpenlyLocal (URLs of the form http://openlylocal.com/services?ldg_service_id=370), we need to know the mapping from Local Service ID codes shown above to the corresponding OpenlyLocal service codes (link???)
  • The DirectGov A-Z Directory of Local Services page links to alphabetical listings of service related pages presumably keyed on the Local Gov Service ID (LGSL= in the URL?), though on a quick skim through the listings I couldn’t find any consultation related services? [Ah, I should probably have tried from here: Directgov: Find out about local consultations]
  • From the Local Directgov on the Dept for Communities and Local Government website, I found a newsletter link to Local Directgov: open datasets
  • On data.gov.uk, there’s a handy CSV data file referred to as the Local directgov services list: “This dataset is held on the Local Directgov platform which provides the deep links into Local council websites for a number of services in Directgov. The Local Authority Service details holds the local council URLS for over 240 services where the customer can directly transfer to the appropriate service page on any council in England.” The CSV data is organised as follows:
    Authority Name,SNAC,LAid,Service Name,LGSL,LGIL,Service URL
    ...
    Adur District Council,45UB,1,Find out about local consultations,867,8,http://www.adur.gov.uk/consultation/index.htm
    ...

So, that’s where I’m at… I now have a CSV file from data.gov.uk with a list of deep link URLs in to local gov websites, and a set of Local Gov Service IDs from esd that allow me to identify the links corresponding to various sorts of consultation.

If I run those URLs through an RSS/Atom feed autodiscovery service, how many open/current consultation feeds do you think I’ll find?!

PS One of of the things OpenlyLocal is managing to do is provide an abstraction/normalisation layer over the myriad local council websites. It’s interesting to compare this with the JISC funded Linking You Toolkit that surveyed URL patterns across various UK university websites and made a series of recommendations about a normalised URL scheme that could potentially be used (via URL rewrites) to provide a common URL interface over common areas of UK HE websites (a simplification that I think also fits into the spirit of normalised data presentation approach being taken with the Key Information Sets). It strikes me that an alternative scheme, at least for the purposes of building services that can map from a central service to deep links related to particular services or content areas of a university website, would be to follow the Local Gov Service ID model and come up with a set of university related services or content areas (potentially reusing those identified by the Linking You project), and then request that universities publish site maps relating deeplink URLs to the appropriate identifier.

PPS as to why I bothered with this post: I’m just trying to document/model an example of the sort of search process I go through whenever I try to find anything out… Which as you can see, is still messed up and informal, starting with Google, then moving to tapping folk I suspect might know the answer to questions I’m trying to articulate, and finally ending up by checking out data.gov.uk…

PPPS Given the full list of government consultation websites for departmental and agency consultations, I wonder: is there a service/content area coding scheme used to identify common areas of central gov department websites?

Written by Tony Hirst

January 6, 2012 at 6:22 pm

Licensing and Tracking Online Content – News and OERs

with 2 comments

Trying to pull a quote from the FreePint/fumsi article “Frictionless sharing” – exploring the changes to Facebook yesterday, I was presented with this pop up dialogue:

I just tried to copy some text from fumsi.com, and here's what I saw...

Clicking “No” meant that rather than grabbing the text I was trying to copy into my clipboard, no copying action took place (when I tried to paste the content, the thing that was pasted was the text I had last successfully copied and pasted…)

Clicking “Yes” took me to another dialogue, shown here in two parts. Firstly, a preview of the text I was trying to copy, and the price charged for reusing it:

fumsi: buying the license...

The “Do I need a License…” line actually ended with a link to Fair Use Statement: Do I need a license to republish an excerpt from this article?.

And the second part – the payment bit:

fumsi - buying a license pat 2

I didn’t go through with the purchase, so I can’t say what happens next, or what sort of embed code I get as a result. (I did, however, View Source, and just check that I could copy as much of the content from the original post as I wanted…;-)

The Fair Use statement links to the site that provides the technology behind the popup, iCopyright:

iCopyright provides a comprehensive suite of services to publishers to help them protect, promote and profit from their content. … With one simple implementation of the iCopyright tag, publishers may take advantage of all of these services. No other content licensing solution comes close to matching the iCopyright platform.

Designed to discourage individuals and organizations from using your content without permission, or exceeding the terms of their original license, iCopyright’s peer-policing and license authentication feature allows those who receive content to know whether a proper license has been obtained. People won’t pirate your content if anyone can verify whether it is an authorized copy.

The Feed & Tag Syndication service enables you to license feeds of your own copyrighted content to other publishers, bloggers and websites — instantly! Your iCopyright toolbar and licensing services are embedded in the feed of your content when displayed on subscribing web sites. You earn new revenue for each page view and share in all secondary uses according to terms you specify. Similarly, you can obtain a licensed feed of content from other publishers in the iCopyright network to enhance your own original content. And, of course, you earn a revenue share on all licensed reuses of that content on your site.
Tag-Only Syndication allows you to add iCopyright tags to content you license from third-party content producers who may or may not be in the iCopyright network. Similarly, it allows you to authorize and manage the tagging of your copyrighted content when it appears on third-party sites, such as aggregators and research databases.
In all cases, iCopyright tracks all licenses, collects the fees, and remits revenue shares among the partners and publishers each month. Publishers never lose control over their brand or their content

And so on…


Wouldn’t it be great? Fan-friggin-tastic. Arsem…

In other news, a press release that’s being rehashed* over various media blogs today announces Newsright, a platform for licensing and tracking the (re)use of online (textual?) news content.

(* Churnalism at work, the bane of many of the larger pop tech and pop media blogs; at least when people just share a link to news releases, they’re admitting all they’re doing is forwarding PR statements, rather than copying and pasting text into a blog post without comment, annotation, curation or contextual linking and then pretending they’re doing more than just syndicating PR fluff.)

According to the Washington Post take on the story (AP, NYTimes, McClatchy, others launch NewsRight online rights clearinghouse), Newsright appears to build an tracking system based around the NewsRegistry [FAQ] launched by AP a couple of years or so ago (here’s a report from the time (July 2009): AP takes action on copyright breaches with new tracking system; The NewsRegistry was based around the hNews microformat, I think? (Ref: No Need for Violence in Microformat War Between hNews, rNews)). Certainly, if you go to the NewsRegistry site today you get quickly led to Newsright.

So why’s this interesting in online education sense? Tracking.

Having released open educational resources onto the web, folk are now getting worried about impact (i.e. one of those forms of return on investment you can use when there isn’t obviously a direct return on the bottom line). I don’t really have much clue about what impact is or is supposed to be, or how it is supposed to be measured (nor, I think, does anyone else), but tracking seems to be one of the gut reaction responses. Which is why the approach taken by Newsright may be of interest to folk wanting to track direct reuse of things like open educational resources.

(For a summary of approaches and technical solutions that have been explored to date, see the JISC/CETIS wiki: Tracking OERs: Technical Approaches to Usage Monitoring for UKOER, the RAPTOR e-resource log analysis toolkit, and JISC’s latest favourite toy in the area, the Learning Registry [press release; I haven't really paid much attention to this, so don't really know what it's all about. Based on this early review (JISC Learning Registry Node Experiment it's a database that will aggregate metadata and usage and tracking data (the Learning Registry folk call that "paradata", I think? I guess you get way more budget for coining a neologism than you do an acronym?!) that other people have figured out how to collect ("The Learning Registry itself is not a search engine, a repository, or a registry in the conventional sense. Instead the project aims to produce a core transport network infrastructure and will rely on the community to develop their own discovery tools and services, such as search engines, community portals, recommender systems, on top of this infrastructure. Dan commented: 'We assume some smart people will do some interesting (and unanticipated) things with the timeline data stream.' The Learning Registry infrastructure is built on couchDb, a noSQL style 'document oriented database' providing a RESTful JSON API."))]).

And finally… This post started with a look at the ways of policing copyright of text in an online setting. We all know that the current copyright laws aren’t particularly suited to the digital context, so it’s perhaps refreshing that they’re under review. In the UK, the Intellectual Property Office are currently running a consultation around copyright (Consultation on proposals to change the UK’s copyright system). The consultation was launched on December 14th, 2011, and runs until March 21st, 2012, so you still have plenty of time to respond;-) A good starting point may well be Extending Copyright Exceptions for Educational Use [PDF].

Written by Tony Hirst

January 6, 2012 at 11:20 am

Posted in Anything you want, Policy

Tagged with

Why Open Data Dumps On Their Own Add Little to Transparency…

with 2 comments

“2012 will be the year where folk realise there’s more to transparent data release than dropping huge data tables”

Along with the “it’ll bootstrap innovation” chant, one of the oft-made claims about the release of open public data is that it’ll be a great boon to the “cause” of transparency. Publishing data, in electronic form, under an open license, is a start, but when it comes to actually trying to make use of public data releases, it can often be a long hard slog, from coping with non-obvious character encodings and data layouts that are all over the place, to reconciling column headings and sheet numbers with explanatory keys provided in a separate document, to trying to make sense of spreadsheet cell formats that mask the form the data is actually in, to [ADD YOUR FAVOURITE BUGBEAR HERE]…

As an example of how crazy things can get, take this tweet from @objectgroup yesterday:

"Now I'm FOI requesting better guidance on getting useful data out of COINS http://www.whatdotheyknow.com/request/cash_flow_and_balance_sheet_for#outgoing-173108"

Here are the highlights (for me) of that request:

Thank you for your reply to my Freedom of Information Request.

You replied to say that the information I requested would be available in the next COINS release on the data.gov.uk website.

However the guidance provided to the COINS data …is insufficient to extract the cash flow, balance sheet and profit and loss statement for each of the 1,500 public bodies included in
the WGA.

Could you either:

1) update the guidance so it includes a section on how to extract the cash flow, balance sheet and profit and loss statement for each of the 1,500 public bodies included in the WGA

or

2) add the cash flow, balance sheet and profit and loss statement for each of the 1,500 public bodies included in the WGA to the COINS page of data.gov.uk

(In part this reminded me of the horrible, horrible time I tried to get to grips with COINS Linked Data.)

So what’s the solution? Many software developers are familiar with the notion of an API, an online webservice that computer programmes can talk to. Services like Facebook publish comprehensive APIs that let third party developers build services on top of the Facebook platform, pulling data from and writing data to it. To make life easier for the developers, API publishers often publish software libraries that make it easy to use the API, or code examples that show how to achieve some of the tasks that can at least get you started with the API (for example, The Six Pillars of Complete Developer Documentation or Web API Documentation Best Practices).

So in the context of open data, how can we make life easier? Publishing example use cases, even really, really simple ones (especially really, really simple ones!) along with the data is one way, and it achieves at least the following:

1) you actually have to use the data you’re releasing, even if just in a toy way. So if you find a problem with accessing the data, or how the data is represented, chances are your users will find a problem with it too. And it might be that there’s a really quick fix to the problem. Like fixing a broken link, or checking the filetype… (As a practical step, try this: if you publish a spreadsheet via a link, click on the link, download the file, and just see if you can open it… Or see if you can view it using something like Zoho viewer.)

2) you can show your working. Reports often contain summary data tables or charts generated from raw data sets. The tables and charts appear in PDF documents, and the raw data is dumped as one or more spreadsheets or database tables. The query that is used to generate the summary data table, or chart, or as in the case above, the profit and loss, is typically not released. And that’s the bit that needs to be transparent at least as much as, if not more so, than the data. (I referred to this as a query path in Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping; see also So Where Do the Numbers in Government Reports Come From?).

See also: Quick Core Dump of Idle Thoughts on the “Making Open Data Real” Consultation and How Might Data Journalists Show Their Working? Sweave.

PS with a bit of luck, the new UK Gov Open Standards Hub will play some sort of role in identifying and improving best practice in data release, and maybe also in raising awareness of good practice conventions that can make life easier for users…

Written by Tony Hirst

January 5, 2012 at 6:15 pm

News, Courses and Scrutiny

with 3 comments

I think I may have confused Stephen Downes yesterday with my notes around consultation based courses, so here are some more loosely connected thoughts that will probably only serve to muddle the situation further, at least for now…;-)

Take the forthcoming UK Parliamentary Communications Green Paper that will lead to a revision of the legislation surrounding communications in the UK. In part, this will draw on the DCMS Communications review carried out earlier this year according to the following process: “An open letter was published on 16 May 2011 asking a broad range of questions about the communications sector. All non-confidential responses to the letter were published on 7 December 2011. Submissions received will be used to inform the Green Paper.” (The public submissions are available as a individual documents in either RTF or PDF format.)

The open letter [PDF] included a series a questions relating to communications policy. For example:


Q6. What are the competing demands for spectrum, how is the market changing and how can a regulatory framework best accommodate any rapidly changing demands on spectrum and market development?

Q12. What barriers are there to innovation in new digital media sectors, including video games, telemedicine, local television and education?

In a consultation-framed course, the consultation questions may be thought of as part of the assessment model. One of the aims of the course is to provide “students” taking the course with the knowledge, skills and understanding required to provide a considered response to some or all of these questions.

Note that we may wish to qualify the reading of a question, or wrap it with additional criteria; for example, we might tune Q6 above along the lines of: “What particular issues are likely to arise in the 300MHz to 3GHz band?”, or something like that!

In the Related Information section of the Communication Review, links were provided to a Research report [on] the Contribution of the digital communications sector to economic growth and productivity in the UK and the Government’s broadband strategy among other things. In a sense, we have been gifted some “course readings”. There are also opportunities to dip into research that maybe doesn’t get read (or scrutinised) as widely as it might, in the form of Parliamentary Library Research Briefing papers.

So that’s part of the jigsaw: reviews, consultations, calls for evidence all involve policy makers soliciting evidence and opinion around a topic area that may include technical considerations. Where questions are asked, these may form part of the reflection/self-assessment/course assessment framework. The original call may itself be viewed as a high level syllabus of the topics to be addressed in the course. The course can then address these issues with reference to teaching material (for example, if we’re considering innovation, we met call out some introductory OpenLearn materials on “Characteristics of consumers and the market”.

Whilst the aim of the review, consultation or piece of proposed legislation may not in itself go too deeply into technical areas, it can be used to provide the SPEL (social, political, ethical, legal) context around a technology area and provide a jumping point off for a technical lesson in that subject area (for example, we may want to consider the similarities and differences between wired networks and wireless networks; or we may need to get up to speed on what optical fibre networks are good for.

Part of the story then, is to try to take the lazy route to curriculum development, and reuse someone else’s, which in this case also amounts to a repurposing of a document or process that wasn’t intended as a course to provide some of the content, topic, cohort discovering and pacing components of a course.

This repurposing lends an element of authenticity and relevance to the course of study (though as mentioned in my previous post, we must be wary that the course is not used as a vehicle for delivering propaganda).

What the approach may also do is increase the amount of scrutiny around a review or route to legislation. In the post No Minister: Any chance for the Communications Act?, Guardian Professional writer Dick Vinegar notes:

Last time around, in 2003, Lord Puttnam, a film director with the right blend of artistic and technical expertise, carried out a pre-legislative scrutiny. I believe that this knocked the heads of broadcasters (fluffies) and comms engineers (techies) together to produce a good bill. From what I have heard so far, I am not sure whether this time around we will get such a mature, ‘two cultures’ approach.

By providing a view over a consultation, or review that is course-like, we can maybe increase the amount of scrutiny involved in the process and also (maybe) deepen people’s understanding of the issues.

The course view thus provides a structured pathway through the relevant issues at a deeper level than provide by the typical supporting documentation, or perhaps just in a more reflective way. The course also provides a way in to citizen engagement from individuals who just want to explore the topic.

The consultation-framed course also provides a way of straddling news and academia, an area that has also interested me in a lifelong learning context for some time.

This could manifest itself in a couple of ways. For example, long form news articles could feature “academic” breakout boxes using OERs sourced from the course, or course discussions could be positioned around issues raised in recent news articles; in a wider context, entry routes to the course may be provided through the news media, from readers who want to know a little more about the issues involved within a particular consultation area (c.f. News, Analysis, Academia and Demand Education or Educative Media?).

Another interesting feature that arises out the consultation based course learning journey is that “authentic assessment opportunities” present themselves: for example, a student may submit an actual response to the consultation, or, if they entered via the news route, write a letter to the editor. Writing responses in the form of research briefing papers also provides another format for producing work that may be used to demonstrate understanding and knowledge in a meaningful and potentially useful way, as well as an assessable way.

The tone with which reviews or consultations are presented is also interesting from an educational perspective, in that the questions that are asked may be open and may not have a single right answer. (On the other hand, in calls for technical expert evidence, there may well be “correct” answers which the evidentiary call is intended to discover.) This frames the learning activity in the context of “we don’t know what the right answer is, but we need to find out/learn more. That is, the consultation is in some sense modeling part of the lifelong learning behaviour we want to inculcate in our students (learning is not just for school or university, right?!;-)

Is there a demand for such an exercise though? Again referring to the Guardian Professional article:

In the run up to the green paper, Westminster has been awash with conferences and seminars with titles like ‘What should be in the new Communications Bill?’ and ‘Dear Jeremy…’ (Hunt). Most of the speakers at these portentous events have been full of patriotic hyperbole and statements of the obvious. “The next Comms Act should focus on ensuring that the UK’s communications sector remained one of the most competitive in the world.” “A level playing field is needed in the internet ecosystem with global issues considered carefully.” “Regulation must not chill innovation.” “The limits of online privacy must be defined.” “Children must be protected.”

PS I mentioned in the previous post how at least one of the forums around the forthcoming Communications Green Paper was “CPD certified”. A little digging turned up The CPD Certification Service, which is presumably what that referred to. Anyway, I’ve added it to my watchlist to see if Pearson, or other companies of that ilk, start sniffing around it as a gateway to one possible new credentials market…

PPS Are there any emerging leaders in the qualification verification arena yet?

Written by Tony Hirst

December 22, 2011 at 2:01 pm

Accessing and Visualising Sentencing Data for Local Courts

with 3 comments

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]

In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…

…but first, a couple of observations:

- the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
- the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?

The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.

The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…

Unix Command Line Tools

I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).

In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:

head recordlevel.csv

Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:

grep -i wight recordlevel.csv > recordsContainingWight.csv

(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )

Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.

Isle fo wight sentencing data

Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):

Visualising data in google fusion tables

We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:

Data exploration in Google FUsion tables

Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?

R/RStudio

If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.

Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:

recordlevel <- read.csv("~/data/recordlevel.csv")
iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))

We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:

age=table(iw$AGE)
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")

R - bar plot

We can also start to look at combinations of factors. For example, how do offence types vary with age?

ageOffence=table(iw$AGE, iw$Offence_type)
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R barplot - offences on IW

If we remove the beside=T argument, we can produce a stacked bar chart:

barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))

R - stacked bar chart

If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:

require(ggplot2)
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')

GGPlot2 in R

Alternatively, we can break down offence types by age:

ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)

ggplot facet barplot

We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:

ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)

ggplot with stacked factor

One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below ;-)

PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:

yeroon.net/ggplot2

PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw$Offence_type,iw$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male'])
iwpFemale=data.frame(iwp['Female'])

We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)

Written by Tony Hirst

November 29, 2011 at 1:20 pm

Quick Core Dump of Idle Thoughts on the Public Data Corporation (PDC) Consultation

leave a comment »

A first pass at answering the questions on the Public Data Corporation Consultation

“Please provide evidence to support your answer where possible.”
I read this as: “We haven’t really provided any evidence in this consultation, but if you don’t, we can ignore what you say on the grounds it’s anecdotal at best, or more likely, completely unjustified…”

***”1. How do you think Government should best balance its objectives around increasing access to data and providing more freely available data for re-use year on year within the constraints of affordability?”

[Paras 1.12, 1.17, 1.18] Presumably, the first implication is that the PDC will incoprate public bodies involved with the production of “core reference data”/those organisations “whose primary purpose is collecting, managing and disseminating data and providing value-added services based on that data” and the policy framework for deciding who’s in and who’s out must in the first instance rule HM Land Registry, Met Office and Ordnance Survey in? Based on the criteria used to rule these organisations in, is the gut feeling that organisations such as Companies House (who mint unique corporate identifiers), the General Register Office, Office for National Statistics, DVLA (eg Vehicle Checking or Driver Validation Service), Highwyas Agency (eg http://www.trafficengland.com/index.aspx?ct=true ), academic data repositories such as http://www.data-archive.ac.uk/ or http://www.census.ac.uk/ , The National Archives, the data models and data assets being developed as part of the BBC Digital Public Space project or the JISC UK Discovery initiative. With publicly funded research increasingly being required to disseminate findings through open access publications (eg http://www.epsrc.ac.uk/about/standards/researchdata/Pages/default.aspx ), to what extent might (or should?) the deposit and/or release of research data be covered: a) by open data principles, b) via the PDC or a research council equivalent, bearing in mind that access to publicly funded research data may be subject to FOI requests (eg http://www.jisc.ac.uk/publications/programmerelated/2010/foiresearchdata.aspx ).

["1.22. The way that Government has sought to cover those high fixed costs and to ensure sustainable investment in data infrastructure has been to encourage public sector bodies to licence their core reference data to third parties"]

But costs introduce friction downstream and may result in one part of government (the data publisher) recruiting more than cost from other public bodies? In addition, is it possible that central data collection bodies may as part of their remit collect data from other public bodies that is then resold back to those bodies in an alternative (albeit potentially enriched) form?

PDC Objective 3 states: “create a vehicle that can attract private investment.” So there is presumably a requirement that money flows in from the private sector to the PDC and then out again in spades (because investors will want a return)?

[4.12] If there is a large up-front cost in producing/releasing data, and limited marginal cost, another model would be a one-off fee to offset the production/release cost, rather than a metered/ongoing usage fee offset against production/release cost + marginal cost? That the public body would carry the marginal cost is just a consequence of its data being worked/used, which is partly the point of releasing it in the first place? (ie presumably some benefit accrues elsewhere in the system as a result of the data being worked?)

[4.22] A single fee may provide a barrier to entry to personal users, researchers, SMEs engaged in invention and innovation where there is no established market for as yet undeveloped products or services. Might fee waivers be a possibility, and if so, how would they be awarded. Might there be an equivalent of a public lending library service (a service that traditionally has provided universal access to information, including information from resources that may have a singnifcant acquisition cost associated with them) that will provide “personal research” access to a public task dataset?

[4.23] Is the work of producing datasets part of the public task of the Office for National Statistics, and if so, will we have to pay for access to those statistics?

[4.24] As a corollary to the case, for example, of locals councils licence out the management and operartion of civic carparks, would the public bodies be allowed to do the same with contracting out the management, publication of and charging for their public data usage by third parties, and if so, how will limits be set on the pricing, bearing in mind any commercial operator would expect to make a financial return on the operation of that service, and would it imapct on the way the public body collects, quality checks and operates its own data processes?

If public bodies are to develop “commercial products to serve commercial markets”, how does this sit with para. 4.18 (profit maximisation model) where an “incentivised PDC [would] fully commercialise all its products and services. While aligned with a strategy focussed purely on maximising value for the taxpayer such a model is unlikely to be consistent with Managing Public Money guidance and delivering on a commitment for free data”? Presumably the “commercial products to serve commercial markets” would be expected to be profit maximising, or not? Cost recovering (as in 4.23)? But what cost (eg would that include the cost of advertising, marketing, and other activities associated with commercial services)?

[4.25, 4.26, 4.28] Freemium does not necessarily imply “try out”. Many freemium services provide an access quota that allows an on-user to use the service as part of their own service, for free, up to certain usage limits. If the usage is heavy, then the commercial plan kicks in. But the small player can run a small service, for free, until they hit usage limits. In some cases, a condition of using the freemium service may be that the user cannot cache the data; ie they must faithfully draw the data down as they use it, rather than building up a local copy. In other cases, they may be encouraged to cache the data so as to prevent repeated service calls for the same data, in which case they usage quota is based on unique data accesses rather than repeated data accesses.

***”2. Are there particular datasets or information that you believe would create particular economic or social benefits if they were available free for use and re-use? Who would these benefit and how? Please provide evidence to support your answer where possible.

***”3. What do you think the impacts of the three options would be for you and/or other groups outlined above?”

[4.39 Government as user of PDC data] If the fees go up, and public bodies are changed universal commercial rates, they will have to pay more, which will introduce further friction into the process and reduce opportunites for effective data (re)use.

How I read the “options”:
["4.40. Under all options, charges for some units of PDC information are likely to change, with more data being provided free at the point of use."] So there are no additional benefits from Option 1.. so rule this one out?
["4.41. Under Option 2, it is possible that some efficiency savings could be delivered through having a single price, although there will be some upfront investment and resource required to implement a change."] Savings possible, but it will cost in the short term? Rule this one out too?
["4.42. Under Option 3, it is likely that in the short term income would decrease, but if the freemium model was successful income might then increase over time."] Presumably we’re expected to read this as: “It won’t cost anything, and profits may go down in the short term; but then we might get a viable business out of it, and moreover a business capabale of growth, using a sexy sounding techie inspired business model… Cool… let’s have that then’? The truth being, of course, that costs are generally associated with any change, and that this is a status quo offering, where public bodies charge other public bodies and private enterprises for data collected as a matter of course (although admittedly at some expense) as part of the operating environment for government.

***”4. A further variation of any of the options could be to encourage PDC and its constituent parts to make better use of the flexibility to develop commercial data products and services outside of their public task. What do you think the impacts of this might be?”

["4.30 There is the potential for providing a PDC and its constituent parts with greater encouragement to make better use of the existing flexibility to develop commercial products to serve commercial markets."] Does this include the ability to develop commercial services based around expertise and support? (Expertise that may not be available widely, for example, particularly to SMEs? In which case, the service would also help support knowledge transfer from the public sphere and into the private sphere?

Rather than produce data and make it available to other public bodies as well as developing commercial products, would it be possible to give the data away under a truly open license and task the PDC with developing data products and services that save the other public bodies money, working with them to reduce costs that can be then considered as in kind direct returns on investment in the data services and products.

***”5. Are there any alternative options that might balance Government’s objectives which are not covered here? Please provide details and evidence to support your response where possible”

[4.10] The assumption here appears to be that payment for data should be based on the basis that commercial users purchase a license to make use of data from the PDC and pay the PDC directly in financial terms. However, might a commercial user not offer an in-kind payment, such as a guarantee to resell services /at a discount or reduced margin/ to other public services, or make value-added versions of the data produced by the commercial user available for free to specified public bodies? This compares with 4.17 where data is provided free to users who then resell added-value data back to the public body. What is important is that if PDC bodies are producing value add data, this should be provided free of rights encumberance to other public bodies, and ideally free of cost; the issue then remains of how the value added data may be passed on to non-public bodies? The intent of these users might also be worth considering: for example, personal or academic research, commercial research/innovation by SMEs, or as part of a service offering by an established larger company. Differential license/charging agreements of course need to be fair, but might this not be handled through offset grants, for example public data access grants awarded to SMEs via the TSB?

[4.36] Defining the future PDC on the basis of supporting incumbent business models predicated on current processes and ‘the old way of doing things’ is a dangerous step to take. If the open data policy framework is intended to foster innovation, it would be foolish to constrain innovation and limit the future possible use of open and public data to legacy models and processes that represent the current status quo. True innovation may well be disruptive, and upset the current status quo. Such is life.

A set of models that do not appear to have been considered are business models that develop around open source software. In the same way that data can be expensive to produce, may be protected by ownership and licensing rights, and may be used as the basis of other commercially viable services, so too can software. A summary of business and sustainability models appropriate for the open source software domain can be found here: http://www.oss-watch.ac.uk/resources/businessandsustainability.xml It may be worth doing a simple mapping of these models, based as they are around open source software, onto an open data (rather than software) resource context. There is the tension that whilst cost recovery by selling on data may be deemed to be an acceptable mode of operation for a public body, in part because it supports cost recovery through getting a return on sale of goods/services for minimal marginal cost of making those goods/services available, the sale of high value consultancy, for example, requires large additional cost and activity not aligned directly with the provision of public service (in effect, the use of public service to provide private commercial services outside the public sphere, not just internally on a cost recovery basis).

If it is the case that better access to information – and data – helps us make better decisions (and I’m not convinced that what we want is to make decisions: most people have no real choice and just want effective local public services), then the reward to the public body is not so much a direct financial return as a minimisation of costs incurred elsewhere becuase a bad decision was made.

Recent years have seen a return to prize fund/Grand Challenge based funding models in wich prizes are awarded to technology solutions to particular technical challenges. This funding models replace the research funding support model with a reward based model. To what extent might the PDC act as a prize fund awarding body that can reward innovation around the use of public data, and sponsor parties engaged in such competition with “data permits” or “data credits” that provide them with data access in return for them submitting responses to data related Grand Challenges?

To what extent might the Technology Strategy Board ( http://www.innovateuk.org/ ), maybe under the auspices of the Small Business Research Initiative (SBRI) [ http://www.innovateuk.org/deliveringinnovation/smallbusinessresearchinitiative/whatissbri.ashx ] work with SMEs to provide “data credits” that provide companies with access to PDC content if it must be otherwise made available for a fee?

Could the TSB, in association with the PDC, even operate as an angel fund, supporting companies wishing to develop services or products based on public data, in exchange for a share in the companies involved, harking back to ideas behind the foundation of 3i, for example?

***”6. To what extent do you agree that there should be greater consistency, clarity and simplicity in the licensing regime adopted by a PDC?”

Experience of using Creative Commons licences and open source software licenses suggests that even within a open licensing framework, if different license types are combined it rapidly becomes difficult to work out how license conditions surrounding differently licensed components interact. Haviong a single license mitigates against creating confusion through complex, and possible inconsistent, combinations of license conditions arising from the novel combination of differently licensed resources.

The confusion as to what is allowable may act as a significant barrier to developing services that combine resources licensed in different ways. Since much innovation is likely to arise from combination of resources, the multi-license approach is not really viable. Regulating on how datasets may be used/what license terms apply for different use cases may place arbitrary conditions on the innovation of new models that fall outside or across models that are assumed to be possible when the model license conditions are framed.

Furthermore, in a truly open licensing regime, the scope of reuse would not be artificially bounded and the user would be free to reuse the resources in any way they wanted.

***”7. To what extent do you think each of the options set out would address those issues (or any others)? Please provide evidence to support your comments where possible.”

Options 1 or 2 may lead to situations where complex and even pathological combinations of different license types make it impossible for a user to work out whether or not they are allowed to combine a set of resources in a particular way, or develop business models that operate across different license condition regimes.

***”8. What do you think the advantages and disadvantages of each of the options would be? Please provide evidence to support your comments”

Option 1 “[5.10] … each organisation within a PDC would have its own portfolio of standard licences, terms and conditions appropriate to the nature of their business.”

Option 2: Overarching PDClicence agreement, subject to: “[5.17] flexibility to add additional schedules where necessary underneath that overarching agreement.”

Complications around from ill specified consequences arising from the combination of differently licensed resources arise here just as they do in the case of option 1.

Option 3 “While a single licence would offer greater consistency of standard terms and conditions it is likely that there would be a wide range of other terms, clauses and schedules required to cover the various types and uses of PDC information. It is therefore likely to be lengthy and will contain clauses and schedules that will not be relevant to all users”

Does this mean that there will essentially be different licenses according to the status of the user (personal use, academic, commercial, etc) rather than the situation in options 1 and 2 where there are essentially different licenses relating to the use to which resources will be put?

***”9. Will the benefits of changing the models from those in use across Government outweigh the impacts of taking out new or replacement licences?”

I don’t know.

***”10.To what extent is the current regulatory environment appropriate to deliver the vision for a PDC?”

“[6.8] … it is envisaged that all organisations within a PDC will be advised to develop and agree with the regulator the statement of their public task.”

So the management of the current operting funds will be expected to work together to produce their own regulatory framework, at least insofar as the definition of their public task goes? This is likely to be backward looking and protective of current operating models rather than being open to new models and potentially even new ways of defining public tasks that do not respect current organisational boundaries, processes and modes of operation. The PDC as thus described is a way of bringing together current orgnisations and their associated business models and allowing them to work together to protect those interests, interests that were defined to support a data environment that may no longer exist.

“6.10. In the freemium model there may be a role for the regulator, as indicated earlier, in advising PDC bodies how they can best go about making practical arrangements to make more data free for re-use while ensuring a sustainable business model.”

Requiring that any innovations also protect the current operating model suggests that the establishment of the PDC is actually a rearguard action to protect against a radical change in the ways in which data is produced, managaed and exploited within government, as well as for the wider public good through development of third sector and even private services.

***”11.Are there any additional oversight activities needed to deliver the vision for a PDC and if so what are they?”

The vision being the preservation of the status quo through the creation of a conglomerate of current data selling public bodies? And through oversight, you presumably do not mean the creation of a body that can force through changes to the way the board of the PDC decide it will operate, but rather will limit it’s role to seeing that the board does what it says it will do?

The current proposal for a PDC seems to favour the creation of a conglomerate charged with exploiting public data for financial return wherever it can, rather than act as a regulator, ombudsman, or advocate tasked with getting the most value out of public data through making effective use of it, and maximising the possibility of making effective use from it?

“[6.1] Given the confines of this consultation, and its remit to focus only on the data policy options for a PDC itself, it would not be appropriate to consult on the whole policy and legislative framework”

Which is to say, you are not soliciting ideas about how to set up a governance regime that will require a nascent PDC to develop structures and processes that seek to innovate in the way public data is collected, processed and exploited, or helps realise a vision where free flowing open public data revitalises the way in which public bodies operate?

***”12.What would be an appropriate timescale for reviewing a PDC or its constituent parts public task(s)?”

It seems you have a done deal already, and the PDC will be set up in a way that means it will be difficult to dismantle or restructure significantly and that any regulatory scheme will that is established will have to be defined so as to regulate an entity that has itself defined how it wants to be regulated?

As with the quick comments on the Making Data Public consultation, I probably need to spend a bit of time reviewing these immediate impressions, but as before, time is short… If you want to harangue me on any obvious howlers, or call me out on any obvious inconsistencies (it might well be the case that comments appear to come from contradictory positions!), feel free to post a comment:-)

Written by Tony Hirst

October 20, 2011 at 10:17 pm

Follow

Get every new post delivered to your Inbox.

Join 134 other followers