OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for October 2011

Appropriate IT – My ILI2011 Presentation

with one comment

Here’s a copy of the slides from my ILI2011 presentation on Appropriate IT:

One thing I wanted to explore was, if discovery happens elsewhere, and the role of the librarian is no longer focussed on discovery related issues, where can library folk help out? Here’s where I think we need to start placing some attention: sensemaking, and knowing what’s possible (aka helping redistribute the future that is already around us;-) Allied with this is the idea that we need to make more out of using appropriate IT for particular tasks, as well as appropriating IT where we can to make our lives easier.

In part, sensemaking is turning the wealth of relevant data out there into something meaningful for the question or issue at hand, or the choice we have to make. My own dabblings with social network analysis are approaches I’m working on that help me make sense of interest networks and social positioning within those networks so I can get a feel for how those communities are structured and who the major actors are within them.

As far as knowing what’s possible, I think we have a real issue with “folk IT” knowledge. Most of us have a reasonable grasp of folk physics and folk psychology. That is, we have a reasonable common-sense model of how the world works at the human scale (let go of an apple, it falls to the floor), and we can generally read other people from their behaviour; but how well developed is “folk IT” knowledge? Given that to most people the idea that you can search within a page in a wide variety of electronic documents using crtrl-F as a keyboard shortcut to a “search within page/document” feature is alien to them, I think our folk understanding of IT is limited to the principle of “if you switch it off and on again it should start working again”.

Folk IT is also tied up with computational thinking, but at a practical, “human scale”. So here are a few ideas I think the librarians need to start pushing:

- the idea of a graph; it’s what the web’s based around, after all, and it also helps us understand social networks. If you think of your website as a graph, with edges representing links that connect nodes/pages together, and realise that your on-site homepage is whatever page someone lands on from a search engine or third party link, you soon start to realise that maybe your website is not as usefully structured as you thought…
- some sort of common sense understanding of the role that URLs/URIs play in the browser, along with the idea that URIs are readable and hackable and also may say something about the way a website, or the resources it makes available, organised;
- the notion of “View Source”, that allows you to copy and crib the work of others when constructing your own applications, along with the very idea that you might be able to build web pages yourself out of free standing components.
- the idea of document types and applications that can work all sorts of magic given documents of that type; the knowledge that an MP3 file works well with an audio player or audio editor, for example, or that a PNG or JPG encodes an image, along with more esoteric formats such as KML (paste a URL to a KML file into the search box of a Google Maps search and see what happens, for example…). Knowledge of the filetype/document type gives you some sort of power over it, and helps you realise what sorts of thing you can do with it… (except for things like PDF, for example, which is to all intents and purposes a “can’t do anything with it” filetype;-)

I also think an understanding of pattern based string matching and what regular expressions allow you to do would go a long way towards helping folk who ever have to manipulate text or text-based data files, at least in terms of letting them know that there are often better ways of cleaning up a text file automagically rather than having to repeat the same operation over and over again on each separate row in file containing several thousand lines… They don’t need to know how to write the regular expression from the off, just that the sorts of operation regular expressions support are possible, and that someone will probably be able to show you how to do it…

Written by Tony Hirst

October 31, 2011 at 3:50 pm

Posted in Infoskills, Library, Presentation

Tagged with

Power Tools for Aspiring Data Journalists: Funnel Plots in R

with 9 comments

Picking up on Paul Bradshaw’s post A quick exercise for aspiring data journalists which hints at how you can use Google Spreadsheets to grab – and explore – a mortality dataset highlighted by Ben Goldacre in DIY statistical analysis: experience the thrill of touching real data, I thought I’d describe a quick way of analysing the data using R, a very powerful statistical programming environment that should probably be part of your toolkit if you ever want to get round to doing some serious stats, and have a go at reproducing the analysis using a bit of judicious websearching and some cut-and-paste action…

R is an open-source, cross-platform environment that allows you to do programming like things with stats, as well as producing a wide range of graphical statistics (stats visualisations) as if by magic. (Which is to say, it can be terrifying to try to get your head round… but once you’ve grasped a few key concepts, it becomes a really powerful tool… At least, that’s what I’m hoping as I struggle to learn how to use it myself!)

I’ve been using R-Studio to work with R, a) because it’s free and works cross-platform, b) it can be run as a service and accessed via the web (though I haven’t tried that yet; the hosted option still hasn’t appeared yet, either…), and c) it offers a structured environment for managing R projects.

So, to get started. Paul describes a dataset posted as an HTML table by Ben Goldacre that is used to generate the dots on this graph:

The lines come from a probabilistic model that helps us see the likely spread of death rates given a particular population size.

If we want to do stats on the data, then we could, as Paul suggests, pull the data into a spreadsheet and then work from there… Or, we could pull it directly into R, at which point all manner of voodoo stats capabilities become available to us.

As with the =importHTML formula in Google spreadsheets, R has a way of scraping data from an HTML table anywhere on the public web:

#First, we need to load in the XML library that contains the scraper function
library(XML)
#Scrape the table
cancerdata=data.frame( readHTMLTable( 'http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis', which=1, header=c('Area','Rate','Population','Number')))

The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).

We can inspect the data we’ve imported as follows:

#Look at the whole table
cancerdata
#Look at the column headers
names(cancerdata)
#Look at the first 10 rows
head(cancerdata)
#Look at the last 10 rows
tail(cancerdata)
#What sort of datatype is in the Number column?
class(cancerdata$Number)

The last line – class(cancerdata$Number) – identifies the data as type ‘factor’. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers… (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.)

#Convert the numerical columns to a numeric datatype
cancerdata$Rate=as.numeric(levels(cancerdata$Rate)[as.integer(cancerdata$Rate)])
cancerdata$Population=as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
cancerdata$Number=as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])

#Just check it worked…
class(cancerdata$Number)
head(cancerdata)

We can now plot the data:

#Plot the Number of deaths by the Population
plot(Number ~ Population,data=cancerdata)

If we want to, we can add a title:
#Add a title to the plot
plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population')

We can also tweak the axis labels:

plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population',ylab='Number of deaths')

The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggpplot2 isn't part of the standard R bundle, so you'll need to install the package yourself if you haven't already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...):

require(ggplot2)
ggplot(cancerdata)+geom_point(aes(x=Population,y=Number))+opts(title='Bowel Cancer Data')+ylab('Number of Deaths')

Doing a bit of searching for the "funnel plot" chart type used to display the ata in Goldacre's article, I came across a post on Cross Validated, the Stack Overflow/Statck Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R?

The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code... This is a dangerous thing to do, and I can't guarantee that the analysis is the same type of analysis as the one Goldacre refers to... but what I'm trying to do is show (quickly) that R provides a very powerful stats analysis environment and could probably do the sort of analysis you want in the hands of someone who knows how to drive it, and also knows what stats methods can be appropriately applied for any given data set...

Anyway - here's something resembling the Goldacre plot, using the cribbed code which has confidence limits at the 95% and 99.9% levels. Note that I needed to do a couple of things:

1) work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.

#TH: funnel plot code from:
#TH: http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
#TH: Use our cancerdata
number=cancerdata$Population
#TH: The rate is given as a 'per 100,000' value, so normalise it
p=cancerdata$Rate/100000

p.se <- sqrt((p*(1-p)) / (number))
df <- data.frame(p, number, p.se)

## common effect (fixed effect model)
p.fem <- weighted.mean(p, 1/p.se^2)

## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
#TH: I'm going to alter the spacing of the samples used to generate the curves
number.seq <- seq(1000, max(number), 1000)
number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)

## draw plot
#TH: note that we need to tweak the limits of the y-axis
fp <- ggplot(aes(x = number, y = p), data = df) +
geom_point(shape = 1) +
geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
geom_hline(aes(yintercept = p.fem), data = dfCI) +
scale_y_continuous(limits = c(0,0.0004)) +
xlab("number") + ylab("p") + theme_bw()

fp

As I said above, it can be quite dangerous just pinching other folks' stats code if you aren't a statistician and don't really know whether you have actually replicated someone else's analysis or done something completely different... (this is a situation I often find myself in!); which is why I think we need to encourage folk who release statistical reports to not only release their data, but also show their working, including the code they used to generate any summary tables or charts that appear in those reports.

In addition, it's worth noting that cribbing other folk's code and analyses and applying it to your own data may lead to a nonsense result because some stats analyses only work if the data has the right sort of distribution...So be aware of that, always post your own working somewhere, and if someone then points out that it's nonsense, you'll hopefully be able to learn from it...

Given those caveats, what I hope to have done is raise awareness of what R can be used to do (including pulling data into a stats computing environment via an HTML table screenscrape) and also produced some sort of recipe we could take to a statistician to say: is this the sort of thing Ben Goldacre was talking about? And if not, why not?

[If I've made any huge - or even minor - blunders in the above, please let me know... There's always a risk in cutting and pasting things that look like they produce the sort of thing you're interested in, but may actually be doing something completely different!]

PS for how to generate reports that can (optionally) also self-document with actually source R code, see How might data journalists show their working? Sweave. The code used in, and comments added to, that post make further refinements to the funnel plot code.

Written by Tony Hirst

October 31, 2011 at 1:32 pm

The F*****d Up World That is Personal Identity Theft Fraud Investigation

with 3 comments

Suppose you get a letter, with your name and address on it, asking you to call an “Investigations Manager” for retail finance outfit. The number appears to check out, although forum posts seem to come down 50/50 on ‘is this a scam or not’.

You call the number and get given another number – again, the crowd is split. The “official body”, CIFAS, is even less useful…

You call the second number and immediately they start asking you for personal information “so they can check it against the data they were provided with” that was used to set up a potentially fraudulently created account.

This is wrong, wrong, wrong; broken, broken, broken.

There are no trusted parties either side of this exchange.

The company (C) and the individual (X) don’t trust each other, and neither trusts giving the other information about the individual because: the company doesn’t know the individual is the individual, or the hoaxer; the individual doesn’t know the company at all.

What C wants is personal information from X, so that it can check that information against its records. But X doesn’t trust C, because C may just be phishing.

What X wants is the information that C reputedly holds about X, so that X at least knows whether C has true or false information about X. What X does next is moot, if they don’t trust C anyway. In fact, there is no obvious way for X to develop trust in C except by using public keys, such as phone numbers on websites that X trusts.

One solution might be to go to a trusted party – such as a high street bank, B. If B has a trusted route to C, and also data on record about X, X can go to B, who will relay to C that X is known to B; B passes C’s reference about X to C, maybe along with a single confirming piece of information. If C trusts that B is in the presence of X, and X grants permission to C to divulge information to B, then C might divulge what it knows about X to B, who is the only party both X and C trust in the exchange. But there’s still a problem, because B may not be trustworthy, and may not be in the presence of X (consider a corrupt bank employee with access to X’s records and an accomplice willing to pretend to be X).

Hmmmm…

See also: “For Data Protection Purposes”, Can You Give Me Some Personal Data…

Written by Tony Hirst

October 25, 2011 at 2:10 pm

Posted in Anything you want

Guardian Tag Explorer, Take 2 – Martin Works Some Magic

with 2 comments

For whatever reason, I seem to lack the discipline – or insight (or skills) – required to make anything anyone would want to actually use, although I do take delight in exploring new ways of combining existing applications and services to see how they might (in principle) work together…

…which is why I’m really fortunate to have Martin as an informal (and I hope not too put-upon!) network collaborator. Take the Hawkseyfied Guardian Tag Explorer for example, which takes a couple of half-baked hacks of mine and puts them together in a way that that allows you to get a mesoscopic view over how particular companies, individuals or news stories have been represented in the Guardian based on the Guardian OpenPlatform tag metadata used to describe the articles they are mentioned in, along with a summary of how many of the corresponding articles were tagged that way, what those articles actually were, and with a link to them:

Guardian tag explorer - via @mhawksey

Martin relates something of its genesis in (Guardian Tag Explorer: When the Guardian Open Platform met d3.js), describing a loosely coupled way of working we have stumbled upon that I think is highly creative, leads to potentially interesting innovations, and often results in incredibly useful – and powerful – recipes and building blocks. And all unfunded, at least in terms of bid for, planned project funding…

In addition, Martin’s been my goto person for all matters relating to Google Apps Script for quite some time now; has built up quite a suite of self-deployable Twitter archiving tools using Google Spreadsheets; and I still don’t understand why more folk haven’t picked up on how far he managed to push the idea of Twitter video subtitles.

I should also add that a lot of my thinking is inspired as a response to something that Martin has created, and that shines light on something that is possible that I’d have probably never considered before. So for example, looking at the reworked tag explorer above, it suggests to me that rather than linking out to the article, we could probably now just as easily pull the story text from the API and display it in-app…If I get a chance this weekend, I’ll try to explore that;-)

I would try to reflect a little on why our loose collaboration feels so productive (at least from my side, for the ideas it sparks, the cribs I can reuse, and the noticings Martin comes up with), but I don’t want to break it…! I think it does have something to do with loosely collaborating – in public, and often in real-time – on reactive unprojects, though, which are often inspired by one or the other of us noticing that has just been released or commented up elsewhere, reacting to something the other of us has said, responding to a question we’ve seen tweeted, or because we’ve wondered whether something is possible. And then just tried to do it. Or ask if the other has already done it! Without obligation. But often with the idea that if the other finds it interesting… (#hackbait ;-)

PS If you don’t follow Martin’s blog already, I suggest you start doing so. His recipes also tend to be far easier to follow (and less buggy!) than mine! MASHe.

Written by Tony Hirst

October 21, 2011 at 9:48 am

Quick Core Dump of Idle Thoughts on the Public Data Corporation (PDC) Consultation

leave a comment »

A first pass at answering the questions on the Public Data Corporation Consultation

“Please provide evidence to support your answer where possible.”
I read this as: “We haven’t really provided any evidence in this consultation, but if you don’t, we can ignore what you say on the grounds it’s anecdotal at best, or more likely, completely unjustified…”

***”1. How do you think Government should best balance its objectives around increasing access to data and providing more freely available data for re-use year on year within the constraints of affordability?”

[Paras 1.12, 1.17, 1.18] Presumably, the first implication is that the PDC will incoprate public bodies involved with the production of “core reference data”/those organisations “whose primary purpose is collecting, managing and disseminating data and providing value-added services based on that data” and the policy framework for deciding who’s in and who’s out must in the first instance rule HM Land Registry, Met Office and Ordnance Survey in? Based on the criteria used to rule these organisations in, is the gut feeling that organisations such as Companies House (who mint unique corporate identifiers), the General Register Office, Office for National Statistics, DVLA (eg Vehicle Checking or Driver Validation Service), Highwyas Agency (eg http://www.trafficengland.com/index.aspx?ct=true ), academic data repositories such as http://www.data-archive.ac.uk/ or http://www.census.ac.uk/ , The National Archives, the data models and data assets being developed as part of the BBC Digital Public Space project or the JISC UK Discovery initiative. With publicly funded research increasingly being required to disseminate findings through open access publications (eg http://www.epsrc.ac.uk/about/standards/researchdata/Pages/default.aspx ), to what extent might (or should?) the deposit and/or release of research data be covered: a) by open data principles, b) via the PDC or a research council equivalent, bearing in mind that access to publicly funded research data may be subject to FOI requests (eg http://www.jisc.ac.uk/publications/programmerelated/2010/foiresearchdata.aspx ).

["1.22. The way that Government has sought to cover those high fixed costs and to ensure sustainable investment in data infrastructure has been to encourage public sector bodies to licence their core reference data to third parties"]

But costs introduce friction downstream and may result in one part of government (the data publisher) recruiting more than cost from other public bodies? In addition, is it possible that central data collection bodies may as part of their remit collect data from other public bodies that is then resold back to those bodies in an alternative (albeit potentially enriched) form?

PDC Objective 3 states: “create a vehicle that can attract private investment.” So there is presumably a requirement that money flows in from the private sector to the PDC and then out again in spades (because investors will want a return)?

[4.12] If there is a large up-front cost in producing/releasing data, and limited marginal cost, another model would be a one-off fee to offset the production/release cost, rather than a metered/ongoing usage fee offset against production/release cost + marginal cost? That the public body would carry the marginal cost is just a consequence of its data being worked/used, which is partly the point of releasing it in the first place? (ie presumably some benefit accrues elsewhere in the system as a result of the data being worked?)

[4.22] A single fee may provide a barrier to entry to personal users, researchers, SMEs engaged in invention and innovation where there is no established market for as yet undeveloped products or services. Might fee waivers be a possibility, and if so, how would they be awarded. Might there be an equivalent of a public lending library service (a service that traditionally has provided universal access to information, including information from resources that may have a singnifcant acquisition cost associated with them) that will provide “personal research” access to a public task dataset?

[4.23] Is the work of producing datasets part of the public task of the Office for National Statistics, and if so, will we have to pay for access to those statistics?

[4.24] As a corollary to the case, for example, of locals councils licence out the management and operartion of civic carparks, would the public bodies be allowed to do the same with contracting out the management, publication of and charging for their public data usage by third parties, and if so, how will limits be set on the pricing, bearing in mind any commercial operator would expect to make a financial return on the operation of that service, and would it imapct on the way the public body collects, quality checks and operates its own data processes?

If public bodies are to develop “commercial products to serve commercial markets”, how does this sit with para. 4.18 (profit maximisation model) where an “incentivised PDC [would] fully commercialise all its products and services. While aligned with a strategy focussed purely on maximising value for the taxpayer such a model is unlikely to be consistent with Managing Public Money guidance and delivering on a commitment for free data”? Presumably the “commercial products to serve commercial markets” would be expected to be profit maximising, or not? Cost recovering (as in 4.23)? But what cost (eg would that include the cost of advertising, marketing, and other activities associated with commercial services)?

[4.25, 4.26, 4.28] Freemium does not necessarily imply “try out”. Many freemium services provide an access quota that allows an on-user to use the service as part of their own service, for free, up to certain usage limits. If the usage is heavy, then the commercial plan kicks in. But the small player can run a small service, for free, until they hit usage limits. In some cases, a condition of using the freemium service may be that the user cannot cache the data; ie they must faithfully draw the data down as they use it, rather than building up a local copy. In other cases, they may be encouraged to cache the data so as to prevent repeated service calls for the same data, in which case they usage quota is based on unique data accesses rather than repeated data accesses.

***”2. Are there particular datasets or information that you believe would create particular economic or social benefits if they were available free for use and re-use? Who would these benefit and how? Please provide evidence to support your answer where possible.

***”3. What do you think the impacts of the three options would be for you and/or other groups outlined above?”

[4.39 Government as user of PDC data] If the fees go up, and public bodies are changed universal commercial rates, they will have to pay more, which will introduce further friction into the process and reduce opportunites for effective data (re)use.

How I read the “options”:
["4.40. Under all options, charges for some units of PDC information are likely to change, with more data being provided free at the point of use."] So there are no additional benefits from Option 1.. so rule this one out?
["4.41. Under Option 2, it is possible that some efficiency savings could be delivered through having a single price, although there will be some upfront investment and resource required to implement a change."] Savings possible, but it will cost in the short term? Rule this one out too?
["4.42. Under Option 3, it is likely that in the short term income would decrease, but if the freemium model was successful income might then increase over time."] Presumably we’re expected to read this as: “It won’t cost anything, and profits may go down in the short term; but then we might get a viable business out of it, and moreover a business capabale of growth, using a sexy sounding techie inspired business model… Cool… let’s have that then’? The truth being, of course, that costs are generally associated with any change, and that this is a status quo offering, where public bodies charge other public bodies and private enterprises for data collected as a matter of course (although admittedly at some expense) as part of the operating environment for government.

***”4. A further variation of any of the options could be to encourage PDC and its constituent parts to make better use of the flexibility to develop commercial data products and services outside of their public task. What do you think the impacts of this might be?”

["4.30 There is the potential for providing a PDC and its constituent parts with greater encouragement to make better use of the existing flexibility to develop commercial products to serve commercial markets."] Does this include the ability to develop commercial services based around expertise and support? (Expertise that may not be available widely, for example, particularly to SMEs? In which case, the service would also help support knowledge transfer from the public sphere and into the private sphere?

Rather than produce data and make it available to other public bodies as well as developing commercial products, would it be possible to give the data away under a truly open license and task the PDC with developing data products and services that save the other public bodies money, working with them to reduce costs that can be then considered as in kind direct returns on investment in the data services and products.

***”5. Are there any alternative options that might balance Government’s objectives which are not covered here? Please provide details and evidence to support your response where possible”

[4.10] The assumption here appears to be that payment for data should be based on the basis that commercial users purchase a license to make use of data from the PDC and pay the PDC directly in financial terms. However, might a commercial user not offer an in-kind payment, such as a guarantee to resell services /at a discount or reduced margin/ to other public services, or make value-added versions of the data produced by the commercial user available for free to specified public bodies? This compares with 4.17 where data is provided free to users who then resell added-value data back to the public body. What is important is that if PDC bodies are producing value add data, this should be provided free of rights encumberance to other public bodies, and ideally free of cost; the issue then remains of how the value added data may be passed on to non-public bodies? The intent of these users might also be worth considering: for example, personal or academic research, commercial research/innovation by SMEs, or as part of a service offering by an established larger company. Differential license/charging agreements of course need to be fair, but might this not be handled through offset grants, for example public data access grants awarded to SMEs via the TSB?

[4.36] Defining the future PDC on the basis of supporting incumbent business models predicated on current processes and ‘the old way of doing things’ is a dangerous step to take. If the open data policy framework is intended to foster innovation, it would be foolish to constrain innovation and limit the future possible use of open and public data to legacy models and processes that represent the current status quo. True innovation may well be disruptive, and upset the current status quo. Such is life.

A set of models that do not appear to have been considered are business models that develop around open source software. In the same way that data can be expensive to produce, may be protected by ownership and licensing rights, and may be used as the basis of other commercially viable services, so too can software. A summary of business and sustainability models appropriate for the open source software domain can be found here: http://www.oss-watch.ac.uk/resources/businessandsustainability.xml It may be worth doing a simple mapping of these models, based as they are around open source software, onto an open data (rather than software) resource context. There is the tension that whilst cost recovery by selling on data may be deemed to be an acceptable mode of operation for a public body, in part because it supports cost recovery through getting a return on sale of goods/services for minimal marginal cost of making those goods/services available, the sale of high value consultancy, for example, requires large additional cost and activity not aligned directly with the provision of public service (in effect, the use of public service to provide private commercial services outside the public sphere, not just internally on a cost recovery basis).

If it is the case that better access to information – and data – helps us make better decisions (and I’m not convinced that what we want is to make decisions: most people have no real choice and just want effective local public services), then the reward to the public body is not so much a direct financial return as a minimisation of costs incurred elsewhere becuase a bad decision was made.

Recent years have seen a return to prize fund/Grand Challenge based funding models in wich prizes are awarded to technology solutions to particular technical challenges. This funding models replace the research funding support model with a reward based model. To what extent might the PDC act as a prize fund awarding body that can reward innovation around the use of public data, and sponsor parties engaged in such competition with “data permits” or “data credits” that provide them with data access in return for them submitting responses to data related Grand Challenges?

To what extent might the Technology Strategy Board ( http://www.innovateuk.org/ ), maybe under the auspices of the Small Business Research Initiative (SBRI) [ http://www.innovateuk.org/deliveringinnovation/smallbusinessresearchinitiative/whatissbri.ashx ] work with SMEs to provide “data credits” that provide companies with access to PDC content if it must be otherwise made available for a fee?

Could the TSB, in association with the PDC, even operate as an angel fund, supporting companies wishing to develop services or products based on public data, in exchange for a share in the companies involved, harking back to ideas behind the foundation of 3i, for example?

***”6. To what extent do you agree that there should be greater consistency, clarity and simplicity in the licensing regime adopted by a PDC?”

Experience of using Creative Commons licences and open source software licenses suggests that even within a open licensing framework, if different license types are combined it rapidly becomes difficult to work out how license conditions surrounding differently licensed components interact. Haviong a single license mitigates against creating confusion through complex, and possible inconsistent, combinations of license conditions arising from the novel combination of differently licensed resources.

The confusion as to what is allowable may act as a significant barrier to developing services that combine resources licensed in different ways. Since much innovation is likely to arise from combination of resources, the multi-license approach is not really viable. Regulating on how datasets may be used/what license terms apply for different use cases may place arbitrary conditions on the innovation of new models that fall outside or across models that are assumed to be possible when the model license conditions are framed.

Furthermore, in a truly open licensing regime, the scope of reuse would not be artificially bounded and the user would be free to reuse the resources in any way they wanted.

***”7. To what extent do you think each of the options set out would address those issues (or any others)? Please provide evidence to support your comments where possible.”

Options 1 or 2 may lead to situations where complex and even pathological combinations of different license types make it impossible for a user to work out whether or not they are allowed to combine a set of resources in a particular way, or develop business models that operate across different license condition regimes.

***”8. What do you think the advantages and disadvantages of each of the options would be? Please provide evidence to support your comments”

Option 1 “[5.10] … each organisation within a PDC would have its own portfolio of standard licences, terms and conditions appropriate to the nature of their business.”

Option 2: Overarching PDClicence agreement, subject to: “[5.17] flexibility to add additional schedules where necessary underneath that overarching agreement.”

Complications around from ill specified consequences arising from the combination of differently licensed resources arise here just as they do in the case of option 1.

Option 3 “While a single licence would offer greater consistency of standard terms and conditions it is likely that there would be a wide range of other terms, clauses and schedules required to cover the various types and uses of PDC information. It is therefore likely to be lengthy and will contain clauses and schedules that will not be relevant to all users”

Does this mean that there will essentially be different licenses according to the status of the user (personal use, academic, commercial, etc) rather than the situation in options 1 and 2 where there are essentially different licenses relating to the use to which resources will be put?

***”9. Will the benefits of changing the models from those in use across Government outweigh the impacts of taking out new or replacement licences?”

I don’t know.

***”10.To what extent is the current regulatory environment appropriate to deliver the vision for a PDC?”

“[6.8] … it is envisaged that all organisations within a PDC will be advised to develop and agree with the regulator the statement of their public task.”

So the management of the current operting funds will be expected to work together to produce their own regulatory framework, at least insofar as the definition of their public task goes? This is likely to be backward looking and protective of current operating models rather than being open to new models and potentially even new ways of defining public tasks that do not respect current organisational boundaries, processes and modes of operation. The PDC as thus described is a way of bringing together current orgnisations and their associated business models and allowing them to work together to protect those interests, interests that were defined to support a data environment that may no longer exist.

“6.10. In the freemium model there may be a role for the regulator, as indicated earlier, in advising PDC bodies how they can best go about making practical arrangements to make more data free for re-use while ensuring a sustainable business model.”

Requiring that any innovations also protect the current operating model suggests that the establishment of the PDC is actually a rearguard action to protect against a radical change in the ways in which data is produced, managaed and exploited within government, as well as for the wider public good through development of third sector and even private services.

***”11.Are there any additional oversight activities needed to deliver the vision for a PDC and if so what are they?”

The vision being the preservation of the status quo through the creation of a conglomerate of current data selling public bodies? And through oversight, you presumably do not mean the creation of a body that can force through changes to the way the board of the PDC decide it will operate, but rather will limit it’s role to seeing that the board does what it says it will do?

The current proposal for a PDC seems to favour the creation of a conglomerate charged with exploiting public data for financial return wherever it can, rather than act as a regulator, ombudsman, or advocate tasked with getting the most value out of public data through making effective use of it, and maximising the possibility of making effective use from it?

“[6.1] Given the confines of this consultation, and its remit to focus only on the data policy options for a PDC itself, it would not be appropriate to consult on the whole policy and legislative framework”

Which is to say, you are not soliciting ideas about how to set up a governance regime that will require a nascent PDC to develop structures and processes that seek to innovate in the way public data is collected, processed and exploited, or helps realise a vision where free flowing open public data revitalises the way in which public bodies operate?

***”12.What would be an appropriate timescale for reviewing a PDC or its constituent parts public task(s)?”

It seems you have a done deal already, and the PDC will be set up in a way that means it will be difficult to dismantle or restructure significantly and that any regulatory scheme will that is established will have to be defined so as to regulate an entity that has itself defined how it wants to be regulated?

As with the quick comments on the Making Data Public consultation, I probably need to spend a bit of time reviewing these immediate impressions, but as before, time is short… If you want to harangue me on any obvious howlers, or call me out on any obvious inconsistencies (it might well be the case that comments appear to come from contradictory positions!), feel free to post a comment:-)

Written by Tony Hirst

October 20, 2011 at 10:17 pm

To do: Critical Infrastructure Status Maps

leave a comment »

Form time to time we get a domestic broadband outage, but whenever it occurs, I’m never quite sure whether it’s a problem at my end, or with the exchange. On such occasions, I have a bookmarked link on my phone to the BT broadband status/issues page: http://btbusiness.custhelp.com/app/service_status

Digging around last night, I found a site that links to similar pages for a wide variety of ISPs (though it doesn’t go as far as scraping the status updates – it just embeds the relevants pages in an iframe): Netstatus.

Some of the other utility companies – gas, water, electricity – also provide status pages where you can find information about planned as well as unscheduled outages. Here’s what I’ve found so far:

Electricity
Eon-UK Power Cut map (Broken link – redirects to http://www.central-networks.co.uk/; power cuts link on that page just goes round the loop too… However, Powercuts.info seems to be pulling this data somehow? I *think* that site is only regional at the moment – would be good if it could start to offer national coverage…?;-)

Electricity NorthWest – Power cuts information (also Power outage map)

Electricity nortwest power outage map

There’s also data on recent electricity demand from the National Grid: National Grid electricity data: realtime demand

Gas

None found?

Water
South West Water – Live

South West Water - live

Thames Water – Live

Thames Water Live

Yorkshire Water – In Your Area -includes incidents, roadworks…

So here’s something for a hack day: compile a list of status update pages, build a set of scrapers and/or a meta-API, and publish a critical infrastructure status map/app…

If you have links to any other UK utility status pages, please post a link below and I’ll try to pull a comprehensive list together.

I haven’t yet looked to see which of the utilities have Twitter accounts or Facebook pages, and if so whether they are used to provide live status updates. If you know more, please let me know via a comment:-)

Written by Tony Hirst

October 19, 2011 at 11:04 am

Quick Core Dump of Idle Thoughts on the “Making Open Data Real” Consultation

with 2 comments

Quick core dump of thoughts, largely culled from things I’ve doodled before… Needs much more work, but time is running out on me… So any/all comments appreciated….

Open Data Consultation

Notes

***1. Do the definitions of the key terms go far enough or too far?
“Public services are either provided by public bodies, or providers who have been funded, commissioned or established by statute to provide a service”

I assume the definitions of open data and public services are to be taken together, with the consultation focussing on ‘open (public) data produced by public services’? For such bodies, I assume there is also a formal “data burden” that defines the public data reporting requirements to the centre, as well as devolved data burdens eg into local government from schools? Would it make sense to clarify the notion and extent of data burdens, and the extent to which elements of these (and the organisations they apply to) should be subject to open data requirements? I guess there is also a data burden placed on individual citizens in respect of filing tax forms, for example, that are not subject to openness requirements?

A clear statement at least of data burdens/formal reporting requirements between public bodies that are in scope for mandatory release as open public data should be made available, eg along the lines of http://www.communities.gov.uk/localgovernment/decentralisation/tacklingburdens/singledatalist/ http://getthedata.org/questions/500/data-burden-on-uk-higher-education (I know some work has already been done on this that I used as the basis for a simple data brden visualisation exercise ( http://www.flickr.com/photos/psychemedia/5536836259/ ).)

“Dataset”
It may be useful to distinguish between data collected for operational, administrative or statistical use, as well as the extent to which data produced in the normal course of events is being legitimately requested as is, or whether it must be processed before release (eg http://www.adls.ac.uk/what-is-administrative-data-and-why-use-it-for-research/ http://www.unsiap.or.jp/ms/ms7/DennisP1_OppoChalle.pdf ).

It may also be worth distinguishing between the release of complete data sets, views over the data that represent a query on a complete dataset, and queries, sampling procedures or any other means that are used to generate those data views. For example, providing data relating to performance indicators for a particular school in response to an FOI request from a citizen equates to the provision of a particular view over the database containing perfomance indicators for UK schools as a whole; providing a copy of the database as a whole to a developer of a school comparison website represents the provision of a complete dataset.

Datasets may provide value to others in a variety of ways: for example:
- using complete datasets as the basis of comparison or recommendation services;
- using complete datasets to support statistical analyses, segmentation/clustering of data;
- generating very particular or specific views over the dataset by constructing meaningful and appropriate queries on the datasets. Queries are also reusable, and whilst some cost may be incurred in creating them, making them open, and suitably parameterising them, the marginal cost of reusing the queries is then minimal. It is possible that queries that take a long time to create/optimise become valuable in their own right, and that the dataset and the view can be given away freely. The query unlocks value in the dataset and delivers it to the requester. When it comes to government reporting, where reports include summary views over open datasets, the openness/transparency requirement should not deem to be met unless the query that generates the view from the dataset is also openly published.

Datasets may also include recordsets relating to an individual; where personal access to personal data/mydata is possible, we need to distinguish between the private/personal right for an indvidual to access their data, or an agent acting on their behalf and with their permission, as opposed to general public access.

****2. Where a decision is being taken about whether to make a dataset open, what tests should be applied?
If data is part of a formally defined data burden, should that data burden be tiered in terms of openness requirements, for example along lines of:
- open on submission to the centre;
- open following embargo period and subject to checking by the centre, but with the presumption that it will be opened;
- not open;

Where data is FOIable, that may be taken as evidence in favour of presumed openness. If data is regularly requsted via FOI, it could be made available in open form as a matter of course in order to reduce FOI overheads in the future. When data is released via FOI, it could be made available via an open data site in partial fulfilment of handling the FOI request. When responding to FOI requests for data, the process required to obtain and release that data could be captured and compared with the actual processes relating to operational and administrative use of that data in order to identify whether an open data tap can be introduced into the current data process to open it as a matter of course, or release it efficiently in response to an FOI request.

As the major producer and consumer of public data, public bodies are well placed to benefit from more open public data. “Publicness” and “openness” both help make data accessible for use within and between public bodies, we well as reuse by third parties; accessibility is also improved by timely release of data, and the publication of data using open standards and formats.

Consequences of making data open should also be considered; for example, once released, will there be continued access to regular updates of the data using the same format. (If the data is released sporadically and with inconsistent formats, services that automate the regular collection of the data are not really viable).

****3. If the costs to publish or release data are not judged to represent value for money, to what extent should the requestor be required to pay for public services data, and under what circumstances?

Where work must be done that does not represent value for money (what would an example of this be? Having to get data into a form the public body would never use?), it may be appropriate to consider the amount of value that is added in processing the data that the requester might otherwise be expected to add, for example as just reward for the cost of processing that data. If the raw data is open, and the requester asks for processed data, it may be appropriate to give the raw data away freely but charge for the value add of processing it that the requester seeks to exploit in the course of their business? However, there will also be a tension between people who want to gain access to a small amount of data, either for personal use, research/innovation purposes, and companies who make use of that data in volume as part of a business. In the latter case, we might expect some payment for use of the data once the business is operating, although it could be argued that if the business is profitable, there is a return built in through taxation.

A balance may need to be struck based on the number of independent requests that are likely to be received for a particular data set and the use they wish to put it to. If N requests are made for the data, and all N parties need to do the same work cleaning or processing the data in the same way, that is obviously inefficient. It may be that third parties process and repackage data, for a fee. But the question arises – if data as published is not fit for use by third parties, is it fit for use by the first (producer) or second (‘official consumer’) party, or has the data been produced solely in response to some openness criteria, and not because the data is actually used for anything?

The ability to save cost elsewhere in government may also be an issue. For example, local authorities who make disbursements to care homes need to mitigate against fraud by regularly checking death reports, often through the purchase of commercial death registers or by checking the local newspaper’s death notices. Whilst a cost may be associated with signatories of death certificates ensuring this data enters the public body data chain in an accessible and open way, it may well save costs in multiple other areas of government.

Where data is processed and released in exchange for a payment, would it also be possible for the raw underlying data to also be made available for free so that third parties can, at their own expense, carry out the required processing if they can do so for less overall cost than piecewise purchase of data from the public body?

****4. How do we get the right balance in relation to the range of organisations (providers of public services) our policy proposals apply to? What threshold would be appropriate to determine the range of public services in scope and what key criteria should inform this?

If an organisation is subject to FOI requests, or data it produces and returns as part of an official data burden may be requested through FOI requests, it should be in scope?

Analysis of data processes associated with fulfilling data burden requirements might provide a basis for identifying where in a data process data might reasonably be made public and open.

*****5. What would be appropriate mechanisms to encourage or ensure publication of data by public service providers?
If data related FOI responses are published via open data sites, the open data site can become a repository of commonly requested data and help identify which processes might benefit from releasing open data as a matter of course.

Where public data is reported as a matter of course by the local press and in the local interest, (for example, court reports, planning notices, traffic notices), public bodies might be encouraged to publish the corresponding data in an open way in order to facilitate the local dissemination of that information. Note that much of this data is transitory/may only be relevant for a limited period. In this case, we need to consider: whether there is a public interest in making the data publicly available and open on an archival basis, or not providing archives per se, but responding to requests for archival copies of data; the extent to which third parties can archive/aggregate such data and continue to make it available; whether there are privacy reasons for not supporting archival access (for example, court reports in local newspapers have a “short memory”).

Are there guidelines available that cover the interactions between things like:
- data eligible for release under FOI;
- data that may be redacted on grounds of Data Protection Act
- data covered by Database Right or data that is covered by copyright
- data released through National Statistics ( http://www.legislation.gov.uk/ukpga/2007/18/contents )
- reusable public sector information ( http://www.legislation.gov.uk/uksi/2005/1515/contents/made )

Analysis of data burden reporting process might identify appropriate points at which data can be made open as part of the process. For example, reported data may be posted to an open data site from where it is collected (“pull reporting”). See also: http://blog.ouseful.info/2011/03/18/open-data-processes-taps-query-pathsaudit-trails-and-round-tripping/

And as I responded to the PDC Engagement Exercise, [o]ne particular class of data that interests me is data that is:

1) reported by a local organisation to a central body;
2) using a standardised, templated reporting format,
3) and that is FOIable either from the local organisation, and/or from the central body.

For example, in Higher Education, this might include data on library usage as reported to SCONUL, or marketing information about courses submitted to UCAS.

It can often be hard to find out how to phrase an FOI request to obtain this data as submitted, unless you know the type of reporting form used to submit it.

What I would like to see is the Public Data Corporation acting in part as a Public Data Exchange Directory, showing how different classes of public organisation make standard (public data containing) reports to other public organisations, detailing the standard report formats, with names/identifiers for those forms if appropriate, and describing which sections of the report are FOIable. This could also link in to the list of local council data burdens, for example ( http://www.communities.gov.uk/… and/or the code of practice for local authority transparency ( http://www.communities.gov.uk/… )

The next step would be to introduce a pubsub (publish-subscribe) model in the reporting chain for reporting documents* that are wholly FOIable. This could happen in several ways:

A) /open report publication/ – the publishing organisation could post their report to their opendata reporting store, and the consuming organisation (the one to which the report was being made) would subscribe to that store, collecting the data from there as it was published; third parties could also subscribe to the local publishing store and be alerted to reports as they are published. If co-publication to the central organisation and the public is not appropriate, the report could be witheld from public/press consumption for a specified period of days, or published to the press but not the public under embargo.

B) /open deposit/ – the publishing organisation publishes the report/data to an open deposit box owned by the central organisation which is receiving the report. After a specified period of time, the report is made public (ie published) via that central deposit box.

C) /data corp in the middle/ – a centralised architecture in which local organisations submit public reports to a Public Data Exchange, which then passes them on to the central body to which reports are made, and publishes them to the public, maybe after a fixed period of time.

The intention of all three approaches described above is to provide an open window onto the reporting chain. At the current time, open public data tends to be data that is published via a separate branch “to the public”. In contrast, the above approach suggests that public data publication acts as a view onto all or part of the data as it goes about it’s daily business being published from one organisation to another. That is, public data publication becomes a “tap” onto a dataflow/workflow process.

If one of the desires for data exploitation is to help introduce efficiencies as well as reuse in data related activities, third parties need to be able to work with data as it currently used.

***How will we ensure that Open Data standards are embedded in new ICT contracts?
By providing a test suite as part of the contract that include tests such as running data import/export/query operations against centralised validation services.

***What is the best way to achieve compliance on high and common standards to allow usability and interoperability?
Require data reporting to proceed through open interfaces or interfaces where public data taps can be applied. Released data should be authentic, and representative of data used as part of a public body’s activities or reporting duties rather than data that is produced purely for release on an open data site.

***How would we ensure that public service providers in their day to day decisionmaking honour a commitment to Open Data, while respecting privacy and security considerations.
Take a lead from open source software projects and publish requests via an issue tracker, that can show when an ‘issue’ was raised, what it’s current status is, and how it was resolved. Related approaches include services like WhatDoTheyKnow or GetTheData

***How should public services make use of data inventories? What is the optimal way to develop and operate this?
If we distinguish between datasets, queries on datasets, and reports/data view generated by queries on datsets on the one hand, and data burdens on the other, we can start to map out how queries are used on datasets to generate reports that fulfil data burden requirements. That has the benefit of making the data burden fulfilment process more transparent, as well as contextualising both the way those reports are generated (through exposing the queries) and the original data sets used as a basis for creating reports.

***Should the data that government releases always be of high quality? How do we define quality? To what extent should public service providers “polish” the data they publish, if at all?
One rule of thumb is that the data should be “good enough”. The question then arises, ‘good enough for whom?’. If the data is released and never referred to, its quality is irrelevant as regards the non-existent on-users, although it may signal problems elsewhere. If data is used by a third party and found to contain errors or omissions, the question arises: does the publisher also suffer from those some lack of quality issues (and if so, how are they handling them?); or are they using a different data set as part of the process that the released dataset relates to (and if so, why isn’t that data being released?)

There are different levels of cleanliness we may associate with data: a major issue in many datasets relates to the use of inconsistent labels to refer to the same entity (something that can be addressed by using universal persistent identifiers). Character set encodings can also cause problems, especially where it is hard to identify what character sets are used within a file.

***How should government approach the release of existing data for policy and research purposes: should this be held in a central portal or held on departmental portals?
As I understand the current situation, public body reports often produce summary tables and as part of transparency requirements, release as public data raw datasets that are used to generate those summary tables. In such cases, the query used to generate the summary table from the raw data should also be published. The transparency does not come from releasing summary tables and saying “it summarises that pile of data”. It comes from saying – here is the summary, and here is how it was generated from that data, allowing the observer to check the assumptions of the query, redo the analysis, and so on.

Using services such as Google spreadsheets or Zoho spreadsheets, it is possible to provide a preview view over the data contained in a dataset made available as a simple CSV file (this approach is taken on some datastores). It is also possible to use services such as a Google spreadsheets as a database, and so provide a certain level of intermediate developer access to the raw data as if read access were made available to the database that sourced the released data (eg http://blog.ouseful.info/2010/11/19/government-spending-data-explorer/ ). A range of powerful hosted statistical analysis and visualisation tools are now available that can also provide a user interface layer over over data published in such environments (“analysis at the point of delivery”). For example, the popular R environment can provide an online statistical analysis UI to online hosted datasets via services such as http://www.stat.ucla.edu/~jeroen/ggplot2.html or http://www.rstudio.org/docs/server/getting_started These tools provide an intermediary step that allow interested parties to explore datasets in situ. Recent developments with the Linked Data API ( http://www.epimorphics.com/web/tools/linked-data-api.html ) offer similar capabilities, including the ability to share persistent URLs to queries that are applied to public Linked Data stores such as those hosted under the data.gov.uk umbrella.

****Is there a role for government to stimulate innovation in the use of Open Data? If so, what is the best way to achieve this?
Allow free access to public data for personal, research, social enterprise and SME commercial research/development purposes. If the service using the data ever becomes popular, worry about how to charge for it then…

Written by Tony Hirst

October 18, 2011 at 4:11 pm

Posted in Anything you want, Policy

Tagged with ,

Google Social API Otherme Service – How Does It Work Exactly?

leave a comment »

Mulling over where I’m situated on Google+, it occurred to me that I should probably start looking at cross-network networks…for example, comparing public social networks across Google+ and Twitter.

A handy tool for this is the (little known?) Google Social API otherme service.

This lets you look up identities that are, in one way or another, associated with each other across the web… but the question is, how is the lookup working? For example, if I lookup my otherme profiles using my Twitter and Google+ accounts, I get different results, even though my Twitter profile is revealed when I look up my Google+ identity:

Google otherme service

Maybe the service only reveals the otherme links that are publicly declared on the identity I’m actually looking up? (But then, I don’t think my Twitter ID links to my Quora identity?)

Anyway, this is something I think I’ll try to have a play with over the next week or so, though I’m not quite sure how yet… I guess the first step might be to just plot a Google+ map and relabel the nodes, where possible, with Twitter IDs, then compare the Twitter friends map with the Google+ map. I’m not sure quite why this might be interesting (or maybe it won’t be), but if nothing else, it’ll get me exploring the different ways in which it’s possible to compare the overlap and structure of two networks with similar nodes…

PS if you’re feeling flush, there are also plenty of services out there that will sell you personal data associated with online identifiers, eg fliptop, Rapleaf, Rapportive,

Written by Tony Hirst

October 18, 2011 at 1:24 pm

Posted in Anything you want

Tagged with

Questions Being Asked by the Current Open Public Data Consultations

leave a comment »

With the October 27th closing data looming for the two current open data consultations, I need to start putting my response together. Here’s a quick recap of the questions being asked:

The PDC consultation questions were as follows:

Chapter 4 – Charging for PDC information
1. How do you think Government should best balance its objectives around increasing access to data and providing more freely available data for re-use year on year within the constraints of affordability? Please provide evidence to support your answer where possible.
2. Are there particular datasets or information that you believe would create particular economic or social benefits if they were available free for use and reuse? Who would these benefit and how? Please provide evidence to support your answer where possible.
3. What do you think the impacts of the three options would be for you and/or other groups outlined above? Please provide evidence to support your answer where possible.
4. A further variation of any of the options could be to encourage PDC and its constituent parts to make better use of the flexibility to develop commercial data products and services outside of their public task. What do you think the impacts of this might be?
5. Are there any alternative options that might balance Government‟s objectives which are not covered here? Please provide details and evidence to support your response where possible.
Chapter 5 – Licensing
6. To what extent do you agree that there should be greater consistency, clarity and simplicity in the licensing regime adopted by a PDC?
7. To what extent do you think each of the options set out would address those issues (or any others) Please provide evidence to support your comments where possible.
8. What do you think the advantages and disadvantages of each of the options would be? Please provide evidence to support your comments
9. Will the benefits of changing the models from those in use across Government outweigh the impacts of taking out new or replacement licences?
Chapter 6 – Regulatory oversight
10. To what extent is the current regulatory environment appropriate to deliver the vision for a PDC?
11. Are there any additional oversight activities needed to deliver the vision for a PDC and if so what are they?
12. What would be an appropriate timescale for reviewing a PDC or its constituent parts public task(s)

And from the Making Open Data Real consultation:

Chapter 8
How would we establish a stronger presumption in favour of publication than that which currently exists?
Is providing an independent body, such as the Information Commissioner, with enhanced powers and scope the most effective option for safeguarding a right to access and a right to data?
Are existing safeguards to protect personal data and privacy measures adequate to regulate the Open Data agenda?
What might the resource implications of an enhanced right to data be for those bodies within its scope? How do we ensure that any additional burden is proportionate to this aim?
How will we ensure that Open Data standards are embedded in new ICT contracts?

What is the best way to achieve compliance on high and common standards to allow usability and interoperability?
Is there a role for government to establish consistent standards for collecting user experience across public services?
Should we consider a scheme for accreditation of information intermediaries, and if so how might that best work?

How would we ensure that public service providers in their day to day decisionmaking honour a commitment to Open Data, while respecting privacy and security considerations.
What could personal responsibility at Board-level do to ensure the right to data is being met include? Should the same person be responsible for ensuring that personal data is properly protected and that privacy issues are met?
Would we need to have a sanctions framework to enforce a right to data?
What other sectors would benefit from having a dedicated Sector Transparency Board?

How should public services make use of data inventories? What is the optimal way to develop and operate this?
How should data be prioritised for inclusion in an inventory? How is value to be established?
In what areas would you expect government to collect and publish data routinely?
What data is collected “unnecessarily”? How should these datasets be identified? Should collection be stopped?
Should the data that government releases always be of high quality? How do we define quality? To what extent should public service providers “polish” the data they publish, if at all?

How should government approach the release of existing data for policy and research purposes: should this be held in a central portal or held on departmental portals?
What factors should inform prioritisation of datasets for publication, at national, local or sector level?
Which is more important: for government to prioritise publishing a broader set of data, or existing data at a more detailed level?

Is there a role for government to stimulate innovation in the use of Open Data? If so, what is the best way to achieve this?

Hmmm….

Written by Tony Hirst

October 17, 2011 at 5:57 pm

Posted in Policy

Most Active Organisations wrt Ministerial Meetings?

leave a comment »

A quick follow on to the previous post on UK Ministers’ meetings… I resolved the differences between organisation names, calculated the weighted degree of each node (which takes into account edge weights, rather than just the degree), and plotted the organisations with greatest weighted degree (node size; the colour ranges blue-red and indicates the actual degree. So a large red node has a high degree (=lots of meetings with different Ministers), and a large blue node has a low degree (few different meetings, but presumably repeated ones).

Active lobbiests

So are these the most active organisations in terms of arranging formal meetings with Ministers?

Written by Tony Hirst

October 17, 2011 at 3:29 pm

Posted in Anything you want

Follow

Get every new post delivered to your Inbox.

Join 126 other followers