Archive for the ‘Open Data’ Category
Whenever a new open data dataset is released, the #opendata wires hum a little more. More open data is a Good Thing, right? Why? Haven’t we got enough already?
In a blog post a few weeks ago, Alan Levine, aka @cogdog, set about Stalking the Mythical OER Reuse: Seeking Non-Blurry Videos. OERs are open educational resources, openly licensed materials produced by educators and released to the world so others could make use of them. Funding was put into developing and releasing them and then, … what?
OERs. People build them. People house them in repositories. People do journal articles, conference presentations, research on them. I doubt never their existence.
But the ultimate thing they are supposed to support, maybe their raison d’être – the re use by other educators, what do we have to show for that except whispered stories, innuendo, and blurry photos in the forest?
Alan went in search of the OER reuse in his own inimitable way…
… but came back without much success. He then used the rest of the post to put out all for stories about how OERs have actually been used in the world… Not just mythical stories, not coulds and mights: real examples.
So what about opendata – is there much use, or reuse, going on there?
It seems as is more datasets get opened every day, but is there more use every day, first day use of newly released datasets, incremental reuse of the datasets that are already out, linkage between the new datasets and the previously released ones.
Yesterday, I spotted via @owenboswarva the release of a dataset that aggregated and normalised data relating to charitable grant awards: A big day for charity data. Interesting… The supporting website – 360 Giving – (self-admittedly in it’s early days) allows you to search by funder, recipient or key word. You have to search using the right keywords, though, and the right capitalisation of keywords…
And you may have to add in white space.. so *University of Oxford * as well as *University of Oxford*.
I don’t want to knock the site, but I am really interested to know how this data might be used. Really. Genuinely. I am properly interested. How would someone working in the charitable sector use that website to help them do something? What thing? How would it support them? My imagination may be able to go off on crazy flights of fancy in certain areas, but my lack of sector knowledge or a current headful of summer cold leaves me struggling to work out what this website would tangibly help someone to do. (I tried to ask a similar question around charities data before, giving the example of Charities Commission data grabbed from OpenCharities, but drew a blank then.) Like @cogdog in his search for real OER use case stories, I’d love to hear examples of real questions – no matter how trivial – that the 360 Giving site could help answer.
As well as the website, 360 Giving folk provide a data download as a CSV file containing getting on for a quarter of a million records. The date stamp on the file I grabbed is 5th June 2014. Skimming through the data quickly – my own opening conversation with it can be found here: 360 Giving Grant Navigator – Initial Data Conversation – I noticed through comparison with the data on the website some gaps…
- this item doesn’t seem to appear in the CSV download, perhaps because it doesn’t appear to have a funder?
- this item on the website has an address for the recipient organisation, but the CSV document doesn’t have any address fields. In fact, on close inspection, the record relates to a grant by the Northern Rock Foundation, and I see no records from that body in the CSV file?
- Although there is a project title field in the CSV document, no project titles are supplied. Looking through a sample of grants on the website, are any titles provided?
- The website lists the following funders:
Arts Council England
Arts Council Wales
Heritage Lottery Fund
Northern Rock Foundation
Paul Hamlyn Foundation
Sport Northern Ireland
The CSV file has data from these funders:
Arts Council England
Arts Council Wales
Sport Northern Ireland
That is, the CSV contains a subset of the data on the website; data from Heritage Lottery Fund, Indigo Trust, Northern Rock Foundation, Paul Hamlyn Foundation doesn’t seem to have made it into the data download? I also note that data from the Research Councils’ Gateway to Research (aside from the TSB data) doesn’t seem to have made it into either dataset. For anyone researching grants to universities, this could be useful information. (Could?! Why?!;-)
- No company numbers or Charity Numbers are given. Using opendata from Companies House a quick join on recipient names and company names from the Companies House register (without any attempts at normalising out things like LTD and LIMITED – that is, purely looking for an exact match) gives me just over 15,000 matched company names (which means I now have their address, company number, etc. too). And presumably if I try to match on names from the OpenCharities data, I’ll be able to match some charity numbers. Now both these annotations will be far from complete, but they’d be more than we have at the moment. A question to then ask is – is this better or worse? Does the dataset only have value if it is in some way complete? One of the clarion calls for open data initiatives has been to ‘just get the data out there’ so that it can be started to be worked on, and improved on. So presumably having some company numbers of charity numbers matched is a plus?
Now I know there is a risk to this. Funders may want to not release details about the addresses of the charities of they are funding because that data may be used to plot maps to say “this is where the money’s going” when it isn’t. The charity may have a Kensington address and the received funding for an initiative in Oswaldtwistle, but the map might see all the money sinking into Kensington; which would be wrong. But that’s where you have to start educating the data users. Or releasing data fields like “address of charity” and “postcode area of point of use”, or whatever, even if the latter is empty. As it is, if you give me a charity or company name, I can look up it’s address. And its company or charity number if it has one.
As I mentioned, I don’t want to knock the work 360 Giving have done, but I’m keen to understand what it is they have done, what they haven’t done, and what the opendata they have aggregated and re-presented could – practically, tractably, tangibly – be used for. Really used for.
Time to pack my bags and head out into the wood, maybe…
Some rambling but possibly associated thoughts… I suggest you put Alice’s Restaurant on…
For some time now, I’ve had an uncomfortable feeling about the asymmetries that exist in the open data world as well as total confusion about the notion of transparency.
Part of the nub of the problem (for me) lies with the asymmetric disclosure requirements of public and private services. Public bodies have disclosure requirements (eg Local Government Transparency Code), private companies don’t. Public bodies disclose metrics and spend data, data that can be used in public contract tendering processes by private bodies against public ones tendering for the same service. The private body uses this information – and prices in a discount associated with not having to carry the cost of public reporting – into the bid. The next time the contract is tendered, the public body won’t have access to the (previously publicly disclosed) information that the private body originally had when making its bid. Possibly. I don’t know how tendering works. But from the outside, that’s how it appears to me. (Maybe there needs to be more transparency about the process?)
Open data is possibly a Big Thing. Who knows? Maybe it isn’t. Certainly the big consulting firms are calling it as something worth squillionty billionty of pounds. I’m not sure how they cost it. Maybe I need to dig through the references and footnotes in their reports (Cap Gemini’s Open Data Economy: Unlocking Economic Value by Opening Government and Public Data, Deloitte’s Open growth: Stimulating demand for open data in the UK or McKinsey’s Open data: Unlocking innovation and performance with liquid information). I don’t know how much those companies have received in fees for producing those reports, or how much they have received in consultancy fees associated with public open data initiatives – somehow, that spend data doesn’t seem to have been curated in a convenient way, or as a #opendatadata bundle? – but I have to assume they’re not doing it to fleece the public bodies and tee up benefits for their other private corporate clients.
Reminds me – I need to read Owen Boswarva’s Who supports the privatisation of Land Registry? and ODUG benefits case for open data release of an authoritative GP dataset again… And remind myself of who sits on the Open Data User Group (ODUG), and other UK gov departmental transparency boards…
And read the FTC’s report Data Brokers: A Call For Transparency and Accountability…
Just by the by, one thing I’ve noticed about a lot of opendata releases is that, along with many other sorts of data, they are most useful when aggregated over time or space, and/or combined with other data sets. Looking at the month on month reports of local spending data from my local council is all very well, but it gets more interesting when viewed over several months or years. Looking at the month on month reports of local spending data from my local council is all very well, but it gets more interesting when looking at spend across councils, as for example in the case of looking at spend to particular companies.
Aggregating public data is one of the business models that helps create some of the GDP figure that contributes to the claimed, anticipated squillionty billionty pounds of financial benefit that will arise from open data – companies like opencorporates aggregating company data, or Spend Network aggregating UK public spending data who hope to start making money selling products off the back of public open data they have curated. Yes – I know a lot of work goes in to cleaning and normalising that data, and that exploiting the data collection as a whole is what their business models are about – and why they don’t offer downloads of their complete datasets, though maybe licenses require they do make links to, or downloads of, the original (“partial”) datasets available?
But you know where I think the real value of those companies lies? In being bought out. By Experian, or Acxiom (if there’s even a hint of personally identifiable data through reverse engineering in the mix), or whoever… A weak, cheap, cop out business model. Just like this: Farmers up in arms over potential misuse of data. (In case you missed it, Climate Corporation was one of the OpenData500 that aggregated shed loads of open data – according to Andrew Stott’s Open Data for Economic Growth report for the World Bank, Climate Corp “uses 60 years of detailed crop yield data, weather observations from one million locations in the United States and 14 terabytes of soil quality data – all free from the US Government – to provide applications that help farmers improve their profits by making better informed operating and financing decisions”. It was also recently acquired by Monsanto – Monsanto – for just under a billion US $. That’s part of the squillionty billionties I guess. Good ol’ open data. Monsanto.)
Sort of related to this – that is, companies buying others to asset strip them for their data – you know all that data of yours locked up in Facebook and Google? Remember MySpace? Remember NNTP? According to the Sophos blog, Just Because You Don’t Give Your Personal Data to Google Doesn’t Mean They Can’t Acquire It. Or that someone else might buy it.
And as another aside – Google – remember Google? They don’t really “read” your email, at least, people don’t, they just let algorithms process it so the algorithms can privately just use that data to send you ads, but no-one will ever know what the content of the email was to trigger you getting that ad (‘cos the cookie tracking, cookie matching services can’t unpick ad bids, ad displays, click thrus, surely, can they?!), well – maybe there are side effects: Google tips off cops after spotting child abuse images in email (for some reason, after initially being able to read that article, my browser can’t load it atm. Server fatigue?). Of course, if Google reads your ads for blind business purposes and ad serving is part of that blind process you accept it. But how does the law enforcement ‘because we can even though you didn’t warrant us to?’ angle work? Does the Post Office look inside the envelope? Is surveillance actually part of Google’s business model?
If you want to up the paranoia stakes, this (from Ray Corrigan, in particular: “Without going through the process of matching each government assurance with contradictory evidence, something I suspect would be of little interest, I would like to draw your attention to one important misunderstanding. It seems increasingly to be the belief amongst MPs that blanket data collection and retention is acceptable in law and that the only concern should be the subsequent access to that data. Assertions to this effect are simply wrong.”) + that. Because one day, one day, they may just find your name on an envelope of some sort under a tonne of garbage. Or an algorithm might… Kid.
But that’s not what this post is about – what this post is about is… Way back when, so very long ago, not so very long ago, there was a license called GPL. GPL. And GPL was a tainting license. findlaw describes the consequences of reusing GPL licensed code as follows: Kid, ‘if a user of GPL code decides to distribute or publish a work that “in whole or in part contains or is derived from the [open source] or any part thereof,” it must make the source code available and license the work as a whole at no charge to third parties under the terms of the GPL (thereby allowing further modification and redistribution).
‘In other words, this can be a trap for the unwary: a company can unwittingly lose valuable rights to its proprietary code.’
Now, friends, GPL scared people so much that another license called LGPL was created, and LGPL allowed you to use LGPL licensed code without fear of tainting your own code with the requirement to open up your own code as GPL would require of it. ‘Cos licenses can be used against you.
And when it comes to open data licenses, they seem to be like LGPL. You can take open public data and aggregate it, and combine it, and mix it and mash it and do what you like with it and that’s fine… And then someone can come along buy that good work you’ve done and do what they want with it. Even Monsanto. Even Experian. And that’s good and right, right? Wrong. The ODUG. Remember the ODUG? The ODUG is the Open Data User Group that lobbies government for what datasets to open up next. And who’s on the ODUG? Who’s there, sitting there, on the ODUG bench, right there, right next to you?
Kid… you wanna be the all-open, all-liberal open data advocate? You wanna see open data used for innovation and exploitation and transparency and all the Good Things (big G, big T) that open data might be used for? Or you wanna sit down on the ODUG bench? With Deloitte, and Experian, and so on…
And if you think that using a tainting open data license so anyone who uses that data has to share it likewise, aggregated, congregated, conjugated, disaggregated, mixed, matched, joined, summarised or just otherwise and anyways processed, is a Good Thing…? Then kid… they’ll all move away from you on the bench there…
Because when they come to buy you, they won’t your data to be tainted in any way that means they’ll have to give up the commercial advantage they’ll have from buying up your work on that open data…
But this post? That’s not what this post is about. This post about holding companies to account. Open data used to hold companies to account. There’s a story to be told that’s not been told about Dr Foster, and open NHS data and fear-mongering and the privatisation of the NHS and that’s one thing…
But another thing is how government might use data to help us protect ourselves. Because government can’t protect us. Government can’t make companies pay taxes and behave responsibly and not rip off consumers. Government needs our help to do that. But can government help us do that too? Protect and Survive.
There’s a thing that DECC – the Department of Energy and Climate Change – do, and that’s publish statistics about domestic energy price statistics and industrial energy price statistics and road fuel and other petroleum product price statistics, and they’re all meaningless. Because they bear little resemblance to spot prices paid when consumers pay their domestic energy bills and road fuel and other petroleum product bills.
To find out what those prices are you have to buy the data from someone like Experian, from something like Experian’s Catalist fuel price data – daily site retail fuel prices – data product. You may be able to caluclate the DECC statistics from that data (or you may not) but you certainly can’t go the other way, from the DECC statistics to anything like the Experian data.
But can you go into your local library and ask to look at a copy of the Experian data? A copy of the data that may or may not be used to generate the DECC road fuel and other petroleum product price statistics (how do they generate those statistics anyway? What raw data do they use to generate those statistics?)
Can you imagine ant-eye-ant-eye-consumer data sets being published by your local council or your county council or your national government that can be used to help you hold companies to account and help you tell them that you know they’re ripping you off and your council off and your government off and that together, you’re not going to stand for it?
Can you imagine your local council publishing the forecourt fuel prices for one petrol stations, just one petrol station, in your local council area every day? And how about if they do it for two petrol stations, two petrol stations, each day? And if they do it for three forecourts, three, can you imagine if they do it for three petrol stations…? And can you, can you imagine prices for 50 petrol stations a day being published by your local council, your council helping you inform yourself about how you’re being manipulated, can you imagine…? (It may not be so hard – food hygiene ratings are published for food retail environments across the England, Northern Ireland and Wales…
So let’s hear it for open data, and how open data can be used to hold corporates to account, and how public bodies can use open data to help you make better decisions (which is a good neoliberal position to take and one which the other folk on the bench tell you that that’s what you want and that and markets work, though they also fall short of telling you that the models say that markets work with full information but you don’t have the information, and even if you did, you wouldn’t understand it, because you don’t really know how to make a good decision, but at the end of the day you don’t want a decision, you just want a good service fairly delivered, but they don’t tell that it’s all right to just want that…)
And let’s hear it for public bodies making data available whether it’s open or not, making it available by paying for it if they have to and making it available via library services so that we can start using it to start holding companies to account and start helping our public services, and ourselves, protect ourselves from the attacks being mounted on us by companies, and their national government supporters, who take on debt, and who allow them to take on debt, to make dividend payouts but not capital investment and subsidise the temporary driving down of prices (which is NOT a capital investment) through debt subsidised loss leading designed to crush competition in a last man standing contest that will allow monopolistic last man standing price hikes at the end of it…
And just remember, if there’s anything you want, you know where you can get it… At Alice’s… or the library… only they’re shutting them down, aren’t they…? So that leaves what..? Google?
Via @simonperry, news that AP will use robots to write some business stories (Automated Insights are one of several companies I’ve been tracking over the years who are involved in such activities, eg Notes on Narrative Science and Automated Insights).
The claim is that using algorithms to do the procedural writing opens up time for the journalists to do more of the sensemaking. One way I see this is that we can use data2text techniques to produce human readable press releases of things like statistical releases, which has a couple of advantages at least.
Firstly, the grunt – and error prone – work of running the numbers (calculating month on month or year on year changes, handling seasonal adjustments etc) can be handled by machines using transparent and reproducible algorithms. Secondly, churning numbers into simple words (“x went up month on month from Sept 2013 to Oct 2013 and down year on year from 2012″) makes them searchable using words, rather than having to write our own database or spreadsheet queries with lots of inequalities in them.
In this respect, something that’s been on my to do list for way to long is to produce some simple “press release” generators based on ONS releases (something I touched on in Data Textualisation – Making Human Readable Sense of Data).
Matt Waite’s upcoming course on “automated story bots” looks like it might produce some handy resources in this regard (code repo). In the meantime, he already shared the code described in How to write 261 leads in a fraction of a second here: ucr-story-bot.
For the longer term, on my “to ponder” list is what might something like “The Grammar of Graphics” be for data textualisation? (For background, see A Simple Introduction to the Graphing Philosophy of ggplot2.)
For example, what might a ggplot2 inspired gtplot library look like for converting data tables not into chart elements, but textual elements? Does it even make sense to try to construct such a grammar? What would the corollaries to aesthetics, geoms and scales be?
I think I perhaps need to mock-up some examples to see if anything comes to mind and that the function names, as well as the outputs, might look like, let alone the code to implement them! Or maybe code first is the way, to get a feel for how to build up the grammar from sensible looking implementation elements? Or more likely, perhaps a bit of iteration may be required?!
A couple of notices taken out in this week’s Isle of Wight County Press by the Isle of Wight Council raise notice that a couple more areas are to become “designated public spaces”, which means that the police and other recognised officers are allowed to confiscate alcohol (no drinking on the beach…).
Despite my best efforts, I failed to find a listing on the Isle of Wight Council website detailing all designated public spaces on the island, or showing maps of their extent.
As with other council orders that have a geographical component, I am ever hopeful that appropriate maps will be made available, in the case of designated public places as a shapefile, not least because I generally have no idea what boundaries the notices refer to when they just describe road names and other geographical features (“along the beach to the point in line with the junction of Whitecross Lane and Whitecross Farm Lane”, for example…). The notice does say “as shown on the map to be attached to the Order”, but I can’t readily find the order on the council website to see whether the map has yet been attached (maybe it’s buried in papers relating to licensing committee meetings?) I did manage to find a map (search: designated public place lake/) but only because I read URLs…
Re: finding maps relating to DPPOs, I’ve had this problem before. The same criticisms still apply – the Home OFfice don’t appear to maintain a single register of DPPOs in force, despite the original guidance note saying they would, although they will release the data at a crude level in response to an FOI request. Again, I wonder about setting up an FOI repeater to schedule monthly submissions of the same request to handle this sort of query to the Home Office.
Drawing on the fan in/fan out notion, a Home Office aggregation of the data represents a large fan-in of data from separate councils, and seems to be the easiest place to get hold of such data. However, it would help if they also requested and made available information about the geographical extent of each order as a shapefile.
Here’s the map, though I’m not clear whether I’m breaching copyright in displaying it without permission:
Returning to the notice, I can’t find it displayed on the Isle of Wight County Press website, or on OnTheWight, our hyperlocal blog, despite online fora being legitimate spaces for publishing notices, so I guess buying the County Press is another of those effective taxes I need to pay to keep up with council notices (spending – (click search to run…)).
(As well as posting notices in the IWCP, why doesn’t the council have a /notices area of the website to which it could post its statutory announcements in addition to any other media channels it uses that would act as a convenient, authoritative and archival repository of such notices? [NB quite a few councils do have an area for public notices on their website. For example, do a web search for intitle:"public notices" site:gov.uk] In addition, an area of the site detailing orders in place and in force on the Island so we could keep up with how freedoms are being restricted at any time by the council.)
In passing, whilst trying to find details about DPPOs on the island, I came across a copy of the consultation around the DPPOs on the Gurnard Parish Council website:
The Isle of Wight County Council has received a request from the Hampshire Constabulary to consider the imposition of Designated Public Place Orders (DPPOs) in …”
Hmm… so does the Hampshire Constabulary website list the extent of DPPOs in the region it covers? (I note on WhatDoTheKnow several people have FOId information about DPPOs from police forces.)
I’m pretty sure there’s a DPPO covering Ryde (here’s an example of additional evidence from Hampshire Constabulary offered in support of DPPO in Ryde in 2007 that was use to support an application for an order at that time) but I can’t find any statement about where, or if, such an order applies on the Isle of Wight Council website, or on the appropriate local neighbourhood area of the Hampshire Constabulary site.
So the only way for me to find out is to go round Ryde looking for a signs, and maybe (outside chance) a notice that has a map of the area? Hmm…
PS Just by the by, shapefiles would also help when it comes to working out which bits of the beach I can and can’t walk the dog on…
In digital electronics, the notions of fan in and fan out describe, respectively, the number of inputs a gate (or, on a chip, a pin) can handle, or the number of output connections it can drive. I’ve been thinking about this notion quite a bit, recently, in the context of concentrating information, or data, about a particular service.
For example, suppose I want to look at the payments made by a local council, as declared under transparency regulations. I can get the data for a particular council from a particular source. If we consider each organisation that the council makes a payment to as a separate output (that is, as a connection that goes between that council and the particular organisation), the fan out of the payment data gives the number of distinct organisations that the council has made a declared payment to.
One things councils do is make payments to other public bodies who have provided them with some service or other. This may include other councils (for example, for the delivery of services relating to out of area social care).
Why might this be useful? If we aggregate the payments data from different councils, we can set up a database that allows us to look at all payments from different councils to a particular organisation, (which may also be a particular council, which is obliged to publish its transaction data, as well as a private company, which currently isn’t). (See Using Aggregated Local Council Spending Data for Reverse Spending (Payments to) Lookups for an example of this. I think startup Spend Network are aggregating this data, but they don’t seem to be offering any useful open or free services, or data collections, off the back of it. OpenSpending has some data, but it’s scattergun in what’s there and what isn’t, depending as it does on volunteer data collectors and curators.)
The payments incoming to a public body from other public bodies are therefore available as open data, but not in a generally, or conveniently, concentrated way. The fan in public payments is given by the number of public bodies that have made a payment to a particular body (which may itself be a public body or may be a private company). If the fan in is large, it can be a major chore searching through the payments data of all the other public bodies trying to track down payments to the body of interest.
Whilst I can easily discover fan out payments from a public body, I can’t easily discover the originators of fan in public payments to a body, public or otherwise. Except that I could possibly FOI a public body for this information (“please send me a list of payments you have received from these bodies…”).
As more and more public services get outsourced to private contractors, I wonder if those private contractors will start to buy services off the public providers? I may be able to FOI the public providers for their receipts data (any examples of this, successful or otherwise?), but I wouldn’t be able to find any publicly disclosed payments data from the private provider to the public provider.
The transparency matrix thus looks something like this:
- payment from public body to public body: payment disclosed as public data, receipts available from analysis of all public body payment data (and reciipts FOIable from receiver?)
- payment from public body to private body: payment disclosed as public data; total public payments to private body can be ascertained by inspecting payments data of all public bodies. Effective fan-in can be increased by setting up different companies to receive payments and make it harder to aggregate total public monies incoming to a corporate group. (Would be useful if private companied has to disclose: a) total amount of public monies received from any public source, exceeding some threshold; b) individual payments above a certain value from a public body)
- payment from private body to public body: receipt FOIable from public body? No disclosure requirement on private body? Private body can effectively reduce fan out (that is, easily identified concentration of outgoing payments) by setting up different companies through which payments are made.
- payment from private body to private body: no disclosure requirements.
I have of course already wondered Do We Need Open Receipts Data as Well as Open Spending Data?. My current take on this would perhaps argue in favour of requiring all bodies, public or private, that receive more than £25,000, for example, in total per financial year from a particular corporate group* to declare all the transactions (over £500, say) from that body. A step on the road towards that would be to require bodies that receive more than a certain amount of receipts summed from across all public bodies to be subject to FOI at least in respect of payments data received from public bodies.
* We would need to define a corporate group somehow, to get round companies setting up EvilCo Public Money Receiving Company No. 1, EvilCo Public Money Receiving Company No. 2354 Ltd, etc, each of which only ever invoices up to £24,999. There would also have to be a way of identifying payments from the same public body but made through different accounts (for example, different local council directorates).
Whilst this would place a burden on all bodies, it would also start to level out the asymmetry between public body reporting and private body reporting in the matter of publicly funded transactions. At the moment, private company overheads for delivering subcontracted public services are less than public body overheads for delivering the same services in the matter of, for example, transparency disclosures, placing the public body at a disadvantage compared to the private body when it comes to transparency disclosures. (Note that things may be changing, at least in FOI stakes… See for example the latter part of Some Notes on Extending FOI.)
One might almost think the government was promoting transparency of public services gleeful in the expectation that as there privatisation agenda moves on a decreasing proportion of service providers will actually have to make public disclosures. Again, this asymmetry would make for unfair comparisons between service providers based on publicly available data if only data from public body providers of public services, rather than private providers of tendered public services, had to be disclosed.
So the take home, which has got diluted somewhat, is the proposal that the joint notions of fan in and fan out, when it comes to payment/receipts data, may be useful when it comes to helping us think about out how easy it is to concentrate data/information about payments to, or from, a particular body, and how policy can be defined to shine light where it needs shining.