The FOI Route to Real (Fake) Open Data via WhatDoTheyKnow

In FOI Signals on Useful Open Data?, I pondered whether we could make use of information about FOI to help identify what sorts of data folk might actually be interested in by virtue of making Freedom of Information (FOI) requests for that that data.

I couldn’t help but start to try working various elements of that idea through, so here’s a simple baby step to begin with – a scraper on Scraperwiki (Scaperwiki scraper: WhatDOTheyKnow requests) that searches for FOI requests made through WhatDoTheyKnow that got one or more Excel/xls spreasheets back as an attachment.

I’ve also popped up a Scraperwiki view that allows you to view data returning searches made to local councils or universities

Clicking through on an FOI request link takes you to the response that contains the data file, which can be downloaded directly or previewed on Zoho:

It strikes me that if I crawled the response pages, I could build my own index of data files, catalogued according to FOI request titles, in effect generating a “fake” data.gov.uk or data.ac.uk opendata catalogue as powered by FOI requests…? (What would be really handy in the local council requests would be if the responses were tagged with with appropriate LGSL code or IPSV terms (indexing on the way out) as a form of useful public metadata that can help put the FOI released data to work…?)

Insofar as the requests may or may not be useful as signaling particular topic areas as good candidates as “standard” open data releases, I still need to do some text analysis on the request titles. In the meantime, you can enter a keyword/key phrase in the Request text box in order to filter the table results to only show requests whose title contains the keyword/phrase. (The Council drop down list allows you to filter the table so that it only shows requests for a particular university/council.)

PS via a post on HelpMeInvestigate, I came across this list of FOI responses to requests made to the NHS Prescription Pricing Division. From a quick skim, some of the responses have “data” file attachments, though in the form of PDFs rather than spreadsheets/CSV. However, it would be possible to scrape the pages to at least identify ones that do have attachments (which is a clue they may contain data sets?)

So now I’m wondering – what other bodies produce full lists of FOI requests they have received, along with the responses to them?

PPS See also this gov.uk search query on FOI Release publications.

Aggregated Local Government Verticals Based on LocalGov Service IDs

(Punchy title, eh?!) If you’re a researcher interested in local government initiatives or service provision across the UK on a particular theme, such as air quality, or you’re looking to start pulling together an aggregator of local council consultation exercises, where would you start?

Really – where would you start? (Please post a comment saying how you’d make a start on this before reading the rest of this post… then we can compare notes;-)

My first thought would be to use a web search engine and search for the topic term using a site:gov.uk search limit, maybe along with intitle:council, or at least council. This would generate a list of pages on (hopefully) local gov websites relating to the topic or service I was interested in. That approach is a bit hit or miss though, so next up I’d probably go to DirectGov, or the new gov.uk site, to see if they had a single page on the corresponding resource area that linked to appropriate pages on the various local council websites. (The gov.uk site takes a different approach to the old DirectGov site, I think, trying to find a single page for a particular council given your location rather than providing a link for each council to a corresponding service page?) If I was still stuck, OpenlyLocal, the site set up several years ago by Chris Taggart/@countculture to provide a single point of reference for looking up common adminsitrivia details relating to local councils, would be the next thing that came to mind. For a data related query, I would probably have a trawl around data.gov.uk, the centralised (but far form complete) UK index of open public datasets.

How much more convenient it would be if there was a “vertical” search or resource site relating to just the topic or service you were interested in, that aggregated relevant content from across the UK’s local council websites in a single place.

(Erm… or maybe it wouldn’t?!)

Anyway, here are a few notes for how we might go about constructing just such a thing out of two key ingredients. The first ingredient is the rather wonderful Local directgov services list:

This dataset is held on the Local Directgov platform which provides the deep links into Local council websites for a number of services in Directgov. The Local Authority Service details holds the local council URLS for over 240 services where the customer can directly transfer to the appropriate service page on any council in England.

The date on the dataset post is 16/09/2011, although I’m not sure if the data file itself is more current (which is one of the issues with data.gov.uk, you could argue…). Presumably, gov.uk runs off a current version of the index? (Share…. ;-) Each item in the local directgov services list carries with it a service identifier code that describes the local government service or provision associated with the corresponding web page. That it, each URL has associated with it a piece of metadata identifying a service or provision type.

Which leads to the second ingredient: the esd standards Local Government Service List. This list maps service codes onto a short key phrase description of the corresponding service. So for example, Council – consultation and community engagement is has service identifier 366, and Pollution control – air quality is 413. (See the standards page for the actual code/vocabulary list in a variety of formats…)

As a starter for ten, I’ve pulled the Directgov local gov URL listing and local gov service list into scraperwiki (Local Gov Web Pages). Using the corresponding scraper API, we can easily run a query looking up service codes relating to pollution, for example:

select * from `serviceDesc` where ToName like '%pollution%'

From this, we can pick up what service code we need to use to look up pages related to that service (413 in the case of air pollution):

select * from `localgovpages` where LGSL=413

We can also get a link to an HTML table (or JSON representation, etc) of the data via a hackable URI:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=htmltable&name=local_gov_web_pages&query=select%20*%20from%20%60localgovpages%60%20where%20LGSL%20%3D413

(Hackable in the sense we can easily change the service code to generate the table for the service with that code.)

So that’s the starter for 10. The next step that comes to my mind is to generate a dynamic Google custom search engine configuration file that defines a search engine that will search over just those URLs (or maybe those URLs plus the pages they link to). This would then provide the ability to generate custom search engines on the fly that searched over particular service pages from across localgov in a single, dynamically generated vertical.

A second thought is to grab those page, index them myself, crawl them/scrape them to find the pages they link to, and index those pages also (using something like tf-idf within each local council site to identify and remove common template elements from the index). (Hmmm… that could be an interesting complement to scraperwiki… SolrWiki, a site for compiling lists of links, indexing them, crawling them to depth N, and then configuring search ranking algorithms over the top of them… Hmmm… It’s a slightly different approach to generating custom search engines as a subset of a monolithic index, which is how the Google CSE and (previously) the Yahoo BOSS engines worked… Not scaleable, of course, but probably okay for small index engines and low thousands of search engines?)

First Sightings of the Data Strategy Board

Via a BIS press release earlier this week – Better access to public sector information moves a step closer – it seems that the Data Strategy Board is on its way, along with a Public Data Group and an Open Data User Group (these are separate from the yet to be constituted Open Standards Board (if you’re quick, the deadline for membership of the board is tomorrow: Open Standards Board – Volunteer Members and Board Advisers, – Ref:1238758) and its feeder Open Data Standards, and Open Technical Standards panels).

So what does the press release promise?

A new independently chaired Data Strategy Board (DSB) will advise Ministers on what data should be released [will this draw on data requests made to data.gov.uk, I wonder? – TH] and has the potential to unlock growth opportunities for businesses across the UK. At least one in three members of the DSB will be from outside government, including representatives of data re-users.

The DSB will work with the Public Data Group (PDG) – which consists of Trading Funds the Met Office, Ordnance Survey, Land Registry and Companies House – to provide a more consistent approach to improving access to public sector information. These organisations have already made some data available, which has provided opportunities for developers and entrepreneurs to create imaginative ways to develop or start up their own businesses based on high quality data.

Looking at the Terms of reference for the Data Strategy Board & the Public Data Group, we can broadly see how they’re organised:

Three departmental agendas then…?! A good sign, or, erm..?! (I haven’t read the Terms of reference properly yet – that’s maybe for another post…)

How these fit in with the Public Sector Transparency Board and the Local Public Data Panel, I’m not quite sure, though it might be quite interesting to try and map out the strong and weak ties between them once their memberships are announced? It’d also be interesting to know whether there’d be any mechanism for linking in with open data standards recommendations and development (via the Standards Hub process to ensure that as an when data gets released, there is at least an eye towards releasing it in a usable form!

The Government is making £7m available from April 2013 for the DSB to purchase additional data for free release from the Trading Funds and potentially other public sector organisations, funded by efficiency savings. An Open Data User Group, which will be made up of representatives from the Open Data community, will be directly involved in decisions on the release of Open Data, advising the DSB on what data to purchase from the Trading Funds and other public organisations and release free of charge.

So the DSB is a pseudo-cartel of sort-of government data providers (the Trading Funds) who are being given £7 million or so to open up data that the public purse (I think?) paid them to collect. The cash is there to offset the charges they would otherwise have made selling the data. (Erm… so, in order for those agencies to give their data away for free, we have to pay them to do it? Right… got it…) Presumably, the DSB members won’t be on the ODG who will be advising the DSB on what data to purchase from the Trading Funds and other public organisations and release free of charge (my emphasis). Note the explicit recognition here that free actually costs. In this case, public bodies are having data central gov paid them to collect bought off them by central gov so (central gov, or the bodies themselves) can then release it “for free”? Good. That’s clear then…

Francis Maude also clarifies this point: “The new structure for Open Data will ensure a more inclusive discussion, including private sector data users, on future data releases, how they should be paid for and which should be available free of charge.”

In addition: The DSB will provide evidence on how data from the Trading Funds – including what is released free of charge – will generate economic growth and social benefit. It will act as an intelligent customer advising Government on commissioning and purchasing key data and services from the PDG, and ensuring the best deal for the taxpayer. So maybe this means the Public Sector Transparency Board will now focus more on “public good” and transparency” arguments, leaving the DSB to demonstrate the financial returns of open data?

The Open Data User Group (ODUG) [will] support the work of the new Data Strategy Board (DSB). [The position of Chair of the group is currently being advertised, if you fancy it…: Chair of Open Data User Group, – Ref:1240914 -TH]. The ODUG will advise the DSB on public sector data that should be prioritised for release as open data, to the benefit of the UK.

As part of the process, an open suggestion site has been set up using the Delib Dialogue app to ask “the community” How should the Open Data User Group engage with users and re-users of Open Data?: [i]n advance of appointing a Chair and Members of the group, the Cabinet Office wants to bring together suggestions for how the ODUG should go about this engagement with wider users and re-users. We are looking for ideas about things like how the ODUG should gather evidence for the release of open data, how it should develop it’s advice to the DSB, how it should run its meetings and how it should keep the wider community up to date on developments (as well as other ideas you have).

A Twitter account has also been pre-emptively set up to manage some of the social media engagement activites of the group: @oduguk

The account currently has just over a couple of hundred followers, so I grabbed the list of all the folk they follow, then graphed folk followed by 30 or more current followers of @oduguk.

Here’s the graph, laid out in Gephi using a fore directed layout, with nodes colured according to modularity group and sized by eigenvector centrality:

Here’s the same graph with nodes size by betweenness centrality:

By the by, responses to the Data Policy for a Public Data Corporation consultation have also been published, including with the Government response, which I haven’t had chance to read yet… If I get a chance, I’ll try to post some thoughts/observations on that alongside a commentary on the terms of reference doc linked to above somewhere…

Government Communications – Department Press Releases and Autodiscoverable Syndication Feeds

A flurry of articles earlier this week (mine will be along shortly) about the Data Strategy Board all broadly rehashed the original press release from BIS. Via the Cabinet Office Transparency minisite, I found a link to the press release via the COI News Distribution Service…

…whereupon I noticed that the COI – Central Office of Information – is to close at the end of this month (31 March 2012), taking with it the News Distribution Service for Government and the Public Sector (soon to be ex- of http://nds.coi.gov.uk/).

In its place is the following advice: “For government press releases please follow this link to find the department that you require http://www.direct.gov.uk/en/Dl1/Directories/A-ZOfCentralGovernment/index.htm This leads to a set of alphabetised pages with links to the various government departments… i.e. it points to a starting point for likely fruitless browsing and searching if you’re after aggregated press releases from gov departments.

(I’m not sure where News Sauce: UK Government Edition gets its data from, but if it’s by scrapes of departmental press releases rather than just scraping and syndicating the old COI content, then it’s probably the site I’ll be using to keep tabs on government press releases.)

FWIW, centralisation and aggregation are not the same in terms of architectures of control. Aggregation (then filter on the way out, if needs be) can be a really really useful way of keeping tabs on otherwise distributed systems… I had a quick look to see whether anyone was scraping and aggregating UKGov departmental press releases on Scraperwiki, but only came up with @pezholio’s LGA Press Releases scraper…

An easier way would be to hook up my feed reader to an OPML bundle that collected together RSS/Atom feeds of news releases from the various government websites. I’m not sure if such a bundle is available anywhere (if you know of one, please add a link in the comments below), but if: 1) gov departments do publish RSS/Atom feed containing their press releases; 2) they make these feeds autodiscoverable via their homepages, and: 3) ensure that said feeds are reliably identifiable as press release/media release feeds, it wouldn’t be too hard to build a simple OPML feed generator.

So for example, trawling through old posts, I note that the post 404 “Page Not Found” Error pages and Autodiscoverable Feeds for UK Government Departments used a Yahoo Pipes pipe to try to automatically audit feed autodiscovery on UK gov departmental homepages, though it may well have rotted by now. If I was to fix it, I’d probably reimplement it in Scraperwiki, as I did with my UK HEI feed autodiscovery thang (UK university autodiscoverable RSS Feeds (Scraperwiki scraper), and Scraperwiki View; about: Autodiscoverable Feeds and UK HEIs (Again…)). If you beat me to that, please post a link to your scraper below;-)

I have to admit I haven’t checked the state of feed autodiscovery on UK gov, local gov, or university websites recently. Sigh… another thing to add to the list of ‘maybe useful’ diversions…;-)

See also: Public Data Principles: RSS Autodiscovery on Government Department Websites?

PS This tool may or may not be handy if feed autodiscovery is new to you? Feed Autodiscovery in Javascript

PPS hmm, from Tracking Down Local Government Consultation Web Pages, I recall there are LGD service ID codes that lists identifiers for local government services that can be used to tag webpages/URLs on local government sites. Are there service identifiers for central government communication services (eg provision of press releases?) that could be used to find central gov department press releases (or local gov press releases for that matter?) Of course, if departments all had autodiscoverable press release feeds on their homepages, it’d be a more weblike way;-)