Archive for the ‘Data’ Category
Via my feeds, a handful of consultations and codes relating to open data, particularly in a local government context:
- first up, from the Cabinet Office, an open consultation on FOI data-release guidelines (see also the press release and draft code of practice); this relates to ways of handling the changes to the Freedom of Information Act that will come in to force in April 2013 that arise from the changes introduced through the Protection of Freedoms Act earlier this year;
- from DCLG, a consultation under the banner of “Improving Local Government Transparency” that takes a look at matters arising from making the Code of recommended practice for local authorities on Data Transparency mandatory [PDF] (here’s a copy of the The Code of Recommended Practice for Local Authorities on Data Transparency, which “sets out key principles for local authorities in creating greater transparency through the publication of public data”, where “‘Public data’ therefore means the objective, factual data, on which policy decisions are based and on which public services are assessed, or which is collected or generated in the course of public service delivery.”);
- from the Advisory Panel on Public Sector Information (APPSI), we have A National Information Framework for Public Sector Information and Open Data (I saw this via a Guardian commentary piece).
Also of note this week, the ICO published its Anonymisation: managing data protection risk code of practice [PDF] (here’s the press release). ENISA, the European Network and Information Security Agency, have also just published the latest in a series of reports on privacy: Privacy considerations of online behavioural tracking. My colleague Ray Corrigan has done a quick review here.
Although it’s hard to know who has influence where, to the extent that the UK’s Open Government Partnership National Action Plan suggests a general roadmap for open government activity, this is maybe worth noting: Involve workshops: Developing the UK’s next open government National Action Plan
For a recent review of the open data policy context, see InnovateUK’s Open Data and it’s role in powering economic growth.
(I’ll update this post with a bit more commentary over the next few days. For now, I thought I’d share the links in case any readers out there fancy a bit of weekend reading…;-)
PS though not in the news this week, here are a couple of links to standing and appealed case law around database IPR:
- background – OKF review of data/database rights and Out-law review of database rights
- (ongoing) Court of Justice of European Union Appeal – Football Dataco and others: are football event listings protected? (context and commentary on review.)
- case law: The British Horseracing Board Ltd and Others v William Hill (horse-racing information) (commentary by out-law).
Whilst preparing for my typically overloaded #online12 presentation, I thought I should make at least a passing attempt at contextualising it for the corporate attendees. The framing idea I opted for, but all too briefly reviewed, was whether open public data might be disruptive to the information industry, particularly purveyors of information services in vertical markets.
If you’ve ever read Clayton Christensen’s The Innovator’s Dilemma, you’ll be familiar with the idea behind disruptive innovations: incumbents allow start-ups with cheaper ways of tackling the less profitable, low-quality end of the market to take that part of the market; the start-ups improve their offerings, take market share, and the incumbent withdraws to the more profitable top-end. Learn more about this on OpenLearn: Sustaining and disruptive innovation or listen again to the BBC In Business episode on The Innovator’s Dilemma, from which the following clip is taken.
In the information industry, the following question then arises: will the availability of free, open public data be adopted at the low, or non-consuming end of the market, for example by micro- and small companies who haven’t necessarily be able to buy in to expensive information or data services, either on financial grounds or through lack of perceived benefits? Will the appearance of new aggregation services, often built around screenscrapers and/or public open data sources start to provide useful and useable alternatives at the low end of the market, in part because of their (current) lack of comprehensiveness or quality? And if such services are used, will they then start to improve in quality, comprehensiveness and service offerings, and in so doing start a ratcheting climb to quality that will threaten the incumbents?
Here are a couple of quick examples, based around some doodles I tried out today using data from OpenCorporates and OpenlyLocal. The original sketch (demo1() in the code here) was a simple scraper on Scraperwiki that accepted a person’s name, looked them up via a director search using the new 0.2 version of the OpenCorporates API, pulled back the companies they were associated with, and then looked up the other directors associated with those companies. For example, searching around Nigel Richard Shadbolt, we get this:
One of the problems with the data I got back is that there are duplicate entries for company officers; as Chris Taggart explained, “[data for] UK officers [comes] from two Companies House sources — data dump and API”. Another problem is that officers’ records don’t necessarily have start/end dates associated with them, so it may be the case that directors’ terms of office don’t actually overlap within a particular company. In my own scraper, I don’t check to see whether an officer is marked as “director”, “secretary”, etc, nor do I check to see whether the company is still a going concern or whether it has been dissolved. Some of these issues could be addressed right now, some may need working on. But in general, the data quality – and the way I work with it – should only improve from this quick’n'dirty minimum viable hack. As it is, I now have a tool that at a push will give me a quick snapshot of some of the possible director relationships surrounding a named individual.
The second sketch (demo2() in the code here) grabbed a list of elected council members for the Isle of Wight Council from another of Chris’ properties, OpenlyLocal, extracted the councillors names, and then looked up directorships held by people with exactly the same name using a two stage exact string match search. Here’s the result:
As with many data results, this is probably most meaningful to people who know the councillors – and companies – involved. The results may also surprise people who know the parties involved if they start to look-up the companies that aren’t immediately recognisable: surely X isn’t a director of Y? Here we have another problem – one of identity. The director look-up I use is based on an exact string match: the query to OpenCorporates returns directors with similar names, which I then filter to leave only directors with exactly the same name (I turn the strings to lower case so that case errors don’t cause a string mismatch). (I also filter companies returned to be solely ones with a gb jurisdiction.) In doing the lookup, we therefore have the possibility of false positive matches (X is returned as a director, but it’s not the X we mean, even though they have exactly the same name); and false negative lookups (eg where we look up a made up director John Alex Smith who is actually recorded in one or more filings as (the again made-up) John Alexander Smith.
That said, we do have a minimum viable research tool here that gives us a starting point for doing a very quick (though admittedly heavily caveated) search around companies that a councillor may be (or may have been – I’m not checking dates, remember) associated with.
We also have a tool around which we can start to develop a germ of an idea around conflict of interest detection.
The Isle of Wight Armchair Auditor, maintained by hyperlocal blog @onthewight (and based on an original idea by @adrianshort) hosts local spending information relating to payments made by the Isle of Wight Council. If we look at the payments made to a company, we see the spending is associated with a particular service area.
If you’re a graph thinker, as I am;-), the following might then suggest itself to you:
- From OpenlyLocal, we can get a list of councillors and the committees they are on;
- from OnTheWight’s Armchair Auditor, we can get a list of companies the council has spent money with;
- from OpenCorporates, we can get a list of the companies that councillors may be directors of;
- from OpenCorporates, we should be able to get identifiers for at least some of the companies that the council has spent money with;
- putting those together, we should be able to see whether or not a councillor may be a director of a company that the council is spending with and how much is being spent with them in which spending areas;
- we can possibly go further, if we can associate council committees with spending areas – are there councillors who are members of a committee that is responsible for a particular spending area who are also directors of companies that the council has spent money with in those spending areas? Now there’s nothing wrong with people who have expertise in a particular area sitting on a related committee (it’s probably a Good Thing). And it may be that they got their experience by working as a director for a company in that area. Which again, could be a Good Thing. But it begs a transparency question that a journalist might well be interested in asking. And in this case, with open data to hand, might technology be able to help out? For example, could we automatically generate a watch list to check whether or not councillors who are directors of companies that have received monies in particular spending areas (or more generally) have declared an interest, as would be appropriate? I think so…(caveated of course by the fact that there may be false positives and false negatives in the report…; but it would be a low effort starting point).
Once you get into this graph based thinking, you can take it mich further of course, for example looking to see whether councillors in one council are directors of companies that deal heavily with neighbouring councils… and so on.. (Paranoid? Me? Nah… Just trying to show how graphs work and how easy it can be to start joining dots once you start to get hold of the data…;-)
Anyway – this is all getting off the point and too conspiracy based…! So back to the point, which was along the lines of this: here we have the fumblings of a tool for mixing and matching data from two aggregators of public information, OpenlyLocal and OpenCorporates that might allow us to start running crude conflict of interest checks. (It’s easy enough to see how we can run the same idea using lists of MP names from the TheyWorkForYou API; or looking up directorships previously held by Ministers and the names of companies of lobbiests they meet (does WhosLobbying have an API of such things?). And so on…
Now I imagine there are commercial services around that do this sort of thing properly and comprehensively, and for a fee. But it only took me a couple of hours, for free, to get started, and having started, the paths to improvement become self-evident… and some of them can be achieved quite quickly (it just takes a little (?!) but of time…) So I wonder – could the information industry be at risk of disruption from open public data?
PS if you’re into conspiracies, Cambridge’s Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) has a post-doc positions open with Professior John Naughton on The impact of global networking on the nature, dissemination and impact of conspiracy theories. The position is complemented by several parallel fellowships, including ones on Rational Choice and Democratic Conspiracies and Ideals of Transparency and Suspicion of Democracy.
Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:
Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.
Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:
If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.
I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…
Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.
So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)
One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.
#Python script to generate GraphML file import csv #We're going to use the really handy networkx graph library: easy_install networkx import networkx as nx import urllib #Create a directed graph object DG=nx.DiGraph() #Open data file in universal newline mode reader=csv.DictReader(open("inventory.csv","rU")) #Define a variable to act as a deal node ID counter dcid=0 #The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers #An identifier is minted for each row, identifying the deal #Deal attributes are used to annotate deal nodes #Identify columns used to annotate nodes taking string values nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use'] #Identify columns used to annotate nodes taking numeric values nodeColsInt=['Minimum Qty kg', 'Discount'] #The customers are treated as nodes in their own right, rather than as deal attributes #Identify columns used to identify customers - each of these will define a customer node customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher'] #Create a node for each customer, and classify it as a 'customer' node type for customer in customerCols: DG.add_node(customer,typ="customer") #Each row defines a deal for row in reader: #Mint an ID for the deal dealID='deal'+str(dcid) #Add a node for the deal, and classify it as a 'deal' node type DG.add_node(dealID,typ='deal') #Annotate the deal node with string based deal attributes for deal in nodeColsStr: DG.node[dealID][deal]=row[deal] #Annotate the deal node with numeric based deal attributes for deal in nodeColsInt: DG.node[dealID][deal]=int(row[deal]) #If the cell in a customer column is set to 1, ## draw an edge between that customer and the corresponding deal for customer in customerCols: if str(row[customer])=='1': DG.add_edge(dealID,customer) #Increment the node ID counter dcid=dcid+1 #write graph nx.write_graphml(DG,"inventory.graphml")
The graph we’re generating (download .graphml) has a basic structure that looks something like the following:
Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.
Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:
If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):
We can view these nodes using a filter:
We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:
The workspace selector is at the bottom of the window, on the right hand side:
(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)
We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).
Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?
At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:
The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:
We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?
The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.
So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?
We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)
Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:
Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”
Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):
I have then sized the labels so that they are proportional to node size:
The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:
This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:
Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:
(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)
Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.
Recalling the “bimodal”/bipartite graph above:
that means we should be able to generate unimodal graphs along the following lines:
D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.
And for the customers?
C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.
You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)
So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:
When we run the projection, the graph is mapped onto a deal-deal graph:
The thickness of the edges describes the number of customers any two deals shared.
If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:
If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):
If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:
Run the modularity statistic and recolour:
Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:
If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:
If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:
Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:
To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:
Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…
PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)
This is really just a searchable placeholder post for me to bundle up a set of links for things I keep forgetting about relating to opendata in the third sector/NGO land that may be useful in generating financial interest maps, trustee interest maps, etc… As such, it’ll probably grow as time goes by..
- Open Charities – headline data relating to UK registered(?) charities
- Nominet Trust Open Data and Charitties report [PDF]
- IATI data – “International Aid Transparency Initiative (IATI) is a global transparency standard that makes information about aid spending easier to access, use and understand” (see IATI Standard)
It’s also possible to find payments to charities from local councils and goevernment departments using services like OpenlyLocal, OpenSpending, (I’m not sure if detail appears in Open Data Communities).
In a post from way back when – Does Funding Equal Happiness in Higher Education? – that I still get daily traffic to, though the IBM Many Eyes Wikified hack described in it no longer works, I aggregated and reused a selection of data sets collected by the Guardian datastore relating to HE.
Whilst the range of datasets used in that hack don’t seem to have been re-collected more recently, the Guardian DataStore does still publish and annual set of aggregated data (from Unistats?) for courses by subject area across UK HE (University guide 2013 – data tables).
The DataStore data is published using Google spreadsheets, which as regular readers will know also double up as a database. The Google Visualisation API that’s supported by Google Spreadsheets also makes it easy to pull data from the spreadsheets into an interactive dashboard view.
As an example, I’ve popped a quick demo up as a Scraperwiki View showing how to pull data from a selected sheet within the Guardian University data spreadsheet and filter it using a range of controls. I’ve also added a tabular view, and a handful of scatterplots, to show off the filtered data.
To play with the view, visit here: Guardian Student Rankings.
If you want to hack around with the view, it’s wikified here: wikified source code.
PS I’ve also pulled all the subject data tables into a single Scraperwiki database: Guardian HE Data 2013
A BIS Press Release (Next steps making midata a reality) seems to have resulted in folk tweeting today about the #midata consultation that was announced last month. If you haven’t been keeping up, #midata is the policy initiative around getting companies to make “[consumer data] that may be actionable and useful in making a decision or in the course of a specific activity” (whatever that means) available to users in a machine readable form. To try to help clarify matters, several vignettes are described in this July 2012 report – Example applications of the midata programme – which plays the role of a ‘draft for discussion’ at the September midata Strategy Board [link?]. Here’s a quick summary of some of them:
- form filling: a personal datastore will help you pre-populate forms and provide certified evidence of things like: proof of her citizenship, qualified to drive, passed certain exams and achieved certain qualifications, passed a CRB check, and so on. (Note: I’ve previously tried to argue the case for the OU starting to develop a service (OU Qualification Verification Service) around delivering verified tokens relating to the award of OU degrees, and degrees awarded by the polytechnics, as was (courtesy of the OU’s CNAA Aftercare Service), but after an initial flurry of interest, it was passed on. midata could bring it back maybe?
- home moving admin: change your details in a personal “mydata” data store, and let everyone pick up the changes from there. Just think what fun you could have with an attack on this;-)
- contracts and warranties dashboard: did my crApple computer die the week before or after the guarantee ran out?
- keeping track of the housekeeping: bank and financial statement data management and reporting tools. I thought there already was software for doing this? do we use it though? I’d rather my bank improved the tools it provided me with?
- keeping up with the Jones’s: how does my house’s energy consumption compare with that of my neighbours?
- which phone? Pick a tariff automatically based on your actual phone usage. From going through this recently, the problem is not with knowing how I use my phone (easy enough to find out), it’s with navigating the mobile phone sites trying to understand their offers. (And why can’t Vodafone send me an SMS to say I’m 10 minutes away from using up this month’s minutes, rather than letting me go over? The midata answer might be an agent that looks at my usage info and tells me when I’m getting close to my limit, which requires me having access to my contract details in a machine readable form, I guess?
And here’s a BIS blog post summarising them: A midata future: 10 ways it could shape your choices.
(The #midata policy seems based on a belief that users want better access to data so they can do things with it. I’m not convinced – why should I have to export my bank data to another service (increasing the number of services I must trust) rather than my bank providing me with useful tools directly? I guess one way this might play out is that any data that does dribble out may get built around by developers who then sell the tools back to the data providers so they can offer them directly? In this context, I guess I should read the BIS commissioned Jigsaw Research report: Potential consumer demand for midata.)
Today has also seen a minor flurry of chat around the call for evidence on the Communications Data Bill, presumably because the closing date for responses is tomorrow (draft Communications Data Bill). (Related reading: latest Annual Report of the Interception of Communications Commissioner.) Again, if you haven’t been keeping up, the draft Communications Data Bill describes communications data in the following terms:
- Communications data is information about a communication; it can include the details of the time, duration, originator and recipient of a communication; but not the content of the communication itself
- Communications data falls into three categories: subscriber data; use data; and traffic data.
The categories are further defined in an annex:
- Subscriber Data – Subscriber data is information held or obtained by a provider in relation to persons to whom the service is provided by that provider. Those persons will include people who are subscribers to a communications service without necessarily using that service and persons who use a communications service without necessarily subscribing to it. Examples of subscriber information include:
– ‘Subscriber checks’ (also known as ‘reverse look ups’) such as “who is the subscriber of phone number 012 345 6789?”, “who is the account holder of e-mail account email@example.com?” or “who is entitled to post to web space http://www.xyz.anyisp.co.uk?”;
– Subscribers’ or account holders’ account information, including names and addresses for installation, and billing including payment method(s), details of payments;
– information about the connection, disconnection and reconnection of services which the subscriber or account holder is allocated or has subscribed to (or may have subscribed to) including conference calling, call messaging, call waiting and call barring telecommunications services;
– information about the provision to a subscriber or account holder of forwarding/redirection services;
– information about apparatus used by, or made available to, the subscriber or account holder, including the manufacturer, model, serial numbers and apparatus codes.
– information provided by a subscriber or account holder to a provider, such as demographic information or sign-up data (to the extent that information, such as a password, giving access to the content of any stored communications is not disclosed).
- Use data – Use data is information about the use made by any person of a postal or telecommunications service. Examples of use data may include:
– itemised telephone call records (numbers called);
– itemised records of connections to internet services;
– itemised timing and duration of service usage (calls and/or connections);
– information about amounts of data downloaded and/or uploaded;
– information about the use made of services which the user is allocated or has subscribed to (or may have subscribed to) including conference calling, call messaging, call waiting and call barring telecommunications services;
– information about the use of forwarding/redirection services;
– information about selection of preferential numbers or discount calls;
- Traffic Data – Traffic data is data that is comprised in or attached to a communication for the purpose of transmitting the communication. Examples of traffic data may include:
– information tracing the origin or destination of a communication that is in transmission;
– information identifying the location of equipment when a communication is or has been made or received (such as the location of a mobile phone);
– information identifying the sender and recipient (including copy recipients) of a communication from data comprised in or attached to the communication;
– routing information identifying equipment through which a communication is or has been transmitted (for example, dynamic IP address allocation, file transfer logs and e-mail headers – to the extent that content of a communication, such as the subject line of an e-mail, is not disclosed);
– anything, such as addresses or markings, written on the outside of a postal item (such as a letter, packet or parcel) that is in transmission;
– online tracking of communications (including postal items and parcels).
To put the communications data thing into context, here’s something you could try for yourself if you have a smartphone. Using something like the SMS to Text app (if you trust it!), grab your txt data from your phone and try charting it: SMS analysis (coming from an Android smartphone or an IPhone). And now ask yourself: what if I also mapped my location data, as collected by my phone? And will this sort of thing be available as midata, or will I have to collect it myself using a location tracking app if I want access to it? (There’s an asymmetry here: the company potentially collecting the data, or me collecting the data…)
It’s also worth bearing in mind that even if access to your data is locked down, access to the data of people associated with you might reveal quite a lot of information about you, including your location, as Adam Sadilek et al. describe: Finding Your Friends and Following Them to Where You Are (see also Far Out: Predicting Long-Term Human Mobility). My own tinkerings with emergent social positioning (looking at who the followers of particular twitter users also follow en masse) also suggest we can generate indicators about potential interests of a user by looking at the interests of their followers… Even if you’re careful about who your friends are, your followers might still reveal something about you you have tried not to disclose yourself (such as your birthday…). (That’s one of the problems with asymmetric trust models! Hmmm… could be interesting to start trying to model some of this… )
Both of these consultations provide a context for reflecting on the extent to which companies use data for their own processing purposes (for a recent review, see What happens to my data? A novel approach to informing users of data processing practices), the extent to which they share this data in raw and processed form with other companies or law enforcement agencies, the extent to which they may use it to underwrite value-added/data-powered services to users directly or when combined with data from other sources, the extent to which they may be willing to share it in raw or processed form back with users, and the extent to which users may then be willing (or licensed) to share that data with other providers, and/or combine it with data from other providers.
One of the biggest risks from a “what might they learn about me” point of view – as well as some of the biggest potential benefits – comes from the reconciliation of data from multiple different sources. Mosaic theory is an idea taken from the intelligence community that captures the idea that when data from multiple sources is combined, the value of the whole view may be greater than the sum of the parts. When privacy concerns are idly raised as a reason against the release of data, it is often suspicion and fears around what a data mosaic picture might reveal that act as drivers of these concerns. (Similar fears are also used as a reason against the release of data, for example under Freedom of Information requests, in case a mosaic results in a picture that can be used against national interests: eg D.E. Pozen, The Mosaic Theory, National Security, and the Freedom of Information Act and MP Goodwin, A National Security Puzzle: Mosaic Theory and the First Amendment Right of Access in the Federal Courts).
Note that within a particular dataset, we might also appeal to mosaic theory thinking; for example, might we learn different things when we observe individual data records as singletons, as opposed to a set of data (and the structures and patterns it contains) as a single thing: GPS Tracking and a ‘Mosaic Theory’ of Government Searches. And as a consequence, might we want to treat individual data records, and complete datasets, differently?
PS via this ORG post – Consulympics: opportunities to have your say on tech policies – which details a whole raft of currently open ICT related consultations in the UK, I am reminded of this ICO Consultation on the draft Anonymisation code of practice along with a draft of the anaoymisation code itself.
A week or two ago, the Government Data Service started publishing a summary document containing website transaction stats from across central government departments (GDS: Data Driven Delivery). The transactional services explorer uses a bubble chart to show the relative number of transactions occurring within each department:
The sizes of the bubbles are related to the volume of transactions (although I’m not sure what the exact relationship is?). They’re also positioned on a spiral, so as you work clockwise round the diagram starting from the largest bubble, the next bubble in the series is smaller (the “Other” catchall bubble is the exception, sitting as it does on the end of the tail irrespective of its relative size). This spatial positioning helps communicate relative sizes when the actual diameter of two bubbles next to each other is hard to differentiate between.
Clicking on a link takes you down into a view of the transactions occurring within that department:
Out of idle curiosity, I wondered what a treemap view of the data might reveal. The order of magnitude differences in the number of transactions across departments meant the the resulting graphic was dominated by departments with large numbers of transactions, so I did what you do in such cases and instead set the size of the leaf nodes in the tree to be the log10 of the number of transactions in a particular category, rather than the actual number of transactions. Each node higher up the tree was then simply the sum of values in the lower levels.
The result is a treemap that I decided shows “interestingness”, which I defined for the purposes of this graphic as being some function of the number and variety of transactions within a departement. Here’s a nested view of it, generated using a Google chart visualisation API treemap component:
The data I grabbed had a couple of usable structural levels that we can make use of in the chart. Here’s going down to the first level:
…and then the second:
Whilst the block sizes aren’t really a very good indicator of the number of transactions, it turns out that the default colouring does indicate relative proportions in the transaction count reasonably well: deep red corresponds to a low number of transactions, dark green a large number.
As a management tool, I guess the colours could also be used to display percentage change in transaction count within an area month on month (red for a decrease, green for an increase), though a slightly different size transformation function might be sensible in order to draw out the differences in relative transaction volumes a little more?
I’m not sure how well this works as a visualisation that would appeal to hardcore visualisation puritans, but as a graphical macroscopic device, I think it does give some sort of overview of the range and volume of transactions across departments that could be used as an opening gambit for a conversation with this data?
I’m starting to feel as if I need to do myself a weekly round-up, or newsletter, on open data, if only to keep track of what’s happening and how it’s being represented. Today, for example, the Commons Public Accounts Committee published a report on Implementing the Transparency Agenda.
From a data wrangling point of view, it was interesting that the committee picked up on the following point in its Conclusions and recommendations (thanks for the direct link, Hadley:-), whilst also missing the point…:
2. The presentation of much government data is poor. The Cabinet Office recognises problems with the functionality and usability of its data.gov.uk portal. Government efforts to help users access data, as in crime maps and the schools performance website, have yielded better rates of access. But simply dumping data without appropriate interpretation can be of limited use and frustrating. Four out of five people who visit the Government website leave it immediately without accessing links to data. So there is a clear benefit to the public when government data is analysed and interpreted by third parties – whether that be, for example, by think-tanks, journalists, or those developing online products and smartphone applications. Indeed, the success of the transparency agenda depends on such broader use of public data. The Cabinet Office should ensure that:
– the publication of data is accessible and easily understood by all; and
– where government wants to encourage user choice, there are clear criteria to determine whether government itself should repackage information to promote public use, or whether this should be done by third parties.
A great example of how data not quite being published consistently can cause all sorts of grief when trying to aggregate it came to my attention yesterday via @lauriej:
Laura James (@LaurieJ) July 31, 2012
It leads to a game where you can help make sense of not quite right column names used to describe open spending data… (I have to admit, I found the instructions a little hard to follow – a screenshot walked through example would have helped? It is, after all, largely a visual pattern matching exercise…)
From a spend mapping perspective, this is also relevant:
6. We are concerned that ‘commercial confidentiality’ may be used as an inappropriate reason for non-disclosure of data. If transparency is to be meaningful and comprehensive, private organisations providing public services under contract must make available all relevant public information. The Cabinet Office should set out policies and guidance for public bodies to build full information requirements into their contractual agreements, in a consistent way. Transparency on contract pricing which is often hidden behind commercial confidentiality clauses would help to drive down costs to the taxpayer.
And from a knowing “what the hell is going on?” perspective, there was also this:
7. Departments do not make it easy for users to understand the full range of information available to them. Public bodies have not generally provided full inventories of all of the information they hold, and which may be available for disclosure. The Cabinet Office should develop guidance for departments on information inventories, covering, for example, classes of information, formats, accuracy and availability; and it should mandate publication of the inventories, in an easily accessible way.
The publication of government department open data strategies may go some way to improving this. I’ve also been of a mind that more accessible ways of releasing data burden reporting requirements could help clarify what “working data” is available, in what form, and the ways in which it is routinely being generated and passed between bodies. Sorting out better pathways between FOI releases of data and the then regular release of such data as opendata is also something I keep wittering on about (eg FOI Signals on Useful Open Data? and The FOI Route to Real (Fake) Open Data via WhatDoTheyKnow).
From within the report, I also found a reiteration of this point notable:
This Committee has previously argued that it is vital that we and the public can access data from private companies who contract to provide public services. We must be able to follow the taxpayers’ pound wherever it is spent. The way contracts are presently written does not enable us to override rules about commercial confidentiality. Data on public contracts delivered by private contractors must be available for scrutiny by Parliament and the public. Examples we have previously highlighted include the lack of transparency of financial information relating to the Private Finance Initiative and welfare to work contractors.
…not least because data releases from companies is also being addressed on another front, midata, most notably via the recently announced BIS Midata 2012 review and consultation [consultation doc PDF]. For example, the consultation document suggests:
1.10 The Government is not seeking to require the release of data electronically at this stage, and instead is proposing to take a power to do so. The Secretary of State would then have to make an order to give effect to the power. An order making power, if utilised, would compel suppliers of services and goods to provide to their customers, upon request, historic transaction/ consumption data in a machine readable format. The requirement would only apply to businesses that already hold this information electronically about individual consumers.
1.11. Data would only have to be released electronically at the request of the consumer and would be restricted to an individual’s consumption and transaction data, since in our view this can be used to better understand consumers’ behaviour. It would not cover any proprietary analysis of the data, which has been done for its own purposes by the business receiving the request.
(More powers to the Minister then…?!) I wonder how this requirement would extend rights available under the Data Protection Act (and why couldn’t that act be extended? For example, Data Protection Principle 6 includes “a right of access to a copy of the information comprised in their personal data” – couldn’t that be extended to include transaction data, suitably defined? Though I note 1.20. There are a number of different enforcement bodies that might be involved in enforcing midata. Data protection is enforced by the Information Commissioner’s Office (ICO), whilst the Office of Fair Trading (OFT), Trading Standards and sector regulators currently enforce consumer protection law. and Question 17: Which body/bodies is/are best placed to perform the enforcement role for this right?) There are so many bits of law around relating to data that I don’t understand at all that I think I need to do myself an uncourse on them… (I also need to map out the various panels, committees and groups that have an open data interest… The latest, of course, is the Open Data User Group (ODUG), the minutes of whose first meeting were released some time ago now, although not in a directly web friendly format…)
The consultation goes on:
1.18. For midata to work well the data needs be made available to the consumer in electronic format as quickly as possible following a request (maybe immediately) and as inexpensively as possible. This will minimise friction and ensure that consumers are able to access meaningful data at the point it is most useful to them. This requirement will only cover data that is already held electronically at the time of the request so we expect that the time needed to respond to a consumer’s request will be short – in many cases instant
Does the Data Protection Act require the release of data in an electronic format, and ideally a structured electronic format (i.e. as something resembling a dataset? The recent Protection of Freedoms Act amended the FOI Act with language relating to the definition and release of datasets, so I wonder if this approach might extend elsewhere?
Coming at the transparency thing from another direction, I also note with interest (via the BBC) that MPs say all lobbyists should be on new register:
All lobbyists, including charities, think tanks and unions, should be subject to new lobbying regulation, a group of MPs have said. They criticised government plans to bring in a statutory register for third-party lobbyists, such as PR firms, only. They said the plan would “do nothing to improve transparency”. Instead, the MPs said, regulation should be brought in to cover all those who lobby professionally.
This is surely a blocking move? If we can’t have a complete register, we shouldn’t have any register. So best not to have one at all for a year or two.. or three… or four… Haven’t they heard of bootstrapping and minimum viability releases?! Or maybe I got the wrong idea from the lead I took from the start of the news report? I guess I need to read what the MPs actually said in the Political and Constitutional Reform – Second Report: Introducing a statutory register of lobbyists.
PS For a round-up of other recent reports on open data, see OpenData Reports Round Up (Links…).
PPS This is also new to me: new UK Data Service “starting on 1 October 2012, [to] integrate the Economic and Social Data Service (ESDS), the Census Programme, the Secure Data Service and other elements of the data service infrastructure currently provided by the ESRC, including the UK Data Archive.”
This is so much a blog post as a dumping ground for bits and pieces relating to Olympics data coverage…
BBC Internet blog: Olympic Data Services and the Interactive Video Player – has a brief overview of how the BBC gets its data from LOCOG; and Building the Olympic Data Services describes something of the technical architecture.
Computer Weekly report from Sept 2011: Olympic software engineers enter final leg of marathon IT development project
Examples of some of the Olympics related products you can buy from the Press Association: Press Association: Olympics Graphics (they also do a line of widgets…;-)
I haven’t found a public source of press releases detailing results that has been published as such (seems like you need to register to get them?) but there are some around if you go digging (for example, gymnastics results, or more generally, try a recent websearch for something like this: "report created" site:london2012.olympics.com.au filetype:pdf olympics results).
[PDFs detailing biographical details of entrants to track and field events at lease: games XXX olympiad biographical inurl:www.iaaf.org/mm/Document/ filetype:pdf]
A really elegant single web page app from @gabrieldance: Was an Olympic Record Set Today? Great use of the data…:-)
This also makes sense – Journalism.co.uk story on how Telegraph builds Olympics graphics tool for its reporters to make it easy to generate graphical views over event results.
PS though it’s not data related at all, you may find this amusing: OU app for working out which Olympic sport you should try out… Olympisize Me (not sure how you know it was an OU app from the landing page though, other than by reading the URL…?)
PPS I tweeted this, but figure it’s also worth a mention here: isn’t it a shame that LOCOG haven’t got into the #opendata thing with the sports results…
A discussion, earlier, about whether it was now illegal to drink in public…
…I thought not, think not, at least, not generally… My understanding was, that local authorities can set up controlled, alcohol free zones and create some sort of civil offence for being caught drinking alcohol there. (As it is, councils can set up regions where public consumption of alcohol may be prohibited and this prohibition may be enforced by the police.) So surely there must be an #opendata powered ‘no drinking here’ map around somewhere? The sort of thing that might result from a newspaper hack day, something that could provide a handy layer on a pub map? I couldn’t find one, though…
I did a websearch, turned up The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007, which does indeed appear to be the bit of legislation that regulates drinking alcohol in public, along with a link to a corresponding guidance note: Home Office circular 013 / 2007:
16. The provisions of the CJPA [Criminal Justice and Police Act 2001, Chapter 2 Provisions for combatting alcohol-related disorder] should not lead to a comprehensive ban on drinking in the open air.
17. It is the case that where there have been no problems of nuisance or annoyance to the public or disorder having been associated with drinking in that place, then a designation order … would not be appropriate. However, experience to date on introducing DPPOs has found that introducing an Order can lead to nuisance or annoyance to the public or disorder associated with public drinking being displaced into immediately adjacent areas that have not been designated for this purpose. … It might therefore be appropriate for a local authority to designate a public area beyond that which is experiencing the immediate problems caused by anti-social drinking if police evidence suggests that the existing problem is likely to be displaced once the DPPO was in place. In which case the designated area could include the area to which the existing problems might be displaced.
Creepy, creep, creep…
This, I thought, was interesting too, in the guidance note:
37. To ensure that the public have full access to information about designation orders made under section 13 of the Act and for monitoring arrangements, Regulation 9 requires all local authorities to send a copy of any designation order to the Secretary of State as soon as reasonably practicable after it has been made.
38. The Home Office will continue to maintain a list of all areas designated under the 2001 Act on the Home Office website: www.crimereduction.gov.uk/alcoholorders01.htm [I'm not convinced that URL works any more...?]
39. In addition, local authorities may wish to consider publicising designation orders made on their own websites, in addition to the publicity requirements of the accompanying Regulations, to help to ensure full public accessibility to this information.
So I’m thinking: this sort of thing could be a great candidate for a guidance note from the Home Office to local councils recommending ways of releasing information about the extent of designation orders as open geodata. (Related? Update from ONS on data interoperability (“Overcoming the incompatibility of statistical and geographic information systems”).)
I couldn’t immediately find a search on data.gov.uk that would turn up related datasets (though presumably the Home Office is aggregating this data, even if it’s just in a filing cabinet or mail folder somewhere*), but a quick websearch for Designated Public Places site:gov.uk intitle:council turned up a wide selection of local council websites along with their myriad ways of interpreting how to release the data. I’m not sure if any of them release the data as geodata, though? Maybe this would be an appropriate test of the scope of the Protection of Freedoms Act Part 6 regulations on the right to request data as data (I need to check them again…)?
* The Home Office did release a table of designated public places in response to an FOI request about designated public place orders, although not as data… But it got me wondering: if I scheduled a monthly FOI request to the Home Office requesting the data on a monthly basis, would they soon stop fulfilling the requests as timewasting? How about if we got a rota going?! Is there any notion of a longitudinal/persistent FOI request, that just keeps on giving (could I request the list of designated public places the Home Office has been informed about over the last year, along with a monthly update of requests in the previous month (or previous month but one, or whatever is reasonable…) over the next 18 months, or two years, or for the life of the regulation, or until such a time as the data is published as open data on a regular basis?
As for the report to government that a local authority must make on passing a designation order – 9. A copy of any order shall be sent to the Secretary of State as soon as reasonably practicable after it has been made. – it seems that how the area denoted as a public space is described is moot: “5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“. Hmmm, two things jump out there…
Firstly, a local authority shall cause to be published in a newspaper circulating in its area [my emphasis; how is a newspaper circulating in its area defined? Do all areas of England have a non-national newspaper circulating in that area? Does this implicitly denote some "official channel" responsibility on local newspapers for the communication of local government notices?]. Hmmm…..
Secondly, the area identified specifically or by description. On commencement, the order must also be made public by “identifying the place which has been identified in the order”, again “in a newspaper circulating in its area”. But I wonder – is there an opportunity there to require something along the lines of and published using an appropriate open data standard in a open public data repository, and maybe further require that this open public data copy is the one that is used as part of the submission informing the Home Office about the regulation? And if we go overboard, how about we further require that each enacted and proposed order is published as such along with a machine readable geodata description and that a single aggregate files containing all that Local Authority’s currently and planned Designated Public Spaces are also published (so one URL for all current spaces, one for all planned ones). Just by the by, does anyone know of any local councils publishing boundary date/shapefiles that mark out their Designated Public Spaces? Please let me know via the comments, if so…
A couple of other, very loosely (alcohol) related, things I found along the way:
- Local Alcohol Profiles for England: the aim appears to have been the collation of, and a way of exploring, a “national alcohol dataset”, that maps alcohol related health indicators on a PCT (Primary Care Trust) and LA (local authority) basis. What this immediately got me wondering was: did they produce any tooling, recipes or infrastructure that would it make a few clicks easy to pull together a national tobacco dataset and associated website, for example? And then I found the Local Tobacco Control Profiles for England toolkit on the London Health Observatory website, along with a load of other public health observatories and it made me remember – again – just how many data sensemaking websites there already are out there…
- UK Alcohol Strategy – maybe some leads into other datasets/data stories?
PS I wonder if any of the London Boroughs or councils hosting regional events have recently declared any new Designated Public Spaces #becauseOfTheOlympics.