The Closed Route to Open Data

A couple of weeks ago, I gave a presentation to the WebScience students at the University of Southampton on the topic of open data, using it as an opportunity to rehearse a view of open data based on the premise that it starts out closed. In much the same way that Darwin’s Theory of Evolution by Natural Selection is based on a major presupposition, specifically a theory of inheritance and the existence of processes that support reproduction with minor variation, so too does much of our thinking about open data derive from the presupposed fact that many of the freedoms we associate with the use of open data in legal terms arise from license conditions that the “owner” of the data awards to us.

Viewing data in this light, we might start by considering what constitutes “closed” data and how it comes to be so, before identifying the means by which freedoms are granted and the data is opened up. (Sometimes it can also be easier to consider what you can’t do than what you can, especially when answers to questions such as “so what can you actually do with open data?” attract the (rather meaningless) response: “anything”. We can then contrast what you can do in terms of freedom complementary to what you can’t…)

So how can data be “closed”?

One lens I particularly like for considering constraints that are placed on actions and actors, particularly in the digital world (although we can apply the model elsewhere) I first saw described by Lawrence Lessig in Code and Other Laws of Cyberspace: What Things Regulate: A Dot’s Life.

Here’s the dot and the forces that constrain its behaviour:

4constraints

So we see, for example, the force of law, social norms, the market (that is, economic forces) and architecture, that is the “digital physical” way the world is implemented. (Architecture may of course be designed in order to enforce particular laws, but it is likely that other “natural laws” will arise as a result of any particular architecture or system implementation.)

Without too much thought, we might identify some constraints around data and its use under each of these separate lenses. For example:

  • Law: copyright and database right grant the creator of a dataset certain protective rights over that data; data protection laws (and other “privacy laws”) limit access to, or disclosure of, data that contains personal information, as well as restricting the use of that data for purposes disclosed at the time it was collected. The UK Data Protection Act also underwrites the right of individuals to claim additional limits on data use, for example the rights “to object to processing that is likely to cause or is causing damage or distress to prevent processing for direct marketing; to object to decisions being taken by automated means” (ICO Guide to the DPA, Principle 6 – The rights of individuals).
  • Norms: social mores, behaviour and taboos limit the ways in which we might use data, even if that use is not constrained by legal, economic or technical concerns. For example, applications that invite people to “burgle my house” based on analysing social network data to discover when they are likely to be away from home and what sorts of valuable product might be on the premises are generally not welcomed. Norms of behaviour and everyday workpractice also mean that much data is not published when theere are no real reasons why it couldn’t be.
  • Market: in the simplest case, charging for access to data places a constraint on who can gain access to the data even in advance of trying to make use of it. If we extend “market” to cover other financial constraints, there may be a cost associated with preparing data so that it can be openly released.
  • Architecture: technical constraints can restrict what you can do with data. Digital rights management (DRM) uses encryption to render data streams unusable to all but the intended client, but more prosaically, document formats such as PDF or the “release” of data charts are flat image files makes it difficult for the end user to manipulate as data any data resources contained in those documents.

Laws can also be used to grant freedoms where freedoms are otherwise restricted. For example:

  • the Freedom of Information Act (FOI) provides a mechanism for requesting copies of datasets from public bodies; in addition, the Environmental Information Regulations “provide public access to environmental information held by public authorities”.
  • the laws around copyright relax certain copyright constraints for the purposes of criticism and review, reporting, research, teaching (IPO – Permitted uses of copyright works);
  • in the UK, the Data Protection Act provides for “a right of access to a copy of the information comprised in their personal data” (ICO Guide to the DPA, Principle 6).
  • in the UK, the Data Protection Act regulates what can be done legitimately with “personal” data. However, other pieces of legislation relax confidentiality requirements when it comes to sharing data for research purposes. For example:
    • the NHS Act s. 251 Control of patient information; for example, the Secretary of State for Health may “make regulations to set aside the common law duty of confidentiality for medical purposes where it is not possible to use anonymised information and where seeking individual consent is not practicable” (discussion). Note that they are changes afoot regarding s. 251…
    • The Secretary of State for Education has specific powers to share pupil data from the National Pupil database (NPD) “with named bodies and third parties who require access to the data to undertake research into the educational achievements of pupils”. The NPD “tracks a pupil’s progress through schools and colleges in the state sector, using pupil census and exam information. Individual pupil level attainment data is also included (where available) for pupils in non-maintained and independent schools” (access arrangements).
  • the Enterprise and Regulatory Reform Bill currently making its way through Parliament legislates around the Supply of Customer Data (the “#midata” clauses) which is intended to open up access to customer transaction data from suppliers of energy, financial services and mobile phones “(a) to a customer, at the customer’s request; (b) to a person who is authorised by a customer to receive the data, at the customer’s request or, if the regulations so provide, at the authorised person’s request.” Although proclaimed as a way of opening up individual rights to access this data, the effect will more likely see third parties enticing individuals to authorise the release to the third party of the individual first party’s personal transaction data held by a second party (for example, #Midata Is Intended to Benefit Whom, Exactly?). (So you’ll presumably legally be able to grant Facebook access to your mobile phone records… Or Facebook will find a way of getting you to release that data to them without you realising you granted them that permission;-)

Contracts (which I guess fall somewhere between norms and laws from the dot’s perspective (I need to read that section of Lessig’s book again!) can also be used by rights holders to grant freedoms over the data they hold the rights for. For example, the Creative Commons licensing framework provides a copyright holder with a set of tools for relaxing some of the rights afforded to them by copyright when they license the work accordingly.

Note that “I am not a lawyer”, so my understanding of all this is pretty hazy;-) I also wonder how the various pieces of legislation interact, and whether there are cracks and possible inconsistencies between them? If there are pieces of legislation around the regulation and use of data that I’m missing, please post links in the comments below, and I’ll try and do a more thorough round up in a follow on post.

Using SPARQL Query Libraries to Generate Simple Linked Data API Wrappers

A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.

I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…

For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.

(*There are actually a couple of models I can think of: 1) I keep the query secret, but run it and give you the results; 2) I license the “query source code” to you and let you run it yourself. Hmm, I wonder: do folk license queries they share? How, and to what extent, might derived queries/query modifications be accommodated in such a licensing scheme?)

Pondering Leigh’s SPARQL-doc post, another post via R-bloggers, Building a package in RStudio is actually very easy (which describes how to package a set of R files for distribution via github), asdfree (analyze survey data for free), a site that “announces obsessively-detailed instructions to analyze us government survey data with free tools” (and which includes R bundles to get you started quickly…), the resource listing Documentation for package ‘datasets’ version 2.15.2 that describes a bundled package of datasets for R and the Linked Data API, which sought to provide a simple RESTful API over SPARQL endpoints, I wondered the following:

How about developing and sharing commented query libraries around Linked Data endpoints that could be used in arbitrary Linked Data clients?

(By “Linked Data clients”, I mean different user agent contexts. So for example, calling a query from Python, or R, or Google Spreadsheets.) That’s it… Simple.

One approach (the simplest?) might be to put each separate query into a separate file, with a filename that could be used to spawn a function name that could be used to call that query. Putting all the queries into a directory and zipping them up would provide a minimal packaging format. An additional manifest file might minimally document the filename along with the parameters that can be passed into and returned from the query. Helper libraries in arbitrary languages would open the query package and “compile” a programme library/set of “API” calling functions for that language (so for example, in R it would create a set of R functions, in Python a set of Python functions).

(This reminds me of a Twitter exchange with Nick Jackson/@jacksonj04 a couple of days ago around “self-assembling” API programme libraries that could be compiled in an arbitrary language from a JSON API, cf. Swagger (presentation), which I haven’t had time to look at yet.)

The idea, then is this:

  1. Define a simple file format for declaring documented SPARQL queries
  2. Define a simple packaging format for bundling separate SPARQL queries
  3. The simply packaged set of queries define a simple “raw query” API over a Linked Data dataset
  4. Describe a simple protocol for creating programming language specific library wrappers around API from the query bundle package.

So.. I guess two questions arise: 1) would this be useful? 2) how hard could it be?

[See also: @ldodds again, on Publishing SPARQL queries and-documentation using github]

This Week in Open and Communications Data Land…

Following the official opening of the Open Data Institute (ODI) last week, a flurry of data related announcements this week:

Things have been moving on the Communications Data front too. Communications Data got a look in as part of the 2011/2012 Security and Intelligence Committee Annual Report with a review of what’s currently possible and “why change may be necessary”. Apparently:

118. The changes in the telecommunications industry, and the methods being used by people to communicate, have resulted in the erosion of the ability of the police and Agencies to access the information they require to conduct their investigations. Historically, prior to the introduction of mobile telephones, the police and Agencies could access (via CSPs, when appropriately authorised) the communications data they required, which was carried exclusively across the fixed-line telephone network. With the move to mobile and now internet-based telephony, this access has declined: the Home Office has estimated that, at present, the police and Agencies can access only 75% of the communications data that they would wish, and it is predicted that this will significantly decline over the next few years if no action is taken. Clearly, this is of concern to the police and intelligence and security Agencies as it could significantly impact their ability to investigate the most serious of criminal offences.

N. The transition to internet-based communication, and the emergence of social networking and instant messaging, have transformed the way people communicate. The current legislative framework – which already allows the police and intelligence and security Agencies to access this material under tightly defined circumstances – does not cover these new forms of communication. [original emphasis]

Elsewhere in Parliament, the Joint Select Committee Report on the Draft Communications Data Bill was published and took a critical tone (Home Secretary should not be given carte blanche to order retention of any type of data under draft communications data bill, says joint committee. “There needs to be some substantial re-writing of the Bill before it is brought before Parliament” adds Lord Blencathra, Chair of the Joint Committee.) Friend and colleague Ray Corrigan links to some of the press reviews of the report here: Joint Committee declare CDB unworkable.

In other news, Prime Minister David Cameron’s announcement of DNA tests to revolutionise fight against cancer and help 100,000 patients was reported via a technology angle – Everybody’s DNA could be on genetic map in ‘very near future’ [Daily Telegraph] – as well as by means of more reactionary headlines: Plans for NHS database of patients’ DNA angers privacy campaigners [Guardian], Privacy fears over DNA database for up to 100,000 patients [Daily Telegraph].

If DNA is your thing, don’t forget that the Home Office already operates a National DNA Database for law enforcement purposes.

And if national databases are your thing, there always the National Pupil Database which was in the news recently with the launch of a consultation on proposed amendments to individual pupil information prescribed persons regulations which seeks to “maximise the value of this rich dataset” by widening access to this data. (Again, Ray provides some context and commentary: Mr Gove touting access to National Pupil Database.)

PS A late inclusion: DECC announcement around smart meter rollout with some potential links to #midata strategy (eg “suppliers will not be able to use energy consumption data for marketing purposes unless they have explicit consent”). A whole raft of consultations were held around smart metering and Govenerment responses are also published today, including Government Response on Data Access and Privacy Framework, the Smart Metering Privacy Impact Assessment and a report on public attitudes research around smart metering. I also spotted an earlier consultation that had passed me by around the Data and Communications Company (DCC) License Conditions; here the response, which opens with: “The communications and data transfer and management required to support smart metering is to be organised by a new central communications body – the Data and Communications Company (“the DCC”). The DCC will be a new licensed entity regulated by the Gas and Electricity Markets Authority (otherwise referred to as “the Authority”, or “Ofgem”). A single organisation will be granted a licence under each of the Electricity and Gas Acts (there will be two licences in a single document, referred to as the “DCC Licence”) to provide these services within the domestic sector throughout Great Britain”. Another one to put on the reading pile…

Putting a big brother watch hat on, the notion of “meter surveillance” brings to mind BBC article about an upcoming (will hopefully thence be persistently available on iPlayer?) radio programme on “Electric Network Frequency (ENF) analysis”, The hum that helps to fight crime. According to Wikipedia, ENF is a forensic science technique for validating audio recordings by comparing frequency changes in background mains hum in the recording with long-term high-precision historical records of mains frequency changes from a database. In turn, this reminds me of appliance signature detection (identifying what appliance is switched on or off from its electrical load curve signature), for example Leveraging smart meter data to recognize home appliances. In context of audio surveillance, how about supplementing surveillance video cameras with microphones? Public Buses Across Country [US] Quietly Adding Microphones to Record Passenger Conversations.

Mapping Primary Care Trust (PCT) Data, Part 1

The launch or official opening or whatever it was of the Open Data Institute this week provided another chance to grab a snapshot of notable folk in the community, as for example demonstrated by people commonly followed by users of the #ODIlaunch hashtag on Twitter. The PR campaign also resulted in the appearance of some open data related use cases, such as a report in the Economist about an analysis by MastodonC and Prescribing Analytics mapping prescription charges (R code available), with a view to highlighting where prescriptions for branded, as opposed to the recommended generic, drugs are being issued at wasteful expense to the NHS. (See Exploring GP Practice Level Prescribing Data for some of my entry level doodlings with prescription data.)

Quite by chance, I’ve been looking at some other health data recently, (Quick Shiny Demo – Exploring NHS Winter Sit Rep Data), which has been a real bundle of laughs. Looking at a range of health related datasets, data seems to be published at a variety of aggregation levels – individual practices and hospitals, Primary Care Trusts (PCTs), Strategic Health Authorities (SHAs) and the new Clinical Commissioning Groups (CCGs). Some of these map on to geographical regions, that can then be coloured according to a particular measure value associated with that area.

I’ve previously experimented with rendering shapefiles and choropleth maps (Amateur Mapmaking: Getting Started With Shapefiles) so I know R provides one possible environment for generating these maps, so I thought I’d try to pull together a recipe or two for supporting the creation of thematic maps based on health related geographical regions.

A quick trawl for PCT shapefiles turned up nothing useful. @jenit suggested @mastodonc, and @paulbradshaw pointed me to a dataset on Google Fusion Tables, discovered through the Fusion Tables search engine, that included PCT geometry data. So no shapefiles, but there is exportable KML data from Fusion Tables.

At this point I should have followed Paul Bradshaw’s advice, and just uploaded my own data (I was going to start out with mapping per capita uptake of dental services by PCT) to Fusion Tables, merging with the other data set, and generating my thematic maps that way.

But that wasn’t quite the point, which was actually an exercise in pulling together an R based recipe for generating these maps…

Anyway, I’ve made a start, and here’s the code I have to date:

##Example KML: https://dl.dropbox.com/u/1156404/nhs_pct.kml
##Example data: https://dl.dropbox.com/u/1156404/nhs_dent_stat_pct.csv

install.packages("rgdal")
library(rgdal)
library(ggplot2)

#The KML data downloaded from Google Fusion Tables
fn='nhs_pct.kml'

#Look up the list of layers
ogrListLayers(fn)

#The KML file was originally grabbed from Google Fusion Tables
#There's only one layer...but we still need to identify it
kml=readOGR(fn,layer='Fusiontables folder')

#This seems to work for plotting boundaries:
plot(kml)

#And this:
kk=fortify(kml)
ggplot(kk, aes(x=long, y=lat,group=group))+ geom_polygon()

#Add some data into the mix
#I had to grab a specific sheet from the original spreadsheet and then tidy the data little...
nhs <- read.csv("nhs_dent_stat_pct.csv")

kml@data=merge(kml@data,nhs,by.x='Name',by.y='PCT.ONS.CODE')

#I think I can plot against this data using plot()?
plot(kml,col=gray(kml@data$A.30.Sep.2012/100))
#But is that actually doing what I think it's doing?!
#And if so, how can experiment using other colour palettes?

#But the real question is: HOW DO I DO COLOUR PLOTS USING gggplot?
ggplot(kk, aes(x=long, y=lat,group=group)) #+ ????

Here’s what an example of the raw plot looks like:

plot_pct

And the greyscale plot, using one of the dental services uptake columns:

thematicPlot_pct

Here’s the base ggplot() view:

ggplot_pctMap

However, I don’t know how to actually now plot the data into the different areas? (Oh – might this help? CRAN Task View: Analysis of Spatial Data.)

If you know how to do the colouring, or ggplotting, please leave a comment, or alternatively, chip in an answer to a related question I posted on StackOverflow: Plotting Thematic Maps from KML Data Using ggplot2

Thanks:-)

PS The recent Chief Medical Officer’s Report makes widespread use of a whole range of graphical devices and charts, including cartograms:

CMO cartogram

Is there R support for cartograms yet, I wonder?! (Hmmm… maybe?)

PPS on the public facing national statistics front, I spotted this job ad yesterday – Head of Rich Content Development, ONS:

The postholder is responsible for inspiring and leading development of innovative rich content outputs for the ONS website and other channels, which anticipate and meet user needs and expectations, including those of the Citizen User. The role holder has an important part to play in helping ONS to realise its vision “for official statistics to achieve greater impact on key decisions affecting the UK and to encourage broader use across the country”.

Key Responsibilities:

1.Inspires, builds, leads and develops a multi-disciplinary team of designers, developers, data analysts and communications experts to produce innovative new outputs for the ONS website and other channels.
2. Keeps abreast of emerging trends and identifies new opportunities for the use of rich web content with ONS outputs.
3. Identifies new opportunities, proposes new directions and developments and gains buy in and commitment to these from Senior Executives and colleagues in other ONS business areas.
4. Works closely with business areas to identify, assess and commission new rich-content projects.
5. Provides, vision, guidance and editorial approval for new projects based on a continual understanding of user needs and expectations.
6. Develops and manages an ongoing portfolio of innovative content, maximising impact and value for money.
7. Builds effective partnerships with media to increase outreach and engagement with ONS content.
8. Establishes best practice in creation of rich content for the web and other channels, and works to improve practice and capability throughout ONS.

Interesting…

Quick Shiny Demo – Exploring NHS Winter Sit Rep Data

Having spent a chink of the weekend and a piece of yesterday trying to pull NHS Winter sitrep data into some sort of shape in Scraperwiki, (described, in part, here: When Machine Readable Data Still Causes “Issues” – Wrangling Dates…), I couldn’t but help myself last night and had a quick go at using RStudio’s Shiny tooling to put together a quick, minimal explorer for it:

For proof of concept, I just pulled in data relating to the Isle of Wight NHS Trust, but it should be possible to build a more generic explorer: Isle of Wight NHS Sit Rep Explorer Demo.

Three files are used to crate the app – a script to define the user interface (ui.R), a script to define the server that responds to UI actions and displays the charts (server.R), and a supporting file that creates variables and functions that are globally available to bother the server and UI scripts (global.R).

##wightsitrep2/global.R

#Loading in CSV directly from https seems to cause problems but this workaround seems okay
floader=function(fn){
  temporaryFile <- tempfile()
  download.file(fn,destfile=temporaryFile, method="curl")
  read.csv(temporaryFile)
}

#This is the data source - a scraperwiki API call
#It would make sense to abstract this further, eg allowing the creation of the URL based around a passed in a select statement
u="https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=csv&name=nhs_sit_reps&query=select%20SHA%2CName%2C%20fromDateStr%2CtoDateStr%2C%20tableName%2CfacetB%2Cvalue%20from%20fulltable%20%20where%20Name%20like%20'%25WIGH%25'"

#Load the data and do a bit typecasting, just in case...
d=floader(u)
d$fdate=as.Date(d$fromDateStr)
d$tdate=as.Date(d$toDateStr)
d$val=as.integer(d$value)
##wightsitrep2/ui.R

library(shiny)

tList=levels(d$tableName)
names(tList) = tList

# Define UI for application that plots random distributions 
shinyUI(pageWithSidebar(
  
  
  # Application title
  headerPanel("IW NHS Trust Sit Rep Explorer"),
  
  sidebarPanel(
    #Just a single selector here - which table do you want to view?
    selectInput("tbl", "Report:",tList),
    
    div("This demo provides a crude graphical view over data extracted from",
        a(href='http://transparency.dh.gov.uk/2012/10/26/winter-pressures-daily-situation-reports-2012-13/',
          "NHS Winter pressures daily situation reports"),
        "relating to the Isle of Wight NHS Trust."),
    div("The data is pulled in from a scraped version of the data stored on Scraperwiki",
        a(href="https://scraperwiki.com/scrapers/nhs_sit_reps/","NHS Sit Reps"),".")
    
 ),
  
  #The main panel is where the "results" charts are plotted
  mainPanel(
    plotOutput("testPlot"),
    tableOutput("view")
    
  )
))
##wightsitrep2/server.R

library(shiny)
library(ggplot2)

# Define server logic
shinyServer(function(input, output) {
  
  #Do a simple barchart of data in the selected table.
  #Where there are "subtables", display these using the faceted view
  output$testPlot = reactivePlot(function() {
    g=ggplot(subset(d,fdate>as.Date('2012-11-01') & tableName==input$tbl))
    g=g+geom_bar(aes(x=fdate,y=val),stat='identity')+facet_wrap(~tableName+facetB)
    g=g+theme(axis.text.x=element_text(angle=-90),legend.position="none")+labs(title="Isle of Wight NHS Trust")
    #g=g+scale_y_discrete(breaks=0:10)
    print(g)
  })
  
  #It would probable make sense to reshape the data presented in this table
  #For example, define columns based on facetB values, so we have one row per date range
  #I also need to sort the table by date
  output$view = reactiveTable(function() {
    head(subset(d,tableName==input$tbl,select=c('Name','fromDateStr','toDateStr','tableName','facetB','value')),n=100)
  })
  
})

I get the feeling that it shouldn’t be too hard to create quite complex Shiny apps relatively quickly, pulling on things like Scraperwiki as a remote data source. One thing I haven’t tried is to use googleVis components, which would support in the first instance at least a sortable table view… Hmmm…

PS for an extended version of this app, see NHS Winter Situation Reports Shiny Viewer v2

When Machine Readable Data Still Causes “Issues” – Wrangling Dates…

With changes to the FOI Act brought about the Protection of Freedoms Act, FOI will allow requests to be made for data in a machine readable form. In this post, I’ll give asn example of a dataset that is, arguably, released in a machine readable way – as an Excel spreadsheet, but that still requires quite a bit of work to become useful as data; because presumably the intent behind the aforementioned amendement to the FOI is to make data releases useful and useable as data? As a secondary result, through trying to make the data useful as data, I realise I have no idea what some of the numbers that are reported in the context of a date range actually relate to… Which makes those data columns misleading at best, useless at worst…And as to the February data in a release allegedly relating to a weekly release from November…? Sigh…

[Note – I’m not meaning to be critical in the sense of “this data is too broken to be useful so don’t publish it”. My aim in documenting this is to show some of the difficulties involved with actually working with open data sets and at least flag up some of the things that might need addressing so that the process can be improved and more “accessible” open data releases published in the future. ]

So what, and where is, the data…? Via my Twitter feed over the weekend, I saw an exchange between @paulbradshaw and @carlplant relating to a scraper built around the NHS Winter pressures daily situation reports 2012 – 13. This seems like a handy dataset for anyone wanting to report on weekly trends, spot hospitals that appear to be under stress, and so on, so I had a look at the scraper, took issue with it ;-) and spawned my own…

The data look like it’ll be released in a set of weekly Excel spreadsheets, with a separate sheet for each data report.

All well and good… almost…

If we load the data into something like Scraperwiki, we find that some of the dates are actually represented as such; that is, rather than character strings (such as the literal “9-Nov-2012”), they are represented as date types (in this case, the number of days since a baseline starting date). A quick check on StackOverflow turned up the following recipe for handling just such a thing and returning a date element that Python (my language of choice on Scraperwiki) recognises as such:

#http://stackoverflow.com/a/1112664/454773
import datetime

def minimalist_xldate_as_datetime(xldate, datemode):
    # datemode: 0 for 1900-based, 1 for 1904-based
    return (
        datetime.datetime(1899, 12, 30)
        + datetime.timedelta(days=xldate + 1462 * datemode)
        )

The next thing we notice is that some of the date column headings actually specify: 1) date ranges, 2) in a variety of styles across the different sheets. For example:

  • 16 – 18/11/2012
  • 16 Nov 12 to 18-NOV-2012
  • 16 to 18-Nov-12

In addition, we see that some of the sheets split the data into what we might term further “subtables” as you should notice if you compare the following sheet with the previous one shown above:

Notwithstanding that the “shape” of the data table is far from ideal when it comes to aggregating data from several weeks in the same database (as I’ll describe in another post), we are faced with a problem here that if we want to look at the data by date range in a mechanical, programmable way, we need to cast these differently represented date formats in the same way, ideally as a date structure that Python or the Scraperwiki SQLlite database can recognise as such.

[For a library that can automatically reshape this sort of hierarchical tabular data arrangement in R, see Automatic Conversion of Tables to LongForm Dataframes]

The approach I took was as follows (it could be interesting to try to replicate this approach using OpenRefine?). Firstly, I took the decision to map dates onto “fromDates” and “toDates”. ***BEWARE – I DON’T KNOW IF THIS IS CORRECT THING TO DO**** Where there is a single specified date in a column heading, the fromDate and toDate are set to one and the same value. In cases where the date value was specified as an Excel represented date (the typical case), the code snippet above casts it to a Pythonic date value then I can then print out as required (I opted to display dates in the YYYY-MM-DD format) using a construction along the lines of:

dateString=minimalist_xldate_as_datetime(cellValue,book.datemode).date().strftime("%Y-%m-%d")

In this case, cellValue is the value of a header cell that is represented as an Excel time element, book is the workbook, as parsed using the xlrd library:

import xlrd
xlbin = scraperwiki.scrape(spreadsheetURL)
book = xlrd.open_workbook(file_contents=xlbin)

and book.datemode is a library call that looks up how dates are being represented in the spreadsheet. If the conversion fails, we default to setting dateString to the original value:
dateString=cellvalue

The next step was to look at the date range cells, and cast any “literal” date strings into a recognised date format. (I’ve just realised I should have optimised the way this is called in the Scraperwiki code – I am doing so many unnecessary lookups at the moment!) In the following snippet, I look to see if we can split the date into a cell range functions,

import time
from time import mktime
from datetime import datetime

def dateNormalise(d):
    #This is a bit of a hack - each time we find new date formats for the cols, we'll need to extend this
    #The idea is to try to identify the date pattern used, and parse the string accordingly
    for trials in ["%d %b %y",'%d-%b-%y','%d-%b-%Y','%d/%m/%Y','%d/%m/%y']:
        try:
            dtf =datetime.datetime.fromtimestamp(mktime(time.strptime(d, trials)))
            break
        except: dtf=d
    if type(dtf) is datetime.datetime:
        dtf=dtf.strftime("%Y-%m-%d")
    return dtf

def patchDate(f,t):
    #Grab the month and year elements from the todate, and add in the from day of month number
    tt=t.split('-')
    fromdate='-'.join( [ str(tt[0]),str(tt[1]),str(f) ])
    return fromdate

def dateRangeParse(daterange):
    #In this first part, we simply try to identify from and to portions
    dd=daterange.split(' to ')
    if len(dd)<2:
        #That is, split on 'to' doesn't work
        dd2=daterange.split(' - ')
        if len(dd2)<2:
            #Doesn't split on '-' either; set from and todates to the string, just in case.
            fromdate=daterange
            todate=daterange
        else:
            fromdate=dd2[0]
            todate=dd2[1]
    else:
        fromdate=dd[0]
        todate=dd[1]
    #By inspection, the todate looks like it's always a complete date, so try to parse it as such 
    todate=dateNormalise(todate)
    #I think we'll require another fudge here, eg if date is given as '6 to 8 Nov 2012' we'll need to finesse '6' to '6 Nov 2012' so we can make a date from it
    fromdate=dateNormalise(fromdate)
    if len(fromdate)<3:
        fromdate=patchDate(fromdate,todate)
    return (fromdate,todate)

#USAGE:
(fromdate,todate)=dateRangeParse(dateString)

One thing this example shows, I think, is that even though the data is being published as a dataset, albeit in an Excel spreadsheet, we need to do some work to make it properly useable.

XKCD - ISO 8601

The sheets look as if they are an aggregate of data produced by different sections, or different people: that is, they use inconsistent ways of representing date ranges.

When it comes to using the date, we will need to take care in how we represent or report on figures collected over a date range (presumably a weekend? I haven’t checked), compared to daily totals. Indeed, as the PS below shows, I’m now starting to doubt what the number in the date range column represents? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period? [This was written under assumption it was summed daily values over period, which PS below suggests is NOT the case, in one sheet at least?] One approach might be to generate “as-if daily” returns simply by dividing ranged date totals by the number of days in the range. A more “truthful” approach may be to plot summed counts over time (date on the x-axis, sume of values to date on the y-axis), with the increment for the date-ranged values that is being added in to the summed value taking the “toDate” date as its x/date value?

When I get a chance, I’ll do a couple more posts around this dataset:
– one looking at datashaping in general, along with an example of how I shaped the data in this particular case
– one looking at different queries we can run over the shaped data.

PS Another problem… on the NHS site, we see that there appear to be weekly spreadsheet releases and an aggregated release:

Because I didn’t check the stub of scraper code used to pull off the spreadsheet URLs from the NHS site, I accidentally scraped weekly and aggrgeated sheets. I’m using a unique key based on a hash that includes the toDate as part of the hashed value, in an attempt to keep dupes out of the data from just this sort of mistake, but looking at a query over the scraped data I spotted this:

If we look at the weekly sheet we see this:

That is, a column for November 15th, and then one for November 18th, but nothing to cover November 16 or 17?

Looking at a different sheet – Adult Critical Care – we get variation at the other end of the range:

If we look into the aggregated sheet, we get:

Which is to say – the weekly report displayed a single data as a column heading where the aggregated sheet gives a date range, although the same cell values are reported in this particular example. So now I realise I have no idea what the cell values in the date range columns represent? Is it: a) the sum total of values for days in that range; b) the average daily rate over that period; c) the value on the first or last date of that period?

And here’s another query:

February data??? I thought we were looking at November data?

Hmmm…

PPS If you’re looking for learning outcomes from this post, here are a few: three ways in which we need to wrangle sense out of dates:

  1. representing Excel dates or strings-that-look-like-dates as dates in some sort of datetime representation (which is most useful sort of representation, even if we end up casting dates into string form);
  2. parsing date ranges into pairs of date represented elements (from and to dates);
  3. where a dataset/spreadsheet contains heterogenous single date and date range columns, how do we interpret the numbers that appear in the date range column?
  4. shoving the data into a database and running queries on it can sometimes flag up possible errors or inconsistencies in the data set, that might be otherwise hard to spot (eg if you had to manually inspect lots of different sheets in lots of different spreadsheets…)

Hmmm….

PPPS Another week, another not-quite-right feature:

another date mixup

PPPPS An update on what the numbers actually mean,from an email exchange (does that make me more a journalist than a blogger?!;-) with the contact address contained within the spreadsheets: “On the columns, where we have a weekend, all items apart from beds figures are summed across the weekend (eg number of diverts in place over the weekend, number of cancelled ops). Beds figures (including beds closed to norovirus) are snapshots at the collection time (i.e 8am on the Monday morning).”

PPPPPS Another week, ans this time three new ways of writing the date range over the weekend: 14-16-Dec-12, 14-16-Dec 12, 14-16 Dec 12. Anyone would think they were trying to break my scraper;-)

This Week in Privacy, Transparency and Open Public Data (fragment/links)

Via my feeds, a handful of consultations and codes relating to open data, particularly in a local government context:

Also of note this week, the ICO published its Anonymisation: managing data protection risk code of practice [PDF] (here’s the press release). ENISA, the European Network and Information Security Agency, have also just published the latest in a series of reports on privacy: Privacy considerations of online behavioural tracking. My colleague Ray Corrigan has done a quick review here.

Although it’s hard to know who has influence where, to the extent that the UK’s Open Government Partnership National Action Plan suggests a general roadmap for open government activity, this is maybe worth noting: Involve workshops: Developing the UK’s next open government National Action Plan

For a recent review of the open data policy context, see InnovateUK’s Open Data and it’s role in powering economic growth.

(I’ll update this post with a bit more commentary over the next few days. For now, I thought I’d share the links in case any readers out there fancy a bit of weekend reading…;-)

PS though not in the news this week, here are a couple of links to standing and appealed case law around database IPR:
– background – OKF review of data/database rights and Out-law review of database rights
– (ongoing) Court of Justice of European Union Appeal – Football Dataco and others: are football event listings protected? (context and commentary on review.)
– case law: The British Horseracing Board Ltd and Others v William Hill (horse-racing information) (commentary by out-law).

#online12 Reflections – Can Open Public Data Be Disruptive to Information Vendors?

Whilst preparing for my typically overloaded #online12 presentation, I thought I should make at least a passing attempt at contextualising it for the corporate attendees. The framing idea I opted for, but all too briefly reviewed, was whether open public data might be disruptive to the information industry, particularly purveyors of information services in vertical markets.

If you’ve ever read Clayton Christensen’s The Innovator’s Dilemma, you’ll be familiar with the idea behind disruptive innovations: incumbents allow start-ups with cheaper ways of tackling the less profitable, low-quality end of the market to take that part of the market; the start-ups improve their offerings, take market share, and the incumbent withdraws to the more profitable top-end. Learn more about this on OpenLearn: Sustaining and disruptive innovation or listen again to the BBC In Business episode on The Innovator’s Dilemma, from which the following clip is taken.

In the information industry, the following question then arises: will the availability of free, open public data be adopted at the low, or non-consuming end of the market, for example by micro- and small companies who haven’t necessarily be able to buy in to expensive information or data services, either on financial grounds or through lack of perceived benefits? Will the appearance of new aggregation services, often built around screenscrapers and/or public open data sources start to provide useful and useable alternatives at the low end of the market, in part because of their (current) lack of comprehensiveness or quality? And if such services are used, will they then start to improve in quality, comprehensiveness and service offerings, and in so doing start a ratcheting climb to quality that will threaten the incumbents?

Here are a couple of quick examples, based around some doodles I tried out today using data from OpenCorporates and OpenlyLocal. The original sketch (demo1() in the code here) was a simple scraper on Scraperwiki that accepted a person’s name, looked them up via a director search using the new 0.2 version of the OpenCorporates API, pulled back the companies they were associated with, and then looked up the other directors associated with those companies. For example, searching around Nigel Richard Shadbolt, we get this:

One of the problems with the data I got back is that there are duplicate entries for company officers; as Chris Taggart explained, “[data for] UK officers [comes] from two Companies House sources — data dump and API”. Another problem is that officers’ records don’t necessarily have start/end dates associated with them, so it may be the case that directors’ terms of office don’t actually overlap within a particular company. In my own scraper, I don’t check to see whether an officer is marked as “director”, “secretary”, etc, nor do I check to see whether the company is still a going concern or whether it has been dissolved. Some of these issues could be addressed right now, some may need working on. But in general, the data quality – and the way I work with it – should only improve from this quick’n’dirty minimum viable hack. As it is, I now have a tool that at a push will give me a quick snapshot of some of the possible director relationships surrounding a named individual.

The second sketch (demo2() in the code here) grabbed a list of elected council members for the Isle of Wight Council from another of Chris’ properties, OpenlyLocal, extracted the councillors names, and then looked up directorships held by people with exactly the same name using a two stage exact string match search. Here’s the result:

As with many data results, this is probably most meaningful to people who know the councillors – and companies – involved. The results may also surprise people who know the parties involved if they start to look-up the companies that aren’t immediately recognisable: surely X isn’t a director of Y? Here we have another problem – one of identity. The director look-up I use is based on an exact string match: the query to OpenCorporates returns directors with similar names, which I then filter to leave only directors with exactly the same name (I turn the strings to lower case so that case errors don’t cause a string mismatch). (I also filter companies returned to be solely ones with a gb jurisdiction.) In doing the lookup, we therefore have the possibility of false positive matches (X is returned as a director, but it’s not the X we mean, even though they have exactly the same name); and false negative lookups (eg where we look up a made up director John Alex Smith who is actually recorded in one or more filings as (the again made-up) John Alexander Smith.

That said, we do have a minimum viable research tool here that gives us a starting point for doing a very quick (though admittedly heavily caveated) search around companies that a councillor may be (or may have been – I’m not checking dates, remember) associated with.

We also have a tool around which we can start to develop a germ of an idea around conflict of interest detection.

The Isle of Wight Armchair Auditor, maintained by hyperlocal blog @onthewight (and based on an original idea by @adrianshort) hosts local spending information relating to payments made by the Isle of Wight Council. If we look at the payments made to a company, we see the spending is associated with a particular service area.

If you’re a graph thinker, as I am;-), the following might then suggest itself to you:

  1. From OpenlyLocal, we can get a list of councillors and the committees they are on;
  2. from OnTheWight’s Armchair Auditor, we can get a list of companies the council has spent money with;
  3. from OpenCorporates, we can get a list of the companies that councillors may be directors of;
  4. from OpenCorporates, we should be able to get identifiers for at least some of the companies that the council has spent money with;
  5. putting those together, we should be able to see whether or not a councillor may be a director of a company that the council is spending with and how much is being spent with them in which spending areas;
  6. we can possibly go further, if we can associate council committees with spending areas – are there councillors who are members of a committee that is responsible for a particular spending area who are also directors of companies that the council has spent money with in those spending areas? Now there’s nothing wrong with people who have expertise in a particular area sitting on a related committee (it’s probably a Good Thing). And it may be that they got their experience by working as a director for a company in that area. Which again, could be a Good Thing. But it begs a transparency question that a journalist might well be interested in asking. And in this case, with open data to hand, might technology be able to help out? For example, could we automatically generate a watch list to check whether or not councillors who are directors of companies that have received monies in particular spending areas (or more generally) have declared an interest, as would be appropriate? I think so…(caveated of course by the fact that there may be false positives and false negatives in the report…; but it would be a low effort starting point).

Once you get into this graph based thinking, you can take it mich further of course, for example looking to see whether councillors in one council are directors of companies that deal heavily with neighbouring councils… and so on.. (Paranoid? Me? Nah… Just trying to show how graphs work and how easy it can be to start joining dots once you start to get hold of the data…;-)

Anyway – this is all getting off the point and too conspiracy based…! So back to the point, which was along the lines of this: here we have the fumblings of a tool for mixing and matching data from two aggregators of public information, OpenlyLocal and OpenCorporates that might allow us to start running crude conflict of interest checks. (It’s easy enough to see how we can run the same idea using lists of MP names from the TheyWorkForYou API; or looking up directorships previously held by Ministers and the names of companies of lobbiests they meet (does WhosLobbying have an API of such things?). And so on…

Now I imagine there are commercial services around that do this sort of thing properly and comprehensively, and for a fee. But it only took me a couple of hours, for free, to get started, and having started, the paths to improvement become self-evident… and some of them can be achieved quite quickly (it just takes a little (?!) but of time…) So I wonder – could the information industry be at risk of disruption from open public data?

PS if you’re into conspiracies, Cambridge’s Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) has a post-doc positions open with Professior John Naughton on The impact of global networking on the nature, dissemination and impact of conspiracy theories. The position is complemented by several parallel fellowships, including ones on Rational Choice and Democratic Conspiracies and Ideals of Transparency and Suspicion of Democracy.

“Drug Deal” Network Analysis with Gephi (Tutorial)

Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:

Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.

Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:

If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.

I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…

Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.

So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)

One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.

#Python script to generate GraphML file
import csv
#We're going to use the really handy networkx graph library: easy_install networkx
import networkx as nx
import urllib

#Create a directed graph object
DG=nx.DiGraph()

#Open data file in universal newline mode
reader=csv.DictReader(open("inventory.csv","rU"))

#Define a variable to act as a deal node ID counter
dcid=0

#The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers
#An identifier is minted for each row, identifying the deal
#Deal attributes are used to annotate deal nodes
#Identify columns used to annotate nodes taking string values
nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use']
#Identify columns used to annotate nodes taking numeric values
nodeColsInt=['Minimum Qty kg', 'Discount']

#The customers are treated as nodes in their own right, rather than as deal attributes
#Identify columns used to identify customers - each of these will define a customer node
customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher']

#Create a node for each customer, and classify it as a 'customer' node type
for customer in customerCols:
	DG.add_node(customer,typ="customer")

#Each row defines a deal
for row in reader:
	#Mint an ID for the deal
	dealID='deal'+str(dcid)
	#Add a node for the deal, and classify it as a 'deal' node type
	DG.add_node(dealID,typ='deal')
	#Annotate the deal node with string based deal attributes
	for deal in nodeColsStr:
		DG.node[dealID][deal]=row[deal]
	#Annotate the deal node with numeric based deal attributes
	for deal in nodeColsInt:
		DG.node[dealID][deal]=int(row[deal])
	#If the cell in a customer column is set to 1,
	## draw an edge between that customer and the corresponding deal
	for customer in customerCols:
		if str(row[customer])=='1':
			DG.add_edge(dealID,customer)
	#Increment the node ID counter
	dcid=dcid+1

#write graph
nx.write_graphml(DG,"inventory.graphml")

The graph we’re generating (download .graphml) has a basic structure that looks something like the following:

Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.

Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:

If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):

We can view these nodes using a filter:

We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:

The workspace selector is at the bottom of the window, on the right hand side:

(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)

We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).

Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?

At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:

The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:

We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?

The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.

So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?

We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)

Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:

Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”

Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):

I have then sized the labels so that they are proportional to node size:

The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:

This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:

Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:

(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)

Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.

Recalling the “bimodal”/bipartite graph above:

that means we should be able to generate unimodal graphs along the following lines:

D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.

And for the customers?

C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.

You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)

So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:

When we run the projection, the graph is mapped onto a deal-deal graph:

The thickness of the edges describes the number of customers any two deals shared.

If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:

If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):

If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:

Run the modularity statistic and recolour:

Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:

If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:

If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:

Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:

To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:

Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…

PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)

Open Data, Development, Charities and the Third Sector

This is really just a searchable placeholder post for me to bundle up a set of links for things I keep forgetting about relating to opendata in the third sector/NGO land that may be useful in generating financial interest maps, trustee interest maps, etc… As such, it’ll probably grow as time goes by..

It’s also possible to find payments to charities from local councils and goevernment departments using services like OpenlyLocal, OpenSpending, (I’m not sure if detail appears in Open Data Communities).