OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘ddj

#online12 Reflections – Can Open Public Data Be Disruptive to Information Vendors?

with 2 comments

Whilst preparing for my typically overloaded #online12 presentation, I thought I should make at least a passing attempt at contextualising it for the corporate attendees. The framing idea I opted for, but all too briefly reviewed, was whether open public data might be disruptive to the information industry, particularly purveyors of information services in vertical markets.

If you’ve ever read Clayton Christensen’s The Innovator’s Dilemma, you’ll be familiar with the idea behind disruptive innovations: incumbents allow start-ups with cheaper ways of tackling the less profitable, low-quality end of the market to take that part of the market; the start-ups improve their offerings, take market share, and the incumbent withdraws to the more profitable top-end. Learn more about this on OpenLearn: Sustaining and disruptive innovation or listen again to the BBC In Business episode on The Innovator’s Dilemma, from which the following clip is taken.


In the information industry, the following question then arises: will the availability of free, open public data be adopted at the low, or non-consuming end of the market, for example by micro- and small companies who haven’t necessarily be able to buy in to expensive information or data services, either on financial grounds or through lack of perceived benefits? Will the appearance of new aggregation services, often built around screenscrapers and/or public open data sources start to provide useful and useable alternatives at the low end of the market, in part because of their (current) lack of comprehensiveness or quality? And if such services are used, will they then start to improve in quality, comprehensiveness and service offerings, and in so doing start a ratcheting climb to quality that will threaten the incumbents?

Here are a couple of quick examples, based around some doodles I tried out today using data from OpenCorporates and OpenlyLocal. The original sketch (demo1() in the code here) was a simple scraper on Scraperwiki that accepted a person’s name, looked them up via a director search using the new 0.2 version of the OpenCorporates API, pulled back the companies they were associated with, and then looked up the other directors associated with those companies. For example, searching around Nigel Richard Shadbolt, we get this:

One of the problems with the data I got back is that there are duplicate entries for company officers; as Chris Taggart explained, “[data for] UK officers [comes] from two Companies House sources — data dump and API”. Another problem is that officers’ records don’t necessarily have start/end dates associated with them, so it may be the case that directors’ terms of office don’t actually overlap within a particular company. In my own scraper, I don’t check to see whether an officer is marked as “director”, “secretary”, etc, nor do I check to see whether the company is still a going concern or whether it has been dissolved. Some of these issues could be addressed right now, some may need working on. But in general, the data quality – and the way I work with it – should only improve from this quick’n'dirty minimum viable hack. As it is, I now have a tool that at a push will give me a quick snapshot of some of the possible director relationships surrounding a named individual.

The second sketch (demo2() in the code here) grabbed a list of elected council members for the Isle of Wight Council from another of Chris’ properties, OpenlyLocal, extracted the councillors names, and then looked up directorships held by people with exactly the same name using a two stage exact string match search. Here’s the result:

As with many data results, this is probably most meaningful to people who know the councillors – and companies – involved. The results may also surprise people who know the parties involved if they start to look-up the companies that aren’t immediately recognisable: surely X isn’t a director of Y? Here we have another problem – one of identity. The director look-up I use is based on an exact string match: the query to OpenCorporates returns directors with similar names, which I then filter to leave only directors with exactly the same name (I turn the strings to lower case so that case errors don’t cause a string mismatch). (I also filter companies returned to be solely ones with a gb jurisdiction.) In doing the lookup, we therefore have the possibility of false positive matches (X is returned as a director, but it’s not the X we mean, even though they have exactly the same name); and false negative lookups (eg where we look up a made up director John Alex Smith who is actually recorded in one or more filings as (the again made-up) John Alexander Smith.

That said, we do have a minimum viable research tool here that gives us a starting point for doing a very quick (though admittedly heavily caveated) search around companies that a councillor may be (or may have been – I’m not checking dates, remember) associated with.

We also have a tool around which we can start to develop a germ of an idea around conflict of interest detection.

The Isle of Wight Armchair Auditor, maintained by hyperlocal blog @onthewight (and based on an original idea by @adrianshort) hosts local spending information relating to payments made by the Isle of Wight Council. If we look at the payments made to a company, we see the spending is associated with a particular service area.

If you’re a graph thinker, as I am;-), the following might then suggest itself to you:

  1. From OpenlyLocal, we can get a list of councillors and the committees they are on;
  2. from OnTheWight’s Armchair Auditor, we can get a list of companies the council has spent money with;
  3. from OpenCorporates, we can get a list of the companies that councillors may be directors of;
  4. from OpenCorporates, we should be able to get identifiers for at least some of the companies that the council has spent money with;
  5. putting those together, we should be able to see whether or not a councillor may be a director of a company that the council is spending with and how much is being spent with them in which spending areas;
  6. we can possibly go further, if we can associate council committees with spending areas – are there councillors who are members of a committee that is responsible for a particular spending area who are also directors of companies that the council has spent money with in those spending areas? Now there’s nothing wrong with people who have expertise in a particular area sitting on a related committee (it’s probably a Good Thing). And it may be that they got their experience by working as a director for a company in that area. Which again, could be a Good Thing. But it begs a transparency question that a journalist might well be interested in asking. And in this case, with open data to hand, might technology be able to help out? For example, could we automatically generate a watch list to check whether or not councillors who are directors of companies that have received monies in particular spending areas (or more generally) have declared an interest, as would be appropriate? I think so…(caveated of course by the fact that there may be false positives and false negatives in the report…; but it would be a low effort starting point).

Once you get into this graph based thinking, you can take it mich further of course, for example looking to see whether councillors in one council are directors of companies that deal heavily with neighbouring councils… and so on.. (Paranoid? Me? Nah… Just trying to show how graphs work and how easy it can be to start joining dots once you start to get hold of the data…;-)

Anyway – this is all getting off the point and too conspiracy based…! So back to the point, which was along the lines of this: here we have the fumblings of a tool for mixing and matching data from two aggregators of public information, OpenlyLocal and OpenCorporates that might allow us to start running crude conflict of interest checks. (It’s easy enough to see how we can run the same idea using lists of MP names from the TheyWorkForYou API; or looking up directorships previously held by Ministers and the names of companies of lobbiests they meet (does WhosLobbying have an API of such things?). And so on…

Now I imagine there are commercial services around that do this sort of thing properly and comprehensively, and for a fee. But it only took me a couple of hours, for free, to get started, and having started, the paths to improvement become self-evident… and some of them can be achieved quite quickly (it just takes a little (?!) but of time…) So I wonder – could the information industry be at risk of disruption from open public data?

PS if you’re into conspiracies, Cambridge’s Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) has a post-doc positions open with Professior John Naughton on The impact of global networking on the nature, dissemination and impact of conspiracy theories. The position is complemented by several parallel fellowships, including ones on Rational Choice and Democratic Conspiracies and Ideals of Transparency and Suspicion of Democracy.

Written by Tony Hirst

November 22, 2012 at 7:37 pm

So It Seems My Ballot Didn’t Count Twice in the PCC Election…

leave a comment »

At the start of my second year 6th, way back when, we had an external speaker – a Labour miner, through and through – come and talk to us about voting. In colourful language, he made it clear that he didn’t mind who we voted for, as long as we voted. I’m not sure what he had to say about spoiled votes, but as far as I can remember, I have always cast a ballot whenever I have been eligible to vote in a public election.

For folk dissatisfied with the candidates standing, I guess there are three +1 options available: 1) don’t vote at all; 2) spoil the paper; 3) cast an empty ballot (showing just how much you trust the way ballots are processed and counted); I can actually think of a couple of ways of spoiling or casting an empty ballot – one in the privacy of the the voting booth, the other in full site of the people staffing the ballot box. The +1 is stand yourself… For the first time ever, I cast an empty ballot this time round and it felt wrong, somehow… I should have made my mark on the voting form.

Anyway… the PCC (Police and Crime Commissioner) election forms allowed voters to nominate a first choice and (optionally) a second choice under a supplementary vote mechanism, described by the BBC as follows: “If a candidate has won more than 50% of first preferences they are elected. If no candidate has won more than 50%, all but the top two candidates are then eliminated. Any second preferences for the top two candidates from the eliminated candidates are added to the two remaining candidates’ totals. Whoever has the most votes combined is declared the winner.”

The Guardian Datablog duly published a spreadsheet of the PCC election results (sans spoiled ballot counts) and Andy Powell hacked them around to do a little bit of further analysis. In particular, Andy came up with a stacked bar chart showing the proportion of votes cast for the winner, vs. others, vs. didn’t vote. Note that the count recorded for the winner in the Guardian data, and Andy’s data (which is derived from the Guardian data) appears to the first round count

…which means we can look to see which elections returned a Commissioner based on second preference votes. If I use my Datastore Explorer tool to treat the spreadsheet as a database, and run a query looking for rows where the winner’s vote was less than any of the other vote counts, here’s what we get:

Here’s a link to my spreadsheet explorer view over Andy’s spreadsheet: PCC count – spreadsheet explorer:

So it seems that as someone in the Hampshire area, I could have had two preferences counted in the returned result, if I had voted for the winner as my second choice.

Written by Tony Hirst

November 19, 2012 at 11:43 am

Posted in Anything you want

Tagged with

Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship

leave a comment »

In Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix I posted a quick hack to calculate the finishing positions that would determine the F1 2012 Drivers’ Championship in today’s United States Grand Prix, leaving a tease dangling around the possibility of working out what combinations would lead to a VET or ALO victory if the championship isn’t decided today. So in the hour before the race started, I began to doodle a quick’n'dirty interactive app that would let me keep track of what the championship scenarios would be for the Brazil race given the lap by lap placement of VET and ALO during the US Grand Prix. Given the prep I’d done in the aforementioned post, this meant figuring out how to code up a similar algorithm in R, and then working out how to make it interactive…

But before I show you how I did it, here’s the scenario for Brazil given how the US race finished:

So how was this quick hack app done…?

Trying out the new Shiny interactive stats app builder from the RStudio folk has been on my to do list for some time. It didn’t take long to realise that an interactive race scenario builder would provide an ideal context for trying it out. There are essentially two (with a minor middle third) steps to a Shiny model:

  1. work out the points difference between VET and ALO for all their possible points combinations in the US Grand Prix;
  2. calculate the points difference going into the Brazilian Grand Prix;
  3. calculate the possible outcomes depending on placements in the Brazilian Grand Prix (essentially, an application of the algorithm I did in the original post).

The Shiny app requires two bits of code – a UI in file ui.R, in which I define two sliders that allow me to set the actual (or anticpated, or possible;-) race classifications in the US for Vettel and Alonso:

library(shiny)

shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("F1 Driver Championship Scenarios"),
  
  # Sidebar with a slider input for number of observations
  sidebarPanel(
    sliderInput("alo", 
                "ALO race pos in United States Grand Prix:", 
                min = 1, 
                max = 11, 
                value = 1),
    sliderInput("vet", 
                "VET race pos in United States Grand Prix:", 
                min = 1, 
                max = 11, 
                value = 2)
  ),
  
  # Show a plot of the generated model
  mainPanel(
    plotOutput("distPlot")
  )
))

And some logic, in file server.R (original had errors; hopefully now bugfixed…) – the original “Paths to the Championship” unpicks elements of the algorithm in a little more detail, but basically I figure out the points difference between VET and ALO based on the points difference at the start of the race and the additional points difference arising from the posited finishing positions for the US race, and then generate a matrix that works out the difference in points awarded for each possible combination of finishes in Brazil:

library(shiny)
library(ggplot2)
library(reshape)

# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {
  points=data.frame(pos=1:11,val=c(25,18,15,12,10,8,6,4,2,1,0))
  points[[1,2]]
  a=245
  v=255
  
  pospoints=function(a,v,pdiff,points){
    pp=matrix(ncol = nrow(points), nrow = nrow(points))
    for (i in 1:nrow(points)){
      for (j in 1:nrow(points))
        pp[[i,j]]=v-a+pdiff[[i,j]]
    }
    pp
  }
  
  pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
  for (i in 1:nrow(points)){
    for (j in 1:nrow(points))
      pdiff[[i,j]]=points[[i,2]]-points[[j,2]]
  }
  
  ppx=pospoints(a,v,pdiff,points)
  
  winmdiff=function(vadiff,pdiff,points){
    win=matrix(ncol = nrow(points), nrow = nrow(points))
    for (i in 1:nrow(points)){
      for (j in 1:nrow(points))
        if (i==j) win[[i,j]]=''
        else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='VET'
        else win[[i,j]]='ALO'
    }
    win
  }
  
  # Function that generates a plot of the distribution. The function
  # is wrapped in a call to reactivePlot to indicate that:
  #
  #  1) It is "reactive" and therefore should be automatically 
  #     re-executed when inputs change
  #  2) Its output type is a plot 
  #
  output$distPlot <- reactivePlot(function() {
    wmd=winmdiff(ppx[[input$vet,input$alo]],pdiff,points)
    wmdm=melt(wmd)
    g=ggplot(wmdm)+geom_text(aes(X1,X2,label=value,col=value))
    g=g+xlab('VET position in Brazil')+ ylab('ALO position in Brazil')
    g=g+labs(title="Championship outcomes in Brazil")
    g=g+ theme(legend.position="none")
    g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))
    print(g)
  })
})

To run the app, if your server and ui files are in some directory shinychamp, then something like the following should et the Shiny app running:

library(shiny)
runApp("~/path/to/my/shinychamp")

Here’s what it looks like:

You can find the code on github here: F1 Championship 2012 – scenarios if the race gets to Brazil…

Unfortunately, until a hosted service is available, you’ll have to run it yourself if you want to try it out…

Disclaimer: I’ve been rushing to get this posted before the start of the race… If you spot errors, please shout!

Written by Tony Hirst

November 18, 2012 at 6:38 pm

Posted in Rstats, Tinkering

Tagged with ,

Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix

with one comment

If you haven’t already seen it, one of the breakthrough visualisations of the US elections was the New York Times Paths to the Election scenario builder. With the F1 drivers’ championship in the balance this weekend, I wondered what chances were of VET claiming the championship this weekend. The only contender is ALO, who is currently ten points behind.

A quick Python script shows the outcome depending on the relative classification of ALO and VET at the end of today’s race. (If the drivers are 25 points apart, and ALO then wins in Brazil with VET out of the points, I think VET will win on countback based on having won more races.)

#The current points standings
vetPoints=255
aloPoints=245

#The points awarded for each place in the top 10; 0 points otherwise
points=[25,18,15,12,10,8,6,4,2,1,0]

#Print a header row (there's probably a more elegant way of doing this...;-)
for x in ['VET\ALO',1,2,3,4,5,6,7,8,9,10,'11+']: print str(x)+'\t',
print ''

#I'm going to construct a grid, VET's position down the rows, ALO across the columns
for i in range(len(points)):
	#Build up each row - start with VET's classification
	row=[str(i+1)]
	#Now for the columns - that is, ALO's classification
	for j in range(len(points)):
		#Work out the points if VET is placed i+1  and ALO at j+1 (i and j start at 0)
		#Find the difference between the points scores
		#If the difference is >= 25 (the biggest points diff ALO could achieve in Brazil), VET wins
		if ((vetPoints+points[i])-(aloPoints+points[j])>=25):
			row.append("VET")
		else: row.append("?")
	#Print the row a slightly tidier way...
	print '\t'.join(row)

(Now I wonder – how would I write that script in R?)

And the result?

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	?	?	?	?	VET	VET	VET	VET	VET	VET	VET
2	?	?	?	?	?	?	?	?	VET	VET	VET
3	?	?	?	?	?	?	?	?	?	?	VET
4	?	?	?	?	?	?	?	?	?	?	?
5	?	?	?	?	?	?	?	?	?	?	?
6	?	?	?	?	?	?	?	?	?	?	?
7	?	?	?	?	?	?	?	?	?	?	?
8	?	?	?	?	?	?	?	?	?	?	?
9	?	?	?	?	?	?	?	?	?	?	?
10	?	?	?	?	?	?	?	?	?	?	?
11	?	?	?	?	?	?	?	?	?	?	?

Which is to say, VET wins if:

  • VET wins the race and ALO is placed 5th or lower;
  • VET is second in the race and ALO is placed 9th or lower;
  • VET is third in the race and ALO is out of the points (11th or lower)

We can also look at the points differences (define a row2 as row, then use row2.append(str((vetPoints+points[i])-(aloPoints+points[j])))):

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	10	17	20	23	25	27	29	31	33	34	35
2	3	10	13	16	18	20	22	24	26	27	28
3	0	7	10	13	15	17	19	21	23	24	25
4	-3	4	7	10	12	14	16	18	20	21	22
5	-5	2	5	8	10	12	14	16	18	19	20
6	-7	0	3	6	8	10	12	14	16	17	18
7	-9	-2	1	4	6	8	10	12	14	15	16
8	-11	-4	-1	2	4	6	8	10	12	13	14
9	-13	-6	-3	0	2	4	6	8	10	11	12
10	-14	-7	-4	-1	1	3	5	7	9	10	11
11	-15	-8	-5	-2	0	2	4	6	8	9	10

We could then do a similar exercise for the Brazil race, and essentially get all the information we need to do a scenario builder like the New York Times election scenario builder… Which I would try to do, but I’ve had enough screen time for the weekend already…:-(

PS FWIW, here’s a quick table showing the awarded points difference between two drivers depending on their relative classification in a race:

A\B	1	2	3	4	5	6	7	8	9	10	11+
1	X	7	10	13	15	17	19	21	23	24	25
2	-7	X	3	6	8	10	12	14	16	17	18
3	-10	-3	X	3	5	7	9	11	13	14	15
4	-13	-6	-3	X	2	4	6	8	10	11	12
5	-15	-8	-5	-2	X	2	4	6	8	9	10
6	-17	-10	-7	-4	-2	X	2	4	6	7	8
7	-19	-12	-9	-6	-4	-2	X	2	4	5	6
8	-21	-14	-11	-8	-6	-4	-2	X	2	3	4
9	-23	-16	-13	-10	-8	-6	-4	-2	X	1	2
10	-24	-17	-14	-11	-9	-7	-5	-3	-1	X	1
11	-25	-18	-15	-12	-10	-8	-6	-4	-2	-1	X

Here’s how to use this chart in association with the previous. Looking at the previous chart, if VET finishes second and ALO third, the points difference is 13 in favour of VET. Looking at the chart immediately above, if we let VET = A and ALO = B, then the columns correspond to ALO’s placement, and the rows to VET. VET (A) needs to lose 14 or more points to lose the championship (that is, we’re looking for values of -14 or less). In particular, ALO (B, columns) needs to finish 1st with VET (A) 5th or worse, 2nd with A 8th or worse, or 3rd with VET 10th or worse.

And the script:

print '\t'.join(['A\B','1','2','3','4','5','6','7','8','9','10','11+'])
for i in range(len(points)):
	row=[str(i+1)]
	for j in range(len(points)):
		if i!=j:row.append(str(points[i]-points[j]))
		else: row.append('X')

And now for the rest of the weekend…

Written by Tony Hirst

November 18, 2012 at 12:59 pm

Posted in Infoskills, Tinkering

Tagged with ,

The Race to the F1 2012 Drivers’ Championship – Initial Sketches

with 4 comments

In part inspired by the chart described in The electoral map sans the map, I thought I’d start mulling over a quick sketch showing the race to the 2012 Formula One Drivers’ Championship.

The chart needs to show tension somehow, so in this first really quick and simple rough sketch, you really do have to put yourself in the graph and start reading it from left to right:

The data is pulled in from the Ergast API as JSON data, which is then parsed and visualised using R:

require(RJSONIO)
require(ggplot2)

#initialise a data frame
champ <- data.frame(round=numeric(),
                 driverID=character(), 
                 position=numeric(), points=numeric(),wins=numeric(),
                 stringsAsFactors=FALSE)

#This is a fudge at the moment - should be able to use a different API call to 
#get the list of races to date, rather than hardcoding latest round number
for (j in 1:18){
  resultsURL=paste("http://ergast.com/api/f1/2012/",j,"/driverStandings",".json",sep='')
  print(resultsURL)
  results.data.json=fromJSON(resultsURL,simplify=FALSE)
  rd=results.data.json$MRData$StandingsTable$StandingsLists[[1]]$DriverStandings
  for (i in 1:length(rd)){
    champ=rbind(champ,data.frame(round=j, driverID=rd[[i]]$Driver$driverId,
                               position=as.numeric(as.character(rd[[i]]$position)),
                                points=as.numeric(as.character(rd[[i]]$points)),
                                                  wins=as.numeric(as.character(rd[[i]]$wins)) ))
  }
}
champ

#Horrible fudge - should really find a better way of filtering?
test2=subset(champ,( driverID=='vettel' | driverID=='alonso' | driverID=='raikkonen'|driverID=='webber' | driverID=='hamilton'|driverID=='button' ))

#Really rough sketch, in part inspired by http://junkcharts.typepad.com/junk_charts/2012/11/the-electoral-map-sans-the-map.html
ggplot(test2)+geom_line(aes(x=round,y=points,group=driverID,col=driverID))+labs(title="F1 2012 - Race to the Championship")

#I wonder if it would be worth annotating the chart with labels explaining any DNF reasons at parts where points stall?

So, that’s the quickest and dirtiest chart I could think of – where to take this next? One way would be to start making the chart look cleaner; another possibility would be to start looking at adding labels, highlights, and maybe pushing all but ALO and VET into the background? (GDS do some nice work in this vein, eg Updating the GOV.UK Performance Dashboard; this StoryTellingWithData post on stacked bar charts also has some great ideas about how to make simple, clean and effective use of text and highlighting…).

Let’s try cleaning it up a little, and then highlight the championship contenders?

test3=subset(test,( driverID=='vettel' | driverID=='alonso' ))
test4=subset(test,( driverID=='raikkonen'|driverID=='webber' | driverID=='hamilton'|driverID=='button' ))

ggplot(test4) + geom_line(aes(x=round,y=position,group=driverID),col='lightgrey') + geom_line(data=test3,aes(x=round,y=position,group=driverID,col=driverID)) + labs(title="F1 2012 - Race to the Championship")

Hmm… I’m not sure about those colours? Maybe use Blue for VET and Red for ALO?

I really hacked the path to this – there must be a cleaner way?!

ggplot(test4)+geom_line(aes(x=round,y=points,group=driverID),col='lightgrey') + geom_line(data=subset(test3,driverID=='vettel'),aes(x=round,y=points),col='blue') + geom_line(data=subset(test3,driverID=='alonso'),aes(x=round,y=points),col='red') + labs(title="F1 2012 - Race to the Championship")

Other chart types are possible too, I suppose? Such as something in the style of a lap chart?

ggplot(test2)+geom_line(aes(x=round,y=position,group=driverID,col=driverID))+labs(title="F1 2012 - Race to the Championship")

Hmmm… Just like the first sketch, this one is cluttered and confusing too… How about if we clean it as above to highlight just the contenders?

ggplot(test4) + geom_line(aes(x=round,y=points,group=driverID),col='lightgrey') + geom_line(data=test3,aes(x=round,y=points,group=driverID,col=driverID)) + labs(title="F1 2012 - Race to the Championship")

A little cleaner, maybe? And with the colour tweak:

ggplot(test4) + geom_line(aes(x=round,y=position,group=driverID),col='lightgrey') + geom_line(data=subset(test3,driverID=='vettel'),aes(x=round,y=position),col='blue') + geom_line(data=subset(test3,driverID=='alonso'),aes(x=round,y=position),col='red') + labs(title="F1 2012 - Race to the Championship")

Something that really jumps out at me in this chart are the gridlines – they really need fixing? But what would be best to show?

Hmm, before we do that, how about an animation? (Does WordPress.com allow animated gifs?)

Here’s the code (it requires the animation package):

library(animation)
race.ani= function(...) {
  for (i in 1:18) {
    g=ggplot(subset(test3, round<=i)) + geom_line(aes(x=round,y=position,group=driverID),col='lightgrey')+geom_line(data=subset(test3,driverID=='vettel' & round<=i),aes(x=round,y=position),col='blue')+geom_line(data=subset(test3,driverID=='alonso' & round <=i),aes(x=round,y=position),col='red')+labs(title="F1 2012 - Race to the Championship")+xlim(1,18)
    print(g)
  }
}
saveMovie(race.ani(), interval = 0.4, outdir = getwd())

And for the other chart:

Hmmm…

How’s about another sort of view – the points difference between VET and ALO?

alo=subset(test3,driverID=='alonso')
vet=subset(test3,driverID=='vettel')
colnames(vet)=c("round","driverID","vposition","vpoints","vwins")
colnames(alo)=c("round","driverID","aposition","apoints","awins")
cf= merge(alo,vet,by=c('round'))
ggplot(cf) + geom_bar( aes(x=round,y=vpoints-apoints,fill=(vpoints-apoints)>0), stat='identity') + labs(title="F1 2012 Championship - VET vs ALO")

Written by Tony Hirst

November 16, 2012 at 11:59 pm

Posted in Rstats

Tagged with ,

Sketched Thoughts On Data Journalism Technologies and Practice

leave a comment »

Over the last year or two, I’ve given a handful of talks to postgrad and undergrad students broadly on the topic of “technology for data driven journalism”. The presentations are typically uncompromising, which is to say I assume a lot. There are many risks in taking such an approach, of course, as waves of confusion spread out across the room… But it is, in part, a deliberate strategy intended to shock people into an awareness of some of the things that are possible with tools that are freely available for use in the desktop and browser based sheds of today’s digital tinkerers… Having delivered one such presentation yesterday, at UCA, Farnham, here are some reflections on the whole topic of “#ddj”. Needless to say, they do not necessarily reflect even my opinions, let alone those of anybody else;-)

The data-driven journalism thing is being made up as we go along. There is a fine tradition of computer assisted journalism, database journalism, and so on, but the notion of “data driven journalism” appears to have rather more popular appeal. Before attempting a definition, what are some of the things we associate with ddj that might explain the recent upsurge of interest around it?

  • access to data: this must surely be a part of it. In one version we might tell of the story, the arrival of Google Maps and the reverse engineering of an API to it by Paul Rademacher for his April 2005 “Housing Maps mashup”, opened up people’s eyes to the possibility of map-based mashups; a short while later, in May 2005, Adrian Holovaty’s Chicago Crime Map showed how the same mashup idea could be used as an example of “live”, automated and geographically contextualised reporting of crime data. Mashups were all about appropriating web technologies and web content, building new “stuff” from pre-existing “stuff” that was already out there. And as an idea, mashups became all the rage way back then, offering as they did the potential for appropriating, combining and re-presenting elements of different web applications and publications without the need for (further) programming.
    In March 2006, a year or so after the first demonstration of the Housing Maps mashup, and in part as a response to the difficulty in getting hold of latitude and longitude data for UK based locations that was required to build Google maps mashups around British locations, the Guardian Technology supplement (remember that? It had Kakoru puzzles and everything?!;-) launched the “Free Our Data” campaign (history). This campaign called for the free release of data collected at public expense, such as the data that gave the latitude and longitude for UK postcodes.
    The early promise of, and popular interest in “mashups” waxed, and then waned; but there was a new tide rising in the information system that is the web: access to data. The mashups had shown the way forward in terms of some of the things you could do if you could wire different applications together, but despite the promise of no programming it was still too techie, too geeky, too damned hard and fiddly for most people; and despite what the geeks said, it was still programming, and there often still was coding involved. So the focus changed. Awareness grew about the sorts of “mashup” were possible, so now you could ask a developer to build you “something like that”, as you pointed to an appropriate example. The stumbling block now was access to the data to power an app that looked like that, but did the same thing for this.
    For some reason, the notion of “open” public data hit a policy nerve, and in the UK, as elsewhere, started to receive cross-party support. (A brief history of open public data in a UK context is illustrated in the first part of Open Standards and Open Data.) The data started to flow, or at least, started to become both published (through mandated transparency initiatives, such as the release of public accounting data) and requestable (for example, via an extension to FOI by the Protection of Freedoms Act 2012).
    We’ve now got access in principle and in practice to increasing amounts of data, we’ve seen some of the ways in which it can be displayed and, to a certain extent, started to explore some of the ways in which we can use it as a source for news stories. So the time is right in data terms for data driven journalism, right?
  • access to visualisation technologies: it wasn’t very long ago when it was still really hard to display data on screen using anything other than canned chart types – pie charts, line charts, bar charts (that is, the charts you were introduced to in primary school. How many chart types have you learned to read, or create, since then?). Spreadsheets offer a range of grab-and-display chart generating wizards, of course, but they’re not ideal when working with large datasets, and they’re typically geared for generating charts for reports, rather than being used analytically. The visual analysis mantra – Overview first, zoom and filter, then details-on-demand – (coined in Ben Schneiderman’s 1997 article A Grander Goal: A Thousand-Fold Increase in Human Capabilities, I think?) arguably requires fast computers and big screens to achieve the levels of responsiveness that is required for interactive usage, and we have those now…

There are, however, still some considerable barriers to access:

  • access to clean data: you might think I’m repeating myself here, but access to data and access to clean data are two separate considerations. A lot of the data that’s out there and published is still not directly usable (you can’t just load it into a spreadsheet and work on it directly); things that are supposed to match often don’t (we might know that Open Uni, OU and Open University refer to the same thing, but why should a spreadsheet?); number columns often contain things that aren’t numbers (such as commas or other punctuation); dates are provided in a wide variety of formats that we can recognise as such, but a computer can’t – at least, not unless we give it a bit of help; data gets misplaced across columns; character encodings used by different applications and operating systems don’t play nicely; typos proliferate; and so on. So whose job is it to clean the data before it can be inspected or analysed?
  • access to skills and workflows: engineering practice tends to have a separation between the notion of “engineer” and “technician”. Over-generalising and trivialising matters somewhat, engineers have academic training, and typically come at problems from a theory dominated direction; technicians (or technical engineers) have the practical skills that can be used to enact the solutions produced by the engineers. (Of course, technicians can often suggest additional, or alternative, solutions, in part reflecting a better, or more immediate, knowledge about the practical considerations involved in taking one course of action compared to another.) At the moment, the demarcation of roles (and skills required at each step of the way) in a workflow based around data discovery, preparation, analysis and reporting is still confused.
  • What questions should ask? If you think of data as a source, with a story to tell: how do you set about finding that source? Why do you even think you want to talk to that source? What sorts of questions should you ask that source, and what sorts of answer might you reasonably expect it to provide you with? How can you tell if that source is misleading you, lying to you, hiding something from you, or is just plain wrong? To what extent do you or should you trust a data source? Remember, ever cell in a spreadsheet is a fact. If you have a spreadsheet containing a million data cells, that’s a lot of fact checking to do…
  • low or misplaced expectations: we don’t necessarily expect Journalism students to know how to drive to a spreadsheet let alone run or apply complex statistics, or even have a great grasp on “the application of number”; but should they? I’m not totally convinced we need to get them up to speed with yesterday’s tools and techniques… As a tool builder/tool user, I keep looking for tools and ways of using tools that may be thought of as emerging “professional” tools for people who work with data on a day-to-day basis, but wouldn’t class themselves as data scientists, or data researchers; tools for technicians, maybe. When presenting tools to students, I try showing the tools that are likely to be found on a technician’s workbench. As such, they may look a little bit more technical than tools developed for home use (compare a socket set from a trade supplier with a £3.50 tool-roll bargain offer from your local garage), but that’s because they’re quality tools that are fit for purpose. And as such, it may take a bit of care, training and effort to learn how to use them. But I thought the point was to expose students to “industry-strength” ideas and applications? And in an area where tools are developing quite quickly, students are exactly the sort of people we need to start engaging with them: 1) at the level of raising awareness about what these tools can do; 2) as a vector for knowledge and technology transfer, getting these tools (or at least, ideas about what they can do) out into industry; 3) for students so inclined, recruiting those students for the further development of the tools, recruiting power users to help drive requirements for future iterations of the tools, and so on. If the journalism students are going to be the “engineers” to the data wrangler technicians, it’ll be good for them to know the sorts of things they can reasonably ask their technicians to help them to do…Which is to say, the journalists need exposing to the data wrangling factory floor.

Although a lot of the #ddj posts on this OUseful.info blog relate to tools, the subtext is all about recognising data as a medium, the form particular datasets take, and the way in which different tools can be used to work with these forms. In part this leads to a consideration of the process questions that can be asked of a data source based on identifying natural representations that may be contained within it (albeit in hidden form). For example, a list of MPs hints at a list of constituencies, which have locations, and therefore may benefit from representation in a geographical, map based form; a collection of emails might hint at a timeline based reconstruction, or network analysis showing who corresponded with whom (and in what order), maybe?

And finally, something that I think is still lacking in the formulation of data journalism as a practice is an articulation of the process of discovering the stories from data: I like the notion of “conversations with data” and this is something I’ll try to develop over forthcoming blog posts.

PS see also @dkernohan’s The campaigning academic?. At the risk of spoiling the punchline (you should nevertheless go and read the whole thing), David writes: “There is a space – in the gap between academia and journalism, somewhere in the vicinity of the digital humanities movement – for what I would call the “campaigning academic”, someone who is supported (in a similar way to traditional research funding) to investigate issues of interest and to report back in a variety of accessible media. Maybe this “reporting back” could build up into equivalence to an academic reward, maybe not.

These would be cross-disciplinary scholars, not tied to a particular critical perspective or methodology. And they would likely be highly networked, linking in both to the interested and the involved in any particular area – at times becoming both. They might have a high media profile and an accessible style (Ben Goldacre comes to mind). Or they might be an anonymous but fascinating blogger (whoever it is that does the wonderful Public Policy and The Past). Or anything in between.

But they would campaign, they would investigate, they would expose and they would analyse. Bringing together academic and old-school journalistic standards of integrity and verifiability.”

Mixed up in my head – and I think in David’s – is the question of “public accounting”, as well as sensemaking around current events and trends, and the extent to which it’s the role of “the media” or “academic” to perform such a function. I think there’s much to be said for reimagining how we inform and educate in a network-centric web-based world, and it’s yet another of those things on my list of things I intend to ponder further… See also: From Academic Privilege to Consultations as Peer Review.

Written by Tony Hirst

November 6, 2012 at 2:39 pm

Posted in Infoskills, onlinejournalismblog

Tagged with ,

Inter-Council Payments and the Google Fusion Tables Network Graph

with 2 comments

One of the great things about aggregating local spending data from different councils in the same place – such as on OpenlyLocal – is that you can start to explore structural relations in the way different public bodies of a similar type spend money with each other.

On the local spend with corporates scraper on Scraperwiki, which I set up to scrape how different councils spent money with particular suppliers, I realised I could also use the scraper to search for how councils spent money with other councils, by searching for suppliers containing phrases such as “district council” or “town council”. (We could also generate views to to see how councils wre spending money with different police authorities, for example.)

(The OpenlyLocal API doesn’t seem to work with the search, so I scraped the search results HTML pages instead. Results are paged, with 30 results per page, and what seems like a maximum of 1500 (50 pages) of results possible.)

The publicmesh table on the scraper captures spend going to a range of councils (not parish councils) from other councils. I also uploaded the data to Google Fusion tables (public mesh spending data), and then started to explore it using the new network graph view (via the Experiment menu). So for example, we can get a quick view over how the various county councils make payments to each other:

Hovering over a node highlights the other nodes its connected to (though it would be good if the text labels from the connected nodes were highlighted and labels for unconnected nodes were greyed out?)

(I think a Graphviz visualisation would actually be better, eg using Canviz, because it can clearly show edges from A to B as well as B to A…)

As with many exploratory visualisations, this view helps us identify some more specific questions we might want to ask of the data, rather than presenting a “finished product”.

As well as the experimental network graph view, I also noticed there’s a new Experimental View for Google Fusion Tables. As well as the normal tabular view, we also get a record view, and (where geo data is identified?) a map view:

What I’d quite like to see is a merging of map and network graph views…

One thing I noticed whilst playing with Google Fusion Tables is that getting different aggregate views is rather clunky and relies on column order in the table. So for example, here’s an aggregated view of how different county councils supply other councils:

In order to aggregate by supplied council, we need to reorder the columns (the aggregate view aggregates columns as thet appear from left to right in the table view). From the Edit column, Modify Table:

(In my browser, I then had to reload the page for the updated schema to be reflected in the view). Then we can get the count aggregation:

It would be so much easier if the aggregation view allowed you to order the columns there…

PS no time to blog this properly right now, but there are a couple of new javascript libraries that are worth mentioning in the datawrangling context.

In part coming out of the Guardian stable, Misoproject is “an open source toolkit designed to expedite the creation of high-quality interactive storytelling and data visualisation content”. The initial dataset library provides a set of routines for: loading data into the browser from a variety of sources (CSV, Google spreadsheets, JSON), including regular polling; creating and managing data tables and views of those tables within the browser, including column operations such as grouping, statistical operations (min, max, mean, moving average etc); playing nicely with a variety of client side graphics libraries (eg d3.js, Highcharts, Rickshaw and other JQuery graphics plugins).

Recline.js is a library from Max Ogden and the Open Knowledge Foundation that if its name is anything to go by is positioning itself as an alternative (or complement?) to Google Refine. To my mind though, it’s more akin to a Google Fusion Tables style user interface (“classic” version) wherever you need it, via a Javascript library. The data explorer allows you to import and preview CSV, Excel, Google Spreadsheet and ElasticSearch data from a URL, as well as via file upload (so for example, you can try it with the public spend mesh data CSV from Scraperwiki). Data can be sorted, filtered and viewed by facet, and there’s a set of integrated graphical tools for previewing and displaying data too. Refine.js views can also be shared and embedded, which makes this an ideal tool for data publishers to embed in their sites as a way of facilitating engagement with data on-site, as I expect we’ll see on the Data Hub before too long.

More reviews of these two libraries later…

PPS These are also worth a look in respect of generating visualisations based on data stored in Google spreadsheets: DataWrapper and Freedive (like my old Guardian Datastore explorer, but done properly… Wizard led UI that helps you create your own searchable and embeddable database view direct from a Google Spreadsheet).

Written by Tony Hirst

May 21, 2012 at 9:25 am

A Tinkerer’s Toolbox: Data Driven Journalism

with 2 comments

Earlier this week, I popped over to Lincoln to chat to @josswinn and @jmahoney127 about their ON Course course data project (I heartily recommend Jamie’s ON Course project blog), hopefully not setting them off down too many ratholes, erm, err, oops?!, as well as bewildering a cohort of online journalism students with a rapid fire presentation about data driven journalism…

I think I need to draw a map…

Written by Tony Hirst

March 23, 2012 at 9:49 pm

Posted in Anything you want, Presentation

Tagged with

Sleight of Hand and Data Laundering in Evidence Based Policy Making

with 8 comments

I’ve still to make this year’s New Year’s Resolution, but one of the things that I thing I’d like to spend more time getting my head round is the notion of “evidence based policy making” (e.g. Is Evidence-Based Government Possible?.

As far as I can tell, this is often caricatured as either involving Googling around a policy area using ministerially obvious Google terms and referencing whatever’s in the top 5 hits, or taking a policy decision then looking for selective evidence to support that decision, along with contrary evidence against competing alternatives; (in a related area of evidence based practice, see for example: Some Questions about Evidence-based Practice in Education. If you have other examples in a similar vein, please let me know… #lookingForAnEvidenceBase Also e.g. the idea of policy based evidence making [h/t Jon Warbrick];-)

One of the suspicions I have is that “evidence” inherits the authority associated with the most reputable source associated with it when we wish to call on it to justify it, (and possibly as a complement to that, the least reputable source if we wish to discount it?)

So for example, in his Networker Observer column last weekend, John Naughton describes a presentation given to a technology conference by Facebook’s chief operating officer, Sheryl Sandberg, that pre-empted a European commission announcement on privacy:

Sandberg made claims about the economic benefits of privacy abuse that defy parody. For example, she unveiled a report that Facebook had commissioned from Deloitte, a consultancy firm, which estimated that Facebook – an outfit with a global workforce of about 3,000 – indirectly helped create 232,000 jobs in Europe in 2011 and enabled more than $32bn in revenues.

Inspection of the “report” confirms one’s suspicion that you couldn’t make this stuff up. Or, rather, only an international consulting firm could make it up. Interestingly, Deloitte itself appears to be ambivalent about it. “The information contained in the report”, it cautions, “has been obtained from Facebook Inc and third party sources that are clearly referenced in the appropriate sections of the report. Deloitte has neither sought to corroborate this information nor to review its overall reasonableness. Further, any results from the analysis contained in the report are reliant on the information available at the time of writing the report and should not be relied upon in subsequent periods.” (Emphasis added by JN.)

Accordingly, continues Deloitte, “no representation or warranty, express or implied, is given and no responsibility or liability is or will be accepted by or on behalf of Deloitte or by any of its partners, employees or agents or any other person as to the accuracy, completeness or correctness of the information contained in this document or any oral information made available and any such liability is expressly disclaimed”.

In this case, the Deloitte report was used as evidence by Facebook to demonstrate a particular economic benefit made possible by Facebook’s activities. The consultancy firms caveats were ignored, (including the fact that the data may in part at least have come from Facebook itself), in reporting this claim. So: this is data laundering, right? We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make it’s way into a policy briefing. Or that’s how I imagine it, any way..

John’s take was in a similar vein:

The sole purpose of “reports” such as this is to impress or intimidate politicians and regulators, many of whom still seem unaware of the extent to which international consulting firms are used by corporations to lend an aura of empirical respectability to hogwash.

Quite so. ;-) I think my concerns go further though – not only is the Deloitte cachet used to bludgeon evidence-poor audiences into submission, it may also perniciously make it’s way into documents further up the policy development ladder where only the findings, and none of the caveats (including the dodgy provenance of the data) are disclosed.

So here are a couple of things for the data journalists to take away, maybe?

1) there may be stories to be told about the way other people have sourced and used their data. Where one report quotes data from another, treat it with as much suspicion as you would hearsay… Check with the source.

2) when developing your own data stories, keep really good tabs on where the data’s come from and be suspicious about it. If you can be, be open with republishing the data, or links to it.

PS if you have other examples of data provenance laundering, please add a link as a comment to this post:-)

PPS see also How SOPA and PIPA did and didn’t change how Washington lobbying works: “The political scientist E.E. Schattschneider once called politics “the mobilization of bias.” By this, he meant something both simple and profound. All political battles are fights between competing interests, he noted, but political outcomes are almost always determined by the bias of those paying attention to the conflict. The trick is to make sure you mobilize the crowd that will cheer for you.”

PPPS A bit of history relating to the “data laundry” idea, originally in the context of scrubbing rights tainted records from library catalogue metadata:
http://blog.ouseful.info/2011/08/09/open-data-processes-the-open-metadata-laundry/

PPPPS via the Twitter Abused blog, the notion of “recursive abstraction” (“where datasets are summarized; those summaries are then further summarized and so on. The end result is a more compact summary that would have been difficult to accurately discern without the preceding steps of distillation.”) and a corollary in the sense of elements from a qualified infographic being republished in summary form without the original qualification (yet presumably with the need for even more qualification on top of the original disclaimers!)

Written by Tony Hirst

February 1, 2012 at 11:26 am

Posted in Anything you want, Policy

Tagged with

A Handful of Google Hacks…

with 5 comments

Last day of the holidays today, so I thought I’d try not to spend the day learning stuff from the web, but do something else instead, like read a book, or watch a film, clean the greenhouse, try out a new recipe, or maybe even get a (late) head start on some marking, start looking at putting a bid together, or draft a paper for some tinpot journal or other, (because I really, really, really need to generate some ‘research’ income, publish something academically credible, find some way, any way, of justifying my salary…).

But it never works out like that, does it…? Because I saw a couple of tweets from Martin, and they lead, as ever, to a bit of playing…

So for example, from this post of Martin’s – Free (and rebuild) the tweets! Export TwapperKeeper archives using Google Refine – I realised that Google Refine offers all sorts of import options I hadn’t really noticed before (like the ability to import data from an XML (incl. Atom/RSS), RDF or JSON feed. Which means I really need to have a quick play with that…

…for example, by seeing if I can load in some data directly from a Twitter search using a URL of the form:


http://search.twitter.com/search.json?q=from:mhawksey&rpp=5&include_entities=true&result_type=mixed

JSON import in Google Refine

Ooh… magic:-)

Google Refine - import data from url/project import

Looking around various Google Refine works in progress, I also noticed that there appears to be a Python library for running Google Refine project scripts, which means you can presumably automate the Google Refine process? (Note to self – I still haven’t found a noddy tutorial to help me get started with running the Gephi Toolkit from Jython).

Or how about this tweet, “forwarded” by Martin in the sense that he replied to a tweet from @SuButcher and included me in the response, so I could look up the thread being replied to: “@mhawksey I want to use a Googledocs form to populate a google map with twitter users – see http://t.co/uJGC78b8 and http://t.co/hLbBEUXY&#8221; which got me idly wondering about the quickest way to geocode data in a Google spreadsheet, remembering the Google Gadget that will do just this (Google Spreadsheet geocoding map gadget – a quick play suggests you can set the range to a whole column (e.g. B:B), but setting the range from a specific row to the end of the column using the format B2:B (eg for use mapping results from a google form, excluding the title row), doesn’t seem to work?!)

Googling around, I also found this handy trick for getting the lat/long of a point out of Google maps in a recipe for Geocoding with Google Spreadsheets:

- CSV output for Google maps point geocoder:
http://maps.google.com/maps/geo?output=csv&q=PLACENAME_OR_ADDRESS

And hence in the Google Spreadsheet context:

=ImportData(CONCATENATE(“http://maps.google.com/maps/geo?output=csv&q=&#8221;,”LOCATION”))

where “LOCATION” might equally be a reference to a cell that contains a location.

So… day wasted… again… may as well spend the rest of it doing F1DataJunkie stuff now…!

PS seems I was unwittingly sort of involved in the publication path of that Google Maps/CSV hack I linked to above… That’s why I do what I do… unforeseen consequences.

Written by Tony Hirst

January 3, 2012 at 5:30 pm

Posted in Anything you want, Tinkering

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 428 other followers