OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Tinkering’ Category

By Me, on the Scraperwiki Blog: Glue Logic and Flowable Data

Regular readers will know I quite often make use of Scraperwiki for grabbing datasets and hosting views over scraped scraped data. A few days ago, I contributed a guest post to the Scraperwiki blog:

As well as being a great tool for scraping and aggregating content from third party sites, Scraperwiki can be used as a transformational “glue logic” tool: joining together applications that utilise otherwise incompatible data formats. Typically, we might think of using a scraper to pull data into one or more Scraperwiki database tables and then a view to develop an application style view over the data. Alternatively, we might just download the data so that we can analyse it elsewhere. There is another way of using Scraperwiki, though, and that is to give life to data as flowable web data.

Read the whole thing here: Glue Logic and Flowable Data.

PS I hope to write more about “flowable data”, feeds, and feed enrichment in a later post here on OUseful.info.

Written by Tony Hirst

March 6, 2013 at 1:13 pm

Posted in elsewhere, Tinkering

Tagged with

Further Dabblings with the Cloudworks API

Picking up on A Couple of Proof of Concept Demos with the Cloudworks API, and some of the comments that came in around it (thanks Sheila et al:-), I spent a couple more hours tinkering around it and came up with the following…

A prettier view, stolen from Mike Bostock (I think?)

prettier view d3js force directed layout

I also added a slider to tweak the layout (opening it up by increasing the repulsion between nodes) [h/t @mhawksey for the trick I needed to make this work] but still need to figure this out a bit more…

I also added in some new parameterised ways of accessing various different views over Cloudworks data using the root https://views.scraperwiki.com/run/cloudworks_network_d3js_force_directed_view_pretti/

Firstly, we can make calls of the form: ?cloudscapeID=2451&viewtype=cloudscapecloudcloudscape

cloudworks cloudscapes by cloud

This grabs the clouds associated with a particular cloudscape (given the cloudscape ID), and then constructs the network containing those clouds and all the cloudscapes they are associated with.

The next view uses a parameter set of the form cloudscapeID=2451&viewtype=cloudscapecloudtags and displays the clouds associated with a particular cloudscape (given the cloudscape ID), along with the tags associated with each cloud:

cloudworks cloudscape cloud tags

Even though there aren’t many nodes or edges, this is quite a cluttered view, so I maybe need to rethink how best to visualise this information?

I’ve also done a couple of views that make use of follower data. For example, here’s how to call on a view that visualises how the folk who follow a particular cloudscape follow each other (this is actually the default if no viewtype is given) -

cloudworks cloudscape innerfollowers

And here’s how to call a view that grabs a particular user’s clouds, looks up the cloudscapes they belong to, then graphs those cloudscapes and the people who follow them: ?userID=1174&viewtype=usercloudcloudscapefollower

cloudworks followers of cloudscapes containing a user's clouds

Here’s another way of describing that graph – followers of cloudscapes containing a user’s clouds.

The optional argument filterNdegree=N (where N is an integer) will filter the diaplayed network to remove nodes with degree <=N. Here’s the above example, but filtered to remove the nodes that have degree 2 or less: ?userID=1174&viewtype=usercloudcloudscapefollower&filterNdegree=2

cloudworks graph filtered

That is, we prune the graph of people who follow no more than two of the cloudscapes to which the specified user has added a cloud. In other words, we depict folk who follow at least three of the cloudscapes to which the specified user has added a cloud.

(Note that on inspecting that graph it looks as if there is at least one node that has degree 2, rather than degree 3 and above. I’m guessing that it originally had degree 3 or more but that at least one of the nodes it was connected to was pruned out? If that isn’t the case, something’s going wrong…)

Also note that it would be neater to pull in the whole graph and filter the d3.js rendered version interactively, but I don’t know how to do this?

However…I also added a parameter to the script that generates the JSON data files from data pulled from the Cloudworks API calls that allows me to generate a GEXF network file that can be saved as an XML file (.gexf suffix, cf. Visualising Networks in Gephi via a Scraperwiki Exported GEXF File) and then visualised using a tool such as Gephi. The trick? Add the URL parameter &format=gexf (the (optional) default is &format=json) [example].

gephiview of cloudworks graph

Gephi, of course, is a wonderful tool for the interactive exploration of graph-based data sets…. including a wide range of filters…

So, where are we at? The d3.js force directed layout is all very shiny but the graphs quickly get cluttered. I’m not sure if there are any interactive parameter controls I can add, but at the moment the visualisations border on the useless. At the very least, I need to squirt a header into the page from the supplied parameters so we know what the visualisation refers to. (The data I’ve played with to date – which has been very limited – doesn’t seem to be that interesting either from what I’ve seen? But maybe the rich structure isn’t there yet? Or maybe there is nothing to be had from these simple views?)

It may be worth exploring some other visualisation types to see if they are any more legible, at least, though it would be even more helpful if they were simply more informative ;-)

PS just in case, here’s a link to the Cloudworks API documentation.

PPS if there are any terms of service associated with the API, I didn’t read them. So if I broke them, oops. But that said – such is life; never ever trust that anybody you give data to will look after it;-)

Written by Tony Hirst

January 17, 2013 at 11:00 pm

A Couple of Proof of Concept Demos With the Cloudworks API

Via a tweet from @mhawksey in response to a tweet from @sheilmcn, or something like that, I came across a post by Sheila on the topic of Cloud gazing, maps and networks – some thoughts on #oldsmooc so far. The post mentioned a prototyped mindmap style browser for Cloudworks, created in part to test out the Cloudworks API.

Having tinkered with mindmap style presentations using the d3.js library in the browser before (Viewing OpenLearn Mindmaps Using d3.js; the app itself may well have rotted by now) I thought I’d have a go at exploring something similar for Cloudworks. With a promptly delivered API key by Nick Freear, it only took a few minutes to repurpose an old script to cast a test call to the Cloudworks API into a form that could easily be visualised using the d3.js library. The approach I took? To grab JSON data from the API, construct a tree using the Python networkx library, and drop a JSON serialisation of the network into a templated d3.js page. (networkx has a couple of JSON export functions that will create tree based and graph/network based JSON data structures that d3.js can feed from.

Here’s the Python fragment:


import urllib2,json, networkx as nx
from networkx.readwrite import json_graph

id=cloudscapeID #need logic




#print entities

#I seem to remember issues with non-ascii before, though maybe that was for XML? Hmmm...
def ascii(s): return "".join(i for i in s.encode('utf-8') if ord(i)<128)

def graphRoot(DG,title,root=1):
    return DG,root

def gNodeAdd(DG,root,node,name):
    return DG,node


#This simple example just grabs a list of clouds associated with a cloudscape
for c in entities['items']:
#We're going to use the tree based JSON data format to feed the d3.js mindmap view
jdata = json_graph.tree_data(DG,root=1)
#print json.dumps(jdata)

#The page template is defined elsewhere.
#It loads the JSON from a declaration in the Javascript of the form: jsonData=%(jdata)s
print page_template % vars()

The rendered view is something along the lines of:


You can find the original code here.

Now I know that: a) this isn’t very interesting to look at; and b) doesn’t even work as a navigation surface, but my intention was purely to demonstrate a recipe from getting data out of the Cloudworks API and into a d3.js mindmap view in the browser, and it does that. A couple of obvious next steps: i) add in additional API calls to grow the tree (easy); ii) linkify some of the nodes (I’m not sure I know who to do that at them moment?)

Sheila’s post ended with a brief reflection: “I’m also now wondering if a network diagram of cloudscape (showing the interconnectedness between clouds, cloudscapes and people) would be helpful ? Both in terms of not only visualising and conceptualising networks but also in starting to make more explicit links between people, activities and networks.”

So here’s another recipe, again using networkx but this time dropping the data into a graph based JSON format and using the d3.js force based layout to render it. What the script does is grab the followers of a particular cloudscape, grab each of their followers, and then graph how the followers of a particular cloudscape follow each other.

Because I had some problems getting the data into the template, I also used a slightly different wiring approach:

import urllib2,json,scraperwiki,networkx as nx
from networkx.readwrite import json_graph

id=cloudscapeID #need logic



def ascii(s): return "".join(i for i in s.encode('utf-8') if ord(i)<128)

def getUserFollowers(id):

    #print results
    for r in results['items']: f.append(r['user_id'])
    return f



#Seed graph with nodes corresponding of followers of a cloudscape
for c in entities['items']:

#construct graph of how followers of a cloudscape follow each other
for c in entities['items']:
    for followerid in followers:
        if followerid in followerIDs:

scraperwiki.utils.httpresponseheader("Content-Type", "text/json")

#Print out the json representation of the network/graph as JSON
jdata = json_graph.node_link_data(DG)
print json_graph.dumps(jdata)

In this case, I generate a JSON representation of the network that is then loaded into a separate HTML page that deploys the d3.js force directed layout visualisation, in this case how the followers of a particular cloudscape follow each other.


This hits the Cloudworks API once for the cloudscape, then once for each follower of the cloudscape, in order to construct the graph and then pass the JSON version to the HTML page.

Again, I’m posting it as a minimum viable recipe that could be developed as a way of building out Sheila’s idea (though the graph definition would probably need to be a little more elaborate, eg in terms of node labeling). Some work on the graph rendering probably wouldn’t go amiss either, eg in respect of node sizing, colouring and labeling.

Still, what do you expect in just a couple of hours?!;-)

Written by Tony Hirst

January 16, 2013 at 10:08 pm

Posted in Tinkering

Tagged with ,

WordPress Stats in R

A trackback from Martin Hawksey’s recent post on Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet), which demonstrates how to pull WordPress stats into a Google Spreadsheet and generates charts and reports therein, reminded me of the WordPress stats API.

So here’s a quick function for pulling WordPress reports into R.

#Wordpress Stats
#Wordpress Stats API docs (from http://stats.wordpress.com/csv.php)

#You can get a copy of your API key (required) from Akismet:
#Login with you WordPress account: http://akismet.com/account/
#Resend API key: https://akismet.com/resend/

#Required parameters: api_key, blog_id or blog_uri.
#Optional parameters: table, post_id, end, days, limit, summarize.

#api_key     String    A secret unique to your WordPress.com user account.
#blog_id     Integer   The number that identifies your blog. Find it in other stats URLs.
#blog_uri    String    The full URL to the root directory of your blog. Including the full path.
#table       String    One of views, postviews, referrers, referrers_grouped, searchterms, clicks, videoplays.
#post_id     Integer   For use with postviews table.
#end         String    The last day of the desired time frame. Format is 'Y-m-d' (e.g. 2007-05-01) and default is UTC date.
#days        Integer   The length of the desired time frame. Default is 30. "-1" means unlimited.
#period      String    For use with views table and the 'days' parameter. The desired time period grouping. 'week' or 'month'
#Use 'days' as the number of results to return (e.g. '&period=week&days=12' to return 12 weeks)
#limit       Integer   The maximum number of records to return. Default is 100. "-1" means unlimited. If days is -1, limit is capped at 500.
#summarize   Flag      If present, summarizes all matching records.
#format      String    The format the data is returned in, 'csv', 'xml' or 'json'. Default is 'csv'.
#NOTE: some of the report calls I tried didn't seem to work properly?
#Need to build up a list of tested calls to the API that actually do what you think they should?

wordpress.getstats.demo=function(apikey, blogurl, table='postviews', end=Sys.Date(), days='12', period='week', limit='', summarise=''){
  #default parameters gets back last 12 weeks of postviews aggregated by week
    '&',summarise, #set this to 'summarise=T' if required
  #Martin's post notes that JSON appears to work better than CSV
  #May be worth doing a JSON parsing version?

#Use the URL of a WordPress blog associated with the same account as the API key


getDomain=function(url) str_match(url, "^http[s]?://([^/]*)/.*?")[, 2]

#We can pull out the domains clicks were sent to or referrals came from


#Scruffy bar chart - is there a way of doing this sorted chart using geom_bar? How would we reorder x?
ggplot(c)+geom_bar(aes(x=reorder(Var1,Freq),y=Freq),stat='identity')+theme( axis.text.x=element_text(angle=-90))

ggplot(c)+geom_bar(aes(x=reorder(Var1,Freq),y=Freq),stat='identity')+theme( axis.text.x=element_text(angle=-90))

(Code as a gist.)

I guess there’s scope for coming up with a set of child functions that pull back specific report types? Also, if we pull in the blog XML archive and extract external links from each page, we could maybe start to analyse we pages are sending traffic where? (Of course, you can use Google Analytics to do this more efficiently, for hosted WordPress blogs don’t support Google Analytics (for no very good reason that I can tell…?)

PS for more WordPress tinkerings, see eg How OUseful.Info Posts Link to Each Other…,which links to a Python script for extracting data from WordPress blog export files that show how blogs posts in a particular WordPress blog link to each other.

Written by Tony Hirst

January 9, 2013 at 11:50 am

Posted in Anything you want, Rstats, Tinkering

Tagged with

Emergent Social Interest Mapping – Red Bull Racing Facebook Group

With the possibility that my effectively unlimited Twitter API key will die at some point in the Spring with the Twitter API upgrade, I’m starting to look around for alternative sources of interest signal (aka getting ready to say “bye, bye, Twitter interest mapping”). And Facebook groups look like they may offer once possibility…

Some time ago, I did a demo of how to map the the common Facebook Likes of my Facebook friends (Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine). In part inspired by a conversation today about profiling the interests of members of particular Facebook groups, I thought I’d have a quick peek at the Facebook API to see if it’s possible to grab the membership list of arbitrary, open Facebook groups, and then pull down the list of Likes made by the members of the group.

As with my other social positioning/social interest mapping experiments, the idea behind this approach is broadly this: users express interest through some sort of public action, such as following a particular Twitter account that can be associated with a particular interest. In this case, the signal I’m associating with an expression of interest is a Facebook Like. To locate something in interest space, we need to be able to detect a set of users associated with that thing, identify each of their interests, and then find interests they have in common. These shared interests (ideally over and above a “background level of shared interest”, aka the Stephen Fry effect (from Twitter, where a large number of people in any set of people appear to follow Stephen Fry oblivious of other more pertinent shared interests that are peculiar to that set of people) are then assumed to be representative of the interests associated with the thing. In this case, the thing is a Facebook group, the users associated with the thing are the group members, and the interests associated with the thing are the things commonly liked by members of the group.


So for example, here is the social interest positioning of the Red Bull Racing group on Facebook, based on a sample of 3000 members of the group. Note that a significant number of these members returned no likes, either because they haven’t liked anything, or because their personal privacy settings are such that they do not publicly share their likes.


As we might expect, the members of this group also appear to have an interest in other Formula One related topics, from F1 in general, to various F1 teams and drivers, and to motorsport and motoring in general (top half of the map). We also find music preferences (the cluster to the left of the map) and TV programmes (centre bottom of the map) that are of common interest, though I have no idea yet whether these are background radiation interests (that is, the Facebook equivalent of the Stephen Fry effect on Twitter) or are peculiar to this group. I’m not sure whether the cluster of beverage related preferences at the bottom right corner of the map is notable either?

This information is visualised using Gephi, using data grabbed via the following Python script (revised version of this code as a gist):

#This is a really simple script:
##Grab the list of members of a Facebook group (no paging as yet...)
###For each member, try to grab their Likes

import urllib,simplejson,csv,argparse

#Grab a copy of a current token from an example Facebook API call, eg from clicking a keyed link on:
#Something a bit like this:

parser = argparse.ArgumentParser(description='Generate social positioning map around a Facebook group')

parser.add_argument('-gid',default='2311573955',help='Facebook group ID')

parser.add_argument('-FBTOKEN',help='Facebook API token')

if args.gid!=None: gid=args.gid

#Quick test - output file is simple 2 column CSV that we can render in Gephi


def getGroupMembers(gid):
	if "error" in data:
		print "Something seems to be going wrong - check OAUTH key?"
		print data['error']['message'],data['error']['code'],data['error']['type']
		return data

#Grab the likes for a particular Facebook user by Facebook User ID
def getLikes(uid,gid):
	#Should probably implement at least a simple cache here
	print ldata
	if len(ldata['data'])>0:	
		for i in ldata['data']:
			if 'name' in i:
				#We could colour nodes based on category, etc, though would require richer output format.
				#In the past, I have used the networkx library to construct "native" graph based representations of interest networks.
				if 'category' in i: 
					print str(uid),i['name'],i['category']

#For each user in the group membership list, get their likes				
def parseGroupMembers(groupData,gid):
	for user in groupData['data']:
		#x is just a fudge used in progress reporting
		#Prevent duplicate fetches
		if uid not in uids:
			#Really crude progress reporting
			print x
	#need to handle paging?
	#parse next page URL and recall this function


Note that I have no idea whether or not this is in breach of Facebook API terms and conditions, nor have I reflected on the ethical implications of running this sort of analysis, over and the above remarking that it’s the same general approach I apply to mapping social interests on Twitter.

As to where next with this? It brings into focus again the question of identifying common interests pertinent to this particular group, compared to background popular interest that might be expressed by any random set of people. But having got a new set of data to play with, it will perhaps make it easier to test the generalisability of any model or technique I do come up with for filtering out, or normalising against, background interest.

Other directions this could go? Using a single group to bootstrap a walk around the interest space? For example, in the above case, trying to identify groups associated with Sebastian Vettel, or F1, and then repeating the process? It might also make sense to look at the categories of the notable shared interests; (from a quick browse, these include, for example, things like Movie, Product/service, Public figure, Games/toys, Sports Company, Athlete, Interest, Sport; is there a full vocabulary available, I wonder? How might we use this information?)

Written by Tony Hirst

December 5, 2012 at 11:29 pm

Posted in Tinkering

Tagged with ,

More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API

I had a bit of a play with Shiny over the weekend, using the Ergast Motor Racing Data API and the magical Shiny library for R, that makes building interactive, browser based applications around R a breeze.

As this is just a quick heads-up/review post, I’ll largely limit myself to a few screenshots. When I get a chance, I’ll try to do a bit more of a write-up, though this may actually just take the form of more elaborate documentation of the app, both within the code and in the form of explanatory text in the app itself.

If you want to try ou the app, you can find an instance here: F1 2012 Laptime Explorer. The code is also available.

Here’s the initial view – the frist race of the season is selected as a default and data loaded in. The driver list is for all drivers represented during the season.

f1 2012 shiny ergast explorer

THe driver selectors allow us to just display traces for selected drivers.

The Race History chart is a classic results chart. It show the difference between the race time to date for each driver, by lap, compared to the average lap time for the winner times the lap number. (As such, this is an offline statistic – it is calculated when the winner’s overall average laptime is known).

race hisotry - classic chart

Variants of the classic Race History chart are possible, for example, using different base line times, but I haven’t implemented any of them – or the necessary UI controls. Yet…

The Lap Chart is another classic:

Lap chart - another classic

Annotations for this chart are also supported, describing all drivers who final status was not “Finished”.

lap chart with annotations

The Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with the fastest overall recorded laptime.

Lap evolution

The Personal Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with their personal fastest laptime.

Personal lap evolution

The Personal Deltas Chart shows the difference between one laptime and the next for each driver.

Personal deltas

The Race Summary Chart is a chart of my own design that tries to capture notable features relating to race position – the grid position (blue circle), final classification (red circle), position at the end of the first lap (the + or horizontal bar). The violin plot shows the distribution of how many laps the driver spent in each race position. Where the chart is wide, the driver spent a large number of laps in that position.

race summary

The x-axis ordering pulls out different features about how the race progressed. I need to add in a control that lets the user select different orderings.

Finally, the Fast Lap text scatterplot shows the fastest laptime for each driver and the lap at which they recorded it.


So – that’s a quick review of the app. All in all it took maybe 3 hours getting my head round the data parsing, 2-3 hours figuring what I wanted to do and learning how to do it in Shiny, and a couple of hours doing it/starting to document/annotate it. Next time, it’ll be much quicker…

Written by Tony Hirst

December 4, 2012 at 2:14 pm

Posted in Rstats, Tinkering

Tagged with , , ,

Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship

In Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix I posted a quick hack to calculate the finishing positions that would determine the F1 2012 Drivers’ Championship in today’s United States Grand Prix, leaving a tease dangling around the possibility of working out what combinations would lead to a VET or ALO victory if the championship isn’t decided today. So in the hour before the race started, I began to doodle a quick’n’dirty interactive app that would let me keep track of what the championship scenarios would be for the Brazil race given the lap by lap placement of VET and ALO during the US Grand Prix. Given the prep I’d done in the aforementioned post, this meant figuring out how to code up a similar algorithm in R, and then working out how to make it interactive…

But before I show you how I did it, here’s the scenario for Brazil given how the US race finished:

So how was this quick hack app done…?

Trying out the new Shiny interactive stats app builder from the RStudio folk has been on my to do list for some time. It didn’t take long to realise that an interactive race scenario builder would provide an ideal context for trying it out. There are essentially two (with a minor middle third) steps to a Shiny model:

  1. work out the points difference between VET and ALO for all their possible points combinations in the US Grand Prix;
  2. calculate the points difference going into the Brazilian Grand Prix;
  3. calculate the possible outcomes depending on placements in the Brazilian Grand Prix (essentially, an application of the algorithm I did in the original post).

The Shiny app requires two bits of code – a UI in file ui.R, in which I define two sliders that allow me to set the actual (or anticpated, or possible;-) race classifications in the US for Vettel and Alonso:


  # Application title
  headerPanel("F1 Driver Championship Scenarios"),
  # Sidebar with a slider input for number of observations
                "ALO race pos in United States Grand Prix:", 
                min = 1, 
                max = 11, 
                value = 1),
                "VET race pos in United States Grand Prix:", 
                min = 1, 
                max = 11, 
                value = 2)
  # Show a plot of the generated model

And some logic, in file server.R (original had errors; hopefully now bugfixed…) – the original “Paths to the Championship” unpicks elements of the algorithm in a little more detail, but basically I figure out the points difference between VET and ALO based on the points difference at the start of the race and the additional points difference arising from the posited finishing positions for the US race, and then generate a matrix that works out the difference in points awarded for each possible combination of finishes in Brazil:


# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {
    pp=matrix(ncol = nrow(points), nrow = nrow(points))
    for (i in 1:nrow(points)){
      for (j in 1:nrow(points))
  pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
  for (i in 1:nrow(points)){
    for (j in 1:nrow(points))
    win=matrix(ncol = nrow(points), nrow = nrow(points))
    for (i in 1:nrow(points)){
      for (j in 1:nrow(points))
        if (i==j) win[[i,j]]=''
        else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='VET'
        else win[[i,j]]='ALO'
  # Function that generates a plot of the distribution. The function
  # is wrapped in a call to reactivePlot to indicate that:
  #  1) It is "reactive" and therefore should be automatically 
  #     re-executed when inputs change
  #  2) Its output type is a plot 
  output$distPlot <- reactivePlot(function() {
    g=g+xlab('VET position in Brazil')+ ylab('ALO position in Brazil')
    g=g+labs(title="Championship outcomes in Brazil")
    g=g+ theme(legend.position="none")
    g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))

To run the app, if your server and ui files are in some directory shinychamp, then something like the following should et the Shiny app running:


Here’s what it looks like:

You can find the code on github here: F1 Championship 2012 – scenarios if the race gets to Brazil…

Unfortunately, until a hosted service is available, you’ll have to run it yourself if you want to try it out…

Disclaimer: I’ve been rushing to get this posted before the start of the race… If you spot errors, please shout!

Written by Tony Hirst

November 18, 2012 at 6:38 pm

Posted in Rstats, Tinkering

Tagged with ,

Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix

If you haven’t already seen it, one of the breakthrough visualisations of the US elections was the New York Times Paths to the Election scenario builder. With the F1 drivers’ championship in the balance this weekend, I wondered what chances were of VET claiming the championship this weekend. The only contender is ALO, who is currently ten points behind.

A quick Python script shows the outcome depending on the relative classification of ALO and VET at the end of today’s race. (If the drivers are 25 points apart, and ALO then wins in Brazil with VET out of the points, I think VET will win on countback based on having won more races.)

#The current points standings

#The points awarded for each place in the top 10; 0 points otherwise

#Print a header row (there's probably a more elegant way of doing this...;-)
for x in ['VET\ALO',1,2,3,4,5,6,7,8,9,10,'11+']: print str(x)+'\t',
print ''

#I'm going to construct a grid, VET's position down the rows, ALO across the columns
for i in range(len(points)):
	#Build up each row - start with VET's classification
	#Now for the columns - that is, ALO's classification
	for j in range(len(points)):
		#Work out the points if VET is placed i+1  and ALO at j+1 (i and j start at 0)
		#Find the difference between the points scores
		#If the difference is >= 25 (the biggest points diff ALO could achieve in Brazil), VET wins
		if ((vetPoints+points[i])-(aloPoints+points[j])>=25):
		else: row.append("?")
	#Print the row a slightly tidier way...
	print '\t'.join(row)

(Now I wonder – how would I write that script in R?)

And the result?

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
2	?	?	?	?	?	?	?	?	VET	VET	VET
3	?	?	?	?	?	?	?	?	?	?	VET
4	?	?	?	?	?	?	?	?	?	?	?
5	?	?	?	?	?	?	?	?	?	?	?
6	?	?	?	?	?	?	?	?	?	?	?
7	?	?	?	?	?	?	?	?	?	?	?
8	?	?	?	?	?	?	?	?	?	?	?
9	?	?	?	?	?	?	?	?	?	?	?
10	?	?	?	?	?	?	?	?	?	?	?
11	?	?	?	?	?	?	?	?	?	?	?

Which is to say, VET wins if:

  • VET wins the race and ALO is placed 5th or lower;
  • VET is second in the race and ALO is placed 9th or lower;
  • VET is third in the race and ALO is out of the points (11th or lower)

We can also look at the points differences (define a row2 as row, then use row2.append(str((vetPoints+points[i])-(aloPoints+points[j])))):

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	10	17	20	23	25	27	29	31	33	34	35
2	3	10	13	16	18	20	22	24	26	27	28
3	0	7	10	13	15	17	19	21	23	24	25
4	-3	4	7	10	12	14	16	18	20	21	22
5	-5	2	5	8	10	12	14	16	18	19	20
6	-7	0	3	6	8	10	12	14	16	17	18
7	-9	-2	1	4	6	8	10	12	14	15	16
8	-11	-4	-1	2	4	6	8	10	12	13	14
9	-13	-6	-3	0	2	4	6	8	10	11	12
10	-14	-7	-4	-1	1	3	5	7	9	10	11
11	-15	-8	-5	-2	0	2	4	6	8	9	10

We could then do a similar exercise for the Brazil race, and essentially get all the information we need to do a scenario builder like the New York Times election scenario builder… Which I would try to do, but I’ve had enough screen time for the weekend already…:-(

PS FWIW, here’s a quick table showing the awarded points difference between two drivers depending on their relative classification in a race:

A\B	1	2	3	4	5	6	7	8	9	10	11+
1	X	7	10	13	15	17	19	21	23	24	25
2	-7	X	3	6	8	10	12	14	16	17	18
3	-10	-3	X	3	5	7	9	11	13	14	15
4	-13	-6	-3	X	2	4	6	8	10	11	12
5	-15	-8	-5	-2	X	2	4	6	8	9	10
6	-17	-10	-7	-4	-2	X	2	4	6	7	8
7	-19	-12	-9	-6	-4	-2	X	2	4	5	6
8	-21	-14	-11	-8	-6	-4	-2	X	2	3	4
9	-23	-16	-13	-10	-8	-6	-4	-2	X	1	2
10	-24	-17	-14	-11	-9	-7	-5	-3	-1	X	1
11	-25	-18	-15	-12	-10	-8	-6	-4	-2	-1	X

Here’s how to use this chart in association with the previous. Looking at the previous chart, if VET finishes second and ALO third, the points difference is 13 in favour of VET. Looking at the chart immediately above, if we let VET = A and ALO = B, then the columns correspond to ALO’s placement, and the rows to VET. VET (A) needs to lose 14 or more points to lose the championship (that is, we’re looking for values of -14 or less). In particular, ALO (B, columns) needs to finish 1st with VET (A) 5th or worse, 2nd with A 8th or worse, or 3rd with VET 10th or worse.

And the script:

print '\t'.join(['A\B','1','2','3','4','5','6','7','8','9','10','11+'])
for i in range(len(points)):
	for j in range(len(points)):
		if i!=j:row.append(str(points[i]-points[j]))
		else: row.append('X')

And now for the rest of the weekend…

Written by Tony Hirst

November 18, 2012 at 12:59 pm

Posted in Infoskills, Tinkering

Tagged with ,

Finding (Nearly) Duplicate Items in a Data Column

[WARNING – THIS IS A *BAD ADVICE* POST – it describes a trick that sort of works, but the example is contrived and has a better solution – text facet and then cluster on facet (h/t to @mhawksey’s Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that oversight I felt the need to post this warning…]

Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine Duplicate custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don’t quite match: for example, an original tweet and it’s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here’s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.

The Ngram GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N’th word, and so on:

If we can reasonably expect word sequences of length N to appear in out “duplicate-ish” strings, we can generate a facet on ngrams of that length.

It may also be worth experimenting with combining the ngram function with the fingerprint GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:

If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:

For a sizeable dataset, and/or long strings, it’s likely that we’ll get a lot of ngram facet terms:

We could list them all by setting an appropriate choice count limit, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:

Even if tweets aren’t identical, if they contain common ngrams we can pull them out.

Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this postFacets in OpenRefine):

Or alternatively, we might choose to view them using a time facet:

Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.

Written by Tony Hirst

November 14, 2012 at 3:09 pm

“Drug Deal” Network Analysis with Gephi (Tutorial)

Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:

Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.

Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:

If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.

I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…

Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.

So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)

One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.

#Python script to generate GraphML file
import csv
#We're going to use the really handy networkx graph library: easy_install networkx
import networkx as nx
import urllib

#Create a directed graph object

#Open data file in universal newline mode

#Define a variable to act as a deal node ID counter

#The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers
#An identifier is minted for each row, identifying the deal
#Deal attributes are used to annotate deal nodes
#Identify columns used to annotate nodes taking string values
nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use']
#Identify columns used to annotate nodes taking numeric values
nodeColsInt=['Minimum Qty kg', 'Discount']

#The customers are treated as nodes in their own right, rather than as deal attributes
#Identify columns used to identify customers - each of these will define a customer node
customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher']

#Create a node for each customer, and classify it as a 'customer' node type
for customer in customerCols:

#Each row defines a deal
for row in reader:
	#Mint an ID for the deal
	#Add a node for the deal, and classify it as a 'deal' node type
	#Annotate the deal node with string based deal attributes
	for deal in nodeColsStr:
	#Annotate the deal node with numeric based deal attributes
	for deal in nodeColsInt:
	#If the cell in a customer column is set to 1,
	## draw an edge between that customer and the corresponding deal
	for customer in customerCols:
		if str(row[customer])=='1':
	#Increment the node ID counter

#write graph

The graph we’re generating (download .graphml) has a basic structure that looks something like the following:

Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.

Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:

If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):

We can view these nodes using a filter:

We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:

The workspace selector is at the bottom of the window, on the right hand side:

(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)

We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).

Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?

At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:

The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:

We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?

The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.

So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?

We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)

Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:

Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”

Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):

I have then sized the labels so that they are proportional to node size:

The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:

This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:

Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:

(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)

Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.

Recalling the “bimodal”/bipartite graph above:

that means we should be able to generate unimodal graphs along the following lines:

D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.

And for the customers?

C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.

You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)

So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:

When we run the projection, the graph is mapped onto a deal-deal graph:

The thickness of the edges describes the number of customers any two deals shared.

If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:

If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):

If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:

Run the modularity statistic and recolour:

Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:

If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:

If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:

Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:

To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:

Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…

PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)

Written by Tony Hirst

November 9, 2012 at 6:17 pm

Posted in Data, Insight, Tinkering


Get every new post delivered to your Inbox.

Join 784 other followers