OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Tinkering’ Category

Emergent Social Interest Mapping – Red Bull Racing Facebook Group

with 2 comments

With the possibility that my effectively unlimited Twitter API key will die at some point in the Spring with the Twitter API upgrade, I’m starting to look around for alternative sources of interest signal (aka getting ready to say “bye, bye, Twitter interest mapping”). And Facebook groups look like they may offer once possibility…

Some time ago, I did a demo of how to map the the common Facebook Likes of my Facebook friends (Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine). In part inspired by a conversation today about profiling the interests of members of particular Facebook groups, I thought I’d have a quick peek at the Facebook API to see if it’s possible to grab the membership list of arbitrary, open Facebook groups, and then pull down the list of Likes made by the members of the group.

As with my other social positioning/social interest mapping experiments, the idea behind this approach is broadly this: users express interest through some sort of public action, such as following a particular Twitter account that can be associated with a particular interest. In this case, the signal I’m associating with an expression of interest is a Facebook Like. To locate something in interest space, we need to be able to detect a set of users associated with that thing, identify each of their interests, and then find interests they have in common. These shared interests (ideally over and above a “background level of shared interest”, aka the Stephen Fry effect (from Twitter, where a large number of people in any set of people appear to follow Stephen Fry oblivious of other more pertinent shared interests that are peculiar to that set of people) are then assumed to be representative of the interests associated with the thing. In this case, the thing is a Facebook group, the users associated with the thing are the group members, and the interests associated with the thing are the things commonly liked by members of the group.

Simples.

So for example, here is the social interest positioning of the Red Bull Racing group on Facebook, based on a sample of 3000 members of the group. Note that a significant number of these members returned no likes, either because they haven’t liked anything, or because their personal privacy settings are such that they do not publicly share their likes.

rbr_fbGroup_commonLikes

As we might expect, the members of this group also appear to have an interest in other Formula One related topics, from F1 in general, to various F1 teams and drivers, and to motorsport and motoring in general (top half of the map). We also find music preferences (the cluster to the left of the map) and TV programmes (centre bottom of the map) that are of common interest, though I have no idea yet whether these are background radiation interests (that is, the Facebook equivalent of the Stephen Fry effect on Twitter) or are peculiar to this group. I’m not sure whether the cluster of beverage related preferences at the bottom right corner of the map is notable either?

This information is visualised using Gephi, using data grabbed via the following Python script (revised version of this code as a gist):

#This is a really simple script:
##Grab the list of members of a Facebook group (no paging as yet...)
###For each member, try to grab their Likes

import urllib,simplejson,csv,argparse

#Grab a copy of a current token from an example Facebook API call, eg from clicking a keyed link on:
#https://developers.facebook.com/docs/reference/api/examples/
#Something a bit like this:
#AAAAAAITEghMBAOMYrWLBTYpf9ciZBLXaw56uOt2huS7C4cCiOiegEZBeiZB1N4ZCqHgQZDZD

parser = argparse.ArgumentParser(description='Generate social positioning map around a Facebook group')

parser.add_argument('-gid',default='2311573955',help='Facebook group ID')
#gid='2311573955'

parser.add_argument('-FBTOKEN',help='Facebook API token')

args=parser.parse_args()
if args.gid!=None: gid=args.gid
if args.FBTOKEN!=None: FBTOKEN=args.FBTOKEN

#Quick test - output file is simple 2 column CSV that we can render in Gephi
fn='fbgroupliketest_'+str(gid)+'.csv'
writer=csv.writer(open(fn,'wb+'),quoting=csv.QUOTE_ALL)

uids=[]

def getGroupMembers(gid):
	gurl='https://graph.facebook.com/'+str(gid)+'/members?limit=5000&access_token='+FBTOKEN
	data=simplejson.load(urllib.urlopen(gurl))
	if "error" in data:
		print "Something seems to be going wrong - check OAUTH key?"
		print data['error']['message'],data['error']['code'],data['error']['type']
		exit(-1)
	else:
		return data

#Grab the likes for a particular Facebook user by Facebook User ID
def getLikes(uid,gid):
	#Should probably implement at least a simple cache here
	lurl="https://graph.facebook.com/"+str(uid)+"/likes?access_token="+FBTOKEN
	ldata=simplejson.load(urllib.urlopen(lurl))
	print ldata
	
	if len(ldata['data'])>0:	
		for i in ldata['data']:
			if 'name' in i:
				writer.writerow([str(uid),i['name'].encode('ascii','ignore')])
				#We could colour nodes based on category, etc, though would require richer output format.
				#In the past, I have used the networkx library to construct "native" graph based representations of interest networks.
				if 'category' in i: 
					print str(uid),i['name'],i['category']

#For each user in the group membership list, get their likes				
def parseGroupMembers(groupData,gid):
	for user in groupData['data']:
		uid=user['id']
		writer.writerow([str(uid),str(gid)])
		#x is just a fudge used in progress reporting
		x=0
		#Prevent duplicate fetches
		if uid not in uids:
			getLikes(user['id'],gid)
			uids.append(uid)
			#Really crude progress reporting
			print x
			x=x+1
	#need to handle paging?
	#parse next page URL and recall this function


groupdata=getGroupMembers(gid)
parseGroupMembers(groupdata,gid)

Note that I have no idea whether or not this is in breach of Facebook API terms and conditions, nor have I reflected on the ethical implications of running this sort of analysis, over and the above remarking that it’s the same general approach I apply to mapping social interests on Twitter.

As to where next with this? It brings into focus again the question of identifying common interests pertinent to this particular group, compared to background popular interest that might be expressed by any random set of people. But having got a new set of data to play with, it will perhaps make it easier to test the generalisability of any model or technique I do come up with for filtering out, or normalising against, background interest.

Other directions this could go? Using a single group to bootstrap a walk around the interest space? For example, in the above case, trying to identify groups associated with Sebastian Vettel, or F1, and then repeating the process? It might also make sense to look at the categories of the notable shared interests; (from a quick browse, these include, for example, things like Movie, Product/service, Public figure, Games/toys, Sports Company, Athlete, Interest, Sport; is there a full vocabulary available, I wonder? How might we use this information?)

Written by Tony Hirst

December 5, 2012 at 11:29 pm

Posted in Tinkering

Tagged with ,

More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API

with 3 comments

I had a bit of a play with Shiny over the weekend, using the Ergast Motor Racing Data API and the magical Shiny library for R, that makes building interactive, browser based applications around R a breeze.

As this is just a quick heads-up/review post, I’ll largely limit myself to a few screenshots. When I get a chance, I’ll try to do a bit more of a write-up, though this may actually just take the form of more elaborate documentation of the app, both within the code and in the form of explanatory text in the app itself.

If you want to try ou the app, you can find an instance here: F1 2012 Laptime Explorer. The code is also available.

Here’s the initial view – the frist race of the season is selected as a default and data loaded in. The driver list is for all drivers represented during the season.

f1 2012 shiny ergast explorer

THe driver selectors allow us to just display traces for selected drivers.

The Race History chart is a classic results chart. It show the difference between the race time to date for each driver, by lap, compared to the average lap time for the winner times the lap number. (As such, this is an offline statistic – it is calculated when the winner’s overall average laptime is known).

race hisotry - classic chart

Variants of the classic Race History chart are possible, for example, using different base line times, but I haven’t implemented any of them – or the necessary UI controls. Yet…

The Lap Chart is another classic:

Lap chart - another classic

Annotations for this chart are also supported, describing all drivers who final status was not “Finished”.

lap chart with annotations

The Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with the fastest overall recorded laptime.

Lap evolution

The Personal Lap Evolution chart shows how each driver’s laptime evolved over the course of the race compared with their personal fastest laptime.

Personal lap evolution

The Personal Deltas Chart shows the difference between one laptime and the next for each driver.

Personal deltas

The Race Summary Chart is a chart of my own design that tries to capture notable features relating to race position – the grid position (blue circle), final classification (red circle), position at the end of the first lap (the + or horizontal bar). The violin plot shows the distribution of how many laps the driver spent in each race position. Where the chart is wide, the driver spent a large number of laps in that position.

race summary

The x-axis ordering pulls out different features about how the race progressed. I need to add in a control that lets the user select different orderings.

Finally, the Fast Lap text scatterplot shows the fastest laptime for each driver and the lap at which they recorded it.

fastlaps

So – that’s a quick review of the app. All in all it took maybe 3 hours getting my head round the data parsing, 2-3 hours figuring what I wanted to do and learning how to do it in Shiny, and a couple of hours doing it/starting to document/annotate it. Next time, it’ll be much quicker…

Written by Tony Hirst

December 4, 2012 at 2:14 pm

Posted in Rstats, Tinkering

Tagged with , , ,

Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship

leave a comment »

In Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix I posted a quick hack to calculate the finishing positions that would determine the F1 2012 Drivers’ Championship in today’s United States Grand Prix, leaving a tease dangling around the possibility of working out what combinations would lead to a VET or ALO victory if the championship isn’t decided today. So in the hour before the race started, I began to doodle a quick’n'dirty interactive app that would let me keep track of what the championship scenarios would be for the Brazil race given the lap by lap placement of VET and ALO during the US Grand Prix. Given the prep I’d done in the aforementioned post, this meant figuring out how to code up a similar algorithm in R, and then working out how to make it interactive…

But before I show you how I did it, here’s the scenario for Brazil given how the US race finished:

So how was this quick hack app done…?

Trying out the new Shiny interactive stats app builder from the RStudio folk has been on my to do list for some time. It didn’t take long to realise that an interactive race scenario builder would provide an ideal context for trying it out. There are essentially two (with a minor middle third) steps to a Shiny model:

  1. work out the points difference between VET and ALO for all their possible points combinations in the US Grand Prix;
  2. calculate the points difference going into the Brazilian Grand Prix;
  3. calculate the possible outcomes depending on placements in the Brazilian Grand Prix (essentially, an application of the algorithm I did in the original post).

The Shiny app requires two bits of code – a UI in file ui.R, in which I define two sliders that allow me to set the actual (or anticpated, or possible;-) race classifications in the US for Vettel and Alonso:

library(shiny)

shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("F1 Driver Championship Scenarios"),
  
  # Sidebar with a slider input for number of observations
  sidebarPanel(
    sliderInput("alo", 
                "ALO race pos in United States Grand Prix:", 
                min = 1, 
                max = 11, 
                value = 1),
    sliderInput("vet", 
                "VET race pos in United States Grand Prix:", 
                min = 1, 
                max = 11, 
                value = 2)
  ),
  
  # Show a plot of the generated model
  mainPanel(
    plotOutput("distPlot")
  )
))

And some logic, in file server.R (original had errors; hopefully now bugfixed…) – the original “Paths to the Championship” unpicks elements of the algorithm in a little more detail, but basically I figure out the points difference between VET and ALO based on the points difference at the start of the race and the additional points difference arising from the posited finishing positions for the US race, and then generate a matrix that works out the difference in points awarded for each possible combination of finishes in Brazil:

library(shiny)
library(ggplot2)
library(reshape)

# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {
  points=data.frame(pos=1:11,val=c(25,18,15,12,10,8,6,4,2,1,0))
  points[[1,2]]
  a=245
  v=255
  
  pospoints=function(a,v,pdiff,points){
    pp=matrix(ncol = nrow(points), nrow = nrow(points))
    for (i in 1:nrow(points)){
      for (j in 1:nrow(points))
        pp[[i,j]]=v-a+pdiff[[i,j]]
    }
    pp
  }
  
  pdiff=matrix(ncol = nrow(points), nrow = nrow(points))
  for (i in 1:nrow(points)){
    for (j in 1:nrow(points))
      pdiff[[i,j]]=points[[i,2]]-points[[j,2]]
  }
  
  ppx=pospoints(a,v,pdiff,points)
  
  winmdiff=function(vadiff,pdiff,points){
    win=matrix(ncol = nrow(points), nrow = nrow(points))
    for (i in 1:nrow(points)){
      for (j in 1:nrow(points))
        if (i==j) win[[i,j]]=''
        else if ((vadiff+pdiff[[i,j]])>=0) win[[i,j]]='VET'
        else win[[i,j]]='ALO'
    }
    win
  }
  
  # Function that generates a plot of the distribution. The function
  # is wrapped in a call to reactivePlot to indicate that:
  #
  #  1) It is "reactive" and therefore should be automatically 
  #     re-executed when inputs change
  #  2) Its output type is a plot 
  #
  output$distPlot <- reactivePlot(function() {
    wmd=winmdiff(ppx[[input$vet,input$alo]],pdiff,points)
    wmdm=melt(wmd)
    g=ggplot(wmdm)+geom_text(aes(X1,X2,label=value,col=value))
    g=g+xlab('VET position in Brazil')+ ylab('ALO position in Brazil')
    g=g+labs(title="Championship outcomes in Brazil")
    g=g+ theme(legend.position="none")
    g=g+scale_x_continuous(breaks=seq(1, 11, 1))+scale_y_continuous(breaks=seq(1, 11, 1))
    print(g)
  })
})

To run the app, if your server and ui files are in some directory shinychamp, then something like the following should et the Shiny app running:

library(shiny)
runApp("~/path/to/my/shinychamp")

Here’s what it looks like:

You can find the code on github here: F1 Championship 2012 – scenarios if the race gets to Brazil…

Unfortunately, until a hosted service is available, you’ll have to run it yourself if you want to try it out…

Disclaimer: I’ve been rushing to get this posted before the start of the race… If you spot errors, please shout!

Written by Tony Hirst

November 18, 2012 at 6:38 pm

Posted in Rstats, Tinkering

Tagged with ,

Paths to the F1 2012 Championship Based on How They Might Finish in the US Grand Prix

with one comment

If you haven’t already seen it, one of the breakthrough visualisations of the US elections was the New York Times Paths to the Election scenario builder. With the F1 drivers’ championship in the balance this weekend, I wondered what chances were of VET claiming the championship this weekend. The only contender is ALO, who is currently ten points behind.

A quick Python script shows the outcome depending on the relative classification of ALO and VET at the end of today’s race. (If the drivers are 25 points apart, and ALO then wins in Brazil with VET out of the points, I think VET will win on countback based on having won more races.)

#The current points standings
vetPoints=255
aloPoints=245

#The points awarded for each place in the top 10; 0 points otherwise
points=[25,18,15,12,10,8,6,4,2,1,0]

#Print a header row (there's probably a more elegant way of doing this...;-)
for x in ['VET\ALO',1,2,3,4,5,6,7,8,9,10,'11+']: print str(x)+'\t',
print ''

#I'm going to construct a grid, VET's position down the rows, ALO across the columns
for i in range(len(points)):
	#Build up each row - start with VET's classification
	row=[str(i+1)]
	#Now for the columns - that is, ALO's classification
	for j in range(len(points)):
		#Work out the points if VET is placed i+1  and ALO at j+1 (i and j start at 0)
		#Find the difference between the points scores
		#If the difference is >= 25 (the biggest points diff ALO could achieve in Brazil), VET wins
		if ((vetPoints+points[i])-(aloPoints+points[j])>=25):
			row.append("VET")
		else: row.append("?")
	#Print the row a slightly tidier way...
	print '\t'.join(row)

(Now I wonder – how would I write that script in R?)

And the result?

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	?	?	?	?	VET	VET	VET	VET	VET	VET	VET
2	?	?	?	?	?	?	?	?	VET	VET	VET
3	?	?	?	?	?	?	?	?	?	?	VET
4	?	?	?	?	?	?	?	?	?	?	?
5	?	?	?	?	?	?	?	?	?	?	?
6	?	?	?	?	?	?	?	?	?	?	?
7	?	?	?	?	?	?	?	?	?	?	?
8	?	?	?	?	?	?	?	?	?	?	?
9	?	?	?	?	?	?	?	?	?	?	?
10	?	?	?	?	?	?	?	?	?	?	?
11	?	?	?	?	?	?	?	?	?	?	?

Which is to say, VET wins if:

  • VET wins the race and ALO is placed 5th or lower;
  • VET is second in the race and ALO is placed 9th or lower;
  • VET is third in the race and ALO is out of the points (11th or lower)

We can also look at the points differences (define a row2 as row, then use row2.append(str((vetPoints+points[i])-(aloPoints+points[j])))):

VET\ALO	1	2	3	4	5	6	7	8	9	10	11+	
1	10	17	20	23	25	27	29	31	33	34	35
2	3	10	13	16	18	20	22	24	26	27	28
3	0	7	10	13	15	17	19	21	23	24	25
4	-3	4	7	10	12	14	16	18	20	21	22
5	-5	2	5	8	10	12	14	16	18	19	20
6	-7	0	3	6	8	10	12	14	16	17	18
7	-9	-2	1	4	6	8	10	12	14	15	16
8	-11	-4	-1	2	4	6	8	10	12	13	14
9	-13	-6	-3	0	2	4	6	8	10	11	12
10	-14	-7	-4	-1	1	3	5	7	9	10	11
11	-15	-8	-5	-2	0	2	4	6	8	9	10

We could then do a similar exercise for the Brazil race, and essentially get all the information we need to do a scenario builder like the New York Times election scenario builder… Which I would try to do, but I’ve had enough screen time for the weekend already…:-(

PS FWIW, here’s a quick table showing the awarded points difference between two drivers depending on their relative classification in a race:

A\B	1	2	3	4	5	6	7	8	9	10	11+
1	X	7	10	13	15	17	19	21	23	24	25
2	-7	X	3	6	8	10	12	14	16	17	18
3	-10	-3	X	3	5	7	9	11	13	14	15
4	-13	-6	-3	X	2	4	6	8	10	11	12
5	-15	-8	-5	-2	X	2	4	6	8	9	10
6	-17	-10	-7	-4	-2	X	2	4	6	7	8
7	-19	-12	-9	-6	-4	-2	X	2	4	5	6
8	-21	-14	-11	-8	-6	-4	-2	X	2	3	4
9	-23	-16	-13	-10	-8	-6	-4	-2	X	1	2
10	-24	-17	-14	-11	-9	-7	-5	-3	-1	X	1
11	-25	-18	-15	-12	-10	-8	-6	-4	-2	-1	X

Here’s how to use this chart in association with the previous. Looking at the previous chart, if VET finishes second and ALO third, the points difference is 13 in favour of VET. Looking at the chart immediately above, if we let VET = A and ALO = B, then the columns correspond to ALO’s placement, and the rows to VET. VET (A) needs to lose 14 or more points to lose the championship (that is, we’re looking for values of -14 or less). In particular, ALO (B, columns) needs to finish 1st with VET (A) 5th or worse, 2nd with A 8th or worse, or 3rd with VET 10th or worse.

And the script:

print '\t'.join(['A\B','1','2','3','4','5','6','7','8','9','10','11+'])
for i in range(len(points)):
	row=[str(i+1)]
	for j in range(len(points)):
		if i!=j:row.append(str(points[i]-points[j]))
		else: row.append('X')

And now for the rest of the weekend…

Written by Tony Hirst

November 18, 2012 at 12:59 pm

Posted in Infoskills, Tinkering

Tagged with ,

Finding (Nearly) Duplicate Items in a Data Column

with one comment

[WARNING - THIS IS A *BAD ADVICE* POST - it describes a trick that sort of works, but the example is contrived and has a better solution - text facet and then cluster on facet (h/t to @mhawksey's Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that oversight I felt the need to post this warning…]

Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine Duplicate custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don’t quite match: for example, an original tweet and it’s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here’s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.

The Ngram GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N’th word, and so on:

If we can reasonably expect word sequences of length N to appear in out “duplicate-ish” strings, we can generate a facet on ngrams of that length.

It may also be worth experimenting with combining the ngram function with the fingerprint GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:

If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:

For a sizeable dataset, and/or long strings, it’s likely that we’ll get a lot of ngram facet terms:

We could list them all by setting an appropriate choice count limit, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:

Even if tweets aren’t identical, if they contain common ngrams we can pull them out.

Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this postFacets in OpenRefine):

Or alternatively, we might choose to view them using a time facet:

Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.

Written by Tony Hirst

November 14, 2012 at 3:09 pm

“Drug Deal” Network Analysis with Gephi (Tutorial)

with 5 comments

Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:

Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.

Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:

If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.

I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…

Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.

So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)

One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.

#Python script to generate GraphML file
import csv
#We're going to use the really handy networkx graph library: easy_install networkx
import networkx as nx
import urllib

#Create a directed graph object
DG=nx.DiGraph()

#Open data file in universal newline mode
reader=csv.DictReader(open("inventory.csv","rU"))

#Define a variable to act as a deal node ID counter
dcid=0

#The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers
#An identifier is minted for each row, identifying the deal
#Deal attributes are used to annotate deal nodes
#Identify columns used to annotate nodes taking string values
nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use']
#Identify columns used to annotate nodes taking numeric values
nodeColsInt=['Minimum Qty kg', 'Discount']

#The customers are treated as nodes in their own right, rather than as deal attributes
#Identify columns used to identify customers - each of these will define a customer node
customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher']

#Create a node for each customer, and classify it as a 'customer' node type
for customer in customerCols:
	DG.add_node(customer,typ="customer")

#Each row defines a deal
for row in reader:
	#Mint an ID for the deal
	dealID='deal'+str(dcid)
	#Add a node for the deal, and classify it as a 'deal' node type
	DG.add_node(dealID,typ='deal')
	#Annotate the deal node with string based deal attributes
	for deal in nodeColsStr:
		DG.node[dealID][deal]=row[deal]
	#Annotate the deal node with numeric based deal attributes
	for deal in nodeColsInt:
		DG.node[dealID][deal]=int(row[deal])
	#If the cell in a customer column is set to 1,
	## draw an edge between that customer and the corresponding deal
	for customer in customerCols:
		if str(row[customer])=='1':
			DG.add_edge(dealID,customer)
	#Increment the node ID counter
	dcid=dcid+1

#write graph
nx.write_graphml(DG,"inventory.graphml")

The graph we’re generating (download .graphml) has a basic structure that looks something like the following:

Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.

Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:

If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):

We can view these nodes using a filter:

We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:

The workspace selector is at the bottom of the window, on the right hand side:

(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)

We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).

Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?

At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:

The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:

We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?

The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.

So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?

We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)

Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:

Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”

Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):

I have then sized the labels so that they are proportional to node size:

The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:

This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:

Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:

(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)

Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.

Recalling the “bimodal”/bipartite graph above:

that means we should be able to generate unimodal graphs along the following lines:

D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.

And for the customers?

C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.

You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)

So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:

When we run the projection, the graph is mapped onto a deal-deal graph:

The thickness of the edges describes the number of customers any two deals shared.

If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:

If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):

If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:

Run the modularity statistic and recolour:

Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:

If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:

If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:

Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:

To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:

Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…

PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)

Written by Tony Hirst

November 9, 2012 at 6:17 pm

Posted in Data, Insight, Tinkering

Chit Chat with New Datasets – Facets in OpenRefine (Was /Google Refine/)

with 7 comments

One of the many ways of using Google OpenRefine is as a toolkit for getting a feel for the range of variation contained within a dataset using the various faceting options. In the sense of analysis being a conversation with data, this is a bit like an idle chit-chat/getting to know you phase, as a precursor to a full blown conversation.

Faceted search or faceted browsing/navigation typically provides a set of limiting search filters to a set of search results that limits or restricts the displayed results to ones that fulfil certain conditions. In a library catalogue, the facets might refer to metadata fields such as publication date, thus allowing a user to search within a given date range, or publisher:

Where the facet relates to a categorical variable – that is, where there is a set of unique values that the facet can take (such as the names of different publishers) – a view of the facet values will show the names of the different publishers extracted from the original search results. Selecting a particular publisher, for example, will then limit the displayed results to just those results associated with that publisher. For numerical facets, where the quantities associated with the facet related to a number or date (that is, a set of things that have a numerical range), the facet view will show the full range of values contained within that particular facet. The user can then select a subset of results that fall within a specified part of that range.

In the case of Open Refine, facets can be defined on a per column basis. For categorical facets, Refine will identify the set of unique values associated with a particular faceted view that are contained within a column, along with a count of how many times each facet value occurs throughout the column. The user can then choose to view only those rows with a particular (facet selected) value in the faceted column. For columns that contain numbers, Refine will generate a numerical facet that spans the range of values contained within the column, along with a histogram that provides a count of occurrences of numbers within small ranges across the full range.

So what faceting options does Google Refine provide?

Here’s how they work (data used for the examples comes from Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi and JSON import from the Twitter search API…):

- exploring the set of categories described within a column using the text facet:

Faceted views also allow you to view the facet values by occurrence count, so it’s easy to see which the most popular facet values are:

You can also get a tab separated list of facet values:

Sometimes it can be useful to view rows associated with particular facet values that occur a particular number of times, particulalry at the limits (for example, very popular facet values, or uniquely occurring facet values):

- looking at the range of numerical values contained in a column using the numeric facet:

- looking at the distribution over time of column contents using the timeline facet:

Faceting by time requires time-related strings to be parsed as such; sometimes, Refine needs a little bit of help in interpreting an imported string as a time string. So for example, given a “time” string such as Mon, 29 Oct 2012 10:56:52 +0000 from the Twitter search API, we can use the GREL function toDate(value,"EEE, dd MMM y H:m:s") to create a new column with time-cast elements.

(See GRELDateFunctions and the Java SimpleDateFormat class documentation for more details.)

- getting a feel for the correlation of values across numerical columns, and exploring those correlations further, using the scatterplot facet.

This generates a view that generates a set of scatterplots relating to pairwise combinations of all the numerical columns in the dataset:

Clicking on one of these panels allows you to filter points within a particular area of the corresponding scatter chart (click and drag a rectangular area over the points you want to view), effectively allowing you to filter the data across related ranges of two numerical columns at the same time:

A range of customisable faceting options are also provided that allow you to define your own faceting functions:

  • the Custom text… facet;
  • the Custom Numeric… facet

More conveniently, a range of predefined Customized facets are provided that provide shortcuts to “bespoke” faceting functions:

So for example:

  • the word facet splits strings contained in cells into single words, counts their occurrences throughout the column, and then lists unique words and their occurrence count in the facet panel. This faceting option thus provides a way of selecting rows where the contents of a particular column contain one or more specified words. (The user defined GREL custom text facet ngram(value,1) provides a similar (though not identical) result – duplicated words in a cell are identified as unique by the single word ngram function; see also split(value," "), which does seem to replicate the behaviour of the word facet function.)

  • the duplicates facet returns boolean values of true and false; filtering on true values returns all the rows that have duplicated values within a particular column; filtering on false displays all unique rows.
  • the text length facet produces a facet based on the character count(?) of strings in cells within the faceted column; the custom numeric facet length(value) achieves something similar; the related measure, word count, can be achieved using the custom numeric facet length(split(value," "))

Note that facet views can be combined. Selecting multiple rows within a particular facet panel provides a Boolean OR over the selected values (that is, if any of the selected values appear in the column, the corresponding rows will be displayed). To AND conditions, even within the same facet, create a separate facet panel for each ANDed condition.

PS On the OpenRefine (was Google Refine) name change, see From Freebase Gridworks to Google Refine and now OpenRefine. The code repository is now on github: OpenRefine Repository. I also notice that openrefine.org/ has been minted and is running a placeholder instance of WordPress. I wonder if it would be worth setting up an aggregator for community posts, a bit like R-Blogger (for example, I have an RStats category feed from this blog that I syndicate to the RBloggers aggregator, and have just created an OpenRefine category that could feed a OpenRefinery aggregator channel).

PPS for an example of using OpenRefine to find differences between two recordsets, see Owen Stephens’ Using Open Refine for e-journal data.

Written by Tony Hirst

November 6, 2012 at 10:39 am

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi

with 9 comments

How can we get a quick snapshot of who’s talking to whom on Twitter in the context of a particular hashtag?

Here’s a quick recipe that shows how…

First we need to grab some search data. The Twitter API documentation provides us with some clues about how to construct a web address/URL that will grab results back from a particular search on Twitter in a machine readable way (that is, as data):

  • http://search.twitter.com/search.format is the base URL, and the format we require is json, which gives us http://search.twitter.com/search.json
  • the query we want is presented using the q= parameter: http://search.twitter.com/search.json?q=searchterm
  • if we want multiple search terms (for example, library skills), they need encoding in a particular way. The easiest was is just to construct your URL, enter it into the location/URL bar of your browser and hit enter, or use a service such as this string encoder. The browser should encode the URL for you. (If the only punctuation in your search phrase are spaces, you can encode them yourself: just change each space to %20, to give something like library%20skills. If you want to encode the # in a hashtag, use %23
  • We want to get back as many results as are allowed at any one time (which happens to be 100), so set rpp=100, that is: http://search.twitter.com/search.json?q=library%20skills&rpp=100
  • results are paged (in the sense of different pages of Google search results, for example), which means we can ask for the first 100 results, the second 100 results and so on as far back as the most recent 1500 tweets (page 15 for rpp=100, or page 30 if we were using rpp=50 (since 15*100 = 30*50 = 1500): http://search.twitter.com/search.json?q=library%20skills&rpp=100&page=1

Clicking on Next provides us with a dialogue that will allow us to load the data from the URLs into Google Refine:

Clicking “Configure Parsing Options” loads the data and provides us with a preview of it:

If you inspect the data that is returned, you should see it has a repeating pattern. Hovering over the various elements allows you to identify what repeating part of the result we want to import. For example, we could just import each tweet:

Or we could import all the data fields – let’s grab them all:

If you click the highlighted text, or click “Update Preview View”, you can get a preview of how the data will appear. To return to the selection view, click “Pick Record Nodes”:

“Create Project” actually generates the project and pulls all the data in… The column names are a little messy, but we can tidy those:

Look for the from_user and to_user columns and rename them source and target respectively… (hovering over a column name pops up tooltip that shows the full column name):

For the example I’m going to describe, we don’t actually need to rename the columns, but it’s handy to know how to do it;-)

We can now filter out all the rows with a “null” value in the target column. It seems a bit fiddly at first, but you soon get used to the procedure… Select the text facet to pop up a window that show the unique elements in the target column and how often they occur. Sort the list by count, and click on the “null” element – it should be highlighted and its setting should appear as “exclude”. The column will now be showing elements in the column that have the null value:

Click on the “Invert” option and the column will now filter out all the “null” elements and only show the elements that have a non-null value – that is, tweets that have a “to_user” value (which is to say, those tweets were sent to a particular user). Here’s what we get:

Let’s now export the source and target data so we can get it into Gephi:

Deselect all the columns, and then select source and target columns; also deselect the ‘output column headers’ – we don’t need headers where this file is going…

Export the custom layout as CSV data:

We can now import this data into another application – Gephi. Gephi is a cross platform package for visualising networks. In the simplest case, it can import two column data files where each row represents two things that are connected to each other. In our case, we have connections between “source” and “target” Twitter names – that is, connections that show when one Twitter user in our search sample has sent a message to another.

Launch Gephi and from the file menu, open the file you exported from Google Refine:

We’ve now got our data into Gephi, where we can start to visualise it…

…but that is a post for another day… (or if you’re impatient, you can find some examples of how to drive Gephi here).

Written by Tony Hirst

October 2, 2012 at 4:45 pm

Posted in Tinkering

Tagged with , , ,

Merging Data Sets Based on Partially Matched Data Elements

with 15 comments

A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes,although the recipe I’m describing here is intended to be a general purpose partial matcher. (See the comments for alternative methods for the special case of reconciling country names with country codes.)

The original data set had the form:

RANK,ECONOMY,PERCENTAGE OF INDIVIDUALS USING THE INTERNET 2011
1,Iceland,95
2,Norway,94
3,Netherlands,92.3
4,Sweden,91
5,Luxembourg,90.9
6,Denmark,90
7,Finland,89.4
8,Qatar,86.2
9,New Zealand,86
10,Switzerland,85.2
11,Liechtenstein,85
12,S. Korea,83.8
13,Germany,83
14,Canada,83
15,United Kingdom,82
16,Antigua & Barbuda,82

I’m not sure what country code listing was being used, but it probably looked something like this list of ISO Country Codes by Country:

ANDORRA;AD
ANGOLA;AO
ANGUILLA;AI
ANTARCTICA;AQ
ANTIGUA AND BARBUDA;AG
ARGENTINA;AR
ARMENIA;AM
ARUBA;AW
AUSTRALIA;AU
AUSTRIA;AT
AZERBAIJAN;AZ
BAHAMAS;BS
BAHRAIN;BH

It’s easy enough to reduce all the country names to lowercase characters so that we can try to match them exactly without worrying about any differences in capitalisation, but how do we match country names that don’t match exactly – ANTIGUA AND BARBUDA and Antigua & Barbuda, for example, or Central African Rep. and Central African Republic?

One trick is to use one of the well known partial string matching algorithms, such as the Levenshtein Distance. Here’s a recipe I hacked together that first tries to find an exact match on country names by attempting to merge the two country lists directly, and then tries to partially match any remaining unmatched names in the original list. A signature function is constructed to help out the partial matching attempt that reduces all words in the country name to lower case, sorts them alphabetically, and then concatenates them with no adjoining spaces.

(The signature idea was cribbed from the fingerprint that is available in Google Refine and that I employed in Merging Datasets with Common Columns in Google Refine.)

[@downes commented that the code wasn't really written with clarity in mind - so I've added some comments...]

#Load in the data from the URLs:
PercentageUsingTheNet=read.csv(url('http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv, encoding='MACROMAN'))
ccode=read.csv(url('http://www.iso.org/iso/country_names_and_code_elements_txt'),sep=';')

##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case:
### convert all characters to lower case (tolower())
### split the string into a vector (unlist()) of separate words (strsplit())
### sort the words alphabetically (sort())
### and then concatenate them with no spaces (paste(y,collapse='')).
##So for example, United Kingdom would become kingdomunited
##To extend this function, we might also remove stopwords such as 'the' and 'of', for example (not shown).
signature=function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}

#The partialMatch function takes two wordlists as vectors (x,y) and an optional distance threshold (levDist)
#The aim is to find words in the second list (y) that match or partially match words in the first (x)
partialMatch=function(x,y,levDist=0.1){
  #Create a data framecontainind the signature for each word
  xx=data.frame(sig=sapply(x, signature),row.names=NULL)
  yy=data.frame(sig=sapply(y, signature),row.names=NULL)
  #Add the original words to the data frame too...
  xx$raw=x
  yy$raw=y
  #We only want words that have a signature...
  xx=subset(xx,subset=(sig!=''))

  #The first matching pass - are there any rows in the two lists that have exactly the same signature?
  xy=merge(xx,yy,by='sig',all=T)
  matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
  #?I think matched=xy[ complete.cases(raw.x,raw.y) ] might also work here?
  #Label the items with identical signatures as being 'Duplicate' matches
  matched$pass="Duplicate"

  #Grab the rows from the first list that were unmatched - that is, no matching item from the second list appears
  todo=subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
  #We've grabbed the signature and original raw text from the first list that haven't been matched up yet
  #Name the columns so we know what's what
  colnames(todo)=c('sig','raw')

  #This is the partial matching magic - agrep finds items in the second list that are within a 
  ## certain Levenshtein distance of items in the first list.
  ##Note that I'm finding the distance between signatures.
  todo$partials= as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))

  #Bring the original text into the partial match list based on the sig key.
  todo=merge(todo,yy,by.x='partials',by.y='sig')

  #Find the items that were actually partially matched, and pull out the columns relating to signatures and raw text
  partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("sig","raw.x","raw.y"))
  #Label these rows as partial match items
  partial.matched$pass="Partial"
  #Add the set of partially matched items to the set of duplicate matched items
  matched=rbind(matched,partial.matched)
  
  #Find the rows that still haven't been matched
  un.matched=subset(todo,subset=(is.na(raw.x)),select=c("sig","raw.x","raw.y"))

  #If there are any unmatched rows, add them to the list of matched rows, but labelled as such
  if (nrow(un.matched)>0){
    un.matched$pass="Unmatched"
    matched=rbind(matched,un.matched)
  }

  #Grab the columns of raw text from x and y from the matched list, along with how they were matched/are unmatched
  matched=subset(matched,select=c("raw.x","raw.y","pass"))
  #Ideally, the length of this should be the same as the length of valid rows in the original first list (x)

  return(matched)
}

#A rogue character in @coneee's data file borked things for me, so I originally needed to do a character code conversion first
#PercentageUsingTheNet$ECONOMY=iconv(PercentageUsingTheNet$ECONOMY)
#Loading the CSV with the encoding attribute set (h/t Kent Johnson) seemed to work properly though...

#Call the partial match function using two vectors
#The aim is to find items in the second vector that partially match items in the first
#The function returns the first list annotated with partial match items from the second and a match type
matches=partialMatch(PercentageUsingTheNet$ECONOMY,ccode$Country.Name)

As ever, this code was arrived at by tinkering and searching on Stack Overflow (using search terms along the lines of “partial match R” and “R levenshtein”). If you can improve on it, please do so and paste a link to the improved code, or a code fragment itself, in the comments below:-)

UPDATE: via the comments, the following suggestion that I don’t have time to check right now…
#Bring the original text into the partial match list based on the sig key.
-todo=merge(todo,yy,by.x=’partials’,by.y=’sig’)
+todo=merge(todo,yy,by.x=’partials’,by.y=’sig’,all.x=T)

#Label these rows as partial match items
-partial.matched$pass=”Partial”
+if (nrow(partial.matched) > 0) partial.matched$pass=”Partial”

Thanks:-)

When we run the script and look at the contents of the matches dataframe, this is an example of what we get:

This data frame can then be merged with the originally loaded data to give us the required country code annotations:

a=PercentageUsingTheNet
b=ccode
#Merge the original data set with the ISO country code country name keys
aa=merge(a,matches,by.x='ECONOMY',by.y='raw.x',all.x=T)
#Merge in the ISO country codes
aa=merge(aa,b,by.x='raw.y',by.y='Country.Name',all.x=T)
aa=subset(aa,select=c('ECONOMY','RANK','PERCENTAGE.OF.INDIVIDUALS.USING.THE.INTERNET.2011','ISO.3166.1.alpha.2.code'))

Something like this, for example:

Unfortunately, not every country in the original data set is matched:

In particular, here are the unmatched items and what they presumably should have been matched with:

Lao P.D.R. - LAO PEOPLE'S DEMOCRATIC REPUBLIC
Syria - SYRIAN ARAB REPUBLIC
S. Korea - KOREA, REPUBLIC OF
Bolivia - BOLIVIA, PLURINATIONAL STATE OF
Russia - RUSSIAN FEDERATION
Guinea Bissau - GUINEA-BISSAU
St. Vincent & Grenadines - SAINT VINCENT AND THE GRENADINES
S. Sudan - SOUTH SUDAN
Eq. Guinea - EQUATORIAL GUINEA
Congo (Dem. Rep.) - CONGO, THE DEMOCRATIC REPUBLIC OF THE
Congo (Kinshasa) - CONGO
Slovak Republic - SLOVAKIA
Iran - IRAN, ISLAMIC REPUBLIC OF
TFYR Macedonia - MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF

We could try to increase the Levenshtein Distance within which a partial match is suggested, but then we run the risk of introducing false positives.

Removing stopwords as part of the signature function may help, for example in the case of St. Vincent & Grenadines. (Also, I’m not sure what went wrong with Guinea Bissau?) We could presumably also make something of the abbreviations (S., Dem., Rep., Eq., P. D. R.), using these as NGrams or stubs in a regular expression? So for example, convert Lao P. D. R. to a regular expression invoking searches for a three word phrase in which consecutive words start with P, D, R. (Of course, this would require matching across the full country name rather than the signature.)

The Google Refine documentation page Clustering In Depth (Methods and theory behind the clustering functionality in Google Refine) also identifies several other fuzzy/partial matching algorithms that could be tried here. In which case, I guess I need to rewrite the above function so that it can accommodate different matching algorithms? IF you manage to implement any of these other fuzzy matching approaches, please post a link in the comments.

PS If you use any other approaches for partial matching of strings, please feel free to share how you do it in the comments below…:-) For example:
- theBioBucket: Merging Dataframes by Partly Matching String. This post also introduces adist(), an R function that returns the Levenshtein distance between two strings.

Written by Tony Hirst

September 26, 2012 at 9:00 pm

Fragment: Pondering Mapping the Pearson Network

with 3 comments

Tipped off by @brlamb to this Huffington Post story on Pearson ‘Education’ — Who Are These People? (which in turn led to this SEC filing (I found the risk assessment on pages 8-10 interesting), I started wondering about the web domains owned by Pearson.

Looking up the domain registration details for pearson.com turned up a a handful of nameservers – ns.pearson.com, ns2.pearson.com, oldtxdns2.pearsontc.com, usrxdns1.pearsontc.com – which we can use as the basis for a reverse lookup to see what other sites are registered with the same domain server (and which presumably, therefore, relate to Pearson activities).

SO for example, Gwebtools turns up a couple of thousand or so domains dangling off usrxdns1.pearsontc.com, but I couldn’t get Haystax extractor to scrape more than a single page (not used it before? Maybe I was doing something wrong? Or maybe Chrome was playing up (too many open tabs again?!). I’m also too tired right now to write a scraper – been struggling to answer ReCaptchas all night (I guess that by now they’re completely inaccessible if you have dyslexia? It often takes me 5 or 6 refreshes before I feel confident going for one!) Which is to say, if you scrape the data describing all the domains associated with each of the Pearson nameservers, please post a link to it in the comments;-)

I don’t remember if I tried grabbing Pearson data from OpenCorporates to do a corporate sprawl graph..? I guess I should try and find what trademarks they have registered too?

WHich reminds me: is there a free open source of directors listing for UK companies yet? And how’s the Lobbiests register campaign (or WhosLobbying scraping) coming on? Is there a reverse lookup by company, so for example we could look to see who reps from Pearson had been chatting to?

I wonder also if Pearson support any All Parliamentary Groups…?

PS this was handy, at first… How to Find the other Websites of a Person?

PPS See also A Gust of WInd BLows Across HE on Pearson’s VUE assessment centres being used for open online course supervised examinations.

Written by Tony Hirst

September 7, 2012 at 10:11 pm

Posted in Tinkering

Follow

Get every new post delivered to your Inbox.

Join 428 other followers