Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:
Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.
Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:
If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.
I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…
Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.
So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)
One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.
#Python script to generate GraphML file
#We're going to use the really handy networkx graph library: easy_install networkx
import networkx as nx
#Create a directed graph object
#Open data file in universal newline mode
#Define a variable to act as a deal node ID counter
#The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers
#An identifier is minted for each row, identifying the deal
#Deal attributes are used to annotate deal nodes
#Identify columns used to annotate nodes taking string values
nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use']
#Identify columns used to annotate nodes taking numeric values
nodeColsInt=['Minimum Qty kg', 'Discount']
#The customers are treated as nodes in their own right, rather than as deal attributes
#Identify columns used to identify customers - each of these will define a customer node
customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher']
#Create a node for each customer, and classify it as a 'customer' node type
for customer in customerCols:
#Each row defines a deal
for row in reader:
#Mint an ID for the deal
#Add a node for the deal, and classify it as a 'deal' node type
#Annotate the deal node with string based deal attributes
for deal in nodeColsStr:
#Annotate the deal node with numeric based deal attributes
for deal in nodeColsInt:
#If the cell in a customer column is set to 1,
## draw an edge between that customer and the corresponding deal
for customer in customerCols:
#Increment the node ID counter
The graph we’re generating (download .graphml) has a basic structure that looks something like the following:
Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.
Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:
If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):
We can view these nodes using a filter:
We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:
The workspace selector is at the bottom of the window, on the right hand side:
(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)
We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).
Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?
At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:
The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:
We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?
The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.
So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?
We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)
Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:
Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”
Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):
I have then sized the labels so that they are proportional to node size:
The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:
This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:
Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:
(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)
Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.
Recalling the “bimodal”/bipartite graph above:
that means we should be able to generate unimodal graphs along the following lines:
D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.
And for the customers?
C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.
You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)
So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:
When we run the projection, the graph is mapped onto a deal-deal graph:
The thickness of the edges describes the number of customers any two deals shared.
If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:
If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):
If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:
Run the modularity statistic and recolour:
Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:
If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:
If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:
Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:
To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:
Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…
PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)