Archive for the ‘Insight’ Category
Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:
Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.
Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:
If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.
I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…
Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.
So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)
One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.
#Python script to generate GraphML file import csv #We're going to use the really handy networkx graph library: easy_install networkx import networkx as nx import urllib #Create a directed graph object DG=nx.DiGraph() #Open data file in universal newline mode reader=csv.DictReader(open("inventory.csv","rU")) #Define a variable to act as a deal node ID counter dcid=0 #The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers #An identifier is minted for each row, identifying the deal #Deal attributes are used to annotate deal nodes #Identify columns used to annotate nodes taking string values nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use'] #Identify columns used to annotate nodes taking numeric values nodeColsInt=['Minimum Qty kg', 'Discount'] #The customers are treated as nodes in their own right, rather than as deal attributes #Identify columns used to identify customers - each of these will define a customer node customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher'] #Create a node for each customer, and classify it as a 'customer' node type for customer in customerCols: DG.add_node(customer,typ="customer") #Each row defines a deal for row in reader: #Mint an ID for the deal dealID='deal'+str(dcid) #Add a node for the deal, and classify it as a 'deal' node type DG.add_node(dealID,typ='deal') #Annotate the deal node with string based deal attributes for deal in nodeColsStr: DG.node[dealID][deal]=row[deal] #Annotate the deal node with numeric based deal attributes for deal in nodeColsInt: DG.node[dealID][deal]=int(row[deal]) #If the cell in a customer column is set to 1, ## draw an edge between that customer and the corresponding deal for customer in customerCols: if str(row[customer])=='1': DG.add_edge(dealID,customer) #Increment the node ID counter dcid=dcid+1 #write graph nx.write_graphml(DG,"inventory.graphml")
The graph we’re generating (download .graphml) has a basic structure that looks something like the following:
Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.
Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:
If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):
We can view these nodes using a filter:
We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:
The workspace selector is at the bottom of the window, on the right hand side:
(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)
We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).
Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?
At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:
The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:
We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?
The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.
So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?
We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)
Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:
Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”
Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):
I have then sized the labels so that they are proportional to node size:
The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:
This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:
Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:
(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)
Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.
Recalling the “bimodal”/bipartite graph above:
that means we should be able to generate unimodal graphs along the following lines:
D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.
And for the customers?
C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.
You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)
So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:
When we run the projection, the graph is mapped onto a deal-deal graph:
The thickness of the edges describes the number of customers any two deals shared.
If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:
If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):
If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:
Run the modularity statistic and recolour:
Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:
If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:
If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:
Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:
To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:
Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…
PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)
Picking up on Tinkering with the Guardian Platform API – Tag Signals, here’s a variant of the same thing using the New York Times Article Search API.
As with the Guardian OpenPlatform demo, the idea is to look up recent articles containing a particular search term, find the tag(s) used to describe the articles, and then graph them. The idea behind this approach is to get a quick snapshot of how the search target is represented, or positioned, in this case, by the New York Times.
Here is the code. The main differences compared to the Guardian API gist are as follows:
– a hacked recipe for getting several paged results back; I really need to sort this out properly, just as I need to generalise the code so it will work with either the Guardian or the NYT API, but that’s for another day now…
– the use of NetworkX as a way of representing the undirected tag-tag graph;
Note that the D3 Python library generates a vanilla force directed layout diagram. In the end, I just grabbed the tiny bit of code that loads the JSON data into a D3 network, and then used it along with the code behind the rather more beautiful network layout used for this Mobile Patent Suits visualisation.
Here’s a snapshot of the result of a search for recent articles on the phrase Gates Foundation:
At this point, it’s probably worth pointing out that the Python script generates the graph file, and then the d3.js library generates the graph visualisation within a web browser. There is no human hand (other than force layout parameter setting) involved in the layout. I guess with the tweaking of a few parameters, maybe juggling the force layout parameters a little more, I could get an even clearer layout. It might also be worth trying to find a way of sizing, or at least colouring, the nodes according to degree (or even better, weighted degree?) I also need to find a way, of possible, of representing the weight of edges if the D3 Python library actually exports this (or if it exports multiple edges between the same two nodes).
Anyway, for an hour or so’s tinkering, it’s quite a neat little recipe to be able to add to my toolkit. Here’s how it works again: Python script calls NYT Article Search API and grabs articles based on a search term. Grab the tags used to describe each article and build up a graph using NetworkX that connects any and all tags that are mentioned on the same article. Dump the graph from its NetworkX representation as a JSON file using the D3 library, then use the D3 Patent Suits layout recipe to render it in the browser :-)
Now all I have to do is find out how I can grab an SVG dump of the network from a browser into a shareable file…
Given a company or personal name, what’s a quick way of generating meaningful tags around what it’s publicly known for, or associated with?
Over the last couple of weeks or so, I’ve been doodling around a few ideas with Miguel Andres-Clavera from the JWT (London) Innovation Lab looking for novel ways of working out how brands and companies seem to be positioned by virtue of their social media followers, as well as their press mentions.
Here’s a quick review of one of those doodles: looking up tags associated with Guardian news articles that mention a particular search term (such as a company, or personal name) as a way of getting a crude snapshot of how the Guardian ‘positions’ that referent in its news articles.
It’s been some time since I played with the Guardian Platform API, but the API explorer makes it pretty easy to automatically generate some (the Python library for the Guardian Platform API appears to have rotted somewhat with various updates made to the API after its initial public testing period).
Here’s a snapshot over recent articles mentioning “The Open University” (bipartite article-tag graph):
Here’s a view of the co-occurrence tag graph:
The code is available as a Gist: Guardian Platform API Tag Grapher
As with many of my OUseful tools and techniques, this view over the data is intended to be used as a sensemaking tool as much as anything. In this case, the aim is to help folk get an idea of how, for example, “The Open University” is emergently positioned in the context of Guardian articles. As with the other ‘discovering social media positioning’ techniques I’m working on, I see the approach useful not so much for reporting, but more as a way of helping us understand how communities position brands/companies etc relative to each other, or relative to particular ideas/concepts.
It’s maybe also worth noting that the Guardian Platform article tag positioning view described above makes use of curated metadata published by the Guardian as the basis of the map. (I also tried running full text articles through the Reuters OpenCalais service, and extracting entity data (‘implicit metadata’) that way, but the results were generally a bit cluttered. (I think I’d need to clean the article text a little first before passing it to the OpenCalais service.)) That is, we draw on the ‘expert’ tagging applied to the articles, and whatever sense is made of the article during the tagging process, to construct our own sensemaking view over a wider set of articles that all refer to the topic of interest.
PS would anyone from the Guardian care to comment on the process by which tags are applied to articles?
PPS a couple more… here’s how the Guardian position JISC recently…
And here’s how “student fees” has recently been positioned:
List Intelligence – Finding Reliable, Trustworthy and Comprehensive Topic/Sector Based Twitter Lists
I woke up full of good intentions to do so “proper” work today, but a a post by Brian Kelly on Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers triggered the hacker ethic in me and so I immediately set off down a rabbit hole… And here’s what I came back with…
The statistics for Facebook, Twitter and YouTube are easily obtained – although I am not aware of a way of automating the gathering of such statistics across all UK University presences which would be helpful if we wished to provide a national picture of how UK Universities are using these services.
“I am not aware of a way of…” Arghhhhhhh… Need – a – fix …
Something else Brian had done previously was a post on Institutional Use of Twitter by the 1994 Group of UK Universities, which had got me wondering about Setting An Exercise In Social Media “Research”, or “how I’d go about finding, or compiling, a comprehensive list of official Twitter accounts for UK HE institution”
I didn’t actually do anything about it at the time, but today I thought I’d spend an hour or so* mulling it over (after all, it might give me something to talk about at the OU hosted and UKOLN promoted Metrics and Social Web Services: Quantitative Evidence for their Use and Impact (I wish I hadn’t used the “script kiddie” title now -I’m a wannabe hacker, goddammit ;-))
(* It’s actually taken me about 4 hours, including this write up…)
So here’s what I came up with. The recipe is this:
– take one twitter list* that is broadly in the area you’re interested in (it doesn’t have to be a list, it might be a list of names of folk using a particular hashtag, or tweeting from a particular location, or even just the folk followed by a particular individual, for example); in the case of UK HEIs, I’m going to use @thirdyearabroad‘s uk-universities list;
– for every person on that “list”, look up the lists those people are on, and sort the results by list; this gives me something like:
(Note that this also gives me a list of Twitter accounts of folk who may be interested in the sector…? I notice Routledge and Blackwell’s books for example? I wonder what a social network analysis of the friendship connections between the top 50 list maintainers (in the above list sorting) would look like [another rabbit hole appears…]?)
– by eye, scan those lists, and see if we can identify another list that looks promising. for example, what happens if I run the routine again using the @campusprabi/ukuniversities list?
Or how about @UniversitiesUK/member-institutions?
There’s an algorithm waiting to be found here, somewhere, for now identifying who members of the university set are? A very crude start may be something like: using the members of top 10(?) lists, create a list f possible universities, and class as universities those that appear on at least 5(?) of the lists?
Let’s try that, using the the 15 lists from the @UniversitiesUK/member-institutions run…
Here’s who’s listed in 14 or 15 of those lists:
So we presumably have good confidence that those are UK university accounts… it maybe also says something about the reach of those accounts?
Here’s that list of accounts with 14-15 mentions again and data corresponding to: number of lists, friends, followers, status updates, creation date:
UniWestminster 14 2420 3204 455 2007-08-29 13:53:51 UCLan 14 129 2925 1286 2008-07-23 11:30:08 bournemouthuni 14 228 3436 783 2009-07-14 16:47:34 EdinburghNapier 14 101 3589 621 2009-02-25 08:43:30 RobertGordonUni 14 129 1185 580 2010-02-04 09:17:23 RoehamptonUni 14 1345 2241 969 2009-04-19 09:25:06 UniversityLeeds 14 265 5416 974 2009-04-20 22:44:14 HeriotWattUni 14 59 2137 317 2009-03-03 09:10:37 LondonU 14 981 9078 615 2008-12-16 22:38:05 UniofEdinburgh 14 195 6150 771 2009-03-09 12:01:11 UniOfYork 14 131 4058 467 2009-03-27 16:12:05 uniglam 14 49 1769 858 2008-04-11 13:39:54 newportuni 14 271 2676 452 2008-07-17 12:52:08 oxford_brookes 14 931 3645 805 2009-06-22 12:59:32 Bristoluni 14 70 5952 1398 2009-03-16 16:51:45 GlasgowUni 14 124 8441 888 2009-01-30 09:06:35 SussexUni 14 1434 7703 2822 2009-02-16 16:35:37 AbertayUni 14 272 3173 1931 2009-02-10 13:36:00 ManMetUni 14 92 4888 703 2008-05-12 11:44:03 GlyndwrUni 14 837 1447 1637 2009-02-27 14:07:49 aberdeenuni 14 2370 5068 683 2009-02-09 10:38:18 LancasterUni 14 186 4417 445 2009-03-20 15:35:26 BradfordUni 14 369 3795 781 2009-01-05 14:06:26 uniofglos 14 282 2280 793 2009-05-20 12:55:28 GoldsmithsUoL 14 485 4008 465 2009-02-12 14:45:17 QMUL 14 1244 3692 899 2009-01-23 16:36:23 BoltonUni 14 114 1015 594 2009-03-06 09:21:28 KeeleUniversity 14 1528 5476 2890 2008-07-11 10:29:19 imperialcollege 14 2484 8014 1030 2008-07-08 14:56:33 uniwales 14 52 1507 319 2009-05-28 09:55:48 UEL_News 15 1907 2010 1780 2009-07-22 11:52:12 univofstandrews 15 118 3679 340 2009-02-02 15:12:20 edgehill 15 821 3556 1384 2008-05-26 11:53:56 AberUni 15 3986 3827 693 2009-01-07 14:07:57 Uni_of_Essex 15 258 3475 1156 2009-02-27 15:43:26 SwanseaMet 15 289 1573 856 2009-05-20 07:31:41 UniOfSurrey 15 487 5331 861 2009-01-24 13:42:48 UniofBath 15 90 6977 2093 2009-01-19 10:39:19 UniofPlym 15 599 3672 1173 2009-06-22 19:23:19 UniofExeter 15 3824 3921 1291 2009-07-27 16:35:16 BangorUni 15 96 3288 420 2009-01-23 15:12:50 BathSpaUni 15 36 1529 193 2010-01-08 10:39:07 sheffhallamuni 15 498 4565 513 2009-02-11 14:47:10 TeessideUni 15 971 5334 2261 2009-02-04 16:20:01 UniStrathclyde 15 181 3445 700 2009-03-06 11:45:17 AstonUniversity 15 384 5221 852 2008-03-26 11:14:12 unibirmingham 15 388 7665 1095 2008-12-04 11:27:39 HuddersfieldUni 15 1124 2941 989 2009-06-18 12:53:54 UniofOxford 15 53 19569 475 2009-06-18 08:28:28 KingstonUni 15 5 4215 216 2009-02-04 16:49:55 portsmouthuni 15 142 4073 1346 2009-01-26 16:04:41 MiddlesexUni1 15 115 2212 389 2009-05-13 16:15:56 sheffielduni 15 7240 8069 882 2009-01-22 15:06:17 UniOfHull 15 572 3061 587 2009-07-24 11:50:29 SalfordUni 15 441 5783 1177 2008-10-30 16:04:15 StaffsUni 15 329 2854 1326 2009-08-19 09:02:40 warwickuni 15 784 7987 1440 2008-08-20 11:12:25 uniofbrighton 15 150 3822 375 2009-05-08 14:18:14 cardiffuni 15 137 8597 1113 2008-01-11 22:48:49 UlsterUni 15 92 3671 1638 2008-07-17 22:27:15 leedsmet 15 4610 4977 4224 2009-02-10 10:28:24 DundeeUniv 15 2794 4407 4624 2009-04-24 14:14:29
For the unis at the bottom of the list, I wonder if the data identifies possible reasons why? A newly created account, maybe, not many updates, few followers? Presence lower down the list also maybe signals to the relevant marketing departments that their account maybe doesn’t have the reach they thought it did?
If you want to explore the data, it’s in a sortable table here (click on column header to sort by that column).
Not that many people are following lists yet… hmmm… maybe I need to add that in – numbers of people following a list when choosing lists?
Here’s the top 15 lists containing at least 20 unis from the @UniversitiesUK/member-institutions list, ordered by subscriber count (the columns are twittername/list, no. of unis from original list on list, number of subscribers):
/UniversitiesUK/member-institutions 127 125
/EtiquetteWise/colleges-and-universities 24 53
/aderitus/universities 44 34
/EMGonline/universities-and-colleges 30 30
/Farmsphere/followfarmer 22 29
/ellielovell/universities 46 29
/CR4_News/engineering-education 46 28
/HowToEnjoyCoUk/biz-trade-work 23 25
/mbaexchange/schools 31 24
/blackwellbooks/academia 79 23
How about if we filter on 75 or more unis from the @UniversitiesUK/member-institutions list?
/UniversitiesUK/member-institutions 127 125
/blackwellbooks/academia 79 23
/umnoticias/universidades-2 79 16
/Routledge_StRef/universities-colleges 85 14
/uhu_global/uk-irish 81 10
/eMotivator/higher-education 81 9
/_StudentvisasUK/uk-uni-s-and-colleges 88 8
/UKTEFL/colleges-universities 87 7
/Universityru/the-university-round-up 85 6
/bellerbys/uk-universities 95 5
/e4scouk/universities 77 5
/campusprabi/ukuniversities 105 5
/EuropaWOL/uk-univs 123 4
/OMorris/uk-universities 123 4
/studentsoftware/uk-universities-colleges 76 4
/thirdyearabroad/uk-universities 87 3
/SPA_Digest/universities-7 91 3
/targetjobsUK/universitiesuk 83 2
/VJEuroRSCGHeist/institutions 82 2
/blackwellbooks/he-institutes 86 2
/livuni/universities 96 1
/StaffsUni/other-universities 78 1
/DanteSystems/universities 77 1
/christchurchsu/universities 78 0
/Bi5on/educational 119 0
One last thing… Discovering newly created accounts is problematic with this approach – only accounts that appear on lists are included. One heuristic for finding new tweeps is to look the followers of the sorts of people they might follow (figuring that you’re unlikely to be followed by anyone in the early days, but you are likely to follow certain easily discovered, “big name” accounts in your sector). Many university accounts don’t follow large numbers of people, but maybe by looking at the accounts they do commonly follow (who are presumably the “big players” in the area), and then tracking back to the intersection of the people who follow those accounts (you still with me?!;-), we can get signal about new entrants to the area…?
Having done a first demo of how to use Gource to visualise activity around the EDINA OpenURL data (Visualising OpenURL Referrals Using Gource), I thought I’d trying something a little more artistic, and use the colour features to try to pull out a bit more detail from the data [video].
What this one shows is how the mendeley referrals glow brightly green, which – if I’ve got my code right – suggests a lot of e-issn lookups are going on (the red nodes correspond to an issn lookup, blue to an isbn lookup and yellow/orange to an unknown lookup). The regularity of activity around particular nodes also shows how a lot of the activity is actually driven by a few dominant services, at least during the time period I sampled to generate this video.
So how was this visualisation created?
Firstly, I pulled out a few more data columns, specifically the issn, eissn, isbn and genre data. I then opted to set node colour according to whether the issn (red), eissn (green) or isbn (blue) columns were populated using a default reasoning approach (if all three were blank, I coloured the node yellow). I then experimented with colouring the actors (I think?) according to whether the genre was article-like, book-like or unkown (mapping these on to add, modify or delete actions), before dropping the size of the actors altogether in favour of just highlighting referrers and asset type (i.e. issn, e-issn, book or unknown).
cut -f 1,2,3,4,27,28,29,32,40 L2_2011-04.csv > openurlgource.csv
When running the Pythin script, I got a “NULL Byte” error that stopped the script working (something obviously snuck in via one of the newly added columns), so I googled around and turned up a little command line cleanup routine for the cut data file:
tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv
Here’s the new Python script too that shows the handling of the colour fields:
import csv from time import * # Command line pre-processing step to handle NULL characters #tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv #alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv f=open('openurlgourcenonulls.csv', 'rb') reader = csv.reader(f, delimiter='\t') writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|') headerline = reader.next() for row in reader: if row.strip() !='': t=int(mktime(strptime(row+" "+row, "%Y-%m-%d %H:%M:%S"))) if row!='': col='FF0000' elif row!='': col='00FF00' elif row!='': col='0000FF' else: col='666600' if row=='article' or row=='journal': typ='A' elif row=='book' or row=='bookitem': typ='M' else: typ='D' agent=row.rstrip(':').replace(':','/') writer.writerow([t,row,typ,agent,col])
The new gource command is:
gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 openurlgource.txt
and the command to generate the video:
gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4
If you’ve been tempted to try Gource out yourself on some of your own data, please post a link in the comments below:-) (AI wonder just how many different sorts of data we can force into the shape that Gource expects?!;-)
One thing that particularly interested me then, as it still does now, was the way that certain search trends they reveal rhythmic behaviour over the course of weeks, months or years.
At the start of this year, I revisited the topic with a post on Identifying Periodic Google Trends, Part 1: Autocorrelation (followd by Improving Autocorrelation Calculations on Google Trends Data).
Anyway today it seems that Google has cracked the scaling issues with discovering correlations between search trends (using North American search trend data), as well as opening up a service that will identify what search trends correlate most closely with your own uploaded time series data: Correlate (announcement: Mining patterns in search data with Google Correlate)
For the quick overview, check out the Google Correlate Comic.
So what’s on offer? First, enter a search term and see what it’s correlated with:
As well as the line chart, correlations can also be plotted as a scatterplot:
You can also run “spatial correlations”, though at the moment this appears to be limited to US states. (I *think* this works by looking for search terms that are popular in the requested areas and not popular in the other listed areas. To generalise this, I guess you need three things: the total list of areas that work for the spatial correlation query; the areas you want the search volume for the “to be discovered correlated phrase” to be high; the areas you want to the search volume for the “to be discovered correlated phrase” to be low?)
At this point it’s maybe worth remembering that correlation does not imply causation…
A couple of other interesting things to note: firstly, you can offset the data (so shift it a few weeks forwards or backwards in time, as you might do if you were looking for lead/lag behaviour); secondly, you can export/download the data.
You can also upload your own data to see what terms correlate with it:
(I wonder if they’ll start offering time series analysis features on uploaded, as well as other trend data, too? For example, frequency analysis or trend analysis? This is presumably going on in the background (though I haven’t read the white paper [PDF] yet…)
As if that’s not enough, you can also draw a curve/trendline and then see what correlates with it (so this a weak alternative to uploading your own data, right? Just draw something that looks like it… (h/t to Mike Ellis for first point this out to me).
I’m not convinced that search trends map literally onto the well known “hype cycle” curve, but I thought I’d try out a hype cycle reminiscent curve where the hype was a couple of years ago, and we’re now maybe seeing start to reach mainstream maturity, with maybe the first inklings of a plateau…
Hmmm… the pr0n industry is often identified as a predictor of certain sorts of technology adoption… maybe the 5ex searchers are too?! (Note that correlated hand-drawn charts are linkable).
So – that’s Google Correlate; nifty, eh?
PS Here’s another reason why I blog… my blog history helps me work out how far i the future I live;-) So currently between about three years in the future.. how about you?!;-)
PPS I can imagine Google’s ThinkInsights (insight marketing) loving the thought that folk are going to check out their time series data against Google Trends so the Goog can weave that into it’s offerings… A few additional thoughts leading on from that: 1) when will correlations start to appear in Google AdWords support tools to help you pick adwords based on your typical web traffic patterns or even sales patterns? 2) how far are we off seeing a Google Insights box to complement the Google Search Appliances, that will let you run correlations – as well as Google Prediction type services – onsite without feeling as if you have to upload your data to Google’s servers, and instead, becoming part of Google’s out-kit-in-your-racks offering; 3) when is Google going to start buying up companies like Prism and will it then maybe go after the likes of Experian and Dunnhumby to become a company that organises information about the world of people, as well as just the world’s information…?!)
PPPS Seems like as well as “traditional” link sharing offerings, you can share the link via your Google Reader account…