A Couple of Recent Quotes from Downes…

…that made me smile:

“Could it be that the hacker spirit is really want is responsible for our progress?” asks Daniel Lemire. “My theory is that if you want to cure cancer or Alzheimer’s, or if you want nuclear fusion, or bases on Mars… you need the hackers.” Once this is accomplished, he suggests, “academics will take the credit.” I’ve seen enough of this first-hand. But of course there are the academics who are also hackers, who I think play a pretty important role (and I like to think I am one). And there is – to my mind – a concern that while hackers are skilled at making things, they may not be skilled at things like social policy, governance and media (though they do not have the humility to realize this).

/via

I suck at meetings…

As is always the case, the stuff that can be done by technical people today will be provided by some application for everyone tomorrow.

/via

Yep. The latter is the reason I play with code and services as they are made available. If I can get them to work and do something halfway useful or interesting with them in an hour or too, maybe someone else may will come along and build a proper app or utility around a similar idea that becomes far more widely used…

“Drug Deal” Network Analysis with Gephi (Tutorial)

Via a trackback from Check Yo Self: 5 Things You Should Know About Data Science (Author Note) criticising tweet-mapping without further analysis (“If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. … what does it gain a man to have graphs of tweets and do jack for analysis with them?”), I came across John Foreman’s Analytics Made Skeezy [uncourse] blog:

Analytics Made Skeezy is a fake blog. Each post is part of a larger narrative designed to teach a variety of analytics topics while keeping it interesting. Using a single narrative allows me to contrast various approaches within the same fake world. And ultimately that’s what this blog is about: teaching the reader when to use certain analytic tools.

Skimming through the examples described in some of the posts to date, Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi not surprisingly caught my attention. That post describes, in narrative form, how to use Excel to prepare and shape a dataset so that it can be imported into Gephi as a faux CSV file and then run through Gephi’s modularity statistic; the modularity class augmented dataset can then be exported from the Gephi Data Lab and re-presented in Excel, whereupon the judicious use of column sorting and conditional formatting is used to try to generate some sort of insight about the clusters/groups discovered in the data – apparently, “Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do”. And furthermore:

If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.

I don’t necessarily disagree with the implication that we often need to do more than just look at pretty pictures in Gephi to make sense of a dataset; but I do also believe that we can use Gephi in an active way to have a conversation with the data, generating some sort of preliminary insights out about the data set that we can then explore further using other analytical techniques. So what I’ll try to do in the rest of this post is offer some suggestions about one or two ways in which we might use Gephi to start conversing with the same dataset described in the Drug Dealer Retargeting post. Before I do so, however, I suggest you read through the original post and try to come to some of your own conclusions about what the data might be telling us…

Done that? To recap, the original dataset (“Inventory”) is a list of “deals”, with columns relating to two sorts of thing: 1) attribute of a deal; 2) one column per dealer showing whether they took up that deal. A customer/customer matrix is then generated and the cosine similarity between each customer calculated (note: other distance metrics are available…) showing the extent to which they participated in similar deals. Selecting the three most similar neighbours of each customer creates a “trimmed nearest neighbors graph”, which is munged into a CSV-resembling data format that Gephi can import. Gephi is then used to do a very quick/cursory (and discounted) visual analysis, and run the modularity/clustering detection algorithm.

So how would I have attacked this dataset (note: IANADS (I am not a data scientist;-)

One way would be to treat it from the start as defining a graph in which dealers are connected to trades. Using a slightly tidied version of the ‘inventory tab from the original dataset in which I removed the first (metadata) and last (totals) rows, and tweaked one of the column names to remove the brackets (I don’t think Gephi likes brackets in attribute names?), I used the following script to generate a GraphML formatted version of just such a graph.

#Python script to generate GraphML file
import csv
#We're going to use the really handy networkx graph library: easy_install networkx
import networkx as nx
import urllib

#Create a directed graph object
DG=nx.DiGraph()

#Open data file in universal newline mode
reader=csv.DictReader(open("inventory.csv","rU"))

#Define a variable to act as a deal node ID counter
dcid=0

#The graph is a bimodal/bipartite graph containing two sorts of node - deals and customers
#An identifier is minted for each row, identifying the deal
#Deal attributes are used to annotate deal nodes
#Identify columns used to annotate nodes taking string values
nodeColsStr=['Offer date', 'Product', 'Origin', 'Ready for use']
#Identify columns used to annotate nodes taking numeric values
nodeColsInt=['Minimum Qty kg', 'Discount']

#The customers are treated as nodes in their own right, rather than as deal attributes
#Identify columns used to identify customers - each of these will define a customer node
customerCols=['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson', 'Martinez', 'Anderson', 'Taylor', 'Thomas', 'Hernandez', 'Moore', 'Martin', 'Jackson', 'Thompson', 'White' ,'Lopez', 'Lee', 'Gonzalez','Harris', 'Clark', 'Lewis', 'Robinson', 'Walker', 'Perez', 'Hall', 'Young', 'Allen', 'Sanchez', 'Wright', 'King', 'Scott','Green','Baker', 'Adams', 'Nelson','Hill', 'Ramirez', 'Campbell', 'Mitchell', 'Roberts', 'Carter', 'Phillips', 'Evans', 'Turner', 'Torres', 'Parker', 'Collins', 'Edwards', 'Stewart', 'Flores', 'Morris', 'Nguyen', 'Murphy', 'Rivera', 'Cook', 'Rogers', 'Morgan', 'Peterson', 'Cooper', 'Reed', 'Bailey', 'Bell', 'Gomez', 'Kelly', 'Howard', 'Ward', 'Cox', 'Diaz', 'Richardson', 'Wood', 'Watson', 'Brooks', 'Bennett', 'Gray', 'James', 'Reyes', 'Cruz', 'Hughes', 'Price', 'Myers', 'Long', 'Foster', 'Sanders', 'Ross', 'Morales', 'Powell', 'Sullivan', 'Russell', 'Ortiz', 'Jenkins', 'Gutierrez', 'Perry', 'Butler', 'Barnes', 'Fisher']

#Create a node for each customer, and classify it as a 'customer' node type
for customer in customerCols:
	DG.add_node(customer,typ="customer")

#Each row defines a deal
for row in reader:
	#Mint an ID for the deal
	dealID='deal'+str(dcid)
	#Add a node for the deal, and classify it as a 'deal' node type
	DG.add_node(dealID,typ='deal')
	#Annotate the deal node with string based deal attributes
	for deal in nodeColsStr:
		DG.node[dealID][deal]=row[deal]
	#Annotate the deal node with numeric based deal attributes
	for deal in nodeColsInt:
		DG.node[dealID][deal]=int(row[deal])
	#If the cell in a customer column is set to 1,
	## draw an edge between that customer and the corresponding deal
	for customer in customerCols:
		if str(row[customer])=='1':
			DG.add_edge(dealID,customer)
	#Increment the node ID counter
	dcid=dcid+1

#write graph
nx.write_graphml(DG,"inventory.graphml")

The graph we’re generating (download .graphml) has a basic structure that looks something like the following:

Which is to say, in this example customer C1 engaged in a single deal, D1; customer C2 participated in every deal, D1, D2 and D3; and customer C3 partook of deals D2 and D3.

Opening the graph file into Gephi as a directed graph, we get a count of the number of actual trades there were from the edge count:

If we run the Average degree statistic, we can see that there are some nodes that are not connected to any other nodes (that is, they are either deals with no takers, or customers who never took part in a deal):

We can view these nodes using a filter:

We can also use the filter the other way, to exclude the unaccepted deals, and then create a new workspace containing just the deals that were taken up, and the customers that bought into them:

The workspace selector is at the bottom of the window, on the right hand side:

(Hmmm… for some reason, the filtered graph wasn’t exported for me… the whole graph was. Bug? Fiddling with with Giant Component filter, then exporting, then running the Giant Component filter on the exported graph and cancelling it seemed to fix things… but something is not working right?)

We can now start to try out some interactive visual analysis. Firstly, let’s lay out the nodes using a force-directed layout algorithm (ForceAtlas2) that tries to position nodes so that nodes that are connected are positioned close to each other, and nodes that aren’t connected are kept apart (imagine each node as trying to repel the other nodes, with edges trying to pull them together).

Our visual perception is great at identifying spatial groupings (see, for example, the Gestalt principles, which lead to many a design trick and a bucketful of clues about how to tease data apart in a visually meaningful way…), but are they really meaningful?

At this point in the conversation we’re having with the data, I’d probably call on a statistic that tries to place connected groups of nodes into separate groups so that I could colour the nodes according to their group membership: the modularity statistic:

The modularity statistic is a random algorithm, so you may get different (though broadly similar) results each time you run it. In this case, it discovered six possible groupings or clusters of interconnected nodes (often, one group is a miscellany…). We can see which group each node was place in by applying a Partition colouring:

We see how the modularity groupings broadly map on to the visual clusters revealed by the ForceAtlas2 layout algorithm. But do the clusters relate to anything meaningful? What happens if we turn the labels on?

The green group appear to relate to Weed transactions, reds are X, Meth and Ketamine deals, and yellow for the coke heads. So the deals do appear to cluster around different types of deal.

So what else might we be able to learn? Does the Ready for Use dimension on a deal separate out at all (null nodes on this dimension relate to customers)?

We’d need to know a little bit more about what the implications of “Ready for Use” might be, but at a glance we get a feeling the the cluster on the far left is dominated by trades with large numbers of customers (there are lots of white/customer nodes), and the Coke related cluster on the right has quite a few trades (the green nodes) that aren’t ready for use. (A question that comes to mind looking at that area is: are there any customers who seem to just go for not Ready for Use trades, and what might this tell us about them if so?)

Something else we might look to is the size of the trades, and any associated discounts. Let’s colour the nodes using the Partition tool to according to node type (attribute name is “typ” – nodes are deals (red) or customers (aqua)) and then size the nodes according to deal size using the Ranking display:

Small fry deals in the left hand group. Looking again at the Coke grouping, where there is a mix of small and large deals, another question we might file away is “are there customers who opt either for large or small trades?”

Let’s go back to the original colouring (via the Modularity coloured Partition; note that the random assignment of colours might change from the original colour set; right click allows you to re-randomise colours; clicking on a colour square allows you to colour select by hand) and size the nodes by OutDegree (that is, the sum total of edges outgoing from a node – remember, the graph was described as a directed graph, with edges going from deals to customers):

I have then sized the labels so that they are proportional to node size:

The node/label sizing shows which deals had plenty of takers. Sizing by OutDegree shows how many deals each customer took part in:

This is quite a cluttered view… returning to the Layout panel, we can use the Expansion layout to stretch out the whole layout, as well as the Label Adjust tool to jiggle nodes so that the labels don’t overlap. Note that you can also click on a node to drag it around, or a group of nodes by increasing the “field of view” of the mouse cursor:

Here’s how I tweaked the layout by expanding the layout then adjusting the labels…:

(One of the things we might be tempted to do is filter out the users who only engaged in one or two or deals, perhaps as a wau of identifying regular customers; of course, a user may only engage in a single, but very large deal, so we’d need to think carefully about what question we were actually asking when making such a choice. For example, we might also be interested in looking for customers engaging in infrequent large trades, which would require a different analysis strategy.)

Insofar as it goes, this isn’t really very interesting – what might be more compelling would be data relating to who was dealing with whom, but that isn’t immediately available. What we should be able to do, though, is see which customers are related by virtue of partaking of the same deals, and see which deals are related by virtue of being dealt to the same customers. We can maybe kid ourselves into thinking we can see this in the customer-deal graph, but we can be a little bit more rigorous two by constructing two new graphs: one that shows edges between deals that share one or more common customers; and one that shows edges between customers who shared one or more of the same deals.

Recalling the “bimodal”/bipartite graph above:

that means we should be able to generate unimodal graphs along the following lines:

D1 is connected to D2 and D3 through customer C2 (that is, an edge exists between D1 and D2, and another edge between D1 and D3). D2 and D3 are joined together through two routes, C2 and C3. We might thus weight the edge between D2 and D3 as being heavier, or more significant, than the edge between either D1 and D2, or D1 and D3.

And for the customers?

C1 is connected to C2 through deal D1. C2 and C3 are connected by a heavier weighted edge reflecting the fact that they both took part in deals D2 and D3.

You will hopefully be able to imagine how more complex customer-deal graphs might collapse into customer-customer or deal-deal graphs where there are multiple, disconnected (or only very weakly connected) groups of customers (or deals) based on the fact that there are sets of deals that do not share any common customers at all, for example. (As an exercise, try coming up with some customer-deal graphs and then “collapsing” them to customer-customer or deal-deal graphs that have disconnected components.)

So can we generate graphs of this sort using Gephi? Well, it just so happens we can, using the Multimode Networks Projection tool. To start with let’s generate another couple of workspaces containing the original graph, minus the deals that had no customers. Selecting one of these workspaces, we can now generate the deal-deal (via common customer) graph:

When we run the projection, the graph is mapped onto a deal-deal graph:

The thickness of the edges describes the number of customers any two deals shared.

If we run the modularity statistic over the deal-deal graph and colour the graph by the modularity partition, we can see how the deals are grouped by virtue of having shared customers:

If we then filter the graph on edge thickness so that we only show edges with a thickness of three or more (three shared customers) we can see some how some of the deal types look as if they are grouped around particular social communities (i.e they are supplied to the same set of people):

If we now go to the other workspace we created containing the original (less unsatisfied deals) graph, we can generate the customer-customer projection:

Run the modularity statistic and recolour:

Whilst there is a lot to be said for maintaining the spatial layout so that we can compare different plots, we might be tempted to rerun the layout algorithm to the see if it highlights the structural associations any more clearly? In this case, there isn’t much difference:

If we run the Network diameter tool, we can generate some network statistics over this customer-customer network:

If we now size the nodes by betweenness centrality, size labels proportional nodes, and use the expand/label overlap layout tools to tweak the display, here’s what we get:

Thompson looks to be an interesting character, spanning the various clusters… but what deals is he actually engaging in? If we go back to the orignal customer-deal graph, we can use an ego filter to see:

To look for actual social groupings, we might filter the network based on edge weight, for example to show only edges above a particular weight (that is, number of shared deals), and then drop this set into a new workspace. If we then run the Average Degree statistic, we can calculate the degree of nodes in this graph, and size nodes accordingly. Relaying out the graph shows us some corse social netwroks based on significant numbers of shared trades:

Hopefully by now you are starting to “see” how we can start to have a visual conversation with the data, asking different questions of it based on things we are learning about it. Whilst we may need to actually look at the numbers (and Gephi’s Data Laboratory tab allows us to do that), I find that visual exploration can provide a quick way of orienting (orientating?) yourself with respect to a particular dataset, and getting a feel for the sorts of questions you might ask of it, questions that might well involve a detailed consideration of the actual numbers themselves. But for starters, the visual route often works for me…

PS There is a link to the graph file here, so if you want to try exploring it for yourself, you can do so:-)

Visualising New York Times Article API Tag Graphs Using d3.js

Picking up on Tinkering with the Guardian Platform API – Tag Signals, here’s a variant of the same thing using the New York Times Article Search API.

As with the Guardian OpenPlatform demo, the idea is to look up recent articles containing a particular search term, find the tag(s) used to describe the articles, and then graph them. The idea behind this approach is to get a quick snapshot of how the search target is represented, or positioned, in this case, by the New York Times.

Here is the code. The main differences compared to the Guardian API gist are as follows:

– a hacked recipe for getting several paged results back; I really need to sort this out properly, just as I need to generalise the code so it will work with either the Guardian or the NYT API, but that’s for another day now…

– the use of NetworkX as a way of representing the undirected tag-tag graph;

– the use of the NetworkX D3 helper library (networkx-d3) to generate a JSON output file that works with the d3.js force directed layout library.

Note that the D3 Python library generates a vanilla force directed layout diagram. In the end, I just grabbed the tiny bit of code that loads the JSON data into a D3 network, and then used it along with the code behind the rather more beautiful network layout used for this Mobile Patent Suits visualisation.

Here’s a snapshot of the result of a search for recent articles on the phrase Gates Foundation:

At this point, it’s probably worth pointing out that the Python script generates the graph file, and then the d3.js library generates the graph visualisation within a web browser. There is no human hand (other than force layout parameter setting) involved in the layout. I guess with the tweaking of a few parameters, maybe juggling the force layout parameters a little more, I could get an even clearer layout. It might also be worth trying to find a way of sizing, or at least colouring, the nodes according to degree (or even better, weighted degree?) I also need to find a way, of possible, of representing the weight of edges if the D3 Python library actually exports this (or if it exports multiple edges between the same two nodes).

Anyway, for an hour or so’s tinkering, it’s quite a neat little recipe to be able to add to my toolkit. Here’s how it works again: Python script calls NYT Article Search API and grabs articles based on a search term. Grab the tags used to describe each article and build up a graph using NetworkX that connects any and all tags that are mentioned on the same article. Dump the graph from its NetworkX representation as a JSON file using the D3 library, then use the D3 Patent Suits layout recipe to render it in the browser :-)

Now all I have to do is find out how I can grab an SVG dump of the network from a browser into a shareable file…

Tinkering with the Guardian Platform API – Tag Signals

Given a company or personal name, what’s a quick way of generating meaningful tags around what it’s publicly known for, or associated with?

Over the last couple of weeks or so, I’ve been doodling around a few ideas with Miguel Andres-Clavera from the JWT (London) Innovation Lab looking for novel ways of working out how brands and companies seem to be positioned by virtue of their social media followers, as well as their press mentions.

Here’s a quick review of one of those doodles: looking up tags associated with Guardian news articles that mention a particular search term (such as a company, or personal name) as a way of getting a crude snapshot of how the Guardian ‘positions’ that referent in its news articles.

It’s been some time since I played with the Guardian Platform API, but the API explorer makes it pretty easy to automatically generate some (the Python library for the Guardian Platform API appears to have rotted somewhat with various updates made to the API after its initial public testing period).

Guardian OpenPlatfrom API

Here’s a snapshot over recent articles mentioning “The Open University” (bipartite article-tag graph):

Open university - article-tag graph

Here’s a view of the co-occurrence tag graph:

'OPen University

The code is available as a Gist: Guardian Platform API Tag Grapher

As with many of my OUseful tools and techniques, this view over the data is intended to be used as a sensemaking tool as much as anything. In this case, the aim is to help folk get an idea of how, for example, “The Open University” is emergently positioned in the context of Guardian articles. As with the other ‘discovering social media positioning’ techniques I’m working on, I see the approach useful not so much for reporting, but more as a way of helping us understand how communities position brands/companies etc relative to each other, or relative to particular ideas/concepts.

It’s maybe also worth noting that the Guardian Platform article tag positioning view described above makes use of curated metadata published by the Guardian as the basis of the map. (I also tried running full text articles through the Reuters OpenCalais service, and extracting entity data (‘implicit metadata’) that way, but the results were generally a bit cluttered. (I think I’d need to clean the article text a little first before passing it to the OpenCalais service.)) That is, we draw on the ‘expert’ tagging applied to the articles, and whatever sense is made of the article during the tagging process, to construct our own sensemaking view over a wider set of articles that all refer to the topic of interest.

PS would anyone from the Guardian care to comment on the process by which tags are applied to articles?

PPS a couple more… here’s how the Guardian position JISC recently…

JISC Positioning... Guardian

And here’s how “student fees” has recently been positioned:

In the context of tuition fees - openplatform tag-tag graph

Hmmm…

Identifying the Twitterati Using List Analysis

Given absolutely no-one picked up on List Intelligence – Finding Reliable, Trustworthy and Comprehensive Topic/Sector Based Twitter Lists, here’s a example of what the technique might be good for…

Seeing the tag #edusum11 in my feed today, and not being minded to follow it it I used the list intelligence hack to see:

– which lists might be related to the topic area covered by the tag, based on looking at which Twitter lists folk recently using the tag appear on;
– which folk on twitter might be influential in the area, based on their presence on lists identified as maybe relevant to the topic associated with the tag…

Here’s what I found…

Some lists that maybe relate to the topic area (username/list, number of folk who used the hashtag appearing on the list, number of list subscribers), sorted by number of people using the tag present on the list:

/joedale/ukedtech 6 6
/TWMarkChambers/edict 6 32
/stevebob79/education-and-ict 5 28
/mhawksey/purposed 5 38
/fosteronomo/chalkstars-combined 5 12
/kamyousaf/uk-ict-education 5 77
/ssat_lia/lia 5 5
/tlists/edtech-995 4 42
/ICTDani/teched 4 33
/NickSpeller/buzzingeducators 4 2
/SchoolDuggery/uk-ed-admin-consultancy 4 65
/briankotts/educatorsuk 4 38
/JordanSkole/jutechtlets 4 10
/nyzzi_ann/teacher-type-people 4 9
/Alexandragibson/education 4 3
/danielrolo/teachers 4 20
/cstatucki/educators 4 13
/helenwhd/e-learning 4 29
/TechSmithEDU/courosalets 4 2
/JordanSkole/chalkstars-14 4 25
/deerwood/edtech 4 144

Some lists that maybe relate to the topic area (username/list, number of folk who used the hashtag appearing on the list, number of list subscribers), sorted by number of people subscribing to the list (a possible ranking factor for the list):
/deerwood/edtech 4 144
/kamyousaf/uk-ict-education 5 77
/SchoolDuggery/uk-ed-admin-consultancy 4 65
/tlists/edtech-995 4 42
/mhawksey/purposed 5 38
/briankotts/educatorsuk 4 38
/ICTDani/teched 4 33
/TWMarkChambers/edict 6 32
/helenwhd/e-learning 4 29
/stevebob79/education-and-ict 5 28
/JordanSkole/chalkstars-14 4 25
/danielrolo/teachers 4 20
/cstatucki/educators 4 13
/fosteronomo/chalkstars-combined 5 12
/JordanSkole/jutechtlets 4 10
/nyzzi_ann/teacher-type-people 4 9
/joedale/ukedtech 6 6
/ssat_lia/lia 5 5
/Alexandragibson/education 4 3
/NickSpeller/buzzingeducators 4 2
/TechSmithEDU/courosalets 4 2

Other ranking factors might include the follower count, or factors from some sort of social network analysis, of the list maintainer.

Having got a set of lists, we can then look for people who appear on lots of those lists to see who might be influential in the area. Here’s the top 10 (user, number of lists they appear on, friend count, follower count, number of tweets, time of arrival on twitter):

['terryfreedman', 9, 4570, 4831, 6946, datetime.datetime(2007, 6, 21, 16, 41, 17)]
['theokk', 9, 1564, 1693, 12029, datetime.datetime(2007, 3, 16, 14, 36, 2)]
['dawnhallybone', 8, 1482, 1807, 18997, datetime.datetime(2008, 5, 19, 14, 40, 50)]
['josiefraser', 8, 1111, 7624, 17971, datetime.datetime(2007, 2, 2, 8, 58, 46)]
['tonyparkin', 8, 509, 1715, 13274, datetime.datetime(2007, 7, 18, 16, 22, 53)]
['dughall', 8, 2022, 2794, 16961, datetime.datetime(2009, 1, 7, 9, 5, 50)]
['jamesclay', 8, 453, 2552, 22243, datetime.datetime(2007, 3, 26, 8, 20)]
['timbuckteeth', 8, 1125, 7198, 26150, datetime.datetime(2007, 12, 22, 17, 17, 35)]
['tombarrett', 8, 10949, 13665, 19135, datetime.datetime(2007, 11, 3, 11, 45, 50)]
['daibarnes', 8, 1592, 2592, 7673, datetime.datetime(2008, 3, 13, 23, 20, 1)]

The algorithms I’m using have a handful of tuneable parameters, which means there’s all sorts of scope for running with this idea in a “research” context…

One possible issue that occurred to me was that identified lists might actually cover different topic areas – this is something I need to ponder…

List Intelligence – Finding Reliable, Trustworthy and Comprehensive Topic/Sector Based Twitter Lists

I woke up full of good intentions to do so “proper” work today, but a a post by Brian Kelly on Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers triggered the hacker ethic in me and so I immediately set off down a rabbit hole… And here’s what I came back with…

Brian wrote:

The statistics for Facebook, Twitter and YouTube are easily obtained – although I am not aware of a way of automating the gathering of such statistics across all UK University presences which would be helpful if we wished to provide a national picture of how UK Universities are using these services.

Brian Kelly: Numbers Matter: Let’s Provide Open Access to Usage Data and Not Just Research Papers

“I am not aware of a way of…” Arghhhhhhh… Need – a – fix …

Something else Brian had done previously was a post on Institutional Use of Twitter by the 1994 Group of UK Universities, which had got me wondering about Setting An Exercise In Social Media “Research”, or “how I’d go about finding, or compiling, a comprehensive list of official Twitter accounts for UK HE institution”

I didn’t actually do anything about it at the time, but today I thought I’d spend an hour or so* mulling it over (after all, it might give me something to talk about at the OU hosted and UKOLN promoted Metrics and Social Web Services: Quantitative Evidence for their Use and Impact (I wish I hadn’t used the “script kiddie” title now -I’m a wannabe hacker, goddammit ;-))

(* It’s actually taken me about 4 hours, including this write up…)

So here’s what I came up with. The recipe is this:

– take one twitter list* that is broadly in the area you’re interested in (it doesn’t have to be a list, it might be a list of names of folk using a particular hashtag, or tweeting from a particular location, or even just the folk followed by a particular individual, for example); in the case of UK HEIs, I’m going to use @thirdyearabroad‘s uk-universities list;

– for every person on that “list”, look up the lists those people are on, and sort the results by list; this gives me something like:

...
/blackwellbooks/he-institutes 76
/christchurchsu/universities 76
/uhu_global/uk-irish 77
/targetjobsUK/universitiesuk 77
/umnoticias/universidades-2 78
/Routledge_StRef/universities-colleges 78
/eMotivator/higher-education 81
/Universityru/the-university-round-up 81
/VJEuroRSCGHeist/institutions 82
/UKTEFL/colleges-universities 82
/SPA_Digest/universities-7 85
/_StudentvisasUK/uk-uni-s-and-colleges 85
/OMorris/uk-universities 86
/UniversitiesUK/member-institutions 87
/bellerbys/uk-universities 87
/EuropaWOL/uk-univs 89
/livuni/universities 89
/Bi5on/educational 90
/campusprabi/ukuniversities 90
/thirdyearabroad/uk-universities 100

(Note that this also gives me a list of Twitter accounts of folk who may be interested in the sector…? I notice Routledge and Blackwell’s books for example? I wonder what a social network analysis of the friendship connections between the top 50 list maintainers (in the above list sorting) would look like [another rabbit hole appears…]?)

– by eye, scan those lists, and see if we can identify another list that looks promising. for example, what happens if I run the routine again using the @campusprabi/ukuniversities list?

...
/Universityru/the-university-round-up 88
/thirdyearabroad/uk-universities 90
/_StudentvisasUK/uk-uni-s-and-colleges 90
/UKTEFL/colleges-universities 90
/bellerbys/uk-universities 93
/SPA_Digest/universities-7 93
/livuni/universities 96
/OMorris/uk-universities 104
/UniversitiesUK/member-institutions 105
/EuropaWOL/uk-univs 106
/Bi5on/educational 109
/campusprabi/ukuniversities 140

Or how about @UniversitiesUK/member-institutions?

...
/targetjobsUK/universitiesuk 83
/Universityru/the-university-round-up 85
/Routledge_StRef/universities-colleges 85
/blackwellbooks/he-institutes 86
/thirdyearabroad/uk-universities 87
/UKTEFL/colleges-universities 87
/_StudentvisasUK/uk-uni-s-and-colleges 88
/SPA_Digest/universities-7 91
/bellerbys/uk-universities 95
/livuni/universities 96
/campusprabi/ukuniversities 105
/Bi5on/educational 119
/EuropaWOL/uk-univs 123
/OMorris/uk-universities 123
/UniversitiesUK/member-institutions 127

There’s an algorithm waiting to be found here, somewhere, for now identifying who members of the university set are? A very crude start may be something like: using the members of top 10(?) lists, create a list f possible universities, and class as universities those that appear on at least 5(?) of the lists?

Let’s try that, using the the 15 lists from the @UniversitiesUK/member-institutions run…

Here’s who’s listed in 14 or 15 of those lists:

UniKent 14
UniWestminster 14
UCLan 14
bournemouthuni 14
EdinburghNapier 14
RobertGordonUni 14
RoehamptonUni 14
UniversityLeeds 14
HeriotWattUni 14
LondonU 14
UniofEdinburgh 14
UniOfYork 14
uniglam 14
newportuni 14
oxford_brookes 14
Bristoluni 14
GlasgowUni 14
SussexUni 14
AbertayUni 14
ManMetUni 14
GlyndwrUni 14
aberdeenuni 14
LancasterUni 14
BradfordUni 14
uniofglos 14
GoldsmithsUoL 14
QMUL 14
BoltonUni 14
KeeleUniversity 14
imperialcollege 14
uniwales 14
UEL_News 15
univofstandrews 15
edgehill 15
AberUni 15
Uni_of_Essex 15
SwanseaMet 15
UniOfSurrey 15
UniofBath 15
UniofPlym 15
UniofExeter 15
BangorUni 15
BathSpaUni 15
sheffhallamuni 15
TeessideUni 15
UniStrathclyde 15
AstonUniversity 15
unibirmingham 15
HuddersfieldUni 15
UniofOxford 15
KingstonUni 15
portsmouthuni 15
MiddlesexUni1 15
sheffielduni 15
UniOfHull 15
SalfordUni 15
StaffsUni 15
warwickuni 15
uniofbrighton 15
cardiffuni 15
UlsterUni 15
leedsmet 15
DundeeUniv 15

So we presumably have good confidence that those are UK university accounts… it maybe also says something about the reach of those accounts?

Here’s that list of accounts with 14-15 mentions again and data corresponding to: number of lists, friends, followers, status updates, creation date:

UniWestminster 14 2420 3204 455 2007-08-29 13:53:51
UCLan 14 129 2925 1286 2008-07-23 11:30:08
bournemouthuni 14 228 3436 783 2009-07-14 16:47:34
EdinburghNapier 14 101 3589 621 2009-02-25 08:43:30
RobertGordonUni 14 129 1185 580 2010-02-04 09:17:23
RoehamptonUni 14 1345 2241 969 2009-04-19 09:25:06
UniversityLeeds 14 265 5416 974 2009-04-20 22:44:14
HeriotWattUni 14 59 2137 317 2009-03-03 09:10:37
LondonU 14 981 9078 615 2008-12-16 22:38:05
UniofEdinburgh 14 195 6150 771 2009-03-09 12:01:11
UniOfYork 14 131 4058 467 2009-03-27 16:12:05
uniglam 14 49 1769 858 2008-04-11 13:39:54
newportuni 14 271 2676 452 2008-07-17 12:52:08
oxford_brookes 14 931 3645 805 2009-06-22 12:59:32
Bristoluni 14 70 5952 1398 2009-03-16 16:51:45
GlasgowUni 14 124 8441 888 2009-01-30 09:06:35
SussexUni 14 1434 7703 2822 2009-02-16 16:35:37
AbertayUni 14 272 3173 1931 2009-02-10 13:36:00
ManMetUni 14 92 4888 703 2008-05-12 11:44:03
GlyndwrUni 14 837 1447 1637 2009-02-27 14:07:49
aberdeenuni 14 2370 5068 683 2009-02-09 10:38:18
LancasterUni 14 186 4417 445 2009-03-20 15:35:26
BradfordUni 14 369 3795 781 2009-01-05 14:06:26
uniofglos 14 282 2280 793 2009-05-20 12:55:28
GoldsmithsUoL 14 485 4008 465 2009-02-12 14:45:17
QMUL 14 1244 3692 899 2009-01-23 16:36:23
BoltonUni 14 114 1015 594 2009-03-06 09:21:28
KeeleUniversity 14 1528 5476 2890 2008-07-11 10:29:19
imperialcollege 14 2484 8014 1030 2008-07-08 14:56:33
uniwales 14 52 1507 319 2009-05-28 09:55:48
UEL_News 15 1907 2010 1780 2009-07-22 11:52:12
univofstandrews 15 118 3679 340 2009-02-02 15:12:20
edgehill 15 821 3556 1384 2008-05-26 11:53:56
AberUni 15 3986 3827 693 2009-01-07 14:07:57
Uni_of_Essex 15 258 3475 1156 2009-02-27 15:43:26
SwanseaMet 15 289 1573 856 2009-05-20 07:31:41
UniOfSurrey 15 487 5331 861 2009-01-24 13:42:48
UniofBath 15 90 6977 2093 2009-01-19 10:39:19
UniofPlym 15 599 3672 1173 2009-06-22 19:23:19
UniofExeter 15 3824 3921 1291 2009-07-27 16:35:16
BangorUni 15 96 3288 420 2009-01-23 15:12:50
BathSpaUni 15 36 1529 193 2010-01-08 10:39:07
sheffhallamuni 15 498 4565 513 2009-02-11 14:47:10
TeessideUni 15 971 5334 2261 2009-02-04 16:20:01
UniStrathclyde 15 181 3445 700 2009-03-06 11:45:17
AstonUniversity 15 384 5221 852 2008-03-26 11:14:12
unibirmingham 15 388 7665 1095 2008-12-04 11:27:39
HuddersfieldUni 15 1124 2941 989 2009-06-18 12:53:54
UniofOxford 15 53 19569 475 2009-06-18 08:28:28
KingstonUni 15 5 4215 216 2009-02-04 16:49:55
portsmouthuni 15 142 4073 1346 2009-01-26 16:04:41
MiddlesexUni1 15 115 2212 389 2009-05-13 16:15:56
sheffielduni 15 7240 8069 882 2009-01-22 15:06:17
UniOfHull 15 572 3061 587 2009-07-24 11:50:29
SalfordUni 15 441 5783 1177 2008-10-30 16:04:15
StaffsUni 15 329 2854 1326 2009-08-19 09:02:40
warwickuni 15 784 7987 1440 2008-08-20 11:12:25
uniofbrighton 15 150 3822 375 2009-05-08 14:18:14
cardiffuni 15 137 8597 1113 2008-01-11 22:48:49
UlsterUni 15 92 3671 1638 2008-07-17 22:27:15
leedsmet 15 4610 4977 4224 2009-02-10 10:28:24
DundeeUniv 15 2794 4407 4624 2009-04-24 14:14:29

For the unis at the bottom of the list, I wonder if the data identifies possible reasons why? A newly created account, maybe, not many updates, few followers? Presence lower down the list also maybe signals to the relevant marketing departments that their account maybe doesn’t have the reach they thought it did?

If you want to explore the data, it’s in a sortable table here (click on column header to sort by that column).

Not that many people are following lists yet… hmmm… maybe I need to add that in – numbers of people following a list when choosing lists?

Here’s the top 15 lists containing at least 20 unis from the @UniversitiesUK/member-institutions list, ordered by subscriber count (the columns are twittername/list, no. of unis from original list on list, number of subscribers):

/UniversitiesUK/member-institutions 127 125
/EtiquetteWise/colleges-and-universities 24 53
/aderitus/universities 44 34
/EMGonline/universities-and-colleges 30 30
/Farmsphere/followfarmer 22 29
/ellielovell/universities 46 29
/CR4_News/engineering-education 46 28
/HowToEnjoyCoUk/biz-trade-work 23 25
/mbaexchange/schools 31 24
/blackwellbooks/academia 79 23

How about if we filter on 75 or more unis from the @UniversitiesUK/member-institutions list?

/UniversitiesUK/member-institutions 127 125
/blackwellbooks/academia 79 23
/umnoticias/universidades-2 79 16
/Routledge_StRef/universities-colleges 85 14
/uhu_global/uk-irish 81 10
/eMotivator/higher-education 81 9
/_StudentvisasUK/uk-uni-s-and-colleges 88 8
/UKTEFL/colleges-universities 87 7
/Universityru/the-university-round-up 85 6
/bellerbys/uk-universities 95 5
/e4scouk/universities 77 5
/campusprabi/ukuniversities 105 5
/EuropaWOL/uk-univs 123 4
/OMorris/uk-universities 123 4
/studentsoftware/uk-universities-colleges 76 4
/thirdyearabroad/uk-universities 87 3
/SPA_Digest/universities-7 91 3
/targetjobsUK/universitiesuk 83 2
/VJEuroRSCGHeist/institutions 82 2
/blackwellbooks/he-institutes 86 2
/livuni/universities 96 1
/StaffsUni/other-universities 78 1
/DanteSystems/universities 77 1
/christchurchsu/universities 78 0
/Bi5on/educational 119 0

One last thing… Discovering newly created accounts is problematic with this approach – only accounts that appear on lists are included. One heuristic for finding new tweeps is to look the followers of the sorts of people they might follow (figuring that you’re unlikely to be followed by anyone in the early days, but you are likely to follow certain easily discovered, “big name” accounts in your sector). Many university accounts don’t follow large numbers of people, but maybe by looking at the accounts they do commonly follow (who are presumably the “big players” in the area), and then tracking back to the intersection of the people who follow those accounts (you still with me?!;-), we can get signal about new entrants to the area…?

Another Blooming Look at Gource and the Edina OpenURL Data

Having done a first demo of how to use Gource to visualise activity around the EDINA OpenURL data (Visualising OpenURL Referrals Using Gource), I thought I’d trying something a little more artistic, and use the colour features to try to pull out a bit more detail from the data [video].

What this one shows is how the mendeley referrals glow brightly green, which – if I’ve got my code right – suggests a lot of e-issn lookups are going on (the red nodes correspond to an issn lookup, blue to an isbn lookup and yellow/orange to an unknown lookup). The regularity of activity around particular nodes also shows how a lot of the activity is actually driven by a few dominant services, at least during the time period I sampled to generate this video.

So how was this visualisation created?

Firstly, I pulled out a few more data columns, specifically the issn, eissn, isbn and genre data. I then opted to set node colour according to whether the issn (red), eissn (green) or isbn (blue) columns were populated using a default reasoning approach (if all three were blank, I coloured the node yellow). I then experimented with colouring the actors (I think?) according to whether the genre was article-like, book-like or unkown (mapping these on to add, modify or delete actions), before dropping the size of the actors altogether in favour of just highlighting referrers and asset type (i.e. issn, e-issn, book or unknown).

cut -f 1,2,3,4,27,28,29,32,40 L2_2011-04.csv > openurlgource.csv

When running the Pythin script, I got a “NULL Byte” error that stopped the script working (something obviously snuck in via one of the newly added columns), so I googled around and turned up a little command line cleanup routine for the cut data file:

tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv

Here’s the new Python script too that shows the handling of the colour fields:

import csv
from time import *

# Command line pre-processing step to handle NULL characters
#tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv
#alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv

f=open('openurlgourcenonulls.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()

for row in reader:
	if row[8].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		if row[4]!='':
			col='FF0000'
		elif row[5]!='':
			col='00FF00'
		elif row[6]!='':
			col='0000FF'
		else:
			col='666600'
		if row[7]=='article' or row[7]=='journal':
			typ='A'
		elif row[7]=='book' or row[7]=='bookitem':
			typ='M'
		else:
			typ='D'
		agent=row[8].rstrip(':').replace(':','/')
		writer.writerow([t,row[3],typ,agent,col])

The new gource command is:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 openurlgource.txt

and the command to generate the video:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

If you’ve been tempted to try Gource out yourself on some of your own data, please post a link in the comments below:-) (AI wonder just how many different sorts of data we can force into the shape that Gource expects?!;-)

Google Correlate: What Search Terms Does Your Time Series Data Correlate With?

Just a few days over three years ago, I blogged about a site I’d put together to try to crowdsource observations about correlated searchtrends: TrendSpotting.

One thing that particularly interested me then, as it still does now, was the way that certain search trends they reveal rhythmic behaviour over the course of weeks, months or years.

At the start of this year, I revisited the topic with a post on Identifying Periodic Google Trends, Part 1: Autocorrelation (followd by Improving Autocorrelation Calculations on Google Trends Data).

Anyway today it seems that Google has cracked the scaling issues with discovering correlations between search trends (using North American search trend data), as well as opening up a service that will identify what search trends correlate most closely with your own uploaded time series data: Correlate (announcement: Mining patterns in search data with Google Correlate)

For the quick overview, check out the Google Correlate Comic.

So what’s on offer? First, enter a search term and see what it’s correlated with:

As well as the line chart, correlations can also be plotted as a scatterplot:

You can also run “spatial correlations”, though at the moment this appears to be limited to US states. (I *think* this works by looking for search terms that are popular in the requested areas and not popular in the other listed areas. To generalise this, I guess you need three things: the total list of areas that work for the spatial correlation query; the areas you want the search volume for the “to be discovered correlated phrase” to be high; the areas you want to the search volume for the “to be discovered correlated phrase” to be low?)

At this point it’s maybe worth remembering that correlation does not imply causation…

A couple of other interesting things to note: firstly, you can offset the data (so shift it a few weeks forwards or backwards in time, as you might do if you were looking for lead/lag behaviour); secondly, you can export/download the data.

You can also upload your own data to see what terms correlate with it:

(I wonder if they’ll start offering time series analysis features on uploaded, as well as other trend data, too? For example, frequency analysis or trend analysis? This is presumably going on in the background (though I haven’t read the white paper [PDF] yet…)

As if that’s not enough, you can also draw a curve/trendline and then see what correlates with it (so this a weak alternative to uploading your own data, right? Just draw something that looks like it… (h/t to Mike Ellis for first point this out to me).

I’m not convinced that search trends map literally onto the well known “hype cycle” curve, but I thought I’d try out a hype cycle reminiscent curve where the hype was a couple of years ago, and we’re now maybe seeing start to reach mainstream maturity, with maybe the first inklings of a plateau…

Hmmm… the pr0n industry is often identified as a predictor of certain sorts of technology adoption… maybe the 5ex searchers are too?! (Note that correlated hand-drawn charts are linkable).

So – that’s Google Correlate; nifty, eh?

PS Here’s another reason why I blog… my blog history helps me work out how far i the future I live;-) So currently between about three years in the future.. how about you?!;-)

PPS I can imagine Google’s ThinkInsights (insight marketing) loving the thought that folk are going to check out their time series data against Google Trends so the Goog can weave that into it’s offerings… A few additional thoughts leading on from that: 1) when will correlations start to appear in Google AdWords support tools to help you pick adwords based on your typical web traffic patterns or even sales patterns? 2) how far are we off seeing a Google Insights box to complement the Google Search Appliances, that will let you run correlations – as well as Google Prediction type services – onsite without feeling as if you have to upload your data to Google’s servers, and instead, becoming part of Google’s out-kit-in-your-racks offering; 3) when is Google going to start buying up companies like Prism and will it then maybe go after the likes of Experian and Dunnhumby to become a company that organises information about the world of people, as well as just the world’s information…?!)

PPPS Seems like as well as “traditional” link sharing offerings, you can share the link via your Google Reader account…

Interesting…

Situated Video Advertising With Tesco Screens

In 2004, Tesco launched an in-store video service under the name Tesco TV as part of its Digital Retail Network service. The original service is described in TESCO taps into the power of satellite broadband to create a state-of-the-art “Digital Retail Network” and is well worth a read. A satellite delivery service provided “news and entertainment, as well as promotional information on both TESCO’s own products and suppliers’ branded products” that was displayed on video screens around the store.
In order to make content as relevant as possible (i.e. to maximise the chances of it influencing a purchase decision;-), the content was zoned:

Up to eight different channels are available on TESCO TV, each channel specifically intended for a particular zone of the store. The screens in the Counters area, for instance, display different content from the screens in the Wines and Spirits area. The latest music videos are shown in the Home Entertainment department and Health & Beauty has its own channel, too. In the Cafe, customers can relax watching the latest news, sports clips, and other entertainment programs.

I’d have loved to have seen the control room:

Remote control from a central location of which content is played on each screen, at each store, in each zone, is an absolute necessity. One reason is that advertisers are only obligated to pay for their advertisements if they are shown in the contracted zones and at the contracted times.

In parallel to the large multimedia files, smaller files with the scripts and programming information are sent to all branches simultaneously or separately, depending on what is required. These scripts are available per channel and define which content is played on which screen at which time. Of course, it is possible to make real-time changes to the schedule enabling TESCO to react within minutes, if required.

In 2006, dunnhumby, the company that runs the Tesco Clubcard service and that probably knows more about your shopping habits at Tesco than you do, won the ad sales contract for Tesco TV’s “5,000 LCD and plasma screens across 100 Tesco Superstores and Extra outlets”. Since then, it has “redeveloped the network to make it more targeted, so that it complements in-store marketing and ties in with above-the-line campaigns”, renaming Tesco TV as Tesco Screens in 2007 as part of that effort (Dunnhumby expands Tesco TV content, dunnhumby relaunches Tesco in-store TV screens). Apparently, “[a]ll campaigns on Tesco Screens are analysed with a bespoke control group using EPOS and Clubcard data.” (If you’ve read any of my previous posts on the topic (e.g. The Tesco Data Business (Notes on “Scoring Points”) or ) you’ll know that dunnhumby excels at customer profiling and targeting.)

Now I don’t know about you, but dunnhumby’s apparent reach and ability to influence millions of shoppers at points of weakness is starting to scare me…(as well as hugely impressing me;-)

On a related note, it’s not just Tesco that use video screen advertising, of course. In Video, Video, Everywhere…, for example, I described how video advertising has now started appearing throughout the London Underground network.

So with the growth of video advertising, it’s maybe not so surprising that Joel Hopwood, one of the management team behind Tesco Screens Retail Media Group should strike out with a start-up: Capture Marketing.

[Capture Marketing] may well be the first agency in the UK to specialise in planning, buying and optimising Retail Media across all retailers – independent of any retailer or media owner!!

They aim to buy from the likes of dunnhumby, JCDecaux, Sainsbury, Asda Media Centre etc in order to give clients a single, independent and authoritative buying and planning point for the whole sector. [DailyDOOH: What On Earth Is Shopper Marketing?]

So what’s the PR strapline for Capture Marketing? “Turning insight into influence”.

If you step back and look at our marketing mix across most of the major brands, it’s clearly shifting, and it’s shifting to in-store, to the internet and to trial activity.
So what’s the answer? Marketing to shoppers. We’ll help you get your message to the consumer when they’re in that crucial zone, after they’ve become a shopper, but before they’ve made a choice. We’ll help you take your campaign not just outside the home, but into the store. Using a wide range of media vehicles, from digital screens to web favourite interrupts to targeted coupons, retail media is immediate, proximate, effective and measurable.

I have no idea where any of this is going… Do you? Could it shift towards making use of VRM (“vendor relationship management”) content, in which customers are able to call up content they desire to help they make a purchase decision (such as price, quality, or nutrition information comparisons?). After all, scanner apps are already starting to appear on Android phones (e.g. ShopSavvy) and the iPhone (Snappr), not to mention the ability to recognise books from their cover or music from the sound of it (The Future of Search is Already Here).

PS Just by the by, here’s some thoughts about how Tesco might make use of SMS:

PPS for a quick A-Z of all you need to know to start bluffing about video based advertising, see Billboards and Beyond: The A-Z of Digital Screens.

Recession, What Recession?

Following on from my own Playing With Google Search Data Trends, and John’s Google’s predictive power (contd.) pick up on this post from Bill Thompson, The net reveals the ties that bind, here’s one possible quick look at the impending state of the recession…

What search terms would you say are recession indicators?

PS I wonder to what extent, if any, the financial wizards factor real time “search intent” tracking into their stock trading strategies?

PPS I have to admit, I don’t really understand the shape of this trend at all?

The minima around Christmas look much of a muchness, but the New Year pear – and then the yearly average, are increasing, year on year? Any ideas?