Opening Up University Energy Data

Knowing how much energy a building uses is often the first step towards reducing it’s energy footprint, so here’s a quick round up of the university open data initiatives I know of that are based around energy data. (If you know of more, please let me know via the comments.)

First up, via @lncd, here’s a heatmap showing change in energy building usage compared aross the last two days:

Lincoln - 2-day energy data utilisation comparison

This visulisation (built on top of Lincoln U’s open data feeds (http://data.lincoln.ac.uk/)) charts energy usage over a calendar month:

Lincoln U - energy usage over time

Read more about Lincoln’s energy data hacks here: University of Lincoln Energy Data …. an update!.

Over in Oxford, there’s a tool called OpenMeters that displays charts of energy usage by building:

Oxford opne energy data

The Oxford data is available as Linked Data from http://data.ox.ac.uk/datasets/, err, I think… As ever, it’d probably take me an hour or two to find out how to make my first query that returns anything meaningful to me in a form I could actually do anything with!;-) I also wonder whether the Oxford project feeds into another Oxford initiative: iMeasure?

(By the by, I misread the title of this place from the other place to Oxford as “A Linked Data Model Of Building Energy Consumption”: A Limited-Data Model Of Building Energy Consumption, which explores a model “targeted at practical, wide-scale deployment which produces an ongoing breakdown of building energy consumption”.)

Whilst I don’t think Warwick University’s energy monitoring page is based on exposed open data, it does have some dials/gauges on it:

Warick U - energy monitoring

(By the by – that page doesn’t have a Warwick favicon associated with it in my browser; should it?)

So – a quick round up (and huuuuuuge displacement activity from what I should have been doing this afternoon), but I think it’s worth tracking and reporting on these early demonstrations…

See also: JISC Green ICT projects (which I note don’t seem to return popular Google results for queries relating to university energy data, even when searches are limited to the .ac.uk domain…) and the Greening ICT Programme Community Site; Govspark, which compares energy usage across government departments; Innovations in Campus Mapping for a review of how open data is being used to support enhanced interactive campus maps; and Open Data Powered Location Based Services in UK Higher Education.

PS given Lincoln are publishing all manner of open data, I wonder whether there is enough there to do an ad hoc version of the Heat and light by timetable project using data they’re producing just anyway?!

Visualising New York Times Article API Tag Graphs Using d3.js

Picking up on Tinkering with the Guardian Platform API – Tag Signals, here’s a variant of the same thing using the New York Times Article Search API.

As with the Guardian OpenPlatform demo, the idea is to look up recent articles containing a particular search term, find the tag(s) used to describe the articles, and then graph them. The idea behind this approach is to get a quick snapshot of how the search target is represented, or positioned, in this case, by the New York Times.

Here is the code. The main differences compared to the Guardian API gist are as follows:

– a hacked recipe for getting several paged results back; I really need to sort this out properly, just as I need to generalise the code so it will work with either the Guardian or the NYT API, but that’s for another day now…

– the use of NetworkX as a way of representing the undirected tag-tag graph;

– the use of the NetworkX D3 helper library (networkx-d3) to generate a JSON output file that works with the d3.js force directed layout library.

Note that the D3 Python library generates a vanilla force directed layout diagram. In the end, I just grabbed the tiny bit of code that loads the JSON data into a D3 network, and then used it along with the code behind the rather more beautiful network layout used for this Mobile Patent Suits visualisation.

Here’s a snapshot of the result of a search for recent articles on the phrase Gates Foundation:

At this point, it’s probably worth pointing out that the Python script generates the graph file, and then the d3.js library generates the graph visualisation within a web browser. There is no human hand (other than force layout parameter setting) involved in the layout. I guess with the tweaking of a few parameters, maybe juggling the force layout parameters a little more, I could get an even clearer layout. It might also be worth trying to find a way of sizing, or at least colouring, the nodes according to degree (or even better, weighted degree?) I also need to find a way, of possible, of representing the weight of edges if the D3 Python library actually exports this (or if it exports multiple edges between the same two nodes).

Anyway, for an hour or so’s tinkering, it’s quite a neat little recipe to be able to add to my toolkit. Here’s how it works again: Python script calls NYT Article Search API and grabs articles based on a search term. Grab the tags used to describe each article and build up a graph using NetworkX that connects any and all tags that are mentioned on the same article. Dump the graph from its NetworkX representation as a JSON file using the D3 library, then use the D3 Patent Suits layout recipe to render it in the browser :-)

Now all I have to do is find out how I can grab an SVG dump of the network from a browser into a shareable file…

Active Lobbying Through Meetings with UK Government Ministers

In a move that seemed to upset collectors of UK ministerial meeting data, @whoslobbying, on grounds of wasted effort, the Guardian datastore published a spreadsheet last night containing data relating to ministerial meetings between May 2010 and March 2011.

(The first release of the spreadsheet actually omitted the column containing who the meeting was with, but that seems to be fixed now… There are, however, still plenty of character encoding issues (apostrophes, accented characters, some sort of em-dash, etc) that might cripple some plug and play tools.)

Looking over the data, we can use it as the basis for a network diagram with actors (Ministers and lobbiests) with edges representing meetings between Minsiters and lobbiests. There is one slight complication in that where there is a meeting between a Minister and several lobbiests, we ideally need to separate out the separate lobbiests into their own nodes.

UK gov meetings spreadsheet

This probably provides an ideal opportunity to have a play with the Stanford Data Wrangler and try forcing these separate lobbiests onto separate rows, but I didn’t allow myself much time for the tinkering (and the requisite learning!), so I resorted to Python script to read in the data file and split out the different lobbiests. (I also did an iterative step, cleaning the downloaded CSV file in a text editor by replacing nasty characters that caused the script to choke.) You can find the script here (note that it makes use of the networkx network analysis library, which you’ll need to install if you want to run the script.)

The script generates a directed graph with links from Ministers to lobbiests and dumps it to a GraphML file (available here) that can be loaded directly into Gephi. Here’s a view – using Gephi – of the hearth of the network. If we filter the graph to show nodes that met with at least five different Ministers…

Gephi - k-core filter

we can get a view into the heart of the UK lobbying netwrok:

Active Lobbiests

I sized the lobbiest nodes according to eigenvector centrality, which gives an indication of well connected they are in the network.

One of the nice things about Gephi is that it allows for interactive exploration of a graph, For example, I can hover over a lobbiest node – Barclays in this case – to see which Ministers were met:

Bankers connect...

Alternatively, we can see who of the well connected met with the Minister for Welfare Reform:

Welfare meetings...

Looking over the data, we also see how some Ministers are inconsistently referenced within the original dataset:

Multiple mentions

Note that the layout algorithm is such that the different representations of the same name are likely to meet similar lobbiests, which will end up placing the node in a similar location under the force directed layout I used. Which is to say – we may be able to use visual tools to help us identify fractured representations of the same individual. (Note that multiple meetings between the same parties can be visualised using the thickness of the edges, which are weighted according to the number of times the edge is described in the GraphML file…)

Unifying the different representations of the same indivudal is something that Google Refine could help us tidy up with its various clustering tools, although it would be nice if the Datastore folk addressed this at source (or at least, as part of an ongoing data quality enhancement process…;-)

I guess we could also trying reconciling company names against universal company identifiers, for example by using Google Refine’s reconciliation service and the Open Corporates database? Hmmm, which makes me wonder: do MySociety, or Public Whip, offer an MP or Ministerial position reconciliation service that works with Google Refine?

A couple of things I haven’t done: represented the department (which could be done via a node attribute, maybe, at least for the Ministers); represented actual meetings, and what I guess we might term co-lobbying behaviour, where several organisations are in the same meeting.

Getting My Eye In Around F1 Quali Data – Parallel Coordinate Plots, Sort of…

Looking over the sector times from the qualifying session for tomorrow’s Hungarian Grand Prix, I noticed that Vettel was only fastest in one of the sectors.

Whilst looking for an easy way of shaping an R data frame so that I could plot categorical values sector1, sector2, sector3 on the x-axis, and then a line for each driver showing their time in the sector on the y-axis (I still haven’t worked out how to do that? Any hints? Add them to the comments please…;-), I came across a variant of the parallel coordinate plot hidden away in the lattice package:

f1 2011 hun quali sector times parralel coordinate plot

What this plot does is for each row (i.e. each driver) take values from separate columns (i.e. times from each sector), normalise them, and then plot lines between the normalised value, one “axis” per column; each row defines a separate category.

The normalisation obviously hides the magnitude of the differences between the time deltas in each sector (the min-max range might be hundredths in one sector, tenths in another), but this plot does show us that there are different groupings of cars – there is clear(?!) white space in the diagram:

Clusters in sectors times

Whilst the parallel co-ordinate plot helps identify groupings of cars, and shows where they may have similar performance, it isn’t so good at helping us get an idea of which sector had most impact on the final lap time. For this, I think we need to have a single axis in seconds showing the delta from the fastest time in the sector. That is, we should have a parallel plot where the parallel axes have the same scale, but in terms of sector time, a floating origin (so e.g. the origin for one sector might be 28.6s and for another, 22.4s). For convenience, I’d also like to see the deltas shown on the y-axis, and the categorical ranges arranged on the x-axis (in contrast to the diagrams above, where the different ranges appear along the y-axis).

PS I also wonder to what extent we can identify signatures for the different teams? Eg the fifth and sixth slowest cars in sector 1 have the same signature across all three sectors and come from the same team; and the third and fourth slowest cars in sector 2 have a very similar signature (and again, represent the same team).

Where else might we look for signatures? In the speed traps maybe? Here’s what the parallel plot for the speed traps looks like:

SPeed trap parallel plot

(In this case, max is better = faster speed.)

To combine the views (timings and speed), we might use a formulation of the flavour:

parallel(~data.frame(a$sector1,a$sector2,a$sector3, -a$inter1,-a$inter2,-a$finish,-a$trap))

Combined parallel plot - times and speeds

This is a bit too cluttered to pull much out of though? I wonder if changing the order of parallel axes might help, e.g. by trying to come up with an order than minimises the number of crossed lines?

And if we colour lines by team, can we see any characteristics?

Looking for patterns across teams

Using a dashed, rather than solid, line makes the chart a little easier to read (more white space). Using a thinking line also helps bring out the colours.

parallel(~data.frame(a$sector1,-a$inter1,-a$inter2,a$sector2,a$sector3, -a$finish,-a$trap),col=a$team,lty=2,lwd=2)

Here’s another ordering of the axes:

ANother attempt at ordering axes

Here are the sector times ordered by team (min is better):

Sector times coloured by team

Here are the speeds by team (max is better):

Speeds by team

Again, we can reorder this to try to make it easier(?!) to pull out team signatures:

Reordering speed traps

(I wonder – would it make sense to try to order these based on similarity eg derived from a circuit guide?)

Hmmm… I need to ponder this…

Surveying the Territory: Open Source, Open-Ed and Open Data Folk on Twitter

Over the last few weeks, I’ve been tinkering with various ways of using the Twitter API to discover Twitter lists relating to a particular topic area, whether discovered through a particular hashtag, search term, a list that already exists on a topic, or one or more people who may be associated with a particular topic area.

On my to do list is a map of the “open” community on Twitter – and the relationships between them – that will try to identify notable folk in different areas of openness (open government, open data, open licensing, open source software) and the communities around them, then aggregate all this open afficionados, plot the network connections between them, and remap the result (to see whether the distinct communities we started with fall out, as well as to discover who acts as the bridges between them, or alternatively discover whether new emergent groupings appear to crystallise out based on network connectivity).

As a step on the road to that, I had a quick peek around found who were tweeting using the #oscon hashtag over the weekend. Through analysing people who were tweeting regularly around the topic, I identified several lists in the area: @realist/opensource, @twongkee/opensource, @lemasney/opensource ,@suncao/open-linked-free, @jasebo/open-source

Pulling down the members of these lists, and then looking for connections between them, I came up with this map of the open source community on Twitter:

A peek at FOSS community on Twitter

Using a different technique not based on lists, I generated a map of the open data community based on the interconnections between people followed by @openlylocal:

How the people @countculture follows follow each other

and the open education community based on the people that follow @opencontent:

How followers of @Opencontent follow each other

(So that’s a different way of identifying the members of each community, right? One based on lists that mention users of a particular hashtag, one based on folk a particular individual follows, and one based on the folk that follow a particular individual.)

I’ve also toyed with looking at communities defined by members of lists that mention a particular individual, or people followed by a particular individual, as well as ones based on members of lists that contain folk listed on one or more trusted, curated lists in a particular topic area (got that?!;-).

Whilst the graphs based on mapping friends or followers of an individual give a good overview of that individual’s sphere of interest or influence, I think the community graphs derived from finding connections between people mentioned on “lists in the area” is a bit more robust in terms of mapping out communities in general, though I guess I’d need to do “proper” research to demonstrate that?

As mentioned at the start, the next thing on my list is a map across the aggregated “open” communities on Twitter. Of course, being digerati, many of these people will have decamped to GooglePlus. So maybe I shouldn’t bother, but instead wait for Google+ to mature a bit, an API to become available, blah, blah, blah…

Quick Command Line Reports from CSV Data Parsed Out of XML Data Files

It’s amazing how time flies, isn’t it..? Spotting today’s date I realised that there’s only a week left before the closing date of the UK Discovery Developer Competition, which is making available several UK “cultural metadata” datasets from library catalogue and activity data, EDINA OpenUrl resolver data, National Archives images and Engligh Heritage places metadata, as well as ArchivesHub project related Linked Data and Tyne and Wear Museums Collections metadata.

I was intending to have a look at how easy it was to engage with datasets (e.g. by blogging intitial explorations for each dataset along the lines of the text processing tricks I posted around the EDINA data in Postcards from a Text Processing Excursion, Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Visualising OpenURL Referrals Using Gource), but I seem to have left it a bit let considering other work I need to get done this week… (i.e. marking:-(

..except for posting this old bit of code I don’t think I’ve posted before that demonstrates how to use the Python scripting language to parse an XML file, such as the Huddersfield University library MOSAIC activity data, and create a CSV/text file that we can then run simple text processing tools against.

If you download the Huddersfield or Lincoln data (e.g. from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data) and have a look at a few lines from the XML files (e.g. using the Unix command line tool head, as in head -n 150 filename.xml to show the first 150 lines of the file), you will notice records of the form:

<useRecordCollection>
  <useRecord>
    <from>
      <institution>University of Huddersfield</institution>
      <academicYear>2008</academicYear>
      <extractedOn>
        <year>2009</year>
        <month>6</month>
        <day>4</day>
      </extractedOn>
      <source>LMS</source>
    </from>
    <resource>
      <media>book</media>
      <globalID type="ISBN">1903365430</globalID>
      <author>Elizabeth I, Queen of England, 1533-1603.</author>
      <title>Elizabeth I : the golden reign of Gloriana /</title>
      <localID>585543</localID>
      <catalogueURL>http://library.hud.ac.uk/catlink/bib/585543</catalogueURL>
      <publisher>National Archives</publisher>
      <published>2003</published>
    </resource>
    <context>
      <courseCode type="ucas">LQV0</courseCode>
      <courseName>BA(H) Humanities</courseName>
      <progression>UG2</progression>
    </context>
  </useRecord>
</useRecordCollection>

Suppose we want to extract the data showing which courses each resource was borrowed against. That is, for each use record, we want to extract a localID and the courseCode. The following script achieves that:

from lxml import etree
import csv
#Inspired by http://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
#----------------------------------------------------------------------
def parseMOSAIC_Level1_XML(xmlFile,writer):
	context = etree.iterparse(xmlFile)
	record = {}
	# we are going to use record to create a record containing UCAS codes
	# record={ucasCode:[],localID:''}
	records = []
	print 'starting...'
	for action, elem in context:
		if elem.tag=='useRecord' and action=='end':
			#we have parsed the end of a useRecord, so output course data
			if 'ucasCode' in record and 'localID' in record:
				for cc in record['ucasCode']:
					writer.writerow([record['localID'],cc])
			record={}
		if elem.tag=='localID':
			record['localID']=elem.text
		elif elem.tag=='courseCode' and 'type' in elem.attrib and elem.attrib['type']=='ucas':
			if 'ucasCode' not in record:
				record['ucasCode']=[]
			record['ucasCode'].append(elem.text)
		elif elem.tag=='progression' and elem.text=='staff':
			record['staff']='staff'
	#return records

writer = csv.writer(open("test.csv", "wb"))

f='mosaic.2008.level1.1265378452.0000001.xml'
s='mosaic.2008.sampledata.xml'
parseMOSAIC_Level1_XML(f,writer)

Usage (if you save the code to the file mosaicXML2csv.py): python mosaicXML2csv.py
Note: this minimal example uses the file specified by f=, in the above case mosaic.2008.level1.1265378452.0000001.xml and writes the CSV out to test.csv

(You can also find the code as a gist on Github: simple Python XML2CSV converter)

Running the script gives data of the form:

185215,L500
231109,L500
180965,W400
181384,W400
180554,W400
201002,W400
...

Note that we might add an additional column for progression. Add in something like:
if elem.tag=='progression': record['progression']=elem.text
and modify the write command to something like writer.writerow([record['localID'],cc,record['progression']])

We can now generate quick reports over the simplified test.csv data file.

For example, how many records did we extract:
wc -l test.csv

If we sort the records (and by so doing, group duplicated rows) [sort test.csv], we can then pull out unique rows and count the number of times they repeat [uniq -c], then sort them by the number of reoccurrences [sort -k 1 -n -r] and pull out the top 20 [head -n 20] using the combined, piped Unix commandline command:

sort test.csv | uniq -c | sort -k 1 -n -r | head -n 20

This gives and output of the form:
186 220759,L500
134 176259,L500
130 176895,L500

showing that resource with localID 220759 was taken out on course L500 186 times.

If we just want to count the number of books taken out on a course as a whole, we can just pull out the coursecode column using the cut command, setting the delimiter to be a comma:
cut -f 2 -d ',' test.csv

Having extracted the course code column, we can sort, find repeat counts, sort again and show the courses with the most borrowings against them:

cut -f 2 -d ',' test.csv | sort | uniq -c | sort -k 1 -n -r | head -n 10

This gives a result of the form:
13476 L500
8799 M931
7499 P301

In other words, we can create very quick and dirty reports over the data using simple commandline tools once we generate a row based, simply delimited text file version of the original XML data report.

Having got the data in a simple test.csv text file, we can also load it directly into the graph plotting tool Gephi, where the two columns (localID and courseCode) are both interpreted as nodes, with an edge going from the localID to the courseCode. (That is, we can treat the two column CSV file as defining a bipartite graph structure.)

Running a clustering statistic and a statistic that allows us to size nodes according to degree, we can generate a view over the data that shows the relative activity against courses:

Huddersfield mosaic data

Here’s another view, using a different layout:

Huddersfield JISC MOSAIC activity data

Also note that by choosing an appropriate layout algorithm, the network structure visually identifies courses that are “similar” by virtue of being connected to similar resources. The thickness of the edges is proportional to the number of times a resource was borrowed against a particular course, so we can also visually identify such items at a glance.

Visualising Twitter Friend Connections Using Gephi: An Example Using the @WiredUK Friends Network

To corrupt a well known saying, “cook a man a meal and he’ll eat it; teach a man a recipe, and maybe he’ll cook for you…”, I thought it was probably about time I posted the recipe I’ve been using for laying out Twitter friends networks using Gephi, not least because I’ve been generating quite a few network files for folk lately, giving them copies, and then not having a tutorial to point them to. So here’s that tutorial…

The starting point is actually quite a long way down the “how did you that?” chain, but I have to start somewhere, and the middle’s easier than the beginning, so that’s where we’ll step in (I’ll give some clues as to how the beginning works at the end…;-)

Here’s what we’ll be working towards: a diagram that shows how the people on Twitter that @wiredUK follows follow each other:

@wireduk innerfriends

The tool we’re going to use to layout this graph from a data file is a free, extensible, open source, cross platform Java based tool called Gephi. If you want to play along, download the datafile. (Or try with a network of your own, such as your Facebook network or social data grabbed from Google+.)

From the Gephi file menu, Open the appropriate graph file:

Gephi - file open

Import the file as a Directed Graph:

Gephi - import directed graph

The Graph window displays the graph in a raw form:

Gephi -graph view of imported graph

Sometimes a graph may contain nodes that are not connected to any other nodes. (For example, protected Twitter accounts do not publish – and are not published in – friends or followers lists publicly via the Twitter API.) Some layout algorithms may push unconnected nodes far away from the rest of the graph, which can affect generation of presentation views of the network, so we need to filter out these unconnected nodes. The easiest way of doing this is to filter the graph using the Giant Component filter.

Gephi - filter on Giant Component

To colour the graph, I often make us of the modularity statistic. This algorithm attempts to find clusters in the graph by identifying components that are highly interconnected.

Gephi - modularity statistic

This algorithm is a random one, so it’s often worth running it several times to see how many communities typically get identified.

A brief report is displayed after running the statistic:

Gephi - modularity statistic report

While we have the Statistics panel open, we can take the opportunity to run another measure: the HITS algorithm. This generates the well known Authority and Hub values which we can use to size nodes in the graph.

Gephi - HITS statistic

The next step is to actually colour the graph. In the Partition panel, refresh the partition options list and then select Modularity Class.

Gephi - select modularity partition

Choose appropriate colours (right click on each colour panel to select an appropriate colour for each class – I often select pastel colours) and apply them to the graph.

Gephi - colour nodes by modularity class

The next thing we want to do is lay out the graph. The Layout panel contains several different layout algorithms that can be used to support the visual analysis of the structures inherent in the network; (try some of them – each works in a slightly different way; some are also better than others for coping with large networks). For a network this size and this densely connected,I’d typically start out with one of the force directed layouts, that positions nodes according to how tightly linked they are to each other.

Gephi select a layout

When you select the layout type, you will notice there are several parameters you can play with. The default set is often a good place to start…

Run the layout tool and you should see the network start to lay itself out. Some algorithms require you to actually Stop the layout algorithm; others terminate themselves according to a stopping criterion, or because they are a “one-shot” application (such as the Expansion algorithm, which just scales the x and y values by a given factor).

Gephi - forceAtlas 2

We can zoom in and out on the layout of the graph using a mouse wheel (on my MacBook trackpad, I use a two finger slide up and down), or use the zoom slider from the “More options” tab:

Gephi zoom

To see which Twitter ID each node corresponds to, we can turn on the labels:

Gephi - labels

This view is very cluttered – the nodes are too close to each other to see what’s going on. The labels and the nodes are also all the same size, giving the same visual weight to each node and each label. One thing I like to do is resize the nodes relative to some property, and then scale the label size to be proportional to the node size.

Here’s how we can scale the node size and then set the text label size to be proportional to node size. In the Ranking panel, select the node size property, and the attribute you want to make the size proportional to. I’m going to use Authority, which is a network property that we calculated when we ran the HITS algorithm. Essentially, it’s a measure of how well linked to a node is.

Gephi - node sizing

The min size/max size slider lets us define the minimum and maximum node sizes. By default, a linear mapping from attribute value to size is used, but the spline option lets us use a non-linear mappings.

Gephi - node sizing spilne

I’m going with the default linear mapping…

Gephi - size nodes

We can now scale the labels according to node size:

Gephi - scale labels

Note that you can continue to use the text size slider to scale the size of all the displayed labels together.

This diagram is now looking quite cluttered – to make it easier to read, it would be good if we could spread it out a bit. The Expansion layout algorithm can help us do this:

Gephi - expansion

A couple of other layout algorithms that are often useful: the Transformation layout algorithm lets us scale the x and y axes independently (compared to the Expansion algorithm, which scales both axes by the same amount); and the Clockwise Rotate and Counter-Clockwise Rotate algorithm lets us rotate the whole layout (this can be useful if you want to rotate the graph so that it fits neatly into a landscape view.

The expanded layout is far easier to read, but some of the labels still overlap. The Label Adjust layout tool can jiggle the nodes so that they don’t overlap.

gephi - label adjust

(Note that you can also move individual nodes by clicking on them and dragging them.)

So – nearly there… The final push is to generate a good quality output. We can do this from the preview window:

Gephi preview window

The preview window is where we can generate good quality SVG renderings of the graph. The node size, colour and scaled label sizes are determined in the original Overview area (the one we were working in), although additional customisations are possible in the Preview area.

To render our graph, I just want to make a couple of tweaks to the original Default preview settings: Show Labels and set the base font size.

Gephi - preview settings

Click on the Refresh button to render the graph:

Gephi - preview refresh

Oops – I overdid the font size… let’s try again:

gephi - preview resize

Okay – so that’s a good start. Now I find I often enter into a dance between the Preview ad Overview panels, tweaking the layout until I get something I’m satisfied with, or at least, that’s half-way readable.

How to read the graph is another matter of course, though by using colour, sizing and placement, we can hopefully draw out in a visual way some interesting properties of the network. The recipe described above, for example, results in a view of the network that shows:

– groups of people who are tightly connected to each other, as identified by the modularity statistic and consequently group colour; this often defines different sorts of interest groups. (My follower network shows distinct groups of people from the Open University, and JISC, the HE library and educational technology sectors, UK opendata and data journalist types, for example.)
– people who are well connected in the graph, as displayed by node and label size.

Here’s my final version of the @wiredUK “inner friends” network:

@wireduk innerfriends

You can probably do better though…;-)

To recap, here’s the recipe again:

– filter on connected component (private accounts don’t disclose friend/follower detail to the api key i use) to give a connected graph;
– run the modularity statistic to identify clusters; sometimes I try several attempts
– colour by modularity class identified in previous step, often tweaking colours to use pastel tones
– I often use a force directed layout, then Expansion to spread to network out a bit if necessary; the Clockwise Rotate or Counter-Clockwise rotate will rotate the network view; I often try to get a landscape format; the Transformation layout lets you expand or contract the graph along a single axis, or both axes by different amounts.
– run HITS statistic and size nodes by authority
– size labels proportional to node size
– use label adjust and expand to to tweak the layout
– use preview with proportional labels to generate a nice output graph
– iterate previous two steps to a get a layout that is hopefully not completely unreadable…

Got that?!;-)

Finally, to the return beginning. The recipe I use to generate the data is as follows:

  1. grab a list of twitter IDs (call it L); there are several ways of doing this, for example: obtain a list of tweets on a particular topic by searching for a particular hashtag, then grab the set of unique IDs of people using the hashtag; grab the IDs of the members of one or more Twitter lists; grab the IDs of people following or followed by a particular person; grab the IDs of people sending geo-located tweets in a particular area;
  2. for each person P in L, add them as a node to a graph;
  3. for each person P in L, get a list of people followed by the corresponding person, e.g. Fr(P)
  4. for each X in e.g. Fr(P): if X in Fr(P) and X in L, create an edge [P,X] and add it to the graph
  5. save the graph in a format that can be visualised in Gephi.

To make this recipe, I use Tweepy and a Python script to call the Twitter API and get the friends lists from there, but you could use the Google Social API to get the same data. There’s an example of calling that API using Javscript in my “live” Twitter friends visualisation script (Using Protovis to Visualise Connections Between People Tweeting a Particular Term) as well as in the A Bit of NewsJam MoJo – SocialGeo Twitter Map.

Follower Networks and “List Intelligence” List Contexts for @JiscCetis

I’ve been tinkering with some of my “List Intelligence” code again, and thought it worth capturing some examples of the sort of network exploration recipes I’m messing around with at the moment.

Let’s take @jiscCetis as an example; this account follows no-one, is followed by a few, hasnlt much of a tweet history and is listed by a handful of others.

Here’s the follower network, based on how the followers of @jiscetis follow each other:

Friend connections between @Jisccetis followers

There are three (maybe four) clusters there, plus all the folk who don’t follow any of the @jisccetis’ followers…: do these follower clusters make any sort of sense I wonder? (How would we label them…?)

The next thing I thought to do was look at the people who were on the same lists as @jisccetis, and get an overview of the territory that @jisccetis inhabits by virtue of shared list membership.

Here’s a quick view over the folk on lists that @jisccetis is a member of. The nodes are users named on the lists that @jisccetis is named on, the edges are undirected and join indivduals who are on the same list.

Distribution of users named on lists that jisccetis is a member of

Plotting “co-membership” edges is hugely expensive in terms of upping the edge count that has to be rendered, but we can use a directed bipartite graph to render the same information (and arguably even more information); here, there are two sorts of nodes: lists, and the memvers of lists. Edges go from members to listnames (I should swap this direction really to make more sense of authority/hub metrics…?)

jisccetis co-list membership

Another thing I thought I’d explore is the structure of the co-list membership community. That is, for all the people on the lists that @jisccetis is a member of, how do those users follow each other?

How folk on same lists as @jisccetis follow each other

It may be interesting to explore in a formal way the extent to which the community groups that appear to arise from the friending relationships are reflected (or not) by the make up of the lists?

It would probably also be worth trying to label the follower group – are there “meaningful” (to @jisccetis? to the @jisccetis community?) clusters in there? How would you label the different colour groupings? (Let me know in the comments…;-)

A Map of My Twitter Follower Network

Twitter may lay claim to millions of users, but we intend to only inhabit a small part of it… I Follow 500 or so people, and am followed by 3000 or so “curated” followers (I block maybe 20-50 a week, and not all of them obvious spam accounts, in part because I see the list of folk who follow me as a network that defines several particular spheres of interest, and I don’t want to drown out signal by noise.)

So here’s a map of how the connected component of the graph of how my Twitter followers follow each other; it excludes people who aren’t followed by anyone in the graph (which may include folk who do follow me but who have private accounts).

The layout is done in Gephi using the Force Atlas 2 layout. It’s quite by chance that the layout resembles the Isle of Wight…or a heart? Yes, maybe it’s a great be heart:-)

Mu twitter follower net - connected component

By running the HITS statistic over the graph, we can get a feel for who the influential folk are; sizing and labeling nodes by Authority, we get this (click through to see a bigger version):

My twitter follower network

Here’s an annotated version, as I see it (click through to see a bigger version):

My annotated twitter follower network

If you’d like me to spend up to 20 mins making a map for you, I’ll pick up on an idea from Martin Hawksey and maybe do two or three maps for a donation to charity (in particular, to Ovacome). In fact, if you feel as if you’ve ever benefited from anything posted to this blog, why not give them a donation anyway…? Donate to ovacome.

Another Blooming Look at Gource and the Edina OpenURL Data

Having done a first demo of how to use Gource to visualise activity around the EDINA OpenURL data (Visualising OpenURL Referrals Using Gource), I thought I’d trying something a little more artistic, and use the colour features to try to pull out a bit more detail from the data [video].

What this one shows is how the mendeley referrals glow brightly green, which – if I’ve got my code right – suggests a lot of e-issn lookups are going on (the red nodes correspond to an issn lookup, blue to an isbn lookup and yellow/orange to an unknown lookup). The regularity of activity around particular nodes also shows how a lot of the activity is actually driven by a few dominant services, at least during the time period I sampled to generate this video.

So how was this visualisation created?

Firstly, I pulled out a few more data columns, specifically the issn, eissn, isbn and genre data. I then opted to set node colour according to whether the issn (red), eissn (green) or isbn (blue) columns were populated using a default reasoning approach (if all three were blank, I coloured the node yellow). I then experimented with colouring the actors (I think?) according to whether the genre was article-like, book-like or unkown (mapping these on to add, modify or delete actions), before dropping the size of the actors altogether in favour of just highlighting referrers and asset type (i.e. issn, e-issn, book or unknown).

cut -f 1,2,3,4,27,28,29,32,40 L2_2011-04.csv > openurlgource.csv

When running the Pythin script, I got a “NULL Byte” error that stopped the script working (something obviously snuck in via one of the newly added columns), so I googled around and turned up a little command line cleanup routine for the cut data file:

tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv

Here’s the new Python script too that shows the handling of the colour fields:

import csv
from time import *

# Command line pre-processing step to handle NULL characters
#tr < openurlgource.csv -d '\000' > openurlgourcenonulls.csv
#alternatively?: sed 's/\x0/ /g' openurlgource.csv > openurlgourcenonulls.csv

f=open('openurlgourcenonulls.csv', 'rb')

reader = csv.reader(f, delimiter='\t')
writer = csv.writer(open('openurlgource.txt','wb'),delimiter='|')
headerline = reader.next()

for row in reader:
	if row[8].strip() !='':
		t=int(mktime(strptime(row[0]+" "+row[1], "%Y-%m-%d %H:%M:%S")))
		if row[4]!='':
			col='FF0000'
		elif row[5]!='':
			col='00FF00'
		elif row[6]!='':
			col='0000FF'
		else:
			col='666600'
		if row[7]=='article' or row[7]=='journal':
			typ='A'
		elif row[7]=='book' or row[7]=='bookitem':
			typ='M'
		else:
			typ='D'
		agent=row[8].rstrip(':').replace(':','/')
		writer.writerow([t,row[3],typ,agent,col])

The new gource command is:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 openurlgource.txt

and the command to generate the video:

gource -s 1 --hide usernames --start-position 0.8 --stop-position 0.82 --user-scale 0.1 -o - openurlgource.txt | ffmpeg -y -b 3000K -r 60 -f image2pipe -vcodec ppm -i - -vcodec libx264 -vpre slow -threads 0 gource.mp4

If you’ve been tempted to try Gource out yourself on some of your own data, please post a link in the comments below:-) (AI wonder just how many different sorts of data we can force into the shape that Gource expects?!;-)