OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘gephi

Visualising F1 Timing Sheet Data

Putting together a couple of tricks from recent posts (Visualising Vodafone Mclaren F1 Telemetry Data in Gephi and PDF Data Liberation: Formula One Press Release Timing Sheets), I thought I’d have a little play with the timing sheet data in Gephi…

The representations I have used to date are graph based, with each node corresponding a particular lap performance by a particular driver, and edges connecting consecutive laps.

**If you want to play along, you’ll need to download Gephi and this data file: F1 timing, Malaysia 2011 (NB it’s not throughly checked… glitches may have got through in the scraping process:-(**

The nodes carry the following data, as specified using the GDF format:

  • name VARCHAR: the ID of each node, given as driverNumber_lapNumber (e.g. 12_43)
  • label VARCHAR: the name of the driver (e.g. S. VETTEL
  • driverID INT: the driver number (e.g. 7)
  • driverNum VARCHAR: an ID for the driver of the lap (e.g. driver_12
  • team VARCHAR: the team name (e.g. Vodafone McLaren Mercedes)
  • lap INT: the lap number (e.g. 41)
  • pos INT: the position at the end of the lap (e.g. 5)
  • pitHistory INT: the number of pitstops to date (e.g. 2)
  • pitStopThisLap DOUBLE: the duration of any pitstop this lap, else 0 (e.g. 12.321)
  • laptime DOUBLE: the laptime, in seconds (e.g. 72.125)
  • lapdelta DOUBLE: the difference between the current laptime and the previous laptime (e.g. 1.327)
  • elapsedTime DOUBLE: the summed laptime to date (e.g. 1839.021)
  • elapsedTimeHun DOUBLE: the elapsed time divided by a hundred (e.g. )

Using the geolayout with an equirectangular (presumably this means Cartesian?) layout, we can generate a range of charts simply by selecting suitable co-ordinate dimensions. For example, if we select the laptime as the y ("latitude") co-ordinate and x ("longitude") as the lap, filtering out the nodes with a null laptime value, we can generate a graph of the form:

We can then tweak this a little - e.g. colour the nodes by driver (using a Partition based coluring), and edges according to node, resize the nodes to show the number of pit stops to date, and then filter to compare just a couple of drivers :

This sort of lap time comparison is all very well, but it doesn't necessarily tell us relative track positions. If we size the nodes non-linearly according to position, with a larger size for the "smaller" numerical position (so first is less than second, and hence first is sized larger than second), we can see whether the relative positions change (in this case, they don't...)

Another sort of chart we might generate will be familiar to many race fans, with a tweak - simply plot position against lap, colour according to driver, and then size the nodes according to lap time:

Again, filtering is trivial:

If we plot the elapsed time against lap, we get a view of separations (deltas between cars are available in the media centre reports, but I haven't used this data yet...):

In this example, lap time flows up the graph, elapsed time increases left to right. Nodes are coloured by driver, and sized according to postion. If a driver has a hight lap count and lower total elapsed time than a driver on the previous lap, then it's lapped that car... Within a lap, we also see the separation of the various cars. (This difference should be the same as the deltas that are available via FIA press releases.)

If we zoom into a lap, we can better see the separation between cars. (Using the data I have, I'm hoping I haven't introduced any systematic errors arising from essentially dead reckoning the deltas between cars...)

Also note that where lines between two laps cross, we have a change of position between laps.

[ADDED] Here's another view, plotting elapsed time against itself to see where folk are on the track-as-laptime:

Okay, that's enough from me for now.. Here's something far more beautiful from @bencc/Ben Charlton that was built on top of the McLaren data...

First up, a 3D rendering of the lap data:

And then a rather nice lap-by-lap visualisation:

So come on F1 teams - give us some higher resolution data to play with and let's see what we can really do... ;-)

PS I see that Joe Saward is a keen user of Lap charts…. That reminds me of an idea for an app I meant to do for race days that makes grabbing position data as cars complete a lap as simple as clicking...;-) Hmmm....

PPS for another take of visualising the timing data/timing stats, see Keith Collantine/F1Fanatic's Malaysia summary post.

Written by Tony Hirst

April 16, 2011 at 7:30 pm

Visualising Vodafone Mclaren F1 Telemetry Data in Gephi

Last year, I popped up an occasional series of posts visualising captures of the telemetry data that was being streamed by the Vodoafone McLaren F1 team (F1 Data Junkie).

I’m not sure what I’m going to do with the data this year, but being a lazy sort, it struck me that I should be able to visualise the data using Gephi (using in particular the geo layout that lets you specify which node attributes should be used as x and y co-ordinates when placing the nodes.

Taking a race worth of data, and visualising each node as follows (size as throttle value, colour as brake) we get something like this:

(Note that the resolution of the data is 1Hz, which explains the gaps…)

It’s possible to filter the data to show only a lap’s worth:

We could also filter out the data to only show points where the throttle value is above a certain value, or the lateral acceleration (“G-force”) and so on… or a combination of things (points where throttle and brake are applied, for example). I’ll maybe post examples of these using data from this year’s races…. err..?;-)

For now though, here’s a little video tour of Gephi in action on the data:

What I’d like to be able to do is animate this so I could look at each lap in turn, or maybe even animate an onion skin of the “current” point and a couple of previous ones) but that’s a bit beyond me… (for now….?!;-) If you know how, maybe we should talk?!:-)

[Thanks to McLaren F1 for streaming this data. Data was captured from the McLaren F1 website in 2010. I believe the speed, throttle and brake data were sponsored by Vodafone.]

PS If McLaren would like to give me some slightly higher resolution data, maybe from an old car on a test circuit, I’ll see what I can do with it… Similarly, any other motor racing teams in any other formula who have data they’d like to share, I’m happy to have a play… I’m hoping to go to a few of the BTCC races this year, so I’d particularly like to hear from anyone from any of those teams, or teams in the supporting races:-) If a Ginetta Junior team is up for it, we might even be able to get an education/outreach thing going into school maths, science, design and engineering clubs…;-)

Written by Tony Hirst

April 14, 2011 at 12:41 pm

Dominant Tags in My Delicious Network

Following on from Social Networks on Delicious, here’s a view over my delicious network (that is, the folk I “follow” on delicious) and the dominant tags they use:

The image is created from a source file generated by:

1) grabbing the list of folk in my delicious network;
2) grabbing the tags each of them uses;
3) generating a bipartite network specification graph containing user and edge nodes, with weighted links corresponding to the number of times a user has used a particular tag (i.e. the number of bookmarks they have bookmarked using that tag).

Because the original graph is a large, sparse one (many users define lots of tags but only use them rarely), I filtered the output view to show only those tags that have been used more than 150 times each by any particular user, based on the weight of each edge (remember, the edge weight describes the number of times a used has used a particular tag). (So if every user had used the same tag up to but not more 149 times each, it wouldn’t be displayed). The tag nodes are sized according to the number of users who have used the tag 150 or more times.

I also had a go at colouring the nodes to identify tags used heavily by a single user, compared to tags heavily used by several members of my network.

Here’s the Python code:

import urllib, simplejson

def getDeliciousUserNetwork(user,network):
  url='http://feeds.delicious.com/v2/json/networkmembers/'+user
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    network.append(u['user'])
    #time also available: u['dt']
  #print network
  return network

def getDeliciousTagsByUser(user):
  tags={}
  url='http://feeds.delicious.com/v2/json/tags/'+user
  data = simplejson.load(urllib.urlopen(url))
  for tag in data:
    tags[tag]=data[tag]
  return tags

def printDeliciousTagsByNetwork(user,minVal=2):
  f=openTimestampedFile('delicious-socialNetwork','network-tags-' + user+'.gdf')
  f.write(gephiCoreGDFNodeHeader(typ='delicious')+'\n')
 
  network=[]
  network=getDeliciousUserNetwork(user,network)

  for user in network:
    f.write(user+','+user+',user\n')
  f.write('edgedef> user1 VARCHAR,user2 VARCHAR,weight DOUBLE\n')
  for user in network:
    tags={}
    tags=getDeliciousTagsByUser(user)
    for tag in tags:
      if tags[tag]>=minVal:
         f.write(user+',"'+tag.encode('ascii','ignore') + '",'+str(tags[tag])+'\n')
  f.close()

Looking at the network, it's possible to see which members of my network are heavy users of a particular tag, and furthermore, which tags are heavily used by more than one member of my network. The question now is: to what extent might this information help me identify whether or not I am following people who are likely to turn up resources that are in my interest area, by virtue of the tags used by the members of my network.

Picking up on the previous post on Social Networks on Delicious, might it be worth looking at the tags used heavily by my followers to see what subject areas they are interested in, and potentially the topic area(s) in which they see me as acting as a resource investigator?

Written by Tony Hirst

January 14, 2011 at 1:47 pm

Social Networks on Delicious

One of the many things that the delicious social networking site appears to have got wrong is how to gain traction from its social network. As well as the incidental social network that arises from two or more different users using the same tag or bookmarking the same resource (for example, Visualising Delicious Tag Communities Using Gephi), there is also an explicit social network constructed using an asymmetric model similar to that used by Twitter: specifically, you can follow me (become a “fan” of me) without my permission, and I can add you to my network (become a fan of you, again without your permission).

Realising that you are part of a social network on delicious is not really that obvious though, nor is the extent to which it is a network. So I thought I’d have a look at the structure of the social network that I can crystallise out around my delicious account, by:

1) grabbing the list of my “fans” on delicious;
2) grabbing the list of the fans of my fans on delicious and then plotting:
2a) connections between my fans and and their fans who are also my fans;
2b) all the fans of my fans.

(Writing “fans” feels a lot more ego-bollox than writing “followers”; is that maybe one of the nails in the delicious social SNAFU coffin?!)

Here’s the way my “fans” on delicious follow each other (maybe? I’m not sure if the fans call always grabs all the fans, or whether it pages the results?):

The network is plotted using Gephi, of course; nodes are coloured according to modularity clusters, the layout is derived from a Force Atlas layout).

Here’s the wider network – that is, showing fans of my fans:

In this case, nodes are sized according to betweenness centrality and coloured according to in-degree (that is, the number of my fans who have this people as fans). [This works in so far as we're trying to identify reputation networks. If we're looking for reach in terms of using folk as a resource discovery network, it would probably make more sense to look at the members of my network, and the networks of those folk...)

If you want to try to generate your own, here's the code:

import simplejson

def getDeliciousUserFans(user,fans):
  url='http://feeds.delicious.com/v2/json/networkfans/'+user
  #needs paging? or does this grab all the fans?
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    fans.append(u['user'])
    #time also available: u['dt']
  #print fans
  return fans

def getDeliciousFanNetwork(user):
  f=openTimestampedFile("fans-delicious","all-"+user+".gdf")
  f2=openTimestampedFile("fans-delicious","inner-"+user+".gdf")
  f.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  f2.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f2.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  fans=[]
  fans=getDeliciousUserFans(user,fans)
  for fan in fans:
    time.sleep(1)
    fans2=[]
    print "Fetching data for fan "+fan
    fans2=getDeliciousUserFans(fan,fans2)
    for fan2 in fans2:
      f.write(fan+","+fan2+"\n")
      if fan2 in fans:
        f2.write(fan+","+fan2+"\n")
  f.close()
  f2.close()

So what”s the next step…?!

Written by Tony Hirst

January 13, 2011 at 2:58 pm

Posted in Analytics

Tagged with , , ,

Visualising Delicious Tag Communities Using Gephi

Years ago, I used the Javascript Infovis Toolkit to put together a handful of data visualisations around the idea of the “social life of a URL” by looking up bookmarked URLs on delicious and then seeing who had bookmarked them and using what tags (delicious URL History – Hyperbolic Tree Visualisation, More Hyperbolic Tree Visualisations – delicious URL History: Users by Tag). Whilst playing with some Twitter hashtag network visualisations today, I wondered whether I could do something similar based around delicious bookmark tags, so here’s a first pass attempt…

As a matter of course, delicious publishes RSS and JSON feeds from tag pages, optionally containing up to 100 bookmarked entries. Each item in the response is a bookmarked URL, along with details of the single individual person who saved that particular bookmark and the tags they used.

That is, for a particular tag on delicious we can trivially get hold of the 100 most recent bookmarks saved with that tag and data on:

- who bookmarked it;
- what tags they used.

Here’s a little script in Python to grab the user and tag data for each lak11 bookmark and generate a Gephi gdf file to represent the bipartite graph that associates users with the tags they have used:

import simplejson
import urllib

def getDeliciousTagURL(tag,typ='json', num=100):
  #need to add a pager to get data when more than 1 page
  return "http://feeds.delicious.com/v2/json/tag/"+tag+"?count=100"

def getDeliciousTaggedURLDetailsFull(tag):
  durl=getDeliciousTagURL(tag)
  data = simplejson.load(urllib.urlopen(durl))
  userTags={}
  uniqTags=[]
  for i in data:
    url= i['u']
    user=i['a']
    tags=i['t']
    title=i['d']
    if user in userTags:
      for t in tags:
        if t not in uniqTags:
          uniqTags.append(t)
        if t not in userTags[user]:
          userTags[user].append(t)
    else:
      userTags[user]=[]
      for t in tags:
        userTags[user].append(t)
        if t not in uniqTags:
          uniqTags.append(t)
  
  f=open('bookmarks-delicious_'+tag+'.gdf')
  f.write('nodedef> name VARCHAR,label VARCHAR, type VARCHAR\n')
  for user in userTags:
    f.write(user+','+user+',user\n')
  for t in uniqTags:
    f.write(t+','+t+',tag\n')

  f.write('edgedef> user VARCHAR,tag VARCHAR\n')
  for user in userTags:
    for t in userTags[user]:
      f.write(user+','+t+'\n')
  f.close()

tag='lak11'
getDeliciousTaggedURLDetailsFull(tag)

[Note to self: this script needs updating to grab additional results pages?]

Here's an example of the output, in this case using the tag for Jim Groom's Digital Storytelling course: ds106. The nodes are coloured according to whether they represent a user or a tag, and sized according to degree, and the layout is based on a force atlas layout with a few tweaks to allow us to see labels clearly.

Note that the actual URLs that are bookmarked are not represented in any way in this visualisation. The netwroks shows the connections between users and the tags they have used irrespective of whether the tags were applies to the same or different URLs. Even if two users share common tags, they may not share any common bookmarks...

Here's another example, this time using the lak11 tag:

Looking at these networks, a couple of things struck me:

- the commonly used tags might express a category or conceptual tag that describes the original tag used to source the data;

- folk sharing similar tags may share similar interests.

Here's a view over part of the LAK11 network with the LAK11 tag filtered out, and the Gephi ego filter applied with depth two to a particular user, in this case delicious user rosemary20:

The filtered view shows us:

- the tags a particular user (in this case, rosemary20) has used;

- the people who have used the same tags as rosemary20; note that this does not necessarily mean that they bookmarked any of the same URLs, nor that they are using the tags to mean the same thing*...

(* delicious does allow users to provide a description of a tag, though I'm not sure if this information is generally available via a public API?)

By sizing the nodes according to degree in this subnetwork, we can readily identify the tags commonly used alongside the tag used to source the data, and also the users who have used the largest number of identical tags.

PS it struck me that a single page web app should be quite easy to put together to achieve something similar to the above visualisations. The JSON feed from delicious is easy enough to pull in to any page, and the Protovis code library has a force directed layout package that works on a simple graph representation not totally dissimilar to the Gephi/GDF format.

If I get an hour I'll try to have a play to put a demo together. If you beat me to it, please post a link to your demo (or even fully blown app!) in the comments:-)

Written by Tony Hirst

January 8, 2011 at 8:52 pm

Posted in Tinkering

Tagged with , ,

Sketching Connections Between US House and Senate Tweeps

Following on from Sketching the Structure of the UK Political Media Twittersphere, I did a quick trawl of US Senators and Congress members based on the lists maintained by @HuffPostPol, and visualised the connections between them in Gephi using the Force Layout algorithm:

Firstpass sketch of UH house and senate tweeps

Node size is proportional to betweenness centrality measured over the directed graph where edges go from one person to a person they follow. The edge set is limited to edges between members of the four @huffPostPol lists: republican-senators, democratic-senators, democratic-members, republican-members.

Written by Tony Hirst

January 5, 2011 at 1:41 pm

Posted in Data, Visualisation

Tagged with

Sketching the Structure of the UK Political Media Twittersphere

I’m not sure how, now, but earlier today I came across the first part of a two part NPR article on In London, A Case Study In Opinionated Press, looking at “ideology in the media”, and how the British print media at least tend to be associated with a particular political affiliation.

This reminded me in part of something I read in Click*, by Bill Tancer over the Christmas break (don’t worry – I didn’t pay even the full discounted price; it was remaindered for a couple of quid in an end-of-line bookshop in Cheltenham…probably still is…) relating to the flow of traffic between different overtly politically affiliated blogs in the US. One chart show traffic flows between blogs based on some work done by Matthew Hindman (Political Traffic, June 2007). Another cited Hitwise statistics about upstream and downstream traffic to/from certain news media sites.

* (The book itself is essentially a distillation of observations drawn from Hitwise stats. If nothing else, it prompted me to wonder (again) about the extent to which academic researchers make use of the big data generated on the web, as well as market research data for segmenting (large) sets of (social science) research data. (If you’re concerned about invasion of privacy online, at least in terms of what marketing profiles are being built around your behaviour, are you similarly concerned about what trad direct mail marketers know about you?!;-))

One of the things I’d idly considered whilst reading the book was what a map of the UK media might look like. The OU has a couple of Hitwise seats, I think, though I’m probably not allowed anywhere near them; Alexa offers a certain amount of data (I’ve no idea how reliable it is, though!), though I’m not sure about it’s provision of an API and any license conditions around the use of the data:

Guardian upstream traffic on alexa

The interesting sites are likely to start appearing further down the long tail…

Data I do have access to is social graph data on Twitter. The Tweetminster folk maintain a variety of relevant lists collating MPs by party affiliation, journalists by media organisation, government departments and so on, so it was easy enough to grab this data and check out who’s following whom. (I really need to clean the data to provide the option of allowing only active twitterers to be displayed…)

Here are a few idle snapshots generated using Gephi, in the preview view; consider it a tease…

Nodes coloured and grouped by affiliation (party, media organisation, etc). This initial layout shows the groupings (I don’t remember what size is? Child nodes I think? (i.e. the number of actual tweeps in that category)) The actual layout was achieved by using a force directed layout (which is sensitive to the number of links between sets of nodes) on the individual people nodes, and then grouping them by category.

Uk political media on twitter

The next chart shows the individual twitterer nodes, coloured by party, under the force atlas layout; a link exists between A and B if A follows B.:

uk political media on twitter

We see that nodes with similar affiliation are, on the whole, closer together; which is to say, folk in a particular party or organisation follow each other like crazy, and then follow some other folk too;-)

To explore the structure in a little more detail, I sized the nodes using betweenness centrality, which is related to how structurally important a node is in connecting different parts of the overall graph:

UK politics on twitter - sized by betweenness centrality

Out of interest, I also ran the modularity statistic over the network to see whether there were any natural forming clusters; because the whole network is quite highly interlinked, only three main clusters were identified:

- what looks largely like a Labour clump:

UK political tweeps cluster 1
?the left is a ghetto?

- a clump of folk associated with (or tracking) government shenanigans, maybe?

Uk political tweeps cluster 2

- I did toy with calling this cluster “hangers-on around the political scene”;-)

Uk political tweeps cluster 3

Okay – enough for now; if you want the data, then you have to demonstrate what impact it’ll get me;-) Or you can make a donation to a charity of my choice. Or you can just grab it yourself (I’ve blogged how enough times;-)

PS I really need to sort an effective colour palette out…

PPS on the wishlist – find a way of looking at edges going from nodes in one group to another group; I think the Gephi MASK operator may do this but i don’t have a clue how it works… A newt function to take two lists and just plot connections between members of the separate lists could be quite handy though, so I’ll save that up for my next train journey:-)
DOH! it’s obvious – just use s UNION filter and get the required attribute filtered nodes:

gephi filter two groups

PPPS just by the by, I found a way of grabbing friends and followers data from protected Tweeps. My understanding of the Twitter API was that this required: a) authentication; b) the requesting party to be a friend of the protected party, but I don’t require any of that. I’m not sure if the data I’m getting is current, or whether it’s a bit stale, but it’s more than I think I should be able to access via the Twitter API?! I’m not going to blog the how-to, though… figure it out yourself;-)

Written by Tony Hirst

January 4, 2011 at 8:49 pm

Posted in Data, Visualisation

Tagged with

Using Gephi to Create Bubble Charts: Exploring Government Tenders

[Elements of this post has been largely deprecated since I drafted it a couple of weeks ago, but I'm posting it anyway because this is my open notebook, and as such it has a role in logging the things that are maybe dead ends, as well as hopefully more useful stuff...]

At its heart, Gephi is an environment for visualising graphs (or networks) in which “nodes” are connected to each other by “edges”. Nodes are represented using circles whose size and colour represent particular characteristics of the the node. So for example, if you were to visualise your Facebook friends, a node might represent a particular friend, the size of the node might be proportional to the number of friends they have, and the colour to how many photos they have uploaded. Lines between nodes would then show who is a friend of whom. But must we always add the lines between the nodes? If we leave them out, can we effectively use Gephi as a tool for generating charts like the Many Eyes bubble charts?

Many Eyes Bubble Chart

One of the data import formats supported by Gephi is the gdf format (gdf documentation), which expects a list of node definitions, followed by a list of edge connections. If we ignore the edges, then we can just import a set of node definitions, and create a bubble chart.

As an example of this, let’s see what we can do with some the Transparency in procurement and contracting information released by the Cabinet Office. As part of the data release, they publish a CSV file containing a summary of all the tender documents held:

Gov procurement tender docs

Looking at the data, we see that each tender is represented by one or more documents. Each row in the CSV file gives us information about the tender (its project ID, originating department, expected value, expected duration) as well as the particular document. So if we view each tender as a “bubble” or node in Gephi, we might want to represent it as follows:

nodedef> name VARCHAR,label VARCHAR, procid VARCHAR, estVal DOUBLE,estDur DOUBLE,date VARCHAR, dept VARCHAR, desc VARCHAR, nature VARCHAR
402846,"Spring Electoral Events Contact Centre",402846,125000,48,"17/09/2010","Central Office of Information","Invitation to Tender","Competition as part of an existing framework agreement"
"2010CMTLSE00001","Supply of body armour to HMCS","2010CMTLSE00001",250000,48,"24/09/2010","Ministry of Justice","RFQ instructions","Competition as part of an existing framework agreement"

Note that the GDF file requires a particular sort of header, followed by CSV rows of data. It's easy enough in this case to simply edit the original CSV file by deleting the first line, tweaking the column headers to the name VARCHAR, label VARCHAR... format required by the GDF file, and prefixing the new first row (the header row) with nodedef>.

However, I've recently started exploring the use of the browser based desktop application Google Refine (formerly Freebase Gridworks) as a step in my workflow for tidying up CSV data and then getting it into the GDF format.

Loading a CSV file into gridworks

Here's what it looks like once the data has been imported:

Data in gridworks

(For a great overview of what Gridworks allows you to do with data, see @jenit's Using Freebase Gridworks to Create Linked Data.)

The data I want to visualise in Gephi relates to the current tenders, rather than anything to do with the actual documents, so we can use Gridworks to simplify the data set by deleting the document type, document name and contact email columns. We can also check that columns using a restricted vocabulary (e.g. the type of tender being offered), do so consistently. For example, if we look at the Nature of the Tender Process column, and select Cluster and Edit...:

Clustering in Gridworks

we can see that there may be the odd typo that we can correct automatically:

Tidying data in Gridworks

The Description column also has various categories we can tidy up:

Tidying data in Gridworks

Here are the data tidying steps I've applied:

Data cleaning steps in Gridworks

At the time of writing, a new version of Google Refine/Gridworks is about to be released. In the version I'm using, I don't think it's possible to remove duplicate rows, which we have aplenty in my tidied up dataset (where several documents were listed for a tender, there are now several identical rows in my dataset). [Google Refine 2.0 is out now - and I don't think it can de-dupe?] However, I happen to know that Gephi will ignore duplicates of nodes loaded into Gephi, so we can do the de-dupe/de-duplication step there...

To generate the GDF file, we need to create a header line, and then define the output pattern for each row. We can do this using Gridworks' Templating support:

Gridworks templating

Here's how I define the output document:

Preparing the export template in gridworks

Note that the linebreaks will need removing in order to generate the correct output format. Also, in the version of Gridworks I'm using, it's worth noting that whenever you run the template, you're returned to the main data window and your template definitions is lost... (so before running the template code, grab a copy of it into a text editor just to be safe;-)

When you export the data, it's exported to your browser downloads directory, as a text file. Change the filetype form txt to .gdf and import it into Gephi:

DeDupe in Gephi

You'll see that Gephi has detected the duplicate rows based on common name elements (that is, common project IDs), and ignored the copies/duplicates.

Now we can view the procurement data using proportional symbol visualisations - here I size the nodes by estimated value (and display the label size in proportion to node size), and colour the nodes according to estimated duration:

Prcouremnt in gephi

[Since drafting this post, I've found a far better way of getting just the node data into Gephi - load it into the data table as a node table. I'll post more on how to do this in a follow on post...]

(The Many Eyes take on Bubble Charts ignores x/y co-ordinates as useful data, although other definitions of Bubble Charts include x/y location as important factors. In the current example, I allow Gephi to layout the nodes/bubbles. However, we can define x and y co-ordinates in the gdf file if we want to specifically locate the bubbles on the canvas.)

We can also use Gephi to cluster the data according to calling departments, or type of procurement exercise:

Clustering data in gephi

I *think* the size of the resulting bubble is proportional to the sum of the values used to inform the node size of the original components, so we should be able to group by procurement exercise type and have the bubble size be proportional to the sum of the estimated values of the procurement projects in that procurement class.

We can also expand a clustered node to see what activity is related to it - in this case, here are tenders from the British Library:

Exploding a cluster in Gephi

Going back to the full list, here we size by estimated value and colour by type of procurement:

category colouring in gephi

We can also generate views over the data using filters - so for example, COI sponsored procurement:

Filtering in gephi

One thing that Gephi doesn't currently support is a treemap style visualisation. However, now we have deduped the data by importing it into Gephi, we can export it as a simple CSV file from the datatable view, and then upload the data to Many Eyes and make use of its treemap:

Gephi datatable export

We use TSV because that is the preferred format for Many Eyes... (data file on Many Eyes)

Importing data into Many Eyes

Here's one configuration of the treemap:

Procurment treemap

With the data in Many Eyes, we can of course easily generate other views over it, such as a histogram:

Many Eyes histogram

(NB in the original data, the Estimated value column - which should contain numbers - also contained a few unknown elements:

More data cleaning in gridworks

Because code that expects numbers sometimes chokes on text, I should maybe have set the unknow vlaues to a default value as shown above?)

Okay - so what have we covered in this post?
- how to start cleaning/preparing data in Freebase Gridworks/Google Refine;
- how to use reebase Gridworks/Google Refine to generate an output file according to a template;
- how to use Gephi to deduplicate data based on a common field (in this case, the project id);
- how to use Gephi as a proportional symbol/bubble chart visualisation tool;
- how to export data from Gephi and upload it into Many Eyes;
- how to use Many Eyes to generate a treemap.

As ever, this blog post took longer to write than it took me to work through the exercise originally.

Written by Tony Hirst

November 19, 2010 at 10:00 am

Posted in Data

Tagged with

data.open.ac.uk Linked Data Now Exposing Module Information

As HE becomes more and more corporatised, I suspect we’re going to see online supermarkets appearing that help you identify – and register on – degree courses in exchange for an affiliate/referral fee from the university concerned. For those sites to appear, they’ll need access to course catalogues, of course. UCAS currently holds the most comprehensive one that I know of, but it’s a pain to scrape and all but useless as a datasource. But if the universities publish course catalogue information themselves in a clean way (and ideally, a standardised way), it shouldn’t be too hard to construct aggregation sites ourselves…

So it was encouraging to see earlier this week an announcement that the OU’s data.open.ac.uk site has started publishing module data from the course catalogue – that is, data about the modules (as we now call them – they used to be called courses) that you can study with the OU.

The data includes various bits of administrative information about each module, the territories it can be studied in, and (most importantly?!) pricing information;-)

data.open.ac.uk - module data

You may remember that the data.open.ac.uk site itself launched a few weeks ago with the release of Linked Data sets including data about deposits in the open repository, as well as OU podcasts on iTunes (data.open.ac.uk Arrives, With Linked Data Goodness. Where podcasts are associated with a course, the magic of Linked Data means that we can easily get to the podcasts via the course/module identifier:

data.open.ac.uk

It’s also possible to find modules that bear an isSimilarTo relation to the current module, where isSimilarTo means (I think?) “was also studied by students taking this module”.

As an example of how to get at the data, here’s a Python script using the Python YQL library that lets me run a SPARQL query over the data.open.ac.uk course module data (the code includes a couple of example queries):

import yql

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

endpoint='http://data.open.ac.uk/query'

# This query finds the identifiers of postgraduate technology courses that are similar to each other
q1='''
select distinct ?x ?z from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z
} limit 10
'''

# This query finds the names and course codes of 
# postgraduate technology courses that are similar to each other
q2='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

# This query finds the names and course codes of 
# postgraduate courses that are similar to each other
q3='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

result=run_sparql_query(q3, endpoint)

for row in result.rows:
	for r in row['result']:
		print r

I'm not sure what purposes we can put any of this data to yet, but for starters I wondered just how connected the various postgraduate courses are based on the isSimilarTo relation. Using q3 from the code above, I generated a Gephi GDF/network file using the following snippet:

# Generate a Gephi GDF file showing connections between 
# modules that are similar to each other
fname='out2.gdf'
f=open(fname,'w')

f.write('nodedef> name VARCHAR, label VARCHAR, title VARCHAR\n')
ccodes=[]
for row in result.rows:
	for r in row['result']:
		if r['code1']['value'] not in ccodes:
			ccodes.append(r['code1']['value'])
			f.write(r['code1']['value']+','+r['code1']['value']+',"'+r['name1']['value']+'"\n')
		if r['code2']['value'] not in ccodes:
			ccodes.append(r['code2']['value'])
			f.write(r['code2']['value']+','+r['code2']['value']+',"'+r['name2']['value']+'"\n')
		
f.write('edgedef> c1 VARCHAR, c2 VARCHAR\n')
for row in result.rows:
	for r in row['result']:
		#print r
		f.write(r['code1']['value']+','+r['code2']['value']+'\n')

f.close()

to produce the following graph. (Size is out degree, colour is in degree. Edges go from ?x to ?z. Layout: Fruchterman Reingold, followed by Expansion.)

OU postgrad courses in gephi

The layout style is a force directed algorithm, which in this case has had the effect of picking out various clusters of highly connected courses (so for example, the E courses are clustered together, as are the M courses, B courses, T courses and so on.)

If we run the ego filter over this network on a particular module code, we can see which modules were studying alongside it:

ego filter on course codes

Note that in the above diagram, the nodes are sized/coloured according to in-degree/out-degree in the original, complete graph, If we re-calculate those measures on just this partition, we get the following:

Recoloured course network

If we return to the whole network, and run the Modularity class statistic, we can identify several different course clusters:

Modules - modularity class

Here's one of them expanded:

A module cluster

Here are some more:

COurse clusters

I'm not sure what use any of this is, but if nothing else, it shows there's structure in that data (which is exactly what we'd expect, right?;-)

PS as to how I wrote my first query on this data, I copied the 'postgraduate modules in computing' example query from data.open.ac.uk:

http://data.open.ac.uk/query?query=select%20distinct%20%3Fx%20from%20%3Chttp://data.open.ac.uk/context/course%3Ewhere%20{%3Fx%20a%20%3Chttp://purl.org/vocab/aiiso/schema%23Module%3E.%0A%3Fx%20%3Chttp://data.open.ac.uk/saou/ontology%23courseLevel%3E%20%3Chttp://data.open.ac.uk/saou/ontology%23postgraduate%3E.%0A%3Fx%20%3Chttp://purl.org/dc/terms/subject%3E%20%3Chttp://data.open.ac.uk/topic/computing%3E%0A}%0A&limit=200

and pasted it into a tool that "unescapes" encoded URLs, which encodes the SPARQL query:

Unescaping text

I was then able to pull out the example query:
select distinct ?x from <http://data.open.ac.uk/context/course&gt;
where {?x a <http://purl.org/vocab/aiiso/schema#Module&gt;.
?x <http://data.open.ac.uk/saou/ontology#courseLevel&gt; <http://data.open.ac.uk/saou/ontology#postgraduate&gt;.
?x <http://purl.org/dc/terms/subject&gt; <http://data.open.ac.uk/topic/computing&gt;
}

Just by the by, there's a host of other handy text tools at Text Mechanic.

Written by Tony Hirst

November 17, 2010 at 11:28 am

Friends of the Community: Who’s Effectively Following a Hashtag

Picking up on @briankelly’s Thoughts on ILI 2010, where he reports on a few gross level stats about #ili2010 hashtag activity grabbed from Summarizr, here are a few things I observed from looking at some of the hashtag community network stats…

To start with, I looked at the “inner hashtag community” where I grab the list of hashtaggers and their friends who have also used the hashtag and make links between them to give this sort of graph, as used before in many OUseful.info posts:

ILI2010 hashtaggers

(Directed graph from person to friend (i.e. to person they follow); node size proportional to in-degree, heat to out-degree.)

After running a few network statistics generated using Gephi, and exporting the data from the Gephi Data Table view, I uploaded the statistics data to IBM’s ManyEyes site here. This allows us to view the distribution of the hashtaggers based on various statistical and network measures using a range of other visualisation techniques, such as histograms (view interactive histogram chart for ILI2010 hashtaggers, interactive scatterplot)

So for example, here’s the distribution of hashtaggers by total number of followers (that is, including followers outside the hashtag community) as a histogram:

ILI2010 hashtaggers - total numbers of followers

If we look at the betweenness measure, which was calculated over the friends connections between the hashtaggers, we can see who’s best suited to getting a message broadcast across the community through direct and friend-of-a-friend links:

ILIhashtaggers - inner frineds betweenness

If we look at the in-degree (number of people in the hashtag community who have friended (i.e. are following) an individual, divided by the total number of friends of that individual, we can identify people who are being followed by more people in the community than they have as friends:

ILI2010 hashtaggers - in-degree divided by total friends

If we look at the in-degree divided by a users total number of followers, we can see the extent to which a person’s twitter feed is dominated by updates from folk who have used the ILI2010 hashtag:

ILI2010 hashtaggers - ectent to which stream is dominated by hashtaggers

In the above case, we see one person who appears to only follow members of the ILI2010 hashtag community. (I’m guessing that if folk come to twitter through a conference, this might be a signature of that?) Before you get too excited though, a little more digging suggests that that person only follows 1 person;-)

The interactive scatterplot allows us to view 3 dimensions of data – in the following case, ‘m looking for well connected (good betweenness centrality), well respected (high in-degree) folk in the hashtag community who also have a large reach in terms of their total number of followers:

ILI2010 hashtaggers - scatterplot

In terms of audience development, we can also create a network based on the complete follower lists of the ILI2010 hashtaggers. Creating such a graph generates a network with 71627 nodes, of which 236 were hashtaggers – meaning that in principle 71,391 people outside the hashtag community might have seen an ILI2010 hashtagged tweet…

Using a directed graph from hashtaggers to their followers, If we filter the graph to only show individuals with an in-degree above 60, say, we can see those people who are following at least 60 people who have used the hashtag:

ILI2010 hashtagger followers

In the way I have constructed this graph, the nodes showing Twitter usernames are in the hashtag community, the numerical IDs are individuals who didn’t use the ILI2010 hashtag but who do follow at least 60 people who did, and therefore presumably saw quite a lot of tweets about the event.

Looking up the twitter IDs of the “friends of the hashtag community”, we see the following people did not use the hashtag over the sample period, but do follow lots of people who did: @ijclark, @aekins, @metalibrarian, @schammond, @Jo_Bo_Anderson, @research_inform, @tomroper, @facetpublishing, @DavidGurteen

Of course, to know the extent to which hashtagger activity dominates the twitterstream of this “friends of the ahshtag community”, we’d need to normalise this against their total number of friends; because for exampe If I follow 20k people, of which 60 were hashtaggers, I’d probably miss most of the hashtagged tweets; whereas, if I follow 100 people, of which 60 are hashtaggers, the density of tweets received from hashtaggers could be expected to be quite high.

Okay – enough for now… although if you can think of anything else that might be interesting to know about the wider community around the hashtaggers, please post it in a comment below:-)

Written by Tony Hirst

October 18, 2010 at 9:23 am

Posted in Tinkering

Tagged with , ,

Follow

Get every new post delivered to your Inbox.

Join 757 other followers