Archive for April 2010
A recent post on the journalism.co.uk site asks: How much computer science does a journalist really need?, commenting that whilst coding skills may undoubtedly be useful for journalists, knowing what can be achieved easily in a computational way may be more important, because there are techies around who can do the coding for you… (For another take on this, see Charles Arthur’s If I had one piece of advice to a journalist starting out now, it would be: learn to code, and this response to it: Learning to Think Like A Programmer.)
Picking up on a few thoughts that came to mind around a presentation I gave yesterday (Web Lego And Format Glue, aka Get Yer Mashup On), here’s a slightly different take on it, based on the idea that programming doesn’t necessarily mean writing arcane computer code.
Note that a lot of what follows I’d apply to librarians as well as journalists… (So for example, see Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library for infoskills that I think librarians as information professionals should at least be aware of (and these probably apply to journalists too…); Data Handling in Action is also relevant – it describes some of the practical skills involved in taking a “dirty” data set and getting it into a form where it can be easily visualised…)
So here we go…. An idea I’ve started working on recently as an explanatory device is the notion of feed oriented programming. I appreciate that this probably already sounds scary geeky, but it’s a made up phrase and I’ll try to explain it. A feed is something like an RSS feed. (If you don’t know what an RSS feed, this isn’t a remedial class, okay… go and find out… this old post should get you started: We Ignore RSS at OUr Peril.)
Typically, an RSS feed will contain a set of items, such as a set of blog posts, news stories, or bookmarks. Each item has the same structure in terms of how it is represented on a computer. Typically, the content of the feed will change over time – a blog feed represents the most recent posts on a blog, for example. That is, the publisher of the feed makes sure that the feed has current content in it – as a “programmer” you don’t really need to do anything to get the fresh content in the feed – you just need to look at the feed to see if there is new content in it – or let your feed reader show you that new content when it arrives. The feed is accessed via a web address/URL.
Some RSS feeds might not change over time. On WriteToReply, where we republish public documents, it’s possible to get hold of an RSS version of the document. The document RSS feed doesn’t change because the content of the document doesn’t change), although the content of the comment feeds might change as people comment on the document.
A nice thing about RSS is that lots of things publish it, and lots of things can import it. Importing an RSS feed into an application such as Google Reader simply means pasting the web address of the feed into a “Subscribe to feed” box in the application. Although it can do other things too, like supporting search, Google Reader is primarily a display application. It takes in RSS feeds and presents them to the user in an easy to read way. Google Maps and Google Earth are other display applications – they display geographical information in an appropriate way, a way that we can readily make sense of.
So what do we learn from this? Information can be represented in a standard way, such as RSS, and displayed in a visual way by an application that accepts RSS as an input. By subscribing to an RSS feed, which we identify by a fixed/permanent web address, we can get new content into our reader without doing anything. Subscribing is just a matter of copying a web address from the publisher’s web site and pasting it into our reader application. Cut and paste. No coding required. The feed publisher is responsible for putting new content into the feed, and our reader application is responsible for pulling that new content out and displaying it to us.
One of the tools I use a lot is Yahoo Pipes. Yahoo Pipes can take in RSS feeds and do stuff with it; it can take in a list of blog posts as an RSS feed and filter them so that you only get posts out that do – or don’t – mention cats, for example. And the output is in the form of an RSS feed…
What this means is that if we have a Yahoo pipe that does something we want in computational terms to an RSS feed, all we have to do is give it the web address of the feed we want to process, and then grab the RSS output web address from the Pipe. Cut and paste the original feed web address into the Pipe’s input. Cut and paste the web address of the RSS output from the pipe into our feed reader. No coding required.
Another couple of tools I use are Google Spreadsheets (a spreadsheet application) and Many Eyes WIkified (an interactive visualisation application). If you publish a spreadsheet on Google docs, you can get a web address/URL that points to a CSV (comma separated variable) version of the selected sheet. A CSV file is a simple text file where each spreadsheet row is a represented as a row in the CSV structured text file; and the value of each cell along a row in the original spreadsheet is represented as the same value in the text file, separated from the previous value by a comma. But you don’t need to know that… All you do need to know is that you can think of it as a feed… With a web address… And in a particular format…
Going to the “Share” options in the spreadsheet, you can publish the sheet and generate a web address that points to a range of cells in the spreadsheet (eg: B1:D120) represented as a CSV file. If we now turn to Many Eyes Wikified, I can provide it with the web address of a CSV file and it will go and fetch the data for me. At the click of a button I can then generate an interactive visualisation of the data in the spreadsheet. Cut and paste the web address of the CSV version of the data in a spreadsheet that Google Docs will generate for me into Many Eyes Wikified, and I can then create an interactive visualisation using the spreadsheet at the click of a button. Cut and paste a URL/web address that is generated for me. No coding required.
As to where the data in the spreadsheet came from? Well maybe it came from somewhere else on the web, via a URL? Like this, maybe?
So the model I’m working towards with feed oriented programming is the idea that you can get the web address of a feed which a publisher will publish current content or data to, and paste that address in an application that will render, or display the content (e.g. Google Reader, Many Eyes Wikified) or process/transform that data on your behalf.
So for example, Google Reader can transfrom an HTML table to CSV for you; (Google spreadsheets also lets you do all the normal spreadsheet things, so you could generate one sheet from another sheet using whatever spreadsheet formulae you like, and publish the CSV representation of that second sheet). Or in Yahoo Pipes, you can process an RSS feed by filtering its contents so that you only see posts that mention cats.
Yahoo Pipes offers other sorts of transformation as well. For example, in my original Wikipedia scraping demo, I took the feed from a Google spreadsheet and passed it to Yahoo Pipes where I geocoded city names and let pipes generate a map friendly feed (known as a KML feed) for me. Copying the web address of the KML feed output from the pipe and pasting it into Google Maps means I can generate an embeddable Google map view of data originally pulled from Wikipedia:
Once you start to think of the world in this way:
- where the web contains data and information that is represented in various standard ways and made available via a unique and persistent web address,
- where web applications can accept data and content that is represented in a standard way given the web address of that data,
- where web applications can transform data represented at one web address in one particular way and republish it in another standard format at another web address,
- or where web applications can take data represented in a particular way from one web adress and provide you with the tools to then visualise or display that data,
then the world is your toolbox. Have URL, will travel. All you need to know is which applications can import what format data, and how they can republish that data for you, whether in a different format, such as Google spreadsheets taking an HTML table from Wikipedia and publishing it as a CSV file, or as a visualisation/human friendly display (Many Eyes Wikified, Google Reader). And if you need to do “proper” programmer type things, then you might be able to do it using a spreadsheet formula or a Yahoo Pipe (no coding required…;-)
What OU Facebook App would you like to see? Here’s a chance to get it made…
Two and half years ago, as part of an informally convened skunkworks team, we released a couple of Open University related Facebook apps inspired, in part, by observing student behaviour in online course forums.
Current privacy fears aside (?!;-), the apps are getting another push as part of an announcement about a “Design an OU Facebook App” Competition (Share your Facebook app ideas for chance to win OU vouchers):
The rules are simple: tell us what new app you’d like to see us build on Facebook.
And in return? For the winner of our competition, which runs until June 8, 2010, there’s £100 course vouchers, and, even more exciting perhaps, the chance to see your app built, with your name on the developer credits.
Although there are only a handful ideas posted so far it’s interesting to see how some of them already tally with ideas we had for additional apps at the time Course Profiles and My OU Story were built. (Liam, maybe we should dig out the old email exchanges we had bouncing round new app ideas, and submit them to the current competition?! Heh heh;-)
The competition is being managed through an online suggestion-and-voting system that appeared on the OU Platform site earlier this year and which is being used to solicit ideas for new courses from any one who registers on the site (Platform is open to all, not just OU students, staff, and alumni).
The Platform team seem to have really got into the idea of competitions, so presumably it works as a marketing exercise. Just out of interest, are there any other HEIs out there that run competitions in a similar way?
Another week, another presentation… I dunno about death by Powerpoint on the audience side, but even though I’ve started finding ways of reusing slides, it still takes me forever and a day (well, 4-6 hrs), to put a slide deck together… One day – one day – I’ll have to produce a presentation I can just give over and over again… ;-)
Anyway, here are slides for a presentation I’m due to give tomorrow (Thursday) at the University of Portsmouth. The plan is for a 1 hr “lecture”, and a 1 hr hands-on workshop session. The slides are for the talk – but also set the scene for the practical activity…
So what’s the practical? (For anyone reading this in advance of attending the session, I suggest you get yourself sorted with accounts for Google/Google Spreadsheets, Yahoo/Yahoo Pipes and IBM/Many Eyes and Many Eyes Wikified.) As time is tight, I suggest the best way in is to just try recreating some of the demos shown in the presentation above, and then going from there…. A good alternative would be to start working through this intro to Yahoo Pipes:
For the more adventurously minded, looking through the pipework category on this blog might provide a little more inspiration…
If it’s Google Spreadsheets hacks you’re after, searching for Google spreadsheet import formula should turn up some example posts…
Many Eyes and Many Eyes Wikified demos can be found by search for “Many Eyes”
I was hoping to put together a couple of rather better structured self-paced workshop examples, but I’m afraid I’ve run out of time for today…:-(
A couple of weeks ago I did a phone interview for the OU’s DISCO project – OU Digital Scholarship Portal. From what I remember of the call, it rambled over many and varied topics, including possible metrics that might be taken into account when putting together promotion cases that include a demonstration of excellence in digital scholarship (whatever that is…).
Anyway, today I wasted a day – a whole day – updating my CV and writing stuff that seems to be the wrong stuff for an OU promotion case. Ever the reflective sort(?!), here are some observations I came away with:
- Slideshare is my presentation memory; I need to get in the habit of recording the date and event a presentation is for when I upload it to make it easier to list the presentations I’ve given. Alternatively, it might make sense to use a calendar to record the dates and events I’ve spoken at and then use the iCal feed to display the result;
- not writing formal academic papers means I have nothing to cite that t’committee would accept as credible. However, I have given quite a few interviews over the last couple of years to folk writing formal reports, doing research projects, or writing books. I’ve also participated in a few Delphi exercises and attended invitation only workshops and brainstorming sessions, as well as being invited to speak at events folk pay money to attend. Here’s part of what I wrote on this topic in my draft case: I have all but given up on formal academic publishing, in favour of short-form informal blog posts, occasional articles, and interviews for people who are writing long-form pieces (books, reports) which typically offer a greater or more immediate reach than scholarly articles in refereed journals, or benefit from a greater impact or better targeted audience than I could personally reach. The problem? That whilst I regularly participate in interviews and conversations with people writing official reports, books, etc as well as participating in Delphi Exercises[,] I’m not very good at keeping records of these or tracking down citations…
What occurs to me, then, is that I am more interested in direct or immediate communications of ideas as part of an ongoing process of learning and discovery (as part of a conversation, to use that well-worn and increasingly pointless phrase…) rather than archiving ideas for the record. (This also reflects my cynical attitude that the majority of stuff that appears in the formal record is not, to my mind, a contribution to anything other than the bulk of a journal sold for profit…)
If I’m going into the archive, someone else can put me there… But for the promotion case, acknowledgements are the lowest of the low in terms of academic credibility, rivaled only by (pers comm). Which is a shame – because one of the quotes I carry with me (but unfortunately can’t credit because I can’t for the life of me remember who said it, except that it was someone from outside the OU giving a seminar in the OU), that the whole point of being an academic is to have interesting conversations.
Anyway, the reason why I started to write this post is this: if the digital scholarship folks want metrics around how effective a scholar’s online activities are, it may be worth looking at tangible outcomes in the real world – such as invitations (e.g. to speak at seminars and workshops) and acknowledgements (e.g. in books, articles and reports). This conversion from informal online activity to a formal request in physical space is where the “citation” is evidenced.
And as Stephen Downes writes in a recent Half an Hour post:
By sharing my work freely, people around the world are able to see it, and they willingly pay for me to come and speak to them. I do not collect speaker fees, but I do require that they pay my expenses, because otherwise I could not afford to travel to their cities. We both benefit, because I then use these trips to produce work that we share with other people around the world, and the cycle continues.
You might think, it’s not a very good deal for some organization to pay several thousand dollars to fly me to their city. But consider the cost were they to buy books from me instead. They could get maybe 30 or 40 copies of an academic text for the same amount. This way, they get all my content I ever create for free, as many copies as they would ever need. [Paying For Art]
If the point of publishing is to communicate ideas, then presentations count. And if the refereeing process is to guarantee quality, then being given an invitation to speak also reflects reputation brownie points and an element of trust on the part of the person responsible for extending the invitation, even if they are not explicitly evaluating the actual content of a presentation a priori.
As to the benefits accruing to Stephen’s employer: “[t]hey get the reputation from sponsoring my work” as well as influencing whatever he is working on.
I’m not sure what metrics Stephen uses if he goes through an annual staff development/appraisal cycle (I thought I’d read something he’d written on this before, but I can’t find it if he did…?) but it would be interesting to see them…
PS today has been crap day. The only enjoyable part has been this bit – thinking about how I might be able to build a living CV… Paraphrasing Fermat, if I didn’t have to walk the dog just now, I’d have been able to build the neatest little demonstration site for this, which would include parsing the events out of my CV into a spreadsheet, and then using my Maintaining a Google Calendar from a Google Spreadsheet recipe to get them into a calendar;-)
Over the weekend, a bit of random googling with filetype:csv OR filetype:xls limits turned up an archive of what looked like a complete set of UK general election results from 2005 on the Electoral Commission website.
(By the by, I also came across the National Digital Archive of Datasets (NDAD), which looks like it could be interesting: “The National Digital Archive of Datasets (NDAD) preserves and provides online access to archived digital datasets and documents from UK central government departments. Our collection spans 40 years of recent history, with the earliest available dataset dating back to about 1963.”)
Anyway, the election results data was presented in a CSV file with a format that looked like:
The first thing I wanted to try to do with it was to produce an election votes treemap (e.g. along similar lines to Work In Progress – Voting Treemaps). The target visualisation tool was Many Eyes, simply because it’s quick and easy to use (I think a little bit of setting up is still required to use the wrapper I’ve placed around the JIT treemaps…:-(
In order to use the Many Eyes treemap to it’s full effect, I needed each row to start with the constituency name. Looking at the above screengrab of the Electoral Commission data, we see that the constituency is only explicitly stated for the winning candidate. So what’s the best way of filling in the blanks?
In a couple of recent talks, I’ve made the claim that library folk should be in the business of teaching on effective data handling skills, and this is a good case in point. So what would you do?
The naive response is to go into the spreadsheet and by hand click and drag each separate constituency name into the blanks. With over six hundred constituencies, that should take you how long..?!
Two rather more efficient ways came to my mind. Firstly, use a Python script to read in the CSV file, check to see if the constituency field is populated, and if it isn’t, populate it with the constituency name from the previous row:
import csv filein='candidates_49576.csv' fileout='election2005.csv' fout=open(fileout,"wb+") csv_file = csv.writer(fout) file_obj=open(filein,"rb") for line in csv.reader(file_obj, delimiter=',', # Your custom delimiter. skipinitialspace=True): # Strips whitespace after delimiter. if line: # Make sure there's at least one entry. if line=='': line=last else: last=line csv_file.writerow( line ) fout.close()
As ever, most of the above code isn’t mine, it’s stolen (in this case, from Stack Overflow).
The second approach I thought of was to use a Google Apps script… (this exercise is left to the reader… feel free to post a copy of the script, or a link to a spreadsheet containing it, in the comments;-)
The third approach was to use a spreadsheet formula and create a new, fully populated constituency column. Here’s one formula that I thought of:
That is, if the cell to the left is empty, copy the contents of the cell above…
So now I had a copy of the data in the form I needed, I could just copy and paste it from a spreadsheet into Many Eyes
You can see a copy of a derived treemap here. The view below uses colour to show the constituencies where the winning candidate won more than half of the valid votes cast in that constituency:
The next thing I wanted to do was to look at the actual number of votes a winning candidate beat the second placed candidate by, and also view this as a percentage of the total number of valid votes cast in that constituency. To do this, I created three new columns:
- one containing the total number of valid votes cast;
– one containing the number of votes the winning candidate beat the second placed candidate by;
– one containing the number of votes the winning candidate beat the second placed candidate by divided by the total number of valid votes cast, given as a percentage.
So how would you calculate those items?
Go on – think about it – how would you do it?
Here’s how I did it…
To count the total number of votes cast in a constituency, I used the formula:
(Note that before yesterday, I’d never heard of this formula… I just went googling for help on something like excel sum similar items…;-)
Select and drag this cell to populate the column…
For the vote majority formula, I only wanted this to apply to the winning candidate, so I put a trap in to see whether the constituency from the previous row is the same as the current row. If it isn’t, then because I assume the first candidate is the winning candidate, I can find the difference between the current row’s votes and that of the second placed candidate, which is immediately below:
Once again, select and drag this cell to populate the column…
I guess I really should also trap just to check that the row below is in the same constituency in case someone (the Speaker?) stands unopposed:
The percentage value is found simply by dividing the absolute number of votes majority by the total number of valid votes cast and multiplying by 100.
I uploaded the data to Many Eyes here.
So now we can look to see the number of votes majority an individual had, for example, compared to the turn out:
This chart brings to mind two other things to me. Firstly it shows all the candidates – it would be more convenient in this case to just show the data for the winning candidates. Secondly, it suggests that we might also plot the number of votes behind the winning candidate each of the other candidates was… but that’s for another day…
What I will do though is consider the first point, specifically how we might just look at the data for the winning candidates… One way of achieving this is to use Google Spreadsheets as a database using the Google spreadsheets API as described in First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground.
The query also generates a link to a CSV version of the data. It’s not hard then to pop the data into Many Eyes so now we can visualise the winning MP majorities more easily – here we see the votes ahead of the second placed candidate vs. the valid number of votes cast:
We can also use the Datastore explorer to interrogate the database. For example, which MPs are defending a slim majority in terms of absolute number of votes?
Okay, that’s more than enough for now – this was only supposed to be a half hour post… What I hope this post has done, though, is show how I mix and match tools and techniques as a I wander around a data set, as well as happily trawling the net to look for solutions to problems that I’m sure someone else must have solved before (such as summing a number of separate items labeled the same way in an Excel spreadsheet).
PS Somewhere along the line, I think when looking at majorities, a couple of points had negative values – the visualisation had turned up missing rows, like this one:
One of the things I experimented with a long time ago was the ability to use an RSS feed to power a presentation (e.g. Feedshow Link Presenter – Testing Audience Synch). The idea was that that you should be able to bookmark a set of webpages using a social bookmarking tool, and then take the RSS feed of those bookmarked links and use it to drive a presentation; the presentation itself would be a walkthrough of the bookmarked pages.
Anyway, not so long ago, the Delicious social bookmarking service started to offer a similar service: Browse These Bookmarks, so that’s all well and good..:-)
One of the things that I had on the Feedshow Presenter to do list (and I’m not sure whether I ever coded this up or not) was to be able to display any description text saved with a bookmark before showing the contents of the bookmarked page. This in turn suggests another way of using a feed powered presentation tool – as a vehicle simply for displaying text elements from an RSS feed in a presentation like format.
Now I know that death by Powerpoint is often brought on by text heavy slides, but sometimes you may need to chat around text; and sometimes, splashing the text you want to talk around on a screen might be a handy thing to do…
So what I’m thinking is, if you want to talk around a document, then maybe talking around the document at a paragraph level is a handy thing to be able to do.
And a good place to find paragraph level feeds is from something like WriteToReply…
So for example, if you look at the Digital Economy Act on WriteToReply, and go to a particular page such as http://writetoreply.org/deact/subscriber-appeals/, you can get an RSS feed from that page by constructing a URL of the form http://writetoreply.org/deact/feed/paragraphlevel/subscriber-appeals/
(Note there’s a minor glitch at the moment – the title of the feed itself is incorrect…)
So to discuss each paragraph in turn from that page, all we need to so is view the feed in Google Reader Play.
To make things easier, I’ve created a couple of bookmarks (bootstrapped from my “Get Current URL” pattern bookmarklet generator).
Firstly, given any RSS feed, here’s a bookmarklet for viewing it in Google Reader Play:
Secondly, given a page from a document hosted on WriteToReply, (not the RSS feed – the actual page; such as this one) this bookmarklet will construct the paragraph level page feed and pass it to Google Reader Play:
So there you have it, another way of supporting discussion around documents hosted on WriteToReply :-)
Whenever Facebook rolls out a major change, there’s a backlash… Here’s why I posted recently about how to opt out of Facebook’s new services…
Firstly, I’m quite happy to admit that it might be that you will be benefit from opting in to the Facebook personaliation and behavioural targeting services. If you take the line that better targeted ads are content, and behavioural advertising is one way to achieve that, all well and good. Just bear in mind that your purchasing decisions will be even more directedly influenced ;-)
What does concern me is that part of the attraction of Facebook for many people are its privacy controls. But when they’re too confusing to understand, and potentially misleading, it’s a Bad Thing… (I suppose you could argue that Facebook is innovating in terms of privacy, openness, and data sharing on behalf of its users, but is that a Good Thing?)
If folk think they have set their privacy setting one way, but they operate in another through the myriad interactions of the different settings, users may find that the images and updates they think they are posting into a closed garden, are in fact being made public in other ways, whether by the actions of their friends, applications they have installed, pages they have connected to, or websites they visit.
The Facebook privacy settings also seem to me to suggest various asymmetries. For example, if think I am only sharing videos with friends, then if those friends can also share on content because of the way I have set/not changed the default on another setting, I may be publishing content in a way that was not intended. It seems to me that Facebook is set up to devolve trust to the edge of my network – I publish to the edge of the my network, for example, but the people or pages on the edge of my network can then push the content out further.
So for example, in the case of connecting to pages, Facebook says: “Keep in mind that Facebook Pages you connect to are public. You can control which friends are able to see connections listed on your profile, but you may still show up on Pages you’re connected to. If you don’t want to show up on those Pages, simply disconnect from them by clicking the “Unlike” link in the bottom left column of the Page.”
The privacy settings around how friends can share on content I have shared with them is also confusing – do their privacy settings override mine on content I have published to them?
I’m starting to think (and maybe I’m wrong on this) that the best way of thinking about Facebook is to assume that everything you publish to your Facebook network can be republished by the members of your network under the terms of their privacy conditions. So if I publish a photo that you can see, then I have to assume that you can also publish it under your privacy settings. And so on…
This contrasts with a view of each object having a privacy setting, and that by publishing an object, the publisher controls that setting. So for example, I could publish an object and say it could only be seen by friends of me, and that setting would stick with the object. If you treid to republish it, it could only be repulshed to your friends who are also my friends. My privacy settings would set the scope, or maximum reach, of your republication of it.
Regular readers will know I’ve started looking at ways of visualising Facebook networks using Gephi. What I’m starting to think is that Facebook should offer a visualisation of the furthest reach of a person’s data, videos, images, updates, etc, given their current privacy settings (or preview changes to that reach if they want to test out new privacy settings.
PS re the visualisation thing – something like this, generated from your current settings, would do the job nicely:
More at The Evolution of Privacy on Facebook, including a view of just how open things are now…
So – if you’re on Facebook, here’s a link you should all follow and take action about:
It should look like this:
But WordPress saves it like this:
The WordPress saved version isn’t properly resolved by Facebook, it just goes to:
It should go to a page that looks like this:
Here’s a shortened link that does work: http://bit.ly/bwG9Xe
Follow it, and decide whether you like what you see. You do know who your friends are, don’t you, and you do know who they know? And where they go? And what applications they have installed? Becuase my reading of the above is that they can share information you shared with them to all those people, whether you approve or not? Or maybe I just misunderstand the permissions granted by the above form in the weird and wacky game of Top Trumps that is the Facebook privacy environment. Maybe the permissions you set to only share photos and videos with friends trumps the settings that let friends share your photos and videos with applications and sites they visit. Or maybe they don’t? Does anyone know for certain…?!
This is what mine looks like now:
For more on this, see: Keeping Up with Facebook Privacy Changes (Again)
PS You should probably also consider unchecking this ( http://www.facebook.com/settings/?tab=privacy§ion=applications#!/settings/?tab=privacy§ion=applications&field=instant_personalization ):
If you leave it set on Allow, when you visit a site that Facebook is friendly with it might share you data with that partner site for you… bless…
PPS Because Facebook is geting increasingly cavalier with what it lets applications do with you data, I suggest you take a look at the applications you have installed from the Applications page at:
This page is not easy to find from the under the privacy settings, but can be reached from the Facebook Account menu, under Application Settings.
If you don’t use an app, particularly an external one, I suggest you delete it…
In Getting Started With Gephi Network Visualisation App – My Facebook Network, Part I I described how to get up and running with the Gephi network visualisation tool using social graph data pulled out of my Facebook account. In this post, I’ll explore some of the tools that Gephi provides for exploring a network in a more structured way.
If you aren’t familiar with Gephi, and if you haven’t read Part I of this series, I suggest you do so now…
Okay, so where do we begin? As before, I’m going to start with a fresh worksheet, and load my Facebook network data, downloaded via the netvizz app, into Gephi, but as an undirected graph this time! So far, so exactly the same as last time. Just to give me some pointers over the graph, I’m going to set the node size to be proportional to the degree of each node (that is, the number of people each person is connected to).
I can activate text labels for the nodes that are proportional to the node sizes from the toolbar along the bottom of the Graph panel:
…remembering to turn on the text labels, of course!
So – how can we explore the data visually using Gephi? One way is to use filters. The notion of filtering is incredibly powerful one, and one that I think is both often assumed and underestimated, so let’s just have a quick recap on what filtering is all about.
["green beans" by House Of Sims]
Filters – such as sieves, or colanders, but also like EQ settings and graphic, bass or treble equalisers on music players, colour filters on cameras and so on – are things that can be used to separate one thing from another based on their different properties. So for example, a colander can be used to separate green beans from the water it was boiled in, and a bass filter can be used to filter out the low frequency pounding of the bass on an audio music track. In Gephi, we can use filters to separate out parts of a network that have particular properties from other parts of the network.
The graph of Facebook friends that we’re looking at shows people I know as nodes; a line connecting two nodes (generally known as an edge) shows that that the two people represented by the corresponding nodes are also friends with each other. The size of the node depicts its degree, that is, the number of edges that are connected to it. We might interpret this as the popularity (or at least, the connectedness) of a particular person in my Facebook network, as determined by the number of my friends that they are also a friend of.
(In an undirected network like Facebook, where if A is a friend of B, B is also a friend of A, the edges are simple lines. In a directed network, such as the social graph provided by Twitter, the edges have a direction, and are typically represented by arrows. The arrow shows the direction of the relationship defined by the edge, so in Twitter an arrow going from A to B might represent that A is a follower of B; but if there is no second arrow going from B to A, then B is not following A.)
We’ve already used degree property of the nodes to scale the size of the nodes as depicted in the network graph window. But we can also use this property to filter the graph, and see just who the most (or least) connected members of my Facebook friends are. That is, we can see which people are friends of lots of the people am I friends of.
So for example – of my Facebook friends, which of them are friends of at least 35 people I am friends with? In the Filter panel, click on the Degree Range element in the Topology folder in the Filter panel Library and drag and drop it on to the Drag Filter Here
Adjust the Degree Range settings slider and hit the Filter button. The changes to allow us to see different views over the network corresponding to number of connections. So for example, in the view shown above, we can see members of my Facebook network who are friends with at least 30 other friends in my network. In my case, the best connected are work colleagues.
Going the other way, we can see who is not well connected:
One of the nice things we can do with Gephi is use the filters to create new graphs to work with, using the notion of workspaces.
If I export the graph of people in my network with more than 35 connections, it is place into a nw workspace, where I can work on it separately from the complete graph.
Navigating between workspaces is achieved via a controller in the status bar at the bottom right of the Gephi environment:
The new workspace contains just the nodes that had 35 or more connections in the original graph. (I’m not sure if we can rename, or add description information, to the workspace? If you know how to do this, please add a comment to the post saying how:-)
If we go back to the original graph, we can now delete the filter (right click, delete) and see the whole network again.
One very powerful filter rule that it’s worth getting to grips with is the Union filter. This allows you to view nodes (and the connections between them) of different filtered views of the graph that might otherwise be disjoint. So for example, if I want to look at members of my network with ten or less connections, but also see how they connect to each other to Martin Weller, who has over 60 connections, the Union filter is the way to do it:
That is, the Union filter will display all nodes, and the connections between them, that either have 10 or less connections, or 60 or more connections.
As before, I can save just the members of this subnetwork to a new workspace, and save the whole project from the File menu in the normal way.
Okay, that’s enough for now… have a play with some of the other filter options, and paste a comment back here about any that look like they might be interesting. For example, can you find a way of displaying just the people who are connected to Martin Weller?