OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Tutorial’ Category

Postcards from a Text Processing Excursion

with 3 comments

It never ceases to amaze me how I lack even the most basic computer skills, but that’s one of the reasons I started this blog: to demonstrate and record my fumbling learning steps so that others maybe don’t have to spend so much time being as dazed and confused as I am most of the time…

Anyway, I spent a fair chunk of yesterday trying to find a way of getting started with grappling with CSV data text files that are just a bit too big to comfortably manage in a text editor or simple spreadsheet (so files over 50,000 or so rows, up to low millions) and that should probably be dumped into a database if that option was available, but for whatever reason, isn’t… (Not feeling comfortable with setting up and populating a database is one example…But I doubt I’ll get round to blogging my SQLite 101 for a bit yet…)

Note that the following tools are Unix tools – so they work on Linux and on a Mac, but probably not on Windows unless you install a unix tools package (such as GnuWincoreutils and sed, which look good for starters…). Another alternative would be to download the Data Journalism Developer Studio and run it either as a bootable CD/DVD, or as a virtual machine using something like VMWare or VirtualBox.

All the tools below are related to the basic mechanics of wrangling with text files, which include CSV (comma separated) and TSV (tab separated) files. Your average unix jockey will look at you with sympathetic eyes if you rave bout them, but for us mere mortals, they may make life easier for you than you ever thought possible…

[If you know of simple tricks in the style of what follows that I haven't included here, please feel free to add them in as a comment, and I'll maybe try to work then into a continual updating of this post...]

If you want to play along, why not check out this openurl data from EDINA (data sample; a more comprehensive set is also available if you’re feeling brave: monthly openurl data).

So let’s start at the beginning and imagine your faced with a large CSV file – 10MB, 50MB, 100MB, 200MB large – and when you try to open it in your text editor (the file’s too big for Google spreadsheets and maybe even for Google Fusion tables) the whole thing just grinds to a halt, if doesn’t actually fall over.

What to do?

To begin with, you may want to take a deep breath and find out just what sort of beast you have to contend with. You know the file size, but what else might you learn? (I’m assuming the file has a csv suffix, L2sample.csv say, so for starters we’re assuming it’s a text file…)

The wc (word count) command is a handy little tool that will give you a quick overview of how many rows there are in the file:

wc -l L2sample.csv

I get the response 101 L2sample.csv, so there are presumably 100 data rows and 1 header row.

We can learn a little more by taking the -l linecount switch off, and getting a report back on the number of words and characters in the file as well:

wc L2sample.csv

Another thing that you might consider doing is just having a look at the structure of the file, by sampling the first few rows of it and having a peek at them. The head command can help you here.

head L2sample.csv

By default, it returns the first 10 rows of the file. IF we want to change the number of rows displayed, we can use the -n switch:

head -n 4 L2sample.csv

As well as the head command, there is the tail command; this can be used to peek at the lines at the end of the file:

tail L2sample.csv
tail -n 15 L2sample.csv

When I look at the rows, I see they have the form:

logDate	logTime	encryptedUserIP	institutionResolverID	routerRedirectIdentifier ...
2011-04-04	00:00:03	kJJNjAytJ2eWV+pjbvbZTkJ19bk	715781	ukfed ...
2011-04-04	00:00:14	/DAGaS+tZQBzlje5FKsazNp2lhw	289516	wayf ...
2011-04-04	00:00:15	NJIy8xkJ6kHfW74zd8nU9HJ60Bc	569773	athens ...

So, not comma separated then; tab separated…;-)

If you were to upload a tab separated file to something like Google Fusion Tables, which I think currently only parses CSV text files for some reason, it will happily spend the time uploading the data – and then shove it into a single column.

I’m not sure if there are column splitting tools available in Fusion Tables – there weren’t last time I looked, though maybe we might expect a fuller range of import tools to appear at some point; many applications that accept text based data files allow you to specify the separator type, as for example in Google spreadsheets:

I’m personally living in hope that some sort of integration with the Google Refine data cleaning tool will appear one day…

If you want to take a sample of a large data file and put into another smaller file that you can play with or try things out with, the head (or tail) tool provides one way of doing that thanks to the magic of Unix redirection (which you might like to think of as a “pipe”, although that has a slightly different meaning in Unix land…). The words/jargon may sound confusing, and the syntax may look cryptic, but the effect is really powerful: take the output from a command and shove it into a file.

So, given a CSV file with a million rows, suppose we want to run a few tests in an application using a couple of hundred rows. This trick will help you generate the file containing the couple of hundred rows.

Here’s an example using L2sample.csv – we’ll create a file containing the first 20 rows, plus the header row:

head -n 21 L2sample.csv > subSample.csv

See the > sign? That says “take the output from the command on the left, and shove it into the file on the right”. (Note that if subSample.csv already exists, it will be overwritten, and you will lose the original.)

There’s probably a better way of doing this, but if you want to generate a CSV file (with headers) containing the last 10 rows, for example, of a file, you can use the cat command to join a file containing the headers with a file containing the last 10 rows:

head -n 1 L2sample.csv > headers.csv
tail -n 20 L2sample.csv > subSample.csv
cat headers.csv subSample.csv > subSampleWithHeaders.csv

(Note: don’t try to cat a file into itself, or Ouroboros may come calling…)

Another very powerful concept from the Unix command line is the notion of | (the pipe). This lets you take the output from one command and direct it to another command (rather than directing it into a file, as > does). So for example, if we want to extract rows 10 to 15 from a file, we can use head to grab the first 15 rows, then tail to grab the last 6 rows of those 15 rows (count them: 10, 11, 12, 13, 14, 15):

head -n 15 L2sample.csv | tail -n 6 > middleSample.csv

Try to read in as an English phrase (the | and > are punctuation): take the the first [head] 15 rows [-n 15] of the file L2sample.csv and use them as input [|] to the tail command; take the last [tail] 6 lines [-n 6] of the input data and save them [>] as the file middleSample.csv.

If we want to add in the headers, we can use the cat command:

cat headers.csv middleSample.csv > middleSampleWithHeaders.csv

We can use a pipe to join all sorts of commands. If our file only uses a single word for each column header, we can count the number of columns (single words) by grabbing the header row and sending it to wc, which will count the words for us:

head -n 1 L2sample.csv | wc

(Take the first row of L2sample.csv and count the lines/words/characters. If there is one word per column header, the word count gives us the column count…;-)

Sometimes we just want to split a big file into a set of smaller files. The split command is our frind here, and lets us split a file into smaller files containing up to a know number of rows/lines:

split -l 15 L2sample.csv subSamples

This will generate a series of files named subSamplesaa, subSamplesab, …, each containing 15 lines (except for the last one, which may contain less…).

Note that the first file will contain the header and 14 data rows, and the other files will contain 15 data rows but no column headings. To get round this, you might want to split on a file that doesn’t contain the header. (So maybe use wc -l to find the number of rows in the original file, create a header free version of the data by using tail on one less than the number of rows in the file, then split the header free version. You might then one to use cat to put the header back in to each of the smaller files…)

A couple of other Unix text processing tools let us use a CSV file as a crude database. The grep searches a file for a particular term or text pattern (known as a regular expression, which I’m not going to cover much in this post… suffice to note for now that you can do real text processing voodoo magic with regular expressions…;-)

So for example, in out test file, I can search for rows that contain the word mendeley

grep mendeley L2sample.csv

We can also redirect the output into a file:

grep EBSCO L2sample.csv > rowsContainingEBSCO.csv

If the text file contains columns that are separated by a unique delimiter (that is, some symbol that is only ever used to separate the columns), we can use the cut command to just pull out particular columns. The cut command assumes a tab delimiter (we can specify other delimiters explicitly if we need to), so we can use it on our testfile to pull out data from the third column in our test file:

cut -f 3 L2sample.csv

We can also pull out multiple columns and save them in a file:

cut -f 1,2,14,17 L2sample.csv > columnSample.csv

If you pull out just a single column, you can sort the entries to see what different entries are included in the column using the sort command:

cut -f 40 L2sample.csv | sort

(Take column 40 of the file L2sample.csv and sort the items.)

We can also take this sorted list and identify the unique entries using the uniq command; so here are the different entries in column 40 of our test file:

cut -f 40 L2sample.csv | sort | uniq

(Take column 40 of the file L2sample.csv, sort the items, and display the unique values.)

(The uniq command appears to make comparaisons between consecutive lines, hence the nee to sort first.)

The uniq command will also count the repeat occurrence of unique entries if we ask it nicely (-c):

cut -f 40 L2sample.csv | sort | uniq -c

(Take column 40 of the file L2sample.csv, sort the items, and display the unique values along with how many times they appear in the column as a whole.)

The final command I’m going to mention here is magic search and replace operator called sed. I’m aware that this post is already over long, so I’ll maybe return to this in a later post, aside from giving you a tease of scome scarey voodoo… how to convert a tab delimited file to a comma separated file. One recipe is given by Kevin Ashley as follows:

sed 's/"/\\\"/g; s/^/"/; s/$/"/; s/ctrl-V<TAB>/","/g;' origFile.tsv > newFile.csv

(See also this related question on #getTheData: Converting large-ish tab separated files to CSV.)

Note: if you have a small amount of text and need to wrangle it on some way, the Text Mechanic site might have what you need…

This lecture note on Unix Tools provides a really handy cribsheet of Unix command line text wrangling tools, though the syntax does appear to work for me using some of the commands as given their (the important thing is the idea of what’s possible…).

If you’re looking for regular expression helpers (I haven’t really mentioned these at all in this post, suffice to say they’re a mechanism for doing pattern based search and replace, and which in the right hands can look like real voodoo text processing magic!), check out txt2re and Regexpal (about regexpal).

TO DO: this is a biggie – the join command will join rows from two files with common elements in specified columns. I canlt get it working properly with my test files, so I’m not blogging it just yet, but here’s a starter for 10 if you want to try… Unix join examples

Written by Tony Hirst

June 3, 2011 at 11:53 am

Tech Tips: Making Sense of JSON Strings – Follow the Structure

with 4 comments

Reading through the Online Journalism blog post on Getting full addresses for data from an FOI response (using APIs), the following phrase – relating to the composition of some Google Refine code to parse a JSON string from the Google geocoding API – jumped out at me: “This took a bit of trial and error…”

Why? Two reasons… Firstly, because it demonstrates a “have a go” attitude which you absolutely need to have if you’re going to appropriate technology and turn it to your own purposes. Secondly, because it maybe (or maybe not…) hints at a missed trick or two…

So what trick’s missing?

Here’s an example of the sort of thing you get back from the Google Geocoder:

{ “status”: “OK”, “results”: [ { "types": [ "postal_code" ], “formatted_address”: “Milton Keynes, Buckinghamshire MK7 6AA, UK”, “address_components”: [ { "long_name": "MK7 6AA", "short_name": "MK7 6AA", "types": [ "postal_code" ] }, { “long_name”: “Milton Keynes”, “short_name”: “Milton Keynes”, “types”: [ "locality", "political" ] }, { “long_name”: “Buckinghamshire”, “short_name”: “Buckinghamshire”, “types”: [ "administrative_area_level_2", "political" ] }, { “long_name”: “Milton Keynes”, “short_name”: “Milton Keynes”, “types”: [ "administrative_area_level_2", "political" ] }, { “long_name”: “United Kingdom”, “short_name”: “GB”, “types”: [ "country", "political" ] }, { “long_name”: “MK7″, “short_name”: “MK7″, “types”: [ "postal_code_prefix", "postal_code" ] } ], “geometry”: { “location”: { “lat”: 52.0249136, “lng”: -0.7097474 }, “location_type”: “APPROXIMATE”, “viewport”: { “southwest”: { “lat”: 52.0193722, “lng”: -0.7161451 }, “northeast”: { “lat”: 52.0300728, “lng”: -0.6977000 } }, “bounds”: { “southwest”: { “lat”: 52.0193722, “lng”: -0.7161451 }, “northeast”: { “lat”: 52.0300728, “lng”: -0.6977000 } } } } ] }

The data represents a Javascript object (JSON = JavaScript Object Notation) and as such has a standard form, a hierarchical form.

Here’s another way of writing the same object code, only this time laid out in a way that reveals the structure of the object:

{
  "status": "OK",
  "results": [ {
    "types": [ "postal_code" ],
    "formatted_address": "Milton Keynes, Buckinghamshire MK7 6AA, UK",
    "address_components": [ {
      "long_name": "MK7 6AA",
      "short_name": "MK7 6AA",
      "types": [ "postal_code" ]
    }, {
      "long_name": "Milton Keynes",
      "short_name": "Milton Keynes",
      "types": [ "locality", "political" ]
    }, {
      "long_name": "Buckinghamshire",
      "short_name": "Buckinghamshire",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "Milton Keynes",
      "short_name": "Milton Keynes",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "United Kingdom",
      "short_name": "GB",
      "types": [ "country", "political" ]
    }, {
      "long_name": "MK7",
      "short_name": "MK7",
      "types": [ "postal_code_prefix", "postal_code" ]
    } ],
    "geometry": {
      "location": {
        "lat": 52.0249136,
        "lng": -0.7097474
      },
      "location_type": "APPROXIMATE",
      "viewport": {
        "southwest": {
          "lat": 52.0193722,
          "lng": -0.7161451
        },
        "northeast": {
          "lat": 52.0300728,
          "lng": -0.6977000
        }
      },
      "bounds": {
        "southwest": {
          "lat": 52.0193722,
          "lng": -0.7161451
        },
        "northeast": {
          "lat": 52.0300728,
          "lng": -0.6977000
        }
      }
    }
  } ]
}

Making Sense of the Notation

At its simplest, the structure has the form: {“attribute”:”value”}

If we parse this object into the jsonObject, we can access the value of the attribute as jsonObject.attribute or jsonObject["attribute"]. The first style of notation is called a dot notation.

We can add more attribute:value pairs into the object by separating them with commas: a={“attr”:”val”,”attr2″:”val2″} and address them (that is, refer to them) uniquely: a.attr, for example, or a["attr2"].

Try it out for yourself… Copy and past the following into your browser address bar (where the URL goes) and hit return (i.e. “go to” that “location”):

javascript:a={"attr":"val","attr2":"val2"}; alert(a.attr);alert(a["attr2"])

(As an aside, what might you learn from this? Firstly, you can “run” javascript in the browser via the location bar. Secondly, the javascript command alert() pops up an alert box:-)

Note that the value of an attribute might be another object.

obj={ attrWithObjectValue: { “childObjAttr”:”foo” } }

Another thing we can see in the Google geocoder JSON code are square brackets. These define an array (one might also think of it as an ordered list). Items in the list are address numerically. So for example, given:

arr[ "item1", "item2", "item3" ]

we can locate “item1″ as arr[0] and “item3″ as arr[2]. (Note: the index count in the square brackets starts at 0.) Try it in the browser… (for example, javascript:list=["apples","bananas","pears"]; alert( list[1] );).

Arrays can contain objects too:

list=[ "item1", {"innerObjectAttr":"innerObjVal" } ]

Can you guess how to get to the innerObjVal? Try this in the browser location bar:

javascript: list=[ "item1", { "innerObjectAttr":"innerObjVal" } ]; alert( list[1].innerObjectAttr )

Making Life Easier

Hopefully, you’ll now have a sense that there’s structure in a JSON object, and that that (sic) structure is what we rely on if we want to cut down on the “trial an error” when parsing such things. To make life easier, we can also use “tree widgets” to display the hierarchical JSON object in a way that makes it far easier to see how to construct the dotted path that leads to the data value we want.

A tool I have appropriated for previewing JSON objects is Yahoo Pipes. Rather than necessarily using Pipes to build anything, I simply make use of it as a JSON viewer, loading JSON into the pipe from a URL via the Fetch Data block, and then previewing the result:

Another tool (and one I’ve just discovered) is an Air application called JSON-Pad. You can paste in JSON code, or pull it in from a URL, and then preview it again via a tree widget:

Clicking on one of the results in the tree widget provides a crib to the path…

Summary

Getting to grips with writing addresses into JSON objects helps if you have some idea of the structure of a JSON object. Tree viewers make the structure of an object explicit. By walking down the tree to the part of it you want, and “dotting” together* the nodes/attributes you select as you do so, you can quickly and easily construct the path you need.

* If the JSON attributes have spaces or non-alphanumeric characters in them, use the obj["attr"] notation rather than the dotted obj.attr notation…

PS Via my feeds today, though something I had bookmarked already, this Data Converter tool may be helpful in going the other way… (Disclaimer: I haven’t tried using it…)

If you know of any other related tools, please feel free to post a link to them in the comments:-)

Written by Tony Hirst

April 12, 2011 at 10:25 am

Posted in onlinejournalismblog, Tutorial, Uncourse

Tagged with

Gephi Bits 2: A Further Look at Comments on Social Objects in a Closed Community

with one comment

In the previous post in this set (Gephi Bits 1: Comments on Social Objects in a Closed Community), I started having a play with comment and favourites data from a series of peer review activities in the OU course Design thinking: creativity for the 21st century.

In particular, I loaded simple pairwise CSV data directly into Gephi, relating comment id and favourite ids to photo ids. The resulting images provided a view over the photos that showed which photos were heavily commented and/or favourited. Towards the end of the post, I suggested it might be interesting to be able to distinguish between the comment and favorite nodes by colouring them somehow. So let’s start by seeing how we might achieve that…

The easiest way I can think of is to preload Gephi with a definition of each node and the assignment of a type label to each node – photo, comment or favourite. We can then partition – and colour – each node based on the type label.

To define the nodes and type labels, we can use a file defined using the GUESS .gdf format. In particular, we define the nodes as follows:

nodedef> name VARCHAR, ltype VARCHAR
p189, photo
p191, photo

c1428, comment
c1429, comment

f1005, fave
f1006, fave

Load this file into Gephi, and then append the contents of the comment-photo and favourite-photo CSV files to the graph. We can then colour the nodes (sized according to, err, something!) according to partition:

Coloured partitions in Gephi

If we filter the network for a particular photo using an ego filter, we can get a colour coded view of the comment and favourite IDs associated with that image:

Coloured nodes and labels in Gephi

What we’ve achieved so far is a way of exploring how heavily commented or favourited a photo is, as well as picking up a tip or two about labeling and colouring nodes. But what about if we wanted a person level analysis, where we could visually identify the individuals who had posted the most images, or whose images were most heavily commented upon and favourited?

To start with, let’s capture some information about each of the nodes. In the following example, we have an identifer (for a photo, favourite or comment), followed by a user id (the person who made the comment or favourite, or who uploaded the photo), and a label (photo, comment or fave). (The ltype field also captures a sense of this.)

nodedef> name VARCHAR, username VARCHAR, ltype VARCHAR
p189,jd342,photo
p191,jd342,photo
p192,pn43,photo
..
c1189,pd73,comment
c1190,srs22,comment
..
f46,ww66,fave
f47,ee79,fave

Rather than describe edges based on connecting comment or favourite ID to photo ID, we can easily generate links of the form userID, photoID, where userID is the ID of the user making a comment or favouriting an image. However, it is possible to annotate the edges to describe whether or not the link relates to a comment or favouriting action. So for example:

edgedef> otherUser VARCHAR, photo VARCHAR, etype VARCHAR
pd73,p189,comment
srs22,p226,comment

ww66,p176,fave

Alternatively, we might just use the simpler format:
edgedef> otherUser VARCHAR, photo VARCHAR
pd73,p189
srs22,p226

ww66,p176

In this simpler case, we can just load in the node definition gdf file, and follow it by adding the actual graph edge data from CSV files, which is what I’ve done for what follows.

Firstly, here’s the partition colour palette:

Gephi - partition colours

The null entities relate to nodes that didn’t get an explicit node specification (i.e. the person nodes).

To provide a bit of flexibility over the graph, I loaded the the favourites and comment edges in as directed edges from “Other user” to photo ID, where “Other user” is the user ID of the person making the comment or favourite.

If we size the graph by out-degree, we can look at which users are actively engaged in commenting on photos:

Gephi - who's commenting/favouriting

The size of the arrow depicts whether or not they are multiple edges going from one person to a photo, so we can see, for example, where someone has made multiple comments on the same photo.

If we size by in-degree, we can see which photos are popular:

Gephi - what photos are popular

If we run an ego filter over over a photo id, we can see who commented on it.

However, what we would really like to be able to do is look at the connections between people via a photo (for example, to see who has favourited who’s photos). If we add in another edge data file that links from a photo ID to a person ID (the person who uploaded the photo), we can start to explore these relationships.

NB the colour palette changes in what follows…

Having captured user to photo relationships based on commenting, favouriting or uploading behaviour, we can now do things like the following. Here for example is a use of a simple filter to see which of a user’s photo’s are popular:

Gephi - simple filter

If we run a simple ego filter, we can see the photos that a user has uploaded or commented on/favourited:

Gephi - ego filter

If we increase the depth to 2, we can see who else a particular user is connected to by virtue of a shared interest in the same photographs (I’m not sure what edge size relates to here…?):

Ego depth 2 in gephi - who connects to whom

Here, user ba49 is outsize because they uploaded a lot of the images that are shown. (The above graph shows linkage between ba49 and other users who either commented on/favourited one of ba49′s images, or who commented/favourited photo that ba49 also commented on/favourited.)

Doh – it appears I’ve crashed Gephi, so as it’s late, I’m going to stop for now! In the next post, I’ll show how we can further elaborate the nodes using extended user identifiers that designate the role a person is acting in (eg as a commenter, favouriter or photo uploader) to see what sorts of view this lets us take over the network.

Written by Tony Hirst

May 27, 2010 at 10:31 pm

Posted in Tutorial, Visualisation

Tagged with ,

Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part IV

with 3 comments

In the first two posts in this series, I described how to use Gephi to visualise various different views over a personal social network in Facebook using data pulled from a Facebook account using the Netvizz application. This was followed by a post describing how to and run some simple social network analysis statistics over the network. In this post, we’ll look at another powerful analytic tool provided by Gephi: clustering. [Note that after publishing this post, a far better way of visualising the clustered groups was suggested to me - find out more in the next post in this series.]

Clustering is a mathematical process in which different elements in a particular set are grouped together based on certain similarities between the different elements. That is, like is grouped with like. In a social network, clustering algorithms typically group together individuals who form “subnetworks” – for example, subsets of the the whole population who all know one another.

Being able to identify clusters within a social network allows you to identify “subnetworks” within that larger network. In Gephi, a graph can be clustered by running the Modularity in the Statistics panel – so here’s what I get when I run this measure over my Facebook network:

Modularity clustering in Gephi

The Modularity clustering tool identifies several different clusters, (that is, groups or classes) within the network as a whole, associating each node with one of the groups.

One you have run the tool, you can view the cluster each node has been associated with by using the Modularity Class Ranking parameter; I find that the colour mapping is the most effective:

Gephi modularity applied

You can inspect the size of the various classes in a crude fashion via the Filter panel: select the Modularity Class option from the Partition folder in the filter Library and drag the filter to the query window. If you click on the Partition column element, you will be presented with a window showing each pf the partition classes:

Modularity class partitions, gephi

If you now filter on one of the partition classes, and switch on the node names, you can see which nodes have been clustered together. Looking separately at two of the largest clusters in my Facebook network, I can see two OU clusters:

OU cluster in my Facebook network via Gephi

and:

My Facebook community - OU cluster via gephi

and a North America/Canada dominated ed-tech cluster (which also includes some BBC folk via Bill Thompson…):

Clustering my Facebook network in gephi

(Note that it is possible to highlight/filter on more than one cluster within a single filter (on a Mac, fn-click allows you to select multiple individual clusters).

The Modularity statistic thus provides us with a powerful tool for identifying subgroupings within a social network. So why not try it using your own data – are the clusters that are identified meaningful to you?

PS a far better way of visualising the clustered groups was suggested to me – find out more in the next post in this series.

Written by Tony Hirst

May 12, 2010 at 10:19 pm

Posted in Tutorial, Uncourse

Getting Started With Gephi Network Visualisation App – My Facebook Network, Part III: Ego Filters and Simple Network Stats

with one comment

In a couple of previous posts on exploring my Facebook network with Gephi, I’ve shown how to plot visualise the network, and how to start constructing various filtered views over it (Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and Getting Started With Gephi Network Visualisation App – My Facebook Network, Part II: Basic Filters). In this post, I’ll explore a new feature, ego filters, as well as looking at some simple social network analysis tools that can help us better understand the structure of a social network.

To start with, I’m going to load my Facebook network data (grabbed via the Netvizz app, as before) into Gephi as an undirected graph. As mentioned above, the ego network filter is a new addition to Gephi, which will show that part of a graph that is connected to a particular person. So for example, I can apply the ego filter (from the Topology folder in the list of filters) to “George Siemens” to see which of my Facebook friends George knows.

Gephi - ego filter - my Facebook friends who are friends with George Siemens

If I save this as a workspace, I can then tunnel into it a little more, for example by applying a new ego filter to the subgraph of my friends who George Siemens knows. In this case, lets add Grainne to the mix – and see who of my friends know both George Siemens and Grainne:

Ego filter applied within an ego filtered workspace

Note that I could have achieved a similar effect with the full graph by using the intersection filter (as introduced in the previous post in this series):

Seeing my facebook connections that two of my Facebook friends know

The depth of the ego filter also allows you to see who of of my friends the named individual knows either directly, or through one of my other friends. Using an ego filtered network to depth two (frined of a friend) around George Siemens, I can run some network statistics over just that group of people. So for example, if I run the Degree statistics over the network, and then set the node size according to node degree within that network this is what I get:

Running stats on the network

(I also turned node labels on and set their size proportional to node size.)

Running Network Diameter stats generates the following sorts of report:

Gephi Network diameter stats

That is:

- betweenness centrality;
- closeness centrality;
- eccentricity.

These all sound pretty technical, so what do they refer to?

Betweenness centrality is a measure based on the number of shortest paths between any two nodes that pass through a particular node. Nodes around the edge of the network would typically have a low betweenness centrality. A high betweenness centrality might suggest that the individual is connecting various different parts of the network together.

Closeness centrality is a measure that indicates how close a node is to all the other nodes in a network, whether or not the node lays on a shortest path between other nodes. A high closeness centrality means that there is a large average distance to other nodes in the network. (So a small closeness centrality means there is a short average distance to all other nodes in the network. Geddit? (I think sometimes the reciprocal of this measure is given as closeness centrality:-).

The eccentricity measure captures the distance between a node and the node that is furthest from it; so a high eccentricity means that the furthest away node in the network is a long way away, and a low eccentricity means that the furthest away node is actually quite close.

So let’s have a look at the structure of my Facebook network, as filtered according to George’s ego filter, depth 2:

Plotting size proportional to betweenness centrality, we see Martin Weller, Grainne and Stephen Downes are influential in keeping different parts of my network connected:

Betweenness centrality

As far as outliers go, we can look at the closeness centrality and eccentricity (to protect the innocent, I shall suppress the names!)

Eccentricity (size) and closeness centrality (colour) in gephi

Here, the colour field defines the closeness centrality and the size of the node the eccentricity. It’s quite easy to identify the people in this network who are not well connected and who are unlikely to be able to reach each other easily through those of my friends they know.

From nods with similar sizes and different colours, we also see how it’s quite possible for two nodes to have a similar eccentricity (similar distances to the furthest away nodes) and very different closeness centrality (that is, the node may have a small or large average distance to every other node in the graph). For example, if a node is connected to a very well connected node, it will lower the closeness centrality.

So for example, if we look at the ego network with the above netwrok based around the very well connected Martin Weller, what do we see?

Further filter

The colder, blue shaded circles (high closeness centrality) have disappeared. Being a Martin Weller friend (in my Facebook network at least) has the effect of lowering your closeness centrality, i.e. bringing you closer to all the other people in the network.

Okay, that’s definitely more than enough for now. Why not have a play looking at your Facebook network, and seeing if you can identify who the best connected folk are?

PS when plotting charts, I think Gephi uses data from the last statistics run it did, even if that was in another workspace, so it’s always worth running the statistics over the current graph if you intend to chart something based on those stats…

Written by Tony Hirst

May 10, 2010 at 1:02 pm

Posted in Tutorial, Uncourse

Tagged with ,

Mashlib Pipes Tutorial: 2D Journal Search

with 8 comments

[This post is a more complete working of Mash Oop North – Pipes Mashup by Way of an Apology]

Keeping current with journal articles in a particular subject area is one of the challenges that faces many researchers, and by implication the academic and research librarians tasked with supporting the information needs of those researchers.

This relatively simple recipe shows how to create a “two dimensional” search that allows a user to provide two sets of keywords, one to identify a set of journals in a particular subject area, the other to filter the current articles down to a particular subtopic in that subject area.

What this demo shows is:
- how to pull in a list of journals in a particular subject area based on user provided keywords into the Yahoo pipes environment;
- how to pull in the most recent table of contents from those journals into that environment;
- how to then filter those recent articles to only display articles on a particular subtopic.

Th starting point for this recipe is jOPML, a service created by Scott Wilson that allows you to run a keyword search on the titles of journals whose tables of contents are made available as RSS on ticTOCs and a generate an OPML feed containing the RSS feed URLs for those journal TOCS. (OPML is an XML formatted language that, among other things, can be used to transport bundles of RSS feed URLs around the web. In much the same way that RSS is one of the most effective ways of transporting sets of links to web pages around the web (as for example in the case of RSS feeds from social bookmarking sites such as delicious), so OPML is one of the best ways of moving collections of RSS links around.)

Now as well as consuming RSS feeds, Yahoo Pipes can also pull in other data formats. So for example:

- the Fetch Feed block can pull in a wide variety of RSS flavoured forms (different versions of RSS, Atom etc); [Handy tip - a pipe that just wires a Fetch Feed block direct to the pip output can be used to "normalise" different flavours of RSS/Atom in order to provide a single, standard feed format at the output of the pipe. ]<

- Fetch Data can be used to import XML and JSON into the pipes environment (with Fetch CSV pulling in CSV data files, from sources such as Google Spreadsheets);

- Fetch Page can be used to load HTML web pages into Yahoo Pipes, providing the means by which to develop simple screen scraping applications within the Pipes environment.

What this means is that we can pull in the OPML file generated by jOPML into the Yahoo Pipes environment and have a play with it :-)

So let’s see how. To start with, we need to find a way of getting arbitrary OPML files out of jOPML. Running a search for science history on jOPML returns:

with the OPML available here: http://jopml.org/feeds.opml?q=science+history

Looking at this URI, you’ll hopefully see that it contains the search terms used to query the journals database on jOPML. In effect, the URI is an API to the jOPML service. By rewriting the URI, we can make different calls on the jOPML service, and return different OPML files for different topic areas.

AS-AN-ASIDE TAKE HOME POINT: many URIs effectively provide an API to a web service. If you ever see a search form, run some queries using it, and look at the URIs of th results page. If you can see your search terms in the URI, you are now in a position to construct your own queires to that service simply by using the URI, rather than having to go by the search form.

Here are a couple of services you can try this out with:
- Google: http://google.com;
- Twitter: http://search.twitter.com.
Remember, the goal is to:
1) run a search;
2) look at the the URI of the results page and see if you can spot the search terms;
3) try to hack the URI to run a search using a new set of search terms.
Are there any other hackable items in the URI? For example, run several Twitter searches returning different numbers of search results, and look at the URI in each case. Can you see how to hack it to return the number of results items that you want? (Note that there is a hard limit set at the Twitter end that defines the maximun numbr of results that can be returned.)

It’s not just search terms that appear in URIs either. For example, the ISBN Playground will generate a wide variety of URIs that are all “keyed” using an arbitrary ISBN. (Actually, thot’s not quite true; many of them require ISBN 10 format ISBNs. But there are ways around that, as I’ll show in a later post…) If I’m missing any URIs you know of that contain ISBNs, please let me know in a comment to this post ;-)

Anyway, that’s more than enough of that! Let’s go back to the 2D journal search recipe, and let the pipework begin…

The main idea bhind Yahoo pipes is to “wire” together different components in order to complete some sort of task. When you create a new Yahoo pipe you are presented with an empty canvas that dominates the screen on which to create your “pipe”, and a menu area on the left that contains different blocks that you can use to create your pipe.

Blocks are added to the canvas either by dragging them from th menu area and dropping them on the canvas, or clicking the + symbol on the block you want in the menu area, which adds it to the canvas automatically.

Blocks are wired together by clicking on the circle on the bottom of a block and dragging and dropping the “wire” onto a circle on the top of the next block in your pipe.

The idea is that content flows through one block into the next, entering the block along its top edge, being processed by the block as appropriate, and then passing out through the bottom edge of the block.

Blocks that do not have an input circle on the top edge are used to pull content into the pipe from elsewhere. (These can be found in the Sources part of the menu panel.)

In contrast, the Pipe output block does not have any circles on its lower edge – the output from this block is exposed to the outside world on the the pipe’s public home page. (The single pipe output block is added to the canvas automatically when you add an input block. Pipes can have multiple input blocks, but only on output block.)

(If this sort of interaction design appeals to you – that is, “wiring” separate components in some sort of linear workflow together – a Javascript library is available that implements the drag, drop and wire features so you can implement an interface similar to the Yahoo Pipes interface in your own web applications: WireIt – a Javascript Wiring Library. To see WireIt in action, check out Tarpipe.)

So where do we start? The first thing to do is to construct the URI to the OPML feed that we can then use to pull in the OPML feed for a set of journals on a particular topic:

If you highlight a block by clicking on it, it will glow orange. You can then inspect the output from just this block by looking in the preview pan at the bottom of the screen:

The “Journal keywords (text)” block is actually a Text input block:

The URL builder constructs a URL from the fixed elements of the URI (the page location http://jopml.org/feeds.opml and the query variable q) and the user inputted search terms. The user inputs are exposed as text entry boxes on the front page of the pipe as well by arguments in the URI for the pipe (e.g. in the same way that the query terms appear in the jOPML URIs).

In order to import the contents of the jOPML file, we can use the Fetch Datablock.

To see what we’ll be working with, here’s what an original OPML file looks like:

If we load this XML file into Pipes, we need to tell the Fetch Data block what parts of the OPML file which should use as separate items in within the pipe. Looking at the OPML file, we ideally want each journal to be represented as a separate item within the pipe. We do this by specifying the path to the outlin element in the OPML feed, noting that each journal listing is represented using a separate outline element.

Within the pipes environment, the OPML file is represented as follows:

Each outline element contains information regarding a single journal – it’s title, xmlUrl, and so on. The xmlUrl element contains the URI of the RSS feed for the contents of the current issue of the particular journal. You’ll see that the xmlURI points to the RSS feed of the journal from the publisher’s site.

So for example, the RSS version of the TOCs for the journal The British Journal for the History of Science can be found at http://journals.cambridge.org/data/rss/feed_BJH_rss_2.0.xml.

Now you could of course subscribe to all these journal table of contents feeds simply importing the OPML file into an RSS reader such as Google Reader, but where would the fun be in that? After all, most of the time, I’m not actually that interested in most of the articles in any particular journal. For example, it would be far more efficient (?!) if I was only alerted to articles that were in my subject area. So let’s see how to do that…

The Loop block lets us work with each item in the selected journals feed. Essentially, it says “for each item in a feed, do something to or with that item”. (For each is a really powerful idea in computational thinking. It does pretty much exactly what it says on the tin: for each item in a list, do something with it. In the Yahoo Pipes environment, the Loop block essentially implements for each):

You’ll see that the loop block has a space for adding another block – the block whose functionality will be applied to each element in the incoming feed. As well as placing ‘standard’ pipes blocks taken from the blocks menu in a Loop element, you can also use pipes you have created yourself.

If we embed a Fetch Feed block in the Loop, then for each journal item identified in the imported OPML feed, we can locate its TOCs RSS feed URI (the xmlUrl element) and use it to fetch the contents of that feed.

Now you may notice that the Loop block can output the results of the Fetch Feed call in one of two ways; it can either annotate the original feed items, for example by assigning (that is, adding) the current list of contents for a journal to each a subelement of each item in the pipe:

In more abstract terms, we might represent that as follows:

Or byemitting the items, which is to say that each item that comes into the the Loop block is replaced by the set of items that were created within the Loop block:

Here’s how that looks diagrammatically:

Because I want to produce a feed that just contains links to articles that may be of interest to me, we’re going to use the “emit all results” option.

So let’s just recap where we are. Here’s the pipe so far:

We start by taking some user keyword terms and construct a URI that calls the jOPML service, returning an OPML file that contains the titles and TOC RSS URLs of journals related to those keywords. We then loop through that list of journals, replacing each journal item with a list of items corresponding to the current table of contents of each journal. These items are pulled from the table of contents RSS feed for each journal as obtained from the ticTOCs listings.

The next step is to filter the contents list so that we only get passed journal articles on a particular topic. We’ll do that using a crude keyword filter that only lets articles through whose contents contain a particular keyword or set of keywords.

Taking the filter block, we wire in another user input that allows the user to specify keywords that must appear in the title element of an article for it to be emitted from the pipe, and take the output from this filter to the output of the pipe.

So there we have it: a 2D search that takes two sets of keywords, one set that pulls out likely suspect journals on a topic, and the second set that filters articles from those journals on a more detailed subject.

The output form the pipe is then available as an RSS feed in its own right, as a Google personal (iGoogle) widget, etc etc.

The whole pipe looks like this:

It works by generating a jOPML URI based on user provided keyword terms, importing the jOPML feed into the pipe, grabbing the RSS feed of the table of contents for each journal specified in the OPML feed and then filtering those contents listings using another set of keyword terms based on the title of each article.

In doing so, you have seen how to use the URL Builder block to construct the jOPML URI using user provided search terms entered using a Text Input block; the Import Data block to grab the jOPML XML feed; the Loop and Fetch Feed blocks to pull in the table of contents RSS feed from the journal publisher for each journal identified in the jOPML feed; and the Filter block to pass through only those articles that contain a second set of user specified keywords in the article title.

Enjoy :-)

PS if I manage to blag being able to run a Library Mashup uncourse in the Autumn, this is about the level of detail post I’d was planning to write. So – too much detail? Not enough? Just right? How’s it for leveling? Appropriate for a ‘not necessarily techie, but keen to learn’ audience?

Written by Tony Hirst

July 9, 2009 at 8:17 am

Posted in Pipework, Tutorial

Tagged with , , ,

Follow

Get every new post delivered to your Inbox.

Join 134 other followers