OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Tutorial’ Category

To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info

In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)

From the Scraperwiki database, we need to interrogate the API:

scraperwiki api

To do this, we’ll use a database query language – SQL.

What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:

select * from support where support in (select candinit from support)

Note that “support” refers to two things here – these are columns:

select * from support where support in (select candinit from support)

and these are the table the columns are being pulled from:

select * from support where support in (select candinit from support)

Here’s the result of Runing the query:

sql select on scraperwiki

We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).

candidates mutual table

There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:

Tidier intra-candidate support map

Simples:-)

Written by Tony Hirst

May 8, 2013 at 10:50 am

Infoskills 2.012 – Practical Exercises in Social Media Network Analysis #change11

As ever, it seems the longer I have to prepare something, the less likely I am to do it. I was supposed to be running a #change11 MOOC session this week – Infoskills 2.012 How to do a lot with a little – but having had it in the diary for a 6 months or so, I have, of course, done nothing to prepare for it… (I didn’t come up with the 2.012 – not sure who did?)

Anyway….over the weekend, I gave a presentation (Social Media Visualisation Hacks) that, typically, bewildered the audience with a blizzard of things that are possible when it comes to looking at social networks but that are still alien to most:

As ever, the presentation is not complete (i.e. the slides really need to be complemented by a commentary), but that’s something I hope to start working on improving – maybe starting this week…

The deck is a review – of sorts – of some of the various ways we can look at social networks and the activity that takes place within them. The slides are prompts, keys, search phrase suggestions that provide a starting point for finding out more. Many of the slides contain screenshots – and if you peer closely enough, you can often see the URL. For posts on my blog, searching with the word ouseful followed by key terms from the post title will often turn up the result on major search engines. Many of the slides identify a “hack” that is described in pseudo-tutorial form on the this blog, or on Martin Hawksey’s MASHe blog.

I put together a delicious stack of links relating to the presentation here: #drg12 – Visualising Social Networks (Tutorial Posts)

For a tutorial stack that focusses more on Yahoo Pipes (though who knows how long that is still for this world given the perilous nature of Yahoo at the moment), see: Twitter Pipes

My #change11 week was supposed to be about new info skills, with a practical emphasis. A couple of other presentations relating to how we might appropriate (a-pro-pre-eight) tools and applications can be found here: Appropriate IT – My ILI2011 Presentation and Just Do IT Yourself… MY UKSG Presentation.

If all you do in Google is 2.3 keyword searches, this deck – Making the Most of Google – (though possibly a little dated by now), may give you some new ideas.

For a more formal take on infoskills for the new age, (though I need to write a critique of this from my own left-field position), see the Cambridge University Library/Arcadia Project “New Curriculum for Information Literacy (ANCIL)” project via the Arcadia project.

If you want to do some formal reading in the visualisation space, check out 7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don’t Know.

Via @cogdog/Alan Levine, I am reminded of Jon Udell’s Seven ways to think like the web. You do think like that, right?!;-)

Mulling over @downes’ half hour post on Education as Platform: The MOOC Experience and what we can do to make it better, I see the MOOC framework as providing an element of co-ordination, pacing and a legacy resource package. For my week, I was expected(?!) to put together some readings and exercises and maybe a webinar. But I haven’t prepared anything (I tried giving a talk at Dev8D earlier this year completely unprepared (though I did have an old presentation I could have used) and it felt to me like a car crash/a disaster, so I know I do need to prep things (even though it may not seem like it if you’ve heard me speak before!;-))

But maybe that’s okay, for one week of this MOOC? The OUSeful.info blog, where you’re maybe reading this, is an ongoing presentation, with a post typically every day or so. WHen I learn something related to the general themes of this blog, I post it here, often as a partial tutorial (partial in the sense that you often have to work through the tutorial for the words to make sense – they complement the things you should be seeing on screen – if you look – as you work through the tutorial; in a sense, the tutorial posts are often second screen complements to and drivers of an activity on another screen).

I’ve personally tried registering with MOOCs a couple of times, but never completed any of the activities. Some of the MOOC related readings or activities pass my way through blogs I follow, or tweeted links that pique my interest, and sometimes I try them out. I guess I’m creating my own unstructured uncourse* daily anyway… (*uncourses complement MOOCs, sort of… They’re courses created live in partial response to feedback, but also reflecting the “teacher”‘s learning journey through a topic. Here’s an example that led to a formal OU course: Digital Worlds uncourse blog experiment. The philosophy is based on the “teacher” modelling – and documenting – a learning journey. Uncourses fully expect the “teacher” not being totally knowledgeable about the subject area, but being happy to demonstrate how they go about making sense of a topic that may well be new to them).

So… this is my #change11 offering. It’s not part of the “formal” course, (how weird does that sound?!) unless it is… As the MOOC is now in week 29, if its principles have been taken on-board, you should be starting to figure out your own distributed co-ordination mechanisms by now. Because what else will you do when the course is over? Or will it be a course that never ends, yet ceases to have a central co-ordination point?

PS if you want to chat, this blog is open to comments; you can also find me on Twitter as @psychemedia

PPS seems like I’ve had at least one critical response (via trackbacks) towards my lacksadaisical “contribution” towards my “teaching” week on the #Change11 MOOC. True. Sorry. But not. I should have kept it simple, posted my motto – identify a problem, then hack a solution to it, every day – and left it at that… It’s how I learn about this stuff… (and any teaching I receive tends to be indirect – by virtue stuff other folk have published that I’ve discovered through web search, (aka search queries – questins – that I’ve had to formulate to help me answer the problem I have identified/created…).

Written by Tony Hirst

March 27, 2012 at 9:53 am

Posted in Open Education, Tutorial, Uncourse

Tagged with

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Google Refine - import Facebook friends list

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

Google Refine - import Facebook friends

If you click the highlighted selection, you should see the data that will be used to create your project:

Google Refine - click to view the data

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

Google Refine - rename columns

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

Google Refine - new column from URL

The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:

Google Refine - crafting URLs for new column creation

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

Google Refine - new data imported

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv

writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL)

fn='my-fb-friends-likes.txt'

data = simplejson.load(open(fn,'r'))
id=0
for d in data['rows']:
	id=id+1
	#'interests' is the column name containing the Likes data
	interests=simplejson.loads(d['interests'])
	for i in interests['data']:
		print str(id),i['name'],i['category']
		writer.writerow([str(id),i['name'].encode('ascii','ignore')])

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Sketching common likes amongst my facebook friends

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

[See also: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and how to visualise Google+ networks]

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?

Written by Tony Hirst

January 4, 2012 at 11:06 am

A Tool Chain for Plotting Twitter Archive Retweet Graphs – Py, R, Gephi

Another set of stepping stones that provide a clunky route to a solution that @mhawksey has been working on a far more elegant expression of (eg Free the tweets! Export TwapperKeeper archives using Google Spreadsheet and Twitter: How to archive event hashtags and create an interactive visualization of the conversation)…

The recipe is as follows:

- download a Twapperkeeper archive to a CSV file using a Python script as described in Python Script for Exporting (Large) Twapperkeeper Archives By User; the CSV file should contain a single column with one row per archive entry; each row includes the sender, the tweet, the tweet ID and a timestamp; **REMEMBER – TWAPPERKEEPER ARCHIVES WILL BE DISABLED ON JAN 6TH, 2012**

- in an R environment (I use RStudio), reuse code from Rescuing Twapperkeeper Archives Before They Vanish and Cornelius Puschmann’s post Generating graphs of retweets and @-messages on Twitter using R and Gephi:

require(stringr)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

twapperkeeperCSVParse=function(fp){
    df = read.csv(fp, header=F)
    df$from=sapply(df$V1,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    df$id=sapply(df$V1,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    df$txt=sapply(df$V1,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}
#usage: 
#twarchive.df=twapperkeeperCSVParse("PATH_TO_YOUR_FILE")
#For example:
df=twapperkeeperCSVParse("~/code/twapps/reports/twArchive_online11.txt")

ats.df <- data.frame(df$from,df$to)
rts.df <- data.frame(df$from,df$rt)

#Cribbing http://blog.ynada.com/339
require(igraph)
ats.g <- graph.data.frame(ats.df, directed=T)
rts.g <- graph.data.frame(rts.df, directed=T)

write.graph(ats.g, file="ats.graphml", format="graphml")
write.graph(rts.g, file="rts.graphml", format="graphml")

- Cornelius’ code uses the igraph library to construct a graph and export graphml files that describe graphs of at behaviour (tweets in the archive sent from one user to another) and RT behaviour (tweets from one person retweeting another using the RT @name convention).

- visualise the graphml files in Gephi. Note a couple of things – empty nodes aren’t handled properly in my version of the code, so the graph includes a dummy node that all non-at or non-RT row tweet senders point to; when you visualise the graph, this node will be obvious, so just delete it ;-)

- the Gephi visualisation by default uses the Label attribute for labeling nodes – we need to change this:

Gephi - setting node label choice

You should now be able to view graphs that illustrate RT or @ behaviour as captured in a Twapperkeeper archive in Gephi.

ILI2011 RT behaviour

Just by the by, we can also generate stats’n graphs of the contents of the archive. For example, via Getting Started With Twitter Analysis in R, we can generate a bar plot to show who was retweeted most:

require(ggplot2)

ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)

We can also do some counting to find out who was RT’d the most, for example:

#count the occurrences of each name in the rt column
rt.count=data.frame(table(df$rt))
#sort the results in descending order and display the top 5 results
head(rt.count[order(-rt.count$Freq),],5)
#There are probably better ways of doing that! If so, let me know via comments

Next on the to do list is:
– automate the production of archive reports
– work in the time component so we can view behaviour over time in Gephi… (here’s a starting point maybe, again from Cornelius Puschmann’s blog: Dynamic Twitter graphs with R and Gephi (clip and code))

As things stand though, I may not be able to get round to either of those for a while…

Written by Tony Hirst

December 21, 2011 at 4:55 pm

Postcards from a Text Processing Excursion

It never ceases to amaze me how I lack even the most basic computer skills, but that’s one of the reasons I started this blog: to demonstrate and record my fumbling learning steps so that others maybe don’t have to spend so much time being as dazed and confused as I am most of the time…

Anyway, I spent a fair chunk of yesterday trying to find a way of getting started with grappling with CSV data text files that are just a bit too big to comfortably manage in a text editor or simple spreadsheet (so files over 50,000 or so rows, up to low millions) and that should probably be dumped into a database if that option was available, but for whatever reason, isn’t… (Not feeling comfortable with setting up and populating a database is one example…But I doubt I’ll get round to blogging my SQLite 101 for a bit yet…)

Note that the following tools are Unix tools – so they work on Linux and on a Mac, but probably not on Windows unless you install a unix tools package (such as GnuWincoreutils and sed, which look good for starters…). Another alternative would be to download the Data Journalism Developer Studio and run it either as a bootable CD/DVD, or as a virtual machine using something like VMWare or VirtualBox.

All the tools below are related to the basic mechanics of wrangling with text files, which include CSV (comma separated) and TSV (tab separated) files. Your average unix jockey will look at you with sympathetic eyes if you rave bout them, but for us mere mortals, they may make life easier for you than you ever thought possible…

[If you know of simple tricks in the style of what follows that I haven’t included here, please feel free to add them in as a comment, and I’ll maybe try to work then into a continual updating of this post…]

If you want to play along, why not check out this openurl data from EDINA (data sample; a more comprehensive set is also available if you’re feeling brave: monthly openurl data).

So let’s start at the beginning and imagine your faced with a large CSV file – 10MB, 50MB, 100MB, 200MB large – and when you try to open it in your text editor (the file’s too big for Google spreadsheets and maybe even for Google Fusion tables) the whole thing just grinds to a halt, if doesn’t actually fall over.

What to do?

To begin with, you may want to take a deep breath and find out just what sort of beast you have to contend with. You know the file size, but what else might you learn? (I’m assuming the file has a csv suffix, L2sample.csv say, so for starters we’re assuming it’s a text file…)

The wc (word count) command is a handy little tool that will give you a quick overview of how many rows there are in the file:

wc -l L2sample.csv

I get the response 101 L2sample.csv, so there are presumably 100 data rows and 1 header row.

We can learn a little more by taking the -l linecount switch off, and getting a report back on the number of words and characters in the file as well:

wc L2sample.csv

Another thing that you might consider doing is just having a look at the structure of the file, by sampling the first few rows of it and having a peek at them. The head command can help you here.

head L2sample.csv

By default, it returns the first 10 rows of the file. IF we want to change the number of rows displayed, we can use the -n switch:

head -n 4 L2sample.csv

As well as the head command, there is the tail command; this can be used to peek at the lines at the end of the file:

tail L2sample.csv
tail -n 15 L2sample.csv

When I look at the rows, I see they have the form:

logDate	logTime	encryptedUserIP	institutionResolverID	routerRedirectIdentifier ...
2011-04-04	00:00:03	kJJNjAytJ2eWV+pjbvbZTkJ19bk	715781	ukfed ...
2011-04-04	00:00:14	/DAGaS+tZQBzlje5FKsazNp2lhw	289516	wayf ...
2011-04-04	00:00:15	NJIy8xkJ6kHfW74zd8nU9HJ60Bc	569773	athens ...

So, not comma separated then; tab separated…;-)

If you were to upload a tab separated file to something like Google Fusion Tables, which I think currently only parses CSV text files for some reason, it will happily spend the time uploading the data – and then shove it into a single column.

I’m not sure if there are column splitting tools available in Fusion Tables – there weren’t last time I looked, though maybe we might expect a fuller range of import tools to appear at some point; many applications that accept text based data files allow you to specify the separator type, as for example in Google spreadsheets:

I’m personally living in hope that some sort of integration with the Google Refine data cleaning tool will appear one day…

If you want to take a sample of a large data file and put into another smaller file that you can play with or try things out with, the head (or tail) tool provides one way of doing that thanks to the magic of Unix redirection (which you might like to think of as a “pipe”, although that has a slightly different meaning in Unix land…). The words/jargon may sound confusing, and the syntax may look cryptic, but the effect is really powerful: take the output from a command and shove it into a file.

So, given a CSV file with a million rows, suppose we want to run a few tests in an application using a couple of hundred rows. This trick will help you generate the file containing the couple of hundred rows.

Here’s an example using L2sample.csv – we’ll create a file containing the first 20 rows, plus the header row:

head -n 21 L2sample.csv > subSample.csv

See the > sign? That says “take the output from the command on the left, and shove it into the file on the right”. (Note that if subSample.csv already exists, it will be overwritten, and you will lose the original.)

There’s probably a better way of doing this, but if you want to generate a CSV file (with headers) containing the last 10 rows, for example, of a file, you can use the cat command to join a file containing the headers with a file containing the last 10 rows:

head -n 1 L2sample.csv > headers.csv
tail -n 20 L2sample.csv > subSample.csv
cat headers.csv subSample.csv > subSampleWithHeaders.csv

(Note: don’t try to cat a file into itself, or Ouroboros may come calling…)

Another very powerful concept from the Unix command line is the notion of | (the pipe). This lets you take the output from one command and direct it to another command (rather than directing it into a file, as > does). So for example, if we want to extract rows 10 to 15 from a file, we can use head to grab the first 15 rows, then tail to grab the last 6 rows of those 15 rows (count them: 10, 11, 12, 13, 14, 15):

head -n 15 L2sample.csv | tail -n 6 > middleSample.csv

Try to read in as an English phrase (the | and > are punctuation): take the the first [head] 15 rows [-n 15] of the file L2sample.csv and use them as input [|] to the tail command; take the last [tail] 6 lines [-n 6] of the input data and save them [>] as the file middleSample.csv.

If we want to add in the headers, we can use the cat command:

cat headers.csv middleSample.csv > middleSampleWithHeaders.csv

We can use a pipe to join all sorts of commands. If our file only uses a single word for each column header, we can count the number of columns (single words) by grabbing the header row and sending it to wc, which will count the words for us:

head -n 1 L2sample.csv | wc

(Take the first row of L2sample.csv and count the lines/words/characters. If there is one word per column header, the word count gives us the column count…;-)

Sometimes we just want to split a big file into a set of smaller files. The split command is our frind here, and lets us split a file into smaller files containing up to a know number of rows/lines:

split -l 15 L2sample.csv subSamples

This will generate a series of files named subSamplesaa, subSamplesab, …, each containing 15 lines (except for the last one, which may contain less…).

Note that the first file will contain the header and 14 data rows, and the other files will contain 15 data rows but no column headings. To get round this, you might want to split on a file that doesn’t contain the header. (So maybe use wc -l to find the number of rows in the original file, create a header free version of the data by using tail on one less than the number of rows in the file, then split the header free version. You might then one to use cat to put the header back in to each of the smaller files…)

A couple of other Unix text processing tools let us use a CSV file as a crude database. The grep searches a file for a particular term or text pattern (known as a regular expression, which I’m not going to cover much in this post… suffice to note for now that you can do real text processing voodoo magic with regular expressions…;-)

So for example, in out test file, I can search for rows that contain the word mendeley

grep mendeley L2sample.csv

We can also redirect the output into a file:

grep EBSCO L2sample.csv > rowsContainingEBSCO.csv

If the text file contains columns that are separated by a unique delimiter (that is, some symbol that is only ever used to separate the columns), we can use the cut command to just pull out particular columns. The cut command assumes a tab delimiter (we can specify other delimiters explicitly if we need to), so we can use it on our testfile to pull out data from the third column in our test file:

cut -f 3 L2sample.csv

We can also pull out multiple columns and save them in a file:

cut -f 1,2,14,17 L2sample.csv > columnSample.csv

If you pull out just a single column, you can sort the entries to see what different entries are included in the column using the sort command:

cut -f 40 L2sample.csv | sort

(Take column 40 of the file L2sample.csv and sort the items.)

We can also take this sorted list and identify the unique entries using the uniq command; so here are the different entries in column 40 of our test file:

cut -f 40 L2sample.csv | sort | uniq

(Take column 40 of the file L2sample.csv, sort the items, and display the unique values.)

(The uniq command appears to make comparaisons between consecutive lines, hence the nee to sort first.)

The uniq command will also count the repeat occurrence of unique entries if we ask it nicely (-c):

cut -f 40 L2sample.csv | sort | uniq -c

(Take column 40 of the file L2sample.csv, sort the items, and display the unique values along with how many times they appear in the column as a whole.)

The final command I’m going to mention here is magic search and replace operator called sed. I’m aware that this post is already over long, so I’ll maybe return to this in a later post, aside from giving you a tease of scome scarey voodoo… how to convert a tab delimited file to a comma separated file. One recipe is given by Kevin Ashley as follows:

sed 's/"/\\\"/g; s/^/"/; s/$/"/; s/ctrl-V<TAB>/","/g;' origFile.tsv > newFile.csv

(See also this related question on #getTheData: Converting large-ish tab separated files to CSV.)

Note: if you have a small amount of text and need to wrangle it on some way, the Text Mechanic site might have what you need…

This lecture note on Unix Tools provides a really handy cribsheet of Unix command line text wrangling tools, though the syntax does appear to work for me using some of the commands as given their (the important thing is the idea of what’s possible…).

If you’re looking for regular expression helpers (I haven’t really mentioned these at all in this post, suffice to say they’re a mechanism for doing pattern based search and replace, and which in the right hands can look like real voodoo text processing magic!), check out txt2re and Regexpal (about regexpal).

TO DO: this is a biggie – the join command will join rows from two files with common elements in specified columns. I canlt get it working properly with my test files, so I’m not blogging it just yet, but here’s a starter for 10 if you want to try… Unix join examples

Written by Tony Hirst

June 3, 2011 at 11:53 am

Tech Tips: Making Sense of JSON Strings – Follow the Structure

Reading through the Online Journalism blog post on Getting full addresses for data from an FOI response (using APIs), the following phrase – relating to the composition of some Google Refine code to parse a JSON string from the Google geocoding API – jumped out at me: “This took a bit of trial and error…”

Why? Two reasons… Firstly, because it demonstrates a “have a go” attitude which you absolutely need to have if you’re going to appropriate technology and turn it to your own purposes. Secondly, because it maybe (or maybe not…) hints at a missed trick or two…

So what trick’s missing?

Here’s an example of the sort of thing you get back from the Google Geocoder:

{ “status”: “OK”, “results”: [ { “types”: [ “postal_code” ], “formatted_address”: “Milton Keynes, Buckinghamshire MK7 6AA, UK”, “address_components”: [ { “long_name”: “MK7 6AA”, “short_name”: “MK7 6AA”, “types”: [ “postal_code” ] }, { “long_name”: “Milton Keynes”, “short_name”: “Milton Keynes”, “types”: [ “locality”, “political” ] }, { “long_name”: “Buckinghamshire”, “short_name”: “Buckinghamshire”, “types”: [ “administrative_area_level_2″, “political” ] }, { “long_name”: “Milton Keynes”, “short_name”: “Milton Keynes”, “types”: [ “administrative_area_level_2″, “political” ] }, { “long_name”: “United Kingdom”, “short_name”: “GB”, “types”: [ “country”, “political” ] }, { “long_name”: “MK7″, “short_name”: “MK7″, “types”: [ “postal_code_prefix”, “postal_code” ] } ], “geometry”: { “location”: { “lat”: 52.0249136, “lng”: -0.7097474 }, “location_type”: “APPROXIMATE”, “viewport”: { “southwest”: { “lat”: 52.0193722, “lng”: -0.7161451 }, “northeast”: { “lat”: 52.0300728, “lng”: -0.6977000 } }, “bounds”: { “southwest”: { “lat”: 52.0193722, “lng”: -0.7161451 }, “northeast”: { “lat”: 52.0300728, “lng”: -0.6977000 } } } } ] }

The data represents a Javascript object (JSON = JavaScript Object Notation) and as such has a standard form, a hierarchical form.

Here’s another way of writing the same object code, only this time laid out in a way that reveals the structure of the object:

{
  "status": "OK",
  "results": [ {
    "types": [ "postal_code" ],
    "formatted_address": "Milton Keynes, Buckinghamshire MK7 6AA, UK",
    "address_components": [ {
      "long_name": "MK7 6AA",
      "short_name": "MK7 6AA",
      "types": [ "postal_code" ]
    }, {
      "long_name": "Milton Keynes",
      "short_name": "Milton Keynes",
      "types": [ "locality", "political" ]
    }, {
      "long_name": "Buckinghamshire",
      "short_name": "Buckinghamshire",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "Milton Keynes",
      "short_name": "Milton Keynes",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "United Kingdom",
      "short_name": "GB",
      "types": [ "country", "political" ]
    }, {
      "long_name": "MK7",
      "short_name": "MK7",
      "types": [ "postal_code_prefix", "postal_code" ]
    } ],
    "geometry": {
      "location": {
        "lat": 52.0249136,
        "lng": -0.7097474
      },
      "location_type": "APPROXIMATE",
      "viewport": {
        "southwest": {
          "lat": 52.0193722,
          "lng": -0.7161451
        },
        "northeast": {
          "lat": 52.0300728,
          "lng": -0.6977000
        }
      },
      "bounds": {
        "southwest": {
          "lat": 52.0193722,
          "lng": -0.7161451
        },
        "northeast": {
          "lat": 52.0300728,
          "lng": -0.6977000
        }
      }
    }
  } ]
}

Making Sense of the Notation

At its simplest, the structure has the form: {“attribute”:”value”}

If we parse this object into the jsonObject, we can access the value of the attribute as jsonObject.attribute or jsonObject[“attribute”]. The first style of notation is called a dot notation.

We can add more attribute:value pairs into the object by separating them with commas: a={“attr”:”val”,”attr2″:”val2″} and address them (that is, refer to them) uniquely: a.attr, for example, or a[“attr2″].

Try it out for yourself… Copy and past the following into your browser address bar (where the URL goes) and hit return (i.e. “go to” that “location”):

javascript:a={"attr":"val","attr2":"val2"}; alert(a.attr);alert(a["attr2"])

(As an aside, what might you learn from this? Firstly, you can “run” javascript in the browser via the location bar. Secondly, the javascript command alert() pops up an alert box:-)

Note that the value of an attribute might be another object.

obj={ attrWithObjectValue: { “childObjAttr”:”foo” } }

Another thing we can see in the Google geocoder JSON code are square brackets. These define an array (one might also think of it as an ordered list). Items in the list are address numerically. So for example, given:

arr[ “item1″, “item2″, “item3″ ]

we can locate “item1″ as arr[0] and “item3″ as arr[2]. (Note: the index count in the square brackets starts at 0.) Try it in the browser… (for example, javascript:list=["apples","bananas","pears"]; alert( list[1] );).

Arrays can contain objects too:

list=[ “item1″, {“innerObjectAttr”:”innerObjVal” } ]

Can you guess how to get to the innerObjVal? Try this in the browser location bar:

javascript: list=[ "item1", { "innerObjectAttr":"innerObjVal" } ]; alert( list[1].innerObjectAttr )

Making Life Easier

Hopefully, you’ll now have a sense that there’s structure in a JSON object, and that that (sic) structure is what we rely on if we want to cut down on the “trial an error” when parsing such things. To make life easier, we can also use “tree widgets” to display the hierarchical JSON object in a way that makes it far easier to see how to construct the dotted path that leads to the data value we want.

A tool I have appropriated for previewing JSON objects is Yahoo Pipes. Rather than necessarily using Pipes to build anything, I simply make use of it as a JSON viewer, loading JSON into the pipe from a URL via the Fetch Data block, and then previewing the result:

Another tool (and one I’ve just discovered) is an Air application called JSON-Pad. You can paste in JSON code, or pull it in from a URL, and then preview it again via a tree widget:

Clicking on one of the results in the tree widget provides a crib to the path…

Summary

Getting to grips with writing addresses into JSON objects helps if you have some idea of the structure of a JSON object. Tree viewers make the structure of an object explicit. By walking down the tree to the part of it you want, and “dotting” together* the nodes/attributes you select as you do so, you can quickly and easily construct the path you need.

* If the JSON attributes have spaces or non-alphanumeric characters in them, use the obj[“attr”] notation rather than the dotted obj.attr notation…

PS Via my feeds today, though something I had bookmarked already, this Data Converter tool may be helpful in going the other way… (Disclaimer: I haven’t tried using it…)

If you know of any other related tools, please feel free to post a link to them in the comments:-)

Written by Tony Hirst

April 12, 2011 at 10:25 am

Posted in onlinejournalismblog, Tutorial, Uncourse

Tagged with

Gephi Bits 2: A Further Look at Comments on Social Objects in a Closed Community

In the previous post in this set (Gephi Bits 1: Comments on Social Objects in a Closed Community), I started having a play with comment and favourites data from a series of peer review activities in the OU course Design thinking: creativity for the 21st century.

In particular, I loaded simple pairwise CSV data directly into Gephi, relating comment id and favourite ids to photo ids. The resulting images provided a view over the photos that showed which photos were heavily commented and/or favourited. Towards the end of the post, I suggested it might be interesting to be able to distinguish between the comment and favorite nodes by colouring them somehow. So let’s start by seeing how we might achieve that…

The easiest way I can think of is to preload Gephi with a definition of each node and the assignment of a type label to each node – photo, comment or favourite. We can then partition – and colour – each node based on the type label.

To define the nodes and type labels, we can use a file defined using the GUESS .gdf format. In particular, we define the nodes as follows:

nodedef> name VARCHAR, ltype VARCHAR
p189, photo
p191, photo

c1428, comment
c1429, comment

f1005, fave
f1006, fave

Load this file into Gephi, and then append the contents of the comment-photo and favourite-photo CSV files to the graph. We can then colour the nodes (sized according to, err, something!) according to partition:

Coloured partitions in Gephi

If we filter the network for a particular photo using an ego filter, we can get a colour coded view of the comment and favourite IDs associated with that image:

Coloured nodes and labels in Gephi

What we’ve achieved so far is a way of exploring how heavily commented or favourited a photo is, as well as picking up a tip or two about labeling and colouring nodes. But what about if we wanted a person level analysis, where we could visually identify the individuals who had posted the most images, or whose images were most heavily commented upon and favourited?

To start with, let’s capture some information about each of the nodes. In the following example, we have an identifer (for a photo, favourite or comment), followed by a user id (the person who made the comment or favourite, or who uploaded the photo), and a label (photo, comment or fave). (The ltype field also captures a sense of this.)

nodedef> name VARCHAR, username VARCHAR, ltype VARCHAR
p189,jd342,photo
p191,jd342,photo
p192,pn43,photo
..
c1189,pd73,comment
c1190,srs22,comment
..
f46,ww66,fave
f47,ee79,fave

Rather than describe edges based on connecting comment or favourite ID to photo ID, we can easily generate links of the form userID, photoID, where userID is the ID of the user making a comment or favouriting an image. However, it is possible to annotate the edges to describe whether or not the link relates to a comment or favouriting action. So for example:

edgedef> otherUser VARCHAR, photo VARCHAR, etype VARCHAR
pd73,p189,comment
srs22,p226,comment

ww66,p176,fave

Alternatively, we might just use the simpler format:
edgedef> otherUser VARCHAR, photo VARCHAR
pd73,p189
srs22,p226

ww66,p176

In this simpler case, we can just load in the node definition gdf file, and follow it by adding the actual graph edge data from CSV files, which is what I’ve done for what follows.

Firstly, here’s the partition colour palette:

Gephi - partition colours

The null entities relate to nodes that didn’t get an explicit node specification (i.e. the person nodes).

To provide a bit of flexibility over the graph, I loaded the the favourites and comment edges in as directed edges from “Other user” to photo ID, where “Other user” is the user ID of the person making the comment or favourite.

If we size the graph by out-degree, we can look at which users are actively engaged in commenting on photos:

Gephi - who's commenting/favouriting

The size of the arrow depicts whether or not they are multiple edges going from one person to a photo, so we can see, for example, where someone has made multiple comments on the same photo.

If we size by in-degree, we can see which photos are popular:

Gephi - what photos are popular

If we run an ego filter over over a photo id, we can see who commented on it.

However, what we would really like to be able to do is look at the connections between people via a photo (for example, to see who has favourited who’s photos). If we add in another edge data file that links from a photo ID to a person ID (the person who uploaded the photo), we can start to explore these relationships.

NB the colour palette changes in what follows…

Having captured user to photo relationships based on commenting, favouriting or uploading behaviour, we can now do things like the following. Here for example is a use of a simple filter to see which of a user’s photo’s are popular:

Gephi - simple filter

If we run a simple ego filter, we can see the photos that a user has uploaded or commented on/favourited:

Gephi - ego filter

If we increase the depth to 2, we can see who else a particular user is connected to by virtue of a shared interest in the same photographs (I’m not sure what edge size relates to here…?):

Ego depth 2 in gephi - who connects to whom

Here, user ba49 is outsize because they uploaded a lot of the images that are shown. (The above graph shows linkage between ba49 and other users who either commented on/favourited one of ba49’s images, or who commented/favourited photo that ba49 also commented on/favourited.)

Doh – it appears I’ve crashed Gephi, so as it’s late, I’m going to stop for now! In the next post, I’ll show how we can further elaborate the nodes using extended user identifiers that designate the role a person is acting in (eg as a commenter, favouriter or photo uploader) to see what sorts of view this lets us take over the network.

Written by Tony Hirst

May 27, 2010 at 10:31 pm

Posted in Tutorial, Visualisation

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 864 other followers