Category: Analytics

FutureLearn Data Doodles Notebook and a Reflection on unLearning Analytics

With LAK16 (this year’s Learning Analytics and Knowledge Conference upon us, not that I’m there, I thought I’d post an (updated) notebook showing some of my latest data sketches’n’doodles around the data made available to FutureLearn partners arising from courses they have presented on FutureLearn. You can find the notebook in the notebooks folder of this Github repository: psychemedia/futurelearnStatsSketches.

Recalling the takedown notice I received for posting “unauthorised” screenshots around some of the data from a FutureLearn course I’d worked on, the notebook doesn’t actually demonstrate the analysis of any data at all. Instead, it provides a set of recipes that can be applied to your own FutureLearn course data to try to help you make sense of it.

In contrast to many learning analytics approaches, where the focus is on building models of learners so that you can adaptively force them to do particular things to make your metrics look better

My thinking hasn’t really moved on that much since my original take on course analytics in 2007, or in a presentation I gave in 2008 (Course Analytics in Context presentation) and it can (still) be summarised pretty much as follows:

Insofar as we are producers of online course producers for delivery “at scale” (that is, to large numbers of learners), our first duty is to ensure that the course materials appear to be working. That is, we should regard the online materials in the same way as the publisher of any content focussed website, as pages that can be optimised in terms of their own performance against any expectations we place on them.

So, if a page has links on it, we should keep track of whether folk click on the link in the volumes we expect. If we expect a person to spend a certain amount of time on a page, we should be concerned if, en masse, they appear to be spending a much shorter longer period of time on the page. In short, we should be catering to the mass behaviour of the visitors, to try to ensure that the page appears to be delivering (albeit at a surface level) the sort of experience we expect for it for the majority of visitors. (Unless the page has been designed to target a very particular audience, in which case we need to segment our visitor stats to ensure that for that particular audience, the page meets out expectations of it in terms of crude user engagement metrics.) This is not about dynamically trying to manage the flow of users through the course materials, it’s about making sure the static site content is behaving as we expect it to.

This is possibly naive, and could be seen as showing a certain level of disinterest in users’ individual learning behaviours, but I think it reflects how we tend to write static materials. In the case of the OU, this tends to be with a very strong, single narrative line, almost as if the materials were presented as a set of short books. I suspect that writing material that is intended to be dynamically served up in response to an algorithm perceived model of the user needs to be authored differently using chunks that can be connected in ways that allow for multiple narrative pathways through them.

In certain respects, this is a a complementary approach to learning design where educators are encouraged in advance of writing a course to identify various sorts of structural activity that I suspect LD advocates would then like to then see being used as the template for an automated course production process; templated steps conforming to the structural design elements could then be dropped into the academic workflow for authors to fill out. (At the same time, my experience of authoring workflows for online material delivery is that they generally suck, despite my best efforts…. See also: here.)


The notebook is presented as a Jupyter notebook with code written using Python3. It requires pandas and seaborn but no other special libraries and should work on a variety of notebook hosting workbenches (see for example Seven Ways of Running IPython / Jupyter Notebooks). I’ve also tested it against the psychemedia/ou-tm351-pystack container image on Github, which is a clone of the Jupyter set-up we’re using in the current presentation of the OU course TM351 Data management and analysis. My original FutureLearn data analysis notebook only used techniques developed in the FutureLearn course Learn to Code for Data Analysis, but the current one goes a little bit further than that…

The notebook includes recipes that analyse all four FutureLearn data file types, both individually and in combination with each other. It also demonstrates a few interactive widgets. Aside from a few settings (identifying the location and name of the data files), and providing some key information such as course enrolment opening date and start date, and any distinguished steps (specific social activity or exercise steps, for example, that you may want to highlight), the analyses should all run themselves. (At least, they ran okay on the dataset I had access to. If you run them and get errors, please send me a copy of any error messages and/or fixes you come up with.) All the code is provided though, so if you want to edit or otherwise play with any of it, you are free to do so. The code is provided without warranty and may not actually do what I claim for it (though if you find any such examples, please let me know).

The notebook and code are licensed as attribution required works. I thought about an additional clause expressly forbidding commercial use or financial gain from the content by FutureLearn Ltd, but on reflection I thought that might get me into more grief than it was worth!;-) (It could also come over as perhaps a bit arrogant and I’m note sure the notebooks have anything that novel or interesting in them; they’re more of a travel diary that record my initial meanderings around the data, as well as a few sketches and doodles as I tried to work out how to wrangle the data.)

As I was about to post the notebooks, I happened to come a across report of a recent investigation on the financial flows in academic publishing (It’s time to stand up to greedy academic publishers) which raises questions about “the issue of how research is communicated in society … that cut to the heart of what academics do, and what academia is about”; this resonated with a couple of recent quotes from Downes that made me smile, an off-the-cuff remark from Martin Weller last week mooting whether there was – or wasn’t a book around the idea of guerrilla research as compared to digital scholarship, and the observation that I wasn’t the only person who took holiday and covered my own expenses to attend the OER conference in Edinburgh last week (though I am most grateful to the organisers for letting me in and giving me the opportunity to bounce a few geeky ideas around with Jim Groom, Brian Lamb, Grant Potter, Martin Hawksey and David Kernohan that I still need to think through. I need to start pondering the data driven stand-up and panel show games too…!:-)

Arising from that confused melee of ideas around what I guess is the economics of gonzo academia, I decided to a post a version of the notebook on Leanpub as the first part of a possible work in progress: Course Analytics – Wrangling FutureLearn Data With Python and R. (I’ve been pondering a version of the notebook recast as an R shiny app, a Jupyter dashboard, and an RMarkdown report, and I think that title will accommodate such ramblings under the same cover.) So if you find value in the notebook and feel as if you should pay for it, you now have an opportunity to do so. (Any monies generated will be used to cover costs of activities related to the topic of the work, along with the progression and dissemination of ideas related to it. Receipts and expenditure arising therefrom will be itemised in full in the repository.) And if you don’t think it’s worth anything, the book is flexibly priced with a starting price of free.

PS the notebook is a Jupyter notebook, as used in the OU/FutureLearn course Learn to Code for Data Analysis. My original FutureLearn data analysis notebook used only techniques developed in that course, although the current version uses a few more, including interactive widgets that let you analyse the data interactively within the notebook. If you need any further reasons as to why you should take the course, here’s a marketing pitch…

Teaching Material Analytics

A couple of weeks ago, I had a little poke around some of the standard reports that we can get out of the OU VLE. OU course materials are generated from a structured document format – OU XML – that generates one or more HTML pages bound to a particular Moodle resource id. Additional Moodle resources are associated with forums, admin pages, library resource pages, and so on.

One of the standard reports provides a count of how many times each resource has been accessed within a given time period, such as a weekly block. Data can only be exported for so many weeks at a time, so to get stats for course materials over the presentation of a course (which may be up to 9 months long) requires multiple exports and the aggregation of the data.

We can then generate simple visual summaries over the data such as the following heatmap.


Usage is indicated by colour density, time in weeks are organised along horizontal x-axis. From the chart, we can clearly see waves of activity over the course of the module as students access resources associated with particular study weeks. We can also see when materials aren’t
being accessed, or are only being accessed by a low number of times (that is, necessarily by a low proportion of students. If we get data about unique user accesses or unique user first use activity, we can get a better idea about the proportion of students in a cohort as a whole accessing a resource).

This sort of reporting – about material usage rather than student attainment – was what originally attracted me to thinking about data in the context of OU courses (eg Course Analytics in Context). That is, I wasn’t that interested in how well students were doing, per se, or interested in trying to find ways of spying on individual students to build clever algorithms behind experimental personalisation and recommender systems that would never make it out of the research context.

That could come later.

What I originally just wanted to know was whether this resource was ever looked at, whether that resource was accessed when I expected (eg if an end of course assessment page was accessed when students were prompted to start thinking about it during an exercise two thirds of the way in to the course), whether students tended to study for half an hour or three hours (so I could design the materials accordingly), how (and when) students searched the course materials – and for what (keyphrase searches copied wholesale out of the continuous assessment materials) and so on.

Nothing very personal in there – everything aggregate. Nothing about students, particularly, everything about course materials. As a member of the course team, asking how are the course materials working rather than how is that student performing?

There’s nothing very clever about this – it’s just basic web stats run with an eye to looking for patterns of behaviour over the life of a course to check that the materials appear to be being worked in the way we expected. (At the OU, course team members are often a step removed from supporting students.)

But what it is, I think, is an important complement to the “student centred” learning analytics. It’s analytics about the usage and utilisation of the course materials, the things we actually spend a couple of years developing but don’t really seem to track the performance of?

It’s data that can be used to inform and check on “learning designs”. Stats that act as indicators about whether the design is being followed – that is, used as expected, or planned.

As a course material designer, I may want to know how well students perform based on how they engage with the materials, but I should really to know how the materials are being utilised, because they’re designed to be utilised in a particular way? And if they’re not being used in that way, maybe I need to have a rethink?

MOOC Platforms and the A/B Testing of Course Materials

[The following is my *personal* opinion only. I know as much about FutureLearn as Google does. Much of the substance of this post was circulated internally within the OU prior to posting here.]

In common with other MOOC platforms, one of the possible ways of positioning FutureLearn is as a marketing platform for universities. Another might see it as a tool for delivering informal versions of courses to learners who are not currently registered with a particular institution. [A third might position it in some way around the notion of “learning analytics”, eg as described in a post today by Simon Buckingham Shum: The emerging MOOC data/analytics ecosystem] If I understand it correctly, “quality of the learning experience” will be at the heart of the FutureLearn offering. But what of innovation? In the same way that there is often a “public benefit feelgood” effect for participants in medical trials, could FutureLearn provide a way of engaging, at least to a limited extent, in “learning trials”.

This need not be onerous, but could simply relate to trialling different exercises or wording or media use (video vs image vs interactive) in particular parts of a course. In the same way that Google may be running dozens of different experiments on its homepage in different combinations at any one time, could FutureLearn provide universities with a platform for trying out differing learning experiments whilst running their MOOCs?

The platform need not be too complex – at first. Google Analytics provides a mechanism for running A/B tests and “experiments” across users who have not disabled Google Analytics cookies, and as such may be appropriate for initial trialling of learning content A/B tests. The aim? Deciding on metrics is likely to prove a challenge, but we could start with simple things to try out – does the ordering or wording of resource lists affect click-through or download rates for linked resources, for example? (And what should we do about those links that never get clicked and those resources that are never downloaded?) Does offering a worked through exercise before an interactive quiz improve success rates on the quiz, and so on.

The OU has traditionally been cautious when running learning experiments, delivering fee-waived pilots rather than testing innovations as part of A/B testing on live courses with large populations. In part this may be through a desire to be ‘equitable’ and not jeopardise the learning experience for any particular student by providing them with a lesser quality offering than we could*. (At the same time, the OU celebrates the diversity and range of skills and abilities of OU students, which makes treating them all in exactly the same way seem rather incongruous?)

* Medical trials face similar challenges. But it must be remembered that we wouldn’t trial a resource we thought stood a good chance of being /less/ effective than one we were already running… For a brief overview of the broken worlds of medical trials and medical academic publishing, as well as how they could operate, see Ben Goldacre’s Bad Pharma for an intro.

FutureLearn could start to change that, and open up a pathway for experimentally testing innovations in online learning as well as at a more micro-level, tuning images and text in order to optimise content for its anticipated use. By providing course publishers with a means of trialling slightly different versions of their course materials, FutureLearn could provide an effective environment for trialling e-learning innovations. Branding FutureLearn not only as a platform for quality learning, but also as a platform for “doing” innovation in learning, gives it a unique point of difference. Organisations trialling on the platform do not face the threat of challenges made about them delivering different learning experiences to students on formally offered courses, but participants in courses are made aware that they may be presented with slightly different variants of the course materials to each other. (Or they aren’t told… if an experiment is based on success in reading a diagram where the labels are presented in different fonts or slightly different positions, or with or without arrows, and so on, does that really matter if the students aren’t told?)

Consultancy opportunities are also likely to arise in the design and analysis of trials and new interventions. The OU is also provided with both an opportunity to act according to it’s beacon status as far communicating innovative adult online learning/pedagogy goes, as well as gaining access to large trial populations.

Note that what I’m not proposing is not some sort of magical, shiny learning analytics dashboard, it’d be a procedural, could have been doing it for years, application of web analytics that makes use of online learning cohorts that are at least a magnitude or two larger than is typical in a traditional university course setting. Numbers that are maybe big enough to spot patterns of behaviour in (either positive, or avoidant).

There are ethical challenges and educational challenges in following such a course of action, of course. But in the same way that doctors might randomly prescribe between two equally good (as far as they know) treatments, or who systematically use one particular treatment over another that is equally good, I know that folk who create learning materials also pick particular pedagogical treatments “just because”. So why shouldn’t we start trialling on a platform that is branded as such?

Once again, note that I am not part of the FutureLearn project team and my knowledge of it is largely limited to what I have found on Google.

See also: Treating MOOC Platforms as Websites to be Optimised, Pure and Simple…. For some very old “course analytics” ideas about using Google Analytics, see Online Course Analytics, which resulted in OUseful blogarchive: “course analytics”. Note that these experiments never got as far as content optimisation, A/B testing, search log analysis etc. The approach I started to follow with the Library Analytics series had a little more success, but still never really got past the starting post and into a useful analyse/adapt cycle. Google Analytics has moved on since then of course… If I were to start over, I;d probably focus on creating custom dashboards to illustrate very particular use cases, as well as REDACTED.

Open Webstats from GovUK

In the #solo12eval session* on Monday organised by Shane McCracken and Karen Bultitude on the topic of evaluating impact (whatever that is) of online science engagement, I was reminded (yet again…) of the Culture24 report on Evaluating Impact online. The report included as one of its deliverables a set of example Google Analytics report templates (now rotted?) that provided a starting point for what could be a commonly-accepted-as-sensible reporting framework. (I keep wondering whether it would be useful to try to do the same for academic library websites/library website analytics?) One of the things I pondered afterwards was whether it would make sense to open up Google Analytics from a ‘typical’ website in that sector to all-comers, so that different parties could demonstrate what stories and information they could pull out of the stats using a common data basis. Something a bit like CSS Zen Garden, but around a common Google Analytics dataset, for example?

* From the session, I also learned of the JISC Impact Analysis Programme, which includes an interestingly titled project on Tracking Digital Impact (TDI). That project is presumably in stealth mode, because it was really hard to find out anything about it… (I thought JISC projects were all encouraged to do the blogging thing? Or is that just from certain enlightened parts of JISC…?)

Loosely related to the workshop, and from my feeds, I noticed a couple of announcements over the last couple of days relating to the publication of web/traffic stats on a couple of government web properties.

First up, the Government Digital Service/@gdsteam posted on their Updat[ed] GOV.UK Performance Dashboard, which you can find: Performance Platform Dashboard.

As you can see, this dashboard reports on a variety of Google Analytics stats – average unique visitors, weekly pageviews, and so on.

As well as the dashboard itself, the @gds_datashark team seem to be quite happy to show their working and presumably allow others to propose check-ins of their own bug fixes and code solutions to .. Gov github

To make it easy to play along, they’re publishing a set of raw data feeds (Headline narrative text, Yesterday’s hourly traffic and comparison average, Weekly visits to GOV.UK, Direct Gov and Businesslink, Weekly unique visitors to GOV.UK, Direct Gov and Businesslink, Format success metrics) although the blog post notes these are ‘internal’ URLs and hence are subject to change…

(Via tweets from @jukesie and @lseteph, I was also reminded that Steph experimented with publishing BIS’ departmental webstats way back when)

In the past, UKGov has posted a certain amount of costings related data around website provision (for example, So Where Do the Numbers in Government Reports Come From?), so if there are any armchair web analysts/auditors out there (unlikely, I know;-), it seems as if data is there for the taking, as well as the asking (the GDS folk seem to be quite open to ideas…)

The second announcement that caught my eye was the opening up of site usage stats on the website.

Data is broken down into site-wide, publisher and datasets groupings, and reports on things like:

– browser type
– O/S type
– social network referrals
– language
– country

The data is also available via a CSV file.

So I wonder: could we use the GDS and data/data feeds as the basis for a crude webstats Zen Garden? How would such a site best be architected? (One central github repo pulling in exemplar view requests from cloned repos?) And would it make sense to publish webstats data/analytics from a “typical” science engagement website (or library website, or course website), and allow the community to see what sorts of take on it folk can come up with in respect of different ways presenting the data and more importantly, identifying different ways of making sense of it/finding different ways of telling stories with it?

Do Retweeters Lack Commitment to a Hashtag?

I seem to be going down more ratholes than usual at the moment, in this case relating to activity round Twitter hashtags. Here’s a quick bit of reflection around a chart from Visualising Activity Around a Twitter Hashtag or Search Term Using R that shows activity around a hashtag that was minted for an event that took place before the sample period.

The y-axis is organised according to the time of first use (within the sample period) of the tag by a particular user. The x axis is time. The dots represent tweets containing the hashtag, coloured blue by default, red if they are an old-style RT (i.e. they begin RT @username:).

So what sorts of thing might we look for in this chart, and what are the problems with it? Several things jump out at me:

  • For many of the users, their first tweet (in this sample period at least) is an RT; that is, they are brought into the hashtag community through issuing an RT;
  • Many of the users whose first use is via an RT don’t use the hashtag again within the sample period. Is this typical? Does this signal represent amplification of the tag without any real sense of engagement with it?
  • A noticeable proportion of folk whose first use is not an RT go on to post further non-RT tweets. Does this represent an ongoing commitment to the tag? Note that this chart does not show whether tweets are replies, or “open” tweets. Replies (that is, tweets beginning @username are likely to represent conversational threads within a tag context rather than “general” tag usage, so it would be worth using an additional colour to identify reply based conversational tweets as such.
  • “New style” retweets are diaplayed as retweets by colouring… I need to check whether or nor newstyle RT information is available that I could use to colour such tweets appropriately. (or alternatively, I’d have to do some sort of string matching to see whether or not a tweet was the same as a previously seen tweet, which is a bit of a pain:-(

(Note that when I started mapping hashtag communities, I used to generate tag user names based on a filtered list of tweets that excluded RTs. this meant that folk who only used the tag as part of an RT and did not originate tweets that contained the tag, either in general or as part of a conversation, would not be counted as a member of the hashtag community. More recently, I have added filters that include RTs but exclude users who used the tag only once, for example, thus retaining serial RTers, but not single use users.)

So what else might this chart tell us? Looking at vertical slices, it seems that news entrants to the tag community appear to come in waves, maybe as part of rapid fire RT bursts. This chart doesn’t tell us for sure that this is happening, but it does highlight areas of the timelime that might be worth investigating more closely if we are interested in what happened at those times when there does appear to be a spike in activity. (Are there any modifications we could make to this chart to make them more informative in this respect? The time resolution is very poor, for example, so being able to zoom in on a particular time might be handy. Or are there other charts that might provide a different lens that can help us see what was happening at those times?)

And as a final point – this stuff may be all very interesting, but is it useful?, And if so, how? I also wonder how generalisable it is to other sorts of communication analysis. For example, I think we could use similar graphical techniques to explore engagement with an active comment thread on a blog, or Google+, or additions to an online forum thread. (For forums with mutliple threads, we maybe need to rethink how this sort of chart would work, or how it might be coloured/what symbols we might use, to distinguish between starting a new thread, or adding to a pre-existing one, for example. I’m sure the literature is filled with dozens of examples for how we might visualise forum activity, so if you know of any good references/links…?! ;-) #lazyacademic)

Visualising Activity Around a Twitter Hashtag or Search Term Using R

I think one of valid criticisms around a lot of the visualisations I post here and on my various #f1datajunkie blogs is that I often don’t post any explanatory context around the visualisations. This is partly a result of the way I use my blog posts in a selfish way to document the evolution of my own practice, but not necessarily the “so what” elements that represent any meaning or sense I take from the visualisations. In many cases, this is because the understanding I come to of a dataset is typically the result of an (inter)active exploration of the data set; what I blog are the pieces of the puzzle that show how I personally set about developing a conversation with a dataset, pieces that you can try out if you want to…;-)

An approach that might get me more readers would be to post commentary around what I’ve learned about a dataset from having a conversation with it. A good example of this can be seen in @mediaczar’s post on How should Page Admins deal with Flame Wars?, where this visualisation of activity around a Facebook post is analysed in terms of effective (or not!) strategies for moderating a flame war.

@mediaczar visualisation of engagement around facebook flamewars

The chart shows a sequential ordering of posts in the order they were made along the x-axis, and the unique individual responsible for each post, ordered by accession to the debate along the y-axis. For interpretation and commentary, see the original post: How should Page Admins deal with Flame Wars? ;-)

One take away of the chart for me is that it provides a great snapshot of new people entering into a conversation (vertical lines) as well as engagement by an individual (horizontal lines). If we use a time proportional axis on x, we can also see engagement over time.

In a Twitter context, it’s likely that a rapid increase in numbers of folk engaging with a hashtag, for example, might be the result of an RT related burst of activity. For folk who have already engaged in hashtag usage, for example as part of a live event backhannel, a large number of near co-occurring tweets that are not RTs might signal some notable happenstance within the event.

To explore this idea, here’s a quick bit of R tooling inspired by Mat’s post… It uses the twitteR library and sources tweets via a Twitter search.

#Pull in a search around a hashtag.
rdmTweets <- searchTwitter(searchTerm, n=500)
# Note that the Twitter search API only goes back 1500 tweets

#Plot of tweet behaviour by user over time
#Based on @mediaczar's
#Make use of a handy dataframe creating twitteR helper function
#@mediaczar's plot uses a list of users ordered by accession to user list
## 1) find earliest tweet in searchlist for each user [ ]
tw.dfx=ddply(tw.df, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))})
## 2) arrange the users in accession order
## 3) Use the username accession order to order the screenName factors in the searchlist
tw.df$screenName=factor(tw.df$screenName, levels = tw.dfxa$screenName)
#ggplot seems to be able to cope with time typed values...

We can get a feeling for which occurrences were old-style RTs by identifying tweets that start with a classic RT, and then colouring each tweet appropriately (note there may be some overplotting/masking of points…I’m not sure how big the x-axis time bins are…)

#Identify and colour the RTs...
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#Identify classic style RTs
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if ( 'T' else 'RT')

So now we can see when folk entered into the hashtag community via a classic RT.

We can also start to explore who was classically retweeted when:

#Generate a plot showing how a person is RTd
tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
#Note that this doesn't show how many RTs each person got in a given time period if they got more than one...

Another view might show who was classically RTd by whom (activity along a row indicating someone was retweeted a lot through one or more tweets, activity within a column identifying an individual who RTs a lot…):

#We can start to get a feel for who RTs whom...
#We don't want to display screenNames of folk who tweeted but didn't RT
#Order the screennames of folk who did RT by accession order (ie order in which they RTd)
tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created))
tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName)
# Plot who RTd whom
ggplot(subset(tw.df.rt,subset=(!,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)

What sense you might make of all this, or where to take it next, is down to you of course… Err, erm…?! ;-)

PS see also:

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list ( or the list of likes someone has made (; replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Google Refine - import Facebook friends list

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

Google Refine - import Facebook friends

If you click the highlighted selection, you should see the data that will be used to create your project:

Google Refine - click to view the data

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

Google Refine - rename columns

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

Google Refine - new column from URL

The Likes URL has the form which we’ll tinker with as follows:

Google Refine - crafting URLs for new column creation

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

Google Refine - new data imported

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv



data = simplejson.load(open(fn,'r'))
for d in data['rows']:
	#'interests' is the column name containing the Likes data
	for i in interests['data']:
		print str(id),i['name'],i['category']

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Sketching common likes amongst my facebook friends

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

[See also: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and how to visualise Google+ networks]

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?