OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Experimenting With iGraph – and a Hint Towards Ways of Measuring Engagement?

leave a comment »

For fear of being left way behind as Martin Hawksey starts to get to grips with R, (see for example how he’s using R to automate the annotation of Google Spreadsheets with calculations that don’t come readily or efficiently to hand in Google Spreadsheets itself), I thought I better try to get to grips with R’s igraph library…

So here’s a script that gives me some hints as to how to start migrating chunks of my clunky Python script into R, as well as some ideas about how to start reporting on the structure of hashtag communities in a graphical as well as stats analytical way.

require(igraph)

#load in a graph from a graphml file
g=read.graph('/Users/ajh59/code/twapps/newt/reports/scmvESP/scmvESP_2012-01-26-22-53-45/friends_outerfriendsdegree_X_25_25_X_esp.graphml',format='graphml')
summary(g)
#Vertices are obtained via V(g). The summary() tells us what attributes are available.
#So for example, inspect the label attribute
V(g)$label
#in and out degree counts for each (labelled) node/vertex
df=data.frame(name=V(g)$label,indegree=degree(g,mode='in'),outdegree=degree(g,mode='out'))

#inspect the top 10 nodes sorted by indegree
#the plyr arrange function makes sorting dataframes a doddle...
require(plyr)
df2=head(arrange(df,desc(indegree)),10)
df2

#get ready to do some plots
require(ggplot2)

#It might be interesting to look at the in-degree and out-degree distributions
#out-degree, because we see how promiscuous folk are in their following behaviour
#h/t to @mhawksey for pointing out the mode argument to me.. doh!
ddout=degree.distribution(g,mode='out')
#degree.distribution() "a numeric vector of the same length as the maximum degree plus one. The first element is the relative frequency zero degree vertices, the second vertices with degree one,etc." 
#We can use the vector vals as the y-value, but x is unspecified/implied by the row number
#So we need to generate the x vals explicitly...?
ggplot()+geom_point(aes(c(1:length(ddout)),ddout))
#If we want to ignore the outdegree==0 value, we can skip the first item in the list
ggplot()+geom_point(aes(c(2:length(ddout)),ddout[-1]))

#in-degree
ddin=degree.distribution(g,mode='in')
ggplot()+geom_point(aes(c(1:length(ddin)),ddin))
ggplot()+geom_point(aes(c(2:length(ddin)),ddin[-1]))

#We can also plot indegree and outdegree together
#Use colour to distinguish the points, and make the upper layer smaller in case we overplot
ggplot() + geom_point(aes(c(2:length(ddin)),ddin[-1]),colour='red') + geom_point(aes(c(2:length(ddout)),ddout[-1]),colour='blue',size=1)

Note that I really should have labelled the axes – x-axis is “in (or out) degree”, y-axis is “proportion of nodes with corresponding in (or out) degree”.

Out-degree:

Out-degree (except out-degree==0):

In-degree:

In-degree (except in-degree==0):

One thing I notice about the in-degree is that there is a very high number of very low in-degree nodes, which tail off quickly, and then another head at in-degree 25 which then tails off. This is an artefact of the way the graph file was pre-processed – I generated a friends network of hashtag users, then filter the network to only include nodes that had indegree of at least 25 and/or outdegree of at least 25. The nodes with in-degree between 1 and 25 are nodes corresponding to hashtaggers that are friended by other hashtaggers.

In- (blue) and out- (red) degree:

Reflecting on the in-degree graph, we have a way of identifying those folk who used the hashtag and are connected to other hashtaggers:

arrange(subset(df,subset=(outdegree>0 & indegree>0)),desc(indegree))

The dataset I’m using refers is based on folk using the #bbcqt hashtag. Here are the hashtaggers most linked to by other hashtaggers:

> head(arrange(subset(df,subset=(outdegree>0)),desc(indegree)))
             name indegree outdegree
1 bbcquestiontime      190       102
2       DIMBLEBOT       76        61
3   markinreading       34       121
4 politicalhackuk       27       236
5          10anta       25        73
6 Parlez_me_nTory       24        63

So now I’m wondering… does this hint at a way of measuring some sort of engagement with the Twitter account set up to promote the programme and, presumably, the hashtag???

If we consider @bbcquestiontime, the high indegree tells us that the @bbcquestiontime account is being followed by a significant number of the hashtag users (we could find out what proportion by dividing through by the number of folk with out-degree>1 minus 1 (minus 1 because @bbcquestiontime is one of those hashtaggers). That @bbcquestiontime has outdegree > 0 tells us it was sampled as a user of the hashtag (the graph was originally generated with directed edges from folk who used the tag to their friends.) The high (ish?!) out-degree tells us that this account is linking to a reasonable number of folk popularly followed by users of the #bbcqt hashtag or who used the hashtag; so #bbcquestiontime is listening to folk that the #bbcqt taggers listen to, which is probably a good thing. (I guess what we could do here is compare the outdegree of the @bbcquestiontime account with its total friend count (ie, with the total number of accounts it follows. Because if the account was following 1000 people or so, and only 10% of them were being followed by #bbcqt hashtaggers, we might wonder whether they’re interested in different things?) Once again, we could also normalise the out-degree number with respect to one less number of accounts with indegree >0 (again, we subtract one to account for the self reference) to get the proportion of folk being followed by hashtaggers that are being followed by @bbcquestiontime. This gives us some idea of the extent to which @bbcquestiontime is listening to folk that the #bbcqt hashtaggers are listening to.

Let’s try that latter normalisation to get a feel for what the proportions are…

#Count the number of rows where folk have indegree, or outdegree, as required, > 0
df$inReach=df$indegree/(nrow(subset(df,df$outdegree>0))-1)
df$outReach=df$outdegree/(nrow(subset(df,df$indegree>0))-1)
#First let's see who reaches furthest out into the interest community
head(arrange(subset(df,inReach>0),desc(outReach)))
             name indegree outdegree  outReach     inReach
1        Damientg        5       341 0.4782609 0.013054830
2      danmknight        9       265 0.3716690 0.023498695
3         martysm        1       261 0.3660589 0.002610966
4       MrJacHart       18       257 0.3604488 0.046997389
5           VMcAV        5       237 0.3323983 0.013054830
6 politicalhackuk       27       236 0.3309958 0.070496084

#now let's see who is touched by most of the community
head(arrange(subset(df,outReach>0),desc(inReach)))
             name indegree outdegree   outReach    inReach
1 bbcquestiontime      190       102 0.14305750 0.49608355
2       DIMBLEBOT       76        61 0.08555400 0.19843342
3   markinreading       34       121 0.16970547 0.08877285
4 politicalhackuk       27       236 0.33099579 0.07049608
5          10anta       25        73 0.10238429 0.06527415
6 Parlez_me_nTory       24        63 0.08835905 0.06266319

So, from that, we see that @Damientg is following a large number of the folk popularly followed by users of the #bbcqt hashtag or who used the hashtag. I don’t think this is interesting. However, the fact that @bbcquestiontime is followed by about half the folk who used the #bbcqt tag (in the sample I grabbed) is maybe useful as a measure of how engaged the hashtaggers may be with the programme Twitter account?

The latter report also brings to mind another question – how many of the hashtaggers does any particular account follow – that is, how connected is any particular account to folk who used the hashtag (which is the set of folk with outdegree>0)? This is important I think – distinguishing between hashtaggers who link to each other as part of a conversation, and other accounts they follow en masse but who aren’t engaging in conversation via the hashtag?

Hmmm…something to ponder over the weekend I think;-)

Written by Tony Hirst

January 27, 2012 at 6:17 pm

Social Media Interest Maps of Newsnight and BBCQT Twitterers

leave a comment »

I grabbed independent samples of 1500 recent users of the #newsnight and #bbcqt hashtags within a minute or two of each other about half an hour ago. Here’s who’s followed by 25 or more of the recent hashtaggers in each case. Can you distinguish the programmes each audience interest projection map relates to?

Here’s the first one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?

#bbcqt 1500 forward friends 25 25

Here’s the second one – are these folk followed by 25 or more of the folk who recently used the #bbcqt or the #newsnight hashtag?

#newsnight 1500   forward friends  projection 25 25

The answer is a only a click away…

PS I’ve got a couple of scripts in the pipeline that should be able to generate data that I can use to generate this sort of differencing word cloud, the idea being I should be able to identify at a glance accounts that different hashtag communities both follow, and accounts that they differently follow…

UPDATE: so here’s a quick first pass at comparing the audiences. I’m not sure how reliable the method is, but it’s as follows:

- for each hashtag, grab 1500 recent tweets. Grab the list of folk the hashtagging users follow and retain a list (the ‘interest list’) of folk followed by at least 25 of the hashtaggers. Filter the hashtagger list so that it only contains hashtaggers who follow at least 25 people (this cuts out brand new users and newly created spam accounts). Count the number of filtered hashtaggers that follow each person in the interest list, and normalise by dividing through by the total number of filtered hashtaggers. To recap, for each tag, we now have a list of folk who were popularly followed by users of that tag, along with a number for each one between 0 and 1 describing proportionally how much of the hashtagging sample follow them.

(Note that there may be all sorts of sampling errors… I guess I need to qualify reports with the number of unique folk tweeting in the twitter sample captured. I maybe also need to improve sampling so rather than searching for 1500 tweets, I generate a sample of 1000 unique users of the tag?)

I then load these files into R and run through the following process:

#Multiply this nromalised follower proportion by 1000 and round down to get an integer between 0 and 1000 representing a score relative to the proportion of filtered hashtagger who follow each person in the interest list.
counts_newsnight$normIn=as.integer(counts_newsnight$inNorm*1000)
counts_bbcqt$normIn=as.integer(counts_bbcqt$inNorm*1000)

#ANother filtering step: we're going to plot similarities and differences between folk followed by at least 25% of the corresponding filtered hashtaggers
newsnight=subset(counts_newsnight,select=c(username,normIn),subset=(inNorm>=0.25))
bbcqt=subset(counts_bbcqt,select=c(username,normIn),subset=(inNorm>=0.25))

#Now generate a dataframe
qtvnn=merge(bbcqt,newsnight,by="username",all=T)
colnames(qtvnn)=c('username','bbcqt','newsnight')

#replace the NA cell values (where for example someone in the bbcqt list is not in the newsnight list
qtvnn[is.na(qtvnn)] <- 0

That generates a dataframe that looks something like this:

      username bbcqt newsnight
1    Aiannucci   414       408
2  BBCBreaking   455       464
3 BBCNewsnight   317       509
4  BBCPolitics     0       256
5   BBCr4today     0       356
6  BarackObama   296       334

Thanks to Josh O’Brien on Stack Overflow, I can recast this data frame into a term.matrix that plays nicely with the latest version of the R wordcloud package.

dimnames(mat)[1] <- qtvnn[1]
mat <- as.matrix(qtvnn[-1])
dimnames(mat)[1] <- qtvnn[1]
comparison.cloud(term.matrix = mat)
commonality.cloud(term.matrix = mat)

Here’s the result – commonly followed folk:

And differentially followed folk (at above the 25% level, remember…)

So from this what can we say? Both audiences have a general news interest, into pop politics and perhaps satirical comedy, maybe leaning to the left? The Question Time audience is a more casual audience, more minded to following celebrities, whereas the Newsnight audience is a bit more into following notable media folk (journalists, editors) and also political news. (I’d be keen to hear any other readings of these maps – please feel free to leave a comment containing your interpretations/observations/reading:-)

UPDATE2: to try to get a handle on what the word clouds might be telling us from an alternative visual perspective on the data, rather than inspecting the actual code for example, here’s a scatterplot showing how the follower proportions compare directly:

COmparison of who #newsnight and #bbcqt hashtaggers follow

ggplot(na.omit(subset(qtvnn,bbcqt>0 & newsnight>0))) + geom_text(aes(x=bbcqt,y=newsnight,label=username,angl=45),size=4) + xlim(200,600) + ylim(200,600) + geom_abline(intercept=0, slope=1,colour='grey')

Here’s another view – this time plotting followed folk for each tag who are not followed by the friends of the other tag:

hashtag comparison - folk not follwed by other tag

I couldn’t remember/didn’t have Google to hand to find the best way of reshaping the data for this, so I ended up with a horrible horrible hack…

nn=data.frame(typ='newsnight',subset(qtvnn,select=c(username,newsnight),subset=(newsnight>0 & bbcqt==0)))
qt=data.frame(typ='bbcqt',subset(qtvnn,select=c(username,bbcqt),subset=(newsnight==0 & bbcqt>0)))
colnames(nn)=c('typ','name',val'')
colnames(qt)=c('typ','name',val'')
qtnn=rbind(nn,qt)
ggplot()+geom_text(data=qtnn,aes(x=typ,y=val,label=name),size=3)

I think this is all starting to get to the point where I need to team up with a proper developer and get *all* the code properly written and documented before any errors that are currently there get baked in too deeply…

Written by Tony Hirst

January 26, 2012 at 11:23 pm

Posted in Anything you want, Rstats

Tagged with

TV Audience Social Interest Mapping – Shameless vs. Newsnight vs Masterchef

with 2 comments

How easy is it to differentiate between audiences of different types of TV programme based on their socially signalled interests?

This evening, I ran a couple of Twitter searches against the #shameless and #newsnight hashtags. In each case, I grabbed 1500 of the most recent tweets and generated lists of folk who had tweeted the corresponding hashtag at least twice in the sample set. I then grabbed the lists of all the friends of the folk in each list to generate a projection map of the friends of recent hashtaggers. The final preprocessing step was to filter each network to contain only nodes that had at least an indegree or outdegree of 25 (that is, I filtered the network to only include folk who had at least 25 friends, or were linked to by at least 25 of the folk in the corresponding hashtaggers list).

Here’s the resulting map generated around the #shameless tag – it gives an impression of folk who tend to be followed by folk using the #shameless tag:

Positiioning #shameless

Music, celebrities, footballers and comedians, err, I think?!

By way of comparison, here’s a sketch of who the folk using the #newsnight tag follow:

Positioning the #newsnight audience

MPs, political hacks, and higher brow level of comedian maybe?! ;-)

PS for what it’s worth, here’s another map for tweets grabbed now around #masterchef, which aired a few hours ago:

Positioning #masterchef

So that’ll be a cooking show with some high profile talent (in the narrower scheme of things) then?!

Written by Tony Hirst

January 25, 2012 at 12:08 am

Mapping Corporate Twitter Account Networks Using Twitter Contributions/Contributees API Calls

with 2 comments

Savvy users of social networks are probably well-versed in the ideas that corporate Twitter accounts are often “staffed” by several individuals (often identified by the ^AB convention at the end of a tweet, where AB are the initials of the person wearing the that account hat (^)); they may also know that social media accounts for smaller companies may actually be operated by a PR company or “social media guru” who churns out tweets their behalf via Twitter accounts operated in the company’s name and in support of it’s online marketing activity.

Rooting around the Twitter API looking for something else, I spotted a GET users/contributees API cal, along with a complementary GET users/contributors call that return “an array of users (i.e. Twitter accounts) that the specified user can contribute to”, and the accounts that can contribute to a particular Twitter account respectively.

I didn’t know this functionality existed, so I put out a fishing tweet to see if anyone knew of any accounts running this feature other than the twitterapi account used by way of example in the API documentation. A response from Martin Hawksey (on whom I’m increasingly reliant for helping me keep up and get my head the daily novelties that the web throws up!), suggested it was a feature that has been quietly rolling out to premium users: Twitter Starts Rolling Out Contributors Feature, Salesforce Activated. Via his reading of that post (I think), Martin suggested that a Bing(;-) search for site:twitter.com “via web by” would turn up a few likely candidates, and so it did…

So why’s this interesting? Because given the ID of an account that a company users for corporate tweets, or the ID of a user who also contributes to a corporate account via their own account, we might be able to map out something of the corporate comms network for an organisation operating multiple accounts (maybe a company, but maybe also a government department or local council ,or lobbiest group), or the client list of “social media guru” operating various accounts for different SMEs.

Anyway, here’s quick script for exploring the TWitter contributors/contributees API. The output is a graphml file we can visualise in Gephi.

And here are a couple of views over what it comes up with. Firstly, a map bootstrapped from the @twitterapi account:

Twitter contributors network

And here’s one I built out from HuffingtonPost:

HuffingtonPost twitter contributors network

So what do we learn from this? Firstly it’s yet another example of how networks get everywhere. Secondly, it raises the question (for me) of whether there are any cribs in other multi-contributor social network apps (maybe in tweet metadata) that allow us to identify originating authors/users and hence find a way into mapping their contribution networks.

As well as building out from an account name to which users contribute, we can bootstrap a map from a user who is known to contribute to one or more accounts (code not included in Github gist atm).

So for example, here’s a map built out from user @VeeVee:

Twitter contributors netwrok built out from @veevee

I guess one of the next questions from a tool building point of view is: is there a more reliable way of getting cribs into possible contributor/contributee networks? Another is: are any other multi-contributor services (on Twitter or other networks, such as Google+) similarly mappable?

PS Just noticed this: Google to drop Google Social API. I also read on a Google blog that the Needlebase screenscraping tool Google acquired as part of the ITA acquisition will be shut down later this year…

Written by Tony Hirst

January 23, 2012 at 5:18 pm

A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets

with 5 comments

Following on from A Tool Chain for Plotting Twitter Archive Retweet Graphs – Py, R, Gephi, here’s a quick view summary view over #UKGC12 tweets saved in Google Spreadsheet archive as developed by Martin Hawksey, generated from an R script (R code available here; #ukgc12 tweet archive here)…

(I did mean to tidy these up, add in titles etc etc but it’s late and I’m realllly tiered:-(

So for example, an ordered bar chart showing who was @’d most by hashtagged tweets:

Tweets to an individual

And a scatterplot showing the number of tagged tweets to and from particular individuals, sized by how many times RT’s of a person’s tweets there were:

ukgc2012 tweeps

(Hmmm..strikes me I could use a fourth dimension (colour) to capture the number of RTs issued by each person too…? I wonder if I can also tie the angle of each label to a parameter value?!)

I also had a quick peek at looking at folk who were using the tag and/or were heavily followed by tag users (nodes sized according to betweenness centrality):

Connections between recent users of the #ukgc12 hashtag and the folk they tend to follow (node size: betweenness centrality)

You can view a dynamic version of the conversation graph around the tag using Martin’s TAGSExplorer (about).

PS See the first comment below from Ben Marwick for a link to a text analysis script in R that can be easily tweaked to use archived tweets. When I get a chance, I’ll try to wrap this into a Sweave script (cf. How Might Data Journalists Show Their Working? Sweave for the automated generation of PDF and HTML reports.).

Written by Tony Hirst

January 21, 2012 at 12:31 am

Posted in Anything you want, Rstats

Tagged with

Apple of My i(TunesU)?

with 5 comments

Apple made some sort of announcement today revolutionising everything, reinventing everything, or something…

The OU may have been in the presentation as part of the story to date or as part of the story to come…

OU gets a mention then at #appleed announcement?

…but I couldn’t tell from a quick trawl of related OU sites…(http://www.open.ac.uk/itunes redirects to http://www.open.edu/itunes/ but there’s no “news” there, http://www.open.ac.uk/itunesu 404s, and there’s nothing (yet?) on the OU media release page (note that http://www.open.ac.uk/press 404s). http://www.open.edu/openlearn/ has no news, nor does http://openlearn.open.ac.uk/ . The community site http://www.open.ac.uk/platform/ has, err, no news I can see, except maybe providing a bit of balance?!;-)

OU platfrom on itunesu/ibooks announcement day

(Will Android users be invited in to the new world order? Hmm..and I wonder, does the iTunesU stuff come with “free” DRM to protect the user….?[IT's been pointed out to me in comments that iTunesU is DRM free... Still worth checking round the various iTunesU and iBook Author license conditions though...])

To show solidarity with the often (incorrectly perceived as) flakey nature of academic computery related things, Apple adopted sector standards in terms of dynamic page building (“an error occurred”):

iBooks Author - an error occurred...

And link checking:

I follow a link on iTUnesU app preview page and what do i get...?
I follow a link on iTUnesU app preview page and what do i get…?

Good stuff…

[Related: from Patrick McAndrew, iBook Author – is it OER incompatible?]

PS I don’t have an iPad, or an iPhone, my Mac is running an O/S two versions old, and my iPod touch is first generation. So presumably I won’t be able to try any of the new stuff out anyway?

PPS I’m a little intrigued as to why http://open.edu redirects to http://www.open.ac.uk/ but http://www.open.ac.uk/openlearn redirects to http://www.open.edu/openlearn/ and http://www.open.ac.uk/itunes redirects to http://www.open.edu/itunes/ The university wide top level navigation on the .edu site also leads in to open.ac.uk pages, but then there is no browseable way back to the .edu site?

PPPS nice to see the iBooks Author app is “free”… you need a Mac with the latest O/S of course; and an iPad for the best reading experience. £1k should get you started, if you have the educational discount. Bargain. Free’s nice, isn’t it?:-)

Written by Tony Hirst

January 19, 2012 at 5:28 pm

Posted in Anything you want

Tagged with

A Couple of Takes on Social Media Triangulation

leave a comment »

One of the things that interests me in social interest networks is the extent to which we can generate quick sketch maps of notable folk in an interest area, or generate quick profiles of the shared interests of a group of users. So here are a couple of doodles around the idea of social media triangulation, generated using Twitter, where I try to get an impression of the shared interest space of two or more users are located in .

The first attempt is a sample/sketch of the common friends three Twitter users (@clhw1, @melissaterras and @annewelsh) from the same institution (UCL),

The sketch is generated by grabbing the friends list for each using, constructing a directed graph of the folk they follow (from each person in the list to each of the people they follow), then filtering the graph to show nodes with outdegree > 0 or in-degree 3 (note to self: automate an option for this based on len(userList)):

Triangulation - common friends of @clhw1 @melissaterras @annewelsh

If you know the identities of the folk identified in this sketch (I don’t) it may or may not be meaningful in terms of the names that are collected there, and maybe any names you might expect to be there that are absent… One obvious next step in profiling the shared interest area of the three originally identified users would be to generate a word cloud across the biographical descriptions of each of the folk identified in the sketch.

The second doodle is inspired by a James Allen post – What you will see Lewis Hamilton doing next:

“Lewis Hamilton has had a relationship with sports clothing brand Reebok since the early days of his F1 career and he is set to become one of the key faces in a new multi million pound campaign for the brand, set to launch in March.

According to Marketing Week, Reebok, which is part of the Adidas group, has identified that many people like to treat fitness and the act of getting and staying fit as a sport in itself and they are going to market it that way, using Hamilton and other sports personalities. It will be interesting to see how he is positioned”

Hmm… which got me wondering – if I grab the friends of a sample of the followers of @lewishamilton, and the friends of independent sample of the followers of @reebok, graph the result and then filter the net to show folk who follow both @reebok and @lewishamilton, who tends to be followed by these common followers? That is, given independent samples of followers of the followers of @reebok and @lewishamiliton, retain any folk in either sample who also follow the other target account and then map who this set of people follow to any significant extent?

Using a sample size of 197 random followers of @lewishamilton and an independent sample of 197 followers of @reebok, the number of individuals that followed both accounts was minimal, which is maybe not surprising given the small sample size and large follower counts.

However, if we relax the filtering constraint a little and instead plot the intersection of the (undirected) degree 2 egonets of @reebok and @lewishamilton, here’s what we get:

Where @reebok and @lewishamilton intersect...

This graph was generated by plotting the friends network of 197 random followers of @lewishamilton and 197 random followers of @reebok and then applying the following filter:

triangulation of a sort...

That is, we filter the network (treating it as an undirected graph) to show all the people who are a within a network distance of two of both @lewishamilton and @reebok. Note that this does not mean that folk in the graph follow both @lewishamilton and @reebok, as this example shows [need a diagram!]: suppose A and B follow X, C follows Y, and D follows X and C.

Example graph

Treating the edges as undirected, D is within a distance of 2 of both X and Y, but does not follow both X and Y directly.

A question that now comes to mind is this: does the “within distance 2″ map have anything meaningful to say about the relative positioning of @lewishamilton and @reebok, or does the method of construction/filtering of the graph just produce meaningless noise?

Written by Tony Hirst

January 17, 2012 at 3:47 pm

Posted in Anything you want

Invisible Library Support – Now You Can’t Afford Not to be Social?

with 2 comments

If you live by pop tech feed or Twitter, you’ve probably heard that Google is rolling out a new style of socially powered search results. If not, or if you’re still not clear about what it entails, read Phil Bradley’s post on the matter: Why Google Search Plus is a disaster for search.

Done that? If not, why not? This post isn’t likely to make much sense at all if you don’t know the context. Here’s the link again: Why Google Search Plus is a disaster for search

So the starting point for this post is this: Google is in the process of rolling out a new web search service that (optionally) offers very personal search results that contains content from folk that Google thinks you’re associated with, and that Google is willing to show you based on license agreements and corporate politics.

Think about this for a minute…. in e the totally personalised view, folk will only see content that their friends have published or otherwise shared…

In Could Librarians Be Influential Friends?, I wondered aloud whether it made sense for librarians and other folk involved with providing support relating to resource discovery and recommendation to start a) creating social network profiles and encouraging their patrons to friend them, and b) start recommending resources using those profiles in order to start influencing the ordering/ranking of results in patrons’ search results based on those personal recommendations. The idea here was that you could start to make invisible frictionless recommendations by influencing the search engine results returned to your patrons (the results aren’t invisible because your profile picture may appear by the result showing that you recommend it. They’re frictionless in the sense that having made the original recommendation, you no longer have to do any work in trying to bring it to the attention of your patron – the search engines take care of that for you (okay, I know that’s a simplistic view;-). [Hmm.. how about referring to it as recommendation mode support?]

(Note that there is an complementary form of support to the approach which I’ve previously referred to as Invisible Library Tech Support (responsive mode support?; which I guess is also frictionless, at least from the perspective of the patron) in which librarians friend their patrons or monitor generic search terms/tags on Q&A sites and then proactively respond to requests that users post into their social networks more generally.)

With the aggressive stance Google now seems to be taking towards pushing social circle powered results, I think we need to face up to the fact – as Phil Bradley pointed out – that if librarians want to make sure they’re heard by their patrons, they’re going to need to start setting up social profiles, getting their patrons to friend them, and start making content and resource recommendations just anyway in order to make them available as resources that are indexed by patrons’ personal search engines. The same goes for publishers of OERs, academic teaching staff, and “courses”.

If we think of Google social search as searching over custom search engines bound by resources created and recommended by members of a users social circle, if you want to make (invisible) recommendations to a user via their (personalised) web search results, you’re going to need to make sure that the resources/content you want to recommend is indexed by their personal search engines. Which means: a) you need to friend them; and b) you need to share that content/those resources in that social context.

(Hmmm…this makes me think there may be something in the course custom search engine approach after all… Specifically, if the course has a social profile, and recommends the links contained within the course via that profile, they become part of the personalised search index of student’s following that course profile?)

Just by the by, as another example of Google completely messing things up at the moment, I notice that when I share links to posts on this blog via Google+, they don’t appear as trackbacks to the post in question. Which means that if someone refers to a post on this blog on Google+, I don’t know about it… whereas if they blog the link, I do…

See also my chronologically ordered posts on the eroding notion of “Google Ground Truth”.

[Invisible vs frictionless (and various notions of that word) is all getting a bit garbled; see eg @briankelly's Should Higher Education Welcome Frictionless Sharing and my comments to it for a little more on this...]

PS I’ve been getting increasingly infuriated by the clutter around, and lack of variation within, Google search results lately, so I changed my default search engine to Bing. The results are a bit all over the place compared to the Google results I tend to get, but this may be down in part to personalisation/training. I am still making occasional forays to Google, but for now, Bing is it… (because Bing is not Google…)

PPS Hah – just noticed: Google Search Plus doesn’t mean plus in the sense of search more, it means search Google+, which is less, or minus the wider world view…;-)

PPPS I keep meaning to blog this, and keep forgetting: Turn[ing] off [Google] search history personalization, in particular: “If you’ve disabled signed-out search history personalization, you’ll need to disable it again after clearing your browser cookies. Clearing your Google cookie clears your search settings, thereby turning history-based customizations back on.” WHich is to say, when you disable personalisation, you don’t disable personalisation against your Google account, you disable it only insofar as it relates to your current cookie ID?

Written by Tony Hirst

January 13, 2012 at 1:35 pm

Amateur Mapmaking: Getting Started With Shapefiles

with one comment

One of the great things about (software) code is that people build on it and out from it… Which means that as well as producing ever more complex bits of software, tools also get produced over time that make it easier to do things that were once hard to do, or required expensive commercial software tools.

Producing maps is a fine example of this. Not so very long ago, producing your own annotated maps was a hard thing to do. Then in June, 2005, or so, the Google Maps API came along and suddenly you could create your own maps (or at least, put markers on to a map if you had latitude and longitude co-ordinates available). Since then, things have just got easier. If you want to put markers on a map just given their addresses, it’s easy (see for example Mapping the New Year Honours List – Where Did the Honours Go?). You can make use of Ordnance Survey maps if you want to, or recolour and style maps so they look just the way you want.

Sometimes, though, when using maps to visualise numerical data sets, just putting markers onto a map, even when they are symbols sized proportionally in accordance with your data, doesn’t quite achieve the effect you want. Sometimes you just have to have a thematic, choropleth map:

OS thematic map example

The example above is taken from an Ordnance Survey OpenSpace tutorial, which walks you through the creation of thematic maps using the OS API.

But what do you do if the boundaries/shapes you want to plot aren’t supported by the OS API?

One of the common ways of publishing boundary data is in the form of shapefiles (suffix .shp, though they are often combined with several other files in a .zip package). So here’s a quick first attempt at plotting shapefiles and colouring them according to an appropriately defined data set.

The example is based on a couple of data sets – shapefiles of the English Government Office Regions (GORs), and a dataset from the Ministry of Justice relating to insolvencies that, amongst other things, describes numbers of insolvencies per time period by GOR.

The language I’m using is R, within the RStudio environment. Here’s the code:

#Download English Government Office Network Regions (GOR) from:
#http://www.sharegeo.ac.uk/handle/10672/50
##tmpdir/share geo loader courtesy of http://stackoverflow.com/users/1033808/paul-hiemstra
tmp_dir = tempdir()
url_data = "http://www.sharegeo.ac.uk/download/10672/50/English%20Government%20Office%20Network%20Regions%20(GOR).zip"
zip_file = sprintf("%s/shpfile.zip", tmp_dir)
download.file(url_data, zip_file)
unzip(zip_file, exdir = tmp_dir)

library(maptools)

#Load in the data file (could this be done from the downloaded zip file directly?
gor=readShapeSpatial(sprintf('%s/Regions.shp', tmp_dir))

#I can plot the shapefile okay...
plot(gor)

Here’s what it looks like:

Thematic maps for UK Government Office Regions in R

#I can use these commands to get a feel for the data contained in the shapefile...
summary(gor)
attributes(gor@data)
gor@data$NAME
#[1] North East               North West              
#[3] Greater London Authority West Midlands           
#[5] Yorkshire and The Humber South West              
#[7] East Midlands            South East              
#[9] East of England         
#9 Levels: East Midlands East of England ... Yorkshire and The Humber

#download data from http://www.justice.gov.uk/downloads/publications/statistics-and-data/courts-and-sentencing/csq-q3-2011-insolvency-tables.csv
insolvency<- read.csv("http://www.justice.gov.uk/downloads/publications/statistics-and-data/courts-and-sentencing/csq-q3-2011-insolvency-tables.csv")
#Grab a subset of the data, specifically to Q3 2011 and numbers that are aggregated by GOR
insolvencygor.2011Q3=subset(insolvency,Time.Period=='2011 Q3' & Geography.Type=='Government office region')

#tidy the data - you may need to download and install the gdata package first
#The subsetting step doesn't remove extraneous original factor levels, so I will.
require(gdata)
insolvencygor.2011Q3=drop.levels(insolvencygor.2011Q3)

names(insolvencygor.2011Q3)
#[1] "Time.Period"                 "Geography"                  
#[3] "Geography.Type"              "Company.Winding.up.Petition"
#[5] "Creditors.Petition"          "Debtors.Petition"  

levels(insolvencygor.2011Q3$Geography)
#[1] "East"                     "East Midlands"           
#[3] "London"                   "North East"              
#[5] "North West"               "South East"              
#[7] "South West"               "Wales"                   
#[9] "West Midlands"            "Yorkshire and the Humber"
#Note that these names for the GORs don't quite match the ones used in the shapefile, though how they relate one to another is obvious to us...

#So what next? [That was the original question...!]

#Here's the answer I came up with...
#Convert factors to numeric [ http://stackoverflow.com/questions/4798343/convert-factor-to-integer ]
#There's probably a much better formulaic way of doing this/automating this?
insolvencygor.2011Q3$Creditors.Petition=as.numeric(levels(insolvencygor.2011Q3$Creditors.Petition))[insolvencygor.2011Q3$Creditors.Petition]
insolvencygor.2011Q3$Company.Winding.up.Petition=as.numeric(levels(insolvencygor.2011Q3$Company.Winding.up.Petition))[insolvencygor.2011Q3$Company.Winding.up.Petition]
insolvencygor.2011Q3$Debtors.Petition=as.numeric(levels(insolvencygor.2011Q3$Debtors.Petition))[insolvencygor.2011Q3$Debtors.Petition]

#Tweak the levels so they match exactly (really should do this via a lookup table of some sort?)
i2=insolvencygor.2011Q3
i2c=c('East of England','East Midlands','Greater London Authority','North East','North West','South East','South West','Wales','West Midlands','Yorkshire and The Humber')
i2$Geography=factor(i2$Geography,labels=i2c)

#Merge the data with the shapefile
gor@data=merge(gor@data,i2,by.x='NAME',by.y='Geography')

#Plot the data using a greyscale
plot(gor,col=gray(gor@data$Creditors.Petition/max(gor@data$Creditors.Petition)))

And here’s the result:

Thematic map via augmented shapefile in R

Okay – so it’s maybe not the most informative of maps, it needs a scale, the London data is skewed, etc etc… But it shows that the recipe seems to work..

(Here’s a glimpse of how I worked my way to this example using a question to Stack Overflow: Plotting Thematic Maps in R Using Shapefiles and Data Files from DIfferent Sources (note: better solutions may have since been posted to that question, and which may improve on the recipe provided in this post…)

PS If the R thing is just too scary, here’s a recipe for plotting data using shapefiles in Google Fusion Tables [PDF] (alternative example) that makes use of the ShpEscape service for importing shapefiles into Fusion Tables (note that shpescape can be a bit slow converting an uploaded file and may appear to be doing nothing much at all for 10-20 minutes…). See also: Quantum GIS

Written by Tony Hirst

January 13, 2012 at 11:32 am

Different Speeches? Digital Skills Aren’t just About Coding…

with 4 comments

Secretary of State for Education, Michael Gove, gave a speech yesterday on rethinking the ICT curriculum in UK schools. You can read a copy of the speech variously on the Department for Education website, or, err, on the Guardian website.

Seeing these two copies of what is apparently the same speech, I started wondering:

a) which is the “best” source to reference?
b) how come the Guardian doesn’t add a disclaimer about the provenance of, and link, to the DfE version? [Note the disclaimer in the DfE version - "Please note that the text below may not always reflect the exact words used by the speaker."]
c) is the Guardian version an actual transcript, maybe? That is, does the Guardian reprint the “exact words” used by the speaker?

And that made me think I should do a diff… About which, more below…

Before that, however, here’s a quick piece of reflection on how these two things – the reinvention of the the IT curriculum, and the provenance of, and value added to, content published on news and tech industry blog sites – collide in my mind…

So for example, I’ve been pondering what the role of journalism is, lately, in part because I’m trying to clarify in my own mind what I think the practice and role of data journalism are (maybe I should apply for a Nieman-Berkman Fellowship in Journalism Innovation to work on this properly?!). It seems to me that “communication” is one important part (raising awareness of particular issues, events, or decisions), and holding governments and companies to account is another. (Actually, I think Paul Bradshaw has called me out on that, before, suggesting it was more to do with providing an evidence base through verification and triangulation, as well as comment, against which governments and companies could be held to account (err, I think? As an unjournalist, I don’t have notes or a verbatim quote against which to check that statement, and I’m too lazy to email/DM/phone Paul to clarify what he may or may not have said…(The extent of my checking is typically limited to what I can find on the web or in personal archives…which appear to be lacking on this point…))

Another thing I’ve been mulling over recently in a couple of contexts relates to the notion of what are variously referred to as digital or information skills.

The first context is “data journalism”, and the extent to which data journalists need to be able to do programming (in the sense of identifying the steps in a process that can be automated and how they should be sequenced or organised) versus writing code. (I can’t write code for toffee, but I can read it well enough to copy, paste and change bits that other people have written. That is, I can appropriate and reuse other people’s code, but can’t write it from scratch very well… Partly because I can’t ever remember the syntax and low level function names. I can also use tools such as Yahoo Pipes and Google Refine to do coding like things…) Then there’s the question of what to call things like URL hacking or (search engine) query building?

The second context is geeky computer techie stuff in schools, the sort of thing covered by Michael Gove’s speech at the BETT show on the national ICT curriculum (or lack thereof), and about which the educational digerati were all over on Twitter yesterday. Over the weekend, houseclearing my way through various “archives”, I came across all manner of press clippings from 2000-2005 or so about the activities of the OU Robotics Outreach Group, of which I was a co-founder (the web presence has only recently been shut down, in part because of the retirement of the sys admin on whose server the websites resided.) This group ran an annual open meeting every November for several years hosting talks from the educational robotics community in the UK (from primary school to HE level). The group also co-ordinated the RoboCup Junior competition in the UK, ran outreach events, developed various support materials and activities for use with Lego Mindstorms, and led the EPSRC/AHRC Creative Robotics Research Network.

At every robotics event, we’d try to involve kids and/or adults in elements of problem solving, mechanical design, programming (not really coding…) based around some sort of themed challenge: a robot fashion show, for example, or a treasure hunt (both variants on edge following/line following;-) Or a robot rescue mission, as used in a day long activity in the “Engineering: An Active Introduction” (TXR120) OU residential school, or the 3 hour “Robot Theme Park” team building activity in the Masters level “Team Engineering” (T885) weekend school. [If you're interested, we may be able to take bookings to run these events at your institution. We can make them work at a variety of difficulty levels from KS3-4 and up;-)]

Given that working at the bits-atoms interface is where the a lot of the not-purely-theoretical-or-hardcore-engineering innovation and application development is likely to take place over the next few years, any mandate to drop the “boring” Windows training ICT stuff in favour of programming (which I suspect can be taught in not only a really tedious way, but a really confusing and badly delivered way too) is probably Not the Best Plan.

Slightly better, and something that I know is currently being mooted for reigniting interest in computing, is the Raspberry Pi, a cheap, self-contained, programmable computer on a board (good for British industry, just like the BBC Micro was…;-) that allows you to work at the interface between the real world of atoms and the virtual world of bits that exists inside the computer. (See also things like the OU Senseboard, as used on the OU course “My Digital Life” (TU100).)

If schools were actually being encouraged to make a financial investment on a par with the level of investment around the introduction of the BBC Micro, back in the day, I’d suggest a 3D printer would have more of the wow factor…(I’ll doodle more on the rationale behind this in another post…) The financial climate may not allow for that (but I bet budget will manage to get spent anyway…) but whatever the case, I think Gove needs to be wary about consigning kids to lessons of coding hell. And maybe take a look at programming in a wider creative context, such as robotics (the word “robotics” is one of the reason why I think it’s seen as a very specialised, niche subject; we need a better phrase, such as “Creative Technologies”, which could combine elements of robotics, games programming, photoshop, and, yex, Powerpoint too… Hmm… thinks.. the OU has a couple of courses that have just come to the end of their life that between them provide a couple of hundred hours of content and activity on robotics (T184) and games programming (T151), and that we delivered, in part, to 6th formers under the OU’s Young Applicants in Schools Scheme.

Anyway, that’s all as maybe… Because there are plenty of digital skills that let you do coding like things without having to write code. Such as finding out whether there are any differences between the text in the DfE copy of Gove’s BETT speech, and the Guardian copy.

Copy the text from each page into a separate text file, and save it. (You’ll need a text editor for that..) Then, if you haven’t already got one, find yourself a good text editor. I use Text Wrangler on a Mac. (Actually, I think MS Word may have a diff function?)

FInding diffs between txt doccs in Text Wrangler

The difference’s all tend to be in the characters used for quotation marks (character encodings are one of the things that can make all sorts of programmes fall over, or misbehave. Just being aware that they may cause a problem, as well as how and why, would be a great step in improving the baseline level understanding of folk IT. Some of the line breaks don’t quite match up either, but other than that, the text is the same.

Now, this may be because Gove was a good little minister and read out the words exactly as they had been prepared. Or it may be the case that the Guardian just reprinted the speech without mentioning provenance, or the disclaimer that he may not actually have read the words of that speech (I have vague memories of an episode of Yes, Minister, here…;-)

Whatever the case, if you know: a) that it’s even possible to compare two documents to see if they are different (a handy piece of folk IT knowledge); and b) know a tool that does it (or how to find a tool that does it, or a person that may have a tool that can do it), then you can compare the texts for yourself. And along the way, maybe learn that churnalism, in a variety of forms, is endemic in the media. Or maybe just demonstrate to yourself when the media is acting in a purely comms, rather than journalistic, role?

PS other phrases in the area: “computational thinking”. Hear, for example: A conversation with Jeannette Wing about computational thinking

PPS I just remembered – there’s a data journalism hook around this story too… from a tweet exchange last night that I was reminded of by an RT:

josiefraser: RT @grmcall: Of the 28,000 new teachers last year in the UK, 3 had a computer-related degree. Not 3000, just 3.
dlivingstone: @josiefraser Source??? Not found it yet. RT @grmcall: 28000 new UK teachers last year, 3 had a computer-related degree. Not 3000, just 3
josiefraser: That ICT qualification teacher stat RT @grmcall: Source is the Guardian http://www.guardian.co.uk/education/2012/jan/09/computer-studies-in-schools

I did a little digging and found the following document on the General Teaching Council of England website – Annual digest of statistics 2010–11 – Profiles of registered teachers in England [PDF] – that contains demographic stats, amongst others, for UK teachers. But no stats relating to subject areas of degree level qualifications held, which is presumably the data referred to in the tweet. So I’m thinking: this is partly where the role of data journalist comes in… They may not be able to verify the numbers by checking independent sources, but they may be able to shed some light on where the numbers came from and how they were arrived at, and maybe even secure their release (albeit as a single point source?)

Written by Tony Hirst

January 12, 2012 at 1:10 pm

Follow

Get every new post delivered to your Inbox.

Join 126 other followers