Archive for the ‘Infoskills’ Category
Journalist Filters on Twitter – The Reuters View
It seems that Reuters has a new product out – Reuters Social Pulse. As well as highlighting “the stories being talked about by the newsmakers we follow”, there is an area highlighting “the Reuters & Klout 50 where we rank America’s most social CEOs.” Of note here is that this list is ordered by Klout score. Reuters don’t own Klout (yet?!) do they?!
The offering also includes a view of the world through the tweets of Reuters own staff. Apparently, “Reuters has over 3,000 journalists around the world, many of whom are doing amazing work on Twitter. That is too many to keep up with on a Twitter list, so we created a directory Reuters Twitter Directory] that shows you our best tweeters by topic. It let’s you find our reporters, bloggers and editors by category and location so you can drill down to business journalists in India, if you so choose, or tech writers in the UK.”
If you view the source of Reuters Twitter directory page, you can find a Javascript object that lists all(?) the folk in the Reuters Twitter directory and the tags they are associated with… Hmm, I thought… Hmmm…
If we grab that object, and pop it into Python, it’s easy enough to create a bipartite network that links journalists to the categories they are associated with:
import simplejson
import networkx as nx
#http://mlg.ucd.ie/files/summer/tutorial.pdf
from networkx.algorithms import bipartite
g = nx.Graph()
#need to bring in reutersJournalistList
users=simplejson.loads(reutersJournalistList)
#I had some 'issues' with the parsing for some reason? Required this hack in the end...
for user in users:
for x in user:
if x=='users':
u=user[x][0]['twitter_screen_name']
print 'user:',user[x][0]['twitter_screen_name']
for topic in user[x][0]['topics']:
print '- topic:',topic
#Add edges from journalist name to each tag they are associated with
g.add_edge(u,topic)
#print bipartite.is_bipartite(g)
#print bipartite.sets(g)
#Save a graph file we can visualise in Gephi corresponding to bipartite graph
nx.write_graphml(g, "usertags.graphml")
#We can find the sets of names/tags associated with the disjoint sets in the graph
users,tags=bipartite.sets(g)
#Collapse the bipartite graph to a graph of journalists connected via a common tag
ugraph= bipartite.projected_graph(g, users)
nx.write_graphml(ugraph, "users.graphml")
#Collapse the bipartite graph to a set of tags connected via a common journalist
tgraph= bipartite.projected_graph(g, tags)
nx.write_graphml(tgraph, "tags.graphml")
#Dump a list of the journalists Twitter IDs
f=open("users.txt","w+")
for uo in users: f.write(uo+'\n')
f.close()
Having generated graph files, we can then look to see how the tags cluster as a result of how they were applied to journalists associated with several tags:
Alternatively, we can look to see which journalists are connected by virtue of being associated with similar tags (hmm, I wonder if edge weight carries information about how many tags each connected pair may be associated through?). In this case, I size the nodes by betweenness centrality to try to highlight journalists that bridge topic areas:
Association through shared tags (as applied by Reuters) is one thing, but there is also structure arising from friendship networks…So to what extent do the Reuters Twitter List journalists follow each other (again, sizing by betweenness centrality):
Finally, here’s a quick look at folk followed by 15 or more of the folk in the Reuters Twitter journalists list: this is the common source area on Twitter for the journalists on the list. This time, I size nodes by eigenvector centrality.
So why bother with this? Because journalists provide a filter onto the way the world is reported to us through the media, and as a result the perspective we have of the world as portrayed through the media. If we see journalists as providing independent fairwitness services, then having some sort of idea about the extent to which they are sourcing their information severally, or from a common pool, can be handy. In the above diagram, for example, I try to highlight common sources (folk followed by at least 15 of the journalists on the Twitter list). But I could equally have got a feeling for the range of sources by producing a much larger and sparser graph, such as all the folk followed by journalists on the list, or folk followed by only 1 person on the list (40,000 people or so in all – see below), or by 2 to 5 people on the list…
Friends lists are one sort of filter every Twitter user has onto the content been shared on Twitter, and something that’s easy to map. There are other views of course – the list of people mentioning a user is readily available to every Twitter user, and it’s easy enough to set up views around particular hashtags or search terms. Grabbing the journalists associated with one or more particular tags, and then mapping their friends (or, indeed, followers) is also possible, as is grabbing the follower lists for one or more journalists and then looking to see who the friends of the followers are, thus positioning the the journalist in the social media environment as perceived by their followers.
I’m not sure that value Reuters sees in the stream of tweets from the folk on its Twitter journalists lists, or the Twitter networks they have built up, but the friend lenses at least we can try to map out. And via the bipartite user/tag graph, it also becomes trivial for us to find journalists with interests in Facebook and advertising, for example…
PS for associated techniques related to the emergent social positioning of hashtags and shared links on Twitter, see Socially Positioning #Sherlock and Dr John Watson’s Blog… and Social Media Interest Maps of Newsnight and BBCQT Twitterers. For a view over @skynews Twitter friends, and how they connect, see Visualising How @skynews’ Twitter Friends Connect.
Mapping Corporate Twitter Account Networks Using Twitter Contributions/Contributees API Calls
Savvy users of social networks are probably well-versed in the ideas that corporate Twitter accounts are often “staffed” by several individuals (often identified by the ^AB convention at the end of a tweet, where AB are the initials of the person wearing the that account hat (^)); they may also know that social media accounts for smaller companies may actually be operated by a PR company or “social media guru” who churns out tweets their behalf via Twitter accounts operated in the company’s name and in support of it’s online marketing activity.
Rooting around the Twitter API looking for something else, I spotted a GET users/contributees API cal, along with a complementary GET users/contributors call that return “an array of users (i.e. Twitter accounts) that the specified user can contribute to”, and the accounts that can contribute to a particular Twitter account respectively.
I didn’t know this functionality existed, so I put out a fishing tweet to see if anyone knew of any accounts running this feature other than the twitterapi account used by way of example in the API documentation. A response from Martin Hawksey (on whom I’m increasingly reliant for helping me keep up and get my head the daily novelties that the web throws up!), suggested it was a feature that has been quietly rolling out to premium users: Twitter Starts Rolling Out Contributors Feature, Salesforce Activated. Via his reading of that post (I think), Martin suggested that a Bing(;-) search for site:twitter.com “via web by” would turn up a few likely candidates, and so it did…
So why’s this interesting? Because given the ID of an account that a company users for corporate tweets, or the ID of a user who also contributes to a corporate account via their own account, we might be able to map out something of the corporate comms network for an organisation operating multiple accounts (maybe a company, but maybe also a government department or local council ,or lobbiest group), or the client list of “social media guru” operating various accounts for different SMEs.
Anyway, here’s quick script for exploring the TWitter contributors/contributees API. The output is a graphml file we can visualise in Gephi.
And here are a couple of views over what it comes up with. Firstly, a map bootstrapped from the @twitterapi account:
And here’s one I built out from HuffingtonPost:
So what do we learn from this? Firstly it’s yet another example of how networks get everywhere. Secondly, it raises the question (for me) of whether there are any cribs in other multi-contributor social network apps (maybe in tweet metadata) that allow us to identify originating authors/users and hence find a way into mapping their contribution networks.
As well as building out from an account name to which users contribute, we can bootstrap a map from a user who is known to contribute to one or more accounts (code not included in Github gist atm).
So for example, here’s a map built out from user @VeeVee:
I guess one of the next questions from a tool building point of view is: is there a more reliable way of getting cribs into possible contributor/contributee networks? Another is: are any other multi-contributor services (on Twitter or other networks, such as Google+) similarly mappable?
PS Just noticed this: Google to drop Google Social API. I also read on a Google blog that the Needlebase screenscraping tool Google acquired as part of the ITA acquisition will be shut down later this year…
Invisible Library Support – Now You Can’t Afford Not to be Social?
If you live by pop tech feed or Twitter, you’ve probably heard that Google is rolling out a new style of socially powered search results. If not, or if you’re still not clear about what it entails, read Phil Bradley’s post on the matter: Why Google Search Plus is a disaster for search.
Done that? If not, why not? This post isn’t likely to make much sense at all if you don’t know the context. Here’s the link again: Why Google Search Plus is a disaster for search
So the starting point for this post is this: Google is in the process of rolling out a new web search service that (optionally) offers very personal search results that contains content from folk that Google thinks you’re associated with, and that Google is willing to show you based on license agreements and corporate politics.
Think about this for a minute…. in e the totally personalised view, folk will only see content that their friends have published or otherwise shared…
In Could Librarians Be Influential Friends?, I wondered aloud whether it made sense for librarians and other folk involved with providing support relating to resource discovery and recommendation to start a) creating social network profiles and encouraging their patrons to friend them, and b) start recommending resources using those profiles in order to start influencing the ordering/ranking of results in patrons’ search results based on those personal recommendations. The idea here was that you could start to make invisible frictionless recommendations by influencing the search engine results returned to your patrons (the results aren’t invisible because your profile picture may appear by the result showing that you recommend it. They’re frictionless in the sense that having made the original recommendation, you no longer have to do any work in trying to bring it to the attention of your patron – the search engines take care of that for you (okay, I know that’s a simplistic view;-). [Hmm.. how about referring to it as recommendation mode support?]
(Note that there is an complementary form of support to the approach which I’ve previously referred to as Invisible Library Tech Support (responsive mode support?; which I guess is also frictionless, at least from the perspective of the patron) in which librarians friend their patrons or monitor generic search terms/tags on Q&A sites and then proactively respond to requests that users post into their social networks more generally.)
With the aggressive stance Google now seems to be taking towards pushing social circle powered results, I think we need to face up to the fact – as Phil Bradley pointed out – that if librarians want to make sure they’re heard by their patrons, they’re going to need to start setting up social profiles, getting their patrons to friend them, and start making content and resource recommendations just anyway in order to make them available as resources that are indexed by patrons’ personal search engines. The same goes for publishers of OERs, academic teaching staff, and “courses”.
If we think of Google social search as searching over custom search engines bound by resources created and recommended by members of a users social circle, if you want to make (invisible) recommendations to a user via their (personalised) web search results, you’re going to need to make sure that the resources/content you want to recommend is indexed by their personal search engines. Which means: a) you need to friend them; and b) you need to share that content/those resources in that social context.
(Hmmm…this makes me think there may be something in the course custom search engine approach after all… Specifically, if the course has a social profile, and recommends the links contained within the course via that profile, they become part of the personalised search index of student’s following that course profile?)
Just by the by, as another example of Google completely messing things up at the moment, I notice that when I share links to posts on this blog via Google+, they don’t appear as trackbacks to the post in question. Which means that if someone refers to a post on this blog on Google+, I don’t know about it… whereas if they blog the link, I do…
See also my chronologically ordered posts on the eroding notion of “Google Ground Truth”.
[Invisible vs frictionless (and various notions of that word) is all getting a bit garbled; see eg @briankelly's Should Higher Education Welcome Frictionless Sharing and my comments to it for a little more on this...]
PS I’ve been getting increasingly infuriated by the clutter around, and lack of variation within, Google search results lately, so I changed my default search engine to Bing. The results are a bit all over the place compared to the Google results I tend to get, but this may be down in part to personalisation/training. I am still making occasional forays to Google, but for now, Bing is it… (because Bing is not Google…)
PPS Hah – just noticed: Google Search Plus doesn’t mean plus in the sense of search more, it means search Google+, which is less, or minus the wider world view…;-)
PPPS I keep meaning to blog this, and keep forgetting: Turn[ing] off [Google] search history personalization, in particular: “If you’ve disabled signed-out search history personalization, you’ll need to disable it again after clearing your browser cookies. Clearing your Google cookie clears your search settings, thereby turning history-based customizations back on.” WHich is to say, when you disable personalisation, you don’t disable personalisation against your Google account, you disable it only insofar as it relates to your current cookie ID?
Amateur Mapmaking: Getting Started With Shapefiles
One of the great things about (software) code is that people build on it and out from it… Which means that as well as producing ever more complex bits of software, tools also get produced over time that make it easier to do things that were once hard to do, or required expensive commercial software tools.
Producing maps is a fine example of this. Not so very long ago, producing your own annotated maps was a hard thing to do. Then in June, 2005, or so, the Google Maps API came along and suddenly you could create your own maps (or at least, put markers on to a map if you had latitude and longitude co-ordinates available). Since then, things have just got easier. If you want to put markers on a map just given their addresses, it’s easy (see for example Mapping the New Year Honours List – Where Did the Honours Go?). You can make use of Ordnance Survey maps if you want to, or recolour and style maps so they look just the way you want.
Sometimes, though, when using maps to visualise numerical data sets, just putting markers onto a map, even when they are symbols sized proportionally in accordance with your data, doesn’t quite achieve the effect you want. Sometimes you just have to have a thematic, choropleth map:

The example above is taken from an Ordnance Survey OpenSpace tutorial, which walks you through the creation of thematic maps using the OS API.
But what do you do if the boundaries/shapes you want to plot aren’t supported by the OS API?
One of the common ways of publishing boundary data is in the form of shapefiles (suffix .shp, though they are often combined with several other files in a .zip package). So here’s a quick first attempt at plotting shapefiles and colouring them according to an appropriately defined data set.
The example is based on a couple of data sets – shapefiles of the English Government Office Regions (GORs), and a dataset from the Ministry of Justice relating to insolvencies that, amongst other things, describes numbers of insolvencies per time period by GOR.
The language I’m using is R, within the RStudio environment. Here’s the code:
#Download English Government Office Network Regions (GOR) from:
#http://www.sharegeo.ac.uk/handle/10672/50
##tmpdir/share geo loader courtesy of http://stackoverflow.com/users/1033808/paul-hiemstra
tmp_dir = tempdir()
url_data = "http://www.sharegeo.ac.uk/download/10672/50/English%20Government%20Office%20Network%20Regions%20(GOR).zip"
zip_file = sprintf("%s/shpfile.zip", tmp_dir)
download.file(url_data, zip_file)
unzip(zip_file, exdir = tmp_dir)
library(maptools)
#Load in the data file (could this be done from the downloaded zip file directly?
gor=readShapeSpatial(sprintf('%s/Regions.shp', tmp_dir))
#I can plot the shapefile okay...
plot(gor)
Here’s what it looks like:
#I can use these commands to get a feel for the data contained in the shapefile...
summary(gor)
attributes(gor@data)
gor@data$NAME
#[1] North East North West
#[3] Greater London Authority West Midlands
#[5] Yorkshire and The Humber South West
#[7] East Midlands South East
#[9] East of England
#9 Levels: East Midlands East of England ... Yorkshire and The Humber
#download data from http://www.justice.gov.uk/downloads/publications/statistics-and-data/courts-and-sentencing/csq-q3-2011-insolvency-tables.csv
insolvency<- read.csv("http://www.justice.gov.uk/downloads/publications/statistics-and-data/courts-and-sentencing/csq-q3-2011-insolvency-tables.csv")
#Grab a subset of the data, specifically to Q3 2011 and numbers that are aggregated by GOR
insolvencygor.2011Q3=subset(insolvency,Time.Period=='2011 Q3' & Geography.Type=='Government office region')
#tidy the data - you may need to download and install the gdata package first
#The subsetting step doesn't remove extraneous original factor levels, so I will.
require(gdata)
insolvencygor.2011Q3=drop.levels(insolvencygor.2011Q3)
names(insolvencygor.2011Q3)
#[1] "Time.Period" "Geography"
#[3] "Geography.Type" "Company.Winding.up.Petition"
#[5] "Creditors.Petition" "Debtors.Petition"
levels(insolvencygor.2011Q3$Geography)
#[1] "East" "East Midlands"
#[3] "London" "North East"
#[5] "North West" "South East"
#[7] "South West" "Wales"
#[9] "West Midlands" "Yorkshire and the Humber"
#Note that these names for the GORs don't quite match the ones used in the shapefile, though how they relate one to another is obvious to us...
#So what next? [That was the original question...!]
#Here's the answer I came up with...
#Convert factors to numeric [ http://stackoverflow.com/questions/4798343/convert-factor-to-integer ]
#There's probably a much better formulaic way of doing this/automating this?
insolvencygor.2011Q3$Creditors.Petition=as.numeric(levels(insolvencygor.2011Q3$Creditors.Petition))[insolvencygor.2011Q3$Creditors.Petition]
insolvencygor.2011Q3$Company.Winding.up.Petition=as.numeric(levels(insolvencygor.2011Q3$Company.Winding.up.Petition))[insolvencygor.2011Q3$Company.Winding.up.Petition]
insolvencygor.2011Q3$Debtors.Petition=as.numeric(levels(insolvencygor.2011Q3$Debtors.Petition))[insolvencygor.2011Q3$Debtors.Petition]
#Tweak the levels so they match exactly (really should do this via a lookup table of some sort?)
i2=insolvencygor.2011Q3
i2c=c('East of England','East Midlands','Greater London Authority','North East','North West','South East','South West','Wales','West Midlands','Yorkshire and The Humber')
i2$Geography=factor(i2$Geography,labels=i2c)
#Merge the data with the shapefile
gor@data=merge(gor@data,i2,by.x='NAME',by.y='Geography')
#Plot the data using a greyscale
plot(gor,col=gray(gor@data$Creditors.Petition/max(gor@data$Creditors.Petition)))
And here’s the result:
Okay – so it’s maybe not the most informative of maps, it needs a scale, the London data is skewed, etc etc… But it shows that the recipe seems to work..
(Here’s a glimpse of how I worked my way to this example using a question to Stack Overflow: Plotting Thematic Maps in R Using Shapefiles and Data Files from DIfferent Sources (note: better solutions may have since been posted to that question, and which may improve on the recipe provided in this post…)
PS If the R thing is just too scary, here’s a recipe for plotting data using shapefiles in Google Fusion Tables [PDF] (alternative example) that makes use of the ShpEscape service for importing shapefiles into Fusion Tables (note that shpescape can be a bit slow converting an uploaded file and may appear to be doing nothing much at all for 10-20 minutes…). See also: Quantum GIS
JISC Project Blog Metrics – Making Use of WordPress Stats. Plus, An Aside…
Brian has a post out on Beyond Blogging as an Open Practice, What About Associated Open Usage Data?, and proposes that “when adopting open practices, one should be willing to provide open accesses to usage data associated with the practices” (his emphasis).
What usage stats are relevant though? If you’re on a hosted WordPress blog, it’s easy enough to pull out in a machine readable way the stats that WordPress collects about your blog and makes available to you (albeit at the cost of revealing a blog specific API key in the URL. Which means that if this key provides access to anything other than stats, particularly if it provides write access to any part of your blog, it’s probably not something you’d really want to share in public… [Getting your WordPress.com Stats API Key])
That said, you can still hand craft your own calls to the WordPress stats API, and extract your own usage data as data, using the WordPress Stats API.
So for example, a URL of the form:
http://stats.wordpress.com/csv.php?api_key=YOURKEY&blog_uri=BLOG.EXAMPLE.COM&end=2011-11-30&table=views
will pull in a summary of November’s views data; or:
http://stats.wordpress.com/csv.php?api_key=KEY&blog_uri=YOURBLOG&end=2011-11-30&table=referrers_grouped
will pull in a list of referrers.
For what it’s worth, I’ve started cobbling together a spreadsheet that can pull in live data, or custom ranged reports, from WordPress: WordPress Stats into Google Spreadsheets (make your own personal copy of the spreadsheet if you want to give it a try). This may or may not become a work in progress… at the moment, it doesn’t even support the full range of URL parameters/report configurations (for the time being at least, that is leaf “as an exercise for the reader”;-)
The approach I took is very simplistic, simply based around crafting URLs that grab specified sets of CSV formatted data, and pop them into a spreadsheet using the =importData() formula (I’m sure Martin could come up with something far more elegant;-); that said, it does provide an example of how to get started with a bit of programmatic URL hacking… and if you want to get started with handcrafting your own URLs, it provides a few examples there too….:-)
The pattern I used was to define a parameter spreadsheet, and then CONCATENATE parameter values to create the URLs; for example:
=importdata(CONCATENATE("http://stats.wordpress.com/csv.php?", "api_key=", Config!B2, "&blog_uri=", Config!B3, "&end=", TEXT(Config!B6,"YYYY-MM-DD"), "&table=referrers_grouped"))
One trick to note is that I defined the end parameter setting in the configuration sheet as a date type, displayed in a particular format. When we grab this data value out of the configuration sheet we’re actually grabbing a date typed record, so we need to use the TEXT() formula to put it into the format that the WordPress API requires (arguments of the form 2011-11-30).
If you want to use the spreadsheet to publish your own data, I guess one way would would be to keep the privacy settings private, but publish the sheets you are happy for people to see. Just make sure you don’t reveal your API key;-) [If you know of a good link/resource describing best practice around publishing public sheets from spreadsheets that also contain, and drawn on, private data, such as API keys, please post a link in the comments below;-)]
[A note on the stats: the WordPress stats made available via the API seem to relate to page views/visits to the website. Looking at my own stats, views from RSS feeds seem to be reported separately, and (I think) this data is not available via the WordPress stats API? If, as I do, you run your blog RSS feed through a service like Feedburner, to get a fuller picture of how widely the content on a blog is consumed, you'd need to report both the WordPress stats and the Feedburner stats, for example. Which leads the the next question, I guess: how can we (indeed, can we at all?) pull feed stats out of Feedburner?]
At this point, I need to come back to the question related above: what usage stats are relevant, particularly in the case of a JISC project blog? To my mind, a JISC project blog can support a variety of functions:
- it serves as a diary for the project team allowing them to record micro-milestones and solutions to problems; if developers are allowed to post to the blog, this might include posts at the level of granularity of a Stack Overflow Q and A, compared to the 500 word end-of-project post that tries to summarise how a complete system works;
- it can provide a feed that others can subscribe to to keep up to date with the project without having to hassle the project team for updates;
- it can provide context for the work by linking out to related resources, an approach that also might alert other projects who watch for trackbacks and pingbacks to the the project;
- it provides an opportunity to go fishing in a couple of ways: firstly, by acting as a resource others can link to (with the triple payoff that it contextualises the project further, it may suggest related work the project team are unaware by means of trackbacks/pingbacks into the project blog, and it may turn up useful commentary around the project); secondly, by providing a place where other interested parties might engage in discussion commentary or feedback around elements of the project, via blog comments.
Even if a blog only ever gets three views per post, they may be really valuable views. For me what’s important is how the blog can be used to document interesting things that might have been turned up in the course of doing the project that wouldn’t ordinarily get documented. Problems, gotchas, clever solutions, the sudden discovery or really useful related resources. The blog also provides an ongoing link-basis for the project, something that can bring it to life in a networked context (a context that may have a far longer life, and scope, than just the life or scope of the actual project).
For many projects that don’t go past a pilot, it may well be that the real value of the project is the blogged documentation of things turned up during the process, rather than any of the formal outputs… Maybe..?!;-)
PS in passing, Google Webmaster tools now lets you track search stats around articles Google associates you with as an author: Clicks and impressions for authors. It’s been some time since I looked at Google Webmaster tools, but as Ouseful.info is registered there, I thought I’d check my broken links…and realised just how many pages get logged by Google as containing broken links when a single post erroneously contains a relative link… (i.e. when the <a href=’ doesn’t start with http://)
PPS Related to the above is a nice example of why I think being able to read and write URL is an important skill, something Jon Udell also picks up on in Forgotten knowledge. In the above case, I needed to unpick the WordPress Stats APi documentation a little to work out how to put the URLs together (something that a knowledge of how to read and write URL helped me with). In Jon Udell’s case was an example of how a conference organiser was able to send a customised URL to the conference hotel that embedded the relevant booking dates.
But I wonder, in an age where folk use Google+search term (e.g. typing Facebook into Google) rather than URLs (eg typing facebook.com into a browser location bar), a behaviour that can surely only be compounded by the fusion of location and search bars in browsers such as Google Chrome, is “URL literacy” becoming even more of a niche skill, rather than becoming more widespread? Is there some corollary here to the world of phones and addressbooks? I don’t need to remember phone numbers any more (I don’t even necessarily recognise them) because my contacts lists masks the number with the name of the person it corresponds to. How many kids are going to lose out on a basic education in map reading because there’s no longer a need to learn route planning or map-based navigation – GPS, SatNav and online journey planners now do that for us… And does this distancing from base skills and low level technologies extend further? Into the kitchen, maybe? Who needs ingredients when you have ready meals (and yes, frozen croissants and gourmet meals from the farm shop do count as ready meals;-), for example? Who needs to actually use a cookery book (or really engage with a lecture) when you can watch a TV chef, (or TED Talks)..?
Information Literacy, Graphs, Hierarchies and the Structure of Information Networks
Over dinner at Côte in Cambridge last week, during the Arcadia Project review event, I doodled a couple of data structures, one on either side of a scrap of paper, and asked my co-Arcadians what sort of thing the drawing might represent, or what the structures they described might be called in general terms.
The sketches were broadly along the lines of the following, though without the circular nodes and labels displayed, just a set of connecting lines:
and:
So if I asked you the same question (what would you call these two different things?), how would you answer?
To my mind, the different organisational structures these represent, and how we can exploit and manipulate them, represents a whole host of issues in the reimagining of information literacy and the teaching of information skills. This ranges from an understanding of the structure of information spaces through the representation and analysis of those structures, to ways in which we can navigate and discover things in those spaces as well as how we can visualise and otherwise make sense of them.
So how would I describe the two different things shown above? The first image represents a hierarchy and is often referred to as a tree. Many library classification schemes, and many organisational management structures, are based around that sort of information structure.
The second image is a depiction of a more general network structure. Whenever I talk about graphs on the OUsefu.info blog (in fact, pretty much whenever I talk about a graph anywhere), that’s the sort of thing I’m talking about. This mess of connections is the way the web is structured. (The tree structure is also a graph, but subject to particular constraints; can you work out what some of those constraint might be?)
Note: it’s maybe worth reiterating at this point when I talk about graphs, the messy network thing I mean, not line charts like this:
One of the terms I got to describe one of the graphs was “a matrix”. Matrices are in fact a very powerful way of describing the structure of a graph – if you fancy a treasure hunt, the terms adjacency matrix and incidence matrix should give you a head start…
I’m not sure what the problem is, but I think there is a problem that arises from not appreciating how powerful graph structures are as a way of making sense of the world. And I’m not really sure what I wanted to say in this post… except maybe go on a little fishing expedition to see how widespread the lack of familiarity with the notion of a graph as something like this:
really is…? So, if I asked you to draw a graph: a) what would you draw? b) would you even remotely consider drawing something the the image directly above? If you answered “no’ to (b), does it “say” anything to you at all?! Would you ever draw a diagram that had that flavour when explaining something (what?!) to someone else? (And the same question for the hierarchy…?)
PS a nice thing about graphs is you don’t have to draw them by hand – all you have to do is describe what connects to what, and then you can let a machine draw it for you. So for example:
- here is the “source code” for the tree
- here is the “source code” for the messy network graph
PPS when folk hear other folk wittering on about “the social graph”, what do they think it is? If asked to draw an indicative sketch of “the social graph”, what would they draw?!
Citation Positioning
It’s been years and years since I did either a formal literature review, or used a reference manager like EndNote or RefWorks in anger, but whilst at the Arcadia Project review in Cambridge a couple of days ago, I started wondering what sorts of ‘added value’ features I’d like to see, maybe even expect, from referencing software nowadays…
One of the ideas I’ve been playing with recently is the idea of emergent social positioning (ESP;-) in online social networks, which I’m defining in terms of where an individual or an expression of a particular interest group might be positioned in terms of the socially projected interests of people following that person or interest group.
For the case of an individual, the approach I’m taking is to look at who the followers of that individual follow to any great extent; for the case of an interest group, as evidenced by users of a particular hashtag, for example, it might be to look at who the followers of the users of the hashtag also follow in significant numbers.
A slightly more constrained approach might be to look at how the followers of the individual or the hashtag users follow each other (a depth 1.5 follower network about an indvidual or set of individals, in effect).
So for example, here’s a map I just grabbed of folk who are followed by 3 or more followers from a sampling of the followers recent users of the #gdslaunch (Government Digital Service launch) hashtag.
So what does this have to do with reference managers? Let’s start with a single academic paper (the ‘target’ paper), that contains a list of references to other works. If we can easily grab the reference lists from all those works, we can generate a depth 1.5 reference map that show how the works referenced in the first paper reference each other. Exploring the structural properties of this map may help us better understand the support basis for the ideas covered in our target paper.
By looking at the depth 2 reference network (that is, the network that shows references included in the target paper, and all their references), we may be able to discover additional (re)sources relevant to the target paper.
Unfortunately, getting free and and easy machine readable access to the lists of references contained within journal articles, conference papers and books is not trivial. There are patchy services such as CiteSeer, Citebase or opencitations.net, but I don’t think services like Mendeley, Zotero or CiteUlike are yet expressing this sort of data? Or maybe they are, and I’m missing a trick somewhere.
(Just by the by, presumably some of the commercial citation services have APIs that support at least accessing this data? If you know of any, could you add a link in the comments please?:-)
Another hack I’d like to try is to generate what more closely corresponds to the social positioning idea, which is to grab the references from a target paper, and then the papers that cite those references and see how they all link together. This would help position the target paper in the space of other papers referencing similar works. I think CiteSeer has this sort of functionality, though not in a graphical form?
PS on my to do list is seeing whether I can get reference lists for articles out of Citeseer using the Citeseer OAI-PMH endpoint. I’ve got as far as installing the pyoai Python library, but not had time to try it out yet. If anyone knows of a guide to OAI for complete novices, ideally with pyoai examples I can crib from, please post a link (or some examples) via the comments:-)
Getting Started With Twitter Analysis in R
Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.
I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…
Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)
require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
rdmTweets <- searchTwitter('#mozfest', n=500)
# Note that the Twitter search API only goes back 1500 tweets (I think?)
#Create a dataframe based around the results
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
#Here are the columns
names(df)
#And some example content
head(df,3)
So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:
counts=table(df$screenName) barplot(counts) # Let's do something hacky: # Limit the data set to show only folk who tweeted twice or more in the sample cc=subset(counts,counts>1) barplot(cc,las=2,cex.names =0.3)
Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:
#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df$text=sapply(df$text,function(row) iconv(row,to='UTF-8'))
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
library(stringr)
#Pull out who a message is to
df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df$to=sapply(df$to,function(name) trim(name))
#And here's a way of grabbing who's been RT'd
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…
require(ggplot2) ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)
Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)
Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents
Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.
The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.
One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.
One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)
Through using the course custom search engine myself, I have found several issues with it:
1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.
2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.
3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).
4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:
<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif" title="Week 4" id="T151_Week4" queries="week 4,T151 week 4,t151 week 4" url="http://www.open.ac.uk" description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>
There are several components we need to consider here:
- queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
- title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
- description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
- url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?
Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.
In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):
Or what a particular question is within a particular topic:
The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:
<Promotion title="Topic Exploration 4A Question 4" queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a" ... />
The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3″ and “topic 1a q4″… (You can try out the CSE here.)
The promotions I have specified so far also lack several things:
1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)
2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)
At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?
At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…
[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]
PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?
My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.
It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?
*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:
- academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
- news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
- video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
- books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.
I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?
However, that wasn’t what this experiment was about!;-)
Course Resources as part of a larger connected graph
Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
- links to papers referenced in the article;
- links to papers that cite the article;
- links to other papers written by the same author;
- links to other papers in collections containing the article on services such as Mendeley;
- links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.
Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.
Who Do The Science Literati Listen to on Twitter?
I really shouldn’t have got distracted by this today, but I did; via Owen Stephens: seen altmetric – tracking social media & other mentions of academic papers (by @stew)?
Monthly Altmetric data downloads of tweets containing mentions of published articles are available for download from Buzzdata, so I grabbed the September dataset, pulled out the names of folk sending the tweets, and how many paper mentioning tweets they had sent from the Unix command line:
cut -d ',' -f 3 twitterDataset0911_v1.csv | sort |uniq -c | sort -k 1 -r > tmp.txt
Read this list into a script, pulled out the folk who had sent 10 or more paper mentioning updates, grabbed their Twitter friends lists and plotted a graph using Gephi to see how they connected (nodes are coloured according to a loose grouping and sized according to eigenvector centrality):
My handle for this view is that is shows who’s influential in the social media (Twitter) domain of discourse relating to the scientific topic areas covered by the Altmetric tweet collection I downloaded. To be included in the graph, you need have posted 10 or more tweets referring to one or more scientific papers in the collection period.
We can get a different sort of view over trusted accounts in the scientific domain by graphing the network of all the friends of (that is, people followed by) the people who sent 10 or more paper referencing tweets in September, as collected by altmetric, edges going from altmetric tweeps to all their friends. This is a big graph, so if we limit it to show folk followed by 100 or more of the folk who sent paper mentioning tweets and display those accounts, this is what we get:
My reading of this one is that it show folk who are widely trusted by folk who post regular updates about scientific papers in particular subject areas.
Hmmm… now, I wonder: what else might I be able to do with the Altmetric data???
PS Okay – after some blank “looks”, here’s the method for the first graph:
1) get the September list of tweets from Buzzdata that contain a link to a scientific paper (as determined by Altmetric filters);
2) extract the Twitter screen names of the people who sent those tweets.
3) count how many different tweets were sent by each screen name.
4) extract the list of screen-names that sent 10 or more of the tweets that Altmetric collected. This list is a list of people who sent 10 or more tweets containing links to academic papers. Let’s call it ‘the September10+ list’.
5) for each person on the September10+ list, grab the list of people they follow on Twitter.
6) plot the graph of social connections between people on the Septemeber10+ list.
Okay? Got that?
Here’s how the second graphic was generated.
a) take the September10+ list and for each person on it, get the list of all their friends on Twitter. (This is the same as step 5 above).
b) Build a graph as follows: for each person on the September10+ list, add a link from them to each person they follow on Twitter. This is a big graph. (The graph in 6 above only shows links between people on the September10+ list.)
c) I was a little disingenuous in the description in the body of this post… I now filter the graph to only show nodes with degree of 100 or more. For folk who are on the September10+ list, this means that the sum of the people on the September10+ list, and the total number of people they follow is equal to or greater than 100. For folk not on the September10+ list, this means that they are being followed by people with a degree of 100 or more who are on the September10+ list (which is to say they are being followed by at least 100 or so people on the September10+ list; I guess there could be folk followed by more than 100 people on the September10+ list who don’t appear in the graph if, for example, they were followed by folk in the original graph who predmoninantly had a degree of less than 100?).
d) to plot word cloud graphic above, I visualise the filtered graph and then hide the nodes whose in-degree is 0 (that is, they aren’t followed by anyone else in the graph).
Got that? Simples… ;-)




















