Archive for the ‘Infoskills’ Category
Getting Started With Twitter Analysis in R
Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.
I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…
Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)
require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
rdmTweets <- searchTwitter('#mozfest', n=500)
# Note that the Twitter search API only goes back 1500 tweets (I think?)
#Create a dataframe based around the results
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
#Here are the columns
names(df)
#And some example content
head(df,3)
So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:
counts=table(df$screenName) barplot(counts) # Let's do something hacky: # Limit the data set to show only folk who tweeted twice or more in the sample cc=subset(counts,counts>1) barplot(cc,las=2,cex.names =0.3)
Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:
#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df$text=sapply(df$text,function(row) iconv(row,to='UTF-8'))
#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
library(stringr)
#Pull out who a message is to
df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df$to=sapply(df$to,function(name) trim(name))
#And here's a way of grabbing who's been RT'd
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…
require(ggplot2) ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)
Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)
Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents
Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.
The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.
One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.
One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)
Through using the course custom search engine myself, I have found several issues with it:
1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.
2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.
3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).
4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:
<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif" title="Week 4" id="T151_Week4" queries="week 4,T151 week 4,t151 week 4" url="http://www.open.ac.uk" description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>
There are several components we need to consider here:
- queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
- title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
- description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
- url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?
Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.
In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):
Or what a particular question is within a particular topic:
The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:
<Promotion title="Topic Exploration 4A Question 4" queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a" ... />
The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3″ and “topic 1a q4″… (You can try out the CSE here.)
The promotions I have specified so far also lack several things:
1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)
2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)
At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?
At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…
[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]
PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?
My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.
It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?
*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:
- academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
- news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
- video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
- books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.
I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?
However, that wasn’t what this experiment was about!;-)
Course Resources as part of a larger connected graph
Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
- links to papers referenced in the article;
- links to papers that cite the article;
- links to other papers written by the same author;
- links to other papers in collections containing the article on services such as Mendeley;
- links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.
Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.
Who Do The Science Literati Listen to on Twitter?
I really shouldn’t have got distracted by this today, but I did; via Owen Stephens: seen altmetric – tracking social media & other mentions of academic papers (by @stew)?
Monthly Altmetric data downloads of tweets containing mentions of published articles are available for download from Buzzdata, so I grabbed the September dataset, pulled out the names of folk sending the tweets, and how many paper mentioning tweets they had sent from the Unix command line:
cut -d ',' -f 3 twitterDataset0911_v1.csv | sort |uniq -c | sort -k 1 -r > tmp.txt
Read this list into a script, pulled out the folk who had sent 10 or more paper mentioning updates, grabbed their Twitter friends lists and plotted a graph using Gephi to see how they connected (nodes are coloured according to a loose grouping and sized according to eigenvector centrality):
My handle for this view is that is shows who’s influential in the social media (Twitter) domain of discourse relating to the scientific topic areas covered by the Altmetric tweet collection I downloaded. To be included in the graph, you need have posted 10 or more tweets referring to one or more scientific papers in the collection period.
We can get a different sort of view over trusted accounts in the scientific domain by graphing the network of all the friends of (that is, people followed by) the people who sent 10 or more paper referencing tweets in September, as collected by altmetric, edges going from altmetric tweeps to all their friends. This is a big graph, so if we limit it to show folk followed by 100 or more of the folk who sent paper mentioning tweets and display those accounts, this is what we get:
My reading of this one is that it show folk who are widely trusted by folk who post regular updates about scientific papers in particular subject areas.
Hmmm… now, I wonder: what else might I be able to do with the Altmetric data???
PS Okay – after some blank “looks”, here’s the method for the first graph:
1) get the September list of tweets from Buzzdata that contain a link to a scientific paper (as determined by Altmetric filters);
2) extract the Twitter screen names of the people who sent those tweets.
3) count how many different tweets were sent by each screen name.
4) extract the list of screen-names that sent 10 or more of the tweets that Altmetric collected. This list is a list of people who sent 10 or more tweets containing links to academic papers. Let’s call it ‘the September10+ list’.
5) for each person on the September10+ list, grab the list of people they follow on Twitter.
6) plot the graph of social connections between people on the Septemeber10+ list.
Okay? Got that?
Here’s how the second graphic was generated.
a) take the September10+ list and for each person on it, get the list of all their friends on Twitter. (This is the same as step 5 above).
b) Build a graph as follows: for each person on the September10+ list, add a link from them to each person they follow on Twitter. This is a big graph. (The graph in 6 above only shows links between people on the September10+ list.)
c) I was a little disingenuous in the description in the body of this post… I now filter the graph to only show nodes with degree of 100 or more. For folk who are on the September10+ list, this means that the sum of the people on the September10+ list, and the total number of people they follow is equal to or greater than 100. For folk not on the September10+ list, this means that they are being followed by people with a degree of 100 or more who are on the September10+ list (which is to say they are being followed by at least 100 or so people on the September10+ list; I guess there could be folk followed by more than 100 people on the September10+ list who don’t appear in the graph if, for example, they were followed by folk in the original graph who predmoninantly had a degree of less than 100?).
d) to plot word cloud graphic above, I visualise the filtered graph and then hide the nodes whose in-degree is 0 (that is, they aren’t followed by anyone else in the graph).
Got that? Simples… ;-)
Digital Scholarship and Academically Discoverable Blogs
What does it take for a digital scholar’s blog to become academically credible?
At a time when we know that folk go to Google for a lot of their search needs, the academic library argues it’s case, in part, as a place where you can go to get access to “good quality” (academically credible) and comprehensive information through what we might term academic search engines.
The library’s search offerings are presumably subscription based (?) and their results often link through to subscription content; but the academic life is a privileged one, and our institutions cover the access costs on our behalf. (I guess this could almost be considered one of the “+ benefits” you might imagine an enthusiastic copywriter assuming for an academic job ad!)
The library and information access privilege extends to students too, so we might imagine a well-intentioned, but perhaps naive, student thinking that if they run a search using the Library’s “academically certified” search engine, they will get the sort of result they can happily cite in an essay, without fear of criticism about the academic credibility of the source publication.
We might imagine, too, that academics and researchers also place an element of trust in the credibility of sources returned as results to search queries raised using library discovery services.
So here’s a claim (which is untested and may or may not be true): if you want your work to stand a chance of being referenced in a piece of scholarly work, you need it to be discoverable in the places that the scholar goes to discover supporting claims or related material for the work they’re doing. The assumption is that the scholar will use a library provided discovery service because it is less noisy than a general web search engine and is likely to return to resources from credible sources. The curation of sources – and what is not included in the index – is in part what the subscription discovery service offers.
What this means is that if digital scholars want their blogging activity to be discoverable in the academic context, they need to find some way of getting some of their blogposts at least into academic discovery service indices.
But this is not likely to happen, right? Wrong… Here’s what I noticed when I ran a search using the OU Library’s “one-stop” search earlier today:
A top two reference to a Mashable article (albeit identified as a news item) via the Newsbank database and a top ranked periodical article from Fast Company (via the UK/EIRE Reference Centre database). (Hmmm, I wonder how quickly this content is indexed? That is, how soon after posting on Mashable does an article become discoverable here?)
So maybe I need to start writing for Mashable?!
Or maybe not…?
One of the attractive features of WordPress as a publishing platform is that it provides feeds for everything, including category and tag level feeds. A handful of my category feeds are syndicated, for example to R-Bloggers, the Guardian Datablog blogroll and (I’m not sure if this still works?) the Online Journalism blog. Only posts tagged in a particular way are sent to the syndicated feeds.
So I’m wondering this: how much mileage would there be in setting up aggregation blogs around particular academic areas that not only syndicate content from publisher members, but also act as a focus for indexing by a service such as Newsbank? The content would be publisher-moderated (I don’t post content on non-R related matters to my R-bloggers syndication feed) and hopefully responsive to the norms of the aggregation community itself.
Precendents already exist of course; for example, Nature.com blogs aggregates blogs from a variety of working scientists. Is this content discoverable via the OU Library’s one stop/Ebsco search?
For an academic’s work to count in RAE terms, it needs to be cited. In order to be cited, it needs to be discoverable. Even if it isn’t citeable as a formal article, it can still make a contribution if it’s discoverable. To be academically discoverable, content needs to be discoverable via academic search engines. So why should Mashable count, but not personal academic blogs that are respected within their own communities?
PS I’m a bit out of touch with referencing converntions; I remember that pers. comm. used to be an acceptable way of crediting someone’s ideas they had personally communicated to you; is there a pub. comm. (that’s pub. comm. not just pub comm. ;-) equivalent that might be used to refer to online or offline public communications that might not otherwise be citeable?
Appropriate IT – My ILI2011 Presentation
Here’s a copy of the slides from my ILI2011 presentation on Appropriate IT:
One thing I wanted to explore was, if discovery happens elsewhere, and the role of the librarian is no longer focussed on discovery related issues, where can library folk help out? Here’s where I think we need to start placing some attention: sensemaking, and knowing what’s possible (aka helping redistribute the future that is already around us;-) Allied with this is the idea that we need to make more out of using appropriate IT for particular tasks, as well as appropriating IT where we can to make our lives easier.
In part, sensemaking is turning the wealth of relevant data out there into something meaningful for the question or issue at hand, or the choice we have to make. My own dabblings with social network analysis are approaches I’m working on that help me make sense of interest networks and social positioning within those networks so I can get a feel for how those communities are structured and who the major actors are within them.
As far as knowing what’s possible, I think we have a real issue with “folk IT” knowledge. Most of us have a reasonable grasp of folk physics and folk psychology. That is, we have a reasonable common-sense model of how the world works at the human scale (let go of an apple, it falls to the floor), and we can generally read other people from their behaviour; but how well developed is “folk IT” knowledge? Given that to most people the idea that you can search within a page in a wide variety of electronic documents using crtrl-F as a keyboard shortcut to a “search within page/document” feature is alien to them, I think our folk understanding of IT is limited to the principle of “if you switch it off and on again it should start working again”.
Folk IT is also tied up with computational thinking, but at a practical, “human scale”. So here are a few ideas I think the librarians need to start pushing:
- the idea of a graph; it’s what the web’s based around, after all, and it also helps us understand social networks. If you think of your website as a graph, with edges representing links that connect nodes/pages together, and realise that your on-site homepage is whatever page someone lands on from a search engine or third party link, you soon start to realise that maybe your website is not as usefully structured as you thought…
- some sort of common sense understanding of the role that URLs/URIs play in the browser, along with the idea that URIs are readable and hackable and also may say something about the way a website, or the resources it makes available, organised;
- the notion of “View Source”, that allows you to copy and crib the work of others when constructing your own applications, along with the very idea that you might be able to build web pages yourself out of free standing components.
- the idea of document types and applications that can work all sorts of magic given documents of that type; the knowledge that an MP3 file works well with an audio player or audio editor, for example, or that a PNG or JPG encodes an image, along with more esoteric formats such as KML (paste a URL to a KML file into the search box of a Google Maps search and see what happens, for example…). Knowledge of the filetype/document type gives you some sort of power over it, and helps you realise what sorts of thing you can do with it… (except for things like PDF, for example, which is to all intents and purposes a “can’t do anything with it” filetype;-)
I also think an understanding of pattern based string matching and what regular expressions allow you to do would go a long way towards helping folk who ever have to manipulate text or text-based data files, at least in terms of letting them know that there are often better ways of cleaning up a text file automagically rather than having to repeat the same operation over and over again on each separate row in file containing several thousand lines… They don’t need to know how to write the regular expression from the off, just that the sorts of operation regular expressions support are possible, and that someone will probably be able to show you how to do it…
Power Tools for Aspiring Data Journalists: Funnel Plots in R
Picking up on Paul Bradshaw’s post A quick exercise for aspiring data journalists which hints at how you can use Google Spreadsheets to grab – and explore – a mortality dataset highlighted by Ben Goldacre in DIY statistical analysis: experience the thrill of touching real data, I thought I’d describe a quick way of analysing the data using R, a very powerful statistical programming environment that should probably be part of your toolkit if you ever want to get round to doing some serious stats, and have a go at reproducing the analysis using a bit of judicious websearching and some cut-and-paste action…
R is an open-source, cross-platform environment that allows you to do programming like things with stats, as well as producing a wide range of graphical statistics (stats visualisations) as if by magic. (Which is to say, it can be terrifying to try to get your head round… but once you’ve grasped a few key concepts, it becomes a really powerful tool… At least, that’s what I’m hoping as I struggle to learn how to use it myself!)
I’ve been using R-Studio to work with R, a) because it’s free and works cross-platform, b) it can be run as a service and accessed via the web (though I haven’t tried that yet; the hosted option still hasn’t appeared yet, either…), and c) it offers a structured environment for managing R projects.
So, to get started. Paul describes a dataset posted as an HTML table by Ben Goldacre that is used to generate the dots on this graph:

The lines come from a probabilistic model that helps us see the likely spread of death rates given a particular population size.
If we want to do stats on the data, then we could, as Paul suggests, pull the data into a spreadsheet and then work from there… Or, we could pull it directly into R, at which point all manner of voodoo stats capabilities become available to us.
As with the =importHTML formula in Google spreadsheets, R has a way of scraping data from an HTML table anywhere on the public web:
#First, we need to load in the XML library that contains the scraper function
library(XML)
#Scrape the table
cancerdata=data.frame( readHTMLTable( 'http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis', which=1, header=c('Area','Rate','Population','Number')))
The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).
We can inspect the data we’ve imported as follows:
#Look at the whole table
cancerdata
#Look at the column headers
names(cancerdata)
#Look at the first 10 rows
head(cancerdata)
#Look at the last 10 rows
tail(cancerdata)
#What sort of datatype is in the Number column?
class(cancerdata$Number)
The last line – class(cancerdata$Number) – identifies the data as type ‘factor’. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers… (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.)
#Convert the numerical columns to a numeric datatype
cancerdata$Rate=as.numeric(levels(cancerdata$Rate)[as.integer(cancerdata$Rate)])
cancerdata$Population=as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
cancerdata$Number=as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
#Just check it worked…
class(cancerdata$Number)
head(cancerdata)
We can now plot the data:
#Plot the Number of deaths by the Population
plot(Number ~ Population,data=cancerdata)
If we want to, we can add a title:
#Add a title to the plot
plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population')
We can also tweak the axis labels:
plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population',ylab='Number of deaths')
The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggpplot2 isn't part of the standard R bundle, so you'll need to install the package yourself if you haven't already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...):
require(ggplot2)
ggplot(cancerdata)+geom_point(aes(x=Population,y=Number))+opts(title='Bowel Cancer Data')+ylab('Number of Deaths')
Doing a bit of searching for the "funnel plot" chart type used to display the ata in Goldacre's article, I came across a post on Cross Validated, the Stack Overflow/Statck Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R?
The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code... This is a dangerous thing to do, and I can't guarantee that the analysis is the same type of analysis as the one Goldacre refers to... but what I'm trying to do is show (quickly) that R provides a very powerful stats analysis environment and could probably do the sort of analysis you want in the hands of someone who knows how to drive it, and also knows what stats methods can be appropriately applied for any given data set...
Anyway - here's something resembling the Goldacre plot, using the cribbed code which has confidence limits at the 95% and 99.9% levels. Note that I needed to do a couple of things:
1) work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.
#TH: funnel plot code from:
#TH: http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
#TH: Use our cancerdata
number=cancerdata$Population
#TH: The rate is given as a 'per 100,000' value, so normalise it
p=cancerdata$Rate/100000
p.se <- sqrt((p*(1-p)) / (number))
df <- data.frame(p, number, p.se)
## common effect (fixed effect model)
p.fem <- weighted.mean(p, 1/p.se^2)
## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
#TH: I'm going to alter the spacing of the samples used to generate the curves
number.seq <- seq(1000, max(number), 1000)
number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)
## draw plot
#TH: note that we need to tweak the limits of the y-axis
fp <- ggplot(aes(x = number, y = p), data = df) +
geom_point(shape = 1) +
geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
geom_hline(aes(yintercept = p.fem), data = dfCI) +
scale_y_continuous(limits = c(0,0.0004)) +
xlab("number") + ylab("p") + theme_bw()
fp
As I said above, it can be quite dangerous just pinching other folks' stats code if you aren't a statistician and don't really know whether you have actually replicated someone else's analysis or done something completely different... (this is a situation I often find myself in!); which is why I think we need to encourage folk who release statistical reports to not only release their data, but also show their working, including the code they used to generate any summary tables or charts that appear in those reports.
In addition, it's worth noting that cribbing other folk's code and analyses and applying it to your own data may lead to a nonsense result because some stats analyses only work if the data has the right sort of distribution...So be aware of that, always post your own working somewhere, and if someone then points out that it's nonsense, you'll hopefully be able to learn from it...
Given those caveats, what I hope to have done is raise awareness of what R can be used to do (including pulling data into a stats computing environment via an HTML table screenscrape) and also produced some sort of recipe we could take to a statistician to say: is this the sort of thing Ben Goldacre was talking about? And if not, why not?
[If I've made any huge - or even minor - blunders in the above, please let me know... There's always a risk in cutting and pasting things that look like they produce the sort of thing you're interested in, but may actually be doing something completely different!]
PS for how to generate reports that can (optionally) also self-document with actually source R code, see How might data journalists show their working? Sweave. The code used in, and comments added to, that post make further refinements to the funnel plot code.
Guardian Tag Explorer, Take 2 – Martin Works Some Magic
For whatever reason, I seem to lack the discipline – or insight (or skills) – required to make anything anyone would want to actually use, although I do take delight in exploring new ways of combining existing applications and services to see how they might (in principle) work together…
…which is why I’m really fortunate to have Martin as an informal (and I hope not too put-upon!) network collaborator. Take the Hawkseyfied Guardian Tag Explorer for example, which takes a couple of half-baked hacks of mine and puts them together in a way that that allows you to get a mesoscopic view over how particular companies, individuals or news stories have been represented in the Guardian based on the Guardian OpenPlatform tag metadata used to describe the articles they are mentioned in, along with a summary of how many of the corresponding articles were tagged that way, what those articles actually were, and with a link to them:
Martin relates something of its genesis in (Guardian Tag Explorer: When the Guardian Open Platform met d3.js), describing a loosely coupled way of working we have stumbled upon that I think is highly creative, leads to potentially interesting innovations, and often results in incredibly useful – and powerful – recipes and building blocks. And all unfunded, at least in terms of bid for, planned project funding…
In addition, Martin’s been my goto person for all matters relating to Google Apps Script for quite some time now; has built up quite a suite of self-deployable Twitter archiving tools using Google Spreadsheets; and I still don’t understand why more folk haven’t picked up on how far he managed to push the idea of Twitter video subtitles.
I should also add that a lot of my thinking is inspired as a response to something that Martin has created, and that shines light on something that is possible that I’d have probably never considered before. So for example, looking at the reworked tag explorer above, it suggests to me that rather than linking out to the article, we could probably now just as easily pull the story text from the API and display it in-app…If I get a chance this weekend, I’ll try to explore that;-)
I would try to reflect a little on why our loose collaboration feels so productive (at least from my side, for the ideas it sparks, the cribs I can reuse, and the noticings Martin comes up with), but I don’t want to break it…! I think it does have something to do with loosely collaborating – in public, and often in real-time – on reactive unprojects, though, which are often inspired by one or the other of us noticing that has just been released or commented up elsewhere, reacting to something the other of us has said, responding to a question we’ve seen tweeted, or because we’ve wondered whether something is possible. And then just tried to do it. Or ask if the other has already done it! Without obligation. But often with the idea that if the other finds it interesting… (#hackbait ;-)
PS If you don’t follow Martin’s blog already, I suggest you start doing so. His recipes also tend to be far easier to follow (and less buggy!) than mine! MASHe.
Visualising New York Times Article API Tag Graphs Using d3.js
Picking up on Tinkering with the Guardian Platform API – Tag Signals, here’s a variant of the same thing using the New York Times Article Search API.
As with the Guardian OpenPlatform demo, the idea is to look up recent articles containing a particular search term, find the tag(s) used to describe the articles, and then graph them. The idea behind this approach is to get a quick snapshot of how the search target is represented, or positioned, in this case, by the New York Times.
Here is the code. The main differences compared to the Guardian API gist are as follows:
- a hacked recipe for getting several paged results back; I really need to sort this out properly, just as I need to generalise the code so it will work with either the Guardian or the NYT API, but that’s for another day now…
- the use of NetworkX as a way of representing the undirected tag-tag graph;
- the use of the NetworkX D3 helper library (networkx-d3) to generate a JSON output file that works with the d3.js force directed layout library.
Note that the D3 Python library generates a vanilla force directed layout diagram. In the end, I just grabbed the tiny bit of code that loads the JSON data into a D3 network, and then used it along with the code behind the rather more beautiful network layout used for this Mobile Patent Suits visualisation.
Here’s a snapshot of the result of a search for recent articles on the phrase Gates Foundation:
At this point, it’s probably worth pointing out that the Python script generates the graph file, and then the d3.js library generates the graph visualisation within a web browser. There is no human hand (other than force layout parameter setting) involved in the layout. I guess with the tweaking of a few parameters, maybe juggling the force layout parameters a little more, I could get an even clearer layout. It might also be worth trying to find a way of sizing, or at least colouring, the nodes according to degree (or even better, weighted degree?) I also need to find a way, of possible, of representing the weight of edges if the D3 Python library actually exports this (or if it exports multiple edges between the same two nodes).
Anyway, for an hour or so’s tinkering, it’s quite a neat little recipe to be able to add to my toolkit. Here’s how it works again: Python script calls NYT Article Search API and grabs articles based on a search term. Grab the tags used to describe each article and build up a graph using NetworkX that connects any and all tags that are mentioned on the same article. Dump the graph from its NetworkX representation as a JSON file using the D3 library, then use the D3 Patent Suits layout recipe to render it in the browser :-)
Now all I have to do is find out how I can grab an SVG dump of the network from a browser into a shareable file…
Active Lobbying Through Meetings with UK Government Ministers
In a move that seemed to upset collectors of UK ministerial meeting data, @whoslobbying, on grounds of wasted effort, the Guardian datastore published a spreadsheet last night containing data relating to ministerial meetings between May 2010 and March 2011.
(The first release of the spreadsheet actually omitted the column containing who the meeting was with, but that seems to be fixed now… There are, however, still plenty of character encoding issues (apostrophes, accented characters, some sort of em-dash, etc) that might cripple some plug and play tools.)
Looking over the data, we can use it as the basis for a network diagram with actors (Ministers and lobbiests) with edges representing meetings between Minsiters and lobbiests. There is one slight complication in that where there is a meeting between a Minister and several lobbiests, we ideally need to separate out the separate lobbiests into their own nodes.
This probably provides an ideal opportunity to have a play with the Stanford Data Wrangler and try forcing these separate lobbiests onto separate rows, but I didn’t allow myself much time for the tinkering (and the requisite learning!), so I resorted to Python script to read in the data file and split out the different lobbiests. (I also did an iterative step, cleaning the downloaded CSV file in a text editor by replacing nasty characters that caused the script to choke.) You can find the script here (note that it makes use of the networkx network analysis library, which you’ll need to install if you want to run the script.)
The script generates a directed graph with links from Ministers to lobbiests and dumps it to a GraphML file (available here) that can be loaded directly into Gephi. Here’s a view – using Gephi – of the hearth of the network. If we filter the graph to show nodes that met with at least five different Ministers…
we can get a view into the heart of the UK lobbying netwrok:
I sized the lobbiest nodes according to eigenvector centrality, which gives an indication of well connected they are in the network.
One of the nice things about Gephi is that it allows for interactive exploration of a graph, For example, I can hover over a lobbiest node – Barclays in this case – to see which Ministers were met:
Alternatively, we can see who of the well connected met with the Minister for Welfare Reform:
Looking over the data, we also see how some Ministers are inconsistently referenced within the original dataset:
Note that the layout algorithm is such that the different representations of the same name are likely to meet similar lobbiests, which will end up placing the node in a similar location under the force directed layout I used. Which is to say – we may be able to use visual tools to help us identify fractured representations of the same individual. (Note that multiple meetings between the same parties can be visualised using the thickness of the edges, which are weighted according to the number of times the edge is described in the GraphML file…)
Unifying the different representations of the same indivudal is something that Google Refine could help us tidy up with its various clustering tools, although it would be nice if the Datastore folk addressed this at source (or at least, as part of an ongoing data quality enhancement process…;-)
I guess we could also trying reconciling company names against universal company identifiers, for example by using Google Refine’s reconciliation service and the Open Corporates database? Hmmm, which makes me wonder: do MySociety, or Public Whip, offer an MP or Ministerial position reconciliation service that works with Google Refine?
A couple of things I haven’t done: represented the department (which could be done via a node attribute, maybe, at least for the Ministers); represented actual meetings, and what I guess we might term co-lobbying behaviour, where several organisations are in the same meeting.
Segmented Communications on Twitter via @-partner Messaging
As this blog rarely attracts comments, it can be quite hard for me to know who, if anyone, regularly reads it (likely known suspects and the Googlebot aside). The anonymous nature of feed reader subscriptions also means it tricky to know who (if anyone) is reading the blog at all…
Twitter is slightly different in this regard, because for the majority of accounts, the friends and followers lists are public; which means it’s possible to “position” a particular account in terms of the interests of the folk it follows and who follow it.
Whilst I was putting together A Couple More Social Media Positioning Maps for UK HE Twitter Accounts, I considered including a brief comment on how the audience of a popular account will probably segment into different interest groups, and whether or not there was any mileage in trying to customise messages to particular segments without alienating the other parts of the audience.
Seeing @eingang’s use yesterday of a new (to me) Twitter convention of sending hashtagged messages to @hidetag, so that folk following the hashtag would see the tweet, but Michelle’s followers wouldn’t necessarily see the tagged tweets (no-one should follow @hidetag, NO_ONE ;-), it struck me that we might be able to use a related technique to send messages that are only visible to a particular segment of the followers of a Twitter account…
How so?
Firstly, you need to know that public Twitter messages sent to a particular person by starting the message with an @name are only generally visible in the stream of folk who follow both the sender and @name; (identifying this population was one of the reasons I put together the Who Can See Whose Conversations In-stream on Twitter? tool).
Secondly, you need to do a bit of social network analysis. (In what follows, I assume a directed graph where a node from A to B means that A follows B, or equivalently, B is a friend of A.) A quick and dirty approach might be to use in-degree and out-degree, or maybe the HITS algorithm/authority and hub values, as follows: identify the audience segment you want to target by looking for clusters in how your followers follow each other, then do a bit of network analysis on that segment to look for Authority nodes or nodes that are followed by a large number of people in that segment who also follow you. If you now send a message to that Authority/high in-degree node, it will be seen in-stream by that user, as well as those of your followers who also follow that Authority account.
This approach can be seen as a version of co-branding/brand partnership: conversational co-branding/conversational brand partnerships. Here’s how it may work: brand X has an audience that segments into groups A, B and C. Suppose that company Y is an authority in segment B. If X and Y form a conversational brand partnership, X can send messages ostensibly to Y that also reach segment B. For a mutually beneficial relationship, X would also have to be an authority in one of Y’s audience segments (for example, segment P out of segments P, Q, and R.) Ideally, P and B would largely overlap, meaning they can have a “sensible” conversation and it will hit both their targeted audiences…
For monitoring discussions within a particular segment, it strikes me that if we monitor the messages seen by an individual with a large Hub value/out-degree (that is, folk who follow large numbers of (influential) folk within the segment). By tapping into the Hub’s stream, we get some sort of sampling of the conversations taking place within the segment.
These ideas are completely untested (by me) of course… But they’re something I may well start to tinker with if an appropriate opportunity arises…





















