OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Tinkering’ Category

Experimenting With iGraph – and a Hint Towards Ways of Measuring Engagement?

with one comment

For fear of being left way behind as Martin Hawksey starts to get to grips with R, (see for example how he’s using R to automate the annotation of Google Spreadsheets with calculations that don’t come readily or efficiently to hand in Google Spreadsheets itself), I thought I better try to get to grips with R’s igraph library…

So here’s a script that gives me some hints as to how to start migrating chunks of my clunky Python script into R, as well as some ideas about how to start reporting on the structure of hashtag communities in a graphical as well as stats analytical way.

require(igraph)

#load in a graph from a graphml file; the graph contains nodes representing Twitter users connected by directed edges that represent friend or follower relations, depending on the actual experimental condition I ran
g=read.graph('/Users/ajh59/code/twapps/newt/reports/scmvESP/scmvESP_2012-01-26-22-53-45/friends_outerfriendsdegree_X_25_25_X_esp.graphml',format='graphml')
summary(g)
#Vertices are obtained via V(g). The summary() tells us what attributes are available.
#So for example, inspect the label attribute
V(g)$label
#in and out degree counts for each (labelled) node/vertex
df=data.frame(name=V(g)$label,indegree=degree(g,mode='in'),outdegree=degree(g,mode='out'))

#inspect the top 10 nodes sorted by indegree
#the plyr arrange function makes sorting dataframes a doddle...
require(plyr)
df2=head(arrange(df,desc(indegree)),10)
df2

#get ready to do some plots
require(ggplot2)

#It might be interesting to look at the in-degree and out-degree distributions
#out-degree, because we see how promiscuous folk are in their following behaviour
#h/t to @mhawksey for pointing out the mode argument to me.. doh!
ddout=degree.distribution(g,mode='out')
#degree.distribution() "a numeric vector of the same length as the maximum degree plus one. The first element is the relative frequency zero degree vertices, the second vertices with degree one,etc." 
#We can use the vector vals as the y-value, but x is unspecified/implied by the row number
#So we need to generate the x vals explicitly...?
ggplot()+geom_point(aes(c(1:length(ddout)),ddout))
#If we want to ignore the outdegree==0 value, we can skip the first item in the list
ggplot()+geom_point(aes(c(2:length(ddout)),ddout[-1]))

#in-degree
ddin=degree.distribution(g,mode='in')
ggplot()+geom_point(aes(c(1:length(ddin)),ddin))
ggplot()+geom_point(aes(c(2:length(ddin)),ddin[-1]))

#We can also plot indegree and outdegree together
#Use colour to distinguish the points, and make the upper layer smaller in case we overplot
ggplot() + geom_point(aes(c(2:length(ddin)),ddin[-1]),colour='red') + geom_point(aes(c(2:length(ddout)),ddout[-1]),colour='blue',size=1)

Note that I really should have labelled the axes – x-axis is “in (or out) degree”, y-axis is “proportion of nodes with corresponding in (or out) degree”.

Out-degree:

Out-degree (except out-degree==0):

In-degree:

In-degree (except in-degree==0):

One thing I notice about the in-degree is that there is a very high number of very low in-degree nodes, which tail off quickly, and then another head at in-degree 25 which then tails off. This is an artefact of the way the graph file was pre-processed – I generated a friends network of hashtag users, then filter the network to only include nodes that had indegree of at least 25 and/or outdegree of at least 25. The nodes with in-degree between 1 and 25 are nodes corresponding to hashtaggers that are friended by other hashtaggers.

In- (blue) and out- (red) degree:

Reflecting on the in-degree graph, we have a way of identifying those folk who used the hashtag and are connected to other hashtaggers:

arrange(subset(df,subset=(outdegree>0 & indegree>0)),desc(indegree))

The dataset I’m using refers is based on folk using the #bbcqt hashtag. Here are the hashtaggers most linked to by other hashtaggers:

> head(arrange(subset(df,subset=(outdegree>0)),desc(indegree)))
             name indegree outdegree
1 bbcquestiontime      190       102
2       DIMBLEBOT       76        61
3   markinreading       34       121
4 politicalhackuk       27       236
5          10anta       25        73
6 Parlez_me_nTory       24        63

So now I’m wondering… does this hint at a way of measuring some sort of engagement with the Twitter account set up to promote the programme and, presumably, the hashtag???

If we consider @bbcquestiontime, the high indegree tells us that the @bbcquestiontime account is being followed by a significant number of the hashtag users (we could find out what proportion by dividing through by the number of folk with out-degree>1 minus 1 (minus 1 because @bbcquestiontime is one of those hashtaggers). That @bbcquestiontime has outdegree > 0 tells us it was sampled as a user of the hashtag (the graph was originally generated with directed edges from folk who used the tag to their friends.) The high (ish?!) out-degree tells us that this account is linking to a reasonable number of folk popularly followed by users of the #bbcqt hashtag or who used the hashtag; so #bbcquestiontime is listening to folk that the #bbcqt taggers listen to, which is probably a good thing. (I guess what we could do here is compare the outdegree of the @bbcquestiontime account with its total friend count (ie, with the total number of accounts it follows. Because if the account was following 1000 people or so, and only 10% of them were being followed by #bbcqt hashtaggers, we might wonder whether they’re interested in different things?) Once again, we could also normalise the out-degree number with respect to one less number of accounts with indegree >0 (again, we subtract one to account for the self reference) to get the proportion of folk being followed by hashtaggers that are being followed by @bbcquestiontime. This gives us some idea of the extent to which @bbcquestiontime is listening to folk that the #bbcqt hashtaggers are listening to.

Let’s try that latter normalisation to get a feel for what the proportions are…

#Count the number of rows where folk have indegree, or outdegree, as required, > 0
df$inReach=df$indegree/(nrow(subset(df,df$outdegree>0))-1)
df$outReach=df$outdegree/(nrow(subset(df,df$indegree>0))-1)
#First let's see who reaches furthest out into the interest community
head(arrange(subset(df,inReach>0),desc(outReach)))
             name indegree outdegree  outReach     inReach
1        Damientg        5       341 0.4782609 0.013054830
2      danmknight        9       265 0.3716690 0.023498695
3         martysm        1       261 0.3660589 0.002610966
4       MrJacHart       18       257 0.3604488 0.046997389
5           VMcAV        5       237 0.3323983 0.013054830
6 politicalhackuk       27       236 0.3309958 0.070496084

#now let's see who is touched by most of the community
head(arrange(subset(df,outReach>0),desc(inReach)))
             name indegree outdegree   outReach    inReach
1 bbcquestiontime      190       102 0.14305750 0.49608355
2       DIMBLEBOT       76        61 0.08555400 0.19843342
3   markinreading       34       121 0.16970547 0.08877285
4 politicalhackuk       27       236 0.33099579 0.07049608
5          10anta       25        73 0.10238429 0.06527415
6 Parlez_me_nTory       24        63 0.08835905 0.06266319

So, from that, we see that @Damientg is following a large number of the folk popularly followed by users of the #bbcqt hashtag or who used the hashtag. I don’t think this is interesting. However, the fact that @bbcquestiontime is followed by about half the folk who used the #bbcqt tag (in the sample I grabbed) is maybe useful as a measure of how engaged the hashtaggers may be with the programme Twitter account?

The latter report also brings to mind another question – how many of the hashtaggers does any particular account follow – that is, how connected is any particular account to folk who used the hashtag (which is the set of folk with outdegree>0)? This is important I think – distinguishing between hashtaggers who link to each other as part of a conversation, and other accounts they follow en masse but who aren’t engaging in conversation via the hashtag?

Hmmm…something to ponder over the weekend I think;-)

Written by Tony Hirst

January 27, 2012 at 6:17 pm

A Handful of Google Hacks…

with 3 comments

Last day of the holidays today, so I thought I’d try not to spend the day learning stuff from the web, but do something else instead, like read a book, or watch a film, clean the greenhouse, try out a new recipe, or maybe even get a (late) head start on some marking, start looking at putting a bid together, or draft a paper for some tinpot journal or other, (because I really, really, really need to generate some ‘research’ income, publish something academically credible, find some way, any way, of justifying my salary…).

But it never works out like that, does it…? Because I saw a couple of tweets from Martin, and they lead, as ever, to a bit of playing…

So for example, from this post of Martin’s – Free (and rebuild) the tweets! Export TwapperKeeper archives using Google Refine – I realised that Google Refine offers all sorts of import options I hadn’t really noticed before (like the ability to import data from an XML (incl. Atom/RSS), RDF or JSON feed. Which means I really need to have a quick play with that…

…for example, by seeing if I can load in some data directly from a Twitter search using a URL of the form:

http://search.twitter.com/search.json?q=from:mhawksey&rpp=5&include_entities=true&result_type=mixed

JSON import in Google Refine

Ooh… magic:-)

Google Refine - import data from url/project import

Looking around various Google Refine works in progress, I also noticed that there appears to be a Python library for running Google Refine project scripts, which means you can presumably automate the Google Refine process? (Note to self – I still haven’t found a noddy tutorial to help me get started with running the Gephi Toolkit from Jython).

Or how about this tweet, “forwarded” by Martin in the sense that he replied to a tweet from @SuButcher and included me in the response, so I could look up the thread being replied to: “@mhawksey I want to use a Googledocs form to populate a google map with twitter users – see http://t.co/uJGC78b8 and http://t.co/hLbBEUXY” which got me idly wondering about the quickest way to geocode data in a Google spreadsheet, remembering the Google Gadget that will do just this (Google Spreadsheet geocoding map gadget – a quick play suggests you can set the range to a whole column (e.g. B:B), but setting the range from a specific row to the end of the column using the format B2:B (eg for use mapping results from a google form, excluding the title row), doesn’t seem to work?!)

Googling around, I also found this handy trick for getting the lat/long of a point out of Google maps in a recipe for Geocoding with Google Spreadsheets:

- CSV output for Google maps point geocoder: http://maps.google.com/maps/geo?output=csv&q=PLACENAME_OR_ADDRESS

And hence in the Google Spreadsheet context:

=ImportData(CONCATENATE(“http://maps.google.com/maps/geo?output=csv&q=”,”LOCATION”))

where “LOCATION” might equally be a reference to a cell that contains a location.

So… day wasted… again… may as well spend the rest of it doing F1DataJunkie stuff now…!

PS seems I was unwittingly sort of involved in the publication path of that Google Maps/CSV hack I linked to above… That’s why I do what I do… unforeseen consequences.

Written by Tony Hirst

January 3, 2012 at 5:30 pm

Posted in Anything you want, Tinkering

Tagged with

Mapping the New Year Honours List – Where Did the Honours Go?

with 3 comments

When I get a chance, I’ll post a (not totally unsympathetic) response to Milo Yiannopoulos’ post The pitiful cult of ‘data journalism’, but in the meantime, here’s a view over some data that was released a couple of days ago – a map of where the New Year Honours went [link]

New Year Honours map

[Hmm... so WordPress.com doesn't seem to want to let me embed a Google Fusion Table map iframe, and Google Maps (which are embeddable) just shows an empty folder when I try to view the Fusion Table KML... (the Fusion Table export KML doesn't seem to include lat/lng data either? Maybe I need to explore some hosting elsewhere this year...]

Note that I wouldn’t make the claim that this represents an example of data journalism. It’s a sketch map showing which parts of the country various recipients of honours this time round presumably live. Just by posting the map, I’m not reporting any particular story. Instead, I’m trying to find a way of looking at the day to see whether or not there may be any interesting stories that are suggested by viewing the data in this way.

There was a small element of work involved in generating the map view, though… Working backwards, when I used Google Fusion tables to geocode the locations of the honoured, some of the points were incorrectly located:

Google Fusion Tables - correcting fault geocoding

(It would be nice to be able to force a locale to the geocoder, maybe telling it to use maps.google.co.uk as the base, rather than (presumably) maps.google.com?)

The approach I took to tidying these was rather clunky, first going into the table view and filtering on the mispositioned locations:

Google Fusion Tables - correcting geocoding errors

Then correcting them:

Google Fusion Table, Correct Geocode errors

What would be really handy would be if Google Fusion Tables let you see a tabular view of data within a particular map view – so for example, if I could zoom in to the US map and then get a tabular view of the records displayed on that particular local map view… (If it does already support this and I just missed it, please let me know via the comments..;-)

So how did I get the data into Google Fusion Tables? The original data was posted as a PDF on the DirectGov website (New Year Honours List 2012 – in detail)…:

New Year Honours data

…so I used Scraperwiki to preview and read through the PDF and extract the honours list data (my scraper is a little clunky and doesnlt pull out 100% of the data, missing the occasional name and contribution details when it’s split over several lines; but I think it does a reasonable enough job for now, particularly as I am currently more interested in focussing on the possible high level process for extracting and manipulating the data, rather than the correctness of it…!;-)

Here’s the scraper (feel free to improve upon it….:-): Scraperwiki: New Year Honours 2012

I then did a little bit of tweaking in Google Refine, normalising some of the facets and crudely attempting to separate out each person’s role and the contribution for which the award was made.

For example, in the case of Dr Glenis Carole Basiro DAVEY, given column data of the form “The Open University, Science Faculty and Health Education and Training Programme, Africa. For services to Higher and Health Education.“, we can use the following expressions to generate new sub-columns:

- value.match(/.*(For .*)/)[0] to pull out things like “For services to Higher and Health Education.”
- value.match(/(.*)For .*/)[0] to pull out things like “The Open University, Science Faculty and Health Education and Training Programme, Africa.”

I also ran each person’s record through Reuters Open Calais service using Google Refine’s ability to augment data with data from a URL (“Add column by fetching URLs”), pulling the data back as JSON. Here’s the URL format I used (polling once every 500ms in order to stay with the max. 4 calls per limit threshold mandated by the API.)

"http://api.opencalais.com/enlighten/rest/?licenseID=<strong>MY_LICENSE_KEY</strong>&content=" + escape(value,'url') + "&paramsXML=%3Cc%3Aparams%20xmlns%3Ac%3D%22http%3A%2F%2Fs.opencalais.com%2F1%2Fpred%2F%22%20xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%22%3E%20%20%3Cc%3AprocessingDirectives%20c%3AcontentType%3D%22TEXT%2FRAW%22%20c%3AoutputFormat%3D%22Application%2FJSON%22%20%20%3E%20%20%3C%2Fc%3AprocessingDirectives%3E%20%20%3Cc%3AuserDirectives%3E%20%20%3C%2Fc%3AuserDirectives%3E%20%20%3Cc%3AexternalMetadata%3E%20%20%3C%2Fc%3AexternalMetadata%3E%20%20%3C%2Fc%3Aparams%3E"

Unpicking this a little:

- licenseID is set to my license key value
- content is the URL escaped version of the text I wanted to process (in this case, I created a new column from the name column that also pulled in data from a second column (the contribution column). The GREL formula I used to join the columns took the form: value+', '+cells["contribution"].value)
- paramsXML is the URL encoded version of the following parameters, which set the content encoding for the result to be JSON (the default is XML):

<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<c:processingDirectives c:contentType="TEXT/RAW" c:outputFormat="Application/JSON"  >
</c:processingDirectives>
<c:userDirectives>
</c:userDirectives>
<c:externalMetadata>
</c:externalMetadata>
</c:params>

So much for process – now where are the stories? That’s left, for now, as an exercise for the reader. An obvious starting point is just to see who received honours in your locale. Remember, Google Fusion Tables lets you generate all sorts of filtered views, so it’s not too hard to map where the MBEs vs OBEs are based, for example, or have a stab at where awards relating to services to Higher Education went. Some awards also have a high correspondence with a particular location, as for example in the case of Enfield…

If you do generate any interesting views from the New Year Honours 2012 Fusion Table, please post a link in the comments. And if you find a problem with/fix for the data or the scraper, please post that info in a comment too:-)

Written by Tony Hirst

January 2, 2012 at 6:13 pm

A Quick Peek at Three Content Analysis Services

with 5 comments

A long, long time ago, I tinkered with a hack called Serendipitwitterous (long since rotted, I suspect), that would look through a Twitter stream (personal feed, or hashtagged tweets), use the Yahoo term extraction service to try to identify concepts or key words/phrases in each tweet, and then use these as a search term on Slideshare, Youtube and so on to find content that may or may not be loosely related to each tweet.

The Yahoo Term Extraction is still hanging in there – just – but I think it finally gets deprecated early next year. From my feeds today, however, it seems there may be a replacement in the form of a new content analysis service via YQL – Yahoo! Opens Content Analysis Technology to all Developers:

[The Y! COntent Analysis service will] extract key terms from the content, and, more importantly, rank them based on their overall importance to the content. The output you receive contains the keywords and their ranks along with other actionable metadata.
On top of entity extraction and ranking, developers need to know whether key terms correspond to objects with existing rich metadata. Having this entity/object connection allows for the creation of highly engaging user experiences. The Y! Content Analysis output provides related Wikipedia IDs for key terms when they can be confidently identified. This enables interoperability with linked data on the semantic Web.

What this means is that you can push a content feed through the service, and get an annotated version out that includes identifier based hooks into other domains (i.e. little-l, little-d linked data). You can find the documentation here: Content Analysis Documentation for Yahoo! Search

So how does it fare? As I’ve previously explored using the Reuters Open Calais service to annotate OU/BBC programme listings (e.g. Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags), I thought I’d use a programme feed from The Bottom Line again…

To start, we need to open the YQL developer console: http://developer.yahoo.com/yql/console/

We can then pull in an example programme description from the BBC using a YQL query of the form:

select long_synopsis from xml where url='http://www.bbc.co.uk/programmes/b00vy3l1.xml'

Grabbing a BBC programme feed into YQL

For reference, the text looks like this:

The view from the top of business. Presented by Evan Davis, The Bottom Line cuts through confusion, statistics and spin to present a clearer view of the business world, through discussion with people running leading and emerging companies.
In the week that Facebook launched its own new messaging service, Evan and his panel of top business guests discuss the role of email at work, amid the many different ways of messaging and communicating.
And location, location, location. It’s a cliche that location can make or break a business, but how true is it really? And what are the advantages of being next door to the competition?
Evan is joined in the studio by Chris Grigg, chief executive of property company British Land; Andrew Horton, chief executive of insurance company Beazley; Raghav Bahl, founder of Indian television news group Network 18.
Producer: Ben Crighton
Last in the series. The Bottom Line returns in January 2011.

The content analysis query example provided looks like this:

select * from contentanalysis.analyze where text="Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration"

but we can nest queries in order to pass the long_synposis from the BBC programme feed through the service:

select * from contentanalysis.analyze where text in (select long_synopsis from xml where url='http://www.bbc.co.uk/programmes/b00vy3l1.xml')

Here’s the result:

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"
    yahoo:count="2" yahoo:created="2011-12-22T11:03:51Z" yahoo:lang="en-US">
    <diagnostics>
        <publiclyCallable>true</publiclyCallable>
        <url execution-start-time="2" execution-stop-time="370"
            execution-time="368" proxy="DEFAULT"><![CDATA[http://www.bbc.co.uk/programmes/b00vy3l1.xml]]></url>
        <user-time>572</user-time>
        <service-time>565</service-time>
        <build-version>24402</build-version>
    </diagnostics> 
    <results>
        <categories xmlns="urn:yahoo:cap">
            <yct_categories>
                <yct_category score="0.536">Business &amp; Economy</yct_category>
                <yct_category score="0.421652">Finance</yct_category>
                <yct_category score="0.418182">Finance/Investment &amp; Company Information</yct_category>
            </yct_categories>
        </categories>
        <entities xmlns="urn:yahoo:cap">
            <entity score="0.979564">
                <text end="57" endchar="57" start="48" startchar="48">Evan Davis</text>
                <wiki_url>http://en.wikipedia.com/wiki/Evan_Davis</wiki_url>
                <types>
                    <type region="us">/person</type>
                    <type region="us">/place/place_of_interest</type>
                    <type region="us">/place/us/town</type>
                </types>
                <related_entities>
                    <wikipedia>
                        <wiki_url>http://en.wikipedia.com/wiki/Don%27t_Tell_Mama</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Lenny_Dykstra</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Los_Angeles_Police_Department</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Today_%28BBC_Radio_4%29</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Chrisman,_Illinois</wiki_url>
                    </wikipedia>
                </related_entities>
            </entity>
            <entity score="0.734099">
                <text end="265" endchar="265" start="258" startchar="258">Facebook</text>
                <wiki_url>http://en.wikipedia.com/wiki/Facebook</wiki_url>
                <types>
                    <type region="us">/organization</type>
                    <type region="us">/organization/domain</type>
                </types>
                <related_entities>
                    <wikipedia>
                        <wiki_url>http://en.wikipedia.com/wiki/Mark_Zuckerberg</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Social_network_service</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Twitter</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Social_network</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Digital_Sky_Technologies</wiki_url>
                    </wikipedia>
                </related_entities>
            </entity>
            <entity score="0.674621">
                <text end="477" endchar="477" start="450" startchar="450">location, location, location</text>
            </entity>
            <entity score="0.651227">
                <text end="79" endchar="79" start="60" startchar="60">The Bottom Line cuts</text>
                <types>
                    <type region="us">/other/movie/movie_name</type>
                </types>
            </entity>
            <entity score="0.646818">
                <text end="799" endchar="799" start="789" startchar="789">Raghav Bahl</text>
                <wiki_url>http://en.wikipedia.com/wiki/Raghav_Bahl</wiki_url>
                <types>
                    <type region="us">/person</type>
                </types>
                <related_entities>
                    <wikipedia>
                        <wiki_url>http://en.wikipedia.com/wiki/Network_18</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Superpower</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Deng_Xiaoping</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/The_Amazing_Race</wiki_url>
                        <wiki_url>http://en.wikipedia.com/wiki/Hare</wiki_url>
                    </wikipedia>
                </related_entities>
            </entity>
            <entity score="0.644349">
                <text end="144" endchar="144" start="133" startchar="133">clearer view</text>
            </entity>
            <entity score="0.54609">
                <text end="675" endchar="675" start="665" startchar="665">Chris Grigg</text>
                <types>
                    <type region="us">/person</type>
                </types>
            </entity>
        </entities>
    </results>
</query>

So, some success in pulling out person names, and limited success on company names. The subject categories look reasonably appropriate too.

[UPDATE: I should have run the desc contentanalysis.analyze query before publishing this post to pull up the docs/examples... As well as the where text= argument, there is a where url= argument that will pul back semantic information about a URL. Running the query over the OU homepage, for example, using select * from contentanalysis.analyze where url="http://www.open.ac.uk" identifies the OU as an organisation, with links out to Wikipedia, as well as geo-information and a Yahoo woe_id.]

Another related service in this area that I haven’t really explored yet is TSO’s Data Enrichment Service (API).

Here’s how it copes with the same programme synposis:

TSO Data Enrichment Service

Pretty good… and links in to dbpedia (better for machine readability) compared to the Wikipedia links that the Yahoo service offers.

For completeness, here’s what the Reuters Open Calais service comes up with:

OPen Calais - content analysis

The best of the bunch on this sample of one, I think, albeit admittedly in the domain the Reuters focus on?

But so what…? What are these services good for? Automatic metadata generation/extraction is one thing, as I’ve demonstrated in Visualising OU Academic Participation with the BBC’s “In Our Time”, where I generated a quick visualisation that showed the sorts of topics that OU academics had talked about as guests on Melvyn Bragg’s In Our Time, along with the topics that other universities had been engaged with on that programme.

Written by Tony Hirst

December 22, 2011 at 11:23 am

Rescuing Twapperkeeper Archives Before They Vanish, Redux

with 3 comments

In Rescuing Twapperkeeper Archives Before They Vanish, I described a routine for grabbing Twapperkeeper archives, parsing them, and saving them to a local desktop file using the R programming language (downloading RStudio is the easiest way I know of getting R…).

Following a post fron @briankelly (Responding to the Forthcoming Demise of TwapperKeeper), where Brian described how to lookup all the archives saved by a person on Twapperkeeper and using that as a basis for an archive rescue strategy, I thought I’d tweak my code to grab all the hashtag archives for a particular user (other archives are also available, search as search term archives; I don’t grab the list of these… IF you fancy generalising the code, please post a link to it in the comments;-)

What should have been a trivial task didn’t work, of course: the R XML parser seemed to choke on some of the archive files claiming they weren’t the claimed UTF-8 encoding. Character encodings are still something that I don’t understand at all (and more than a few times have caused me to give up on a hack), but on the offchance, I tried using a more resilient file loader (curl, if that means anything to you…;-) rather than the XML package loader, and it seems to do the trick (warnings are still raised, but that’s an improvement on errors, that tend cause everything to stop).

Anyway, here’s the revised code, along with an additional routine for grabbing all the hashtag archives saved on Twapperkeeper by a named individual. If I get a chance (i.e. when I learn how to do it!), I’ll add in a line to two that will grab all the archives from a list of named individuals…

require(XML)
require(stringr)
require(RCurl)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
tagtrim <- function (x) sub('#','',x)

twapperkeeperRescue=function(hashtag,num=10000){
    #startCrib: http://libreas.wordpress.com/2011/12/09/twapperkeeper/
    #tweak - reduce to a grab of 10000 archived tweets
    url <- paste("http://twapperkeeper.com/rss.php?type=hashtag&name=",hashtag,"&l=",num, sep="")
    print(url)
    #This is a hackfix I tried on spec - use the RCurl library to load in the file...
    lurl=getURL(url)
    #...then parse it, rather than loading it in directly using the XML parser...
    doc <- xmlTreeParse(lurl,useInternal=T,encoding = "UTF-8")
    tweet <- xpathSApply(doc, "//item//title", xmlValue)  
    pubDate <- xpathSApply(doc, "//item//pubDate", xmlValue)
    #endCrib
    df=data.frame(cbind(tweet,pubDate))
    print('...extracting from...')
    df$from=sapply(df$tweet,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    print('...extracting id...')
    df$id=sapply(df$tweet,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    print('...extracting txt...')
    df$txt=sapply(df$tweet,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    print('...extracting to...')
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    print('...extracting rt...')
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}

#usage: 
#tag='ukdiscovery'
#twarchive.df=twapperkeeperRescue(tag)

#if you want to save the parsed archive:
twapperkeeperSave=function(hashtag,num=10000,path='./'){
    tweet.df=twapperkeeperRescue(hashtag,num)
    fn <- paste(path,"twArchive_",hashtag,".csv")
    write.csv(tweet.df,fn)
}
#usage:
#twapperkeeperSave(tag)


#The following function grabs a list of hashtag archives saved by a given user
# and then rescues each archive in turn...
twapperkeeperUserRescue=function(uname='psychemedia',num=10000){
	#This routine only grabs hashtag archives;
	#Search archives and other archives can also be identified an downloaded if you feel like generalising this bit of code...;-)
	url=paste('http://twapperkeeper.com/allnotebooks.php?type=hashtag&name=&description=&tag=&created_by=',uname,sep='')
	archives=readHTMLTable(url,which=2,header=T)
	archives$Name=sapply(archives$Name,function(tag) tagtrim(tag))
	mapply(twapperkeeperSave,archives$Name,num)
}
#usage:
#user='psychemedia'
#twapperkeeperUserRescue(user)
#twapperkeeperUserRescue(user,1000)
#The numerical argument is the number of archived tweets you want to save (max 50000)
#Note to self: need to trap this maxval...

Now… do I build some archive analytics and visualisation on top of this, or do I have a play with building an archive rescuer in Scraperwiki?!

PS I also doodled a Python script to download (even large) Twapperkeeper archives, by user

Written by Tony Hirst

December 11, 2011 at 8:21 pm

Posted in Rstats, Tinkering

Tagged with

Finding Common Terms around a Twitter Hashtag

with 2 comments

@aendrew sent me a link to a StackExchange question he’s just raised, in a tweet asking: “Anyone know how to find what terms surround a Twitter trend/hashtag?”

I’ve dabbled in this area before, though not addressing this question exactly, using Yahoo Pipes to find what hashtags are being used around a particular search term (Searching for Twitter Hashtags and Finding Hashtag Communities) or by members of a particular list (What’s Happening Now: Hashtags on Twitter Lists; that post also links to a pipe that identifies names of people tweeting around a particular search term.).

So what would we need a pipe to do that finds terms surrounding a twitter hashtag?

Firstly, we need to search on the tag to pull back a list of tweets containing that tag. Then we need to split the tweets into atomic elements (i.e. separate words). At this point, it might be useful to count how many times each one occurs, and display the most popular. We might also need to generate a “stop list” containing common words we aren’t really interested in (for example, the or and.

So here’s a quick hack at a pipe that does just that (Popular words round a hashtag).

For a start, I’m going to construct a string tokeniser that just searches for 100 tweets containing a particular search term, and then splits each tweet up in separate words, where words are things that are separated by white space. The pipe output is just a list of all the words from all the tweets that the search returned:

Twitter string tokeniser

You might notice the pipe also allows us to choose which page of results we want…

We can now use the helper pipe in another pipe. Firstly, let’s grab the words from a search that returns 200 tweets on the same search term. The helper pipe is called twice, once for the first page of results, once for the second page of results. The wordlists from each search query are then merged by the union block. The Rename block relabels the .content attribute as the .title attribute of each feed item.

Grab 200 tweets and check we have set the title element

The next thing we’re going to do is identify and count the unique words in the combined wordlist using the Unique block, and then sort the list accord to the number of times each word occurs.

Preliminary parsing of a wordlist

The above pipe fragment also filters the wordlist so that only words containing alphabetic characters are allowed through, as well as words with four or more characters. (The regular expression .{4,} reads: allow any string of four or more ({4,}) characters of any type (.). An expression .{5,7} would say – allow words through with length 5 to 7 characters.)

I’ve also added a short routine that implements a stop list. The regular expression pattern (?i)\b(word1|word2|word3)\b says: ignoring case ((?i)),try to match any of the words word1, word2, word3. (\b denotes word boundary.) Note that in the filter below, some of the words in my stop list are redundant (the ones with three or fewer characters. Remember, we have already filtered the word list to show only words of length four or more characters.)

Stop list

I also added a user input that allows additional stop terms to be added (they should be pipe (|) separated, with no spaces between them). You can find the pipe here.

Written by Tony Hirst

November 22, 2011 at 1:47 pm

A Quick Lookup Service for UK University Bursary & Scholarship Pages

with one comment

Here’s a quick recipe for grabbing a set of links from an alphabetised set of lookup pages and then providing a way of looking them up… The use case is to lookup URLs of pages on the websites of colleges and universities offering financial support for students as part of the UK National Scholarship Programme, as described on the DirectGov website:

National Scholarship programme

Index pages have URLs of the form:
http://www.direct.gov.uk/en/EducationAndLearning/UniversityAndHigherEducation/StudentFinance/StudentfinanceA-Z/index.htm?indexChar=D

<div class="subContent">
			<h3>A</h3>
						<div class="subContent">
						<h4></h4>
					<ul class="subLinks">
						<li><a href="http://www.anglia.ac.uk/nsp"   target="_blank">Anglia Ruskin University<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
					</ul>
				</div>
				<div class="subContent">
						<h4></h4>
					<ul class="subLinks">
						<li><a href="http://www.aucb.ac.uk/international/feesandfinance/financialhelp.aspx"   target="_blank">Arts University College at Bournemouth<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
					</ul>
				</div>
				<div class="subContent">
						<h4></h4>
					<ul class="subLinks">
						<li><a href="http://www1.aston.ac.uk/study/undergraduate/student-finance/tuition-fees/2012-entry/ "   target="_blank">Aston University Birmingham<span class='tooltip' title='Opens new window'> <span>Opens new window</span></span></a></li>
					</ul>
				</div>
		
	</div>

I’ve popped a quick scraper onto Scraperwiki (University Bursaries / Scholarship / Bursary Pages) that trawls the the index pages A-Z, grabs the names of the institutions and the URLs they link to and pops them into a database.

import scraperwiki
import string,lxml.html

# A function I usually bring in with lxml that strips tags and just give you text contained in an XML substree
## via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)
#As it happens, we're not actually going to use this function in this scraper, so we could remove it from the code...

# We want to poll through page URLs indexed by an uppercase alphachar
allTheLetters = string.uppercase

for letter in allTheLetters:
    #Generate the URL
    url="http://www.direct.gov.uk/en/EducationAndLearning/UniversityAndHigherEducation/StudentFinance/StudentfinanceA-Z/index.htm?indexChar="+letter
    print letter
    #Grab the HTML page from the URL and generate an XML object from it
    page=lxml.html.fromstring(scraperwiki.scrape(url))
    #There are probably more efficient ways of doing this scrape...
    for element in page.findall('.//div'):
        if element.find('h3')!=None and element.find('h3').text==letter:
            for uni in element.findall('.//li/a'):
                print uni.text,uni.get('href')
                scraperwiki.sqlite.save(unique_keys=["href"], data={"href":uni.get('href'), "uni":uni.text})

Running this gives a database containing the names of the institutions that signed up to the National Scholarship Programmea and the information that have about scholarships and bursaries availabale in that context.

The Scraperwiki API allows you to run queries on this database and get the results back as JSON, HTML, CSV or RSS: University Bursaries API. So for example, we can search for bursary pages on Liverpool colleges and universities websites:

Scraperwiki API

We can also generate a view over the data on Scraperwiki… (this script shows how to interrogate the Scraperwiki database from within a webpage.

Finally, if we take the URLs from the bursary pages and pop them into a Google custom search engine, we can now search over just those pages… UK HE Financial Support (National Scholarship Programme) Search Engine. (Note that this is a bit ropey at them moment.) If you own the CSE, it’s easy enough to grab embed codes that allow you to pop search and results controls for the CSE into your own webpage.

(On the to do list is generate a view over the data that defines a Google Custom Search Engine Annotations file that can be used to describe the sites/pages searched over by the CSE.)

Written by Tony Hirst

November 13, 2011 at 12:18 pm

Posted in Tinkering

Tagged with ,

Generating Mind Maps from OU/OpenLearn Structured Authoring XML Documents

leave a comment »

One of the really useful things about publishing documents in a structured way is that we can treat the document as a database, or generate an outline view of it automatically.

Whilst looking through the OU Structured Authoring XML docs looking for things I could reliably extract from them in order to configure a course custom search engine (Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents), I put together a quick script to generate a course mind map based around the course structure.

It struck me that as structured document/XML views of OpenLearn material is available, I could do the same for OpenLearn docs. So here’s an example. If you visit the OpenLearn site, you should be able to find several modules derived from the old OU course T175. Going to the first page proper for each of the derived modules (URLs have the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=398868&direct=1), it is possible to grab a copy of the source XML document for the unit by rewriting the URL to include the setting&content=1: for example, http://openlearn.open.ac.uk/mod/oucontent/view.php?id=398868&content=1

OpenLearn source XML

Downloading the XML files for each of the T175 derived modules on OpenLearn into a single folder, I put together a quick script to mine the structure of the document and pull out the learning objectives for each unit, as well as the headings of each section and subsection. The resulting mindmap provides an outline of the course as a whole, something that can be used to provide a macroscopic view over the whole course, as well as providing a document that could be made available to people following the unit as a resource they could use to organise their notes or annotations around the unit.

T175 on Openlearn mindmap

Download a copy of the T175 on OpenLearn Outline Freemind/.mm mindmap

If we could find a way of getting the OpenLearn page URLs for each section, we could add them in as links within the mindmap, thus allowing it to be used as a navigation surface. (See also MindMap Navigation for Online Courses in this regard.)

Here’s a copy of the Python script I ran over the folder to generate the Freemind mindmap definition file (filetype .mm) based on the section and subsection elements used to structure the document.

# DEPENDENCIES
## We're going to load files in from a course related directory
import os
## Quick hack approach - use lxml parser to parse SA XML files
from lxml import etree
# We may find it handy to generate timestamps...
import time


# CONFIGURATION

## The directory the course XML files are in (separate directory for each course for now) 
SA_XMLfiledir='data'
## We can get copies of the XML versions of Structured Authoring documents
## that are rendered in the VLE by adding &content=1 to the end of the URL
## [via Colin Chambers]
## eg http://learn.open.ac.uk/mod/oucontent/view.php?id=526433&content=1


# UTILITIES

#lxml flatten routine - grab text from across subelements
#via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005
def flatten(el):           
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return "".join(result)

#Quick and dirty handler for saving XML trees as files
def xmlFileSave(fn,xml):
	# Output
	txt = etree.tostring(xml, pretty_print=True)
	#print txt
	fout=open(fn,'wb+')
	#fout.write('<?xml version="1.0" encoding="UTF-8" ?>\n')
	fout.write(txt)
	fout.close()


#GENERATE A FREEMIND MINDMAP FROM A SINGLE T151 SA DOCUMENT
## The structure of the T151 course lends itself to a mindmap/tree style visualisation
## Essentially what we are doing here is recreating an outline view of the course that was originally used in the course design phase
def freemindRoot(page):
	tree = etree.parse('/'.join([SA_XMLfiledir,page]))
	courseRoot = tree.getroot()
	mm=etree.Element("map")
	mm.set("version", "0.9.0")
	root=etree.SubElement(mm,"node")
	root.set("CREATED",str(int(time.time())))
	root.set("STYLE","fork")
	#We probably need to bear in mind escaping the text strings?
	#courseRoot: The course title is not represented consistently in the T151 SA docs, so we need to flatten it
	title=flatten(courseRoot.find('CourseTitle'))
	root.set("TEXT",title)
	
	## Grab a listing of the SA files in the target directory
	listing = os.listdir(SA_XMLfiledir)

	#For each SA doc, we need to handle it separately
	for page in listing:
		print 'Page',page
		#Week 0 and Week 10 are special cases and don't follow the standard teaching week layout
		if page!='week0.xml' and page!='week10.xml':
			tree = etree.parse('/'.join([SA_XMLfiledir,page]))
			courseRoot = tree.getroot()
			parsePage(courseRoot,root)
	return mm

def learningOutcomes(courseRoot,root):
	mmlos=etree.SubElement(root,"node")
	mmlos.set("TEXT","Learning Outcomes")
	mmlos.set("FOLDED","true")
	
	los=courseRoot.findall('.//FrontMatter/LearningOutcomes/LearningOutcome')
	for lo in los:
		mmsession=etree.SubElement(mmlos,"node")
		mmsession.set("TEXT",flatten(lo))

def parsePage(courseRoot,root):
	unitTitle=courseRoot.find('.//Unit/UnitTitle')

	mmweek=etree.SubElement(root,"node")
	mmweek.set("TEXT",flatten(unitTitle))
	mmweek.set("FOLDED","true")

	learningOutcomes(courseRoot,mmweek)
	
	sessions=courseRoot.findall('.//Unit/Session')
	for session in sessions:
		title=flatten(session.find('.//Title'))
		mmsession=etree.SubElement(mmweek,"node")
		mmsession.set("TEXT",title)
		mmsession.set("FOLDED","true")
		subsessions=session.findall('.//Section')
		for subsession in subsessions:
			heading=subsession.find('.//Title')
			if heading !=None:
				title=flatten(heading)
				mmsubsession=etree.SubElement(mmsession,"node")
				mmsubsession.set("TEXT",title)
				mmsubsession.set("FOLDED","true")


mm=freemindRoot('t175_1.xml')
print etree.tostring(mm, pretty_print=True)
xmlFileSave('reports/test_t175_full.mm',mm)

If you try to run it over other OpenLearn materials, you may need to tweak the parser slightly. For example, some documents may make use of InnerSection elements, or Header rather than Title elements.

If youdo try using the above script to generate mindmaps/outlines of other OpenLearn courses, please let me know how you got on in the comments below (eg whether you needed to tweak the script, or whether you found other structural elements that could be pulled into the mindmap.)

Written by Tony Hirst

November 10, 2011 at 1:40 pm

Posted in Open Content, OU2.0, Tinkering

Tagged with ,

Getting Started With Twitter Analysis in R

with 8 comments

Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.

I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…

Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)

require(twitteR)
#The original example used the twitteR library to pull in a user stream
#rdmTweets <- userTimeline("psychemedia", n=100)
#Instead, I'm going to pull in a search around a hashtag.
rdmTweets <- searchTwitter('#mozfest', n=500)
# Note that the Twitter search API only goes back 1500 tweets (I think?)

#Create a dataframe based around the results
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
#Here are the columns
names(df)
#And some example content
head(df,3)

So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:

counts=table(df$screenName)
barplot(counts)

# Let's do something hacky:
# Limit the data set to show only folk who tweeted twice or more in the sample
cc=subset(counts,counts>1)
barplot(cc,las=2,cex.names =0.3)

Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:

#Whilst tinkering, I came across some errors that seemed
# to be caused by unusual character sets
#Here's a hacky defence that seemed to work...
df$text=sapply(df$text,function(row) iconv(row,to='UTF-8'))

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)

#A couple of tweet parsing functions that add columns to the dataframe
#We'll be needing this, I think?
library(stringr)
#Pull out who a message is to
df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)"))
df$to=sapply(df$to,function(name) trim(name))

#And here's a way of grabbing who's been RT'd
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…

require(ggplot2)
ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)

Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)

Written by Tony Hirst

November 9, 2011 at 2:47 pm

Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents

with 5 comments

Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.

The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.

One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.

One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)

Through using the course custom search engine myself, I have found several issues with it:

1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.

2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.

3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).

4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:

<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif"
  	title="Week 4"
  	id="T151_Week4"
  	queries="week 4,T151 week 4,t151 week 4"
  	url="http://www.open.ac.uk"
  	description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>

Course CSE - week promotion

There are several components we need to consider here:

  1. queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
  2. title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
  3. description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
  4. url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?

Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.

In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):

Course CSE - multiple promotions

Or what a particular question is within a particular topic:

COurse CSE - what's the question?

The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:

<Promotion 
  	title="Topic Exploration 4A Question 4"
  	queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a"
  	... />

The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3″ and “topic 1a q4″… (You can try out the CSE here.)

The promotions I have specified so far also lack several things:

1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)

2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)

At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?

At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…

[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]

PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?

My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.

Search hierarchy

It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?

*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:

  • academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
  • news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
  • video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
  • books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.

I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?

Fudging the CSE with a local searchengine

However, that wasn’t what this experiment was about!;-)

Course Resources as part of a larger connected graph

Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
- links to papers referenced in the article;
- links to papers that cite the article;
- links to other papers written by the same author;
- links to other papers in collections containing the article on services such as Mendeley;
- links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.

Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.

Written by Tony Hirst

November 8, 2011 at 1:54 pm

Follow

Get every new post delivered to your Inbox.

Join 134 other followers