Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents

Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.

The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.

One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.

One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)

Through using the course custom search engine myself, I have found several issues with it:

1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.

2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.

3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).

4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:

<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif"
  	title="Week 4"
  	id="T151_Week4"
  	queries="week 4,T151 week 4,t151 week 4"
  	url="http://www.open.ac.uk"
  	description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>

Course CSE - week promotion

There are several components we need to consider here:

  1. queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
  2. title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
  3. description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
  4. url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?

Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.

In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):

Course CSE - multiple promotions

Or what a particular question is within a particular topic:

COurse CSE - what's the question?

The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:

<Promotion 
  	title="Topic Exploration 4A Question 4"
  	queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a"
  	... />

The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3” and “topic 1a q4″… (You can try out the CSE here.)

The promotions I have specified so far also lack several things:

1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)

2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)

At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?

At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…

[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]

PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?

My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.

Search hierarchy

It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?

*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:

  • academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
  • news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
  • video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
  • books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.

I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?

Fudging the CSE with a local searchengine

However, that wasn’t what this experiment was about!;-)

Course Resources as part of a larger connected graph

Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
– links to papers referenced in the article;
– links to papers that cite the article;
– links to other papers written by the same author;
– links to other papers in collections containing the article on services such as Mendeley;
– links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.

Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.

Rediscovering Playlists…

Yesterday, I had a quick peek at the beta version of SocialLearn (currently open to OU staff, at least…). A key feature of the site are “learning paths”, ordered sets of annotated resources with associated progress status indicators:

I haven’t yet had a proper play with the site yet, so a will hold off a review of the site just for now, but my first glimpse reaction to that feature was: “isn’t that what H2O Playlists did?”

(I thought I must have posted some sort of review of H20 Playlists, “shared list[s] of readings and other content about a topic of intellectual interest”, but it seems I only made passing mention of them. However, I do remember creating several H2O playlists as a way of curating links associated with several presentations I gave when I was trying to advocate the use of social bookmarking in education. (For a review of H20 Playlists posted elsewhere around the same time, see More on H20 Playlist as a Social Bookmarking Tool for Business.)

H2O Playlists

Hmmm.. it seems I misremembered: you couldn’t check off progress on the playlist items, though you could save items off one list onto your own playlist, and you could also discover “related lists” that shared some of the same items.

Also yesterday, I came across the BBC Food Recipe Binder site:

BBC Food Recipe Binder

This site lets you save, and annotate, recipes on the BBC Food site to a personal “Recipe Binder” page from an on-page call to action button.

BBC Food - Recipe Binder

(Okay, so the Recipe Binder isn’t a playlist, but it is an example of embedded bookmarking/personal curation of web resources…)

And then, today, I fire up my feeds to see all sorts of chatter about the delicious website redesign, a key feature of which appears to be… stacks (aka playlists:

The mechanics for putting together the playlists still seem a bit clunky (do I really need to add three links to create a new playlist?) but I guess it’s still early days… Anyway, here’s my first playlist stack: Crafty Stats…

Suddenly, it seems like 2005 again…

Surveying the Territory: Open Source, Open-Ed and Open Data Folk on Twitter

Over the last few weeks, I’ve been tinkering with various ways of using the Twitter API to discover Twitter lists relating to a particular topic area, whether discovered through a particular hashtag, search term, a list that already exists on a topic, or one or more people who may be associated with a particular topic area.

On my to do list is a map of the “open” community on Twitter – and the relationships between them – that will try to identify notable folk in different areas of openness (open government, open data, open licensing, open source software) and the communities around them, then aggregate all this open afficionados, plot the network connections between them, and remap the result (to see whether the distinct communities we started with fall out, as well as to discover who acts as the bridges between them, or alternatively discover whether new emergent groupings appear to crystallise out based on network connectivity).

As a step on the road to that, I had a quick peek around found who were tweeting using the #oscon hashtag over the weekend. Through analysing people who were tweeting regularly around the topic, I identified several lists in the area: @realist/opensource, @twongkee/opensource, @lemasney/opensource ,@suncao/open-linked-free, @jasebo/open-source

Pulling down the members of these lists, and then looking for connections between them, I came up with this map of the open source community on Twitter:

A peek at FOSS community on Twitter

Using a different technique not based on lists, I generated a map of the open data community based on the interconnections between people followed by @openlylocal:

How the people @countculture follows follow each other

and the open education community based on the people that follow @opencontent:

How followers of @Opencontent follow each other

(So that’s a different way of identifying the members of each community, right? One based on lists that mention users of a particular hashtag, one based on folk a particular individual follows, and one based on the folk that follow a particular individual.)

I’ve also toyed with looking at communities defined by members of lists that mention a particular individual, or people followed by a particular individual, as well as ones based on members of lists that contain folk listed on one or more trusted, curated lists in a particular topic area (got that?!;-).

Whilst the graphs based on mapping friends or followers of an individual give a good overview of that individual’s sphere of interest or influence, I think the community graphs derived from finding connections between people mentioned on “lists in the area” is a bit more robust in terms of mapping out communities in general, though I guess I’d need to do “proper” research to demonstrate that?

As mentioned at the start, the next thing on my list is a map across the aggregated “open” communities on Twitter. Of course, being digerati, many of these people will have decamped to GooglePlus. So maybe I shouldn’t bother, but instead wait for Google+ to mature a bit, an API to become available, blah, blah, blah…

Google Playing the SEO Link Building Game to Drive Uptake Of Google Profiles?

As you’re probably aware by now, yesterday Google announced its Google+ social network. A key part of every social network is a user’s personal profile page, the “social object” that other people can actually connect to.

Google has offered personal profile pages for some time, (here’s my rather basic ), but they’ve never really been a part of anything, and they’re not really linkable to – which means there’s little reason for PageRank based search algorithms such as Google’s to return Google Profile pages in the top results for you if anyone ever searches for you.

(PageRank is the algorithm that gave Google its early edge in the search engine wars; links from one page to another count as “votes” regarding the quality of the page that is linked to. Crudely put, if people link to you, those links contribute to your PageRank and you’re more likely to make it to the top of a search results page.)

Until now, that is (or at least, until a couple of weeks ago… I missed this announcement at the time it was made…): Authorship markup and web search, a technique for “supporting markup that enables websites to publicly link within their site from content to author pages”.

The method is described as follows:

To identify the author of an article, Google checks for a connection between the content page (such as an article), an author page, and a Google Profile.

A content page can be any piece of content with an author: a news article, blog post, short story …
An author page is a page about a specific author, on the same domain as the content page.
A Google Profile is Google’s version of an author page. It’s how you present yourself to the web and to Google.

In confirming authorship, Google looks for:

Links from the content page to the author page (if the path of links continues to a Google Profile, we can also show Profile information in search results)
A path of links back from your Google Profile to your content.
These reciprocal links are important: without them, anyone could attribute content to you, or you could take credit for any content on the web.
….
The rel=”author” link indicates the author of an article [so for example: <a rel=”author” href=”https://profiles.google.com/tony.hirst/”>Google Profile: Tony Hirst</a>]

Source: Authorship

Here’s why you might be tempted to do this…:

Many of you create great content on the web, and we work hard to make that content discoverable on Google. Today, we will start highlighting the people creating this content in Google.com search results.

Google author identified links

As you can see …, certain results will display an author’s picture and name — derived from and linked to their Google Profile — next to their content on the Google Search results page.

Source: Highlighting content creators in search results; [my emphasis]

So… if you want to assert authorship and be recognised as the author in the Google search results, you need to start linking all your content back to your Google Profile Page…

…and so start feeding PageRank juice to your Google profile page…

…so that when folk search for you on the web, they’re more likely to see that page…

This is a harsh reading, of course: authorship can also be asserted by linking within a domain to a page that you have asserted to Google that represents you: The rel=”author” link indicates the author of an article, and can point to .. an author page on the same domain as the content page: Written by <a rel="author" href="../authors/mattcutts">Matt Cutts</a>. The author page should link to your Google Profile using rel=”me”.

(I wonder why <link rel=”author” href=”../authors/mattcutts”/> isn’t supported? Or maybe it is?)

Algorithmically, the assertion of authorship might also help in Google’s fight against spamblogs, which republish content blindly from original sources. That is, by asserting authorship of a page, if someone reposts your content, google will be able to identify you as the original author and return a link back to your page in the search results listing, rather than the republished page.

I imagine there might also be personal reputation benefits – for example, if people +1 a page you have claimed authorship of, it might give you a “Reputation Rank” boost for the subject area associated with that page?

Follower Networks and “List Intelligence” List Contexts for @JiscCetis

I’ve been tinkering with some of my “List Intelligence” code again, and thought it worth capturing some examples of the sort of network exploration recipes I’m messing around with at the moment.

Let’s take @jiscCetis as an example; this account follows no-one, is followed by a few, hasnlt much of a tweet history and is listed by a handful of others.

Here’s the follower network, based on how the followers of @jiscetis follow each other:

Friend connections between @Jisccetis followers

There are three (maybe four) clusters there, plus all the folk who don’t follow any of the @jisccetis’ followers…: do these follower clusters make any sort of sense I wonder? (How would we label them…?)

The next thing I thought to do was look at the people who were on the same lists as @jisccetis, and get an overview of the territory that @jisccetis inhabits by virtue of shared list membership.

Here’s a quick view over the folk on lists that @jisccetis is a member of. The nodes are users named on the lists that @jisccetis is named on, the edges are undirected and join indivduals who are on the same list.

Distribution of users named on lists that jisccetis is a member of

Plotting “co-membership” edges is hugely expensive in terms of upping the edge count that has to be rendered, but we can use a directed bipartite graph to render the same information (and arguably even more information); here, there are two sorts of nodes: lists, and the memvers of lists. Edges go from members to listnames (I should swap this direction really to make more sense of authority/hub metrics…?)

jisccetis co-list membership

Another thing I thought I’d explore is the structure of the co-list membership community. That is, for all the people on the lists that @jisccetis is a member of, how do those users follow each other?

How folk on same lists as @jisccetis follow each other

It may be interesting to explore in a formal way the extent to which the community groups that appear to arise from the friending relationships are reflected (or not) by the make up of the lists?

It would probably also be worth trying to label the follower group – are there “meaningful” (to @jisccetis? to the @jisccetis community?) clusters in there? How would you label the different colour groupings? (Let me know in the comments…;-)

Identifying the Twitterati Using List Analysis

Given absolutely no-one picked up on List Intelligence – Finding Reliable, Trustworthy and Comprehensive Topic/Sector Based Twitter Lists, here’s a example of what the technique might be good for…

Seeing the tag #edusum11 in my feed today, and not being minded to follow it it I used the list intelligence hack to see:

– which lists might be related to the topic area covered by the tag, based on looking at which Twitter lists folk recently using the tag appear on;
– which folk on twitter might be influential in the area, based on their presence on lists identified as maybe relevant to the topic associated with the tag…

Here’s what I found…

Some lists that maybe relate to the topic area (username/list, number of folk who used the hashtag appearing on the list, number of list subscribers), sorted by number of people using the tag present on the list:

/joedale/ukedtech 6 6
/TWMarkChambers/edict 6 32
/stevebob79/education-and-ict 5 28
/mhawksey/purposed 5 38
/fosteronomo/chalkstars-combined 5 12
/kamyousaf/uk-ict-education 5 77
/ssat_lia/lia 5 5
/tlists/edtech-995 4 42
/ICTDani/teched 4 33
/NickSpeller/buzzingeducators 4 2
/SchoolDuggery/uk-ed-admin-consultancy 4 65
/briankotts/educatorsuk 4 38
/JordanSkole/jutechtlets 4 10
/nyzzi_ann/teacher-type-people 4 9
/Alexandragibson/education 4 3
/danielrolo/teachers 4 20
/cstatucki/educators 4 13
/helenwhd/e-learning 4 29
/TechSmithEDU/courosalets 4 2
/JordanSkole/chalkstars-14 4 25
/deerwood/edtech 4 144

Some lists that maybe relate to the topic area (username/list, number of folk who used the hashtag appearing on the list, number of list subscribers), sorted by number of people subscribing to the list (a possible ranking factor for the list):
/deerwood/edtech 4 144
/kamyousaf/uk-ict-education 5 77
/SchoolDuggery/uk-ed-admin-consultancy 4 65
/tlists/edtech-995 4 42
/mhawksey/purposed 5 38
/briankotts/educatorsuk 4 38
/ICTDani/teched 4 33
/TWMarkChambers/edict 6 32
/helenwhd/e-learning 4 29
/stevebob79/education-and-ict 5 28
/JordanSkole/chalkstars-14 4 25
/danielrolo/teachers 4 20
/cstatucki/educators 4 13
/fosteronomo/chalkstars-combined 5 12
/JordanSkole/jutechtlets 4 10
/nyzzi_ann/teacher-type-people 4 9
/joedale/ukedtech 6 6
/ssat_lia/lia 5 5
/Alexandragibson/education 4 3
/NickSpeller/buzzingeducators 4 2
/TechSmithEDU/courosalets 4 2

Other ranking factors might include the follower count, or factors from some sort of social network analysis, of the list maintainer.

Having got a set of lists, we can then look for people who appear on lots of those lists to see who might be influential in the area. Here’s the top 10 (user, number of lists they appear on, friend count, follower count, number of tweets, time of arrival on twitter):

['terryfreedman', 9, 4570, 4831, 6946, datetime.datetime(2007, 6, 21, 16, 41, 17)]
['theokk', 9, 1564, 1693, 12029, datetime.datetime(2007, 3, 16, 14, 36, 2)]
['dawnhallybone', 8, 1482, 1807, 18997, datetime.datetime(2008, 5, 19, 14, 40, 50)]
['josiefraser', 8, 1111, 7624, 17971, datetime.datetime(2007, 2, 2, 8, 58, 46)]
['tonyparkin', 8, 509, 1715, 13274, datetime.datetime(2007, 7, 18, 16, 22, 53)]
['dughall', 8, 2022, 2794, 16961, datetime.datetime(2009, 1, 7, 9, 5, 50)]
['jamesclay', 8, 453, 2552, 22243, datetime.datetime(2007, 3, 26, 8, 20)]
['timbuckteeth', 8, 1125, 7198, 26150, datetime.datetime(2007, 12, 22, 17, 17, 35)]
['tombarrett', 8, 10949, 13665, 19135, datetime.datetime(2007, 11, 3, 11, 45, 50)]
['daibarnes', 8, 1592, 2592, 7673, datetime.datetime(2008, 3, 13, 23, 20, 1)]

The algorithms I’m using have a handful of tuneable parameters, which means there’s all sorts of scope for running with this idea in a “research” context…

One possible issue that occurred to me was that identified lists might actually cover different topic areas – this is something I need to ponder…

Tags Associated With Other Tags on Delicious Bookmarked Resources

If you’re using a particular tag to aggregate content around a particular course or event, what do the other tags used to bookmark those resource tell you about that course or event?

In a series of recent posts, I’ve started exploring again some of the structure inherent in socially bookmarked and tagged resource collections (Visualising Delicious Tag Communities Using Gephi, Social Networks on Delicious, Dominant Tags in My Delicious Network). In this post, I’m going to look at the tags that co-occur with a particular tag that may be used to bookmark resources relating to an event or course, for example.

Here are a few examples, starting with cck11, using the most recent bookmarks tagged with ‘cck11’:

The nodes are sized according to degree; the edges represent that the two tags were both applied by an individual user person to the same resource (so if three (N) tags were applied to a resource (A, B, C), there are N!/(K!(N-K)!) pairwise (K=2) combinations (AB, AC, BC; that is, three combinations in this case.).

Here are the tags for lak11 – can you tell what this online course is about from them?

Finally, here are tags for the OU course T151; again, can you tell what the course is most likely to be about?

Here’s the Python code I used to generate the gdf network definition files used to generate the diagrams shown above in Gephi:

import simplejson, urllib

def getDeliciousTagURL(tag,typ='json', num=100):
  #need to add a pager to get data when more than 1 page
  return "http://feeds.delicious.com/v2/json/tag/"+tag+"?count=100"

def getDeliciousTaggedURLTagCombos(tag):
  durl=getDeliciousTagURL(tag)
  data = simplejson.load(urllib.urlopen(durl))
  uniqTags=[]
  tagCombos=[]
  for i in data:
    tags=i['t']
    for t in tags:
      if t not in uniqTags:
        uniqTags.append(t)
    if len(tags)>1:
      for i,j in combinations(tags,2):
        print i,j
        tagCombos.append((i,j))
  f=openTimestampedFile('delicious-tagCombos',tag+'.gdf')
  header='nodedef> name VARCHAR,label VARCHAR, type VARCHAR'
  f.write(header+'\n')
  for t in uniqTags:
    f.write(t+','+t+',tag\n')
  f.write('edgedef> tag1 VARCHAR,tag2 VARCHAR\n')
  for i,j in tagCombos:
      f.write(i+','+j+'\n')
  f.close()

def combinations(iterable, r):
    # combinations('ABCD', 2) --> AB AC AD BC BD CD
    # combinations(range(4), 3) --> 012 013 023 123
    pool = tuple(iterable)
    n = len(pool)
    if r > n:
        return
    indices = range(r)
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        yield tuple(pool[i] for i in indices)

Next up? I’m wondering whether a visualisation of the explicit fan/network (i.e. follower/friend) delicious network for users of a given tag might be interesting, to see how it compares to the ad hoc/informal networks that grow up around a tag?