Information Literacy, Graphs, Hierarchies and the Structure of Information Networks

Over dinner at Côte in Cambridge last week, during the Arcadia Project review event, I doodled a couple of data structures, one on either side of a scrap of paper, and asked my co-Arcadians what sort of thing the drawing might represent, or what the structures they described might be called in general terms.

The sketches were broadly along the lines of the following, though without the circular nodes and labels displayed, just a set of connecting lines:

Hierarchy

and:

Graph

So if I asked you the same question (what would you call these two different things?), how would you answer?

To my mind, the different organisational structures these represent, and how we can exploit and manipulate them, represents a whole host of issues in the reimagining of information literacy and the teaching of information skills. This ranges from an understanding of the structure of information spaces through the representation and analysis of those structures, to ways in which we can navigate and discover things in those spaces as well as how we can visualise and otherwise make sense of them.

So how would I describe the two different things shown above? The first image represents a hierarchy and is often referred to as a tree. Many library classification schemes, and many organisational management structures, are based around that sort of information structure.

The second image is a depiction of a more general network structure. Whenever I talk about graphs on the OUsefu.info blog (in fact, pretty much whenever I talk about a graph anywhere), that’s the sort of thing I’m talking about. This mess of connections is the way the web is structured. (The tree structure is also a graph, but subject to particular constraints; can you work out what some of those constraint might be?)

Note: it’s maybe worth reiterating at this point when I talk about graphs, the messy network thing I mean, not line charts like this:

Line chart

One of the terms I got to describe one of the graphs was “a matrix”. Matrices are in fact a very powerful way of describing the structure of a graph – if you fancy a treasure hunt, the terms adjacency matrix and incidence matrix should give you a head start…

I’m not sure what the problem is, but I think there is a problem that arises from not appreciating how powerful graph structures are as a way of making sense of the world. And I’m not really sure what I wanted to say in this post… except maybe go on a little fishing expedition to see how widespread the lack of familiarity with the notion of a graph as something like this:

Graph

really is…? So, if I asked you to draw a graph: a) what would you draw? b) would you even remotely consider drawing something the the image directly above? If you answered “no’ to (b), does it “say” anything to you at all?! Would you ever draw a diagram that had that flavour when explaining something (what?!) to someone else? (And the same question for the hierarchy…?)

PS a nice thing about graphs is you don’t have to draw them by hand – all you have to do is describe what connects to what, and then you can let a machine draw it for you. So for example:

– here is the “source code” for the tree
– here is the “source code” for the messy network graph

PPS when folk hear other folk wittering on about “the social graph”, what do they think it is? If asked to draw an indicative sketch of “the social graph”, what would they draw?!

What’s Inside a Book?

A couple of months ago, when I started looking at the idea of emergent social positioning in online social networks, I was focussing on trying to model the positioning of certain brands and companies, in part with a view to trying to identify ones that were associated with innovation, or future thinking in some way.

Based on absolutely no evidence at all, I surmised that one useful signal in this regard might be the context in which companies or brands are mentioned in popular, MBA-related business books, the sort of thing that Harvard Business Review publish, for example.

Here’s how my thinking went then:

– generate a bipartite network graph that connects the book’s index terms with page numbers of the pages they appear on based on the index entries* in a given book. A bipartite graph is one that contains two sorts or classes of node (in this case, index term nodes and book page number nodes). The index terms are likely to include companies, brands, people and ideas/concepts. Sometimes, particular index terms may be identified as companies, names, etc, through presentational mark up – a bold font, or italics, for example. These presentational conventions can often be mapped onto semantic equivalents. Terms might also be passed through something like the Reuters’ Open Calais service, or TSO’s Data Enrichment Service.

– collapse the network graph by generating links between things that are connected to the same page number and remove the page number nodes from the graph. You now have a graph that connects brands, people and other index terms with each other, where edges represent the relation “is on the same page in the same book as”. If companies and other index terms appear on several pages together, we might reflect this by increasing the weight of the edge that connects them, for example by using edge weight to represent the number of pages where the two terms co-exist.

(*This will be obvious to some, but not to others. To a certain extent, a book index provides a faceted/search term limited search engine interface to a book, that returns certain pages as results to particular queries…)

Note that we can generate a network for a specific book, in which case we can render a graphical summary of the content, relations within and structure of that book, or we can generate more comprehensive networks that summarise the index term relations across several books.

My thinking then was that if we can grab the indexes of a set of business books, we could map which companies and brands were being associated either with each other or with particular concepts in MBA land.

Which is where the problem lays – because I haven’t found anywhere where I can readily get hold of the indexes of business books in a sensible machine readable format. Given an electronic cpy of a book, I guess I could run some text processing algorithms over it looking for word pairs in close association with each other and generating my own view over the book. But the reason for using an actual book index is at least twofold: firstly, because there has presumably been a a quality process that determines what terms are entered into the index; secondly, because the index, if used by a human reader, will be influencing which parts of the book (and hence which related terms) they will be exposed to.

(It’s maybe also worth noting that books also contain a lot of other structured metadata – tables of contents, lists of figures, titles, headings, subheadings, emphasis, lists, captions, and so on, all of which provide cues as to how the book is structured and how ideas and entities contained within it relate to each other.)

As to why I’m posting this now? I first floated this idea with @edchamberlain following a JISC bibliography data event, and he reminded me of it at the Arcadia Project review a couple of days ago ;-)

Related, sort of: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, which looked at mapping corporate mentions in the BBC/OU co-pro business programme The Bottom Line:

First attempt at tagging BBC/OU 'The Bottom Line' progs using opencalais

Also Citation Positioning.

PS this is clever – and related – via @ostephens: http://www.eatyourbooks.com/ (“‘Tell us which books you own’ We have indexed the most popular cookbooks & magazines so recipes become instantly searchable.”).

Tags Associated With Other Tags on Delicious Bookmarked Resources

If you’re using a particular tag to aggregate content around a particular course or event, what do the other tags used to bookmark those resource tell you about that course or event?

In a series of recent posts, I’ve started exploring again some of the structure inherent in socially bookmarked and tagged resource collections (Visualising Delicious Tag Communities Using Gephi, Social Networks on Delicious, Dominant Tags in My Delicious Network). In this post, I’m going to look at the tags that co-occur with a particular tag that may be used to bookmark resources relating to an event or course, for example.

Here are a few examples, starting with cck11, using the most recent bookmarks tagged with ‘cck11’:

The nodes are sized according to degree; the edges represent that the two tags were both applied by an individual user person to the same resource (so if three (N) tags were applied to a resource (A, B, C), there are N!/(K!(N-K)!) pairwise (K=2) combinations (AB, AC, BC; that is, three combinations in this case.).

Here are the tags for lak11 – can you tell what this online course is about from them?

Finally, here are tags for the OU course T151; again, can you tell what the course is most likely to be about?

Here’s the Python code I used to generate the gdf network definition files used to generate the diagrams shown above in Gephi:

import simplejson, urllib

def getDeliciousTagURL(tag,typ='json', num=100):
  #need to add a pager to get data when more than 1 page
  return "http://feeds.delicious.com/v2/json/tag/"+tag+"?count=100"

def getDeliciousTaggedURLTagCombos(tag):
  durl=getDeliciousTagURL(tag)
  data = simplejson.load(urllib.urlopen(durl))
  uniqTags=[]
  tagCombos=[]
  for i in data:
    tags=i['t']
    for t in tags:
      if t not in uniqTags:
        uniqTags.append(t)
    if len(tags)>1:
      for i,j in combinations(tags,2):
        print i,j
        tagCombos.append((i,j))
  f=openTimestampedFile('delicious-tagCombos',tag+'.gdf')
  header='nodedef> name VARCHAR,label VARCHAR, type VARCHAR'
  f.write(header+'\n')
  for t in uniqTags:
    f.write(t+','+t+',tag\n')
  f.write('edgedef> tag1 VARCHAR,tag2 VARCHAR\n')
  for i,j in tagCombos:
      f.write(i+','+j+'\n')
  f.close()

def combinations(iterable, r):
    # combinations('ABCD', 2) --> AB AC AD BC BD CD
    # combinations(range(4), 3) --> 012 013 023 123
    pool = tuple(iterable)
    n = len(pool)
    if r > n:
        return
    indices = range(r)
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        yield tuple(pool[i] for i in indices)

Next up? I’m wondering whether a visualisation of the explicit fan/network (i.e. follower/friend) delicious network for users of a given tag might be interesting, to see how it compares to the ad hoc/informal networks that grow up around a tag?

Arcadia Project – OU Report Back Presentation

Short notice, but then, if I gave more notice there’d have been all sorts of calendar negotiations over a week or two then we’d have rescheduled anyway…


Presentation trailer

Many OU folk will have already spent an hour or two at the Learn About fair (fayre?) on that day, so you might as well as right the whole day off in terms of doing “proper work”…;-)

Time for a University Prepress?

When I first joined the OU as a lecturer, I was self-motivated, research active, publishing to peer reviewed academic conferences outside of the context of a formal research group. That didn’t last more than a couple of years, though… In that context, and at that time, one of the things that struck me about the OU was that research active academics were expected to produce written work for publication in two ways: for research, through academic conferences and journals; and for teaching, via OU course materials.

The internal course material production route was, and still is, managed through a process of course team review in the authoring stage and then supported by editors, artists and picture researchers for publication, although I don’t remember so much involvement from media project managers ten years or so ago, if they even existed then? Pagination and layout was managed elsewhere, and for authors who struggled to use the provided document templates, the editor was at hand for technical review as well as typos and grammar, as well as reference checking, and a course secretary could be brought in to style the document appropriately. Third party rights were handled by the course manager, and so on.

In contrast, researchers had to research and write their papers, produce images, charts, tables as required, and style the document as a camera ready document using a provided style sheet. In addition, published researchers would also review (and essentially help edit) works submitted to other journals and conferences. Th publisher contributed nothing except perhaps project management and the production and distribution of the actual print material (though I seem to remember getting offprints, receiving requests for them, and mailing them out with an OU stamp on an OU envelope).

Although I haven’t published research formally for some time, I suspect the same is still largely true nowadays…

Given that the OU is a publication house, publishing research and teaching materials as a way of generating income, I wonder if there is an opportunity for the Library to support the research publication process providing specialist support for research authors, including optimising them for discovery!

At the current time, many academic libraries host their institution’s repository, providing a central location within which are lodge copies of academic research publications produced by members of that institution. Some academic publishers even offer an ‘added value’ service in their publication route whereby a published article, as written, corrected, layed out, paginated, rights cleared, and rights waived by the author (and reviewed for free by one or more of their peers) will be submitted back to the institution’s repository.

[Cue bad Catherine Tate impression]: what a f*****g liberty… [!]

So as the year ends, here’s a thought I’ve ranted to several people over the year: academic libraries should seize the initiative from the academic publishers, adopt the view that the content being produced by the academy is valuable to publishers as well as academics, that the reputation of journals is in part built on the reputation of the institutions and academics responsible for producing the research papers, and set up a system in which:

– academics submit articles to the repository using an institutional XML template (no more faffing around with different style sheets from different publishers), at which point they are released using a preview stylesheet as a preprint;

– journals to which articles are to be submitted are required to collect the articles from the repository. Layout and pagination is for them to do, before getting it signed off by the author;

– optionally, journal editors might be invited to bid for the right to publish an article formally. The benefit of formal publication for the publisher is that when a work is cited, the journal gets the credit for having published the work.

That is all… ;-)

PS RAE/REF style accounting could also be used in part to set journal pricing and payments. Crap journals that no-on cites content in would get nothing. Well cited journals would be recompensed more generously… There would of course bee opportunities for gaming the system, but addressing this would be similar in kind to implementing measures that search engines based on PageRank style algorithms take against link farms, etc.

Google/Feedburner Link Pollution

Just a quick observation…

If you run a blog (or any other) RSS feed through Feedburner, the title links in the feed point to a Feedburner proxy for the link.

If you use Google Reader, and send a post to delicious:

the Feedburner proxy link is the link that you’ll bookmark:

(Hmmmm, methinks it would be handy if Delicious gave you the option to bookmark the ‘terminal’ URI rather than a proxied or short URI? Maybe by getting Google proxied links into Delicious, Google is amassing data about social bookmarking behaviour from RSS feeds on Delicious? So how about this for a scenario: you wake up tomorrow to find the Goog has bought Delicious off Yahoo, and all your bookmarked links are suddenly rewritten in the form: http://deliproxy.google.com/~r/gamesetwatch/~3/Yci8wJb49yk/fighting_fantasy_flowcharts.php)

If you click on the link to take you through to the actual linked page, and the actual page URI, you may well get something like this:

http://www.gamesetwatch.com/2009/11/fighting_fantasy_flowcharts.php?
utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+gamesetwatch+%28GameSetWatch%29

That is, a URI with Google Analytics tracking info attached automagically by Feedburner (see Google Analytics, Feedburner and Google Reader for more on this).

Here, then, are a couple of good examples of why you might not want to use (Google) Feedburner for your RSS feeds:

1) it can pollute your links, first by appending them with Google Analytics tracking codes, then by rewriting the link as a proxied link;
2) you have no idea what future ‘innovations’ the Goog will introduce to pollute your feed even further.

(Bear in mind that Google Feedburner also allows you to inject ads into a feed you have burned using AdSense for Feeds.)