## Information Literacy, Graphs, Hierarchies and the Structure of Information Networks

Over dinner at Côte in Cambridge last week, during the Arcadia Project review event, I doodled a couple of data structures, one on either side of a scrap of paper, and asked my co-Arcadians what sort of thing the drawing might represent, or what the structures they described might be called in general terms.

The sketches were broadly along the lines of the following, though without the circular nodes and labels displayed, just a set of connecting lines:

and:

So if I asked you the same question (what would you call these two different things?), how would you answer?

To my mind, the different organisational structures these represent, and how we can exploit and manipulate them, represents a whole host of issues in the reimagining of information literacy and the teaching of information skills. This ranges from an understanding of the structure of information spaces through the representation and analysis of those structures, to ways in which we can navigate and discover things in those spaces as well as how we can visualise and otherwise make sense of them.

So how would I describe the two different things shown above? The first image represents a hierarchy and is often referred to as a tree. Many library classification schemes, and many organisational management structures, are based around that sort of information structure.

The second image is a depiction of a more general network structure. Whenever I talk about graphs on the OUsefu.info blog (in fact, pretty much whenever I talk about a graph anywhere), that’s the sort of thing I’m talking about. This mess of connections is the way the web is structured. (The tree structure is also a graph, but subject to particular constraints; can you work out what some of those constraint might be?)

Note: it’s maybe worth reiterating at this point when I talk about graphs, the messy network thing I mean, not line charts like this:

One of the terms I got to describe one of the graphs was “a matrix”. Matrices are in fact a very powerful way of describing the structure of a graph – if you fancy a treasure hunt, the terms adjacency matrix and incidence matrix should give you a head start…

I’m not sure what the problem is, but I think there is a problem that arises from not appreciating how powerful graph structures are as a way of making sense of the world. And I’m not really sure what I wanted to say in this post… except maybe go on a little fishing expedition to see how widespread the lack of familiarity with the notion of a graph as something like this:

really is…? So, if I asked you to draw a graph: a) what would you draw? b) would you even remotely consider drawing something the the image directly above? If you answered “no’ to (b), does it “say” anything to you at all?! Would you ever draw a diagram that had that flavour when explaining something (what?!) to someone else? (And the same question for the hierarchy…?)

PS a nice thing about graphs is you don’t have to draw them by hand – all you have to do is describe what connects to what, and then you can let a machine draw it for you. So for example:

– here is the “source code” for the tree
– here is the “source code” for the messy network graph

PPS when folk hear other folk wittering on about “the social graph”, what do they think it is? If asked to draw an indicative sketch of “the social graph”, what would they draw?!

## What’s Inside a Book?

A couple of months ago, when I started looking at the idea of emergent social positioning in online social networks, I was focussing on trying to model the positioning of certain brands and companies, in part with a view to trying to identify ones that were associated with innovation, or future thinking in some way.

Based on absolutely no evidence at all, I surmised that one useful signal in this regard might be the context in which companies or brands are mentioned in popular, MBA-related business books, the sort of thing that Harvard Business Review publish, for example.

Here’s how my thinking went then:

– generate a bipartite network graph that connects the book’s index terms with page numbers of the pages they appear on based on the index entries* in a given book. A bipartite graph is one that contains two sorts or classes of node (in this case, index term nodes and book page number nodes). The index terms are likely to include companies, brands, people and ideas/concepts. Sometimes, particular index terms may be identified as companies, names, etc, through presentational mark up – a bold font, or italics, for example. These presentational conventions can often be mapped onto semantic equivalents. Terms might also be passed through something like the Reuters’ Open Calais service, or TSO’s Data Enrichment Service.

– collapse the network graph by generating links between things that are connected to the same page number and remove the page number nodes from the graph. You now have a graph that connects brands, people and other index terms with each other, where edges represent the relation “is on the same page in the same book as”. If companies and other index terms appear on several pages together, we might reflect this by increasing the weight of the edge that connects them, for example by using edge weight to represent the number of pages where the two terms co-exist.

(*This will be obvious to some, but not to others. To a certain extent, a book index provides a faceted/search term limited search engine interface to a book, that returns certain pages as results to particular queries…)

Note that we can generate a network for a specific book, in which case we can render a graphical summary of the content, relations within and structure of that book, or we can generate more comprehensive networks that summarise the index term relations across several books.

My thinking then was that if we can grab the indexes of a set of business books, we could map which companies and brands were being associated either with each other or with particular concepts in MBA land.

Which is where the problem lays – because I haven’t found anywhere where I can readily get hold of the indexes of business books in a sensible machine readable format. Given an electronic cpy of a book, I guess I could run some text processing algorithms over it looking for word pairs in close association with each other and generating my own view over the book. But the reason for using an actual book index is at least twofold: firstly, because there has presumably been a a quality process that determines what terms are entered into the index; secondly, because the index, if used by a human reader, will be influencing which parts of the book (and hence which related terms) they will be exposed to.

(It’s maybe also worth noting that books also contain a lot of other structured metadata – tables of contents, lists of figures, titles, headings, subheadings, emphasis, lists, captions, and so on, all of which provide cues as to how the book is structured and how ideas and entities contained within it relate to each other.)

As to why I’m posting this now? I first floated this idea with @edchamberlain following a JISC bibliography data event, and he reminded me of it at the Arcadia Project review a couple of days ago ;-)

Related, sort of: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, which looked at mapping corporate mentions in the BBC/OU co-pro business programme The Bottom Line:

Also Citation Positioning.

PS this is clever – and related – via @ostephens: http://www.eatyourbooks.com/ (“‘Tell us which books you own’ We have indexed the most popular cookbooks & magazines so recipes become instantly searchable.”).

## Tags Associated With Other Tags on Delicious Bookmarked Resources

If you’re using a particular tag to aggregate content around a particular course or event, what do the other tags used to bookmark those resource tell you about that course or event?

In a series of recent posts, I’ve started exploring again some of the structure inherent in socially bookmarked and tagged resource collections (Visualising Delicious Tag Communities Using Gephi, Social Networks on Delicious, Dominant Tags in My Delicious Network). In this post, I’m going to look at the tags that co-occur with a particular tag that may be used to bookmark resources relating to an event or course, for example.

Here are a few examples, starting with cck11, using the most recent bookmarks tagged with ‘cck11’:

The nodes are sized according to degree; the edges represent that the two tags were both applied by an individual user person to the same resource (so if three (N) tags were applied to a resource (A, B, C), there are N!/(K!(N-K)!) pairwise (K=2) combinations (AB, AC, BC; that is, three combinations in this case.).

Here are the tags for lak11 – can you tell what this online course is about from them?

Finally, here are tags for the OU course T151; again, can you tell what the course is most likely to be about?

Here’s the Python code I used to generate the gdf network definition files used to generate the diagrams shown above in Gephi:

```import simplejson, urllib

def getDeliciousTagURL(tag,typ='json', num=100):
#need to add a pager to get data when more than 1 page
return "http://feeds.delicious.com/v2/json/tag/"+tag+"?count=100"

def getDeliciousTaggedURLTagCombos(tag):
durl=getDeliciousTagURL(tag)
uniqTags=[]
tagCombos=[]
for i in data:
tags=i['t']
for t in tags:
if t not in uniqTags:
uniqTags.append(t)
if len(tags)>1:
for i,j in combinations(tags,2):
print i,j
tagCombos.append((i,j))
f=openTimestampedFile('delicious-tagCombos',tag+'.gdf')
header='nodedef> name VARCHAR,label VARCHAR, type VARCHAR'
for t in uniqTags:
f.write(t+','+t+',tag\n')
f.write('edgedef> tag1 VARCHAR,tag2 VARCHAR\n')
for i,j in tagCombos:
f.write(i+','+j+'\n')
f.close()

def combinations(iterable, r):
# combinations('ABCD', 2) --> AB AC AD BC BD CD
# combinations(range(4), 3) --> 012 013 023 123
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = range(r)
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
yield tuple(pool[i] for i in indices)```

Next up? I’m wondering whether a visualisation of the explicit fan/network (i.e. follower/friend) delicious network for users of a given tag might be interesting, to see how it compares to the ad hoc/informal networks that grow up around a tag?

## Arcadia Project – OU Report Back Presentation

Short notice, but then, if I gave more notice there’d have been all sorts of calendar negotiations over a week or two then we’d have rescheduled anyway…

Presentation trailer

Many OU folk will have already spent an hour or two at the Learn About fair (fayre?) on that day, so you might as well as right the whole day off in terms of doing “proper work”…;-)

## Time for a University Prepress?

When I first joined the OU as a lecturer, I was self-motivated, research active, publishing to peer reviewed academic conferences outside of the context of a formal research group. That didn’t last more than a couple of years, though… In that context, and at that time, one of the things that struck me about the OU was that research active academics were expected to produce written work for publication in two ways: for research, through academic conferences and journals; and for teaching, via OU course materials.

The internal course material production route was, and still is, managed through a process of course team review in the authoring stage and then supported by editors, artists and picture researchers for publication, although I don’t remember so much involvement from media project managers ten years or so ago, if they even existed then? Pagination and layout was managed elsewhere, and for authors who struggled to use the provided document templates, the editor was at hand for technical review as well as typos and grammar, as well as reference checking, and a course secretary could be brought in to style the document appropriately. Third party rights were handled by the course manager, and so on.

In contrast, researchers had to research and write their papers, produce images, charts, tables as required, and style the document as a camera ready document using a provided style sheet. In addition, published researchers would also review (and essentially help edit) works submitted to other journals and conferences. Th publisher contributed nothing except perhaps project management and the production and distribution of the actual print material (though I seem to remember getting offprints, receiving requests for them, and mailing them out with an OU stamp on an OU envelope).

Although I haven’t published research formally for some time, I suspect the same is still largely true nowadays…

Given that the OU is a publication house, publishing research and teaching materials as a way of generating income, I wonder if there is an opportunity for the Library to support the research publication process providing specialist support for research authors, including optimising them for discovery!

At the current time, many academic libraries host their institution’s repository, providing a central location within which are lodge copies of academic research publications produced by members of that institution. Some academic publishers even offer an ‘added value’ service in their publication route whereby a published article, as written, corrected, layed out, paginated, rights cleared, and rights waived by the author (and reviewed for free by one or more of their peers) will be submitted back to the institution’s repository.

[Cue bad Catherine Tate impression]: what a f*****g liberty… [!]

So as the year ends, here’s a thought I’ve ranted to several people over the year: academic libraries should seize the initiative from the academic publishers, adopt the view that the content being produced by the academy is valuable to publishers as well as academics, that the reputation of journals is in part built on the reputation of the institutions and academics responsible for producing the research papers, and set up a system in which:

– academics submit articles to the repository using an institutional XML template (no more faffing around with different style sheets from different publishers), at which point they are released using a preview stylesheet as a preprint;

– journals to which articles are to be submitted are required to collect the articles from the repository. Layout and pagination is for them to do, before getting it signed off by the author;

– optionally, journal editors might be invited to bid for the right to publish an article formally. The benefit of formal publication for the publisher is that when a work is cited, the journal gets the credit for having published the work.

That is all… ;-)

PS RAE/REF style accounting could also be used in part to set journal pricing and payments. Crap journals that no-on cites content in would get nothing. Well cited journals would be recompensed more generously… There would of course bee opportunities for gaming the system, but addressing this would be similar in kind to implementing measures that search engines based on PageRank style algorithms take against link farms, etc.

Just a quick observation…

If you run a blog (or any other) RSS feed through Feedburner, the title links in the feed point to a Feedburner proxy for the link.

If you click on the link to take you through to the actual linked page, and the actual page URI, you may well get something like this:

http://www.gamesetwatch.com/2009/11/fighting_fantasy_flowcharts.php?
utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+gamesetwatch+%28GameSetWatch%29

That is, a URI with Google Analytics tracking info attached automagically by Feedburner (see Google Analytics, Feedburner and Google Reader for more on this).

Here, then, are a couple of good examples of why you might not want to use (Google) Feedburner for your RSS feeds:

2) you have no idea what future ‘innovations’ the Goog will introduce to pollute your feed even further.

(Bear in mind that Google Feedburner also allows you to inject ads into a feed you have burned using AdSense for Feeds.)

## Using JISCPress/Digress.it for Reading List Publication

One of the things I’ve been doodling with but not managing to progress much thinking wise (not enough dog walking time lately!) is how we might be able to use the digress.it WordPress theme to support various course related functions in ways that exploit the disaggregating features of the theme.

Chatting with Huw Jones last week about his upcoming Arcadia seminar on “The Problem of Reading Lists” (this coming Tuesday, Nov 24th – all welcome;-) I started thinking again about the potential for using digress.it as a means of publishing, and collecting comments on, reading lists.

So for example, over on the doodlings WriteToReply site I’ve posted an example of how a reading list posted under the theme is automatically disaggregated into separate, uniquely identified references:

The reading list was generated simply by copying and pasting a PDF based reading list into a WordPress blog post. Looking at the format of the list, one could imagine adding further comments or notes relating to each reference using a blog comment. Given that the basis of each paragraph is a citation to a particular work, it might be possible to parse out enough information to generate a link to a search on the University OPAC for the corresponding work (and if so, pull back an indication of the availability of the book as, for example, my Library Traveler script used to do for books viewed on Amazon).

Under the current in-testing digress.it theme, each paragraph on the page can be made available as a separate item in an RSS feed; that is, as well as the standard ‘single item’ RSS page feed that WordPress generates automatically, we can get an N-item feed from the page for the N-paragraphs contained on a page.

Which in terms means that to generate an itemised RSS feed version of a reading list, all I need to do is paste the reading list – with each reference in a separate paragraph – into a single blog post. (the same is true for disaggregating/feed itemising previous exam papers, for example, or I guess video links in order to generate a DeliTV programme bundle…?!)

(For more details of the various ways in which digress.it can automatically disaggregate/atomise a document, see Open Data: What Have We Got?.)

PS just a reminder again – Huw’s Reading List project talk, which is about far more than just reading lists, is on Tuesday in the Old Combination Room, Wolfson College, Cambridge, at 6pm.

Here’s a quick post from under the radar… Apparently, folk from Cam Libraries get together every so often for an informal but issues related brown bag lunch somewhere… It seems like the where and whenabouts of these events is a closely guarded secret.

I think I’m ‘presenting’ at a brown bag lunch session next week, Nov 27th, but I don’t have access to the mailing list the announcement went out on so don’t know any more details than that.

i did, however, manage to grab a bootleg of a trailer for the what may or may not be this event based on what I think I said I could talk about if I managed to get an invite:

If the event is on, I guess I’ll be told immediately before the event and taken to the location blindfolded (presumably using a brown paper bag?)

Just in case, best keep this hush hush, okay? ;-)

For many years now, it’s been possible to subscribe to persistent (“saved”) Google News searches and so build up your own custom dashboard views of news… Indeed, it was over three years ago now that I hacked together a demo news feed roller (Persistent News Search OPML Feed Roller) that let users bundle up a roll of feeds in an OPML file (sort of!) for easy viewing elsewhere.

And if OPML isn’t your thing, then services like Netvibes or Pageflakes let you easily wire up your own news dashboard:

But we all know in our heart of hearts that RSS and Atom feed subscriptions are just not popular widespread as a consumer technology. Folk aren’t knowingly using feeds, and they not unknowingly using them directly either. (But feeds are being used as wiring/plumbing behind the scenes, so RSS is not dead yet, okay?!;-)

(In the Library world, as well as the wider news reading world, this failure to engage with feed subscriptions can be seen (in part) by the lack of significant uptake of RSS alerts.)

So when Google announced last week that you can now Create and Share custom News sections, it struck me that they were getting round the exposed plumbing problem that subscribing to a feed implies, and instead making it easy to create a custom view (the output of which can also be subscribed to) with the appearance of having to do much plumbing at all – How to Create Your Own Google Custom News Section (Tutorial):

You can search the directory of already created news sections – as well as find a link to a page that lets you create your own news sections, here: Google News: Custom sections directory.