OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for July 2011

Getting My Eye In Around F1 Quali Data – Parallel Coordinate Plots, Sort of…

with 2 comments

Looking over the sector times from the qualifying session for tomorrow’s Hungarian Grand Prix, I noticed that Vettel was only fastest in one of the sectors.

Whilst looking for an easy way of shaping an R data frame so that I could plot categorical values sector1, sector2, sector3 on the x-axis, and then a line for each driver showing their time in the sector on the y-axis (I still haven’t worked out how to do that? Any hints? Add them to the comments please…;-), I came across a variant of the parallel coordinate plot hidden away in the lattice package:

f1 2011 hun quali sector times parralel coordinate plot

What this plot does is for each row (i.e. each driver) take values from separate columns (i.e. times from each sector), normalise them, and then plot lines between the normalised value, one “axis” per column; each row defines a separate category.

The normalisation obviously hides the magnitude of the differences between the time deltas in each sector (the min-max range might be hundredths in one sector, tenths in another), but this plot does show us that there are different groupings of cars – there is clear(?!) white space in the diagram:

Clusters in sectors times

Whilst the parallel co-ordinate plot helps identify groupings of cars, and shows where they may have similar performance, it isn’t so good at helping us get an idea of which sector had most impact on the final lap time. For this, I think we need to have a single axis in seconds showing the delta from the fastest time in the sector. That is, we should have a parallel plot where the parallel axes have the same scale, but in terms of sector time, a floating origin (so e.g. the origin for one sector might be 28.6s and for another, 22.4s). For convenience, I’d also like to see the deltas shown on the y-axis, and the categorical ranges arranged on the x-axis (in contrast to the diagrams above, where the different ranges appear along the y-axis).

PS I also wonder to what extent we can identify signatures for the different teams? Eg the fifth and sixth slowest cars in sector 1 have the same signature across all three sectors and come from the same team; and the third and fourth slowest cars in sector 2 have a very similar signature (and again, represent the same team).

Where else might we look for signatures? In the speed traps maybe? Here’s what the parallel plot for the speed traps looks like:

SPeed trap parallel plot

(In this case, max is better = faster speed.)

To combine the views (timings and speed), we might use a formulation of the flavour:

parallel(~data.frame(a$sector1,a$sector2,a$sector3, -a$inter1,-a$inter2,-a$finish,-a$trap))

Combined parallel plot - times and speeds

This is a bit too cluttered to pull much out of though? I wonder if changing the order of parallel axes might help, e.g. by trying to come up with an order than minimises the number of crossed lines?

And if we colour lines by team, can we see any characteristics?

Looking for patterns across teams

Using a dashed, rather than solid, line makes the chart a little easier to read (more white space). Using a thinking line also helps bring out the colours.

parallel(~data.frame(a$sector1,-a$inter1,-a$inter2,a$sector2,a$sector3, -a$finish,-a$trap),col=a$team,lty=2,lwd=2)

Here’s another ordering of the axes:

ANother attempt at ordering axes

Here are the sector times ordered by team (min is better):

Sector times coloured by team

Here are the speeds by team (max is better):

Speeds by team

Again, we can reorder this to try to make it easier(?!) to pull out team signatures:

Reordering speed traps

(I wonder – would it make sense to try to order these based on similarity eg derived from a circuit guide?)

Hmmm… I need to ponder this…

Written by Tony Hirst

July 30, 2011 at 11:45 pm

Fragments… Obtaining Individual Photo Descriptions from flickr Sets

leave a comment »

I think I’ve probably left it too late to think up some sort of hack for the UK Discovery developer competition, but here’s a fragment that might provide a starting point for someone else… How to use a Yahoo pipe to grab a list of photos in a particular flickr set (such as one of the sets posted by the UK National Archive to the flickr commons)

The recipe makes use of two calls to the flickr api: one to get the a list of photos in a particular set, the second, repeatedly made call, to grab details down for each photo in the set.

In pseudo-code, we would write the algorithm along the lines of:

get list of photos in a given flickr set
for each photo in the set:
  get the details for the photo

Here’s the pipe:

Example of calling flickr api to obtain descriptions of photos in a flickr set

The first step is construct a call to the flickr API to pull down the photos in a given set. The API is called via a URI of the form:

http://api.flickr.com/services/rest/?method=flickr.photosets.getPhotos
&api_key=APIKEY&photoset_id=PHOTOSETID&format=json&nojsoncallback=1

The API returns a JSON object containing separate items identifying each photo in the set.

The rename block constructs a new attribute for each photo item (detailURI) containing the corresponding photo ID. The RegEx block applies a regular expression to each item’s detailURI attribute to transform it to a URI that calls the flickr API for details of a particular phot, by photo id. The call this time is of the form:

http://api.flickr.com/services/rest/?method=flickr.photos.getInfo
&api_key=APIKEY&photo_id=PHOTOID&format=rest

Finally, the Loop block runs through each item in the original set, calls the flickr API using the detailURI to get the details for the corresponding photo, and replaces each item with the full details of each photo.

flickr api - photo details

You can find the pipe here: grabbing photo details for photos in a flickr set

An obvious next step might be to enrich the phot decriptions with semantic tags using something like the Reuters OpenCalais service. On a quick demo, this didn’t seem to work in the pipes context (I wonder if there is Calais API throttling going on, or maybe a timeout?) but I’ve previously posted a recipe using Python that shows how to call the Open Calais service in a Python context: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags.

Again in pseudo code, we might do something like:

get JSON feed out of Yahoo pipe
for each item:
  call the Calais API against the description element

We could then index the description, title, and semantic tags for each photo and use them to support search and linking between items in the set.

Other pivots potentially arise from identifying whether photos are members of other sets, and using the metadata associated with those other sets to support discovery, as well as information contained within comments associate with each photo.

Written by Tony Hirst

July 29, 2011 at 10:18 pm

Circles vs Community Detection

One take on the story so far:

- Facebook supports symmetrical follows and allows you to see connections between your Facebook friends;
- Twitter supports asymmetric follows and allows you to see pretty much everyones’ friend and follower connections;
- Google+ supports asymmetric follows

Facebook and Twitter both support lists but hardly anyone uses them. Google+ encourages you to put people into addressable circles (i.e. lists).

If you can grab a copy of connections between folk in your social network, you can run social network statistics that will partition out different social groupings:

My annotated twitter follower network

If you’re familiar with the interests of people in a particular cluster, you can label them (there are also ways you might try to do this automagically).

Now a Facebook app, Super Friends, will help you identify – and label – clusters in your Facebook network (via ReadWriteWeb):

Super friends facebook app

This is a great feature, and something I could imagine being supported to some extent in Gephi, for example, by allowing the user to create a node attribute where the values represent label mappings from different modularity clusters (or more simply by allowing a user to add a label to each modularity class?).

The SuperFriends app also stands in contrast to the Google+ approach. I’d class SuperFriends as gardening, whereas he Google+ approach is more one of planning. The Google+ approach encourages you to think you’re in control of different parts of your network and makes your life really complicated (which circle do I put this person in; do I need a new circle for this?); the SuperFriends approach helps you realise how complicated (or not) your social circle is. In terms of filters, the Google+ approach encourages you to add your own, whereas the SuperFriends approach helps you identify setting that emerges out of network properties.

Given that in many respects Google is an AI/machine learning company, it’s odd that they’re getting the user to define circle/set membership; maybe it’d be too creepy if they automatically suggested groups? Maybe there’s too much scope for error if you don’t deliberately place people into a group yourself (and instead trust an algorithm to do it?)

Superfriends helps uncover structure, Google+ forces you to make all sorts of choices and decisions every time you “follow” another person. Google+ makes you define tags and categories to label people up front; SuperFriends identifies clusters that might be covered by an obvious tag.

Looking at my delicous bookmarks, I have almost as many tags a bookmarks… But if I ran some sort of grouping analysis, (not sure what?!) maybe natural clusters – and natural tags – would emerge as a result?

Maybe I need to read Everything is Miscellaneous again…?

PS if you want to run a more hands on analysis of your Facebook network, try this: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I

PPS here’s another Facebook app that identifies clusters: http://www.fellows-exp.com/ h/t @jacomyal

PPPS @danmcquillan also tweeted that LinkedIn InMaps do a similar clustering job on LinkedIn connections. They do indeed; and they use Gephi. I wonder if they’ve released the code that handles things from the point at which a social network graph data is prpvided to the rendering of the map?

Written by Tony Hirst

July 28, 2011 at 10:38 am

Posted in Anything you want, Infoskills

Tagged with

Autocuration Signals in My Personalised Google Search Results

I spotted this for the first time last night:

Auto-curation signals in my search results

I had actually read the post in the Google Reader context (so Google knew that), but I wonder: if I hadn’t read the post, would it still have shown up like that?

As far as personalised ranking signals go:

- does the fact that I subscribe to the feed in Google Reader affect the rank of items from that feed in my personalised search results?
- if I have read the post in Google reader, does that also affect the rank of that specific post in my personalised search results?

If I have shared a link – through Google+, or Twitter, for example – are the ranking of those links positively affected in my personalised search results. That is, might social search actually be most useful when the Goog picks up on things I have shared myself, and then “reminds” me of them via a ranking boost in my personalised search results when I’m searching on a related topic?

Maybe tweeting and sharing into the void is actually yet another way of invisibly building search refinements into your personalised search context?

Written by Tony Hirst

July 27, 2011 at 1:09 pm

Posted in Search, SEO

Tagged with ,

Extracting Data From Misbehaving RSS/Atom Feeds

A quick recipe prompted by a query from Joss Winn about getting data out of an apparently broken Atom feed: http://usesthis.com/feed/?r=n&n=150

The feed previews in Google Reader okay – http://www.google.com/reader/atom/feed/http://usesthis.com/feed/?r=n&n=150 – and is also viewable in my browser, but neither Google Spreadsheets (via the =importFeed() formula) nor YQL (?!) appear to like it.

[Note: h/t to Joss for pointing this out to me: http://www.google.com/reader/atom/feed/http://usesthis.com/feed/?r=n&n=150 is a recipe for accessing Google Reader's archive of a feed, and pulling out e.g. n=150 items (r=n is maybe an ordering argument?) Which is to say: here's a way of accessing an archive of RSS feed items...:-)]

However, Yahoo Pipes does, so a simple proxy pipe normalises the feed and gives us one that is properly formatted:

Sample Yahoo feed proxy - http://pipes.yahoo.com/ouseful/feedproxy

The normalised feed can now be accessed via:

http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed%2F%3Fr%3Dn%26n%3D150

We can also get access to a CSV output:

http://pipes.yahoo.com/ouseful/feedproxy?_render=csv&furl=http%3A%2F%2Fusesthis.com%2Ffeed

The CSV can be imported in to a Google spreadsheet using the =importData() formula:
=importData(“http://pipes.yahoo.com/ouseful/feedproxy?_render=csv&furl=http%3A%2F%2Fusesthis.com%2Ffeed”)

[Gotcha: if you have ?&_render in the URL (i.e. ?&), Spreadsheets can't import the data...]

Once in the spreadsheet it’s easy enough to just pull out e.g. the description text from each feed item because it all appears in a single column.

Google spreadsheets can also query the feed and just pull in the description element. For example:

=ImportFeed(“http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed”,”items summary”)

(Note that it seemed to time out on me when I tried to pull in the full set of 150 elements in Joss’ original feed, but it worked fine with 15.)

We can also use YQL developer console to just pull out the description elements:

select description from rss where url=’http://pipes.yahoo.com/ouseful/feedproxy?_render=rss&furl=http%3A%2F%2Fusesthis.com%2Ffeed%2F%3Fr%3Dn%26n%3D150

YQL querying an rss feed

YQL then provides XML or JSON output as required.

Written by Tony Hirst

July 27, 2011 at 8:59 am

Autodiscoverable Feeds and UK HEIs (Again…)

with 6 comments

It’s that time of year again when Brian’s banging on about IWMW, the Instituional[ised?] Web Managers’ Workshop, and hence that time of year again when he reminds me* about my UK HE Feed Autodiscovery app that trawls through various UK HEI home pages (the ones on .ac.uk, rather than the one you get by searching for a uni name in Google;-)

* that is, tells me the script is broken and, by implication, gently suggests that I should fix it…;-)

As ever, most universities don’t seem to be supporting autodiscoverable feeds (neither are many councils…), so here are a few thoughts about what feeds you might link to, and why…

- news feeds: the canonical example. News feeds can be used to pipe news around various university websites, and also syndicate content to any local press or hyperlocal news sites. If every UK HEI published a news feed that was autodiscoverable as such, it would be trivial to set up a UK universities aggregated newswire.

- research announcements: I was told that one reason for putting out press releases was simply to build up an institutional memory/archive of notable events. Many universities run research newsletters that remark on awarded grants. How about a “funded research” feed from each university detailing grant awards and other research funding. Again, at a national level, this could be aggregated to provide a research funding newswire, as well as contribtuing data to local archives of research funding success.

- jobs: if every UK HEI published a jobs/vacancies RSS feed, it would trivial to build an aggregator and let people roll their own versions of jobs.ac.uk.

- events: universities contribute a lot to local culture through public talks and exhibitions. Make it easy for the local press and hyperlocal news sites to syndicate this info, and add events to their own aggregated “what’s on” calendars. (And as well as RSS, give ‘em an iCal feed for your events.)

- recent submissions to local repository: provide a feed listing recent submissions to the local research output/paper repository (and/or maybe a feed of the most popular downloads); if local feeds are you thing, the library quite possibly makes things like recent acquisition feeds available…

- YouTube uploads: you might was well add an autodiscoverable feed to your university’s recent uploads on YouTube. If nothing else, it contributes an informal ownership link to the web for folk who care about things like that.

- your university Twitter feed: if you’ve got one. I noticed Glasgow Caledonian linked to their Twitter feed through an autodiscoverable link on their university homepage.

- tenders: there’s a whole load of work going on in gov at the moment regarding transparency as it relates to procurement and tendering. So why not get open with your procurement and tendering data, and increase the chances of SMEs finding out what you’re tendering around. If the applications have to go through a particular process, no problem: link to the appropriate landing page in each feed item.

- energy data: releasing this data may well become a requirement in the not so far off future, so why not get ahead of the game, e.g. as Lincoln are starting to do (Lincoln U energy data)? If everyone was publishing energy data feeds, I’m sure DevCSI hackday folk would quickly roll together something like the aggregating service built by college student @issyl0 out of a Rewired State hack that pulls together UK gov department energy data: GovSpark

- XCRI-CAP course marketing data feeds: JISC is giving away shed loads of cash to support this, so pull your finger out and get the thing published.

- location data: got a KML feed yet? If not, why not? e.g. Innovations in Campus Mapping

PS the backend of my RSS feed autodiscovery app (founded: 2008) is a Yahoo pipe. Just because, I thought I’d take half an hour out to try and build something related on Scraperwiki. The code is here: UK University Autodiscoverable RSS feeds. Please feel free to improve or, fork it, etc. University homepage URLs are identified by scraping a page on the Universities UK website, but I probably should use a feed from the JISC Monitoring Unit (e.g. getting UK University location/contact data).

PPS this could be handy for some folk – the code that runs the talks@cam events site: http://source.caret.cam.ac.uk/svn/projects/talks.cam/. (Thanks Laura:-) – does it do feeds nicely now?! Related: Keeping Up With Events, a quickly hacked app from my Arcadia project that (used to) aggregate Cambridge events feeds.)

Written by Tony Hirst

July 26, 2011 at 6:59 pm

Innovations in Campus Mapping

with 9 comments

A post on the ever interesting Google Maps Mania blog alerted me to a new interactive campus map produced by the Marketing folk at Loughborough University built on top of Google maps (I wonder if this is partly driven by Loughborough’s seemingly close ties with the Google education folk? If so, I wonder if they’re doing anything interesting relating to search in education too?)

Loughborough campus map maps.lboro.ac.uk

The layers are reminiscent of the layers in the Southampton campus map (e.g. Open Data Powered Location Based Services in UK Higher Education).

A simpler approach is taken by Liverpool University, who have bookmarked a few markers onto a map:

Liverpool campus map

Bristol University’s map appears to fall somewhere between the two in terms of complexity:

Bristol U campu map

Have any other UK HEIs rolled out interactive maps built on top of a third party API such as Google’s or maybe OpenStreetmap?

I know that Lincoln and Southampton are also currently battling it out in the “who can do the best 3D models of university buildings in Google Earth” (e.g. Lincoln’s KML file – h/t @alexbilbie;-). Any other UK HEIs with 2.5/3D building model layers? [h/t re: 2.5D to @gothwin] (Why’s this useful? Because if you can identify buildings placed as 3D models in Google Earth, you also have the lat/long data for those building profiles;-) Of course, just having building markers on a Google Map would also give you that geo-location data.

Lincoln U 3d layers/kml in google earth

I wonder whether any universities have Google Streetmap (or equivalent) style navigation? Where campuses are town centre based, and built on public roads, I guess they may be? I also wonder how many universities have started marking up locations as Google Places, or places on other location based services (I notice the Lincoln KML layer links through to a venue/place definition on Foursquare). This sort of detail would then support services based around checkins. For a homebrew example of what this might look like, see the OU’s check-in app: wayOU – mobile location tracking app using linked data.

A crude alternative, for promo purposes at least, is the campus tour, of course. Here’s one from several years ago looking at the OU’s Walton Hall campus:

Any others out there? Or any other map/location based initiatives being carried out by UK HEIs, whether similar to the above mentioned ones, or maybe even something completely different?!;-)

Written by Tony Hirst

July 26, 2011 at 6:08 pm

Posted in Anything you want

Tagged with , ,

Integrating Course Related Search and Bookmarking?

with 5 comments

Not surprisingly, I’m way behind on the two eSTEeM projects I put proposals in for – my creative juices don’t seem to have been flowing in those areas for a bit:-( – but as a marking avoidance strategy I thought I’d jot down some thoughts that have been coming to mind about how the custom search project at least might develop (eSTEeM Project: Custom Course Search Engines).

The original idea was to provide a custom search engine that indexes pages and domains that are referenced within a course in order to provide a custom search engine for that course. The OU course T151 is structured as a series of topic explorations using the structure:

- topic overview
- framing questions
- suggested resources
- my reflections on the topic, guided by the questions, drawing on the suggested resources and a critique of them

One original idea for the course was that rather than give an explicit list of suggested resources, we provide a set of links pulled in live from a predefined search query. The list would look as if it was suggested by the course team but it would actually be created dynamically. As instructors, we wouldn’t be specifying particular readings, instead we would be trusting the search algorithm to return relevant resources. (You might argue this is a neglectful approach… a more realistic model might be to have specifically recommended items as well as a dynamically created list of “Possibly related resources”.)

At this point it’s maybe worth stepping back a moment to consider what goes into producing a set of search results. Essentially, there are three key elements:

- the index, the set of content that the search engine has “searched” and from which it can return a set of results;
- the search query; this is run against the index to identify a set of candidate search results;
- a presentation algorithm that determines how to order the search results as presented to the user.

If the search engine and the presentation algorithm are fixed, then for a given set of search terms, and a given index, we can specify a search term and get a known set of results back. So in this case, we could use a fixed custom search engine, with know search terms, and return a known list of suggested readings. The search engine would provide some sort of “ground truth” – same answer for the same query, always.

If we trust the sources and the presentation algorithm, and we trust that we have written an effective search query, then if the index is not fixed, or if a personalised ranking algorithm (that we trust) is used as part of the search engine, we would potentially be returning search results that the instructor has not seen before. For example, the resources may be more recent than the last time the instructor searched for resources to recommend, or they better fit the personalisation criteria for the user under the ranking algorithm used as part of the presentation algorithm.

In this case, the instructor is not saying: “I want you to read this particular resource”. They are saying something more along the lines of: “these are potentially the sorts of resource I might suggest you look at in order to study this topic”. (Lots of caveats in there… If you believe in content led instruction, with students referring to to specifically referenced resources, I imagine that you would totally rile against this approach!)

At times, we might want to explicitly recommend one or two particular resources, but also open up some other recommendations to “the algorithm”. It struck me that it might be possible to do this within the context of a Google Custom Search approach using “special results” (e.g. Google CSEs: Creating Special Results/Promotions).

For example, Google CSEs support:

- promotions: “A promotion is simply an association between a pre-defined set of query terms and a link to a webpage. When a user types a search that exactly matches one of your query terms, the promotion appears at the top of the page.” So by using a specific search term, we can force the return of a specific result as the top result. In the context of a topic exploration, we could thus prepopulate the search form of an embedded search engine with a known search phrase, and use a promotion to force a “recommend reading” link to the top of the results listing.

Promotion links are stored in a separate config file and have the form:

<Promotions>
  <Promotion id="1"
    queries="wanderer, the wanderer" 
    title="Groo the Wanderer" 
    url="http://www.groo.com/"
    description="Comedy. American series illustrated by Sergio Aragonés."
    image_url="http://www.newsfromme.com/images5/groo11.jpg" />
</Promotions>

- subscribed links: subscribed links allow you to return results in a specific format (such as text, or text and a link, or other structured results) based on a perfect match with a specific search term. In a sense, subscribed links represent a generalised version of promotions. Subscribed links are also available to users outside the context of a CSE. If a user subscribes to a particular subscribed link file, then if there is an exact match against of one the search phrases in the subscribed link file and a search phrase used by a subscribing user on Google web search (i.e. on google.com or google.co.uk), the subscribed link will be returned in the results listing.

In the simplest case, subscribed links can be defined at the individual link level:

Google subscribed link definition

If your search term is an exact match for the term in the subscribed link definition, it will appear in the main search results page:

Google subscribed links

It’s also possible to define subscribed link definition files, either as simple tab separated docs or RSS/Atom feeds, or using a more formal XML document structure. One advantage of creating subscribed links files for use within in custom search engine is that users (i.e. students) can subscribe to them as a way of augmenting or enhancing their own Google search results. This has the joint effect of increasing the surface area of the course, so that course related recommendations can be pushed to the student for relevant queries made through the Google search engine, as well as providing a legacy offering: students can potentially take away a subscription when then finish the course to continue to receive “academically credible” results on relevant search topics. (By issuing subscription links on a per course presentation basis (or even on a personalised, unique feed per student basis), feeds to course alumni might be customised, or example by removing links to subscription content (or suggesting how such content might be obtained through a subscription to the university library), or occasionally adding in advertising related links (so if a student searches using a “course” keyword, make recommendations around that via a subscribed links feed; in the limit, this could even take on the form of a personalised, subscription based advertising channel).

Another way in which “recommended” links can be boosted in a custom search result listing is through boosting search results via their ranking factors (Changing the Ranking of Your Search Results).

In the case of both subscribed links and boosted search results, it’s possible to create a configuration file dynamically. Where students are bookmarking search results relating to a course, it would therefore be possible to feed these into a course related custom search engine definition file, or a subscribed link file. If subscribed link files are maintained at a personal level, it would also be possible to integrate a student’s bookmarked links in to their subscribed links feed, at least for use on Google websearch (probably not in the custom search engine context?). This would support rediscovery of content bookmarked by the student through subscribed link recommendations.

Just by the by, a PR mailing in my inbox today threw up another example of how search and bookmarking might be brought more closely together: SearchTeam (screenshots [pdf]).

The model here is based around defining search contexts that one or more users can contribute to, and then saving out results from a search into a topic based bookmark area. The video suggests that particular results can also be blocked (and maybe boosted? The greyed plus on the left hand side?) – presumably this is a persistent feature, so if you, or another member of your “search team” runs the search, the blocked result doesn’t appear? (Is a list of blocked results and their corresponding search terms available anywhere I wonder?) In common with the clipping blog model used by sites such as posterous, it’s possible to post links and short blog posts into a topic area. Commenting is also supported.

To say that search was Google’s initial big idea, it’s surprising that it seems to play no significant role in Google’s offerings for education through Google Apps. Thinking back, search related topics were what got me into blogging and quick hacks; maybe it’s time to return to that area…

Written by Tony Hirst

July 26, 2011 at 12:34 pm

Posted in Library, Search, SEO

Tagged with

Surveying the Territory: Open Source, Open-Ed and Open Data Folk on Twitter

with 2 comments

Over the last few weeks, I’ve been tinkering with various ways of using the Twitter API to discover Twitter lists relating to a particular topic area, whether discovered through a particular hashtag, search term, a list that already exists on a topic, or one or more people who may be associated with a particular topic area.

On my to do list is a map of the “open” community on Twitter – and the relationships between them – that will try to identify notable folk in different areas of openness (open government, open data, open licensing, open source software) and the communities around them, then aggregate all this open afficionados, plot the network connections between them, and remap the result (to see whether the distinct communities we started with fall out, as well as to discover who acts as the bridges between them, or alternatively discover whether new emergent groupings appear to crystallise out based on network connectivity).

As a step on the road to that, I had a quick peek around found who were tweeting using the #oscon hashtag over the weekend. Through analysing people who were tweeting regularly around the topic, I identified several lists in the area: @realist/opensource, @twongkee/opensource, @lemasney/opensource ,@suncao/open-linked-free, @jasebo/open-source

Pulling down the members of these lists, and then looking for connections between them, I came up with this map of the open source community on Twitter:

A peek at FOSS community on Twitter

Using a different technique not based on lists, I generated a map of the open data community based on the interconnections between people followed by @openlylocal:

How the people @countculture follows follow each other

and the open education community based on the people that follow @opencontent:

How followers of @Opencontent follow each other

(So that’s a different way of identifying the members of each community, right? One based on lists that mention users of a particular hashtag, one based on folk a particular individual follows, and one based on the folk that follow a particular individual.)

I’ve also toyed with looking at communities defined by members of lists that mention a particular individual, or people followed by a particular individual, as well as ones based on members of lists that contain folk listed on one or more trusted, curated lists in a particular topic area (got that?!;-).

Whilst the graphs based on mapping friends or followers of an individual give a good overview of that individual’s sphere of interest or influence, I think the community graphs derived from finding connections between people mentioned on “lists in the area” is a bit more robust in terms of mapping out communities in general, though I guess I’d need to do “proper” research to demonstrate that?

As mentioned at the start, the next thing on my list is a map across the aggregated “open” communities on Twitter. Of course, being digerati, many of these people will have decamped to GooglePlus. So maybe I shouldn’t bother, but instead wait for Google+ to mature a bit, an API to become available, blah, blah, blah…

Written by Tony Hirst

July 25, 2011 at 2:32 pm

Quick Command Line Reports from CSV Data Parsed Out of XML Data Files

leave a comment »

It’s amazing how time flies, isn’t it..? Spotting today’s date I realised that there’s only a week left before the closing date of the UK Discovery Developer Competition, which is making available several UK “cultural metadata” datasets from library catalogue and activity data, EDINA OpenUrl resolver data, National Archives images and Engligh Heritage places metadata, as well as ArchivesHub project related Linked Data and Tyne and Wear Museums Collections metadata.

I was intending to have a look at how easy it was to engage with datasets (e.g. by blogging intitial explorations for each dataset along the lines of the text processing tricks I posted around the EDINA data in Postcards from a Text Processing Excursion, Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Visualising OpenURL Referrals Using Gource), but I seem to have left it a bit let considering other work I need to get done this week… (i.e. marking:-(

..except for posting this old bit of code I don’t think I’ve posted before that demonstrates how to use the Python scripting language to parse an XML file, such as the Huddersfield University library MOSAIC activity data, and create a CSV/text file that we can then run simple text processing tools against.

If you download the Huddersfield or Lincoln data (e.g. from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data) and have a look at a few lines from the XML files (e.g. using the Unix command line tool head, as in head -n 150 filename.xml to show the first 150 lines of the file), you will notice records of the form:

<useRecordCollection>
  <useRecord>
    <from>
      <institution>University of Huddersfield</institution>
      <academicYear>2008</academicYear>
      <extractedOn>
        <year>2009</year>
        <month>6</month>
        <day>4</day>
      </extractedOn>
      <source>LMS</source>
    </from>
    <resource>
      <media>book</media>
      <globalID type="ISBN">1903365430</globalID>
      <author>Elizabeth I, Queen of England, 1533-1603.</author>
      <title>Elizabeth I : the golden reign of Gloriana /</title>
      <localID>585543</localID>
      <catalogueURL>http://library.hud.ac.uk/catlink/bib/585543</catalogueURL>
      <publisher>National Archives</publisher>
      <published>2003</published>
    </resource>
    <context>
      <courseCode type="ucas">LQV0</courseCode>
      <courseName>BA(H) Humanities</courseName>
      <progression>UG2</progression>
    </context>
  </useRecord>
</useRecordCollection>

Suppose we want to extract the data showing which courses each resource was borrowed against. That is, for each use record, we want to extract a localID and the courseCode. The following script achieves that:

from lxml import etree
import csv
#Inspired by http://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
#----------------------------------------------------------------------
def parseMOSAIC_Level1_XML(xmlFile,writer):
	context = etree.iterparse(xmlFile)
	record = {}
	# we are going to use record to create a record containing UCAS codes
	# record={ucasCode:[],localID:''}
	records = []
	print 'starting...'
	for action, elem in context:
		if elem.tag=='useRecord' and action=='end':
			#we have parsed the end of a useRecord, so output course data
			if 'ucasCode' in record and 'localID' in record:
				for cc in record['ucasCode']:
					writer.writerow([record['localID'],cc])
			record={}
		if elem.tag=='localID':
			record['localID']=elem.text
		elif elem.tag=='courseCode' and 'type' in elem.attrib and elem.attrib['type']=='ucas':
			if 'ucasCode' not in record:
				record['ucasCode']=[]
			record['ucasCode'].append(elem.text)
		elif elem.tag=='progression' and elem.text=='staff':
			record['staff']='staff'
	#return records

writer = csv.writer(open("test.csv", "wb"))

f='mosaic.2008.level1.1265378452.0000001.xml'
s='mosaic.2008.sampledata.xml'
parseMOSAIC_Level1_XML(f,writer)

Usage (if you save the code to the file mosaicXML2csv.py): python mosaicXML2csv.py
Note: this minimal example uses the file specified by f=, in the above case mosaic.2008.level1.1265378452.0000001.xml and writes the CSV out to test.csv

(You can also find the code as a gist on Github: simple Python XML2CSV converter)

Running the script gives data of the form:

185215,L500
231109,L500
180965,W400
181384,W400
180554,W400
201002,W400
...

Note that we might add an additional column for progression. Add in something like:
if elem.tag=='progression': record['progression']=elem.text
and modify the write command to something like writer.writerow([record['localID'],cc,record['progression']])

We can now generate quick reports over the simplified test.csv data file.

For example, how many records did we extract:
wc -l test.csv

If we sort the records (and by so doing, group duplicated rows) [sort test.csv], we can then pull out unique rows and count the number of times they repeat [uniq -c], then sort them by the number of reoccurrences [sort -k 1 -n -r] and pull out the top 20 [head -n 20] using the combined, piped Unix commandline command:

sort test.csv | uniq -c | sort -k 1 -n -r | head -n 20

This gives and output of the form:
186 220759,L500
134 176259,L500
130 176895,L500

showing that resource with localID 220759 was taken out on course L500 186 times.

If we just want to count the number of books taken out on a course as a whole, we can just pull out the coursecode column using the cut command, setting the delimiter to be a comma:
cut -f 2 -d ',' test.csv

Having extracted the course code column, we can sort, find repeat counts, sort again and show the courses with the most borrowings against them:

cut -f 2 -d ',' test.csv | sort | uniq -c | sort -k 1 -n -r | head -n 10

This gives a result of the form:
13476 L500
8799 M931
7499 P301

In other words, we can create very quick and dirty reports over the data using simple commandline tools once we generate a row based, simply delimited text file version of the original XML data report.

Having got the data in a simple test.csv text file, we can also load it directly into the graph plotting tool Gephi, where the two columns (localID and courseCode) are both interpreted as nodes, with an edge going from the localID to the courseCode. (That is, we can treat the two column CSV file as defining a bipartite graph structure.)

Running a clustering statistic and a statistic that allows us to size nodes according to degree, we can generate a view over the data that shows the relative activity against courses:

Huddersfield mosaic data

Here’s another view, using a different layout:

Huddersfield JISC MOSAIC activity data

Also note that by choosing an appropriate layout algorithm, the network structure visually identifies courses that are “similar” by virtue of being connected to similar resources. The thickness of the edges is proportional to the number of times a resource was borrowed against a particular course, so we can also visually identify such items at a glance.

Written by Tony Hirst

July 25, 2011 at 11:11 am

Follow

Get every new post delivered to your Inbox.

Join 126 other followers