Archive for October 2010
OUseless.info
Legacy Systems Management – Time For a Cobol Course?!;-)
Listening to Jeff Papows discussing “Glitch: The Impact of Faulty Software” on Technometria via my IT Conversations podcast feed this morning, (book available here), I started wondering whether there was an opportunity for a Masters level course on managing legacy systems, which might include a unit or two on programming in Cobol?
Whilst not the most fashionable of languages, Cobol still keeps many financial and enterprise systems running, and the old hand maintainers are now starting to retire, if they haven’t already.
Taking the thought a little further, it also occurred to me that many legacy systems may well have been designed according to old-fashioned and no longer popular architectural styles, styles that are not necessarily taught to today’s students… which means if you’re faced with managing a legacy system, you may not have the right model of it in mind. So what’s the way round this? Maybe picking up old course materials, contemporary to the time that the systems were put in place, and using them to provide an almost archaeological overview of the styles an approaches used to the engineer the legacy systems that still need managing.
Related to managing legacy systems is the idea of digital preservation strategies, a forward looking complement to legacy systems which seeks to identify processes, procedures and representations that ensure today’s digital creations don’t become an unreadable legacy tomorrow.
I knew the OU already ran a course on Learning from Information System failures, but it seems there’s also one on Information systems legacy and evolution, though I’m not sure it teaches any Cobol?!;-) So the OU’s already on the case… I wonder how popular the course is…?
In fact, I’m not sure if the OU ever did a Cobol course…? Hmmm… is there an “old OU courses catalogue” anywhere? I know the OU Library acts as archive to OU courses, so maybe I can track one down through the Library catalogue…? The answer is “yes, (sort of… no..)”:
…though I don’t think it’s available in a digital/scanned form…
Hmm… ;-)
PS I wonder… for quirky advanced users, would a “course materials archive” filter be a useful to the library catalogue search…? What would happen if we ran search terms coming in to the OU website – and on the OU website – over archived course descriptions, to see if any old courses are actually starting to pick up long tail interest…?
Orange Visual Visualisation Tool
A few days ago, I came across a drag’n'drop, wire it together visualisation and data analysis tool called Orange.
Here’s a quick run through of some of the basics (at least, a run through of the first few things I tried to do with the tool…)
First off, we need some data. Orange likes TSV (tab separated values) rather than CSV, so I grabbed some TSV from one of the Guardian Datastore spreadheets on Google Docs (use “Save as Text” to get the tab separated value format…)
Orange is a canvas based visual programming environment, in which functional blocks are added the the canvas and certain parameters set within the block. Here’s how we get some data into Orange from a TSV file:
The File icon is giving me a warning (no dependent variable) but I’m not sure why…? I’m sure Orange has managed to detect labels and quantities correctly from other files I’ve tried?
Anyway… we can inspect the data by looking at it in a data table widget – just wire one in:
The table is sortable by column, and the Report button can be used to save a version of the table. Looking t the data table, we see it has identified columns with missing entries. We can clean these from out data set using the Preprocessing widget:
If we now wire the output of the Processing widget into the Scatterplot widget, we can generate a variety of scatterplots:
If you want to save a copy of the chart, it’s easy enough to do so. (I can’t get colour palettes to work on my Mac, so I’m stuck with greyscale displays. Also, the blob sizing doesn’t seem very responsive…)
The Report tool allows us to create a report from various bits of the dataflow, including adding information from several widgets to either separate report pages or the same report page.
Saving a Report saves all the report pages to a navigable set of HTML pages that resemble the Orange Report viewer.
Here are a couple of other things we can do with the data, this time using a data set that isn’t throwing the “dependent variable missing” error, in particular the distribution of comments in a small Friendfeed network…
So for example, here’s how the number of comments made by members of the network is distributed:
Alternatively, we may look at the distribution in a more “statistical” way:
(Remember, we can generate these reports interactively, and then add them to a growing report.)
The survey plot gives us a macroscopic birds eye view over the whole of the data set:
Okay, that’s enough for starters – hopefully you get the idea: wire stuff together and generate visual reports… So why not go and download Orange now?!;-)
There are a whole range of clustering tools, too, which look like they could be interesting…
And I think the platform is extensible, which means there’s a way of adding your own widgets (written in Python, maybe..?)
It’s All About Flow…
One of the compelling features of Yahoo Pipes for me is the way the the user interface encourages you think of programming in terms of pipelines and feeds, in which a bundle of stuff (RSS feed, CSV data, or whatever) is processed in a sequence of steps (the pipeline), with each step being applied to each item in the feed.
A few days ago I blogged about pipe2py, a toolkit from Greg Gaughan that lets you “compile” a simple Yahoo pipe into a Python code equivalent programme (Yahoo Pipes Code Generator (Python)). Given that, in general, I don’t believe the “build it and they will come” mantra, I spent half an hour or so this morning looking round the web for people who had posted queries about how to generate code equivalents of Yahoo Pipes, so that I could point them to pipe2py.
In doing so, I came across a couple of other visual pipeline environments that are maybe worth looking at in a little more detail.
PyF is a “[flow based] open source Python programming framework and platform dedicated to large data processing, mining, transforming, reporting and more.”
On the other hand, Orange claims to offer “[o]pen source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.”
Here’s one of their promo shots:
I haven’t had a chance to play with either of these environments – and probably won’t for a little time yet – so whilst I feel like I’m cheating by posting about them in such a cursory way without having even a simple demo to show, they’re maybe of interest to anyone who stumbles across this blog by way of pipe2py… [Update: my Orange Visualisation tool review).]
PS as well as PyF, see also: Pypes [via @dartdog]
Graph Structure of an Open Science Notebook – “Linked Science” FTW…
Early days on this, but what, if anything, can we look from looking at the link structure of an open science lab-book, based on the use of hyperlinks between pages in the lab-book?
A couple of days ago, I started informally bouncing ideas around with @cameronneylon about quick wins/low hanging fruit visualisations around his open science notebook (a full description of our conversation – and indeed the whole history of this ad hoc “mini-project” – can be found on Cameron’s blog: A little bit of federated Open Notebook Science). So here are a couple of Gephi takes on the lab-book (original data/scripts can be found from the github links in Cameron’s post.)
The lab notebook identifies different types of post, which can be used to colour the graph:
The network graph also shows the presence of highly linked “procedure” type nodes relating to a particular experimental procedure. If we apply the ego filter to the graph we can get a close look at which posts are connected to a procedure:
If we run the modularity statistic, we can automatically partition the posts into groupings of posts that are linked together – here they are grouped by modularity class:
We can expand different class nodes to see the posts associated with them:
Here’s one close up:
If we apply the ego network, we see the modularity cluster does seem to have acted in a meaningful way:
Notice though that we lose sight of the internal link structure within that modularity class that was evident in the previous image.
Was that connect node important in some way?
With his intimate knowledge of the experiments recorded in the lab book, Cameron also observed that Gephi has (largely) successfully clustered the correct posts together [according to protein classification] and thus separate the purifications from each other based only on connectivity. This suggests that even if posts aren’t explicitly tagged by a particular experiment, the link analysis may be useful in finding posts that are related to a particular experiment; in cases where a post is included in one group and links out to another, it may indicate some some sort of relationship between the separate clusters, such as a shared reagent.
So why might the visualisation of the whole notebook be a useful thing to do? My take on it is that the visualisation acts as a macroscope.
As Jonathan Schull put it in the Macroscope Manisfesto:
Most natural patterns are not easily perceived, for they do not happen to produce lasting stimuli to which our nervous systems are attuned. But everything we know about biology, epidemiology, social networks, computational algorithms and data structures, tells us that branching patterns are “out there”, waiting to be mapped, illuminated, seen-anew. In the last few decades new data sources, new data-analytic tools, and new tracking techniques have become available to scientists and school children. It is now possible to envision a “macroscope” that present these invisible but ubiquitous patterns to human perceptual systems so that they would engage our innate ability to perceive millions of leaves as scores of trees…and a forest
For me, Gephi can act as a macroscope in the way it reveals structure from across the whole of Cameron’s open science lab book in a single image, and allows us to interrogate the lab book from a variety of perspectives in an interactive way.
The approach is amenable to displaying structures aggregated from across multiple blogs, as long as they link to teach other. It may also serve to identify related processes, as for example when modularity clusters are connected by one or more links.
And what might this suggest as a baby next step Open Notebook Science? Well I can’t help thinking that maybe open Lab Notebooks should also be publishing their link graph, with URI referenced external links as well as internal links included… then we can create some big graphs across notebooks and start to see what might fall out…
Linked Science FTW ;-)
PS One think that may or may not be missing from the above – links to a video demonstrating each procedure, if appropriate, on a visual protocols site. Just by the by, here’s a Google custom search engine I created some time ago that implements a Science Experimental Protocols Videos meta-search engine. (It doesnlt turn up anything for /Purification of sortase/ though;-(
UK Open Data Guidance Resources
This is a live post where I will try to collect together advice relating to the release and use of open public data in the UK, as much for my own reference as anything… (I guess this really should be a wiki page somewhere…?)
- Including data in the data.gov.uk index
- Very basic, standard file format for data
- Using and creating data with Ordnance Survey products
The O/S advice includes examples of what is and is not an acceptable ‘free’ reuse of data: - UK Government Licensing Framework
- Open Data and FOI Requests – Maude: Data must be published in a machine readable way
- Local spending data guidance (DRAFT)
- Local government Transparency programme:
- Why publish data as data, rather than PDF? An example taken from the world of calendars: Jon Udell: Developing Intuitions About Data






















