Following up on 404 “Page Not Found” Error pages and Autodiscoverable Feeds for UK Government Departments and Back from Behind Enemy Lines, Without Being Autodiscovered(?!), I felt morally obliged to put together a page showing how the well the HEI Libraries are doing at getting autodiscoverable RSS feeds declared on their website homepages.
So here it is: Autodiscoverable RSS feeds from UK HEI Library homepages.
So far, about 10% of Library homepages have autodiscoverable feeds on their home pages, showing how libraries are leading their institutional web policies into the world of syndicated content…. err… maybe… (not)…
There are several reasons why this percentage may be so low:
- I’m not using the correct Library homepage URLs. (Do you know how hard it is to find such a list?)
- The Libraries don’t control their web page <head> – or the CMS just can’t cope with it – so they can’t add the autodiscovery code. (Some of the library pages do have links to RSS feeds, just not autodiscoverable ones).
- The computing services folks who probably do control the <head> are so ashamed about their own inability to get RSS feeds up and running (Autodiscoverable RSS feeds from UK HEI homepages) they don’t want anyone else to have feeds either…
See also: Nudge: Improving Decisions About RSS Usage – a post by Brian Kelly from July 2008 encouraging HEIs to get into the RSS game. Does nudging work? Has much progress has been made over the last year? Or have the computer services folks got it right and UK HEI RSS feeds are just a total waste of time? Or at least, autodiscoverable ones are…?
When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).
In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google’s Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”
I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.
But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.
As Steph Gray says in Cui bono? The problem with opening up data:
Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.
So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?
As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:
Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)
The URI for the metapage that hosts a link to the RAE/research data blog post is:
and links to the teaching related posts is:
Going back up the common path to http://www.guardian.co.uk/news/datablog+education/ we get…. a 404 :-(
Hmmm… So how come the datablog+education page doesn’t link down to the HE collection pages, as wll as the schools data blog pages (e.g. these are both valid:
– http://www.guardian.co.uk/news/datablog+education/school-tables and
and might naturally be expected to be linked to from:
Looking back to the HE teaching related datasets, we see they are both listed on the http://www.guardian.co.uk/news/datablog+education/higher-education page. So might we then expect them to be ‘compatible’ datasets in some sense?
That is, do the HE data sets share common values, for instance in the way the HEIs are named?
If we generate a couple of queries on to the university satisfaction tables and the dropout tables (maybe trying to look for correlations between drop out rate and student satisfaction) by pulling the results from different queries on those tables in to a data grid within a Google spreadsheet (cf. the approach taken in Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)), what do we gt?
Here’s a search for “Leeds”, for example:
One table contains items:
– Leeds Trinity & All Saints
– Leeds Met
and the other contains:
– Leeds College of Music
– The University of Leeds
– Leeds Metropolitan University
– Leeds Trinity and All Saints
So already, even with quite a young datastore, we have an issue with data quality. In Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman identifies “seven common data quality issues) which include the related problems of too much data (i.e. multiple copies of the same data in different places – that is, redundancy) and data inconsistency across sources (not a problem the datastore suffers from – yet?) and poor data definition (p41 -preview available on Google books?).
This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.
So what does Redmond have to say about this (p.55)?
– Find and fix errors
– Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).
(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)
Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)
So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?
PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.
PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:
Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.
What about monetization? Well, first of all, there are already many private entities who make a nice living processing public data. Why not the newsmedia? Take the education market: Why not having editorial products, designed by professional journalists, capitalizing on powerful label such as Le Monde, VG or The Guardian to address this audience with well designed products, in print or online? Think about students, how they could use this new knowledge with their laptops or iPhones. This market is up for grabs. And medias are well positioned to enter it. (Or someone else will.
My gut feeling is that with the news media trying to redefine itself for a future where revenues aren’t guaranteed by ad-sales in daily or weekend papers, and the higher education market (in the UK at least) potentially looking set for a fall in the short term as graduate openings disappear and institutions look for ways to increase student fees, there is an opportunity for a new sort of service provider, perhaps not dissimilar to a professional institution, to take up the slack and provide quality comment, analysis to the media (think: paidContent for the quality papers’ analysis sections, powered by academe redefined (i.e. Academe 2.0;-)); and FE/HE level lifelong
training “learning” to whosoever needs it.
When the OU was founded 40 years ago, it opened up access to Higher Education in the UK for those who couldn’t otherwise access it, and opened to doorway for many to membership of a professional institution. One of the driving reasons for the institutions was to keep their members up-to-date with innovations in their profession. However, those institutions have suffered terribly in recent years, (declining numbers of members – you can probably guess the rest…) so maybe it’s time for a rethink…
Indeed, maybe it’s time for something that combines elements of Higher Education, professional institutions and “products” like Guardian Professional (such as their research service, and maybe even the events* part?) wrapped up with some form of verification service that blends elements of professional, academic and maybe even vendor certification?
And maybe data analysis and commentary is one way in to that?
* I keep wondering why it is that Guardian, TED and O’Reilly conferences (as well as a wide variety of unconferences) are of far more interest to me than academic ones? It can’t just be that they tend to publish their audio streams online?;-)
PS see also Guerrilla Education: Teaching and Learning at the Speed of News and its associated comments.
My tweetwatching has been a little sporadic of late, so I haven’t really been keeping up with the data sets that keep being posted to the Guardian Datablog, but today I had a quick skim through some of the recent uploads and had my eye caught by a post about funding of UK government quangos (Every quango in Britain [spreadsheet]).
What I’ll show in this post is how to construct a query on one of the quangos data sheets that can then be visualised as a change treemap, showing at a single glance how funding over all the quangos (arranged by sponsor department) has changed over the period 2006-2007.
The quangos spreadsheet is split into several different areas and at face value is quite hard to make sense of (what’s the difference in status (and provenance of the data) between the NHS and Health quangos, for example?
But I take nothing if not a pragmatic view about all this data stuff, and whilst there may be, err, issues with doing “proper” data journalism with this spreadsheet, I think we can still get value from just seeing what sorts of technology enhanced questions we might ask of this data, such as it is (as wll as identifying any problems with the data as it is presented), and as a result maybe identifying various issues with how to present and engage with this data were it to be republished again.
As ever, my doodles don’t properly acknowledge the provenance or source of the data, nor do I try to make any sense out of the data or look for any specific ‘meaning’ within it – I’m still at the stage of sharpening my technology pencil and seeing what sorts of marks it can make – but this is something I know I don’t do, and will be something I start to look at somewhen, honest!
So let’s make a start. To provide a bit of context, the questions I set out with when doodling through this post were:
1) is the data clean enough to run summarising queries on the data (that is, queries that summed totals for different departments)?
2) is the data clean enough to not break Many Eyes Wikified if i pass it to that visualisation tool via a CSV feed?
And a matter arising:
3) how do I write a query that specifies column headings (the headings in the spreadsheet leave something to be desired, at times….)?
The spreadsheet I chose to play with was the Westminster sheet, which you can find from here: UK Quangos [Guardian Datastore] (you’ll need to select the appropriate tab).
Just by looking at the data in the spreadsheet we notice a couple of things, things that suggest certain queries we might run on the data. (I probably need to unpack that phrase at some point (“run certain queries”) but the essence of it is this: if we treat the spreadsheet as a database, so sort of reprts can we generate from it; typically, in a database environment, reports are generated by running queries using some sort of database query language, which in the case of Google spreadsheets is the SQL like Google Visualisation API Query Language.)
So, the first thing I noticed are the two columns on the left – Government departments and presumably the quangos sponsored by those departments. And what these suggested to me were that I should be able to generate reports that summarise expenditure over all quangos in each department. (Whether or not this is interesting, I don’t know; but it’s something that we should be able to do easily enough, and it may spark off other questions in our mind).
The second thing I noticed was that lots of the data straddled two years (2006 and 2007)
And finally, gross expenditure seemed like a meaningful quantity and maybe least open to contention, so I decided to pick on that as the quantity to sharpen my pencil with:
To start playing with spreadsheets, I bookmarked it so that I could play with it in my Data Store explorer (note that I needed to specify the sheet number, where the first sheet is sheet 0, the second is sheet 1, and so on; so the Westminster sheet (the third sheet form the left in the spreadsheet) is sheet 2):
When we preview the column headings, (which the API assumes are in the top row, I think?), we get – not a lot…
If we scroll down in the data store explorer, we get at least the spreadsheet column labels:
Anyway, let’s try to run a query that summarises the overall gross expenditure for 2006 (column R) and 2007 (column S) for each department (column C):
The query is encoded as:
select C,sum(R),sum(S) group by C order by sum(R) desc
So we select three columns, and for each column, group the rows according to department (column C), display the summed value over those grouped rows for columns R and S, and presenting them in descending (desc) order by column sum(R):
If we click on the HTML preview link, we can view the table in its own page:
(A link to the CSV version is also generated.)
The data explorer doesn’t support forms for all the queries we can write yet, so the next step requires hacking the HTML table URI directly to add labels to the columns:
select C,sum(R),sum(S) group by C order by sum(R) desc label C ‘Dept’, sum(R) ‘2006 Gross Expenditure’, sum(S) ‘2007 Expenditure’
If you’re hacking the URI in a recent browser address/location bar, you don’t need to encode things like spaces as %20, or single quotes as %27, because the browser will take care of it for you:
If you then copy this URI and paste it back into the location bar, the encoded version will be generated for you, e.g. so that you can use it as a link in a web page:
So we have our basic summary report, and can now use the CSV output of the spreadsheet so that it can be transported elsewhere. There are two things we need to do now.
The first is to just change the output format of the data from an HTML table to CSV. Take this part of the URI:
and change it to this:
If you preview the CSV, you’ll notice there’s a problem with it though:
There are rogue commas everywhere, appearing within the ‘numerical’ data, and this file is supposed to use commas to separate out different elements. To get proper numbers out, we need to set their format, which means adding something to the end of the URI:
format sum(R) ‘#’,sum(S) ‘#’
(Not that you do need to encode the #s by hand, as %23)
That is, present sum(R) and sum(S) in a proper numerical format:
So there we have summary tables showing the expenditure for each government department. Many Eyes Wikified isn’t letting m import that data directly via CSV at the moment, but it’s easy enough to download the CSV and copy the data into a Many Eyes Wikified data page:
(Casting an eye over the CSV data, there are also a couple of glitches in it that upset the grouping – eg “Home” and “Home ” (trailing space) are treated differently.)
We can now use this page to create some visualisations – so put some placeholders in place in a visualisation page:
And then generate some visualisations…
One of the potentially most informative summary views for this data is a change treemap, that shows the relative gross expenditure for each department, along with whether it has increased or decreased between 2006 and 2007:
Blue is decrease from 2006 to 2007, orange is an increase.
The next step is now to create a change treemap for each quango within a department, but that’s a post for another day… [UPDATE: see it here – Visualising Where the Money Goes: Westminster Quangos, Part 2]
The OU Library website has been running Google Analytics for ages, but from what I can tell they haven’t done a hug amount with the results in terms of making the analytics actionable and using them to improve the site design (I’d love for someone to correct me with a blog post or two about how analytics have been used to improve site performance. If anyone would like to publish such a post, I’ll happily give you a guest slot here on OUseful.info…:-)
(As a bit of background, see Library Analytics, (Part 1), Library Analytics, (Part 2), Library Analytics, (Part 3), Library Analytics, (Part 4), Library Analytics, (Part 5), Library Analytics, (Part 6), Library Analytics, (Part 7) and Library Analytics, (Part 8))
Anyway, here’s the Library homepage (August 2009):
And here are two the real OU Library homepages:
(See also: Where is the Open University Homepage?;-)
And here’s the OU Library homepage as treemap, where the block size shows where the traffic goes (as recorded over the last month) as a percentage of all traffic to the OU LIbrary homepage.
So if each click was equally valuable, and each pixel on the screen was equally valuable, then that’s how the screen area should be allocated… (Hmm – that could be, err, interesting – an adaptive homepage where there’s one block element per link, and a treemap algorithm that allocates the area each block has when the page is rendered? Heh heh :-)
I did think about showing a heatmap of where on the homepage the clicks were made, but I figure I’ve probably already upset the Library folk enough by now. I also considered doing a treemap showing the realtive proportions of different keywords on Google that drove traffic to the OU Library homepage, but I figure that may be commercially sensitive in terms of bidding for Adsense keywords…
Getting blog links in to the OUseful.info blog has been getting harder over the last few weeks, but my post on Open Educational Resources and the University Library Website (which I’d tweeted as “Are academic libraries conspiring against OERs?”) generated a couple that I thought I’d comment on here.
First up, Stephen Downes suggested: “Is the university library actively subverting the movement toward opn [sic] educational resources? One could argue that it has significant incentive to do so. … We cannot, I argue, expect support for open educational resources from institutions dedicated to their elimination.”
The thing that first came to my mind when I read that was that it’s the vendors of systems into libraries who have a commercial stake in directing attention towards their systems, particularly in the journals/ebooks area. (I’m not sure what the library catalogue vendors sell other than their pretty much unusable OPAC systems?) On second breath, it then struck me that the people in the library who hold the budgets to pay for these expensive systems also have a stake in retaining them, under the assumption that the bigger the budget you hold/spend, the more, I dunno, important you are?
I haven’t had a look round the library websites to see how obvious the open repository searches are, but I’d be willing to bet that they still aren’t as prominent as the bought in systems that direct attention towards subscription journal content. (Maybe that’s right? Maybe the prominence should be proportional to the amount of money spent on the resource compared with allocating it in proportion to website traffic (cf. OU Library Home Page – Normalised Click Density). But it is worth remembering that the research repository projects are often run under the auspices of the archiving role of the library.)
The next point that came to mind was clarified for me by David Davies in OER and library websites, time for integration:
[H]istorically it’s not been libraries that worried about those kinds of educational resources [i.e. OERs]. While libraries were cataloguing books and journals, other parts of the central institutional services were managing learning objects, multimedia resources, e-learning content, whatever you want to call the stuff. These resources were locked up in WebCT or some other VLE/LMS and were discoverable there, at least in theory. Teachers and their students knew, and still know, where to look for books & journals and where to look for other kinds of learning resources.
This suggests that teaching material type stuff is to be found through the VLE or CMS. But I’m not sure that’s right? Locally suggested resources are almost definitely linked to from those environments, but do those environments also offer wider search facilities over external teaching materials? Why shouldn’t that sort of material be the sort of material that you’d expect to find in using academic library search tools? (See also: ACRLog: The Question They Forgot To Ask: “But why are we only considering the role of the academic library as gateway, archive and buyer? I would argue [we need] to add a new dimension for faculty to consider – the academic library’s role as learning center and instruction partner [rather than focusing] on the acadmic [sic] library’s traditional role as collector, organizer and gateway provider. … I would argue that an equally essential part of the academic library’s digital transformation is the shift from the gateway role to the teaching and learning role in a much more aggressive way that integrates the library into the digital learning environment that has become many faculty’s preferred method of delivering their educational content“; okay, so here we gt into invisible library territory – the library providing services in other locations. But it’s still a library service… What I think we need to do is tease apart library services and where those services are accessed. It may be that we don’t want OERs to be discoverable through the library website, but then we need to ask exactly what sort of proposition the library website is offering?)
There’s another thing I’d like to pick up from David’s post – the phrase: “While libraries were…”; because while they were (doing whatever? conspiring with the publishers against academic authors’ rights?, or conspiring with the publishers to pass off sponsored publications of as academic journals?;-), the web arrived, websearch arrived, Google arrived, Google Books and Google Scholar arrived, arXiv already was there (but did the library build on it?), OERs started gaining funding, if not traction (yet), Library Thing arrived, Creative Commons arrived, print-on-demand has almost arrived (Amazon has a POD capability, I think?), all manner of how to video tutorial sites arrived (and their aggregators – like How Do I?;-), custom search engines arrived, and so on…
Ho hum, I hadn’t intended that to b so much of a rant… Sigh… maybe I need to go and look at JISC’s Library of the Future debate to see how to think about libraries in a more considered manner?;-)
Some disconnected thoughts about who gives a whatever about OERs, brought on in part by @liamgh’s Why remix an Open Educational Resource? (see also this 2 year old post: So What Exactly Is An OpenLearn Content Remix?). A couple of other bits of context too, to to situate HE in a wider context of educational broadcasting:
– Trust partially upholds fair trading complaints against the BBC: “BESA appealed to the Trust regarding three of the BBC’s formal learning offerings on bbc.co.uk between 1997 and 2009. … the Trust considers it is necessary for the Trust to conduct an assessment of the potential competitive impacts of Bitesize, Learning Zone Broadband and the Learning Portal, covering developments to these offerings since June 2007, and the way in which they deliver against the BBC’s Public Purposes. This will enable the Trust to determine whether the BBC Executive’s failure to conduct its own competitive impact assessment since 2007 had any substantive effect. … No further increases in investment levels for Bitesize, Learning Zone Broadband and the Learning Portal will be considered until the Trust has completed its competitive impact assessment on developments since 2007”
– Getting nearer day by day: “We launched a BBC College of Journalism intranet site back in January 2007 … aimed at the 7,500 journalists in the BBC … A handful of us put together about 1200 pages of learning – guides, tips, advice – and about 250 bits of video; a blog, podcasts, interactive tests and quizzes and built the tools to deliver them. A lot of late nights and a lot of really satisfying work. Satisfying, too, because we put into effect some really cool ideas about informal learning and were able to find out how early and mid career journalists learn best. … The plan always was to share this content with the people who’d paid for it – UK licence fee payers. And to make it available for BBC journalists to work on at home or in parts of the world where a www connection was more reliable than an intranet link. Which is where we more or less are now.” [my emphasis; see also BBC Training and Development]
So why my jaded attitude? Because I wonder (again) what it is we actually expect to happen to these OERs (how many OER projects re-use other peoples’ bids to get funding? How many reuse each others ‘what are OERs stuff’? How many OER projects ever demonstrate a remix of their content, or a compelling reuse of it? How many publish their sites as a wiki so other people can correct errors? How many are open to public comments, ffs? How many give a worked example of any of the twenty items on Liam’s list with their content, and how many of them mix in other people’s OER content if they ever do so? How many attempt to publish running stats on how their content is being reused, and how many demonstrate showcase examples of content remix and reuse.
That said, there are signs of some sort of use: ‘Self-learners’ creating university of online; maybe the open courseware is providing a discovery context for learners looking for specific learning aids (or educators looking for specific teaching aids)? That is, while use might be most likely at the disaggregated level, discovery will be mediated through course level aggregations (the wider course context providing the SEO, or discovery metadata, that leads to particular items being discovered? Maybe Google turns up the course, and local navigation helps (expert) users browse to the resource they were hoping to discover?)
Early days yet, I know, but how much of the #ukoer content currently being produced will be remixed with, or reused alongside, content from other parts of that project as part of end-of-project demos? (Of course, if reuse/remix isn’t really what you expect, then fine… and, err, what are you claiming, exactly? Simple consumption? That’s fine, but say it; limit yourself to that…)
Ok, rant part over. Deep breath. Here comes another… as academics, we like to think we do the education thing, not the training thing. But for those of you who do learn new stuff, maybe every day, what do you find most useful to support that presumably self-motivated learning? For my own part, I tend to search for tutorials, and maybe even use How Do I?. That is, I look for training materials. A need or a question frames the search, and then being able to do something, make something, get my head round something enough to be able to make use of it, or teach it on, frames the admittedly utilitarian goal. Maybe that ability to look for those materials is a graduate level information skill, so it’s something we teach, right…? (Err… but that would be training…?!)
So here’s where I’m at – OERs are probably [possibly?] not that useful. But open training materials potentially are. (Or maybe not..?;-) Here are some more: UNESCO Training Platform
And so is open documentation.
They probably all could come under the banner of open information resources, but thy are differently useful, and differently likely to be reused/reusable, remixed/remixable, maintained/maintainable or repurposed/repurposeable. Of them all, I suspect that the opencourseware subset of OERs is the least re* of them all.
That is all…