Getting Started With Linked Data – OpenUp Laboratories Example SPARQL Queries

On and off over the last few months, I’ve had the occasional rant and Twitter heckle about how hard it seems to be getting going with Linked Data, not least because the there appear to be so few rungs at the bottom of the getting started ladder.

The typical response of a Linked Data geek to a request for “how do I start” is often along the lines of:

do the following query to your LD datastore SPARQL endpoint:
SELECT ?a ?b ?c WHERE { ?a ?b ?C } LIMIT 50
and it’ll dump out 50 triples, so you can easily see what’s in there…

Right…

I personally find an easier way in to look at example queries that might, in and of themselves, be useful, or interesting; at the very least, they should provide an example of the sort of query you might make to the datastore in a real world context. As David Flanders pointed out in a tweet last night, just being able to get hold of lists is often a useful start. (Paul Miller retorted with a “lists are only a tiny part of the usefulness” quip, and then quietly failed to pull out even the most simple list of schools by council area from the education datastore. I think he was going to phone the council this morning, instead..?!;-)

Anyway, I was pleased to see that today(?) the TSO have launched the TSO OpenUP Laboratories, a showcase for work in progress that includes their Linked Data work. As well as linking to data sets they have a hand in:

OPen Up Laboratories - http://openuplabs.tso.co.uk/

there are also links to example queries on their Linked Data stores. So for example, here are some example queries on the statistics datastore

TSO openup labs linked data/sparql examples - http://openuplabs.tso.co.uk/

Meaningful example queries are given separately for all the datasets, and in many cases appear to have been chosen so that you can recombine fragments from two or more queries in order to produce more complex queries. So for example, in the research datastore, I could readily combine elements of the query [t[o find all the projects, which started in 2010 with the query to [f]ind the details for all the projects, whose project status is live and are funded by Technology Strategy Board to come up with a mashed together query to find the projects that started in 2010 funded by the Technology Strategy board.

I also made a couple of guesses at queries that don’t appear to be supported yet, but I hope will be. For example, the Technology Strategy Board was identified using it’s DBpedia identifier: <http://dbpedia.org/resource/Technology_Strategy_Board&gt;
So I went to DBPedia and searched for JISC (<Joint_Information_Systems_Committee>), and the EPSRC (<Engineering_and_Physical_Sciences_Research_Council>), and tried those queries, although they failed to return anything. But at least there was enough of a crib in the example queries to help me start thinking about possible queries I might be able to make on a research projects database.

As well as the sample queries, the TSO have been working on a RESTful/ URL based API that sits on top of a SPARQL query into a particular datastore and masks some of the complexity of the query from the user: Open Up Labs – APIs. (See also A Developers’ Guide to the Linked Data APIs.)

Work is still in progress on this API, but as well as abstracting up from queries via nicely patterned URLs, it also looks set to offer a variety of preview output displays, as you can see on the project wiki – for example: Linked Data API: BasicHTMLViewer

For some time, I’ve been of the mind that sharing queries is going to be a Good Thing to do, and that a small economy or query-based market may even evolve around them. Good examples are an important part of that, particularly from an educational point of view, and the OpenUP Labs provide a great example of this. With a bit of luck, I’ll have even more to post on this topic in the next few days…

PS by the by, it’s maybe also worth mentioning that the OpenUP competition deadline has been extended to the end of the year:

We are looking for ideas of how you, the public, want to have that information made available to you. What different pieces of information do you need to make that informed decision? How would you want that information presented to you – on a map, combined with other statistics, delivered as a regular email? What other information does government data need to be combined with to make it more useful?

We are looking for ideas from everyone; parents, students, businesspeople, GPs, local government officers, pensioners and everyone else who has ever needed to use a piece of government information.

This is your opportunity to share your ideas, win £1,000 and have your idea developed.

If you’re a developer, or just want to get involved we would love to hear from you as well.

Competition details
The closing date for the OpenUp challenge is now 31 December 2010. The best five ideas will then be selected and if yours is one of them you’ll be invited to pitch your idea for 10 minutes to a panel of experts. They will have the difficult job of choosing the winning idea which will be brought to life with a £50,000 development fund as well as £1,000 cash for the creator of the idea.

Hmmm.. I maybe need to have a think about that…?!;-)

PS see also: Hackable SPARQL Queries: Parameter Spotting Tutorial

TSO OpenUP Competition – Opening Up UCAS Data

Here’s the presentation I gave to the judging panel at the TSO OpenUp competition final yesterday. As ever, it doesn’t make sense with[out] (doh!) me talking, though I did add some notes in to the Powerpoint deck: Opening up UCAS Course Code Data

(I had hoped Slideshare would be able to use the notes as a transcript, bit it doesn’t seem to do that, and I can’t see how to cut and paste the notes in by hand?:-(

A quick summary:

The “Big Idea” behind my entry to the TSO competition was a simple one – make UCAS course data (course code, title and institution) avaliable as data. By opening up the data we make it possible for third parties to construct services and applications based around complete data skeleton of all the courses offered for undergraduate entry through clearing in a particular year across UK higher education.
The data acts as scaffolding that can be used to develop consumer facing applications across HE (e.g. improved course choice applications) as well as support internal “vertical” activities within HEIs that may also be transferable across HEIs.
Primary value is generated from taking the course code scaffolding and annotating it with related data. Access to this dataset may be sold on in a B2B context via data platform services. Consumer facing applications with their own revenue streams may also be built on top of the data platform.
This idea makes data available that can potentially disrupt the currently discovery model for course choice and selection (but in its current form, not in university application or enrollment), in Higher Education in the UK.

Here are the notes I doodled to myself in preparation for the pitch. Now the idea has been picked up, it will need tightening up and may change significantly! ;-) Which is to say – in this form, it is just my original personal opinion on the idea, and all ‘facts’ need checking…

  1. I thought the competition was as much about opening up the data as anything… So the original idea was simply that it would be really handy to have machine readable access to course code and course name information for UK HE courses from UCAS – which is presumably the closest thing we have to a national catalogue of higher education courses.

    But when selected to pitch the idea, it became clear that an application or two were also required, or at least some good business reasons for opening up this data…

    So here we go…

  2. UCAS is the clearing house for applying to university in the UK. It maintains a comprehensive directory of HE courses available in the UK.

    Postgraduate students and Open University students do not go through UCAS. Other direct entry routes to higher education courses may also be available.

    According to UCAS, in 2010, there were 697,351 applicants with 487,329 acceptances, compared with 639,860 applications and 481,854 acceptances in 2009. [ Slightly different figures in end of cycle report 2009/10? ]

    For convenience, hold in mind the thought that course codes could be to course marketing, what postcodes are for geo related applications… They provide a natural identifier that other things can be associated with.

    Associated with each degree course is a course code. UCAS course codes are also associated with JACS codes – Joint Academic Coding System identifiers – that relate to particular topics of study. “The UCAS course codes have no meaning other than “this course is offered by this institution for this application cycle”.” link]

    “UCAS course code is 4 character reference which can be any combination of letters and numbers.

    Each course is also assigned up to three JACS (Joint Academic Coding System) codes in order to classify the course for *J purposes. The JACS system was introduced for 2002 entry, and replaced UCAS Standard Classification of Academic Subjects (SCAS). Each JACS code consists of a single letter followed by 3 numbers. JACS is divided into subject areas, with a related initial letter for each. JACS codes are allocated to courses for the *J return.

    The JACS system is used by the Higher Education Statistics Agency (HESA), and is the result of a joint UCAS-HESA subject code harmonization project.

    JACS is also used by UK institutions to identify the subject matter of programmes and modules. These institutions include the Department for Innovation, Universities and Skills (DIUS), the Home Office and the Higher Education Funding Council for England (HEFCE).”

    Keywords: up to 10 keywords per course are allocated to each course from a restricted list of just over 4,500 valid keywords.
    “Main keyword: This is generally a broad subject category, usually expressed as a single word, for example ‘Business’.
    Suggested keyword (SUG): Where a search on a main keyword identifies more than 200 courses, the Course Search user is prompted to select from a set of secondary keywords or phrases. These are the more specific ‘Suggested keywords’ attached to the courses identified. For example, ‘Business Administration’ is one of a range of ‘Suggested keywords’ which could be attached to a Business course (there are more than 60 others to choose from). A course in Business Administration would typically have this as the ‘Suggested keyword’, with ‘Business’ as the main keyword.
    However, if a course only has a ‘Suggested keyword’ and not a related ‘Main keyword’, the course will not be displayed in any search under the ‘Main keyword’ alone.

    Single subject: Main keywords can be ticked as ‘Single subject’. This means that the course will be displayed by a keyword search on the subject, when the user chooses the ‘single subject’ option below. You may have a maximum of two keywords indicated as single subjects per course.”

    “Between January and March 2010, approximately 600,000 unique IP addresses access the UCAS course code search function. During the same time period, almost 5 million unique IP addresses accessed the UCAS subject search function.” [link]

    “New courses from 2012 will be given UCAS codes that should not be used for subject classification purposes. However, all courses will still be assigned up to three individual JACS3 codes based on the subject content of the course.

    An analysis of unique IP address activity on the UCAS Course Search has shown that very few searches are conducted using the course code, compared to the subject search function. UCAS Courses Data Team will be working to improve the subject search and course keywords over the coming year to enable potential applicants to accurately find suitable courses.” [link]

    Course code identifiers have an important role to play within a university administrations, for example in marshalling resources around a course, although they are not used by students. (On the other hand, students may have a familiarity with module codes.) Course codes identify courses that are the subject of quality assessment by the QAA. To a certain extent, a complete catalogue of course codes allows third parties to organise offerings based around UK higher education degrees in a comprehensive way and link in to the UCAS application procedure.

  3. If released as open data, and particularly as Linked Open Data, the course data can be used to support:
    – the release of horizontal data across the UK HE sector by HEIs, such as course catalogue information;
    – vertical scaffolding within an institution for elaboration by module codes, which in turn may be associated with module descriptions, reading lists, educational resources, etc.
    – the development across HE of services supporting student choice – for example “compare the uni” type services
  4. At the moment the data is siloed inside UCAS behind a search engine with unfriendly session based URLs and a poor results UI. Whilst it is possible to scrape or crowd-source course code information, such ad hoc collection mechanisms run the danger of being far from complete, which means that bias may be introduced into the collection as a side effect of the collection method.
  5. Making the data available via an API or Linked data store makes it easier for third parties to build course related services of whatever flavour – course comparison sites, badging services, resource recommendation services. The availability of the data also makes it easier for developers within an intsitution to develop services around course codes that might be directly transferable to, or scaleable across, other institutions.
  6. What happens if the API becomes writeable? An appropriately designed data store, and corresponding ingest routes, might encourage HEIs to start releasing the course data themselves in a more structured way.

    XCRI is JISC’s preferred way of doing this, and I think there has been some lobbying of HEFCE from various JISC projects, but I’m not sure how successful it’s been?

  7. Ultimately, we might be able to aggregate data from locally maintained local data stores. Course marketing becomes a feature of the Linked Data cloud.

    Also context of data burden on HEIs, reporting to Professional, Statutory and Regulatory Bodies – PSURBS.

    Reconciliation with HESA Institution and campus identifiers, as well as the JISCMU API and Guardian Datablog Rosetta Stone spreadsheet

    By hosting course code data, and using it as scaffolding within a Linked Data cloud around HE courses, a valuable platform service can be made available to HEIs as well as commercial operations designed to support student choice when it comes to selecting an appropriate course and university.

  8. Several recent JISC project have started to explore the release of course related activity data on the one hand, and Linked Data approaches to enterprise wide data management on the other. What is currently lacking is national data-centric view over all HEI course offerings. UCAS has that data.

    Opening up the data facilitates rapid innovation projects within HEIs, and makes it possible for innovators within an HEI to make progress on projects that span across course offerings even if they don’t have easy access to that data from their own institution.

  9. Consumer services are also a possibility. As HEIs become more businesslike, treating students as customers, and paying customers at that, we might expect to see the appearance of university course comparison sites.

    CompareTheUni has had a holding page up for months – but will it ever launch? Uni&Books crowd sources module codes and associated reading links. Talis Aspire is a commercial reading list system that associates resources with module codes.

  10. Last year, I pulled together a few separate published datasets and through them into Google Fusion Tables, then plotted the results. The idea was that you could chart research ratings against student satisfaction, or drop out rates against the academic pay. [link ]

    Guardian datablog picked up the post, and I still get traffic from there on a daily basis… [link ]

  11. The JISC MOSAIC Library data challenge saw Huddersfield University open up book loans data associated with course codes – so you could map between courses and books, and vice versa (“People who studied this course borrowed this book”, “this book was referred to by students on this course”)

    One demonstrator I built used a bookmarklet to annotate UCAS course pages with a link to a resource page showing what books had been borrowed by students on that course at Huddersfiled University. [Link ]

  12. Enter something like compare the uni, but data driven, and providing aggregated views over data from universities and courses.
  13. To set the scene, the site needs to be designed with a user in mind. I see a 16-17 year old, sloughing on the sofa, TV on with the most partial of attention being paid to it, laptop or tablet to hand and the main focus of attention. Facebook chat and a browser are grabbing attention on screen, with occasional distractions from the TV and mobile phone.
  14. The key is course data – this provides a natural set of identifiers that span the full range of clearing based HE course offerings in the UK and allows third parties to build servies on this basis.

    The course codes also provide hooks against which it may be possible to deploy mappings across skills frameworks, e.g. SFIA in IT world. The course codes will also have associated JACS subject code mappings and UCAS search terms, which in turn may provide weak links into other domains, such as the world of books using vocabularies such as the Library of Congress Subject headings and Dewey classification codes.

  15. Further down the line, if we can start to associate module codes with course codes, we can start to develop services to support current students, or informal learners, by hooking in educational resources at the module level.
  16. Marketing can go several ways. For the data platform, evangelism into the HE developer community may spark innovation from within HEIs, most likely under the auspices of JISC projects. Platform data services may also be marketed to third party developers and innovators/entrepeneurs.

    Marketing of services built on top of the data platform will need to be marketed to the target audience using appropriate channels. Specialist marketers such as Campus Group may be appropriate partners here.

  17. The idea pitched is disruptive in that one of the major competitors is at first UCAS. However, if UCAS retains it’s unique role in university application and clearing, then UCAS will still play an essential, and heavily trafficked, role in undergraduate student applications to university. Course discovery and selection will, however, move away from the UCAS site towards services that better meet the needs of potential applicants. One then might imagine UCAS becoming a B2B service that acts as intermediary between student choice websites and universities, or even entertain a scenario in which UCAS is disintermediated and some other clearing mechanism instituted between universities and potential-student facing course choice portals.
  18. According to UCAS, between January and March 2010 “almost 5 million unique IP addresses accessed the UCAS subject search function” [link] In each of the last couple of years, the annual application/acceptance numbers have been of the order approx 500,000 students intake per year, on 600,000 applicants. If 10% of applicants and generate £5 per applicant, that’s £300k pa. £10 from 20% of intake, that’s £1M pa. £50 each from 40% is £10M. I haven’t managed to find out what the acquisition cost of a successful applicant is, or the advertising budget allocated to an undergraduate recruitment marketing campaign, but there are 200 or so HE institutions (going by the number of allocated HESA institution codes).

    For platform business – e.g. business model based around selling queries on linked/aggregated/mapped datasets. If you imagine a query returning results with several attributes, each result is a row and each attribute is a column, If you allow free access to x thousand query cells returned a day, and then charge for cells above that limit, you:
    Encourage wider innovation around your platform; let people run narrow queries or broad queries. License on use of data for folk to use on their own datastores/augmented with their own triples.
    Generate revenue that scales on a metered basis according to usage;
    – offer additional analytics that get your tracking script in third party web pages, helping train your learning classifiers, which makes platform more valuable.

    For a consumer facing application – eg a course choice site for potential appications is the easiest to imagine:
    – Short term model would be advertising (e.g. course/uni ads), affiliate fees on booksales for first year books? Seond hand books market eg via Facebook marketplace?
    – Medium term – affiliate for for prospectus application/fulfilment
    Long term – affiliate fee for course registration

  19. At the end of the day, if the data describing all the HE courses available in the UK is available as data, folk will be able to start to build interesting things around it…

A First Quick Viz of UK University Fees

Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.

The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)

My first thought was to go for a proportional symbol map. (Does anyone know of a javascript library that can generate proportional symbol overlays on a Google Map or similar, even better if it can trivially pull in data from a Google spreadsheet via the Google visualisation? I have an old hack (supermarket catchment areas), but there must be something nicer to use by now, surely? [UPDATE: ah – forgot this: Polymaps])

In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:

The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results[0].formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results[0].geometry.location.lng and value.parseJson().results[0].geometry.location.lat).

Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:

Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?

What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…

The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)

Click thru’ to have a play with the chart yourself;-)

PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE – apparently, HESA codes are in the spreadsheet…. ;-0]

PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?

PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)

Getting Access to University Course Code Data (or not… (yet…))

A couple of weeks or so ago, having picked up the TSO OpenUp competition prize for suggesting that it would be a Good Thing for UCAS/university course code data to be made available, I had a meeting with the TSO folk to chat over “what next?” The meeting was an upbeat one with a plan to get started as soon as possible with a scrape of the the UCAS website… so what’s happened since…?

First up – a reading of the UCAS website Terms and Conditions suggests that scraping is a no-no…

6. Intellectual property rights
e. Copying, distributing or any use of the material contained on the website for any commercial purpose is prohibited.
f. You may not create a database by systematically downloading substantial parts of the website

(In the finest traditions of the web, you aren’t allowed to deep link into the site without permission either: 6.c inks to the website are not permitted, other than links to the homepage for your personal use, except with our prior written permission. Links to the website from within a frameset definition are not permitted except with our prior written permission.)

So, err, I guess my link to the terms and conditions breaks those terms and conditions? Oops…;-) Should I be sending them something like this do you think?

Dear enquiries@ucas.ac.uk,
As per your terms and conditions, (paragraph 6 c) please may I publish a link to your terms and conditions web page [ http://www.ucas.com/terms_and_conditions ] in a blog post I am writing that, in part, refers to your terms and conditions?
Luv'n'hugs,
tony

As a fallback, I put a couple of trial balloon FOI requests in to a couple of universities asking for the course names and UCAS course codes for courses offered in 2010/11, along with the search keywords associated with each course (doh! I did it again, deep linking into the UCAS site…)

PS Please may I also link to the page describing course search keywords [ http://www.ucas.com/he_staff/courses/coursesearchkeywords ] ?

The first request went to the University of Southampton, in part because I knew that they already publish chunks of the data (as data) as part of their #opensoton Open Data initiative. (This probably means I was abusing the FOI system, but a point maybe needed to be made…?!;-) The second request was put in to the University of Bristol.

The requests were of the form:

I would be grateful if you could send me in spreadsheet, machine readable electronic form or plain text a copy of the course codes, course titles and search keywords for each course as submitted to UCAS for the 2010-2011 (October 2010) student entry.

If possible, would you also provide HESA subject category codes associated with each course.

So how did I get on?

Bristol’s response was as follows:

On discussion with our Admissions and Student Information teams, it appears that the University does not actually hold this data – it is held on a UCAS database. UCAS are not currently subject to the Freedom of Information Act (they will be in due course) but it may be worth talking to them directly to see if they are willing to assist.

And Southampton’s FOI response?

Course codes and titles may be found here: http://www.soton.ac.uk/corporateservices/foi/request-66210-6124d691.pdf Keywords were not held by the University – you should inquire with UCAS (http://www.ucas.com). HESA subject category codes may be found here: http://www.hesa.ac.uk/index.php/content/view/1806/296/

So what did I learn?

  1. I don’t seem to have made it clear enough to Southampton that I wanted the the 2-tuple (course code, HESA code) for each course. So how should I have asked for that data (the response pointed me to the list of all HESA codes. What I wanted was, for each course code, the course code/HESA code pair).
  2. Generalising from an example of one;-), there seems to be a disconnect between FOI and open data branches of organisations. In my ideal world, the FOI person (an advocate for the person making the request) would also be on good terms with the Open Data team in the organisation, if not a data wrangler themselves. For data requests, the FOI person would make sure the data is released as open data as part of the process of fulfilling the request and then refer the person making the request to the open data site (see also: Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping). Southampton have part of this process already – the course data is in a PDF on the their site and I was referred to it. (Note that the PDF is not just any PDF – have a look at it! – rather than the spreadsheet, machine readable electronic form or plain text I requested, even though @cgutteridge had posted a link to the SPARQL opendata query for the course code/UCAS code information I’d requested as a reply to my FOI request on the WhatDoTheyKnow site.)
  3. Universities don’t necessarily have any record of the search keywords they associate with the courses they post on UCAS. The UCAS website suggests that (doh!) “[r]ecent analysis of unique IP address use of the UCAS Course Search indicates that the subject search is by far the most popular of the 3 search options currently available”, such that “[w]hen an applicant uses our Course Search facility to search for available courses, they can choose a keyword by which to search, known as the ‘subject search’.” Which is to say, universities have no local record of the terms they use to describe courses that are the the primary way of discovering their courses on UCAS? Blimey… (I wonder how much universities spend on Google AdWords for advertising particular courses on their own course prospectus websites and how they go about selecting those terms?)
  4. Asking for a machine readable “data as data” response has no teeth at the current time. I don’t know if the Protection of Freedoms bill clause that “extends Freedom of Information rights by requiring datasets to be available in a re-usable format” will change this? It seems like it might?

    Where—
    (a) an applicant makes a request for information to a public authority in respect of information that is, or forms part of, a dataset held by the public authority, and
    (b) on making the request for information, the applicant expresses a preference for communication by means of the provision to the applicant of a copy of the information in electronic form, the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.

  5. So what next? UCAS is a charity that appears to be operated by, for, and on behalf of UK Higher Education (e.g. UCAS Directors’ Report and Accounts 2009). Whilst not FOIable yet, it looked set to become FOIable from October 2011 (Ministry of Justice: Greater transparency in Freedom of Information), though I haven’t been able to find the SI and commencement date that enact this…?). IF it does become FOIable, we may be able to get the data out that way (although memories of the battle between open data advocates and the Ordnance Survey come to mind…) Hopefully, though, we’ll be able to get the data open by more amicable means before then…:-)

    PS a couple of other things that I’ve been dipping into relating to this project. Firstly, the UCAS Business Plan 2009-2012 (doh!):

    PPS Please may I also link to your Corporate Business Plan 2009-2012 [ http://www.ucas.com/documents/corporate/corpbusplan09-12.pdf ]

    Secondly, the Cabinet Office’s “Better Choices: Better Deals” strategy document [PDF], which as well as its “MyData” right to personal data initiative, also encourages business to put their information (and data…) to work. Whether or not you agree that more information may help to make for better choices from potential students, or that comparison sites have a role to play in this, the UK government appears to believe it and looks set to support the development of businesses operating in this area. For example:

    Effective consumer choices are also important in the public sector – such as decisions about what and where to study.
    However, unlike in private markets, public services are generally:
    ● Free at the point of delivery, so prices do not give us clues about quality or popularity.
    ● Not motivated by profits, so there is little incentive to highlight differences and encourage switching.
    ● Supplied under a universal service obligation, such that they serve a particularly broad range of users, from the very informed to the highly vulnerable.
    In the same way that comparison and feedback sites have developed for private markets, some choice-tools have already emerged for public services. For example, parents and prospective students can use league tables to compare school and university performance, while patients can access websites comparing waiting times for treatments across different healthcare providers, and feedback from fellow consumers about the performance of a local GP practice. Their role is likely to become more important in future as public service markets are opened up and there is scope for further choice-tools to be developed [Better Choices: Better Deals, p. 32]

    If you’re looking to put a bid or business plan together based on using public data as a basis for comparison services, the Better Choices document has more than a few quotable sections;-)

    [Related: Course Detective metasearch/custom search across UK University prospectus websites]

Immediate Thoughts on the “Provision of information about higher education”

Some immediate thoughts on reading the “Provision of information about higher education” consultation report. Note that the opinions expressed below may not even belong to me, let alone my employer. (They’re just imaginings… or nightmare visions…)

What I still need to do is try to find out how the requirement to provide KIS data over the coming months fits in with JISC’s current Grant Funding Call 8/11: ‘Course Data: Making the most of Course Information’ Capital Programme – Call for Letters of Commitment which is “designed to ensure a high number of engaged institutions, which is vital to get the critical mass needed to effectively demonstrate to the sector the huge potential of organising and presenting course information in a standardised way.” (The initial call is for £10k for each eligible UK HEI, and a second tranche of £40-80,000 for each of 80 or so plan execution projects. (“Do the math”, as they say…) I don’t know how much HEFCE intend to give to UK HEIs to help underwrite the roll out of KIS (a fair chunk will go to the vendors that provide enterprise software to the HEIs, I guess..?) but I imagine that that will be a not insignificant sum. I just wonder what we’d have been able to do if we’d manage to get hold of the set of course code data that corresponds to the courses offered by UK HEIs? If UCAS would just relax their license conditions, I’m guessing we could even scrape the data and they wouldn’t even have to work out how to drop the corresponding table and make it available in some way… But if we respect their license conditions, we’re *****d.

1. This is a joint publication by HEFCE, Universities UK (UUK) and GuildHE, setting out how it is intended to improve the accessibility and usefulness of information about higher education (HE).

Who says what’s useful?

6. Universities and colleges should publish Key Information Sets (KISs) for undergraduate courses, whether full- or part-time. These KISs will contain information on student satisfaction, graduate outcomes, learning and teaching activities, assessment methods, tuition fees and student finance, accommodation and professional accreditation.

A lot of this data is already available as public data from original sources, or via curated datastores such as the Guardian Datablog. What is lacking at the current time is the scaffolding that lets us create resources capable of spanning the sector at qualification level. Some time ago, I described a simple visual application for comparing summary statistics relating to satisfaction, fees, salary levels and so on across UK universities (Does Funding Equal Happiness in Higher Education?). That was a first step. The second step was to try to start building up information from the course level and begin using that as the focus for comparisons (as well as building out other services, such a book recommendations related to courses). Which was in part why I entered the TSO OpenUp coompetition

Through HESA subject codes (which structure subject areas into a three level categorisation, it is possible to compare statistics relating to broad teaching subject offerings across multiple different providers within in a particular topic area. Cross-relating teaching subject areas to research areas is still an ad hoc process though, as is obtaining research funding data from across the UK research funding councils and agencies, let alone trying to relate it to teaching subject areas. (Exploiting research for teaching is one of the claimed benefits of undergraduate study; maybe through making accessible an easy way of comparing the amount of research funding provided to particular institutions in different subject areas and the related teaching areas we might get a better handle on the actual relationship that exists between teaching and research excellence?)

Nor is it a simple matter to to compare, in detail, the qualifications across teaching providers within a geographical area. The only place that currently describes all the current UK HE qualifications on offer each year is the UCAS website, which also acts as a gateway to applications to HE. One of the key considerations when developing comparison services is the extent to which a service can provide comprehensive coverage over the range of offerings that are being compared. In a very real sense, a comprehensive catalogue of offerings provides the key infrastructure that innovative third parties can build upon. By enriching and annotating a common, core dataset, vendors can develop differentiated services whilst maintain a level of consistency between them (i.e. the services become comparable). An opportunity also arises for vendors to offer business to business services over that core data set.

The provision of a common, key information set information about each course/qualification within a university can thus be picked apart as follows:

– firstly, that there exists a comprehensive directory of courses;
– secondly, that for each course, there exists a common set of data attributes, aligned to a common scale;
– thirdly, that the information is provided in a consistent way so as to “support” comparison.

As I have already mentioned, there is a significant amount of data available in public through open licenses that could already be used for the provision of comparison services. What is missing is the scaffolding – the complete course catalogue – that allows this to be done reliably across the sector.

(There is also arguably a lack of opportunity in certain areas for business development. One model might see comparison services acting almost in the role of “independent” educational advisers, helping guide a potential student to an informed choice, and reaping some benfit from that process. For example, let us crudely model the student application lifecycle as: discovery (where to go/what to do), application, study, completion/graduation, employment. In the discovery phase, services might sell advertising, and pick up affiliate fees for prospectus requests for example. In a mature market, the application phase might also accommodate affiliate or referral fees, for example, based on encouraging applications, or even better, accepted and taken up applications. The financial services industry, for all its sins, supports a variety of models for repaying an agent who signs up a client to a longstading financial product, replete with bonuses and other incentives that encourage the agent to find a product that the client will actually stick with. On completing a degree with a given grade, the agent may get a bonus. (Retention initiatives can start early, arguably before the student even accepts a place at univesity, through helpoing them make a decision regarding a course that is likely to suit them!)) If you can imagine that universities might set up as recruitment agents, taking a fee for placing a graduate in a particular job on graduation, it’s not hard to also imagine that a bonus might be paid from that placement fee to the agent responsible for referring the unergraduate applicant, as was, in the first place.)

13. Institutions will be required to submit data to HEFCE for inclusion in the KIS. Institutions who subscribe to the QAA but who do not currently take part in the NSS and DLHE should take steps to do so.

So a data burden will be placed on institutions to provide information in a standard way to a central aggregating service? Will there be an opportunity for HEIs to publish this data via an open API, and allow HEFCE to pull/harvest the data from there? Or will the data be required to pass from the HEI, through HEFCE so that HEFCE can put a stamp of approval on it, before it is allowed to be branded as part of the instition’s KIS?

14. All KISs should be made available via institutional web-sites by the end of September 2012.

But will the KIS data also support services that allow the direct comparison of KIS data across institutions on first (university), second (HEFCE, UCAS, Unistats, etc.) and third (commercial, or not-for-profit) party sites without having to visit each of those institutions separately?

22. The plans are based on extensive research, consultation and pilot processes. We are very grateful to all who have given their time and views so generously. There were 215 responses to HEFCE 2010/31, all of which have been carefully considered. We have also taken into account: the views of 2,000 prospective and current students on useful information; several expert working groups considering specific parts of the KIS; a pilot with eight institutions; and user testing with more than 200 prospective HE students. We are particularly pleased to have engaged closely with the National Union of Students in this project, and to have received consultation responses from 30 student unions. We have also liaised with the Academic Registrars’ Council, in an attempt to ensure that the next steps are both feasible and proportionate to implement.

I wonder: did they also consult with open data advocates or web development companies who are familiar with putting data to work in a customer-facing, value adding way? To my shame, I didn’t respond – I came across the consultation after it had closed. (Which suggests the consultation didn’t reach out into that part of the open data community I inhabit? Or maybe I did see it and missed/didn’t pick up on the significance of it at the time:-(

27. The consultation made three primary proposals which are summarised in this section. The first question focused on the purposes of providing information about HE. Responses broadly agreed that information about HE has three purposes:
– to inform people about the quality of higher education and, in particular, to give prospective students information that will help them choose what and where to study
– as evidence for quality assurance processes in institutions
– as information that institutions can use to enhance the quality of their HE provision.

29. The consultation proposed that universities’ and colleges’ web-sites should use a standardised way of publishing key pieces of information about each undergraduate course they offer, by using KISs.

30. KISs would make it easier to find information that prospective students have identified as important to their decisions, and which is mostly already available. The categories of information were identified during research undertaken with 2,000 prospective students, current students and careers advisers by Oakleigh Consulting and Staffordshire University15

So the implication here is that I can compare the data, because each university will separately publish a standard set of data in the same format. So to compare 14 different courses across 8 universities, I probably need to have 14 browser windows open on the same screen at the same time?!

31. In parallel to the consultation, a programme of KIS development work was undertaken. This looked specifically at the information items that do not currently exist in a national comparable format (about learning and teaching, assessment, professional accreditation and accommodation costs) and piloting the processes institutions need to undertake to provide these data. There were also user tests with prospective students. For further details see Annex A.

So the consultation looked at what sort of data might be used to enrich the core data set. One might argue that if the core, course data set were available, third party comparison services might already have started to explore various ways of annotating, enriching and pivoting around the data?

36. The principle of the KIS is that it presents information we have identified that prospective students find useful, in a place we know they already look for such information. In summary, this is information on study, satisfaction, costs and employability, presented on the course information sections of institution’s own web-sites.

“[I]n a place we know they already look for such information”: you could read that as being anti-competitive…? I’d also argue that it doesn’t support the ability to make comparisons. I assume that enerfy suppliers and mobile phone operators publish similar sorts of infromation about tariffs on their websites? Why, then, do comparison sites exisit?! I’d argue it’s not because they don’t have KIS tables on their sites (though that may contribute). Rather, it’s easier to make a comparison across sites in the context of a single location. (And here, I fear, I start to smell a trap… Because “a place we know they already look for such information” exits in the form of UCAS…)

46. There will be three categories of learning and teaching activities:
– scheduled learning and teaching activities
– guided independent study
– placement/study abroad.

47. Information on these will be presented in a bar chart, as a proportion of hours, on a year-by-year basis, showing each year/stage of study, rather than aggregated for the course as a whole. For KISs relating to part-time study, three bars should also be provided for a standard undergraduate course, each referring to the time equivalent to one year of study if studied full-time

48. In the interest of providing as much relevant information to the user as possible, a web-link would follow that would lead users to more detailed information. This might be the programme specification or other document, but we would expect this would present more detailed information about learning and teaching, for instance possibly module-level contact hours. This would provide useful contextualised data – something that was a strong theme emerging from consultation responses.

Being able to reliably identify links to programme specifications could be really handy, e.g. for things like the Course Detective approach to custom search engine development…?

67. The salaries for all institutions data will be adjusted to account for regional variations in the salaries earned by graduates in different parts of the country. A link from the KIS to institutional web-sites will enable institutions to provide additional contextual information with particular reference to the different circumstances of different employment sectors (for example the creative industries.)

I can see this causing all sorts of problems when it comes to offering comparisons?

92. Information derived from the NSS and DLHE survey will be presented at course level if sufficient data are available; otherwise NSS and DLHE data will be presented at the most detailed level possible of the Joint Academic Coding System (JACS), subject to the surveys’ response rates and threshold requirements. This information is held by HEFCE and HESA for publicly funded institutions and others that subscribe to HESA.

If a data describing UK HE courses were freely available, work could already have started on this…?

93. Annex C provides a detailed breakdown of the expected coverage of the KIS for HEIs, but
in summary:
a. The data thresholds we intend to apply to the NSS and DLHE data (which mirror the thresholds we apply on Unistats) mean that roughly one in seven single subject, full-time, first degree KISs will have both DLHE and NSS data available at course level, although in some cases the data presented may need to be aggregated across two years. However, over 95 per cent of KISs will be able to present DLHE or NSS data, or both, when data are included that is aggregated to JACS level 1 and across two years.
b. We expect that about 2 per cent of single subject, part-time, first degree KISs will have full data available; this rises to about 35 per cent when data are included which are aggregated to JACS level 1 and across two years.
c. We expect the KISs where full data are available to cover about 40 per cent of the student body; after allowing for aggregation, the proportion where some data are available is likely to cover over 90 per cent of the student body.

One argument against making a comprehensive course catalogue available under an open public license is that if it were to be used as scaffolding for aggregating different, comparative data sources, lack of coverage over the whole course listing would be confusing and offer a poor user experience. Err…? “[R]oughly one in seven single subject, full-time, first degree KISs will have both DLHE and NSS data available at course level” So that reason isnlt a deal breaker, then?!

95. We recognise that, even aggregating data over years or over JACS levels, there will be, as on Unistats at present, a number of courses for which it will not be possible to provide data derived from the NSS or DLHE due to the small size of the student cohorts concerned. The thresholds for publication reflect both the need to ensure the statistical validity of the information and the need to meet data protection requirements. There will still be elements of the KIS that will be useful to prospective students, but we recognise the need to ensure prospective students do not negatively interpret the absence of data. We will undertake further user testing over the next few months to finalise appropriate explanatory text.

Ditto.

98. Consideration has been given to who should undertake the production of the KISs, and how. Requiring individual institutions to create their own KISs was considered, but it was felt to be problematic because it would place a significant burden on individual institutions and would pose a challenge in controlling the quality of – potentially – several hundred different production processes, hindering the creation of a single, uniform and credible information source. This task therefore needs to be undertaken by a single body.

So institutions are not going to have a new data burden placed on them?

99. The first year of KISs (those to be published in September 2012) will be centrally created by HEFCE in partnership with HESA. From year two onwards it is intended that central creation will pass to HESA.

100. In the first year, HEFCE will draw data from the NSS and DLHE and institutions will provide additional data (as set out in Table 1). Once this has been collated, HEFCE will provide institutions with web code to be inserted appropriately on their own web-sites.

Hmmm.. when I won the TSO Open Up competition, the plan was to get UCAS course code data and then start annotating and enriching it howseover we could. The reason why I wanted the UCAS data is that it provides the scaffolding to build from. The user focus is the course, so it made sense to build up views over the data from the course level. (We could have started trying to build service out at the level of HESA codes, but that wasn’t what the prize was awarded for.) During the competition pitch, I made the claim that course code data was akin to postcode data to the extent that rights over the seemingly most useful identifier space was controlled by a restrictive license. I don’t yet know what services I want to build out over the course code space, but why is that a reason to prevent innovation in the development of services around course codes by locking those codes down?

103. In order for KISs to be published during September 2012, for use by applicants for entry in academic year 2013-14, institutions must submit their data returns to HEFCE by summer 2012.

So the data burden is on the universities?! But the aggregation – where the value is locked up – is under the control of the centre? Hmmm… thinks… SCONUL charge 80 quid (?) for their aggregated report on HE library stats data, but I’ve managed to FOI the return made to SCONUL by individual libraries. So if there is a KIS like return from HEIs to HEFCE, it should be FOIable, and we can create a copy of the aggregate by aggregating FOI requests. Hmmm…

105. In the main, we would expect the KIS to be revised at most annually; however, a system will be set up to enable exceptions to be processed, for example, corrections to be made or financial information updated. More detail will follow in the technical guidance.

Another of the arguments I’ve heard – this time from universities – to explain a university’s unwillingness to publish a course data API or data dumps is that a third party that aggregates data from universities may end up with data that is stale or out of synch with data on the university website. I suspect that a third party would be quicker to respond to changes than once ever 52 weeks…

106. HEFCE is in discussion with the primary providers of institutional data management software to ensure that the new data requirements for the KIS can be incorporated into existing applications as soon as possible.

So how much do we think the thrid party software vendors are going to claim for to make the changes to their systems? And hands up who thinks that those changes will also be antagnostic to developers who might be minded to open up the data via APIs. After all, if you can get data out of your commercially licensed enterprise software via a programmable API, there’s less requirement to stump up the cash to pay for maintenance and the implementation of “additional” features…

107. The KIS will have a strong brand, including a unique logo. This is to ensure that the KIS is as engaging to users as possible, as well as distinguishing it from any other information sources available.

…which sounds to me like someone’s twigged there may be value locked up in the data, and they’re not willing to let it go…

108. A core feature of the KIS is that it is standardised and comparable across HEIs, with consistent branding and presentation. Therefore, in order to avoid confusion, institutions should not publish a document called the KIS or with the KIS logo for any courses where not required.

Brand police… Total ownership. We can haz ur data; we pwn ur data.

110. It is likely the KIS for each course will be available through an embedded ‘widget’ on the institution’s web-site. We do not intend to be prescriptive about where on the web-site this should appear, other than that it should be found near other course information. The widget would contain three items of top-line information, and the option to click through for the full KIS.

Hmmm… did somebody just discover widgets?! So the idea here is to control the brand through a KIS branded widget that can be embedded on University websites?

The obvious question to an open data freak would be: will there be a freely available open API with that, and will the data made available through the API be openly licensed?

The open systems advocate in me would also wonder whether there might be mileage in each insitution publishing it’s own data in an open way via an open API that could be harvested by the central HEFCE aggregator or by a third party. In addition, the KIS data would be available as a service within the institution to institutional developers.

113. In HEFCE 2010/31 we suggested that the KIS should be accessible from the UCAS web-site. Although it was pointed out that not all applications go through UCAS, there was broad support for this approach in the consultation and discussions with UCAS are continuing. UCAS is
keen to link the KIS to its site and to explore the possibility of incorporating a comparison function into its planned ‘course finder’ facility, for all courses there are KISs for (including part-time courses), not just those they process applications for.

So HEFCE want to run a data service…?! Will it be an open data service? Or are HEFCE going to get a copy of the course code scaffolding grail and use it to act as infrastructure for a data service that aggregates and re-presents data that is in part already largely available, albeit in a less structured way, via a branded and content controlled widget?

114. We would also like to work with other organisations that provide student information on HE and other related careers guidance. We are keen to promote and publicise the KIS through the various student web-sites and social media outlets that exist.

So will third parties be encouraged and supported in developing their own takes on enriched KIS data?

116. Because KISs will be created centrally, a central database of KISs will be available. HEPISG needs to consider how to use this information, recognising the Government’s intention that data on publicly funded provision should be available for general use. More information will be published on the HEFCE web-site in due course.

Ah – so the data may be available via an open public license. Tip to the HEPISG folk: why not build an API around the data, and serve the widget from that? Furthermore, by making the core course code data available as a dataset, third parties would be able annotate and enrich that data and serve it as additional information around the “officially sanctioned” KIS data pulled from the API. Finally, a question: if third parties are going to use clustering techniques so that they can provide recommendations on “similar” courses, will they have access to the whole KIS data set so that they can run their own clustering algorithms?!

119. Currently, there is information available via Unistats that will not be available through the KIS. We do not envisage, therefore, that any changes will be made to the Unistats web-site in the KIS’ first year of operation. The focus will be on ensuring that the KIS is available on institutional web-sites as advised in the Oakleigh Consulting and Staffordshire University research, with links to, and from, the UCAS web-site.

So if students want to compare courses, they need to go to N different pages to find the KIS widget on each, and then go and fight with the Unistats website?

120. However, we recognise that, in the longer term, there will be a need to revisit the arrangements to ensure we meet the needs of students for good access to information and that we secure the best use of public money and institutional time. As we move to more established arrangements for the creation and maintenance of the KIS, and look at the use of potential sites for comparing information, we will consider the future of Unistats in the light of the wider policy environment for higher education.

Just open the course/qualification scaffolding data

123. As well as the KIS and Unistats data, a wider set of information is to be made available by all publicly funded HEIs, FECs with undergraduate provision, and private providers who subscribe to the QAA.

This is the sort of thing third parties might be keen to develop. But to scaffold the collection and delivery of the additional data annotations, the course data could be really handy…

[The paper goes on a bit more, but it’s making me angry so I figure I need to take a break!]

Several Million Up for Grabs in JISC ‘Course Data’ Call. On the Other Hand…

I notice that there’s a couple of days left for institutions to get £10k from JISC in order to look at what it would take to start publishing course data via XCRI feeds, with another £40-80k each for up to 80 institutions to do something about it (JISC Grant Funding 8/11: JISC ‘Course Data: Making the most of Course Information’ Capital Programme – Call for Letters of Commitment; see also Immediate Impressions on JISC’s “Course Data: Making the most of Course Information” Funding Call, as well as the associated comments):

As funding for higher education is reduced and the cost to individuals rises, we see a move towards a consumer-led market for education and increased student expectations. One of the key themes about delivering a better student experience discussed in the recent Whitepaper mentions improving the information available to prospective students.

Nowadays, information about a college or university is more likely found via a laptop than in a prospectus. In this competitive climate publicising courses while embracing new technologies is ever more important for institutions.

JISC have made it easier for prospective students to decide which course to study by creating an internationally recognised data standard for course information, known as XCRICAP. This will make transferring and advertising information about courses between institutions and organisations, more efficient and effective.

The focus of this new programme is to enable institutions to publish electronic prospectus information in a standard format for all types of courses, especially online, distance, part time, post graduate and continuing professional development. This standard data could then be shared with many different aggregator agencies (such as UCAS, the National Learning Directory, 14-19 Prospectus websites, or new services yet to be developed) to collect and share with prospective student

All well and good, but:

– there still won’t be a single, centralised directory of UK courses, the sort of thing than can be used to scaffold other services. I know it isn’t perfect, but UCAS has some sort of directory of UK undergrad HE courses that can be applied for via central clearing, but it’s not available as open data.

– the universities are being offered £10k each to explore how they can start to make more of their course data. There seems to be the expectation that some good will follow, and aggregation services will flower around this data (This standard data could then be shared with many different aggregator agencies (such as … new services yet to be developed). I think they might too. (For example, we’re already starting to see sites like Which University? provide shiny front ends to HESA and NSS data.) But why should these aggregation sites have to wait for the universities to scope out, plan, commission, delay and then maybe or maybe not deliver open XCRI feeds. (Hmm, I wonder: does the JISC money place any requirements on universities making their XCRI-CAP feeds available under an open license that allows commercial reuse?)

When we cobbled together the Course Detective search engine, we exploited Google’s index of UK HE websites to provide a search engine that provides a customised search over the course prospectus webpages on UK HE websites. Being a Google Custom Search Engine there’s only so much we can do with it, but whilst we wait for all the UK HEIs to get round to publishing course marketing feeds, it’s a start.

Of course, if we had our own index, we could offer a more refined search service, with all sorts of potential enhancements and enrichment. Which is where copyright kicks in…

…because course catalogue webpages are generally copyright the host institution, and not published under an open license that allows for commercial reuse.

(I’m not sure how the law stands against general indexing for web search purposes vs indexing only a limited domain (such as course catalogue pages on UK HEI websites) vs scraping pages from a limited domain (such as course catalogue pages on UK HEI websites) in order to create a structured search engine over UK HE course pages. But I suspect the latter two cases breach copyright in ways that are harder to argue your way out of then a “we index everything we can find, regardless” web search engine. (I’m not sure how domain limited Google CSEs figure either? Or folk who run searches with the site: limit?))

To kickstart the “so what could we do with a UK wide aggregation of course data?”, I wonder whether UK HEIs who are going to pick up the £10k from JISC’s table might also consider doing the following:

– licensing their their course catalogue web pages with an open, commercial license (no one really understands what non-commercial means…and the aim would be to build sustainable services that help people find courses in a fair (open algorithmic) way that they might want to take…)

– publishing a sitemap/site feed that makes it clear where the course catalogue content lives (as a starter for 10, we have the Course Detective CSE definition file [XML]). That way, the sites could retain some element of control over which parts of the site good citizen scrapers could crawl. (I guess a robots.txt file might also be used to express this sort of policy?)

The license would allow third parties to start scraping and indexing course catalogue content, develop normalised forms of that data, and start working on discovery services around that data. A major aim of such sites would presumably be to support course discovery by potential students and their families, and ultimately drive traffic back to the university websites, or on to the UCAS website. Such sites, once established, would also provide a natural sink for XCRI-CAP feeds as and when they are published (although I suspect JISC would also like to be able to run a pilot project looking at developing an aggregator service around XCRI-CAP feeds as well;-) In addition, the sites might well identify additional – pragmatic – requirements on other sorts of data that might contribute to intermediary course discovery and course comparison sites.

It’s already looking as if the KIS – Key Information Set – data that will supposedly support course choice won’t be as open as it might otherwise be (e.g. Immediate Thoughts on the “Provision of information about higher education”); it would be a shame if the universities themselves also sought to limit the discoverability of their courses via cross-sector course discovery sites…