TSO OpenUP Competition – Opening Up UCAS Data

Here’s the presentation I gave to the judging panel at the TSO OpenUp competition final yesterday. As ever, it doesn’t make sense with[out] (doh!) me talking, though I did add some notes in to the Powerpoint deck: Opening up UCAS Course Code Data

(I had hoped Slideshare would be able to use the notes as a transcript, bit it doesn’t seem to do that, and I can’t see how to cut and paste the notes in by hand?:-(

A quick summary:

The “Big Idea” behind my entry to the TSO competition was a simple one – make UCAS course data (course code, title and institution) avaliable as data. By opening up the data we make it possible for third parties to construct services and applications based around complete data skeleton of all the courses offered for undergraduate entry through clearing in a particular year across UK higher education.
The data acts as scaffolding that can be used to develop consumer facing applications across HE (e.g. improved course choice applications) as well as support internal “vertical” activities within HEIs that may also be transferable across HEIs.
Primary value is generated from taking the course code scaffolding and annotating it with related data. Access to this dataset may be sold on in a B2B context via data platform services. Consumer facing applications with their own revenue streams may also be built on top of the data platform.
This idea makes data available that can potentially disrupt the currently discovery model for course choice and selection (but in its current form, not in university application or enrollment), in Higher Education in the UK.

Here are the notes I doodled to myself in preparation for the pitch. Now the idea has been picked up, it will need tightening up and may change significantly! ;-) Which is to say – in this form, it is just my original personal opinion on the idea, and all ‘facts’ need checking…

  1. I thought the competition was as much about opening up the data as anything… So the original idea was simply that it would be really handy to have machine readable access to course code and course name information for UK HE courses from UCAS – which is presumably the closest thing we have to a national catalogue of higher education courses.

    But when selected to pitch the idea, it became clear that an application or two were also required, or at least some good business reasons for opening up this data…

    So here we go…

  2. UCAS is the clearing house for applying to university in the UK. It maintains a comprehensive directory of HE courses available in the UK.

    Postgraduate students and Open University students do not go through UCAS. Other direct entry routes to higher education courses may also be available.

    According to UCAS, in 2010, there were 697,351 applicants with 487,329 acceptances, compared with 639,860 applications and 481,854 acceptances in 2009. [ Slightly different figures in end of cycle report 2009/10? ]

    For convenience, hold in mind the thought that course codes could be to course marketing, what postcodes are for geo related applications… They provide a natural identifier that other things can be associated with.

    Associated with each degree course is a course code. UCAS course codes are also associated with JACS codes – Joint Academic Coding System identifiers – that relate to particular topics of study. “The UCAS course codes have no meaning other than “this course is offered by this institution for this application cycle”.” link]

    “UCAS course code is 4 character reference which can be any combination of letters and numbers.

    Each course is also assigned up to three JACS (Joint Academic Coding System) codes in order to classify the course for *J purposes. The JACS system was introduced for 2002 entry, and replaced UCAS Standard Classification of Academic Subjects (SCAS). Each JACS code consists of a single letter followed by 3 numbers. JACS is divided into subject areas, with a related initial letter for each. JACS codes are allocated to courses for the *J return.

    The JACS system is used by the Higher Education Statistics Agency (HESA), and is the result of a joint UCAS-HESA subject code harmonization project.

    JACS is also used by UK institutions to identify the subject matter of programmes and modules. These institutions include the Department for Innovation, Universities and Skills (DIUS), the Home Office and the Higher Education Funding Council for England (HEFCE).”

    Keywords: up to 10 keywords per course are allocated to each course from a restricted list of just over 4,500 valid keywords.
    “Main keyword: This is generally a broad subject category, usually expressed as a single word, for example ‘Business’.
    Suggested keyword (SUG): Where a search on a main keyword identifies more than 200 courses, the Course Search user is prompted to select from a set of secondary keywords or phrases. These are the more specific ‘Suggested keywords’ attached to the courses identified. For example, ‘Business Administration’ is one of a range of ‘Suggested keywords’ which could be attached to a Business course (there are more than 60 others to choose from). A course in Business Administration would typically have this as the ‘Suggested keyword’, with ‘Business’ as the main keyword.
    However, if a course only has a ‘Suggested keyword’ and not a related ‘Main keyword’, the course will not be displayed in any search under the ‘Main keyword’ alone.

    Single subject: Main keywords can be ticked as ‘Single subject’. This means that the course will be displayed by a keyword search on the subject, when the user chooses the ‘single subject’ option below. You may have a maximum of two keywords indicated as single subjects per course.”

    “Between January and March 2010, approximately 600,000 unique IP addresses access the UCAS course code search function. During the same time period, almost 5 million unique IP addresses accessed the UCAS subject search function.” [link]

    “New courses from 2012 will be given UCAS codes that should not be used for subject classification purposes. However, all courses will still be assigned up to three individual JACS3 codes based on the subject content of the course.

    An analysis of unique IP address activity on the UCAS Course Search has shown that very few searches are conducted using the course code, compared to the subject search function. UCAS Courses Data Team will be working to improve the subject search and course keywords over the coming year to enable potential applicants to accurately find suitable courses.” [link]

    Course code identifiers have an important role to play within a university administrations, for example in marshalling resources around a course, although they are not used by students. (On the other hand, students may have a familiarity with module codes.) Course codes identify courses that are the subject of quality assessment by the QAA. To a certain extent, a complete catalogue of course codes allows third parties to organise offerings based around UK higher education degrees in a comprehensive way and link in to the UCAS application procedure.

  3. If released as open data, and particularly as Linked Open Data, the course data can be used to support:
    – the release of horizontal data across the UK HE sector by HEIs, such as course catalogue information;
    – vertical scaffolding within an institution for elaboration by module codes, which in turn may be associated with module descriptions, reading lists, educational resources, etc.
    – the development across HE of services supporting student choice – for example “compare the uni” type services
  4. At the moment the data is siloed inside UCAS behind a search engine with unfriendly session based URLs and a poor results UI. Whilst it is possible to scrape or crowd-source course code information, such ad hoc collection mechanisms run the danger of being far from complete, which means that bias may be introduced into the collection as a side effect of the collection method.
  5. Making the data available via an API or Linked data store makes it easier for third parties to build course related services of whatever flavour – course comparison sites, badging services, resource recommendation services. The availability of the data also makes it easier for developers within an intsitution to develop services around course codes that might be directly transferable to, or scaleable across, other institutions.
  6. What happens if the API becomes writeable? An appropriately designed data store, and corresponding ingest routes, might encourage HEIs to start releasing the course data themselves in a more structured way.

    XCRI is JISC’s preferred way of doing this, and I think there has been some lobbying of HEFCE from various JISC projects, but I’m not sure how successful it’s been?

  7. Ultimately, we might be able to aggregate data from locally maintained local data stores. Course marketing becomes a feature of the Linked Data cloud.

    Also context of data burden on HEIs, reporting to Professional, Statutory and Regulatory Bodies – PSURBS.

    Reconciliation with HESA Institution and campus identifiers, as well as the JISCMU API and Guardian Datablog Rosetta Stone spreadsheet

    By hosting course code data, and using it as scaffolding within a Linked Data cloud around HE courses, a valuable platform service can be made available to HEIs as well as commercial operations designed to support student choice when it comes to selecting an appropriate course and university.

  8. Several recent JISC project have started to explore the release of course related activity data on the one hand, and Linked Data approaches to enterprise wide data management on the other. What is currently lacking is national data-centric view over all HEI course offerings. UCAS has that data.

    Opening up the data facilitates rapid innovation projects within HEIs, and makes it possible for innovators within an HEI to make progress on projects that span across course offerings even if they don’t have easy access to that data from their own institution.

  9. Consumer services are also a possibility. As HEIs become more businesslike, treating students as customers, and paying customers at that, we might expect to see the appearance of university course comparison sites.

    CompareTheUni has had a holding page up for months – but will it ever launch? Uni&Books crowd sources module codes and associated reading links. Talis Aspire is a commercial reading list system that associates resources with module codes.

  10. Last year, I pulled together a few separate published datasets and through them into Google Fusion Tables, then plotted the results. The idea was that you could chart research ratings against student satisfaction, or drop out rates against the academic pay. [link ]

    Guardian datablog picked up the post, and I still get traffic from there on a daily basis… [link ]

  11. The JISC MOSAIC Library data challenge saw Huddersfield University open up book loans data associated with course codes – so you could map between courses and books, and vice versa (“People who studied this course borrowed this book”, “this book was referred to by students on this course”)

    One demonstrator I built used a bookmarklet to annotate UCAS course pages with a link to a resource page showing what books had been borrowed by students on that course at Huddersfiled University. [Link ]

  12. Enter something like compare the uni, but data driven, and providing aggregated views over data from universities and courses.
  13. To set the scene, the site needs to be designed with a user in mind. I see a 16-17 year old, sloughing on the sofa, TV on with the most partial of attention being paid to it, laptop or tablet to hand and the main focus of attention. Facebook chat and a browser are grabbing attention on screen, with occasional distractions from the TV and mobile phone.
  14. The key is course data – this provides a natural set of identifiers that span the full range of clearing based HE course offerings in the UK and allows third parties to build servies on this basis.

    The course codes also provide hooks against which it may be possible to deploy mappings across skills frameworks, e.g. SFIA in IT world. The course codes will also have associated JACS subject code mappings and UCAS search terms, which in turn may provide weak links into other domains, such as the world of books using vocabularies such as the Library of Congress Subject headings and Dewey classification codes.

  15. Further down the line, if we can start to associate module codes with course codes, we can start to develop services to support current students, or informal learners, by hooking in educational resources at the module level.
  16. Marketing can go several ways. For the data platform, evangelism into the HE developer community may spark innovation from within HEIs, most likely under the auspices of JISC projects. Platform data services may also be marketed to third party developers and innovators/entrepeneurs.

    Marketing of services built on top of the data platform will need to be marketed to the target audience using appropriate channels. Specialist marketers such as Campus Group may be appropriate partners here.

  17. The idea pitched is disruptive in that one of the major competitors is at first UCAS. However, if UCAS retains it’s unique role in university application and clearing, then UCAS will still play an essential, and heavily trafficked, role in undergraduate student applications to university. Course discovery and selection will, however, move away from the UCAS site towards services that better meet the needs of potential applicants. One then might imagine UCAS becoming a B2B service that acts as intermediary between student choice websites and universities, or even entertain a scenario in which UCAS is disintermediated and some other clearing mechanism instituted between universities and potential-student facing course choice portals.
  18. According to UCAS, between January and March 2010 “almost 5 million unique IP addresses accessed the UCAS subject search function” [link] In each of the last couple of years, the annual application/acceptance numbers have been of the order approx 500,000 students intake per year, on 600,000 applicants. If 10% of applicants and generate £5 per applicant, that’s £300k pa. £10 from 20% of intake, that’s £1M pa. £50 each from 40% is £10M. I haven’t managed to find out what the acquisition cost of a successful applicant is, or the advertising budget allocated to an undergraduate recruitment marketing campaign, but there are 200 or so HE institutions (going by the number of allocated HESA institution codes).

    For platform business – e.g. business model based around selling queries on linked/aggregated/mapped datasets. If you imagine a query returning results with several attributes, each result is a row and each attribute is a column, If you allow free access to x thousand query cells returned a day, and then charge for cells above that limit, you:
    Encourage wider innovation around your platform; let people run narrow queries or broad queries. License on use of data for folk to use on their own datastores/augmented with their own triples.
    Generate revenue that scales on a metered basis according to usage;
    – offer additional analytics that get your tracking script in third party web pages, helping train your learning classifiers, which makes platform more valuable.

    For a consumer facing application – eg a course choice site for potential appications is the easiest to imagine:
    – Short term model would be advertising (e.g. course/uni ads), affiliate fees on booksales for first year books? Seond hand books market eg via Facebook marketplace?
    – Medium term – affiliate for for prospectus application/fulfilment
    Long term – affiliate fee for course registration

  19. At the end of the day, if the data describing all the HE courses available in the UK is available as data, folk will be able to start to build interesting things around it…

Getting Access to University Course Code Data (or not… (yet…))

A couple of weeks or so ago, having picked up the TSO OpenUp competition prize for suggesting that it would be a Good Thing for UCAS/university course code data to be made available, I had a meeting with the TSO folk to chat over “what next?” The meeting was an upbeat one with a plan to get started as soon as possible with a scrape of the the UCAS website… so what’s happened since…?

First up – a reading of the UCAS website Terms and Conditions suggests that scraping is a no-no…

6. Intellectual property rights
e. Copying, distributing or any use of the material contained on the website for any commercial purpose is prohibited.
f. You may not create a database by systematically downloading substantial parts of the website

(In the finest traditions of the web, you aren’t allowed to deep link into the site without permission either: 6.c inks to the website are not permitted, other than links to the homepage for your personal use, except with our prior written permission. Links to the website from within a frameset definition are not permitted except with our prior written permission.)

So, err, I guess my link to the terms and conditions breaks those terms and conditions? Oops…;-) Should I be sending them something like this do you think?

Dear enquiries@ucas.ac.uk,
As per your terms and conditions, (paragraph 6 c) please may I publish a link to your terms and conditions web page [ http://www.ucas.com/terms_and_conditions ] in a blog post I am writing that, in part, refers to your terms and conditions?
Luv'n'hugs,
tony

As a fallback, I put a couple of trial balloon FOI requests in to a couple of universities asking for the course names and UCAS course codes for courses offered in 2010/11, along with the search keywords associated with each course (doh! I did it again, deep linking into the UCAS site…)

PS Please may I also link to the page describing course search keywords [ http://www.ucas.com/he_staff/courses/coursesearchkeywords ] ?

The first request went to the University of Southampton, in part because I knew that they already publish chunks of the data (as data) as part of their #opensoton Open Data initiative. (This probably means I was abusing the FOI system, but a point maybe needed to be made…?!;-) The second request was put in to the University of Bristol.

The requests were of the form:

I would be grateful if you could send me in spreadsheet, machine readable electronic form or plain text a copy of the course codes, course titles and search keywords for each course as submitted to UCAS for the 2010-2011 (October 2010) student entry.

If possible, would you also provide HESA subject category codes associated with each course.

So how did I get on?

Bristol’s response was as follows:

On discussion with our Admissions and Student Information teams, it appears that the University does not actually hold this data – it is held on a UCAS database. UCAS are not currently subject to the Freedom of Information Act (they will be in due course) but it may be worth talking to them directly to see if they are willing to assist.

And Southampton’s FOI response?

Course codes and titles may be found here: http://www.soton.ac.uk/corporateservices/foi/request-66210-6124d691.pdf Keywords were not held by the University – you should inquire with UCAS (http://www.ucas.com). HESA subject category codes may be found here: http://www.hesa.ac.uk/index.php/content/view/1806/296/

So what did I learn?

  1. I don’t seem to have made it clear enough to Southampton that I wanted the the 2-tuple (course code, HESA code) for each course. So how should I have asked for that data (the response pointed me to the list of all HESA codes. What I wanted was, for each course code, the course code/HESA code pair).
  2. Generalising from an example of one;-), there seems to be a disconnect between FOI and open data branches of organisations. In my ideal world, the FOI person (an advocate for the person making the request) would also be on good terms with the Open Data team in the organisation, if not a data wrangler themselves. For data requests, the FOI person would make sure the data is released as open data as part of the process of fulfilling the request and then refer the person making the request to the open data site (see also: Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping). Southampton have part of this process already – the course data is in a PDF on the their site and I was referred to it. (Note that the PDF is not just any PDF – have a look at it! – rather than the spreadsheet, machine readable electronic form or plain text I requested, even though @cgutteridge had posted a link to the SPARQL opendata query for the course code/UCAS code information I’d requested as a reply to my FOI request on the WhatDoTheyKnow site.)
  3. Universities don’t necessarily have any record of the search keywords they associate with the courses they post on UCAS. The UCAS website suggests that (doh!) “[r]ecent analysis of unique IP address use of the UCAS Course Search indicates that the subject search is by far the most popular of the 3 search options currently available”, such that “[w]hen an applicant uses our Course Search facility to search for available courses, they can choose a keyword by which to search, known as the ‘subject search’.” Which is to say, universities have no local record of the terms they use to describe courses that are the the primary way of discovering their courses on UCAS? Blimey… (I wonder how much universities spend on Google AdWords for advertising particular courses on their own course prospectus websites and how they go about selecting those terms?)
  4. Asking for a machine readable “data as data” response has no teeth at the current time. I don’t know if the Protection of Freedoms bill clause that “extends Freedom of Information rights by requiring datasets to be available in a re-usable format” will change this? It seems like it might?

    Where—
    (a) an applicant makes a request for information to a public authority in respect of information that is, or forms part of, a dataset held by the public authority, and
    (b) on making the request for information, the applicant expresses a preference for communication by means of the provision to the applicant of a copy of the information in electronic form, the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.

  5. So what next? UCAS is a charity that appears to be operated by, for, and on behalf of UK Higher Education (e.g. UCAS Directors’ Report and Accounts 2009). Whilst not FOIable yet, it looked set to become FOIable from October 2011 (Ministry of Justice: Greater transparency in Freedom of Information), though I haven’t been able to find the SI and commencement date that enact this…?). IF it does become FOIable, we may be able to get the data out that way (although memories of the battle between open data advocates and the Ordnance Survey come to mind…) Hopefully, though, we’ll be able to get the data open by more amicable means before then…:-)

    PS a couple of other things that I’ve been dipping into relating to this project. Firstly, the UCAS Business Plan 2009-2012 (doh!):

    PPS Please may I also link to your Corporate Business Plan 2009-2012 [ http://www.ucas.com/documents/corporate/corpbusplan09-12.pdf ]

    Secondly, the Cabinet Office’s “Better Choices: Better Deals” strategy document [PDF], which as well as its “MyData” right to personal data initiative, also encourages business to put their information (and data…) to work. Whether or not you agree that more information may help to make for better choices from potential students, or that comparison sites have a role to play in this, the UK government appears to believe it and looks set to support the development of businesses operating in this area. For example:

    Effective consumer choices are also important in the public sector – such as decisions about what and where to study.
    However, unlike in private markets, public services are generally:
    ● Free at the point of delivery, so prices do not give us clues about quality or popularity.
    ● Not motivated by profits, so there is little incentive to highlight differences and encourage switching.
    ● Supplied under a universal service obligation, such that they serve a particularly broad range of users, from the very informed to the highly vulnerable.
    In the same way that comparison and feedback sites have developed for private markets, some choice-tools have already emerged for public services. For example, parents and prospective students can use league tables to compare school and university performance, while patients can access websites comparing waiting times for treatments across different healthcare providers, and feedback from fellow consumers about the performance of a local GP practice. Their role is likely to become more important in future as public service markets are opened up and there is scope for further choice-tools to be developed [Better Choices: Better Deals, p. 32]

    If you’re looking to put a bid or business plan together based on using public data as a basis for comparison services, the Better Choices document has more than a few quotable sections;-)

    [Related: Course Detective metasearch/custom search across UK University prospectus websites]

Several Million Up for Grabs in JISC ‘Course Data’ Call. On the Other Hand…

I notice that there’s a couple of days left for institutions to get £10k from JISC in order to look at what it would take to start publishing course data via XCRI feeds, with another £40-80k each for up to 80 institutions to do something about it (JISC Grant Funding 8/11: JISC ‘Course Data: Making the most of Course Information’ Capital Programme – Call for Letters of Commitment; see also Immediate Impressions on JISC’s “Course Data: Making the most of Course Information” Funding Call, as well as the associated comments):

As funding for higher education is reduced and the cost to individuals rises, we see a move towards a consumer-led market for education and increased student expectations. One of the key themes about delivering a better student experience discussed in the recent Whitepaper mentions improving the information available to prospective students.

Nowadays, information about a college or university is more likely found via a laptop than in a prospectus. In this competitive climate publicising courses while embracing new technologies is ever more important for institutions.

JISC have made it easier for prospective students to decide which course to study by creating an internationally recognised data standard for course information, known as XCRICAP. This will make transferring and advertising information about courses between institutions and organisations, more efficient and effective.

The focus of this new programme is to enable institutions to publish electronic prospectus information in a standard format for all types of courses, especially online, distance, part time, post graduate and continuing professional development. This standard data could then be shared with many different aggregator agencies (such as UCAS, the National Learning Directory, 14-19 Prospectus websites, or new services yet to be developed) to collect and share with prospective student

All well and good, but:

– there still won’t be a single, centralised directory of UK courses, the sort of thing than can be used to scaffold other services. I know it isn’t perfect, but UCAS has some sort of directory of UK undergrad HE courses that can be applied for via central clearing, but it’s not available as open data.

– the universities are being offered £10k each to explore how they can start to make more of their course data. There seems to be the expectation that some good will follow, and aggregation services will flower around this data (This standard data could then be shared with many different aggregator agencies (such as … new services yet to be developed). I think they might too. (For example, we’re already starting to see sites like Which University? provide shiny front ends to HESA and NSS data.) But why should these aggregation sites have to wait for the universities to scope out, plan, commission, delay and then maybe or maybe not deliver open XCRI feeds. (Hmm, I wonder: does the JISC money place any requirements on universities making their XCRI-CAP feeds available under an open license that allows commercial reuse?)

When we cobbled together the Course Detective search engine, we exploited Google’s index of UK HE websites to provide a search engine that provides a customised search over the course prospectus webpages on UK HE websites. Being a Google Custom Search Engine there’s only so much we can do with it, but whilst we wait for all the UK HEIs to get round to publishing course marketing feeds, it’s a start.

Of course, if we had our own index, we could offer a more refined search service, with all sorts of potential enhancements and enrichment. Which is where copyright kicks in…

…because course catalogue webpages are generally copyright the host institution, and not published under an open license that allows for commercial reuse.

(I’m not sure how the law stands against general indexing for web search purposes vs indexing only a limited domain (such as course catalogue pages on UK HEI websites) vs scraping pages from a limited domain (such as course catalogue pages on UK HEI websites) in order to create a structured search engine over UK HE course pages. But I suspect the latter two cases breach copyright in ways that are harder to argue your way out of then a “we index everything we can find, regardless” web search engine. (I’m not sure how domain limited Google CSEs figure either? Or folk who run searches with the site: limit?))

To kickstart the “so what could we do with a UK wide aggregation of course data?”, I wonder whether UK HEIs who are going to pick up the £10k from JISC’s table might also consider doing the following:

– licensing their their course catalogue web pages with an open, commercial license (no one really understands what non-commercial means…and the aim would be to build sustainable services that help people find courses in a fair (open algorithmic) way that they might want to take…)

– publishing a sitemap/site feed that makes it clear where the course catalogue content lives (as a starter for 10, we have the Course Detective CSE definition file [XML]). That way, the sites could retain some element of control over which parts of the site good citizen scrapers could crawl. (I guess a robots.txt file might also be used to express this sort of policy?)

The license would allow third parties to start scraping and indexing course catalogue content, develop normalised forms of that data, and start working on discovery services around that data. A major aim of such sites would presumably be to support course discovery by potential students and their families, and ultimately drive traffic back to the university websites, or on to the UCAS website. Such sites, once established, would also provide a natural sink for XCRI-CAP feeds as and when they are published (although I suspect JISC would also like to be able to run a pilot project looking at developing an aggregator service around XCRI-CAP feeds as well;-) In addition, the sites might well identify additional – pragmatic – requirements on other sorts of data that might contribute to intermediary course discovery and course comparison sites.

It’s already looking as if the KIS – Key Information Set – data that will supposedly support course choice won’t be as open as it might otherwise be (e.g. Immediate Thoughts on the “Provision of information about higher education”); it would be a shame if the universities themselves also sought to limit the discoverability of their courses via cross-sector course discovery sites…