A couple of weeks or so ago, having picked up the TSO OpenUp competition prize for suggesting that it would be a Good Thing for UCAS/university course code data to be made available, I had a meeting with the TSO folk to chat over “what next?” The meeting was an upbeat one with a plan to get started as soon as possible with a scrape of the the UCAS website… so what’s happened since…?
First up – a reading of the UCAS website Terms and Conditions suggests that scraping is a no-no…
6. Intellectual property rights
e. Copying, distributing or any use of the material contained on the website for any commercial purpose is prohibited.
f. You may not create a database by systematically downloading substantial parts of the website
(In the finest traditions of the web, you aren’t allowed to deep link into the site without permission either: 6.c inks to the website are not permitted, other than links to the homepage for your personal use, except with our prior written permission. Links to the website from within a frameset definition are not permitted except with our prior written permission.)
So, err, I guess my link to the terms and conditions breaks those terms and conditions? Oops…;-) Should I be sending them something like this do you think?
As per your terms and conditions, (paragraph 6 c) please may I publish a link to your terms and conditions web page [ http://www.ucas.com/terms_and_conditions ] in a blog post I am writing that, in part, refers to your terms and conditions?
As a fallback, I put a couple of trial balloon FOI requests in to a couple of universities asking for the course names and UCAS course codes for courses offered in 2010/11, along with the search keywords associated with each course (doh! I did it again, deep linking into the UCAS site…)
PS Please may I also link to the page describing course search keywords [ http://www.ucas.com/he_staff/courses/coursesearchkeywords ] ?
The first request went to the University of Southampton, in part because I knew that they already publish chunks of the data (as data) as part of their #opensoton Open Data initiative. (This probably means I was abusing the FOI system, but a point maybe needed to be made…?!;-) The second request was put in to the University of Bristol.
The requests were of the form:
I would be grateful if you could send me in spreadsheet, machine readable electronic form or plain text a copy of the course codes, course titles and search keywords for each course as submitted to UCAS for the 2010-2011 (October 2010) student entry.
If possible, would you also provide HESA subject category codes associated with each course.
So how did I get on?
Bristol’s response was as follows:
On discussion with our Admissions and Student Information teams, it appears that the University does not actually hold this data – it is held on a UCAS database. UCAS are not currently subject to the Freedom of Information Act (they will be in due course) but it may be worth talking to them directly to see if they are willing to assist.
Course codes and titles may be found here: http://www.soton.ac.uk/corporateservices/foi/request-66210-6124d691.pdf Keywords were not held by the University you should inquire with UCAS (http://www.ucas.com). HESA subject category codes may be found here: http://www.hesa.ac.uk/index.php/content/view/1806/296/
So what did I learn?
- I don’t seem to have made it clear enough to Southampton that I wanted the the 2-tuple (course code, HESA code) for each course. So how should I have asked for that data (the response pointed me to the list of all HESA codes. What I wanted was, for each course code, the course code/HESA code pair).
- Generalising from an example of one;-), there seems to be a disconnect between FOI and open data branches of organisations. In my ideal world, the FOI person (an advocate for the person making the request) would also be on good terms with the Open Data team in the organisation, if not a data wrangler themselves. For data requests, the FOI person would make sure the data is released as open data as part of the process of fulfilling the request and then refer the person making the request to the open data site (see also: Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping). Southampton have part of this process already – the course data is in a PDF on the their site and I was referred to it. (Note that the PDF is not just any PDF – have a look at it! – rather than the spreadsheet, machine readable electronic form or plain text I requested, even though @cgutteridge had posted a link to the SPARQL opendata query for the course code/UCAS code information I’d requested as a reply to my FOI request on the WhatDoTheyKnow site.)
- Universities don’t necessarily have any record of the search keywords they associate with the courses they post on UCAS. The UCAS website suggests that (doh!) “[r]ecent analysis of unique IP address use of the UCAS Course Search indicates that the subject search is by far the most popular of the 3 search options currently available”, such that “[w]hen an applicant uses our Course Search facility to search for available courses, they can choose a keyword by which to search, known as the ‘subject search’.” Which is to say, universities have no local record of the terms they use to describe courses that are the the primary way of discovering their courses on UCAS? Blimey… (I wonder how much universities spend on Google AdWords for advertising particular courses on their own course prospectus websites and how they go about selecting those terms?)
- Asking for a machine readable “data as data” response has no teeth at the current time. I don’t know if the Protection of Freedoms bill clause that “extends Freedom of Information rights by requiring datasets to be available in a re-usable format” will change this? It seems like it might?
(a) an applicant makes a request for information to a public authority in respect of information that is, or forms part of, a dataset held by the public authority, and
(b) on making the request for information, the applicant expresses a preference for communication by means of the provision to the applicant of a copy of the information in electronic form, the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.
So what next? UCAS is a charity that appears to be operated by, for, and on behalf of UK Higher Education (e.g. UCAS Directors’ Report and Accounts 2009). Whilst not FOIable yet, it looked set to become FOIable from October 2011 (Ministry of Justice: Greater transparency in Freedom of Information), though I haven’t been able to find the SI and commencement date that enact this…?). IF it does become FOIable, we may be able to get the data out that way (although memories of the battle between open data advocates and the Ordnance Survey come to mind…) Hopefully, though, we’ll be able to get the data open by more amicable means before then…:-)
PS a couple of other things that I’ve been dipping into relating to this project. Firstly, the UCAS Business Plan 2009-2012 (doh!):
PPS Please may I also link to your Corporate Business Plan 2009-2012 [ http://www.ucas.com/documents/corporate/corpbusplan09-12.pdf ]
Secondly, the Cabinet Office’s “Better Choices: Better Deals” strategy document [PDF], which as well as its “MyData” right to personal data initiative, also encourages business to put their information (and data…) to work. Whether or not you agree that more information may help to make for better choices from potential students, or that comparison sites have a role to play in this, the UK government appears to believe it and looks set to support the development of businesses operating in this area. For example:
Effective consumer choices are also important in the public sector – such as decisions about what and where to study.
However, unlike in private markets, public services are generally:
● Free at the point of delivery, so prices do not give us clues about quality or popularity.
● Not motivated by profits, so there is little incentive to highlight differences and encourage switching.
● Supplied under a universal service obligation, such that they serve a particularly broad range of users, from the very informed to the highly vulnerable.
In the same way that comparison and feedback sites have developed for private markets, some choice-tools have already emerged for public services. For example, parents and prospective students can use league tables to compare school and university performance, while patients can access websites comparing waiting times for treatments across different healthcare providers, and feedback from fellow consumers about the performance of a local GP practice. Their role is likely to become more important in future as public service markets are opened up and there is scope for further choice-tools to be developed [Better Choices: Better Deals, p. 32]
If you’re looking to put a bid or business plan together based on using public data as a basis for comparison services, the Better Choices document has more than a few quotable sections;-)
Regular readers will know how I do quite like to dabble with visual analysis, so here are a couple of doodles with some of the university fees data that is starting to appear.
The data set I’m using is a partial one, taken from the Guardian Datastore: Tuition fees 2012: what are the universities charging?. (If you know where there’s a full list of UK course fees data by HEI and course, please let me know in a comment below, or even better, via an answer to this Where’s the fees data? question on GetTheData.)
In the end, I took the easy way out, and opted for Geocommons. I downloaded the data from the Guardian datastore, and tidied it up a little in Google Refine, removing non-numerical entries (including ranges, such 4,500-6,000) in the Fees column and replacing them with minumum fee values. Sorting the fees column as a numerical type with errors at the top made the columns that needed tweaking easy to find:
The Guardian data included an address column, which I thought Geocommons should be able to cope with. It didn’t seem to work out for me though (I’m sure I checked the UK territory, but only seemed to get US geocodings?) so in the end I used a trick posted to the OnlineJournalism blog to geocode the addresses (Getting full addresses for data from an FOI response (using APIs); rather than use the value.parseJson().results.formatted_address construct, I generated a couple of columns from the JSON results column using value.parseJson().results.geometry.location.lng and value.parseJson().results.geometry.location.lat).
Uploading the data to Geocommons and clicking where prompted, it was quite easy to generate this map of the fees to date:
Anyone know if there’s a way of choosing the order of fields in the pop-up info box? And maybe even a way of selecting which ones to display? Or do I have to generate a custom dataset and then create a map over that?
What I had hoped to be able to do was use coloured proportional symbols to generate a two dimensional data plot, e.g. comparing fees with drop out rates, but Geocommons doesn’t seem to support that (yet?). It would also be nice to have an interactive map where the user could select which numerical value(s) are displayed, but again, I missed that option if it’s there…
The second thing I thought I’d try would be an interactive scatterplot on Many Eyes. Here’s one view that I thought might identify what sort of return on value you might get for you course fee…;-)
Click thru’ to have a play with the chart yourself;-)
PS I can;t not say this, really – you’ve let me down again, @datastore folks…. where’s a university ID column using some sort of standard identifier for each university? I know you have them, because they’re in the Rosetta sheet… although that is lacking a HESA INST-ID column, which might be handy in certain situations… ;-) [UPDATE – apparently, HESA codes are in the spreadsheet…. ;-0]
PPS Hmm… that Rosetta sheet got me thinking – what identifier scheme does the JISC MU API use?
PPPS If you’re looking for a degree, why not give the Course Detective search engine a go? It searches over as many of the UK university online prospectus web pages that we could find and offer up as a sacrifice to a Google Custom search engine ;-)
Here’s the presentation I gave to the judging panel at the TSO OpenUp competition final yesterday. As ever, it doesn’t make sense with[out] (doh!) me talking, though I did add some notes in to the Powerpoint deck: Opening up UCAS Course Code Data
(I had hoped Slideshare would be able to use the notes as a transcript, bit it doesn’t seem to do that, and I can’t see how to cut and paste the notes in by hand?:-(
A quick summary:
The “Big Idea” behind my entry to the TSO competition was a simple one – make UCAS course data (course code, title and institution) avaliable as data. By opening up the data we make it possible for third parties to construct services and applications based around complete data skeleton of all the courses offered for undergraduate entry through clearing in a particular year across UK higher education.
The data acts as scaffolding that can be used to develop consumer facing applications across HE (e.g. improved course choice applications) as well as support internal “vertical” activities within HEIs that may also be transferable across HEIs.
Primary value is generated from taking the course code scaffolding and annotating it with related data. Access to this dataset may be sold on in a B2B context via data platform services. Consumer facing applications with their own revenue streams may also be built on top of the data platform.
This idea makes data available that can potentially disrupt the currently discovery model for course choice and selection (but in its current form, not in university application or enrollment), in Higher Education in the UK.
Here are the notes I doodled to myself in preparation for the pitch. Now the idea has been picked up, it will need tightening up and may change significantly! ;-) Which is to say – in this form, it is just my original personal opinion on the idea, and all ‘facts’ need checking…
But when selected to pitch the idea, it became clear that an application or two were also required, or at least some good business reasons for opening up this data…
So here we go…
Postgraduate students and Open University students do not go through UCAS. Other direct entry routes to higher education courses may also be available.
According to UCAS, in 2010, there were 697,351 applicants with 487,329 acceptances, compared with 639,860 applications and 481,854 acceptances in 2009. [ Slightly different figures in end of cycle report 2009/10? ]
For convenience, hold in mind the thought that course codes could be to course marketing, what postcodes are for geo related applications… They provide a natural identifier that other things can be associated with.
Associated with each degree course is a course code. UCAS course codes are also associated with JACS codes – Joint Academic Coding System identifiers – that relate to particular topics of study. “The UCAS course codes have no meaning other than “this course is offered by this institution for this application cycle”.” link]
“UCAS course code is 4 character reference which can be any combination of letters and numbers.
Each course is also assigned up to three JACS (Joint Academic Coding System) codes in order to classify the course for *J purposes. The JACS system was introduced for 2002 entry, and replaced UCAS Standard Classification of Academic Subjects (SCAS). Each JACS code consists of a single letter followed by 3 numbers. JACS is divided into subject areas, with a related initial letter for each. JACS codes are allocated to courses for the *J return.
The JACS system is used by the Higher Education Statistics Agency (HESA), and is the result of a joint UCAS-HESA subject code harmonization project.
JACS is also used by UK institutions to identify the subject matter of programmes and modules. These institutions include the Department for Innovation, Universities and Skills (DIUS), the Home Office and the Higher Education Funding Council for England (HEFCE).”
Keywords: up to 10 keywords per course are allocated to each course from a restricted list of just over 4,500 valid keywords.
“Main keyword: This is generally a broad subject category, usually expressed as a single word, for example ‘Business’.
Suggested keyword (SUG): Where a search on a main keyword identifies more than 200 courses, the Course Search user is prompted to select from a set of secondary keywords or phrases. These are the more specific ‘Suggested keywords’ attached to the courses identified. For example, ‘Business Administration’ is one of a range of ‘Suggested keywords’ which could be attached to a Business course (there are more than 60 others to choose from). A course in Business Administration would typically have this as the ‘Suggested keyword’, with ‘Business’ as the main keyword.
However, if a course only has a ‘Suggested keyword’ and not a related ‘Main keyword’, the course will not be displayed in any search under the ‘Main keyword’ alone.
Single subject: Main keywords can be ticked as ‘Single subject’. This means that the course will be displayed by a keyword search on the subject, when the user chooses the ‘single subject’ option below. You may have a maximum of two keywords indicated as single subjects per course.”
“Between January and March 2010, approximately 600,000 unique IP addresses access the UCAS course code search function. During the same time period, almost 5 million unique IP addresses accessed the UCAS subject search function.” [link]
“New courses from 2012 will be given UCAS codes that should not be used for subject classification purposes. However, all courses will still be assigned up to three individual JACS3 codes based on the subject content of the course.
An analysis of unique IP address activity on the UCAS Course Search has shown that very few searches are conducted using the course code, compared to the subject search function. UCAS Courses Data Team will be working to improve the subject search and course keywords over the coming year to enable potential applicants to accurately find suitable courses.” [link]
Course code identifiers have an important role to play within a university administrations, for example in marshalling resources around a course, although they are not used by students. (On the other hand, students may have a familiarity with module codes.) Course codes identify courses that are the subject of quality assessment by the QAA. To a certain extent, a complete catalogue of course codes allows third parties to organise offerings based around UK higher education degrees in a comprehensive way and link in to the UCAS application procedure.
– the release of horizontal data across the UK HE sector by HEIs, such as course catalogue information;
– vertical scaffolding within an institution for elaboration by module codes, which in turn may be associated with module descriptions, reading lists, educational resources, etc.
– the development across HE of services supporting student choice – for example “compare the uni” type services
XCRI is JISC’s preferred way of doing this, and I think there has been some lobbying of HEFCE from various JISC projects, but I’m not sure how successful it’s been?
Also context of data burden on HEIs, reporting to Professional, Statutory and Regulatory Bodies – PSURBS.
Reconciliation with HESA Institution and campus identifiers, as well as the JISCMU API and Guardian Datablog Rosetta Stone spreadsheet
By hosting course code data, and using it as scaffolding within a Linked Data cloud around HE courses, a valuable platform service can be made available to HEIs as well as commercial operations designed to support student choice when it comes to selecting an appropriate course and university.
Opening up the data facilitates rapid innovation projects within HEIs, and makes it possible for innovators within an HEI to make progress on projects that span across course offerings even if they don’t have easy access to that data from their own institution.
CompareTheUni has had a holding page up for months – but will it ever launch? Uni&Books crowd sources module codes and associated reading links. Talis Aspire is a commercial reading list system that associates resources with module codes.
Guardian datablog picked up the post, and I still get traffic from there on a daily basis… [link ]
One demonstrator I built used a bookmarklet to annotate UCAS course pages with a link to a resource page showing what books had been borrowed by students on that course at Huddersfiled University. [Link ]
The course codes also provide hooks against which it may be possible to deploy mappings across skills frameworks, e.g. SFIA in IT world. The course codes will also have associated JACS subject code mappings and UCAS search terms, which in turn may provide weak links into other domains, such as the world of books using vocabularies such as the Library of Congress Subject headings and Dewey classification codes.
Marketing of services built on top of the data platform will need to be marketed to the target audience using appropriate channels. Specialist marketers such as Campus Group may be appropriate partners here.
For platform business – e.g. business model based around selling queries on linked/aggregated/mapped datasets. If you imagine a query returning results with several attributes, each result is a row and each attribute is a column, If you allow free access to x thousand query cells returned a day, and then charge for cells above that limit, you:
Encourage wider innovation around your platform; let people run narrow queries or broad queries. License on use of data for folk to use on their own datastores/augmented with their own triples.
Generate revenue that scales on a metered basis according to usage;
– offer additional analytics that get your tracking script in third party web pages, helping train your learning classifiers, which makes platform more valuable.
For a consumer facing application – eg a course choice site for potential appications is the easiest to imagine:
– Short term model would be advertising (e.g. course/uni ads), affiliate fees on booksales for first year books? Seond hand books market eg via Facebook marketplace?
– Medium term – affiliate for for prospectus application/fulfilment
Long term – affiliate fee for course registration