Several Million Up for Grabs in JISC ‘Course Data’ Call. On the Other Hand…

I notice that there’s a couple of days left for institutions to get £10k from JISC in order to look at what it would take to start publishing course data via XCRI feeds, with another £40-80k each for up to 80 institutions to do something about it (JISC Grant Funding 8/11: JISC ‘Course Data: Making the most of Course Information’ Capital Programme – Call for Letters of Commitment; see also Immediate Impressions on JISC’s “Course Data: Making the most of Course Information” Funding Call, as well as the associated comments):

As funding for higher education is reduced and the cost to individuals rises, we see a move towards a consumer-led market for education and increased student expectations. One of the key themes about delivering a better student experience discussed in the recent Whitepaper mentions improving the information available to prospective students.

Nowadays, information about a college or university is more likely found via a laptop than in a prospectus. In this competitive climate publicising courses while embracing new technologies is ever more important for institutions.

JISC have made it easier for prospective students to decide which course to study by creating an internationally recognised data standard for course information, known as XCRICAP. This will make transferring and advertising information about courses between institutions and organisations, more efficient and effective.

The focus of this new programme is to enable institutions to publish electronic prospectus information in a standard format for all types of courses, especially online, distance, part time, post graduate and continuing professional development. This standard data could then be shared with many different aggregator agencies (such as UCAS, the National Learning Directory, 14-19 Prospectus websites, or new services yet to be developed) to collect and share with prospective student

All well and good, but:

– there still won’t be a single, centralised directory of UK courses, the sort of thing than can be used to scaffold other services. I know it isn’t perfect, but UCAS has some sort of directory of UK undergrad HE courses that can be applied for via central clearing, but it’s not available as open data.

– the universities are being offered £10k each to explore how they can start to make more of their course data. There seems to be the expectation that some good will follow, and aggregation services will flower around this data (This standard data could then be shared with many different aggregator agencies (such as … new services yet to be developed). I think they might too. (For example, we’re already starting to see sites like Which University? provide shiny front ends to HESA and NSS data.) But why should these aggregation sites have to wait for the universities to scope out, plan, commission, delay and then maybe or maybe not deliver open XCRI feeds. (Hmm, I wonder: does the JISC money place any requirements on universities making their XCRI-CAP feeds available under an open license that allows commercial reuse?)

When we cobbled together the Course Detective search engine, we exploited Google’s index of UK HE websites to provide a search engine that provides a customised search over the course prospectus webpages on UK HE websites. Being a Google Custom Search Engine there’s only so much we can do with it, but whilst we wait for all the UK HEIs to get round to publishing course marketing feeds, it’s a start.

Of course, if we had our own index, we could offer a more refined search service, with all sorts of potential enhancements and enrichment. Which is where copyright kicks in…

…because course catalogue webpages are generally copyright the host institution, and not published under an open license that allows for commercial reuse.

(I’m not sure how the law stands against general indexing for web search purposes vs indexing only a limited domain (such as course catalogue pages on UK HEI websites) vs scraping pages from a limited domain (such as course catalogue pages on UK HEI websites) in order to create a structured search engine over UK HE course pages. But I suspect the latter two cases breach copyright in ways that are harder to argue your way out of then a “we index everything we can find, regardless” web search engine. (I’m not sure how domain limited Google CSEs figure either? Or folk who run searches with the site: limit?))

To kickstart the “so what could we do with a UK wide aggregation of course data?”, I wonder whether UK HEIs who are going to pick up the £10k from JISC’s table might also consider doing the following:

– licensing their their course catalogue web pages with an open, commercial license (no one really understands what non-commercial means…and the aim would be to build sustainable services that help people find courses in a fair (open algorithmic) way that they might want to take…)

– publishing a sitemap/site feed that makes it clear where the course catalogue content lives (as a starter for 10, we have the Course Detective CSE definition file [XML]). That way, the sites could retain some element of control over which parts of the site good citizen scrapers could crawl. (I guess a robots.txt file might also be used to express this sort of policy?)

The license would allow third parties to start scraping and indexing course catalogue content, develop normalised forms of that data, and start working on discovery services around that data. A major aim of such sites would presumably be to support course discovery by potential students and their families, and ultimately drive traffic back to the university websites, or on to the UCAS website. Such sites, once established, would also provide a natural sink for XCRI-CAP feeds as and when they are published (although I suspect JISC would also like to be able to run a pilot project looking at developing an aggregator service around XCRI-CAP feeds as well;-) In addition, the sites might well identify additional – pragmatic – requirements on other sorts of data that might contribute to intermediary course discovery and course comparison sites.

It’s already looking as if the KIS – Key Information Set – data that will supposedly support course choice won’t be as open as it might otherwise be (e.g. Immediate Thoughts on the “Provision of information about higher education”); it would be a shame if the universities themselves also sought to limit the discoverability of their courses via cross-sector course discovery sites…

16 comments

  1. benthamfish

    … as a starter for 10, we have the Course Detective CSE definition file [XML]…

    Except that this link in Tony’s post doesn’t work (as at 16:15 Monday 5 Sep 11).

    Problem of sustainability for these informal links? That’s one of the reasons we’ve been moving ahead on more formal XCRI-CAP feeds, including a register and / or some form of discoverability process. I’m also a little bit concerned about the quality of scraped courses information. With a defined and managed feed there’s more possibility of getting comprehensive and accurate data, rather than data that looks vaguely right.

    An additional problem is that many many institutions don’t have a coherent course catalogue at all. Many still have PDFs for example, or data that is inherently inconsistent maintained by individual faculties and departments.

    • Tony Hirst

      @alan the link does work – though it might be a bit slow… and it may not be being served with correct mimetype (try View Source)…

      As far as the informality goes – I can run a scraper over a site on a daily basis and refresh indices accordingly; yes scraping is brittle, but when are the XCRI feeds going to be available. I could probably have written scrapers for most of the unis in the time the course data call has been out. And I’m not saying that we can do without machine readable course data feeds; what I am saying is that a major reason why we can’t start to explore building aggregators around course descriptions right now is because of the way the content is licensed. And that by starting to explore aggregations, we can provide a service for potential students now; we can start to identify how to make sensible recommendations we can start to accrete other stuff around course search (like survey results, career paths, locale, how to work in recommendations based on A’levels) etc etc

      For unis that don’t have course catalogues, just PDFs – they’re scrapeable too (unless they’re PDFs of scanned images), although less reliably and it takes longer to write the scraper. (And anyway, if the index is not comprehensive, it’s still got more content in it than the empty set at the moment.)

      As far as unis publishing reiable feeds – yeah, right: they’ll do it perfectly from the off. I suspect I’m one of the few people who looks at RSS/Atom feeds that educational establishments publish around their sites, and I’ve given up bothering to email to say: “you’re feed’s broken”… And just by the by, the licensing around SCRI-CAP feed publication has been clarified, has it? What will the recommendation be? And will it be enforceable? (ANd will it be a requirement that unis do actually publish the feeds? To what level of detail?)

  2. Tavis Reddick

    What kind of business model would your commercial course search have? How would it “help people find courses in a fair (open algorithmic) way”?

    The commercial course aggregators I have had dealings with through our College have not produced a level playing field and have either tried to charge more for “added value” or tried to channel applications through their own subscription service, while publishing information which was inaccurate or simply out of date (and were obstructive and slow in taking down our course information on request).

    What value would a commercial course aggregator provide?

    • Tony Hirst

      @tavis Every search engine has it’s biases – the “fair (open algorithmic) way” I had in mind would be one that wasn’t influenced by people paying to boost their rankings… Hmmm… maybe I should have said “transparent”, then results could be boosted by sponsorship, along with disclaimers saying something like “University of Poppleton payed £2.50 to appear at the top of this search results listing”… But maybe all but the most trivial algorithms are too open to bias…

      So instead, let’s go for a common bias. If I go to the UCAS site and run a course search, I get results in alphabetical order. There is no reason why an aggregator couldn’t also offer such a a results listing… but in terms of generating results, having an index of full course description text would make for a friendlier search engine, capable of responding to queries like “diversity of life” (which works on coursedetective.co.uk – albeit with results provided in Google-rank-order, rather than alphabetical order).

      “while publishing information which was inaccurate or simply out of date (and were obstructive and slow in taking down our course information on request)”: if the data is hard to come by and maintain, then it makes it harder to refresh the content in the aggregator. If you’re maintaining an aggregator that is trying to provide a current, live view, then it is in their interest to respond to such requests. XCRI feeds could help here, making timely and correct data available in a machine readable way. But XCRI feeds are going to take time to appear…

      If your issue is with aggregators in principle, it’s not my fault the UK goverment is trashing Higher Education by trying to privatise it and make a market out of it…

  3. Tavis Reddick

    @tony, I appreciate that search engine rankings will have usually have certain weightings that are not transparently described. Google does publish quite a large amount of advice for webmasters which generally falls under the web standards, netiquette and common sense categories; whilst withholding certain information to limit opportunities for, shall we say, unethical rigging.

    I do not have a issue with aggregators in principle. Skills Development Scotland have been trying to get learning providers to produce XCRI CAP feeds across the sector, and I have no concerns about the integrity of their course search.

    There could be problems simply in terms of the relative amounts of descriptive text provided by different institutions, in that more text (and the existence of useful keywords) is more likely to produce a match for a search phrase. This possibly means that controlled vocabularies (of keywords, particularly subject areas) and guidance for/weighting of descriptive text will be required to give balance to searches.

    Further, controlled vocabularies (thinking of the kind of metadata Amazon uses for its products, say, or that Google generates for its image searches) should allow course discovery rather than search, a more precise form of refinement by selecting subcategories via links (or filter options).

    You would also want thesaurus (collection) relating of educational terms (which might not be the same as the matches of a generic search engine). Lexaurus lists a few educational thesauri with broader, narrower, preferred, related, and equivalent terms.

    With regard to UCAS searches and sort bias, it appears that a course search on their website can return the institutes in alphabetical order, with matching courses in alphabetical order sorted underneath, which as you suggest is a particular concern for York College and those further down the alphabet. I agree that there are some advantages in the Google ranking used by coursedetective.co.uk, with the above provisos.

    I would be concerned if there was a trend towards exclusivity (Universities only, say, and not Higher Education colleges) to course search engines; and would prefer that any restriction (type of institution, locality and so on) was applied by the user.

    On another point: you might also consider that an additional driver to produce XCRI CAP feeds is for exchanging course data within an institution; for example, extracting and publishing a feed from databases to be consumed by the institutional website, prospectus and course brochures (our College does this), and possibly feeding reports and other information systems like Virtual Learning Environments.

  4. Tony Hirst

    @Tavis “There could be problems simply in terms of the relative amounts of descriptive text provided by different institutions, in that more text (and the existence of useful keywords) is more likely to produce a match for a search phrase. This possibly means that controlled vocabularies (of keywords, particularly subject areas) and guidance for/weighting of descriptive text will be required to give balance to searches.”

    I’m coming from the stance of a free text search on course descriptions would offer a contrasting position to the search that UCAS currently offers, which I think is based in part on a limited number of search keywords provided by each institution. Which is to say, I’m interested in finding ways of making more effective use of search, as well as looking at how third parties might be able to support university and course choice based on additional criteria.

    “You would also want thesaurus (collection) relating of educational terms (which might not be the same as the matches of a generic search engine). Lexaurus lists a few educational thesauri with broader, narrower, preferred, related, and equivalent terms.”

    Agreed… but I also want the natural language that appears in the course descriptions;-)

    “I would be concerned if there was a trend towards exclusivity (Universities only, say, and not Higher Education colleges) to course search engines; and would prefer that any restriction (type of institution, locality and so on) was applied by the user.”

    Agreed. At the very least, the aggregator should make clear what sites are being aggregated (and maybe also what sites that might reasonably be included aren’t for whatever reason/policy.) Does JISC go far enough for you in terms of who it’s trying to encourage to use XCRI-CAP?

    “On another point: you might also consider that an additional driver to produce XCRI CAP feeds is for exchanging course data within an institution; for example, extracting and publishing a feed from databases to be consumed by the institutional website, prospectus and course brochures (our College does this), and possibly feeding reports and other information systems like Virtual Learning Environments.”

    I completely agree… a lot of rhetoric around open data appears to have missed the trick that a primary consumer of institutional data might be the institution itself (as well as bodies the institution must report to as part of its public reporting data burden). I suspect it’s also sometimes easier to work with public data your institution has published openly rather than having to try to find ways of accessing it internally;-)

    • Tavis Reddick

      @tony, in answer to your question:
      “Does JISC go far enough for you in terms of who it’s trying to encourage to use XCRI-CAP?”

      I think that JISC has done a good job of making the case in various ways, including this cool animation (it mentions the older XCRI Curriculum):

      http://www.jisc.ac.uk/whatwedo/programmes/eframework/soa.aspx

      although I think that a lot of the work and promotion has been done by people outside JISC bodies, which bodies have nevertheless provided essential frameworks, forums and funding.

      However, I am aware of ignorance of and resistance to XCRI within organisations. A senior manager recently told me that he had not heard of XCRI nor did he think it was something we should be involved in (despite our College having been involved in the project nearly since its conception and using XCRI to publish our course information for years; and in spite of our Principal’s exhortation to go forth and play a leading role in projects in the sector).

      Possibly there is a visibility problem in that many people who are not in technical or information management roles just do not see the problems (that XCRI was set up to solve) in the first place, and the solutions are largely transparent (one reason I like playing about with the graphical visualisations of open data formats) and automated.

  5. Rob Englebright

    Insightful stuff.
    I’m working on the JISC #coursedata programme, and can maybe provide a few answers about it, though will probably also open up a few more cans of worms/questions.

    First off, the work on eXchanging Course Related Information (XCRI), and eXchanging Course Related Information – Course Advertising Profile (XCRI-CAP) has been sponsored by JISC since 2005.
    It provides some structure in the fragmented landscape of course information, and further work would have carried on without the JISC call.
    The European standard – Metadata for Learning opportunities advertising (MLO) is nearly out of the standards sausage machine and work is underway to produce a British residual standard based on XCR-CAP1.2 as a conformant profile of MLO.
    What I’m trying to get at here is that most of the underpinning technical work is done, and we have a community developed standard that can be used now.

    Work on XCRI-CAP projects has shown that the biggest issues aren’t in the spec, but in where the institutional information comes from. I talked to a CMS vendor the other day who said “the only common thing is that the data sources in ALL the systems are different”
    This is why the bulk of the #coursedata money is linked to reviewing and re-engineering internal systems, and the production of an automatic XCRI-CAP feed is merely a manifestation of that streamlining… much like the minty freshness when someone brushes their teeth. What we really want is clean teeth, sweet smelling breath is just a proxy.
    How long will this take? I think the funding for stage 2 projects runs from Jan 2012 for a year and a bit, some places will be able to have a feed running quite quickly, others may need more substantial re-engineering of their processes.

    The XCRI-CAP feed itself, should obviously conform to the spec, and I have just had approval to release an ITT for test tools, more details on the JISC funding page soon.
    The feed should be registered: http://www.xcri.org/wiki/index.php/XCRI_Feeds
    Unless the projects can supply a valid reason for NOT doing so, to facilitate the ease of discovery the feed should be placed at one of the following publicly accessible locations:
    data.foo.ac.uk/courses.xml
    foo.ac.uk/data/courses.xml
    foo.ac.uk/courses.xml
    Unless the project can supply a valid reason for not doing so the feed should have an open licence that allows commercial re-use :)

    Scott Wilson recently suggested there needs to be a separation of the XCRI-CAP standards development from the Coursedata community needs development, much like the WiFi group and 802.11x. I think that’s a cracking idea, and can see the BSI IST43 panel that is working on the profile of MLO is well positioned on the spec side. I’m hoping the JISC #coursedata programme can help focus the future community needs side.
    Rob

    • Tony Hirst

      @rob Thanks for the comment. Interesting to hear how it’s the institutional dataflow/data process that causes so much of the grief, and just how much variety there is in the way the data is handled/managed (or not) across institutions and vendors. It’s interesting that you’re trying to standardise a home for course feeds (I guess the LInking You project could also add support here?) It might also make sense for feeds to be made autodiscoverable, much like RSS/Atom feeds…? (Although there is a spectacular lack of support for these across UK HEI atm, too!)
      One thing I did wonder was whether the short term opening up of course pages to scrapers/aggregators might lead to developments by aggregator services that could then feed back to the institutional side, eg with recommendations about how to structure the data that is already provided in public HTML pages through template refinements? This view from the outside might help the institutions see what it is about their course data that third parties see value in?

      PS is JISC making recommendations about the license that XCRI-CAP data will be released under, and where this license information will be made available?

  6. Rob Englebright

    Auto discovery, yes, though I’m not sure of the end use case, it would be churlish not to consider it…

    I think a clearly articulated statement regarding the benefits of explicitly permitting the aggregation of existing pages whilst their XCRI-CAP pipework is being laid, could be included in the guidance to projects, and would certainly provide some useful feedback as you say.

    In terms of licencing the feed, the technical guidance (still in draft) currently says “Unless the project can supply a valid reason for not doing so the feed should have an open licence that allows commercial re-use ” – I think Scott created some guidance for the OER projects that could be re-used here.

    No idea how many applications we’ll get, I’m a natural pessimist, so see the timing of the call as an issue… In terms of getting institutions to spend time reviewing and reworking their processes I guess it doesn’t matter. In terms of a critical mass of XCRI-CAP feed producers?

    • Tony Hirst

      @rob end-case for autodiscoverability is just that it makes it easier for machines to reliably identify where the feed lives…? And also, requirement of autodiscoverabiliyt means there has to be a feed to discover…;-) Also helps in case of feed ever moving (not that it should…)

      re: clearly articulated statement about benefits: I’d happily try to contribute to such a discussion;-)

      re: license: thanks for clarification on that; is the license likely to provide another friction point/barrier to adoption in the institutions, do you think, and if so, why? (Fear of consequences, maybe, arising form a liberal license? Uncertainty around what the licensing issues either way actually entail?)

      If aggregation sites started attracting and pushing traffic, that would provide some sort of incentive to publishing public feeds. I agree with an earlier comment though that the most significant benefits from fluid course data might well be within the institution itself. If requiring the course feeds to be open and public was feared by institutions, it might actually then act as a brake on adopting the feeds internally? One of the reasons I like the government mandated approach of “open the data to the public” is the spin-offs that might then arise within institutions because they can start to work with data that has become rather more accessible, or draw down on services that add value to data they are making available (eg RateMyPlace http://www.lichfielddc.gov.uk/site/scripts/documents_info.php?documentID=1141 ).

      Re: institutions taking up the call – would there be scope for getting someone to wander round the institutions, asking them what they’ve got, and trying to profile a set of institutions from across the UK to see if there might be quick wins available?

  7. Matt Spence

    As the developer of the previously mentioned Which University? (http://www.whichuni.info) I’ld like to weigh in…

    My initial objective for Which Uni? is to facilitate prospective student in choosing a University (and course) by comparing University statistics. There are a number of things I’ld like like the site to grow to cover, one of these is course discovery and the XCRI project is exciting in this regard. First…

    Scrapping: eeghh no, not sustainable

    The idea of scrapping is horrific to me! Its fine for a hacked together proof of concepts, but it is exactly that: a hack. Even on a small data set from a single (or handful of) sources its ok, what ever gets the job done I suppose. Doing it across tens if not hundreds of institutions is not worth thinking about. Each one would need a variation depending on that institution’s dom (unless there is a standard micro-format markup but why not just do a feed at that point). The large number of institutions would make breakages frequent, and whilst any single break would not affect the service as whole (esp if cached) it would take large amounts of maintenance to keep everything running long term.

    The bottom line is its not sustainable, or viable for a business to scrape this kind of data.

    Feeds: good idea, we’ll see how it comes off in practice, licensing could be a stopper

    The first problem is the practicalities, if the feeds work as JSIC plan then great! Personally I am not holding my breath, on when they will be available, with what coverage and quality of content, whether every institution will stick to the standard (there will always be someones data who doesn’t fit or knows better), etc etc but only time will tell I guess.

    The next problem is the licensing, the data in these feeds must be available for comercial gain. If not then only sites that are going to emerge to utilise them will be poorly realised, badly maintained and unsupported hobby sites.

    The data should also be available without a negotiated agreement (ie help yourself) or a least a single agreement is needed for all institutions through a central organisation.

    A potential solution to the issue of commercial gain is to have the same agreement we with have with HEFCE for the statistical data we provide and that is it must be free to end users.

    Business model: advertising

    Our initial revenue stream will be advertising but have a couple of other ideas
    i. paid for featured search results, with similar levels of transparency that Google has ie “sponsored links”
    ii. having universities sponsor their own pages replacing adverts with call to action buttons that deep link to their web site (eg “View this course on acme.ac.uk”, “Attend an open day”)

    The integrity of the data and the way it is presented to the users is tantamount to what the site is about, institutions will be able to buy placement (within reason) and traffic but they will not be able to buy a better score or have any more control over user submitted content (comments and reviews) that a normal user (ie flag as inappropriate) .

    • Tony Hirst

      Hi Matt – thanks for the comment…
      I quite agree with you that scraping doesn’t provide a sustainable, long term solution, but I proposed it (not altogether seriously;-) bearing in mind:
      1) course details don’t change that often – though for effective comparison you do need to know that the information you are basing the comparisons are correct;
      2) by scraping the data the scraping template can be passed back to the original institution to make it clear – a) how their page templates can be improved; b) to suggest how they might create a template to publish the data as data. It also serves to audit the extent to which content is made available through course page templates is necessary and sufficient for a normalised aggregator;
      3) getting an aggregator up and running even with even a small number of sites can still bootstrap thinking and service development around at least the sites being compared/aggregated. The aggregator can also act as an demonstrable consumer and maybe even validator of course feeds.
      The suggestion was thus made in the context of ‘how might we bootstrap the process of getting a course aggregator up and running – it wasn’t intended as a be all and end all solution, just something that could help get the invention and innovation cycle going.

      As far as feeds go: it’s that or data dumps, surely?! I quite agree that licensing could be a deal breaker: when I won the TSO OpenUp course data competition to open up the UCAS collection of qualification data as scaffolding that third parties could make use of, there was no practical problems in just scraping the data, but the licensing conditions stymied that approach completely. I have a few concerns that the KIS data will end up being licensed in a way that prevents third parties getting access to the datasets that feed the KIS widgets. As to data being open enough for folk to be able to know they can use it without having to get in touch with the publisher, I agree with that too. But as an interim step, it seems to me as if JISC could mandate, or at least make a strong recommendation to, institutions who are taking the course data funding that they make the data available in as open a form as possible to developers who are considering making a sustainable service on top of it…

      …and for sustainability: agreed – there absolutely has to be a business model associated with it; and advertising would presumably play a part in that, along with (I suspect) affiliate fees as an when institutions start incentivising third parties to send them leads/prospects. Sponsored buttons and contextual ads from the institutions might also be appropriate, if handled appropriately.

      And finally, as you say, data integrity is key. But I’m also mindful that we have to start somewhere, as you have done…:-)

    • Matt Spence

      I agree an initial small aggregator to get us going could have catalytic affect, but as you have mentioned I think IP will get in the way here. I am also not sure it will change thinking amongst institutions, people tend to either be for (even if they are cautious) or against this kind of openness, those entrench and worried about losing control of this data will not be won over until the aggregators are driving large amounts of traffic to their institutions web sites. There are potentially a number of people who are not against the idea of open data but not convinced of its value who may be won over.

      A data dump would also be great, better even perhaps, however the schema should be standardised. Certainly we will be using the JISC monitoring unit’s sqlite download rather than the RESTful api.

      I think there is hope if JISC recommend open licensing, some forward thinking institutions will make theirs open and be included in aggregators, then in time the rest will follow suit as they see the value those who have made their data open are getting from it. Mandating it as a requirement of the funding would expedite this and makes sense to me since code developed with jisc funding must be open so why not content made available with jisc funding.

      I had not considered affiliate fees, but makes sense if institutions become more “corporate” as everyone is suggesting. However institutions being what they are this will take time to appear if it ever does.

  8. Pingback: XCRI « Malleable Musings