University Funding – A Wider View

A post on the Guardian Datablog yesterday (Higher education funding: which institutions will be affected?) alerted me to the release of HEFCE’s “provisional allocations of recurrent funding for teaching and research, and the setting of student number control limits for institutions, for academic year 2012-13” (funding data).

Here are the OU figures for teaching:

Funding for old-regime students (mainstream) Funding for old-regime students (co-funding) High cost funding for new-regime students Widening participation Teaching enhancement and student success Other targeted allocations Other recurrent teaching grants Total teaching funding
59,046,659 0 2,637,827 23,273,796 17,277,704 22,619,320 3,991,473 128,846,779

HEFCE preliminary teaching funding allocations to the Open University, 2012-13

Of the research funding for 2012-13, mainstream funding was 8,030,807, the RDP supervision fund came in at 1,282,371, along with 604,103 “other”, making up the full 9,917,281 research allocation.

Adding Higher Education Innovation Funding of 950,000, the OU’s total allocation was 139,714,060.

So what other funding comes into the universities from public funds?

Open Spending publishes data relating to spend by government departments to named organisations, so we can search that for data spent by government departments with the universities (for example, here is a search on OpenSpending.org for “open university”:

Given the amounts spent by public bodies on consultancy (try searching OpenCorporates for mentions of PriceWaterhouseCoopers, or any of EDS, Capita, Accenture, Deloitte, McKinsey, BT’s consulting arm, IBM, Booz Allen, PA, KPMG (h/t @loveitloveit)), university based consultancy may come in reasonably cheaply?

The universities also receive funding for research via the UK research councils (EPSRC, ESRC, AHRC, MRC, BBSRC, NERC, STFC) along with innovation funding from JISC. Unpicking the research council funding awards to universities can be a bit of a chore, but scrapers are appearing on Scraperwiki that make for easier access to individual grant awards data:

  • AHRC funding scraper; [grab data using queries of the form select * from `swdata` where organisation like "%open university%" on scraper arts-humanities-research-council-grants]
  • EPSRC funding scraper; [grab data using queries of the form select * from `grants` where department_id in (select distinct id as department_id from `departments` where organisation_id in (select id from `organisations` where name like "%open university%")) on scraper epsrc_grants_1]
  • ESRC funding scraper; [grab data using queries of the form select * from `grantdata` where institution like "%open university%" on scraper esrc_research_grants]
  • BBSRC funding [broken?] scraper;
  • NERC funding [broken?] scraper;
  • STFC funding scraper; [grab data using queries of the form select * from `swdata` where institution like "%open university%" on scraper stfc-institution-data]

In order to get a unified view over the detailed funding of the institutions from these different sources, the data needs to be reconciled. There are several ID schemes for identifying universities (eg UCAS or HESA codes; see for example GetTheData: Universities by Mission Group) but even official data releases tend not make use of these, preferring instead to rely solely on insitution names, as for example in the case of the recent HEFCE provisional funding data release [DOh! This is not the case – identifiers are there, apparently (I have to admit, I didn’t check and was being a little hasty… See the contribution/correction from David Kernohan in the comments to this post…]:

For some time, I’ve been trying to put my finger on why data releases like this are so hard to work with, and I think I’ve twigged it… even when released in a spreadsheet form, the data often still isn’t immediately “database-ready” data. Getting data from a spreadsheet into a database often requires an element of hands-on crafting – coping with rows that contain irregular comment data, as well as handling columns or rows with multicolumn and multirow labels. So here are a couple of things that would make life easier in the short term, though they maybe don’t represent best practice in the longer term…:

1) release data as simple CSV files (odd as it may seem), because these can be easily loaded into applications that can actually work on the data as data. (I haven’t started to think too much yet about pragmatic ways of dealing with spreadsheets where cell values are generated by formulae, because they provide an audit trail from one data set to derived views generated from that data.)

2) have a column containing regular identifiers using a known identification scheme, for example, HESA or UCAS codes for HEIs. If the data set is a bit messy, and you can only partially fill the ID column, then only partially fill it; it’ll make life easier joining those rows at least to other related datasets…

As far as UK HE goes, the JISC monitoring unit/JISCMU has a an api over various administrative data elements relating to UK HEIs (eg GetTheData: Postcode data for HE and FE institutes, but I don’t think it offers a Google Refine reconciliation service, (ideally with some sort of optional string similarity service)…? Yet?! ;-) maybe that’d make for a good rapid innovation project???

PS I’m reminded of a couple of related things: Test Your RESTful API With YQL, a corollary to the idea that you can check your data at least works by trying to use it (eg generate a simple chart from it) mapped to the world of APIs: if you can’t easily generate a YQL table/wrapper for it, it’s maybe not that easy to use? 2) the scraperwiki/okf post from @frabcus and @rufuspollock on the need for data management systems not content management systems.

PPS Looking at the actual Guardian figures reveals all sorts of market levers appearing… Via @dkernohan, FT: A quiet Big Bang in universities

DataFriday on GetTheData…

I’ve hereby arbitrarily decided it’s #dataFriday on GetTheData.org, the question and answer / Q&A site for all your data needs (whether it’s finding a data set, working a data set, parsing a data set, or looking for ways of analysing or visualising a data set).

The idea behind dataFriday is that while we’re growing the community, we need an occasional sprint of Q’ing and A’ing (not least in the hope we might broker a few connections between people by means of a couple of rapid asking and answering exchanges).

So if you work with data, are struggling with data, or are publishing data that meets the needs of some who’s looking for that data, pop over to getTheData.org right now… (go on, you know you want to… and it’s Friday, right…? Too late to start that new work project, but the perfect opportunity to spend 5 minutes doing a good deed for the web and the day…;-)

First Inklings of a Small Contract Market Around Data Services? And a concern…

A few days ago I was tipped off to a “bounty” request on Scraperwiki, offering 50 quid for a scrape of the DVLA test centres. The request had been posted on the Scraperwiki, and a bounty offered (on which Scraperwiki seems to add a commission).

Scraperwiki also appears to be offering a “private scraper” service as a business model. Maybe visualisation design around a wiki will be next to be offered on the market?!

Another hint that folk may be willing to pay to get data into a useable form appeared on GetTheData in a request for information about currency data from a professional, non-coder journalist that suggested a payment may be in the offing for anyone who could help.

Given that a lot of data that is apparently out there, is readily scrapeable, but is actually subject to non-commercial, personal use only end user licences, I do wonder if there will be a black market in unlicensed data that gets laundered through a series of steps that don’t respect attribution, let alone other, more stringent license conditions.

On the other hand, I wonder whether or not GetTheData should have a facility for associating a bounty with a particular query?

And the concern? It’s to do with the ethics of scraping or aggregating large amounts of personal – albeit public – data from folk on social networks. For example, it’s easy enough to find out who’s being wished a happy birthday on Twitter, and I have more than a few tools for grabbing friends and follower lists around hashtags, search terms, Twitter lists and usernames, and so on. Once we start mining data, it may be possible to discover things about folk from the public context they inhabit that maybe reveals something about them they didn’t realise could be deduced from the context? So what should our response be if we get a request on GetTheData asking someone how to mine public social data around a named individual… It may not be phone tapping, but something about that sort of request, should it ever occur, wouldn’t feel quite right to me…?

A Few More Thoughts on GetTheData.org

As we come up to a week in on GetTheData.org, there’s already an interesting collection of questions – and answers – starting to appear on the site, along with a fledgling community (thanks for chipping in, folks:-), so how can we maintain – and hopefully grow – interest in the site?

A couple of things strike me as the most likely things to make the site attractive to folk:

– the ability to find an appropriate – and useful – answer to your question without having to ask it, for example because someone has already asked the same, or a similar, question;
– timely responses to questions once asked (which leads to a sense of community, as well as utility).

I think it’s also worth bearing in mind the context that GetTheData sits in. Many of the questions result in answers that point to data resources that are listed in other directories. (The links may go to either the data home page or its directory page on a data directory site.)

Data Recommendations
One thing I think is worth exploring is the extent to which GetTheData can both receive and offer recommendations to other websites. Within a couple of days of releasing the site, Rufus had added a recommendation widget that could recommend datasets hosted on CKAN that seem to be related to a particular question.

GetTheData.org - related datasets on CKAN

What this means is that even before you get a reply, a recommendation might be made to you of a dataset that meets your requirements.

(As with many other Q&A sites, GetTheData also tries to suggest related questions to you when you enter you question, to prompt you to consider whether or not your question has already been asked – and answered.)

I think the recommendation context is something we might be able to explore further, both in terms of linking to recommendations of related data on other websites, but also in the sense of reverse links from GetTheData to those sites.

For example:

– would it be possible to have a recommendation widget on GetTheData that links to related datasets from the Guardian datastore, or National Statistics?
– are there other data directory sites that can take one or more search terms and return a list of related datasets?
– could a getTheData widget be located on CKAN data package pages to alert package owners/maintainers that a question possibly related to the dataset had been posted on GetTheData? This might encourage the data package maintainer to answer the question on the getTheData site with a link back to the CKAN data package page.

As well as recommendations, would it be useful for GetTheData to syndicate new questions asked on the site? For example, I wonder if the Guardian Datastore blog would be willing to add the new questions feed to the other datablogs they syndicate?;-) (Disclosure: data tagged posts from OUseful.info get syndicated in that way.)

Although I don’t have any good examples of this to hand from GetTheData, it strikes me that we might start to see questions that relate to obtaining data which is actually a view over a particular data set. This view might be best obtained via a particular query onto a particular data set. such as a specific SPARQL query on a Linked Data set, or a particular Google query language request to the visualisation API against a particular Google spreadsheet.

If we do start to see such queries, then it would be useful to aggregate these around the datastores they relate to, though I’m not sure how we could best do this at the moment other than by tagging?

News announcements
There are a wide variety of sites publishing data independently, and a fractured networked of data directories and data catalogues. Would it make sense for GetTheData to aggregate news announcements relating to the release of new data sets, and somehow use these to provide additional recommendations around data sets?

Hackdays and Data Fridays
As suggested in Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers:

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

Alternatively, how about “Data Fridays”, on the first Friday in the month, where folk agree to check GetTheData two or three times that day and engage in something of a distributed data related Question and Answer sprint, helping answer unanswered questions, and maybe pitching in a few new ones?

Aggregated Search
It would be easy enough to put together a Google custom search engine that searches over the domains of data aggregation sites, and possibly also offer filetype search limits?

So What Next?
Err, that’s it for now…;-) Unless you fancy seeing if there’s a question you can help out on right now at GetTheData.org