A post on the Guardian Datablog yesterday (Higher education funding: which institutions will be affected?) alerted me to the release of HEFCE’s “provisional allocations of recurrent funding for teaching and research, and the setting of student number control limits for institutions, for academic year 2012-13” (funding data).
Here are the OU figures for teaching:
|Funding for old-regime students (mainstream)||Funding for old-regime students (co-funding)||High cost funding for new-regime students||Widening participation||Teaching enhancement and student success||Other targeted allocations||Other recurrent teaching grants||Total teaching funding|
HEFCE preliminary teaching funding allocations to the Open University, 2012-13
Of the research funding for 2012-13, mainstream funding was 8,030,807, the RDP supervision fund came in at 1,282,371, along with 604,103 “other”, making up the full 9,917,281 research allocation.
Adding Higher Education Innovation Funding of 950,000, the OU’s total allocation was 139,714,060.
So what other funding comes into the universities from public funds?
Open Spending publishes data relating to spend by government departments to named organisations, so we can search that for data spent by government departments with the universities (for example, here is a search on OpenSpending.org for “open university”:
Given the amounts spent by public bodies on consultancy (try searching OpenCorporates for mentions of PriceWaterhouseCoopers, or any of EDS, Capita, Accenture, Deloitte, McKinsey, BT’s consulting arm, IBM, Booz Allen, PA, KPMG (h/t @loveitloveit)), university based consultancy may come in reasonably cheaply?
The universities also receive funding for research via the UK research councils (EPSRC, ESRC, AHRC, MRC, BBSRC, NERC, STFC) along with innovation funding from JISC. Unpicking the research council funding awards to universities can be a bit of a chore, but scrapers are appearing on Scraperwiki that make for easier access to individual grant awards data:
- AHRC funding scraper; [grab data using queries of the form select * from `swdata` where organisation like "%open university%" on scraper arts-humanities-research-council-grants]
- EPSRC funding scraper; [grab data using queries of the form select * from `grants` where department_id in (select distinct id as department_id from `departments` where organisation_id in (select id from `organisations` where name like "%open university%")) on scraper epsrc_grants_1]
- ESRC funding scraper; [grab data using queries of the form select * from `grantdata` where institution like "%open university%" on scraper esrc_research_grants]
- BBSRC funding [broken?] scraper;
- NERC funding [broken?] scraper;
- STFC funding scraper; [grab data using queries of the form select * from `swdata` where institution like "%open university%" on scraper stfc-institution-data]
In order to get a unified view over the detailed funding of the institutions from these different sources, the data needs to be reconciled. There are several ID schemes for identifying universities (eg UCAS or HESA codes; see for example GetTheData: Universities by Mission Group) but even official data releases tend not make use of these, preferring instead to rely solely on insitution names, as for example in the case of the recent HEFCE provisional funding data release [DOh! This is not the case – identifiers are there, apparently (I have to admit, I didn’t check and was being a little hasty… See the contribution/correction from David Kernohan in the comments to this post…]:
For some time, I’ve been trying to put my finger on why data releases like this are so hard to work with, and I think I’ve twigged it… even when released in a spreadsheet form, the data often still isn’t immediately “database-ready” data. Getting data from a spreadsheet into a database often requires an element of hands-on crafting – coping with rows that contain irregular comment data, as well as handling columns or rows with multicolumn and multirow labels. So here are a couple of things that would make life easier in the short term, though they maybe don’t represent best practice in the longer term…:
1) release data as simple CSV files (odd as it may seem), because these can be easily loaded into applications that can actually work on the data as data. (I haven’t started to think too much yet about pragmatic ways of dealing with spreadsheets where cell values are generated by formulae, because they provide an audit trail from one data set to derived views generated from that data.)
2) have a column containing regular identifiers using a known identification scheme, for example, HESA or UCAS codes for HEIs. If the data set is a bit messy, and you can only partially fill the ID column, then only partially fill it; it’ll make life easier joining those rows at least to other related datasets…
As far as UK HE goes, the JISC monitoring unit/JISCMU has a an api over various administrative data elements relating to UK HEIs (eg GetTheData: Postcode data for HE and FE institutes, but I don’t think it offers a Google Refine reconciliation service, (ideally with some sort of optional string similarity service)…? Yet?! ;-) maybe that’d make for a good rapid innovation project???
PS I’m reminded of a couple of related things: Test Your RESTful API With YQL, a corollary to the idea that you can check your data at least works by trying to use it (eg generate a simple chart from it) mapped to the world of APIs: if you can’t easily generate a YQL table/wrapper for it, it’s maybe not that easy to use? 2) the scraperwiki/okf post from @frabcus and @rufuspollock on the need for data management systems not content management systems.
PPS Looking at the actual Guardian figures reveals all sorts of market levers appearing… Via @dkernohan, FT: A quiet Big Bang in universities
It’s generally taken as read that folk hate doing documentation*. This is as true of documenting data and APIs as it is of code. I’m not sure if anyone has yet done a review of “what folk want from published datasets” (JISC? It’s probably worth a quick tender call…?), but there have certainly been a few reports around what developers are perceived to expect of an API and its associated documentation and community support (e.g. UKOLN’s JISC Good APIs Management Report and API Good Practice reports, and their briefing docs on APIs).
* this is one reason why I think bloggers such as myself, Martin Hawksey and Liam Green Hughes offer a useful service: we do quick demos and geting started walkthroughs of newly launched services, demonstrating their application in a “real” context…
At a recent technical advisory group meeting in support of the Resource Discovery Taskforce UK Discovery initiative (which is aiming to improve the discoverability of information resources through the publication of appropriate metadata, and hopefully a bit of thought towards practical SEO…) I suggested that a Q and A site might be in order to support developer activities: content is likely to be relevant, pre-SEOd (blending naive language questions with technical answers), and maintained and refreshed by the community:-)
In much the same way that JISCPress arose organically from the ad hoc initiative between myself and Joss Winn that was WriteToReply, I suggested that the question and answer site with a focus on data that I set up with Rufus Pollock might provide a running start to UK Discovery Q&A site: GetTheData.
API connections to OSQA, the codebase that underpins GetTheData, are still lacking, but there are mechanisms for syndicating content from RSS feeds (for example, it’s easy enough to get a feed out of tagged questions out, or questions and answers relating to a particular search query); which is to say – we could pull in ukdiscovery tagged questions and answers in to the UK Discovery website developers’ area.
Another issue relates to whether or not developers would actually engage in the asking and answering of questions around UK Discovery technical issues. Something I’ve been mulling over is the extent to which GetTheData could actually be used to provide QandA styled support documentation for published data or data APIs, concentrating a wide range of data related Q&A content on GetTheData (and hence helping building community/activity through regularly refreshed content and a critical mass of active users) and then syndicating specific content to a publisher’s site.
So for example: if a data/api publisher wants to use GetTheData as a way of supporting their documentation/FAQ effort, we could set them up as an admin and allow them rights over the posting and moderation of questions and answers on the site. (Under the current permissions model, I think we’d have to take it on trust that they wouldn’t mess with other bits of the site in a reckless or malevolent way…;-)
API/data publishers could post FAQ style questions on GetTheData and provide canned, accepted (“official”) answers. Of course, the community could also submit additional answers to the FAQs, and if they improve on the official answer be promoted to accepted answers. Through syndication feeds, maybe using a controlled tag filtered through a question submitter filter (i.e. filtering questions by virtue of who posted them), it would be possible to get a “maintained” lists of questions out of GetTheData that could then be pulled in via an RSS feed into a third party site – such as the FAQ area of a data/api publisher’s website.
Additional activity (i.e. community sourced questions and answers) around the data/API on GetTheData could also be selectively pulled in to the official support site. (We may also be able to pull out the lists of people who are active around a particular tag???) In the medium term, it might also be possible to find a way of supporting remote question submission that could be embedded on the API/data site…
If any data/API publishers would like to explore how they might be able to use GetTheData to power FAQ areas of their developer/documentation sites, please get in touch:-)
And if anyone has comments about the extent to which GetTheData, or OSQA, either is or isn’t appropriate for discovery.ac.uk, please feel free to air them below…:-)
I’ve hereby arbitrarily decided it’s #dataFriday on GetTheData.org, the question and answer / Q&A site for all your data needs (whether it’s finding a data set, working a data set, parsing a data set, or looking for ways of analysing or visualising a data set).
The idea behind dataFriday is that while we’re growing the community, we need an occasional sprint of Q’ing and A’ing (not least in the hope we might broker a few connections between people by means of a couple of rapid asking and answering exchanges).
So if you work with data, are struggling with data, or are publishing data that meets the needs of some who’s looking for that data, pop over to getTheData.org right now… (go on, you know you want to… and it’s Friday, right…? Too late to start that new work project, but the perfect opportunity to spend 5 minutes doing a good deed for the web and the day…;-)
A few days ago I was tipped off to a “bounty” request on Scraperwiki, offering 50 quid for a scrape of the DVLA test centres. The request had been posted on the Scraperwiki, and a bounty offered (on which Scraperwiki seems to add a commission).
Scraperwiki also appears to be offering a “private scraper” service as a business model. Maybe visualisation design around a wiki will be next to be offered on the market?!
Another hint that folk may be willing to pay to get data into a useable form appeared on GetTheData in a request for information about currency data from a professional, non-coder journalist that suggested a payment may be in the offing for anyone who could help.
Given that a lot of data that is apparently out there, is readily scrapeable, but is actually subject to non-commercial, personal use only end user licences, I do wonder if there will be a black market in unlicensed data that gets laundered through a series of steps that don’t respect attribution, let alone other, more stringent license conditions.
On the other hand, I wonder whether or not GetTheData should have a facility for associating a bounty with a particular query?
And the concern? It’s to do with the ethics of scraping or aggregating large amounts of personal – albeit public – data from folk on social networks. For example, it’s easy enough to find out who’s being wished a happy birthday on Twitter, and I have more than a few tools for grabbing friends and follower lists around hashtags, search terms, Twitter lists and usernames, and so on. Once we start mining data, it may be possible to discover things about folk from the public context they inhabit that maybe reveals something about them they didn’t realise could be deduced from the context? So what should our response be if we get a request on GetTheData asking someone how to mine public social data around a named individual… It may not be phone tapping, but something about that sort of request, should it ever occur, wouldn’t feel quite right to me…?
As we come up to a week in on GetTheData.org, there’s already an interesting collection of questions – and answers – starting to appear on the site, along with a fledgling community (thanks for chipping in, folks:-), so how can we maintain – and hopefully grow – interest in the site?
A couple of things strike me as the most likely things to make the site attractive to folk:
– the ability to find an appropriate – and useful – answer to your question without having to ask it, for example because someone has already asked the same, or a similar, question;
– timely responses to questions once asked (which leads to a sense of community, as well as utility).
I think it’s also worth bearing in mind the context that GetTheData sits in. Many of the questions result in answers that point to data resources that are listed in other directories. (The links may go to either the data home page or its directory page on a data directory site.)
One thing I think is worth exploring is the extent to which GetTheData can both receive and offer recommendations to other websites. Within a couple of days of releasing the site, Rufus had added a recommendation widget that could recommend datasets hosted on CKAN that seem to be related to a particular question.
What this means is that even before you get a reply, a recommendation might be made to you of a dataset that meets your requirements.
(As with many other Q&A sites, GetTheData also tries to suggest related questions to you when you enter you question, to prompt you to consider whether or not your question has already been asked – and answered.)
I think the recommendation context is something we might be able to explore further, both in terms of linking to recommendations of related data on other websites, but also in the sense of reverse links from GetTheData to those sites.
– would it be possible to have a recommendation widget on GetTheData that links to related datasets from the Guardian datastore, or National Statistics?
– are there other data directory sites that can take one or more search terms and return a list of related datasets?
– could a getTheData widget be located on CKAN data package pages to alert package owners/maintainers that a question possibly related to the dataset had been posted on GetTheData? This might encourage the data package maintainer to answer the question on the getTheData site with a link back to the CKAN data package page.
As well as recommendations, would it be useful for GetTheData to syndicate new questions asked on the site? For example, I wonder if the Guardian Datastore blog would be willing to add the new questions feed to the other datablogs they syndicate?;-) (Disclosure: data tagged posts from OUseful.info get syndicated in that way.)
Although I don’t have any good examples of this to hand from GetTheData, it strikes me that we might start to see questions that relate to obtaining data which is actually a view over a particular data set. This view might be best obtained via a particular query onto a particular data set. such as a specific SPARQL query on a Linked Data set, or a particular Google query language request to the visualisation API against a particular Google spreadsheet.
If we do start to see such queries, then it would be useful to aggregate these around the datastores they relate to, though I’m not sure how we could best do this at the moment other than by tagging?
There are a wide variety of sites publishing data independently, and a fractured networked of data directories and data catalogues. Would it make sense for GetTheData to aggregate news announcements relating to the release of new data sets, and somehow use these to provide additional recommendations around data sets?
Hackdays and Data Fridays
As suggested in Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers:
If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.
Alternatively, how about “Data Fridays”, on the first Friday in the month, where folk agree to check GetTheData two or three times that day and engage in something of a distributed data related Question and Answer sprint, helping answer unanswered questions, and maybe pitching in a few new ones?
It would be easy enough to put together a Google custom search engine that searches over the domains of data aggregation sites, and possibly also offer filetype search limits?
So What Next?
Err, that’s it for now…;-) Unless you fancy seeing if there’s a question you can help out on right now at GetTheData.org