Archive for July 2011
On Google+ and Twitter…
So I still haven’t posted anything in Google+ (although my Twitter feed seems to be cleaning up nicely; less noise, more signal. Soon I’ll start pruning the folk I follow who seem to have moved to Google Plus (is it + or Plus?)), not participated in any hangouts (Skype does video confs and group messaging I think? I rarely use either; what group chats I participate in are often in docs anyway), and don’t feel the need to create circles (I have a couple of twitter search columns open in Tweetdeck, and they show me filtered views of the Twitterverse that I’m interested in, typically hashtag searches (which I tweak on a regular basis to pull in tweets from live event backchannels). I don’t tend to use Twitter lists very much (“circles” from a consumption point of view), though I do use folders to aggregate feeds by topic in Google Reader.
On the rare occasions I post to LinkedIn or Facebook, I use Tweetdeck.
Which is to say… I currently use Tweetdeck…
A couple of years ago(?!), I blogged an aside comment about how it would be handy if Tweetdeck supported a plugin architecture (Filter Tweets by Language). If Tweetdeck had opened up a plugin architecture, I suspect there would have been Google+ support available in that client by now. As it is, Twitter bought up Tweetdeck, and has also been making it clear that it’s not that supportive of thrid party developers implementing just-another-Twitter-client.
I don’t want to start spending my time in Google+, but the lack of support within Tweetdeck means that if I do feel the need to post into the that space, I need to use another client. A quick search turns up various Chrome extensions that bring Facebook and Twitter into the Google+ context, but nothing obvious to me that brings Plus into the Twitter space.
So I’m now faced with the question: continue to keep out of the Google Plus space, or move into that space with a Twitter extension (though I haven’t found one with Tweetdeck like multi-column support, which I’d want), consuming Twitter in that space and then maybe dipping into Google Plus every so often?
If Google+ starts to offer Twitter integration natively, or effective workarounds are found (e.g. creating a Twitter posting circle; for receiving the Twitter stream, can a circle be used to subscribe to e.g. an RSS feed or authenticate to a Twitter stream? Is this sort of support likely to be on the roadmap if so? How about a circle that provides a view of my GMail priority inbox too?) then I may be tempted to move to Google+, not so much for the Google offerings at first, simply as a new Twitter client, but at least one that makes it convenient to try out or dip into Google+ if I ever feel the need. And who knows, that might then lead to me spending more time in Google+ and less in Twitter…
Not being in Google+, I wonder how it is being used though…? Tweetdeck is ideal for 140 chars and link sharing with brief annotations, but it’s not a feed reader, nor a client for clipping services such as Posterous. I like the brevity of the Twitter space, but a feed of items longer than 140 chars is potentially too much. If Google+ usage were based largely around 140 char style updates, Tweetdeck column support for separate circles would be really interesting (along with a switch that made it easy to post to a particular circle). Has anyone produced a Tweetdeck like clone that offers this experience yet? Or is the content shared on Google+ typically too long form to make such a multi-column feed view over separate circles appropriate?
Playing With R/ggplot2 Online (err, I think..?!)
Trying to get my head round what to talk about in another couple of presentations – an online viz tools presentation for the JISC activity data synthesis project tomorrow, and an OU workshop around the iChart eSTeEM project – I rediscovered an app that I’d completely forgotten about: an online R server that supports the plotting of charts using the ggplot library (err, I think?!): http://www.yeroon.net/ggplot2/
Example of how to use http://www.yeroon.net/ggplot2/
By the by, I have started trying to get my head round R using RStudio, but the online ggplot2 environment masks the stats commands and just focusses on helping you create quick charts. I randomly uploaded one of my F1 timing data files from the British Grand Prix, had a random click around, and in 8(?) clicks – from uploading the file, to rendering the chart – I’d managed to create this:
What it shows is a scatterplot for each car showing the time on the current leader lap that the leader is ahead. When the plotted points drop from 100 or so seconds behind to just a few seconds behind, that car has been lapped.
What this chart shows (which I stumbled across just by playing with the environment) is a birds-eye view over the whole of the race, from each driver’s point of view. One thing I don’t make much use of is the colour dimension – or the size of each plotted point – but if tweak the input file to include the number of laps a car is behind the leader, their race position, the number of pitstops they’ve had, or their current tyre selection, I could easily view a couple more of these dimensions.
Where there’s a jump in the plotted points for a lap or two, if the step/break goes above the trend line (the gap to leader increases by 20s or so), the leader has lapped before the car. If the jump goes below the trend line (the gap to the leader has decreased), the leader has pitted before the car in question.
But that’s not really the point; what is the point is that here is a solution (and I think mirroring options are a possibility) for hosting within an institution an interactive chart generator. I also wonder to what extent it would be possible to extend the environment to detect single sign on credentials and allow a student to access a set of files related to a particular course, for example? Alternatively, it looks as if there is support for loading files in from Google Docs, so would it be possible to use this environment as a way of providing a graphing environment for data files stored (and maybe shared via a course) within a student’s Google Apps account?
Immediate Impressions on JISC’s “Course Data: Making the most of Course Information” Funding Call
Notes on the JISC Grant Funding Call 8/11: “Course Data: Making the most of Course Information” Capital Programme – Call for Letters of Commitment
This post builds on quick commentaries around other reports in the area of Higher Education course data: Immediate Thoughts on the “Provision of information about higher education” and Getting Access to University Course Code Data (or not… (yet…))). It doesn’t necessarily represent my own opinions, let alone those of my employer.
1. The Joint Information Systems Committee (JISC) and the Higher Education Funding Council for England (HEFCE) invite English Universities and FE colleges (teaching over 400 HE FTEs) to become involved in a new programme of work which will help prepare the sector for increasing demands on course data.
3. Funding is available for projects starting from Monday 12 September 2011 for an initial period of approximately three months. Projects selected to go forward into Stage 2 will continue for an additional 12 to 15 months. All projects must be complete by 29 March 2013.
So how does this fit with the timeline for HEFCE Key Information Set (KIS) development if the called for work is relevant to that? (Note: HEFCE makes available much of the monies disbursed by JISC, and HEFCE is managing the KIS work directly.)
| As soon as possible and not later than the end of September 2011 | Technical guidance published by HEFCE |
| January to March 2012 | Submission system open for KISs to be published in September 2012: Institutions submit their data to HEFCE |
| June to early July 2012 | 2012 NSS and DLHE data available to HEFCE |
| July to August 2012 HEFCE merges data submitted by institutions with 2012 NSS and DLHE data. Institutions quality check and sign off their final KISs |
|
| September 2012 | KISs available for institutions to upload. All KISs to be accessible via institutional web-sites by the end of the month |
[HEFCE: Provision of information about higher education]
So given the timings, the JISC second phase work looks as if it is supporting processes relating to, and publication of, different sorts of data to the KIS data, although phase 1 work may be relevant to KIS releases?
10. There are 3 main drivers for making it easier for people to find and compare courses:
- prospective fee paying students want to know more about the academic experience a course will provide and be able to compare this with other courses;
- better informed students are more likely to choose a course that they will complete, and be more motivated to achieve better results;
- increased scrutiny by quality assurance agencies and the Government’s requirement for transparency of publicly funded bodies.11. JISC have made it easier for prospective students to decide which course to study by creating an internationally recognised data standard for course information, known as XCRI-CAP which is conformant with the new European standard for Advertising Learning Opportunities. This will make transferring and advertising information about courses between institutions and organisations more efficient and effective. Placing this data at a consistent COOL URI makes it easier to find.
So there are two end-user groups in mind for the course related information: prospective students, and the scrutineers. XCRI-CAP relates to the publication of information describing at a high level the subject content of a course, rather than the sorts of “metadata” around courses that the KIS will provide. If we were building a course comparison website, the XCRI-CAP data might provide course descriptions relating to a course, whereas the KIS data would provide student satisfaction ratings, teaching hours, assessment strategies, graduate employment rates and salaries. Pricing related information might be common to both sets?

Example of what the KIS display might look like.
Within the university website, developers will be required to identify which course a page relates to, and then call in the appropriate KIS widget from HEFCE or its agent, presumably by passing parameters relating to: institution identifier; course identifier.
In order to display both XCRI-CAP style data and KIS data on the same third party site web page, the third party will need to be able to identify the course identifier and the university identifier. It will also need a way of identifying which course codes are offered by each institution. In order to satisfy requests from potential applicants searching for a particular topic anywhere in the country*, the third party would ideally have access to an index (or at least a comprehensive list either of courses for each institution, or of institutions by course) that allows it to identify and return the set of (institution, course) pairs for which the course satisifes the search term. (Alternatively, for every request, the third party could query every university separately for related courses, aggregate these responses, and then annotate each result with a link to the corresponding KIS information, or its widget.) If the aggregator was to offer a service whereby potential applicants could rank each result according to one or more KIS data elements, it would need to index associate the KIS data relating to each of the courses identified by the (institution, course) pairs with the corresponding pair, and then use this aggregated data set to present the result to the end user. Again, this could be achieved my making separate requests to the KIS information server, once for each (institution, course) pair; or it could draw on its own index of this data if the information was openly licensed.
* when thinking about course selection, I often have four scenarios in mind: a) I know what course I want to do and where I want to do it; b) I know where I want to go but donlt know what course to do; c) I know what course I want to do, but know where to do it; d) I don’t know what course to do or where to do it…
Just by the by, I wonder if the intention of the HEFCE technical working group is to come up with a structured machine readable standard for communicating the KIS information via the KIS widget? That is, will content represented via the KIS widget be marked up in semantic form, or will semantics at the data representation level have to be reverse engineered from the presentation of the information? Where the KIS renders graphical elements, will the charts be generated directly from data transported to the widget, or will the provision simply be flat image files? (Charts displayed in widgets can come in three flavours: 1) as a flat image file with an arbitrary URL (e.g. kisDataImage4.png) (note that data underlying the graph may be described in surrounding metadata, such as within img attribute tags; 2) as an image file generated from data contained within the URL (e.g. as in the mechanism used by the Google Charts API); 3) through the enhancement of data contained within the page (for example, in a Javascript data scructure or an HTML table).
The KIS data only partially overlaps with the XCRI-CAP data, so I wonder: to what extent will it be possible to JOIN the two data sets (that is, how will we be able to link XCRI-CAP and KIS data? Via HEI+coursecode keys, presumably?)
12. The proposed programme will support the sector to prepare for the increasing demand for course information, and increase the availability of high-quality, accurate information about part-time, online and distance learning opportunities offered by UK HEIs by:
- funding institutions to make the process and technical innovations necessary to release a structured, machine-readable feed of their course-related information, and;
- creating a proof-of-concept aggregator and discovery service to bring together this course information and enable prospective students to search it.
So – what I think the JISC are suggesting is that they are looking to fund work on the “wider information set” of information around courses? That JISC are also looking to create a “proof-of-concept aggregator and discovery service to bring together this course information and enable prospective students to search it” sounds interesting. I wonder how this would sit in the context of:
- UCAS (which currently concentrates course listings as a basis for a single point of application for entry (how will entry work for the private universities? Cf. also the OU, which has only just started to make use of the UCAS entry route, and which also supports a significant direct entry route onto modules?)
- third party services such as ???Hotcourses
- custom search engines such as CourseDetective, which search over online course prospectuses (and which cost approx. 2 volunteered FTE days to put together at a hackday…;-)
It’s also worth bearing in mind that my TSO OpenUp competition entry also suggested the opening up of course code scaffolding data so that third parties could start to create aggregated and enriched datasets around courses, as well as building services on top of that data that would potentially be revenue generating and commercially sustainable…
Just on the topic of “wider information sets”, here’s what the HEFCE KIS consultation report had to say on the matter:
The wider information set
32. Higher education providers already publish a wide range of information about their institution and the courses they deliver. The information published has been considered by QAA in the context of institutional audit (for publicly funded higher education institutions and those privately funded providers that subscribe to QAA) or of Integrated Quality and Enhancement Review (for further education colleges (FECs) offering HE courses) and is subject to a ‘comment’ in that context. The consultation proposed that institutions should make this information more public-facing, noting that published information would, in due course, be subject to a judgement in QAA review processes.33. It was proposed that this wider information set has two purposes: to provide information about higher education to a wide variety of audiences including:
prospective and current students; students’ parents and advisers; employers; the media; and the institution itself to form part of the evidence used in QAA audit and review.34. The required information set was presented in the consultation document as a minimum requirement, with institutions continuing to publish as much other information as they wished. Institutions were asked to consider whether any of the information could be presented in more accessible ways.
…
Information about aspects of course/awards (not available in the KIS):
Information to be provided Level of information Availability prospectuses, programme guides, module descriptors or similar programme specifications;
results of internal student surveys
links with employers – where employers have input into a course or programme (this could be quite a high-level statement)
partnership agreements, links with awarding bodies/delivery partners.Course/programme level All apart from results of internal surveys to be publicly available
Results of internal surveys should be available internally
If there is such pent-up demand for aggregated course discovery services, then they should also be able to run as commercial services? One thing that I would argue currently limits innovation in this area is access to a comprehensive qualifcation catologue across the UK. UCAS do have this data, and they do sell it. But I want to play with it and see if I can build a service round it, rather than deciding to quit my job, raise finance, buy the data from UCAS and then see if I can make a go of building a commercial service around the data. UCAS would still benefit from traffic driven to the UCAS site for couse registrations. (But then, if aggregators were also aggregating information about courses in the private sector that supported direct entry and did not require central applications and clearing, aggregators might also start recommending courses outside the scope of UCAS…? Hmmm… Becuase the private universities would probably provide a commercial incentive to drive traffic to them in the form of affiliate fees based on registrations resulting from referrals… Hmmm… This is all starting to put me in mind of things like FOTA, Formula One and the FIA…!)
Another route to a comprehensive course catalogue is through indexing catalogue feeds (akin to website sitemap feeds that detail all the pages on a website to make it easy for search engines to index them) published directly by the universities, such as XCRI-CAP feeds…
13. The availability of useable course data feeds, and the demonstration of the proof-of-concept aggregator, is intended to provide a catalyst to the feeds being used within existing aggregators, catalogues or information, advice and guidance services, or to form the basis of new services.
I’m not sure an incentive is required.. just open access to the data, free in the first instance. (And if companies do start to make money from it, then license fees can kick in. I don’t think people would have a problem with that…)
15. Between September 2011 and March 2013, JISC intends to fund projects that help institutions review and adapt their internal processes to permit easier access to their course data to meet the needs of various stakeholders. As a minimum, and to provide a clear focus for this overarching activity, the programme will concentrate on the implementation of an XCRI-CAP standard system-generated feed. The programme will be staged to ensure maximum benefit is achieved.
If this data is already exposed via online course prospectuses, a developer with data scraper in hand could probably get a large chunk of this data anyway over the next three to six months. (The CourseDetective CSE definition file already provides a basis for anyone wanting to spider university course catalogues… Hmmm… maybe that’s a good reason for me to get to grips with Lucene…? Ideally, course prospectuses would also produce a sitemap (or XCRI) feed providing URLs for all the course pages currently published via the online prospectus to make it easy for third parties to index, or harvest, this data. The provision of semantic markup in a page, whether through RDFa, microformats, microdata or metadata would also simplify the sctaping (i.e. machibe parseability) of the course pages. At the very least, using template based, sensibly structured presentation markup that enforces markup conventions that suggest de facto semantics makes pages reliably scrapeable and provides one way of supporting the harvesting of data (if license conditions allow…)) Because, of course, a major why potentially commercial services don’t just scrape the data to build course comparison sites relates to the licensing/copyright restrictions that may exist, deliberately or by default, over the university prospectus data that is published online… (Not everyone’s a pirate;-)
16. In Stage 1, institutions will review the maturity of their management of course data using the XCRI Self Assessment Framework. This could cover the full course data life cycle, but must include a particular focus on prospectus and course advertising information. Based on the outcomes of this review, institutions will produce an implementation plan for how they will improve processes to, as a minimum, create a system-generated course advertising feed in a XCRI CAP 1.2 format with a COOL URI.
Ah, ha….
So I wonder, would JISC indemnify a third party looking to scrape, aggregate, and republish this data in a standard form via an open API and a permissive license, against actions taken against them by UCAS and the universities for breach of copyright?! I also wonder whether JISC will be providing guidance about what license conditions they expect XCRI-CAP data to be published under? Or is that out of scope?
19. The anticipated outcomes from this programme of work are:
- There will be increased usage of appropriate technology to streamline course data processes leading to:
– More standardised, and therefore comparable, course information in a consistent location making discovery easier.
– Improved quality and therefore more efficient and effective course data.
– Increased ease in finding and comparing courses, especially types of courses that are currently hard to find, such as ones delivered by distance learning.- Institutions are able to make appropriate and informed decisions about their processes for managing course-related data, leading to a reduced administrative data burden, cost-effective working, and better business intelligence.
Ah… this is actually different to getting the data out there, then, in a way that third parties can use it? It’s more about tweaking systems and processes inside the institution to support the provisioning of data in ways that make it more accessible to third party aggregators? The course aggregator is then a red herring – it’s just there to provide a reference/candidate client/consumer against which the released data can be targeted.
25. There will be a support and synthesis project that will be working with projects from the start of the programme to help them shape their implementation plans in Stage 1 and other outputs in Stage 2 that are of most use to the sector. Projects are expected to engage with the support and synthesis project and to be proactive in sharing outputs throughout the project. This information will be synthesised and shared with the sector; where that information is sensitive, it will be shared in an aggregated, anonymised form.
A “support and synthesis project” within JISC presumably, (i.e. run by the usual suspects)? Rather than sponsoring and indemnifying the open data community on the one hand, or encouraging potential startups on the other, to start building user facing (potential student) services, along with the necessary business model that will make them sustainable, and maybe even profitable?
26. Funding is provided to enable institutions to carry out project work, but also to release key staff to prepare for, take part in and follow up on these programme-level activities. Projects should allow at least 5 person-days in Stage 1 and 10 person-days in Stage 2.
Such is the price of funding HE based developer activities. 5 days project work: £10k. 10 days project work: £40k-80k. So now you know…
BBC Click Radio Recording As-Live at the OU
In case you’re around the OU campus in Milton Keynes on Monday 18th July, we’re recording the final episode in our season of special episodes on openness with Gareth Mitchell and the BBC Click Radio team.
If you’d like to attend, the recording will take place in the Berrill Lecture Theatre from 1pm to 1.30. Please be seated by 12.45 for the sound check;-)
Immediate Thoughts on the “Provision of information about higher education”
Some immediate thoughts on reading the “Provision of information about higher education” consultation report. Note that the opinions expressed below may not even belong to me, let alone my employer. (They’re just imaginings… or nightmare visions…)
What I still need to do is try to find out how the requirement to provide KIS data over the coming months fits in with JISC’s current Grant Funding Call 8/11: ‘Course Data: Making the most of Course Information’ Capital Programme – Call for Letters of Commitment which is “designed to ensure a high number of engaged institutions, which is vital to get the critical mass needed to effectively demonstrate to the sector the huge potential of organising and presenting course information in a standardised way.” (The initial call is for £10k for each eligible UK HEI, and a second tranche of £40-80,000 for each of 80 or so plan execution projects. (“Do the math”, as they say…) I don’t know how much HEFCE intend to give to UK HEIs to help underwrite the roll out of KIS (a fair chunk will go to the vendors that provide enterprise software to the HEIs, I guess..?) but I imagine that that will be a not insignificant sum. I just wonder what we’d have been able to do if we’d manage to get hold of the set of course code data that corresponds to the courses offered by UK HEIs? If UCAS would just relax their license conditions, I’m guessing we could even scrape the data and they wouldn’t even have to work out how to drop the corresponding table and make it available in some way… But if we respect their license conditions, we’re *****d.
1. This is a joint publication by HEFCE, Universities UK (UUK) and GuildHE, setting out how it is intended to improve the accessibility and usefulness of information about higher education (HE).
Who says what’s useful?
6. Universities and colleges should publish Key Information Sets (KISs) for undergraduate courses, whether full- or part-time. These KISs will contain information on student satisfaction, graduate outcomes, learning and teaching activities, assessment methods, tuition fees and student finance, accommodation and professional accreditation.
A lot of this data is already available as public data from original sources, or via curated datastores such as the Guardian Datablog. What is lacking at the current time is the scaffolding that lets us create resources capable of spanning the sector at qualification level. Some time ago, I described a simple visual application for comparing summary statistics relating to satisfaction, fees, salary levels and so on across UK universities (Does Funding Equal Happiness in Higher Education?). That was a first step. The second step was to try to start building up information from the course level and begin using that as the focus for comparisons (as well as building out other services, such a book recommendations related to courses). Which was in part why I entered the TSO OpenUp coompetition…
Through HESA subject codes (which structure subject areas into a three level categorisation, it is possible to compare statistics relating to broad teaching subject offerings across multiple different providers within in a particular topic area. Cross-relating teaching subject areas to research areas is still an ad hoc process though, as is obtaining research funding data from across the UK research funding councils and agencies, let alone trying to relate it to teaching subject areas. (Exploiting research for teaching is one of the claimed benefits of undergraduate study; maybe through making accessible an easy way of comparing the amount of research funding provided to particular institutions in different subject areas and the related teaching areas we might get a better handle on the actual relationship that exists between teaching and research excellence?)
Nor is it a simple matter to to compare, in detail, the qualifications across teaching providers within a geographical area. The only place that currently describes all the current UK HE qualifications on offer each year is the UCAS website, which also acts as a gateway to applications to HE. One of the key considerations when developing comparison services is the extent to which a service can provide comprehensive coverage over the range of offerings that are being compared. In a very real sense, a comprehensive catalogue of offerings provides the key infrastructure that innovative third parties can build upon. By enriching and annotating a common, core dataset, vendors can develop differentiated services whilst maintain a level of consistency between them (i.e. the services become comparable). An opportunity also arises for vendors to offer business to business services over that core data set.
The provision of a common, key information set information about each course/qualification within a university can thus be picked apart as follows:
- firstly, that there exists a comprehensive directory of courses;
- secondly, that for each course, there exists a common set of data attributes, aligned to a common scale;
- thirdly, that the information is provided in a consistent way so as to “support” comparison.
As I have already mentioned, there is a significant amount of data available in public through open licenses that could already be used for the provision of comparison services. What is missing is the scaffolding – the complete course catalogue – that allows this to be done reliably across the sector.
(There is also arguably a lack of opportunity in certain areas for business development. One model might see comparison services acting almost in the role of “independent” educational advisers, helping guide a potential student to an informed choice, and reaping some benfit from that process. For example, let us crudely model the student application lifecycle as: discovery (where to go/what to do), application, study, completion/graduation, employment. In the discovery phase, services might sell advertising, and pick up affiliate fees for prospectus requests for example. In a mature market, the application phase might also accommodate affiliate or referral fees, for example, based on encouraging applications, or even better, accepted and taken up applications. The financial services industry, for all its sins, supports a variety of models for repaying an agent who signs up a client to a longstading financial product, replete with bonuses and other incentives that encourage the agent to find a product that the client will actually stick with. On completing a degree with a given grade, the agent may get a bonus. (Retention initiatives can start early, arguably before the student even accepts a place at univesity, through helpoing them make a decision regarding a course that is likely to suit them!)) If you can imagine that universities might set up as recruitment agents, taking a fee for placing a graduate in a particular job on graduation, it’s not hard to also imagine that a bonus might be paid from that placement fee to the agent responsible for referring the unergraduate applicant, as was, in the first place.)
13. Institutions will be required to submit data to HEFCE for inclusion in the KIS. Institutions who subscribe to the QAA but who do not currently take part in the NSS and DLHE should take steps to do so.
So a data burden will be placed on institutions to provide information in a standard way to a central aggregating service? Will there be an opportunity for HEIs to publish this data via an open API, and allow HEFCE to pull/harvest the data from there? Or will the data be required to pass from the HEI, through HEFCE so that HEFCE can put a stamp of approval on it, before it is allowed to be branded as part of the instition’s KIS?
14. All KISs should be made available via institutional web-sites by the end of September 2012.
But will the KIS data also support services that allow the direct comparison of KIS data across institutions on first (university), second (HEFCE, UCAS, Unistats, etc.) and third (commercial, or not-for-profit) party sites without having to visit each of those institutions separately?
22. The plans are based on extensive research, consultation and pilot processes. We are very grateful to all who have given their time and views so generously. There were 215 responses to HEFCE 2010/31, all of which have been carefully considered. We have also taken into account: the views of 2,000 prospective and current students on useful information; several expert working groups considering specific parts of the KIS; a pilot with eight institutions; and user testing with more than 200 prospective HE students. We are particularly pleased to have engaged closely with the National Union of Students in this project, and to have received consultation responses from 30 student unions. We have also liaised with the Academic Registrars’ Council, in an attempt to ensure that the next steps are both feasible and proportionate to implement.
I wonder: did they also consult with open data advocates or web development companies who are familiar with putting data to work in a customer-facing, value adding way? To my shame, I didn’t respond – I came across the consultation after it had closed. (Which suggests the consultation didn’t reach out into that part of the open data community I inhabit? Or maybe I did see it and missed/didn’t pick up on the significance of it at the time:-(
27. The consultation made three primary proposals which are summarised in this section. The first question focused on the purposes of providing information about HE. Responses broadly agreed that information about HE has three purposes:
- to inform people about the quality of higher education and, in particular, to give prospective students information that will help them choose what and where to study
- as evidence for quality assurance processes in institutions
- as information that institutions can use to enhance the quality of their HE provision.29. The consultation proposed that universities’ and colleges’ web-sites should use a standardised way of publishing key pieces of information about each undergraduate course they offer, by using KISs.
30. KISs would make it easier to find information that prospective students have identified as important to their decisions, and which is mostly already available. The categories of information were identified during research undertaken with 2,000 prospective students, current students and careers advisers by Oakleigh Consulting and Staffordshire University15
So the implication here is that I can compare the data, because each university will separately publish a standard set of data in the same format. So to compare 14 different courses across 8 universities, I probably need to have 14 browser windows open on the same screen at the same time?!
31. In parallel to the consultation, a programme of KIS development work was undertaken. This looked specifically at the information items that do not currently exist in a national comparable format (about learning and teaching, assessment, professional accreditation and accommodation costs) and piloting the processes institutions need to undertake to provide these data. There were also user tests with prospective students. For further details see Annex A.
So the consultation looked at what sort of data might be used to enrich the core data set. One might argue that if the core, course data set were available, third party comparison services might already have started to explore various ways of annotating, enriching and pivoting around the data?
36. The principle of the KIS is that it presents information we have identified that prospective students find useful, in a place we know they already look for such information. In summary, this is information on study, satisfaction, costs and employability, presented on the course information sections of institution’s own web-sites.
“[I]n a place we know they already look for such information”: you could read that as being anti-competitive…? I’d also argue that it doesn’t support the ability to make comparisons. I assume that enerfy suppliers and mobile phone operators publish similar sorts of infromation about tariffs on their websites? Why, then, do comparison sites exisit?! I’d argue it’s not because they don’t have KIS tables on their sites (though that may contribute). Rather, it’s easier to make a comparison across sites in the context of a single location. (And here, I fear, I start to smell a trap… Because “a place we know they already look for such information” exits in the form of UCAS…)
46. There will be three categories of learning and teaching activities:
- scheduled learning and teaching activities
- guided independent study
- placement/study abroad.47. Information on these will be presented in a bar chart, as a proportion of hours, on a year-by-year basis, showing each year/stage of study, rather than aggregated for the course as a whole. For KISs relating to part-time study, three bars should also be provided for a standard undergraduate course, each referring to the time equivalent to one year of study if studied full-time
48. In the interest of providing as much relevant information to the user as possible, a web-link would follow that would lead users to more detailed information. This might be the programme specification or other document, but we would expect this would present more detailed information about learning and teaching, for instance possibly module-level contact hours. This would provide useful contextualised data – something that was a strong theme emerging from consultation responses.
Being able to reliably identify links to programme specifications could be really handy, e.g. for things like the Course Detective approach to custom search engine development…?
67. The salaries for all institutions data will be adjusted to account for regional variations in the salaries earned by graduates in different parts of the country. A link from the KIS to institutional web-sites will enable institutions to provide additional contextual information with particular reference to the different circumstances of different employment sectors (for example the creative industries.)
I can see this causing all sorts of problems when it comes to offering comparisons?
92. Information derived from the NSS and DLHE survey will be presented at course level if sufficient data are available; otherwise NSS and DLHE data will be presented at the most detailed level possible of the Joint Academic Coding System (JACS), subject to the surveys’ response rates and threshold requirements. This information is held by HEFCE and HESA for publicly funded institutions and others that subscribe to HESA.
If a data describing UK HE courses were freely available, work could already have started on this…?
93. Annex C provides a detailed breakdown of the expected coverage of the KIS for HEIs, but
in summary:
a. The data thresholds we intend to apply to the NSS and DLHE data (which mirror the thresholds we apply on Unistats) mean that roughly one in seven single subject, full-time, first degree KISs will have both DLHE and NSS data available at course level, although in some cases the data presented may need to be aggregated across two years. However, over 95 per cent of KISs will be able to present DLHE or NSS data, or both, when data are included that is aggregated to JACS level 1 and across two years.
b. We expect that about 2 per cent of single subject, part-time, first degree KISs will have full data available; this rises to about 35 per cent when data are included which are aggregated to JACS level 1 and across two years.
c. We expect the KISs where full data are available to cover about 40 per cent of the student body; after allowing for aggregation, the proportion where some data are available is likely to cover over 90 per cent of the student body.
One argument against making a comprehensive course catalogue available under an open public license is that if it were to be used as scaffolding for aggregating different, comparative data sources, lack of coverage over the whole course listing would be confusing and offer a poor user experience. Err…? “[R]oughly one in seven single subject, full-time, first degree KISs will have both DLHE and NSS data available at course level” So that reason isnlt a deal breaker, then?!
95. We recognise that, even aggregating data over years or over JACS levels, there will be, as on Unistats at present, a number of courses for which it will not be possible to provide data derived from the NSS or DLHE due to the small size of the student cohorts concerned. The thresholds for publication reflect both the need to ensure the statistical validity of the information and the need to meet data protection requirements. There will still be elements of the KIS that will be useful to prospective students, but we recognise the need to ensure prospective students do not negatively interpret the absence of data. We will undertake further user testing over the next few months to finalise appropriate explanatory text.
Ditto.
98. Consideration has been given to who should undertake the production of the KISs, and how. Requiring individual institutions to create their own KISs was considered, but it was felt to be problematic because it would place a significant burden on individual institutions and would pose a challenge in controlling the quality of – potentially – several hundred different production processes, hindering the creation of a single, uniform and credible information source. This task therefore needs to be undertaken by a single body.
So institutions are not going to have a new data burden placed on them?
99. The first year of KISs (those to be published in September 2012) will be centrally created by HEFCE in partnership with HESA. From year two onwards it is intended that central creation will pass to HESA.
100. In the first year, HEFCE will draw data from the NSS and DLHE and institutions will provide additional data (as set out in Table 1). Once this has been collated, HEFCE will provide institutions with web code to be inserted appropriately on their own web-sites.
Hmmm.. when I won the TSO Open Up competition, the plan was to get UCAS course code data and then start annotating and enriching it howseover we could. The reason why I wanted the UCAS data is that it provides the scaffolding to build from. The user focus is the course, so it made sense to build up views over the data from the course level. (We could have started trying to build service out at the level of HESA codes, but that wasn’t what the prize was awarded for.) During the competition pitch, I made the claim that course code data was akin to postcode data to the extent that rights over the seemingly most useful identifier space was controlled by a restrictive license. I don’t yet know what services I want to build out over the course code space, but why is that a reason to prevent innovation in the development of services around course codes by locking those codes down?
103. In order for KISs to be published during September 2012, for use by applicants for entry in academic year 2013-14, institutions must submit their data returns to HEFCE by summer 2012.
So the data burden is on the universities?! But the aggregation – where the value is locked up – is under the control of the centre? Hmmm… thinks… SCONUL charge 80 quid (?) for their aggregated report on HE library stats data, but I’ve managed to FOI the return made to SCONUL by individual libraries. So if there is a KIS like return from HEIs to HEFCE, it should be FOIable, and we can create a copy of the aggregate by aggregating FOI requests. Hmmm…
105. In the main, we would expect the KIS to be revised at most annually; however, a system will be set up to enable exceptions to be processed, for example, corrections to be made or financial information updated. More detail will follow in the technical guidance.
Another of the arguments I’ve heard – this time from universities – to explain a university’s unwillingness to publish a course data API or data dumps is that a third party that aggregates data from universities may end up with data that is stale or out of synch with data on the university website. I suspect that a third party would be quicker to respond to changes than once ever 52 weeks…
106. HEFCE is in discussion with the primary providers of institutional data management software to ensure that the new data requirements for the KIS can be incorporated into existing applications as soon as possible.
So how much do we think the thrid party software vendors are going to claim for to make the changes to their systems? And hands up who thinks that those changes will also be antagnostic to developers who might be minded to open up the data via APIs. After all, if you can get data out of your commercially licensed enterprise software via a programmable API, there’s less requirement to stump up the cash to pay for maintenance and the implementation of “additional” features…
107. The KIS will have a strong brand, including a unique logo. This is to ensure that the KIS is as engaging to users as possible, as well as distinguishing it from any other information sources available.
…which sounds to me like someone’s twigged there may be value locked up in the data, and they’re not willing to let it go…
108. A core feature of the KIS is that it is standardised and comparable across HEIs, with consistent branding and presentation. Therefore, in order to avoid confusion, institutions should not publish a document called the KIS or with the KIS logo for any courses where not required.
Brand police… Total ownership. We can haz ur data; we pwn ur data.
110. It is likely the KIS for each course will be available through an embedded ‘widget’ on the institution’s web-site. We do not intend to be prescriptive about where on the web-site this should appear, other than that it should be found near other course information. The widget would contain three items of top-line information, and the option to click through for the full KIS.
Hmmm… did somebody just discover widgets?! So the idea here is to control the brand through a KIS branded widget that can be embedded on University websites?
The obvious question to an open data freak would be: will there be a freely available open API with that, and will the data made available through the API be openly licensed?
The open systems advocate in me would also wonder whether there might be mileage in each insitution publishing it’s own data in an open way via an open API that could be harvested by the central HEFCE aggregator or by a third party. In addition, the KIS data would be available as a service within the institution to institutional developers.
113. In HEFCE 2010/31 we suggested that the KIS should be accessible from the UCAS web-site. Although it was pointed out that not all applications go through UCAS, there was broad support for this approach in the consultation and discussions with UCAS are continuing. UCAS is
keen to link the KIS to its site and to explore the possibility of incorporating a comparison function into its planned ‘course finder’ facility, for all courses there are KISs for (including part-time courses), not just those they process applications for.
So HEFCE want to run a data service…?! Will it be an open data service? Or are HEFCE going to get a copy of the course code scaffolding grail and use it to act as infrastructure for a data service that aggregates and re-presents data that is in part already largely available, albeit in a less structured way, via a branded and content controlled widget?
114. We would also like to work with other organisations that provide student information on HE and other related careers guidance. We are keen to promote and publicise the KIS through the various student web-sites and social media outlets that exist.
So will third parties be encouraged and supported in developing their own takes on enriched KIS data?
116. Because KISs will be created centrally, a central database of KISs will be available. HEPISG needs to consider how to use this information, recognising the Government’s intention that data on publicly funded provision should be available for general use. More information will be published on the HEFCE web-site in due course.
Ah – so the data may be available via an open public license. Tip to the HEPISG folk: why not build an API around the data, and serve the widget from that? Furthermore, by making the core course code data available as a dataset, third parties would be able annotate and enrich that data and serve it as additional information around the “officially sanctioned” KIS data pulled from the API. Finally, a question: if third parties are going to use clustering techniques so that they can provide recommendations on “similar” courses, will they have access to the whole KIS data set so that they can run their own clustering algorithms?!
119. Currently, there is information available via Unistats that will not be available through the KIS. We do not envisage, therefore, that any changes will be made to the Unistats web-site in the KIS’ first year of operation. The focus will be on ensuring that the KIS is available on institutional web-sites as advised in the Oakleigh Consulting and Staffordshire University research, with links to, and from, the UCAS web-site.
So if students want to compare courses, they need to go to N different pages to find the KIS widget on each, and then go and fight with the Unistats website?
120. However, we recognise that, in the longer term, there will be a need to revisit the arrangements to ensure we meet the needs of students for good access to information and that we secure the best use of public money and institutional time. As we move to more established arrangements for the creation and maintenance of the KIS, and look at the use of potential sites for comparing information, we will consider the future of Unistats in the light of the wider policy environment for higher education.
Just open the course/qualification scaffolding data…
123. As well as the KIS and Unistats data, a wider set of information is to be made available by all publicly funded HEIs, FECs with undergraduate provision, and private providers who subscribe to the QAA.
This is the sort of thing third parties might be keen to develop. But to scaffold the collection and delivery of the additional data annotations, the course data could be really handy…
[The paper goes on a bit more, but it's making me angry so I figure I need to take a break!]
Social Media Monitoring: Bit.ly ClickThrus for Your Domain
Quite by chance, I noticed this on my bit.ly settings page earlier today – bit.ly domain tracking:
What it does is allow you to register a verified domain, and get crude stats back about the number of people clicking through bit.ly shortened links for that domain.
I registered blog.ouseful.info, verifying it via a CNAME DNS entry with my domain name provider (easily,co.uk), and then spent a bit of time hunting for the stats. I eventually found them via a drop down element on the Analyse page:
I don’t think you can get this data via the API, but there is a way of doing OAuth API connections, so I might see if I can grab whatvever data is available for my account via the bit.ly API into Google Spreadsheet using a tweak of this Google spreadsheets/OAuth recipe.
It’s still not a replacement for BackType of course – I can’t get Twitter IDs of folk who have shared shortened links to my domain, (nor even a list of people who have shortened links to my domain?), but maybe that’s in the pipeline? Or maybe Twitter, Google, Microsoft or Yahoo will buy up bit.ly and shut down what useful services it currently offers…?
MojoEventViz: What’s Going On in Trafalgar Square?
I just saw a tweet commenting on crowds around Trafalgar Square, wondering what was going on…
Do the Twitter usernames provide a clue?
Here are the tweets:
Ah – that’ll explain it…
So What’s Open Government Data Good For? Government and “Independent Advisers”, maybe?
Although I got an invite to today’s “Government Transparency: Opening Up Public Services” briefing, I didn’t manage to attend (though I’m rather wishing I had), but I did manage to keep up with what was happening through the #openuk hashtag commentary.
It all kicked off with the Prime Minister’s Letter to Cabinet Ministers on transparency and open data, which sets out the roadmap for government data releases over the coming months in the areas of health, education, criminal justice, transport and public spending; it also sets the scene for the forthcoming Open Public Services White Paper (see also the public complement to that letter: David Cameron’s article in The Telegraph on transparency).
The Telegraph article suggests there will be a “profound impact” in four areas:
- First, it will enable choice, particularly for patients and parents. …
- Second, it will raise standards. All the evidence shows that when you give professionals information about other people’s performance, there’s a race to the top as they learn from the best. …
- Third, this information is going to help us mend our economy. To begin with, it’s going to save money. Already, the information we have published on public spending has rooted out waste, stopped unnecessary duplication and put the brakes on ever-expanding executive salaries. Combine that with this new information on the performance of our public services, and there will be even more pressure to get real value for taxpayers’ money.
- But transparency can help with the other side of the economic equation too – boosting enterprise. Estimates suggest the economic value of government data could be as much as £6 billion a year. Why? Because the possibilities for new business opportunities are endless. Imagine the innovations that could be created – the apps that provide up-to-date travel information; the websites that compare local school performance. But releasing all this data won’t just support new start-ups – it will benefit established industries too.
David Cameron’s article in The Telegraph on transparency
All good stuff… all good rhetoric. But what does that actually mean? What are people actually going to be able to do differently, Melody?
As far as I can tell, the main business models for making money on the web are:
- sell the audience: the most obvious example of this is to sell adverts to the visitors of your site. The rate advertisers pay is dependent on the number of people who see the adds, and their specificity (different media attract different, possibly niche, audiences. If an audience is the one you’re particularly trying to target, you’re likely to pay more than you would for a general audience, in part because it means you don’t have to go out and find that focussed audience yourself.) Another example is to sell information about the users of your site (for example, banks selling shopping data).
- take a cut: so for example, take an affiliate fee, referral fee or booking fee for each transaction brokered through your site, or levy some other transaction cost.
Where data is involved, there is also the opportunity to analyse other peoples’ data and then sell analysis of that data back to the pubishing organisations as consultancy. Or maybe use that data to commercial advantage in put together tenders and approaches to public bodies?
When all’s said and done, though, the biggest potential is surely within government itself? By making data from one department or agency available, other departments or agencies will have easier access to it. Within departments and agencies too, open data has the potential to reduce friction and barriers to access, as well as opening up the very existence of data sets that may be being created in duplicate fashion across areas of government.
By consuming their own and each others’ open data, departments will also start to develop processes that improve the cleanliness and quality of data sets, (for example, see Putting Public Open Data to Work…? and Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping; Library Location Data on data.gov.uk gives examples of how the same data can be released in several different (i.e. not immediately consistent) ways).
I’m more than familiar with the saying that “the most useful thing that can be done with your data will probably be done by someone else”, but if an organisation can’t find a way to make use of its own data, why should anyone else even try?! Especially if it means they have to go through the difficulty of cleaning the published data and preparing it for first use. By making use of open data as part of everyday government processes: a) we know the data’s good (hopefully!); b) cleanliness and inconsistency issues will be detected by the immediate publisher/user of the data; c) we know the data will have at least one user.
Finally, one other thing that concerns me is the extent to which “the public” want access to data in order to provide choice. As far as I can tell, choice is often the enemy of contentment; choice can sow the seeds of doubt and inner turmoil when to all intents and purposes there is no choice. I live on an island with a single hospital and not the most effective of rural transport systems. I’d guess the demographics of the island skew old and poor. So being able to “choose” a hospital with performance figures better than the local one for a given procedure is quite possibly no choice at all if I want visitors, or to be able to attend the hospital as an outpatient.
But that’s by the by: because the real issues are that the data that will be made available will in all likelihood be summary statistic data, which actually masks much of the information you’d need to make an informed decision; and if there is any meaningful intelligence in the data, or its summary statistics, you’ll need to know how to interpret the statistics, or even just read the pretty graphs, in order to take anything meaningful form them. And therein lies a public education issue…
Maybe then, there is a route to commercialisation of public facing public data? By telling people the data’s there for you to make the informed choice, the lack of knowledge about how to use that information effectively will open up (?!) a whole new sector of “independent advisers”: want to know how to choose a good school? Ask your local independent education adviser; they can pay for training on how to use the monolithic, more-stats-than-you-can-throw-a-distribution-at one-stop education data portal and charge you to help you decide which school is best for your child. Want comforting when you have to opt for treatment in a hospital that the league tables say are failing? Set up an appointment with your statistical counsellor, who can explain to you that actually things may not be so bad as you fear. And so on…
Fragments: Accessing YouTube Account Data in Google Spreadsheets via OAuth
If you’re running a Youtube account, how might you collect Insights data for all your videos as spreadsheet entries that can be used in the preparation of reports about your social media effectiveness?
One way might be to go to each video in turn and download the separate CSV data files created for each video. Alternatively, you can grab the data via the YouTube/GData API (http://code.google.com/apis/youtube/2.0/developers_guide_protocol_insight.html).
I haven’t actually got round to getting any data out of my YouTube account and into a Google spreadsheet yet, but I have dome the first step, which is to set up the authentication using OAuth. Here’s the Google Apps script I used…
function youtube(){
// Setup OAuthServiceConfig
var oAuthConfig = UrlFetchApp.addOAuthService("youtube");
oAuthConfig.setAccessTokenUrl("https://www.google.com/accounts/OAuthGetAccessToken");
oAuthConfig.setRequestTokenUrl("https://www.google.com/accounts/OAuthGetRequestToken?scope=http%3A%2F%2Fgdata.youtube.com%2F");
oAuthConfig.setAuthorizationUrl("https://www.google.com/accounts/OAuthAuthorizeToken");
oAuthConfig.setConsumerKey("anonymous");
oAuthConfig.setConsumerSecret("anonymous");
// Setup optional parameters to point request at OAuthConfigService. The "twitter"
// value matches the argument to "addOAuthService" above.
var options =
{
"oAuthServiceName" : "youtube",
"oAuthUseToken" : "always"
};
var result = UrlFetchApp.fetch("http://gdata.youtube.com/feeds/api/users/default/favorites?v=2&alt=json", options);
var o = Utilities.jsonParse(result.getContentText());
Logger.log(o)
}[Gist here: https://gist.github.com/1067283]
The first time you run the script, it should request access from your YouTube account…
The next step is to work out what to pull from Youtube, and how to actually store it in the spreadsheet…
PS a couple more Youtube snippets of interest:
- YouTube documentation wizard: customise your YouTube API documentation view
- interactive YouTube API explorer
Visualising Twitter Friend Connections Using Gephi: An Example Using the @WiredUK Friends Network
To corrupt a well known saying, “cook a man a meal and he’ll eat it; teach a man a recipe, and maybe he’ll cook for you…”, I thought it was probably about time I posted the recipe I’ve been using for laying out Twitter friends networks using Gephi, not least because I’ve been generating quite a few network files for folk lately, giving them copies, and then not having a tutorial to point them to. So here’s that tutorial…
The starting point is actually quite a long way down the “how did you that?” chain, but I have to start somewhere, and the middle’s easier than the beginning, so that’s where we’ll step in (I’ll give some clues as to how the beginning works at the end…;-)
Here’s what we’ll be working towards: a diagram that shows how the people on Twitter that @wiredUK follows follow each other:
The tool we’re going to use to layout this graph from a data file is a free, extensible, open source, cross platform Java based tool called Gephi. If you want to play along, download the datafile. (Or try with a network of your own, such as your Facebook network.)
From the Gephi file menu, Open the appropriate graph file:
Import the file as a Directed Graph:
The Graph window displays the graph in a raw form:
Sometimes a graph may contain nodes that are not connected to any other nodes. (For example, protected Twitter accounts do not publish – and are not published in – friends or followers lists publicly via the Twitter API.) Some layout algorithms may push unconnected nodes far away from the rest of the graph, which can affect generation of presentation views of the network, so we need to filter out these unconnected nodes. The easiest way of doing this is to filter the graph using the Giant Component filter.
To colour the graph, I often make us of the modularity statistic. This algorithm attempts to find clusters in the graph by identifying components that are highly interconnected.
This algorithm is a random one, so it’s often worth running it several times to see how many communities typically get identified.
A brief report is displayed after running the statistic:
While we have the Statistics panel open, we can take the opportunity to run another measure: the HITS algorithm. This generates the well known Authority and Hub values which we can use to size nodes in the graph.
The next step is to actually colour the graph. In the Partition panel, refresh the partition options list and then select Modularity Class.
Choose appropriate colours (right click on each colour panel to select an appropriate colour for each class – I often select pastel colours) and apply them to the graph.
The next thing we want to do is lay out the graph. The Layout panel contains several different layout algorithms that can be used to support the visual analysis of the structures inherent in the network; (try some of them – each works in a slightly different way; some are also better than others for coping with large networks). For a network this size and this densely connected,I’d typically start out with one of the force directed layouts, that positions nodes according to how tightly linked they are to each other.
When you select the layout type, you will notice there are several parameters you can play with. The default set is often a good place to start…
Run the layout tool and you should see the network start to lay itself out. Some algorithms require you to actually Stop the layout algorithm; others terminate themselves according to a stopping criterion, or because they are a “one-shot” application (such as the Expansion algorithm, which just scales the x and y values by a given factor).
We can zoom in and out on the layout of the graph using a mouse wheel (on my MacBook trackpad, I use a two finger slide up and down), or use the zoom slider from the “More options” tab:
To see which Twitter ID each node corresponds to, we can turn on the labels:
This view is very cluttered – the nodes are too close to each other to see what’s going on. The labels and the nodes are also all the same size, giving the same visual weight to each node and each label. One thing I like to do is resize the nodes relative to some property, and then scale the label size to be proportional to the node size.
Here’s how we can scale the node size and then set the text label size to be proportional to node size. In the Ranking panel, select the node size property, and the attribute you want to make the size proportional to. I’m going to use Authority, which is a network property that we calculated when we ran the HITS algorithm. Essentially, it’s a measure of how well linked to a node is.
The min size/max size slider lets us define the minimum and maximum node sizes. By default, a linear mapping from attribute value to size is used, but the spline option lets us use a non-linear mappings.
I’m going with the default linear mapping…
We can now scale the labels according to node size:
Note that you can continue to use the text size slider to scale the size of all the displayed labels together.
This diagram is now looking quite cluttered – to make it easier to read, it would be good if we could spread it out a bit. The Expansion layout algorithm can help us do this:
A couple of other layout algorithms that are often useful: the Transformation layout algorithm lets us scale the x and y axes independently (compared to the Expansion algorithm, which scales both axes by the same amount); and the Clockwise Rotate and Counter-Clockwise Rotate algorithm lets us rotate the whole layout (this can be useful if you want to rotate the graph so that it fits neatly into a landscape view.
The expanded layout is far easier to read, but some of the labels still overlap. The Label Adjust layout tool can jiggle the nodes so that they don’t overlap.
(Note that you can also move individual nodes by clicking on them and dragging them.)
So – nearly there… The final push is to generate a good quality output. We can do this from the preview window:
The preview window is where we can generate good quality SVG renderings of the graph. The node size, colour and scaled label sizes are determined in the original Overview area (the one we were working in), although additional customisations are possible in the Preview area.
To render our graph, I just want to make a couple of tweaks to the original Default preview settings: Show Labels and set the base font size.
Click on the Refresh button to render the graph:
Oops – I overdid the font size… let’s try again:
Okay – so that’s a good start. Now I find I often enter into a dance between the Preview ad Overview panels, tweaking the layout until I get something I’m satisfied with, or at least, that’s half-way readable.
How to read the graph is another matter of course, though by using colour, sizing and placement, we can hopefully draw out in a visual way some interesting properties of the network. The recipe described above, for example, results in a view of the network that shows:
- groups of people who are tightly connected to each other, as identified by the modularity statistic and consequently group colour; this often defines different sorts of interest groups. (My follower network shows distinct groups of people from the Open University, and JISC, the HE library and educational technology sectors, UK opendata and data journalist types, for example.)
- people who are well connected in the graph, as displayed by node and label size.
Here’s my final version of the @wiredUK “inner friends” network:
You can probably do better though…;-)
To recap, here’s the recipe again:
- filter on connected component (private accounts don’t disclose friend/follower detail to the api key i use) to give a connected graph;
- run the modularity statistic to identify clusters; sometimes I try several attempts
- colour by modularity class identified in previous step, often tweaking colours to use pastel tones
- I often use a force directed layout, then Expansion to spread to network out a bit if necessary; the Clockwise Rotate or Counter-Clockwise rotate will rotate the network view; I often try to get a landscape format; the Transformation layout lets you expand or contract the graph along a single axis, or both axes by different amounts.
- run HITS statistic and size nodes by authority
- size labels proportional to node size
- use label adjust and expand to to tweak the layout
- use preview with proportional labels to generate a nice output graph
- iterate previous two steps to a get a layout that is hopefully not completely unreadable…
Got that?!;-)
Finally, to the return beginning. The recipe I use to generate the data is as follows:
- grab a list of twitter IDs (call it L); there are several ways of doing this, for example: obtain a list of tweets on a particular topic by searching for a particular hashtag, then grab the set of unique IDs of people using the hashtag; grab the IDs of the members of one or more Twitter lists; grab the IDs of people following or followed by a particular person; grab the IDs of people sending geo-located tweets in a particular area;
- for each person P in L, add them as a node to a graph;
- for each person P in L, get a list of people followed by the corresponding person, e.g. Fr(P)
- for each X in e.g. Fr(P): if X in Fr(P) and X in L, create an edge [P,X] and add it to the graph
- save the graph in a format that can be visualised in Gephi.
To make this recipe, I use Tweepy and a Python script to call the Twitter API and get the friends lists from there, but you could use the Google Social API to get the same data. There’s an example of calling that API using Javscript in my “live” Twitter friends visualisation script (Using Protovis to Visualise Connections Between People Tweeting a Particular Term) as well as in the A Bit of NewsJam MoJo – SocialGeo Twitter Map.































