Whilst preparing the presentation, I had a dig through the DCLG sponsored Improving Visualisation for the Public Sector site, which provides pathways for identifying appropriate visualisation types based on data type, policy objectives/communication goals and anticipated audience level. It struck me that being able to pick an appropriate visualisation type is one thing, but being able to create it is another.
My presentation, for example, was based very much around tools that could provide a way in to actually creating visualisations, as well as shaping and representing data so that it can be plugged straight in to particular visualisation views.
So I’m wondering, is there maybe an opportunity here for a practical programme of work that builds on the DCLG Improving Visulisation toolkit by providing worked, and maybe templated, examples, with access to code and recipes wherever possible, for actually creating examples of exemplar visualisation types from actual open/public data set that can be found on the web?
Could this even be the basis for a set of School of Data practical exercises, I wonder, to actual create some of these examples?
I guess this counts as a dissemination activity for my related eSTEeM project on course related custom search engines, since the work(?!) sort of evolved out of that idea…
The thesis is this:
Course Units on OpenLearn are available as XML docs – a URL pointing to the XML version of a unit can be derived from the Moodle URL for the HTML version of the course; (the same is true of “closed” OU course materials). The OU machine uses the XML docs as a feedstock for a publication process that generates HTML views, ebook views, etc, etc of a course.
We can treat XML docs as if they were database records; sets of structured XML elements can be viewed as if they define database tables; the values taken by the structured elements are like database table entries. Which is to say, we can treat each XML docs as a mini-database, or we we can trivially extract the data and pop it into a “proper”/”real” database.
given a list of courses we can grab all the corresponding XML docs and build a big database of their contents; that is, a single database that contains records pulled from course XML docs.
the sorts of things that we can pull out of a course include: links, images, glossary items, learning objectives, section and subsection headings;
if we mine the (sub)section structure of a course from the XML, we can easily provide an interactive treemap version of the sections and subsections in a course; generating a Freemind mindmap document type, we can automatically generate course-section mindmap files that students can view – and annotate – in Freemind. We can also generate bespoke mindmaps, for example based on sections across OpenLearn courses that contain a particular search term.
By disaggregating individual course units into “typed” elements or faceted components, and then reaggreating items of a similar class or type across all course units, we can provide faceted search across, as well as university wide “meta” view over, different classes of content. For example:
by aggregating learning objectives from across OpenLearn units, we can trivially create a search tool that provides a faceted search over just the learning objectives associated with each unit; the search returns learning outcomes associated with a search term and links to course units associated with those learning objectives; this might help in identifying reusable course elements based around reuse or extension of learning outcomes;
by aggregating glossary items from across OpenLearn units, we can trivially create a meta glossary for the whole of OpenLearn (or similarly across all OU courses). That is, we could produce a monolithic OpenLearn, or even OU wide, glossary; or maybe it’s useful to have redefine the same glossary terms using different definitions, rather than reuse the same definition(s) consistently across different courses? As with learning objectives, we can also create a search tool that provides a faceted search over just the glossary items associated with each unit; the search returns glossary items associated with a search term and links to course units associated with those glossary items;
by aggregating images from across OpenLearn units, we can trivially create a search tool that provides a faceted search over just the descriptions/captions of images associated with each unit; the search returns the images whose description/captions are associated with the search term and links to course units associated with those images. This disaggregation provides a direct way of search for images that have been published through OpenLearn. Rights information may also be available, allowing users to search for images that have been rights cleared, as well as openly licensed images.
the original route in was the extraction of links from course units that could be used to seed custom search engines that search over resources referenced from a course. This could in principle also include books using Google book search.
I also briefly described an approach for appropriating Google custom search engine promotions as the basis for a search engine mediated course, something I think could be used in a sMoocH (search mediated MOOC hack). But then MOOCs as popularised have f**k all to do with innovation, don’t they, other than in a marketing sense for people with very little imagination.
During questions, @briankelly asked if any of the reported dabblings/demos (and there are severalworkingdemo) were just OUseful experiments or whether they could in principle be adopted within the OU, or even more widely across HE. The answers are ‘yes’ and ‘yes’ but in reality ‘yes’ and ‘no’. I haven’t even been able to get round to writing up (or persuading someone else to write up) any of my dabblings as ‘proper’ research, let alone fight the interminable rounds of lobbying and stakeholder acquisition it takes to get anything adopted as a rolled out as adopted innovation. If any of the ideas were/are useful, they’re Googleable and folk are free to run with them…but because they had no big budget holding champion associated with their creation, and hence no stake (even defensively) in seeing some sort of use from them, they unlikely to register anywhere.
Whilst preparing for my typically overloaded #online12 presentation, I thought I should make at least a passing attempt at contextualising it for the corporate attendees. The framing idea I opted for, but all too briefly reviewed, was whether open public data might be disruptive to the information industry, particularly purveyors of information services in vertical markets.
If you’ve ever read Clayton Christensen’s The Innovator’s Dilemma, you’ll be familiar with the idea behind disruptive innovations: incumbents allow start-ups with cheaper ways of tackling the less profitable, low-quality end of the market to take that part of the market; the start-ups improve their offerings, take market share, and the incumbent withdraws to the more profitable top-end. Learn more about this on OpenLearn: Sustaining and disruptive innovation or listen again to the BBC In Business episode on The Innovator’s Dilemma, from which the following clip is taken.
In the information industry, the following question then arises: will the availability of free, open public data be adopted at the low, or non-consuming end of the market, for example by micro- and small companies who haven’t necessarily be able to buy in to expensive information or data services, either on financial grounds or through lack of perceived benefits? Will the appearance of new aggregation services, often built around screenscrapers and/or public open data sources start to provide useful and useable alternatives at the low end of the market, in part because of their (current) lack of comprehensiveness or quality? And if such services are used, will they then start to improve in quality, comprehensiveness and service offerings, and in so doing start a ratcheting climb to quality that will threaten the incumbents?
Here are a couple of quick examples, based around some doodles I tried out today using data from OpenCorporates and OpenlyLocal. The original sketch (demo1() in the code here) was a simple scraper on Scraperwiki that accepted a person’s name, looked them up via a director search using the new 0.2 version of the OpenCorporates API, pulled back the companies they were associated with, and then looked up the other directors associated with those companies. For example, searching around Nigel Richard Shadbolt, we get this:
One of the problems with the data I got back is that there are duplicate entries for company officers; as Chris Taggart explained, “[data for] UK officers [comes] from two Companies House sources — data dump and API”. Another problem is that officers’ records don’t necessarily have start/end dates associated with them, so it may be the case that directors’ terms of office don’t actually overlap within a particular company. In my own scraper, I don’t check to see whether an officer is marked as “director”, “secretary”, etc, nor do I check to see whether the company is still a going concern or whether it has been dissolved. Some of these issues could be addressed right now, some may need working on. But in general, the data quality – and the way I work with it – should only improve from this quick’n’dirty minimum viable hack. As it is, I now have a tool that at a push will give me a quick snapshot of some of the possible director relationships surrounding a named individual.
The second sketch (demo2()in the code here) grabbed a list of elected council members for the Isle of Wight Council from another of Chris’ properties, OpenlyLocal, extracted the councillors names, and then looked up directorships held by people with exactly the same name using a two stage exact string match search. Here’s the result:
As with many data results, this is probably most meaningful to people who know the councillors – and companies – involved. The results may also surprise people who know the parties involved if they start to look-up the companies that aren’t immediately recognisable: surely X isn’t a director of Y? Here we have another problem – one of identity. The director look-up I use is based on an exact string match: the query to OpenCorporates returns directors with similar names, which I then filter to leave only directors with exactly the same name (I turn the strings to lower case so that case errors don’t cause a string mismatch). (I also filter companies returned to be solely ones with a gb jurisdiction.) In doing the lookup, we therefore have the possibility of false positive matches (X is returned as a director, but it’s not the X we mean, even though they have exactly the same name); and false negative lookups (eg where we look up a made up director John Alex Smith who is actually recorded in one or more filings as (the again made-up) John Alexander Smith.
That said, we do have a minimum viable research tool here that gives us a starting point for doing a very quick (though admittedly heavily caveated) search around companies that a councillor may be (or may have been – I’m not checking dates, remember) associated with.
We also have a tool around which we can start to develop a germ of an idea around conflict of interest detection.
The Isle of Wight Armchair Auditor, maintained by hyperlocal blog @onthewight (and based on an original idea by @adrianshort) hosts local spending information relating to payments made by the Isle of Wight Council. If we look at the payments made to a company, we see the spending is associated with a particular service area.
If you’re a graph thinker, as I am;-), the following might then suggest itself to you:
From OpenlyLocal, we can get a list of councillors and the committees they are on;
from OnTheWight’s Armchair Auditor, we can get a list of companies the council has spent money with;
from OpenCorporates, we can get a list of the companies that councillors may be directors of;
from OpenCorporates, we should be able to get identifiers for at least some of the companies that the council has spent money with;
putting those together, we should be able to see whether or not a councillor may be a director of a company that the council is spending with and how much is being spent with them in which spending areas;
we can possibly go further, if we can associate council committees with spending areas – are there councillors who are members of a committee that is responsible for a particular spending area who are also directors of companies that the council has spent money with in those spending areas? Now there’s nothing wrong with people who have expertise in a particular area sitting on a related committee (it’s probably a Good Thing). And it may be that they got their experience by working as a director for a company in that area. Which again, could be a Good Thing. But it begs a transparency question that a journalist might well be interested in asking. And in this case, with open data to hand, might technology be able to help out? For example, could we automatically generate a watch list to check whether or not councillors who are directors of companies that have received monies in particular spending areas (or more generally) have declared an interest, as would be appropriate? I think so…(caveated of course by the fact that there may be false positives and false negatives in the report…; but it would be a low effort starting point).
Once you get into this graph based thinking, you can take it mich further of course, for example looking to see whether councillors in one council are directors of companies that deal heavily with neighbouring councils… and so on.. (Paranoid? Me? Nah… Just trying to show how graphs work and how easy it can be to start joining dots once you start to get hold of the data…;-)
Anyway – this is all getting off the point and too conspiracy based…! So back to the point, which was along the lines of this: here we have the fumblings of a tool for mixing and matching data from two aggregators of public information, OpenlyLocal and OpenCorporates that might allow us to start running crude conflict of interest checks. (It’s easy enough to see how we can run the same idea using lists of MP names from the TheyWorkForYou API; or looking up directorships previously held by Ministers and the names of companies of lobbiests they meet (does WhosLobbying have an API of such things?). And so on…
Now I imagine there are commercial services around that do this sort of thing properly and comprehensively, and for a fee. But it only took me a couple of hours, for free, to get started, and having started, the paths to improvement become self-evident… and some of them can be achieved quite quickly (it just takes a little (?!) but of time…) So I wonder – could the information industry be at risk of disruption from open public data?
PS if you’re into conspiracies, Cambridge’s Centre for Research in the Arts, Social Sciences and Humanities (CRASSH) has a post-doc positions open with Professior John Naughton on The impact of global networking on the nature, dissemination and impact of conspiracy theories. The position is complemented by several parallel fellowships, including ones on Rational Choice and Democratic Conspiracies and Ideals of Transparency and Suspicion of Democracy.
I was due to be at #odw13 today, but circumstances beyond my control intruded…
The presentation I was down to give related to some of the things we could do with company data from OpenCorporates. Here’s a related thing that covers some of what I was intending to talk about…
(I’m experimenting with a new way of putting together presentations by actually writing notes for each slide. Please let me know via the comments whether you think this approach makes my slidedecks any easier to understand!)
I’m fortunate enough to be visiting the LASI13, the Learning Analytics Summer Institute, in Stanford this week, and got to lead a workshop session yesterday on tools for tinkering and playing with data.
The presentation I prepped can be found on Slideshare – LASI13 datawrangling tools – though as ever I didn’t get through all the slides, and, as ever again, went slightly off-piste at various points. (The session was a 2hr 15 session, split 1h30 and 45 mins; I reckon the whole slidedeck would be a 4hr session; as it was, we got as far as grabbing data out of Facebook and into OpenRefine, with a v brief tease about starting to analyse the data in Gephi.)
I mentioned several tutorial posts and resource pages in the session – here a few links to some of them:
if search limits for use in Google searches are new to you, I like site: for searching sites or domains (eg site:open.ac.uk or site:edu); filetype: for searching by document type (eg filetype:xls or filetype:pdf); for limiting by document titles, intitle: and for limiting by terms that appear in a url, inurl:
If you want to follow through on OpenRefine/LODRefine, there’s lots more to know and I’m still working through tutorials to cover some of that. Grabbing Facebook friends likes using OpenRefine varies slightly from the w/s recipe; I’ll post a new version over the next week or two. For now, if you want to parse the JSON data, the magic phrase is forEach(value.parseJson()['data'],v,[v.category,v.name,v.id].join('::')).join('||'). You then need to Edit cells – split multi valued cells (by ||) then Edit column – split into several columns (using :: as the separator). I’ve posted a whole host of OpenRefine tutorials using the OpenRefine category on the this blog.
If anyone still here at LASI would like to chat further about data related skills development, let’s grab a table… similarly if you’d like to learn more about the School of Data data expedition model as a possible learning or training exercise. Finally, if you’re looking for folk to run data skills workshops, let’s talk;-)
PS if I’ve missed any links you think should be here, let me know and I’ll add them…
slides 2-6 – some thoughts on getting your eye into some tech trends: OU Innovating Pedagogy reports (2012, 2013), possible data-sources and reports;
slides 6-11 – what can we learn from Google Trends and related tools? A big thing: the importance of segmenting your stats; means are often meaningless. The Mothers’ Day example demonstrates two signal causes (in different territories – i.e. different segments) for the compound flowers trend. The Google Correlate example show how one signal may lead – or lag – another. So the question: do you segment your library data? Do you look for leading or lagging indicators?
slides 12-18 – what role should/does/could the library play in developing the reputation of the organisation’s knowledge producers/knowledge outputs, not least as a way of making them more discoverable; this builds on the question of whose role it is to facilitate access to knowledge (along with the question: facilitate access for whom?)? – my take is this fits in the role librarians often take of organising an institution’s knowledge.
slides 19-27 – what is a library for? Supporting discovery (of what, by whom)? (Helping others) organise knowledge, and gain access to information? Do research?
slides 28-30 – the main focus of my own presentation during the main ILI2013 conference (I’ll post the slides/brief commentary in another post): if the information we want to discover is buried in data, who’s there to help us extract or discover the information from within the data?
slides 31-32 – sometimes reframing your perception of an organisation’s offerings can help you rethink the proposition, and sometimes using an analogy helps you switch into that frame of mind. So if energy utilities provide “warm house” and “clean, dry clothes” service, rather than gas or electricity, what shift might libraries adopt?
slides 33-39 – a few idle idea prompts around the question of just what is it that libraries do, what services do they provide?
slide 40 – one of the items from this slide caused a nightmare tangent! The riff started with a trivial observation – a telling off I received for trying to use the phone on my camera to take a photo of a sign saying “no cameras in the library”, with a photocopier as a backdrop (original story). The purpose of this story was two-fold: 1) to get folk into the idea of spotting anachronisms or situations where one technology is acceptable where an equivalent or alternative is not (and then wonder why/what fun can be had around that thought;-); 2) to get folk into wondering how users might appropriate technology they have to hand to make their lives easier, even if it “goes against the rules”.
slide 41 – a thought experiment that I still have high hopes for in the right workshop setting…! if you overheard someone answer a question you didn’t hear with the phrase “did you try the library?”, what might the question be? You can then also pivot the question to identify possible competitors; for example, if a sensible answer to the same question is “did you try Amazon?”, Amazon might be a competitor for the delivery of that service.
slide 42 – this can lead on from the previous slide, either directly (replace “library” with “Amazon” or “Google”), or as way of generating ideas about how else a service might be delivered.
As with many idea generating techniques, things can be combined. For example, having introduced the notion of Amazon lockers, we might then ask: so what use might libraries make of such a system, or thing? Or if such things become commonplace, how might this affect or influence the expectations of our users??