Archive for the ‘Thinkses’ Category
A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)
An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.
Here’s a quick example of the sort of thing I mean – the generation of this piece of text:
The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.
from a data table that can be sliced like this:
In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.
The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.
There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.
There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94″ is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377″ generated mechanically from a year on year comparison).
Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.
The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.
PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…
Reading Game Analytics: Maximizing the Value of Player Data earlier this morning (which I suggest might be a handy read if you’re embarking on a learning analytics project…) I was struck by the mention of “player dossiers”. A Game Studies article from 2011 by Ben Medler- Player Dossiers: Analyzing Gameplay Data as a Reward describes them as follows:
Recording player gameplay data has become a prevalent feature in many games and platform systems. Players are now able to track their achievements, analyze their past gameplay behaviour and share their data with their gaming friends. A common system that gives players these abilities is known as a player dossier, a data-driven reporting tool comprised of a player’s gameplay data. Player dossiers presents a player’s past gameplay by using statistical and visualization methods while offering ways for players to connect to one another using online social networking features.
Which is to say – you can grab your own performance and achievement data and then play with it, maybe in part to help you game the game.
The Game Analytics book also mentioned the availability of third party services built on top of game APIs that let third parties build analytics tools for users that are not otherwise supported by the game publishers.
What I started to wonder was – are there any services out there that allow you aggregate dossier material from different games to provide a more rounded picture of your performance as a gamer, or maybe services that homologate dossiers from different games to give overall rankings?
In the learning analytics space, this might correspond to getting your data back from a MOOC provider, for example, and giving it to a third party to analyse. As a user of MOOC platform, I doubt that you’ll be allowed to see much of the raw data that’s being collected about you; I’m also wary that institutions that sign up to MOOC platforms will also get screwed by the platform providers when it comes to asking for copies of the data. (I suggest folk signing their institutions up to MOOC platforms talk to their library colleagues, and ask how easy it is for them to get data, (metadata, transaction data, usage data etc etc) out of the library system vendors, and what sort of contracts got them into the mess they may admit to being in.)
(By the by, again the Game Analytics book made a useful distinction – that of viewing folk as customers, (i.e. people you can eventually get money from), or as players of the game (or maybe in MOOC land, learners). Whilst you may think of yourself as a player (learner), what they really want to do is develop you as a customer. In this respect, I think one of the great benefits of the arrival of MOOCs is that it allows us to see just how we can “monetise” education and let’s us talk freely and, erm, openly, in cold hard terms about the revenue potential of these things, and how they can be used as part of a money making/sales venture, without having to pretend to talk about educational benefits, which we’d probably feel obliged to do if we were talking about universities. Just like game publishers create product (games) to make money, MOOCspace is about businesses making money from education. (If it isn’t, why is venture capital interested?))
Anyway, all that’s all by the by, not just the by the by bit: this was just supposed to be a quick post, rather than a rant, about how we might do a little bit to open up part of the learning analytics data collection process to the community. (The technique generalises to other sectors…) The idea is built on appropriating a technology that many website publishers use to collect data, the third party service that is Google Analytics (eg from 2012, 88% of Universities UK members use Google Analytics on their public websites). I’m not sure how many universities use Google Analytics to track VLE activity though? Or how many MOOC operators use Google Analytics to track activity on course related pages? But if there are some, I think we can grab that data and pop it into a communal data pool; or grab that data into our own Google Account.
So how might we do that?
That’s all a rather roundabout way of saying we can quite easily write extensions that change the behaviour of a web page. (Hmm… can we do this for mobile devices?) So what I propose – though I don’t have time to try it and test it right now (the rant used up the spare time I had!) – is an extension that simply replaces the Google Analytics tracking code with another tracking code:
- either a “common” one, that pools data from multiple individuals into the same Google Analytics account;
- or a “personal” one, that lets you collect all the data that the course provider was using Google Analytics to collect about you.
(Ideally the rewrite would take place before the tracking script is loaded? Or we’d have to reload the script with the new code if the rewrite happens too late? I’m not sure how the injection/replacement of the original tracking code with the new one actual takes place when the extension loads?)
Another “advantage” of this approach is that you hijack the Google Analytics data so it doesn’t get sent to the account of the person whose site you’re visiting. (Google Analytics docs suggest that using multiple tracking codes is “not supported”, though this doesn’t mean it can’t be done if you wanted to just overload the data collection (i.e. let the publisher collect the data to their account, and you just grab a copy of it too…).
(An alternative, cruder, approach might be to create an extension that purges Google Analytics code within a page, and then inject your own Google Analytics scripts/code. This would have the downside of not incorporating the instrumentation that the original page publisher added to the page. Hmm.. seems I looked at this way back when too… Collecting Third Party Website Statistics (like Yahoo’s) with Google Analytics.)
All good fun, eh? And for folk operating cMOOCs, maybe this represents a way of tracking user activity across multiple sites (though to mollify ethical considerations, tracking/analytics code should probably only be injected onto whitelisted course related domains, or users presented with a “track my activity on this site” button…?)
A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context
Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.
To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):
- you can’t start following someone on Twitter until you join Twitter;
- follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
- starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.
A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)
There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).
So with low follower numbers, where can we get our timestamps from?
In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).
If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.
This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.
(I’m making this up as I’m going along…;)
If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.
Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.
Got that?!;-) (I’m still making this up as I’m going along…!)
We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.
If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.
Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].
If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)
We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).
For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:
if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]
Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]
If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:
- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
- encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
- consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
- if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.
I think I need to go and walk the dog again.
PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.
Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics
Last year I sat on a couple of panels organised by I’m a Scientist’s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking Who are you Twitter?, a quick review of a social media mapping exercise carried out on the followers of the @imascientist Twitter account.
Using the technique described in Estimated Follower Accession Charts for Twitter, we can estimate a follower acquisition growth curve for the @imascientist Twitter account:
I’ve already noted how we may be able to use “spikes” in follower acquisition rates to identify news events that involved the owner of a particular Twitter account and caused a surge in follower numbers as a result (What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events).
Thinking back to the context of evaluating the impact of events that include social media as part of the overall campaign, it struck me that whilst running a particular event may not lead to a huge surge in follower numbers on the day of the event or in the immediate aftermath, the followers who do sign up over that period might have signed up as a result of the event. And now we have the first inklings of a post hoc analysis tool that lets us try to identify these people, and perhaps look to see if their profiles are different to profiles of followers who signed up at different times (maybe reflecting the audience interest profile of folk who attended a particular event, or reflecting sign ups from a particular geographical area?)
In other words, through generating the follower acquisition curve, can we use it to filter down to folk who started following around a particular time in order to then see whether there is a possibility that they started following as a result of a particular event, and if so can count as some sort of “conversion”? (I appreciate that there are a lot of caveats in there!;-)
A similar approach may also be relevant in the context of analysing link building around historical online community events, such as MOOCs… If we know somebody took a particular MOOC at a particular time, might we be able to construct their follower acquisition curve and then analyse it around the time of the MOOC, looking to see if the connections built over that period are different to the users other followers, and as such may represent links developed as a result of taking the MOOC? Analysing the timelines of the respective parties may further reveal conversational dynamics between those parties, and as such allow is to see whether a fruitful social learning relationship developed out of contact made in the MOOC?
Arguably primed by the open courseware and open learning initiatives that started a decade or so ago (precursor: OSTP), several notable MOOC platforms (Coursera, Udacity) provide a one stop supermarket for 20-100 hour large cohort, online “course experiences” offered by traditional universities. Using a blend of readings and video lectures, courses provide pacing through a predetermined syllabus on a particular topic, with social tools to allow students to discuss elements of the course with each other. To provide the feedback on progress, computer marking systems and scaleable “peer assessment” provide a means of letting a student know how well they are doing on the course.
At least, I think that’s how it works. I don’t really know. Though I’ve signed up for several MOOCs, I’ve never actually started any of them, or tried to work through them.
Maybe that’s because I tend to learn from resources discovered on the web in response to questions I’ve formulated myself (questions which often derive from reading other resources and either being confused by them or not being able to make sense of them!). But just see how far a search far a web search for visualisation site:coursera.org gets you. As I believe Patrick McAndrew suggested. (Hmmm… I appear to have hit my article limit… Here’s Patrick’s OER13 keynote which I think led to the article.)
I’m not sure who, within the universities that have signed up to delivering courses on the various MOOC platforms, is deciding which courses to run with, but I suspect the marketing department has a hand in it.
Marketing departments also used to run university web presences, too, didn’t they?
Way back when, when I was still at school, I used to watch tech documentaries – I remember seeing a window based GUI for the first time on an old episode of Horizon – and OU programmes, amongst other things… If you’re over a certain age, you’ll remember them:
Things have moved on since then, of course. OU/BBC programmes now are of a far more mainstream flavour (for example, clips from recent OU/BBC co-pros currently on iPlayer). Add to that, a wide variety of online videos, such as the 60 second animation series (such as 60 Second Adventures in Thought, or 60 Second Adventures in Astronomy, or even 60 Second Adventures in Economics), or clips from OU produced videos sent to students as part of their course materials (this sort of thing, for example: A Cyborg Future?, The Buildings of Ancient Rome or Environmental Policy in an International Context.)
As “tasters”, the OU course programmes that appeared in the dead parts of the schedule on BBC2 introduced me to a particular form of discourse, somewhere between a lecture and a tutorial, at particular level of understanding (higher education academic). The programmes were created as embedded parts of a distance education course, complementing readings and exercises, and though I find it hard to remember back, I think that came across. That the programmes were a glimpse into a wider, and deeper, exploration of a particular topic that the course of which they were a part provided. I don’t really recall.
So whilst the OU course programmes were part of a course, they were not the whole of it. They were windows into a course, rather than a microcosm of one. I get the impression that the MOOCs are intended in some way to be “complete”, albeit short, courses, that are maybe intended to act as tasters of more comprehensive, formal offerings. I don’t really know.
My introduction to the OU, then, aged ten, or thereabouts, so 35 years or so ago, was through these glimpses into courses that other people were studying. They were opportunities for me to walk into a lecture to see what it was like. The programmes were not intended for me, but I could partake of them. That is more of a “passively discoverable OER” model than a MOOC. Maybe. I don’t really know.
I wonder now, if now was then, how I would have come to discover the world of “academic” communications. Through Google, presumably. Through the web. Through the marketing department? Or through the academics, (but discovered how?).
I guess we could argue that MOOCs represent, in part, higher education marketing departments waking up to the fact that the web exists, that it is a contentful medium, in part at least, and that the universities have content that may attract eyeballs. Maybe. I don’t really know.
If the marketing departments are leading the MOOC campaigns, I wonder what sort of return they expect? Raising “brand awareness”? Being seen to be there/having a presence on a platform because other universities have? Generating leads and building up mailing lists? Online courses as “promotional items” (whatever that means)? Edutainment offerings?! I don’t really know.
Going back to the OU programmes on BBC2, the primary audience then were presumably students on a course, because the programmes were part of the course. Partially open courses. Courses being run for real that also had an open window open on to them, so that other people could see what sorts of thing were covered by those courses, and maybe learn a little along the way.
This is more in keeping with the model of online course delivery being pursued by Jim Groom’s ds106 or Martin Weller’s H817 module on “Open Education” (I think; I don’t really know). I think. I don’t really know.
Other models are possible, of course. The “cMOOC” – connectionist/connectivist MOOC – idea explores a different pedagogy. The xMOOC offerings of Coursera and Udacity wrap not opencourseware in a delivery platform and run scheduled cohorts. The original OU OpenLearn offering had the platform (Moodle), had the open content, but didn’t have the community that comes from marshalling learners into a scheduled offering. Or the hype. Or at least, not the right sort of hype (the hype that follows VC investment, where VC does not refer to Vice Chancellor). The cMOOC idea tries to be open as to curriculum, in part – a more fluid learning environment where loose co-ordination amongst participants encourages an element of personal research into a topic and sharing of the lessons learned. A pedagogy that seeks to foster independent learning in the sense of being able to independently research a topic, rather than independently pay to follow a prescribed path through a set of learning materials. In the xMOOC, a prescribed path that propagates a myth of there being one true way to learn a subject. One true path.
My own open course experiment was an uncourse. Tasked with writing a course on a subject about which I knew nothing, I sought to capture my own learning path through it, using that trail to inform the design of a supposedly more polished offering. The traces are still there, still open – Digital Worlds – Interactive Media and Game Design. The pages still get hit, the resources still get downloaded. I could – should – do a little more to make evident the pathways through the content.
Whilst the reverse chronological form of a blog made sense as I was discovering a trail through the subject area – new content was revealed to any others following the uncourse in a sensible order – looking back at the material now the journeys through each topic area start at the end, presenting anyone wishing to follow the path I took with an apparent problem. Though not really… If you select a Category on the Digital Worlds blog, and add ?order=asc – as for example http://digitalworlds.wordpress.com/category/animation/?order=asc, the posts will be presented in chronological order. I wonder if there is a switch – or a plugin – that can make chronological views of posts within a particular tag or category on WordPress automatically display in a chronological order? I don’t really know. This would provide one way of transforming a platform configured as a “live presentation” site into one that worked better as a legacy site. It’s not hard to imagine a Janus theme that would provide these two faces of a site, one in reverse chronological order for live delivery, the other in chronological order for folk wishing to follow the steps taken by a previous journeyman in the same order as they were originally taken.
I still don’t know what forces are at play that may result in MOOC-hype and whatever shakes out as a result transforming, if at all, higher education as we know it and as developing countries may yet come to know it. I really don’t know.
And I still don’t have a good feeling for how we can make most effective use of the web to support a knowledge driven society; how best we can make use of online content resources and social communication tools to help people to develop their own personal, and deeper, understanding about whatever topic, to help them make sense of whatever they need to make sense of; how best schools and universities can draw on the web to help people develop lifelong learning skills; what it means to use the web in furtherance of lifelong learning.
I really, really, don’t know.
As well as serendipity, I believe in confluence…
A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.
It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.
In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:
So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)
Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.
So what’s confluential about those?
The IWCP article localises the Fairer Gambling data in several ways:
- the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
- a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
- a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.
The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).
Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.
A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).
Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.
The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)
Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).
The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.
Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.
PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…
A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.
I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…
For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.
(*There are actually a couple of models I can think of: 1) I keep the query secret, but run it and give you the results; 2) I license the “query source code” to you and let you run it yourself. Hmm, I wonder: do folk license queries they share? How, and to what extent, might derived queries/query modifications be accommodated in such a licensing scheme?)
Pondering Leigh’s SPARQL-doc post, another post via R-bloggers, Building a package in RStudio is actually very easy (which describes how to package a set of R files for distribution via github), asdfree (analyze survey data for free), a site that “announces obsessively-detailed instructions to analyze us government survey data with free tools” (and which includes R bundles to get you started quickly…), the resource listing Documentation for package ‘datasets’ version 2.15.2 that describes a bundled package of datasets for R and the Linked Data API, which sought to provide a simple RESTful API over SPARQL endpoints, I wondered the following:
How about developing and sharing commented query libraries around Linked Data endpoints that could be used in arbitrary Linked Data clients?
(By “Linked Data clients”, I mean different user agent contexts. So for example, calling a query from Python, or R, or Google Spreadsheets.) That’s it… Simple.
One approach (the simplest?) might be to put each separate query into a separate file, with a filename that could be used to spawn a function name that could be used to call that query. Putting all the queries into a directory and zipping them up would provide a minimal packaging format. An additional manifest file might minimally document the filename along with the parameters that can be passed into and returned from the query. Helper libraries in arbitrary languages would open the query package and “compile” a programme library/set of “API” calling functions for that language (so for example, in R it would create a set of R functions, in Python a set of Python functions).
(This reminds me of a Twitter exchange with Nick Jackson/@jacksonj04 a couple of days ago around “self-assembling” API programme libraries that could be compiled in an arbitrary language from a JSON API, cf. Swagger (presentation), which I haven’t had time to look at yet.)
The idea, then is this:
- Define a simple file format for declaring documented SPARQL queries
- Define a simple packaging format for bundling separate SPARQL queries
- The simply packaged set of queries define a simple “raw query” API over a Linked Data dataset
- Describe a simple protocol for creating programming language specific library wrappers around API from the query bundle package.
So.. I guess two questions arise: 1) would this be useful? 2) how hard could it be?
[See also: @ldodds again, on Publishing SPARQL queries and-documentation using github]
I’ve been pondering what is is to be an engineer, lately, in the context of trying to work out what it is that I actually do and what sort of “contract” I feel I’m honouring (and with whom) by doing whatever that is that spend my days doing…
According to Wikipedia, [t]he term engineering … deriv[es] from the word engineer, which itself dates back to 1325, when an engine’er (literally, one who operates an engine) originally referred to “a constructor of military engines.” … The word “engine” itself is of even older origin, ultimately deriving from the Latin ingenium (c. 1250).
Via Wiktionary, [e]ngine originally meant ‘ingenuity, cunning’ which eventually developed into meaning ‘the product of ingenuity, a plot or snare’ and ‘tool, weapon’. Engines as the products of cunning, then, and hence, naturally, war machines. And engineers as their operators, or constructors.
One of the formative books in my life (mid-teens, I think) was Richard Gregory’s Mind in Science, from which I took away the idea of tools as things that embodied and executed an idea. You see a way of doing something or how to do something, and then put that idea into an artefact – a tool – that does it. Code is a particularly expressive medium in this respect, AI (in the sense of Artificial Intelligence) one way of explicitly trying to give machines ideas, or embody mind in machine. (I have an AI background – my PhD in evolutionary computation was pursued in a cognitive science unit (HRCL, as was) at the OU; what led me to “AI”, I think, was a school of thought relating to the practice of how to use code to embody mind and natural process in machines, as well as how to use code that can act on, and be acted on by, the physical world.)
So part of what I (think I) do is build tools, executable expressions of ideas. I’m not so interested in how they are used. I’ve also started sketching maps a lot, lately, of social networks and other things that can be represented as graphs. These are tools too – macroscopes for peering at structural relationships within a system – and again, once produced, I’m not so interested in how they’re used. (What excites me is finding the process that allows the idea to be represented or executed.)
If we go back to the idea of “engineer”, and dig a little deeper by tracing the notion of ingenium, we find this take on it:
ingenium is the original and natural faculty of humans; it is the proper faculty with which we achieve certain knowledge. It is original because it is the first “facility” young people untouched by prejudices exemplify upon seeing similarities between disparate things. It is natural because it is to us what the power to create is to God. just as God easily begets a world of nature, so we ingeniously make discoveries in the sciences and artifacts in the arts. Ingenium is a productive and creative form of knowledge. It is poietic in the creation of the imagination; it is rhetorical in the creation of language, through which all sciences are formalized. Hence, it requires its own logic, a logic that combines both the art of finding or inventing arguments and that of judging them. Vico argues that topical art allows the mind to locate the object of knowledge and to see it in all its aspects and not through “the dark glass” of clear and distinct ideas. The logic of discovery and invention which Vico uses against Descartes’s analytics is the art of apprehending the true. With this Vico come full circle in his arguments against Descartes. [From the Introduction by L.M. Palmer to Vico on Ingenium, in Giambattista Vico: On the Most Ancient Wisdom of the Italians. Trans. L.M. Palmer. London: Cornell University Press, 1988. 31-34, 96-104. Originally published 1710.]
And for some reason, at first reading, that brings me peace…
…which I shall savour on a quick dog walk. I wonder if the woodpecker will be out in the woods today?
[aka A Note on Noticing...] Ever since I can remember, information discovery has fascinated me (a year after graduating, just before the web appeared, I started exploring with a university friend how we might rival subscription providers such as Dialog with tooling built around Archie, gopher, Veronica and so on that could provide one stop information destinations in vertical content areas… And then I landed a postgrad position… And soon after saw Mosaic for the first time…). The intelligence part of the stack – making stories on top of information maps, I guess (paraphrasing Fragments of Amber: Map Makers and Story Tellers) – was always the next step, the more obviously value-add step, the more obviously saleable step, the step that could directly supported decision making and as such initiate actions that could directly drive income or savings. Or thwart terrorist plots. Or whatever.
Anyway, via @mhawksey, I see this tweet: “Curating Big Data in the Cloud rww.to/H8aVzb via @sheilmcn“
The ReadWriteWeb story describes a company called Flow that allows users to construct their own Techmeme like content trackers. The
storytweet – and I think something of the sentiment behind Martin’s tweetit – immediately put me in mind of* Strategy Eye, a platform I came across yesterday, that bills itself as a “[c]loud platform enabling media partners to rapidly launch B2B intelligence portals and drive subscription or ad revenues”. In short, a platform that (in part at least) seems to let you build vertical content aggregators such as Wind Power Intelligence; (“Windpower Intelligence monitors key activities in the global wind industry including pipeline wind farms, deals, contracts, investments and policies. Clients access our content by subscribing to our Investment Reports and real-time Tracker.”)
[* that is, the things I've noticed in the last few hours, days, weeks, months, years, sensitise me to the things I notice today as I try to link in today's noticings with things I've noticed before...]
I came across Stategy Eye and Windpower Intelligence via this post: Media Pioneer: Windpower Intelligence, which opens as follows: “Windpower Intelligence is a paid-for digital tracker service launched in September 2010. The tracker is an experimental new platform developed by Haymarket with StrategyEye. The aim was to create a high value information service that wove together journalism, data and analytical tools in an intuitive system delivering customised information and data. In under a year, this service has generated a six-figure sum in subscription revenue, earning a place on the 2012 Media Pioneers shortlist.”
As to how I found the InPublishing article? That was another tweet, this time from @paulbradshaw: “Another example of a magazine with a business model centred on data: Windpower Intelligence is a paid-for digita… http://bit.ly/HLXrJo”
The “business” keyword in Paul’s post, (along with the “six figure sum” reference in the article he linked to), along with the attention grabbing “imagine ££££” from Martin’s post also helped sensitise me to the spot-it-or-don’t natural pairing of Flow and Strategy Eye along the “(new?) ways of businesses making making from data” dimension. And as every follower of Dirk Gently knows, you have to be sensitive to coincidence and the fundamental interconnectedness of things…
So why this post? Because this little example is just a note to self about my “practice”, that captures part of it well. The chance noticing of two vaguely related things that together help crystallise out a clearer idea about what’s going on in the world. In this case, flow based tracking verticals. Which isn’t really a new idea at all… Although that isn’t to say it won’t be a huge market, albeit just an iteration of a current one.
PS on the sort of related ‘not quite futurism’ front, see this recent post from John Battelle: If-Then and Antiquities of the Future (which also put me in mind of @briankelly’s The History Of The Web Backwards).
Secretary of State for Education, Michael Gove, gave a speech yesterday on rethinking the ICT curriculum in UK schools. You can read a copy of the speech variously on the Department for Education website, or, err, on the Guardian website.
Seeing these two copies of what is apparently the same speech, I started wondering:
a) which is the “best” source to reference?
b) how come the Guardian doesn’t add a disclaimer about the provenance of, and link, to the DfE version? [Note the disclaimer in the DfE version - "Please note that the text below may not always reflect the exact words used by the speaker."]
c) is the Guardian version an actual transcript, maybe? That is, does the Guardian reprint the “exact words” used by the speaker?
And that made me think I should do a diff… About which, more below…
Before that, however, here’s a quick piece of reflection on how these two things – the reinvention of the the IT curriculum, and the provenance of, and value added to, content published on news and tech industry blog sites – collide in my mind…
So for example, I’ve been pondering what the role of journalism is, lately, in part because I’m trying to clarify in my own mind what I think the practice and role of data journalism are (maybe I should apply for a Nieman-Berkman Fellowship in Journalism Innovation to work on this properly?!). It seems to me that “communication” is one important part (raising awareness of particular issues, events, or decisions), and holding governments and companies to account is another. (Actually, I think Paul Bradshaw has called me out on that, before, suggesting it was more to do with providing an evidence base through verification and triangulation, as well as comment, against which governments and companies could be held to account (err, I think? As an unjournalist, I don’t have notes or a verbatim quote against which to check that statement, and I’m too lazy to email/DM/phone Paul to clarify what he may or may not have said…(The extent of my checking is typically limited to what I can find on the web or in personal archives…which appear to be lacking on this point…))
Another thing I’ve been mulling over recently in a couple of contexts relates to the notion of what are variously referred to as digital or information skills.
The first context is “data journalism”, and the extent to which data journalists need to be able to do programming (in the sense of identifying the steps in a process that can be automated and how they should be sequenced or organised) versus writing code. (I can’t write code for toffee, but I can read it well enough to copy, paste and change bits that other people have written. That is, I can appropriate and reuse other people’s code, but can’t write it from scratch very well… Partly because I can’t ever remember the syntax and low level function names. I can also use tools such as Yahoo Pipes and Google Refine to do coding like things…) Then there’s the question of what to call things like URL hacking or (search engine) query building?
The second context is geeky computer techie stuff in schools, the sort of thing covered by Michael Gove’s speech at the BETT show on the national ICT curriculum (or lack thereof), and about which the educational digerati were all over on Twitter yesterday. Over the weekend, houseclearing my way through various “archives”, I came across all manner of press clippings from 2000-2005 or so about the activities of the OU Robotics Outreach Group, of which I was a co-founder (the web presence has only recently been shut down, in part because of the retirement of the sys admin on whose server the websites resided.) This group ran an annual open meeting every November for several years hosting talks from the educational robotics community in the UK (from primary school to HE level). The group also co-ordinated the RoboCup Junior competition in the UK, ran outreach events, developed various support materials and activities for use with Lego Mindstorms, and led the EPSRC/AHRC Creative Robotics Research Network.
At every robotics event, we’d try to involve kids and/or adults in elements of problem solving, mechanical design, programming (not really coding…) based around some sort of themed challenge: a robot fashion show, for example, or a treasure hunt (both variants on edge following/line following;-) Or a robot rescue mission, as used in a day long activity in the “Engineering: An Active Introduction” (TXR120) OU residential school, or the 3 hour “Robot Theme Park” team building activity in the Masters level “Team Engineering” (T885) weekend school. [If you're interested, we may be able to take bookings to run these events at your institution. We can make them work at a variety of difficulty levels from KS3-4 and up;-)]
Given that working at the bits-atoms interface is where the a lot of the not-purely-theoretical-or-hardcore-engineering innovation and application development is likely to take place over the next few years, any mandate to drop the “boring” Windows training ICT stuff in favour of programming (which I suspect can be taught in not only a really tedious way, but a really confusing and badly delivered way too) is probably Not the Best Plan.
Slightly better, and something that I know is currently being mooted for reigniting interest in computing, is the Raspberry Pi, a cheap, self-contained, programmable computer on a board (good for British industry, just like the BBC Micro was…;-) that allows you to work at the interface between the real world of atoms and the virtual world of bits that exists inside the computer. (See also things like the OU Senseboard, as used on the OU course “My Digital Life” (TU100).)
If schools were actually being encouraged to make a financial investment on a par with the level of investment around the introduction of the BBC Micro, back in the day, I’d suggest a 3D printer would have more of the wow factor…(I’ll doodle more on the rationale behind this in another post…) The financial climate may not allow for that (but I bet budget will manage to get spent anyway…) but whatever the case, I think Gove needs to be wary about consigning kids to lessons of coding hell. And maybe take a look at programming in a wider creative context, such as robotics (the word “robotics” is one of the reason why I think it’s seen as a very specialised, niche subject; we need a better phrase, such as “Creative Technologies”, which could combine elements of robotics, games programming, photoshop, and, yex, Powerpoint too… Hmm… thinks.. the OU has a couple of courses that have just come to the end of their life that between them provide a couple of hundred hours of content and activity on robotics (T184) and games programming (T151), and that we delivered, in part, to 6th formers under the OU’s Young Applicants in Schools Scheme.
Anyway, that’s all as maybe… Because there are plenty of digital skills that let you do coding like things without having to write code. Such as finding out whether there are any differences between the text in the DfE copy of Gove’s BETT speech, and the Guardian copy.
Copy the text from each page into a separate text file, and save it. (You’ll need a text editor for that..) Then, if you haven’t already got one, find yourself a good text editor. I use Text Wrangler on a Mac. (Actually, I think MS Word may have a diff function?)
The difference’s all tend to be in the characters used for quotation marks (character encodings are one of the things that can make all sorts of programmes fall over, or misbehave. Just being aware that they may cause a problem, as well as how and why, would be a great step in improving the baseline level understanding of folk IT. Some of the line breaks don’t quite match up either, but other than that, the text is the same.
Now, this may be because Gove was a good little minister and read out the words exactly as they had been prepared. Or it may be the case that the Guardian just reprinted the speech without mentioning provenance, or the disclaimer that he may not actually have read the words of that speech (I have vague memories of an episode of Yes, Minister, here…;-)
Whatever the case, if you know: a) that it’s even possible to compare two documents to see if they are different (a handy piece of folk IT knowledge); and b) know a tool that does it (or how to find a tool that does it, or a person that may have a tool that can do it), then you can compare the texts for yourself. And along the way, maybe learn that churnalism, in a variety of forms, is endemic in the media. Or maybe just demonstrate to yourself when the media is acting in a purely comms, rather than journalistic, role?
PS other phrases in the area: “computational thinking”. Hear, for example: A conversation with Jeannette Wing about computational thinking
PPS I just remembered – there’s a data journalism hook around this story too… from a tweet exchange last night that I was reminded of by an RT:
josiefraser: RT @grmcall: Of the 28,000 new teachers last year in the UK, 3 had a computer-related degree. Not 3000, just 3.
dlivingstone: @josiefraser Source??? Not found it yet. RT @grmcall: 28000 new UK teachers last year, 3 had a computer-related degree. Not 3000, just 3
josiefraser: That ICT qualification teacher stat RT @grmcall: Source is the Guardian http://www.guardian.co.uk/education/2012/jan/09/computer-studies-in-schools
I did a little digging and found the following document on the General Teaching Council of England website – Annual digest of statistics 2010–11 – Profiles of registered teachers in England [PDF] – that contains demographic stats, amongst others, for UK teachers. But no stats relating to subject areas of degree level qualifications held, which is presumably the data referred to in the tweet. So I’m thinking: this is partly where the role of data journalist comes in… They may not be able to verify the numbers by checking independent sources, but they may be able to shed some light on where the numbers came from and how they were arrived at, and maybe even secure their release (albeit as a single point source?)