Archive for the ‘Thinkses’ Category
A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context
Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.
To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):
- you can’t start following someone on Twitter until you join Twitter;
- follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
- starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.
A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)
There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).
So with low follower numbers, where can we get our timestamps from?
In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).
If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.
This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.
(I’m making this up as I’m going along…;)
If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.
Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.
Got that?!;-) (I’m still making this up as I’m going along…!)
We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.
If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.
Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].
If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)
We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).
For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:
if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]
Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]
If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:
- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
- encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
- consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
- if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.
I think I need to go and walk the dog again.
PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.
Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics
Last year I sat on a couple of panels organised by I’m a Scientist’s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking Who are you Twitter?, a quick review of a social media mapping exercise carried out on the followers of the @imascientist Twitter account.
Using the technique described in Estimated Follower Accession Charts for Twitter, we can estimate a follower acquisition growth curve for the @imascientist Twitter account:
I’ve already noted how we may be able to use “spikes” in follower acquisition rates to identify news events that involved the owner of a particular Twitter account and caused a surge in follower numbers as a result (What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events).
Thinking back to the context of evaluating the impact of events that include social media as part of the overall campaign, it struck me that whilst running a particular event may not lead to a huge surge in follower numbers on the day of the event or in the immediate aftermath, the followers who do sign up over that period might have signed up as a result of the event. And now we have the first inklings of a post hoc analysis tool that lets us try to identify these people, and perhaps look to see if their profiles are different to profiles of followers who signed up at different times (maybe reflecting the audience interest profile of folk who attended a particular event, or reflecting sign ups from a particular geographical area?)
In other words, through generating the follower acquisition curve, can we use it to filter down to folk who started following around a particular time in order to then see whether there is a possibility that they started following as a result of a particular event, and if so can count as some sort of “conversion”? (I appreciate that there are a lot of caveats in there!;-)
A similar approach may also be relevant in the context of analysing link building around historical online community events, such as MOOCs… If we know somebody took a particular MOOC at a particular time, might we be able to construct their follower acquisition curve and then analyse it around the time of the MOOC, looking to see if the connections built over that period are different to the users other followers, and as such may represent links developed as a result of taking the MOOC? Analysing the timelines of the respective parties may further reveal conversational dynamics between those parties, and as such allow is to see whether a fruitful social learning relationship developed out of contact made in the MOOC?
Arguably primed by the open courseware and open learning initiatives that started a decade or so ago (precursor: OSTP), several notable MOOC platforms (Coursera, Udacity) provide a one stop supermarket for 20-100 hour large cohort, online “course experiences” offered by traditional universities. Using a blend of readings and video lectures, courses provide pacing through a predetermined syllabus on a particular topic, with social tools to allow students to discuss elements of the course with each other. To provide the feedback on progress, computer marking systems and scaleable “peer assessment” provide a means of letting a student know how well they are doing on the course.
At least, I think that’s how it works. I don’t really know. Though I’ve signed up for several MOOCs, I’ve never actually started any of them, or tried to work through them.
Maybe that’s because I tend to learn from resources discovered on the web in response to questions I’ve formulated myself (questions which often derive from reading other resources and either being confused by them or not being able to make sense of them!). But just see how far a search far a web search for visualisation site:coursera.org gets you. As I believe Patrick McAndrew suggested. (Hmmm… I appear to have hit my article limit… Here’s Patrick’s OER13 keynote which I think led to the article.)
I’m not sure who, within the universities that have signed up to delivering courses on the various MOOC platforms, is deciding which courses to run with, but I suspect the marketing department has a hand in it.
Marketing departments also used to run university web presences, too, didn’t they?
Way back when, when I was still at school, I used to watch tech documentaries – I remember seeing a window based GUI for the first time on an old episode of Horizon – and OU programmes, amongst other things… If you’re over a certain age, you’ll remember them:
Things have moved on since then, of course. OU/BBC programmes now are of a far more mainstream flavour (for example, clips from recent OU/BBC co-pros currently on iPlayer). Add to that, a wide variety of online videos, such as the 60 second animation series (such as 60 Second Adventures in Thought, or 60 Second Adventures in Astronomy, or even 60 Second Adventures in Economics), or clips from OU produced videos sent to students as part of their course materials (this sort of thing, for example: A Cyborg Future?, The Buildings of Ancient Rome or Environmental Policy in an International Context.)
As “tasters”, the OU course programmes that appeared in the dead parts of the schedule on BBC2 introduced me to a particular form of discourse, somewhere between a lecture and a tutorial, at particular level of understanding (higher education academic). The programmes were created as embedded parts of a distance education course, complementing readings and exercises, and though I find it hard to remember back, I think that came across. That the programmes were a glimpse into a wider, and deeper, exploration of a particular topic that the course of which they were a part provided. I don’t really recall.
So whilst the OU course programmes were part of a course, they were not the whole of it. They were windows into a course, rather than a microcosm of one. I get the impression that the MOOCs are intended in some way to be “complete”, albeit short, courses, that are maybe intended to act as tasters of more comprehensive, formal offerings. I don’t really know.
My introduction to the OU, then, aged ten, or thereabouts, so 35 years or so ago, was through these glimpses into courses that other people were studying. They were opportunities for me to walk into a lecture to see what it was like. The programmes were not intended for me, but I could partake of them. That is more of a “passively discoverable OER” model than a MOOC. Maybe. I don’t really know.
I wonder now, if now was then, how I would have come to discover the world of “academic” communications. Through Google, presumably. Through the web. Through the marketing department? Or through the academics, (but discovered how?).
I guess we could argue that MOOCs represent, in part, higher education marketing departments waking up to the fact that the web exists, that it is a contentful medium, in part at least, and that the universities have content that may attract eyeballs. Maybe. I don’t really know.
If the marketing departments are leading the MOOC campaigns, I wonder what sort of return they expect? Raising “brand awareness”? Being seen to be there/having a presence on a platform because other universities have? Generating leads and building up mailing lists? Online courses as “promotional items” (whatever that means)? Edutainment offerings?! I don’t really know.
Going back to the OU programmes on BBC2, the primary audience then were presumably students on a course, because the programmes were part of the course. Partially open courses. Courses being run for real that also had an open window open on to them, so that other people could see what sorts of thing were covered by those courses, and maybe learn a little along the way.
This is more in keeping with the model of online course delivery being pursued by Jim Groom’s ds106 or Martin Weller’s H817 module on “Open Education” (I think; I don’t really know). I think. I don’t really know.
Other models are possible, of course. The “cMOOC” – connectionist/connectivist MOOC – idea explores a different pedagogy. The xMOOC offerings of Coursera and Udacity wrap not opencourseware in a delivery platform and run scheduled cohorts. The original OU OpenLearn offering had the platform (Moodle), had the open content, but didn’t have the community that comes from marshalling learners into a scheduled offering. Or the hype. Or at least, not the right sort of hype (the hype that follows VC investment, where VC does not refer to Vice Chancellor). The cMOOC idea tries to be open as to curriculum, in part – a more fluid learning environment where loose co-ordination amongst participants encourages an element of personal research into a topic and sharing of the lessons learned. A pedagogy that seeks to foster independent learning in the sense of being able to independently research a topic, rather than independently pay to follow a prescribed path through a set of learning materials. In the xMOOC, a prescribed path that propagates a myth of there being one true way to learn a subject. One true path.
My own open course experiment was an uncourse. Tasked with writing a course on a subject about which I knew nothing, I sought to capture my own learning path through it, using that trail to inform the design of a supposedly more polished offering. The traces are still there, still open – Digital Worlds – Interactive Media and Game Design. The pages still get hit, the resources still get downloaded. I could – should – do a little more to make evident the pathways through the content.
Whilst the reverse chronological form of a blog made sense as I was discovering a trail through the subject area – new content was revealed to any others following the uncourse in a sensible order – looking back at the material now the journeys through each topic area start at the end, presenting anyone wishing to follow the path I took with an apparent problem. Though not really… If you select a Category on the Digital Worlds blog, and add ?order=asc – as for example http://digitalworlds.wordpress.com/category/animation/?order=asc, the posts will be presented in chronological order. I wonder if there is a switch – or a plugin – that can make chronological views of posts within a particular tag or category on WordPress automatically display in a chronological order? I don’t really know. This would provide one way of transforming a platform configured as a “live presentation” site into one that worked better as a legacy site. It’s not hard to imagine a Janus theme that would provide these two faces of a site, one in reverse chronological order for live delivery, the other in chronological order for folk wishing to follow the steps taken by a previous journeyman in the same order as they were originally taken.
I still don’t know what forces are at play that may result in MOOC-hype and whatever shakes out as a result transforming, if at all, higher education as we know it and as developing countries may yet come to know it. I really don’t know.
And I still don’t have a good feeling for how we can make most effective use of the web to support a knowledge driven society; how best we can make use of online content resources and social communication tools to help people to develop their own personal, and deeper, understanding about whatever topic, to help them make sense of whatever they need to make sense of; how best schools and universities can draw on the web to help people develop lifelong learning skills; what it means to use the web in furtherance of lifelong learning.
I really, really, don’t know.
As well as serendipity, I believe in confluence…
A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.
It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.
In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:
So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)
Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.
So what’s confluential about those?
The IWCP article localises the Fairer Gambling data in several ways:
- the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
- a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
- a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.
The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).
Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.
A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).
Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.
The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)
Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).
The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.
Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.
PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…
A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.
I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…
For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.
(*There are actually a couple of models I can think of: 1) I keep the query secret, but run it and give you the results; 2) I license the “query source code” to you and let you run it yourself. Hmm, I wonder: do folk license queries they share? How, and to what extent, might derived queries/query modifications be accommodated in such a licensing scheme?)
Pondering Leigh’s SPARQL-doc post, another post via R-bloggers, Building a package in RStudio is actually very easy (which describes how to package a set of R files for distribution via github), asdfree (analyze survey data for free), a site that “announces obsessively-detailed instructions to analyze us government survey data with free tools” (and which includes R bundles to get you started quickly…), the resource listing Documentation for package ‘datasets’ version 2.15.2 that describes a bundled package of datasets for R and the Linked Data API, which sought to provide a simple RESTful API over SPARQL endpoints, I wondered the following:
How about developing and sharing commented query libraries around Linked Data endpoints that could be used in arbitrary Linked Data clients?
(By “Linked Data clients”, I mean different user agent contexts. So for example, calling a query from Python, or R, or Google Spreadsheets.) That’s it… Simple.
One approach (the simplest?) might be to put each separate query into a separate file, with a filename that could be used to spawn a function name that could be used to call that query. Putting all the queries into a directory and zipping them up would provide a minimal packaging format. An additional manifest file might minimally document the filename along with the parameters that can be passed into and returned from the query. Helper libraries in arbitrary languages would open the query package and “compile” a programme library/set of “API” calling functions for that language (so for example, in R it would create a set of R functions, in Python a set of Python functions).
(This reminds me of a Twitter exchange with Nick Jackson/@jacksonj04 a couple of days ago around “self-assembling” API programme libraries that could be compiled in an arbitrary language from a JSON API, cf. Swagger (presentation), which I haven’t had time to look at yet.)
The idea, then is this:
- Define a simple file format for declaring documented SPARQL queries
- Define a simple packaging format for bundling separate SPARQL queries
- The simply packaged set of queries define a simple “raw query” API over a Linked Data dataset
- Describe a simple protocol for creating programming language specific library wrappers around API from the query bundle package.
So.. I guess two questions arise: 1) would this be useful? 2) how hard could it be?
[See also: @ldodds again, on Publishing SPARQL queries and-documentation using github]
I’ve been pondering what is is to be an engineer, lately, in the context of trying to work out what it is that I actually do and what sort of “contract” I feel I’m honouring (and with whom) by doing whatever that is that spend my days doing…
According to Wikipedia, [t]he term engineering … deriv[es] from the word engineer, which itself dates back to 1325, when an engine’er (literally, one who operates an engine) originally referred to “a constructor of military engines.” … The word “engine” itself is of even older origin, ultimately deriving from the Latin ingenium (c. 1250).
Via Wiktionary, [e]ngine originally meant ‘ingenuity, cunning’ which eventually developed into meaning ‘the product of ingenuity, a plot or snare’ and ‘tool, weapon’. Engines as the products of cunning, then, and hence, naturally, war machines. And engineers as their operators, or constructors.
One of the formative books in my life (mid-teens, I think) was Richard Gregory’s Mind in Science, from which I took away the idea of tools as things that embodied and executed an idea. You see a way of doing something or how to do something, and then put that idea into an artefact – a tool – that does it. Code is a particularly expressive medium in this respect, AI (in the sense of Artificial Intelligence) one way of explicitly trying to give machines ideas, or embody mind in machine. (I have an AI background – my PhD in evolutionary computation was pursued in a cognitive science unit (HRCL, as was) at the OU; what led me to “AI”, I think, was a school of thought relating to the practice of how to use code to embody mind and natural process in machines, as well as how to use code that can act on, and be acted on by, the physical world.)
So part of what I (think I) do is build tools, executable expressions of ideas. I’m not so interested in how they are used. I’ve also started sketching maps a lot, lately, of social networks and other things that can be represented as graphs. These are tools too – macroscopes for peering at structural relationships within a system – and again, once produced, I’m not so interested in how they’re used. (What excites me is finding the process that allows the idea to be represented or executed.)
If we go back to the idea of “engineer”, and dig a little deeper by tracing the notion of ingenium, we find this take on it:
ingenium is the original and natural faculty of humans; it is the proper faculty with which we achieve certain knowledge. It is original because it is the first “facility” young people untouched by prejudices exemplify upon seeing similarities between disparate things. It is natural because it is to us what the power to create is to God. just as God easily begets a world of nature, so we ingeniously make discoveries in the sciences and artifacts in the arts. Ingenium is a productive and creative form of knowledge. It is poietic in the creation of the imagination; it is rhetorical in the creation of language, through which all sciences are formalized. Hence, it requires its own logic, a logic that combines both the art of finding or inventing arguments and that of judging them. Vico argues that topical art allows the mind to locate the object of knowledge and to see it in all its aspects and not through “the dark glass” of clear and distinct ideas. The logic of discovery and invention which Vico uses against Descartes’s analytics is the art of apprehending the true. With this Vico come full circle in his arguments against Descartes. [From the Introduction by L.M. Palmer to Vico on Ingenium, in Giambattista Vico: On the Most Ancient Wisdom of the Italians. Trans. L.M. Palmer. London: Cornell University Press, 1988. 31-34, 96-104. Originally published 1710.]
And for some reason, at first reading, that brings me peace…
…which I shall savour on a quick dog walk. I wonder if the woodpecker will be out in the woods today?
[aka A Note on Noticing...] Ever since I can remember, information discovery has fascinated me (a year after graduating, just before the web appeared, I started exploring with a university friend how we might rival subscription providers such as Dialog with tooling built around Archie, gopher, Veronica and so on that could provide one stop information destinations in vertical content areas… And then I landed a postgrad position… And soon after saw Mosaic for the first time…). The intelligence part of the stack – making stories on top of information maps, I guess (paraphrasing Fragments of Amber: Map Makers and Story Tellers) – was always the next step, the more obviously value-add step, the more obviously saleable step, the step that could directly supported decision making and as such initiate actions that could directly drive income or savings. Or thwart terrorist plots. Or whatever.
Anyway, via @mhawksey, I see this tweet: “Curating Big Data in the Cloud rww.to/H8aVzb via @sheilmcn“
The ReadWriteWeb story describes a company called Flow that allows users to construct their own Techmeme like content trackers. The
storytweet – and I think something of the sentiment behind Martin’s tweetit – immediately put me in mind of* Strategy Eye, a platform I came across yesterday, that bills itself as a “[c]loud platform enabling media partners to rapidly launch B2B intelligence portals and drive subscription or ad revenues”. In short, a platform that (in part at least) seems to let you build vertical content aggregators such as Wind Power Intelligence; (“Windpower Intelligence monitors key activities in the global wind industry including pipeline wind farms, deals, contracts, investments and policies. Clients access our content by subscribing to our Investment Reports and real-time Tracker.”)
[* that is, the things I've noticed in the last few hours, days, weeks, months, years, sensitise me to the things I notice today as I try to link in today's noticings with things I've noticed before...]
I came across Stategy Eye and Windpower Intelligence via this post: Media Pioneer: Windpower Intelligence, which opens as follows: “Windpower Intelligence is a paid-for digital tracker service launched in September 2010. The tracker is an experimental new platform developed by Haymarket with StrategyEye. The aim was to create a high value information service that wove together journalism, data and analytical tools in an intuitive system delivering customised information and data. In under a year, this service has generated a six-figure sum in subscription revenue, earning a place on the 2012 Media Pioneers shortlist.”
As to how I found the InPublishing article? That was another tweet, this time from @paulbradshaw: “Another example of a magazine with a business model centred on data: Windpower Intelligence is a paid-for digita… http://bit.ly/HLXrJo”
The “business” keyword in Paul’s post, (along with the “six figure sum” reference in the article he linked to), along with the attention grabbing “imagine ££££” from Martin’s post also helped sensitise me to the spot-it-or-don’t natural pairing of Flow and Strategy Eye along the “(new?) ways of businesses making making from data” dimension. And as every follower of Dirk Gently knows, you have to be sensitive to coincidence and the fundamental interconnectedness of things…
So why this post? Because this little example is just a note to self about my “practice”, that captures part of it well. The chance noticing of two vaguely related things that together help crystallise out a clearer idea about what’s going on in the world. In this case, flow based tracking verticals. Which isn’t really a new idea at all… Although that isn’t to say it won’t be a huge market, albeit just an iteration of a current one.
PS on the sort of related ‘not quite futurism’ front, see this recent post from John Battelle: If-Then and Antiquities of the Future (which also put me in mind of @briankelly’s The History Of The Web Backwards).
Secretary of State for Education, Michael Gove, gave a speech yesterday on rethinking the ICT curriculum in UK schools. You can read a copy of the speech variously on the Department for Education website, or, err, on the Guardian website.
Seeing these two copies of what is apparently the same speech, I started wondering:
a) which is the “best” source to reference?
b) how come the Guardian doesn’t add a disclaimer about the provenance of, and link, to the DfE version? [Note the disclaimer in the DfE version - "Please note that the text below may not always reflect the exact words used by the speaker."]
c) is the Guardian version an actual transcript, maybe? That is, does the Guardian reprint the “exact words” used by the speaker?
And that made me think I should do a diff… About which, more below…
Before that, however, here’s a quick piece of reflection on how these two things – the reinvention of the the IT curriculum, and the provenance of, and value added to, content published on news and tech industry blog sites – collide in my mind…
So for example, I’ve been pondering what the role of journalism is, lately, in part because I’m trying to clarify in my own mind what I think the practice and role of data journalism are (maybe I should apply for a Nieman-Berkman Fellowship in Journalism Innovation to work on this properly?!). It seems to me that “communication” is one important part (raising awareness of particular issues, events, or decisions), and holding governments and companies to account is another. (Actually, I think Paul Bradshaw has called me out on that, before, suggesting it was more to do with providing an evidence base through verification and triangulation, as well as comment, against which governments and companies could be held to account (err, I think? As an unjournalist, I don’t have notes or a verbatim quote against which to check that statement, and I’m too lazy to email/DM/phone Paul to clarify what he may or may not have said…(The extent of my checking is typically limited to what I can find on the web or in personal archives…which appear to be lacking on this point…))
Another thing I’ve been mulling over recently in a couple of contexts relates to the notion of what are variously referred to as digital or information skills.
The first context is “data journalism”, and the extent to which data journalists need to be able to do programming (in the sense of identifying the steps in a process that can be automated and how they should be sequenced or organised) versus writing code. (I can’t write code for toffee, but I can read it well enough to copy, paste and change bits that other people have written. That is, I can appropriate and reuse other people’s code, but can’t write it from scratch very well… Partly because I can’t ever remember the syntax and low level function names. I can also use tools such as Yahoo Pipes and Google Refine to do coding like things…) Then there’s the question of what to call things like URL hacking or (search engine) query building?
The second context is geeky computer techie stuff in schools, the sort of thing covered by Michael Gove’s speech at the BETT show on the national ICT curriculum (or lack thereof), and about which the educational digerati were all over on Twitter yesterday. Over the weekend, houseclearing my way through various “archives”, I came across all manner of press clippings from 2000-2005 or so about the activities of the OU Robotics Outreach Group, of which I was a co-founder (the web presence has only recently been shut down, in part because of the retirement of the sys admin on whose server the websites resided.) This group ran an annual open meeting every November for several years hosting talks from the educational robotics community in the UK (from primary school to HE level). The group also co-ordinated the RoboCup Junior competition in the UK, ran outreach events, developed various support materials and activities for use with Lego Mindstorms, and led the EPSRC/AHRC Creative Robotics Research Network.
At every robotics event, we’d try to involve kids and/or adults in elements of problem solving, mechanical design, programming (not really coding…) based around some sort of themed challenge: a robot fashion show, for example, or a treasure hunt (both variants on edge following/line following;-) Or a robot rescue mission, as used in a day long activity in the “Engineering: An Active Introduction” (TXR120) OU residential school, or the 3 hour “Robot Theme Park” team building activity in the Masters level “Team Engineering” (T885) weekend school. [If you're interested, we may be able to take bookings to run these events at your institution. We can make them work at a variety of difficulty levels from KS3-4 and up;-)]
Given that working at the bits-atoms interface is where the a lot of the not-purely-theoretical-or-hardcore-engineering innovation and application development is likely to take place over the next few years, any mandate to drop the “boring” Windows training ICT stuff in favour of programming (which I suspect can be taught in not only a really tedious way, but a really confusing and badly delivered way too) is probably Not the Best Plan.
Slightly better, and something that I know is currently being mooted for reigniting interest in computing, is the Raspberry Pi, a cheap, self-contained, programmable computer on a board (good for British industry, just like the BBC Micro was…;-) that allows you to work at the interface between the real world of atoms and the virtual world of bits that exists inside the computer. (See also things like the OU Senseboard, as used on the OU course “My Digital Life” (TU100).)
If schools were actually being encouraged to make a financial investment on a par with the level of investment around the introduction of the BBC Micro, back in the day, I’d suggest a 3D printer would have more of the wow factor…(I’ll doodle more on the rationale behind this in another post…) The financial climate may not allow for that (but I bet budget will manage to get spent anyway…) but whatever the case, I think Gove needs to be wary about consigning kids to lessons of coding hell. And maybe take a look at programming in a wider creative context, such as robotics (the word “robotics” is one of the reason why I think it’s seen as a very specialised, niche subject; we need a better phrase, such as “Creative Technologies”, which could combine elements of robotics, games programming, photoshop, and, yex, Powerpoint too… Hmm… thinks.. the OU has a couple of courses that have just come to the end of their life that between them provide a couple of hundred hours of content and activity on robotics (T184) and games programming (T151), and that we delivered, in part, to 6th formers under the OU’s Young Applicants in Schools Scheme.
Anyway, that’s all as maybe… Because there are plenty of digital skills that let you do coding like things without having to write code. Such as finding out whether there are any differences between the text in the DfE copy of Gove’s BETT speech, and the Guardian copy.
Copy the text from each page into a separate text file, and save it. (You’ll need a text editor for that..) Then, if you haven’t already got one, find yourself a good text editor. I use Text Wrangler on a Mac. (Actually, I think MS Word may have a diff function?)
The difference’s all tend to be in the characters used for quotation marks (character encodings are one of the things that can make all sorts of programmes fall over, or misbehave. Just being aware that they may cause a problem, as well as how and why, would be a great step in improving the baseline level understanding of folk IT. Some of the line breaks don’t quite match up either, but other than that, the text is the same.
Now, this may be because Gove was a good little minister and read out the words exactly as they had been prepared. Or it may be the case that the Guardian just reprinted the speech without mentioning provenance, or the disclaimer that he may not actually have read the words of that speech (I have vague memories of an episode of Yes, Minister, here…;-)
Whatever the case, if you know: a) that it’s even possible to compare two documents to see if they are different (a handy piece of folk IT knowledge); and b) know a tool that does it (or how to find a tool that does it, or a person that may have a tool that can do it), then you can compare the texts for yourself. And along the way, maybe learn that churnalism, in a variety of forms, is endemic in the media. Or maybe just demonstrate to yourself when the media is acting in a purely comms, rather than journalistic, role?
PS other phrases in the area: “computational thinking”. Hear, for example: A conversation with Jeannette Wing about computational thinking
PPS I just remembered – there’s a data journalism hook around this story too… from a tweet exchange last night that I was reminded of by an RT:
josiefraser: RT @grmcall: Of the 28,000 new teachers last year in the UK, 3 had a computer-related degree. Not 3000, just 3.
dlivingstone: @josiefraser Source??? Not found it yet. RT @grmcall: 28000 new UK teachers last year, 3 had a computer-related degree. Not 3000, just 3
josiefraser: That ICT qualification teacher stat RT @grmcall: Source is the Guardian http://www.guardian.co.uk/education/2012/jan/09/computer-studies-in-schools
I did a little digging and found the following document on the General Teaching Council of England website – Annual digest of statistics 2010–11 – Profiles of registered teachers in England [PDF] – that contains demographic stats, amongst others, for UK teachers. But no stats relating to subject areas of degree level qualifications held, which is presumably the data referred to in the tweet. So I’m thinking: this is partly where the role of data journalist comes in… They may not be able to verify the numbers by checking independent sources, but they may be able to shed some light on where the numbers came from and how they were arrived at, and maybe even secure their release (albeit as a single point source?)
Way back when, when I first started blogging, I tried to push the idea of “live documents” that supported transclusion of content from elsewhere (e.g. Keeping Courses Current with Live Links; there was also a demo, but I think it’s rotted…?) A couple of days ago, Owen Stephens (re)introduced me to the notion of literate programming, “a methodology that combines a programming language with a documentation language”. The context was active reading of reactive documents, in which a reader interacts with a document that contains human readable paragraphs that describe some sort of mathematical or logical model which is embedded in the text as interactive, parameterised elements. (I can’t give a demo in this WordPress.com hosted blog because what I am allowed to do is really locked down… so to see what I’m talking about, check out the Explorable Explanations) example document… I can wait…)
Done that? Up to speed now?
My immediate impression was that it reminded me of the interactive, browser based programming style (e.g. Online Apps for Live Code Tutorials/Demos), in which learners can read and run, edit and run, and write and run, code examples in the browser (or more generally, in the context of an “electronic study guide” (eSG). It also brought to mind similarities with dexy.it and Sweave, a couple of (literate programming;-) frameworks that allow you to include programme code within a document and them execute it in order to produce an output that also appears in the document. (I remember one of the joys of course writing for an eSG is that you often have to hand over the text (including code and output in situ), a text file containing the code (for testing), and a text file containing the output. If (when) an error is found, version control across the various files can be come really problematic. Far easier if the document were to include code fragments that are then executed and used to produce the actual output that is in turn piped directy into the final document.) Wolfram’s Computable Document Format also comes to mind, as a document format that allows a reader to express executable mathematical statements, whether formally specified or, increasingly, using natural language.
So the document space I’m imagining here is one in which the document contains one or more components that are generated in response to some sort of request from an operational part of the document, or a part of the document that encodes some sort of performative action[?????], such as a search term that is used to trigger a search whose results are then included within the page, a piece of programme code that can be executed in order to generate an output, or a parameter for a model that can be run with the specified parameter value in order to produce an output that is rendered live within the document.
For example, this might include a ‘live’ document, that transcludes content from an external source:
A literate programme, that combines:
- some explanatory text;
- fragments of, or complete, programmes;
- the output of the programme.
Or a reactive document which contains:
- some explanatory text;
- parameterised programme code, or a parameterised mathematical or logical model; the code/model should also be executable, using parameter values specified by the reader;
- the output from executing the code or model.
(I guess a live document might be viewed as reactive in certain cases, for example, when a user specifies a search term or query that determines what content is pulled live into a document from an external source.)
There is something almost cell like going on here, in that part of the document contains the instructions that some document machinery can process in order to produce other parts of the document…
One obvious use case for living documents is in educational materials. For a long time now (even before the time of education CD-ROMs;-), eLearning materials have included interactive components. But these have often be external components that have been slotted in to the educational text, rather than being generated from the execution of a specified part of the the text. For example, many OU course materials include interactive self-assessment questions, or Flash based interactive exercises (hmm… I wonder when these are going to be rebranded as edu-apps and made available, for a fee, or via open license, in an OU edu-app market;-) [Note: the OU used to be a pretty significant educational software house in terms of output, with large numbers of highly skilled educational software developers who knew how to turn out software that worked in educational terms... but that was before the VLE came along...;-)]
Another use case is the area of data journalism. A criticism of many interactive visulisations produced to support news stories is that whilst they’re all very nice and shiny, they don’t actually work that well to communicate anything of substance at all (for example, see my comments on Michael Blastland’s talk at the OU Stats conference). Maybe a few well crafted reactive documents might start to address this balance, and engage at least part of the audience in a contextualised consideration of data (or model) based story…?
A third area I’d like to spend some time mulling over (maybe even in the context of Public Platforms…?) is policy development and public consultation, scoping out what may be possible and plausible if consultation documents were to propose particular models and then allow the engaged reader to explore the various parameter regimes associated with those models?
Hmmm…. maybe I need to start working on my resolutions for next year…?!
PS just in passing, as well as treating documents as living things, it can also be instructive to think of them as databases. This is a trivial mapping if the document has a regular tabluar structure, such as a spreadsheet sheet, or is otherwise formally structured (as for example in the case of an XML document, which typically describes some sort of hierarchical (document as data) structure) ,or even if it contains conventions in either style or content (for example, section headings being phrased in the form “Section NN: blah blah blah”; “Section NN: ” is a convention that can be used to identify the semantics of the text “blah blah blah” (in this case, as the text representing the header of section NN).
I notice that there’s a couple of days left for institutions to get £10k from JISC in order to look at what it would take to start publishing course data via XCRI feeds, with another £40-80k each for up to 80 institutions to do something about it (JISC Grant Funding 8/11: JISC ‘Course Data: Making the most of Course Information’ Capital Programme – Call for Letters of Commitment; see also Immediate Impressions on JISC’s “Course Data: Making the most of Course Information” Funding Call, as well as the associated comments):
As funding for higher education is reduced and the cost to individuals rises, we see a move towards a consumer-led market for education and increased student expectations. One of the key themes about delivering a better student experience discussed in the recent Whitepaper mentions improving the information available to prospective students.
Nowadays, information about a college or university is more likely found via a laptop than in a prospectus. In this competitive climate publicising courses while embracing new technologies is ever more important for institutions.
JISC have made it easier for prospective students to decide which course to study by creating an internationally recognised data standard for course information, known as XCRICAP. This will make transferring and advertising information about courses between institutions and organisations, more efficient and effective.
The focus of this new programme is to enable institutions to publish electronic prospectus information in a standard format for all types of courses, especially online, distance, part time, post graduate and continuing professional development. This standard data could then be shared with many different aggregator agencies (such as UCAS, the National Learning Directory, 14-19 Prospectus websites, or new services yet to be developed) to collect and share with prospective student
All well and good, but:
- there still won’t be a single, centralised directory of UK courses, the sort of thing than can be used to scaffold other services. I know it isn’t perfect, but UCAS has some sort of directory of UK undergrad HE courses that can be applied for via central clearing, but it’s not available as open data.
- the universities are being offered £10k each to explore how they can start to make more of their course data. There seems to be the expectation that some good will follow, and aggregation services will flower around this data (This standard data could then be shared with many different aggregator agencies (such as … new services yet to be developed). I think they might too. (For example, we’re already starting to see sites like Which University? provide shiny front ends to HESA and NSS data.) But why should these aggregation sites have to wait for the universities to scope out, plan, commission, delay and then maybe or maybe not deliver open XCRI feeds. (Hmm, I wonder: does the JISC money place any requirements on universities making their XCRI-CAP feeds available under an open license that allows commercial reuse?)
When we cobbled together the Course Detective search engine, we exploited Google’s index of UK HE websites to provide a search engine that provides a customised search over the course prospectus webpages on UK HE websites. Being a Google Custom Search Engine there’s only so much we can do with it, but whilst we wait for all the UK HEIs to get round to publishing course marketing feeds, it’s a start.
Of course, if we had our own index, we could offer a more refined search service, with all sorts of potential enhancements and enrichment. Which is where copyright kicks in…
…because course catalogue webpages are generally copyright the host institution, and not published under an open license that allows for commercial reuse.
(I’m not sure how the law stands against general indexing for web search purposes vs indexing only a limited domain (such as course catalogue pages on UK HEI websites) vs scraping pages from a limited domain (such as course catalogue pages on UK HEI websites) in order to create a structured search engine over UK HE course pages. But I suspect the latter two cases breach copyright in ways that are harder to argue your way out of then a “we index everything we can find, regardless” web search engine. (I’m not sure how domain limited Google CSEs figure either? Or folk who run searches with the site: limit?))
To kickstart the “so what could we do with a UK wide aggregation of course data?”, I wonder whether UK HEIs who are going to pick up the £10k from JISC’s table might also consider doing the following:
- licensing their their course catalogue web pages with an open, commercial license (no one really understands what non-commercial means…and the aim would be to build sustainable services that help people find courses in a fair (open algorithmic) way that they might want to take…)
- publishing a sitemap/site feed that makes it clear where the course catalogue content lives (as a starter for 10, we have the Course Detective CSE definition file [XML]). That way, the sites could retain some element of control over which parts of the site good citizen scrapers could crawl. (I guess a robots.txt file might also be used to express this sort of policy?)
The license would allow third parties to start scraping and indexing course catalogue content, develop normalised forms of that data, and start working on discovery services around that data. A major aim of such sites would presumably be to support course discovery by potential students and their families, and ultimately drive traffic back to the university websites, or on to the UCAS website. Such sites, once established, would also provide a natural sink for XCRI-CAP feeds as and when they are published (although I suspect JISC would also like to be able to run a pilot project looking at developing an aggregator service around XCRI-CAP feeds as well;-) In addition, the sites might well identify additional – pragmatic – requirements on other sorts of data that might contribute to intermediary course discovery and course comparison sites.
It’s already looking as if the KIS – Key Information Set – data that will supposedly support course choice won’t be as open as it might otherwise be (e.g. Immediate Thoughts on the “Provision of information about higher education”); it would be a shame if the universities themselves also sought to limit the discoverability of their courses via cross-sector course discovery sites…