Mixing Stuff Up

Remember mashups? Five years or so ago they were all the rage. At their heart, they provided ways of combining things that already existed to do new things. This is a lazy approach, and one I favour.

One of the key inspirations for me in this idea combinatorial tech, or tech combinatorics, is Jon Udell. His Library Lookup project blew me away in its creativity (the use of bookmarklets, the way the project encouraged you to access one IT service from another, the using of “linked data”, common/core-canonical identifiers to bridge services and leverage or enrich one from another, and so on) and was the spark that fired many of my own doodlings. (Just thinking about it again excites me now…)

As Jon wrote on his blog yesterday (Shiny old tech) (my emphasis):

What does worry me, a bit, is the recent public conversation about ageism in tech. I’m 20 years past the point at which Vinod Khosla would have me fade into the sunset. And I think differently about innovation than Silicon Valley does. I don’t think we lack new ideas. I think we lack creative recombination of proven tech, and the execution and follow-through required to surface its latent value.

Elm City is one example of that. Another is my current project, Thali, Yaron Goland’s bid to create the peer-to-peer web that I’ve long envisioned. Thali is not a new idea. It is a creative recombination of proven tech: Couchbase, mutual SSL authentication, Tor hidden services. To make Thali possible, Yaron is making solid contributions to Thali’s open source foundations. Though younger than me, he is beyond Vinod Khosla’s sell-by date. But he is innovating in a profoundly important way.

Can we draw a clearer distinction between innovation and novelty?

Creative recombination.

I often think of this in terms of appropriation (eg Appropriating Technology, Appropriating IT: innovative uses of emerging technologies or Appropriating IT: Glue Steps).

Or repurposing, a form of reuse that differs from the intended original use.

Openness helps here. Open technologies allow users to innovate without permission. Open licensing is just part of that open technology jigsaw; open standards another; open access and accessibility a third. Open interfaces accessed sideways. And so on.

Looking back over archived blog posts from five, six, seven years ago, the web used to be such fun. An open playground, full of opportunities for creative recombination. Now we have Facebook, where authenticated APIs give you access to local social neighbourhoods, but little more. Now we have Google using link redirection and link pollution at every opportunity. Services once open are closed according to economic imperatives (and maybe scaling issues; maybe some creative recombinations are too costly to support when a network scales). Maybe my memory of a time when the web was more open is a false memory?

Creative recombination, ftw.

PS just spotted this (Walking on custard), via @plymuni. If you don’t see why it’s relevant, you probably don’t get the sense of this post!

Visualising Pandas DataFrames With IPythonBlocks – Proof of Concept

A few weeks ago I came across IPythonBlocks, a Python library developed to support the teaching of Python programming. The library provides an HTML grid that can be manipulated using simple programming constructs, presenting the outcome of the operations in a visually meaningful way.

As part of a new third level OU course we’re putting together on databases and data wrangling, I’ve been getting to grips with the python pandas library. This library provides a dataframe based framework for data analysis and data-styled programming that bears a significant resemblance to R’s notion of dataframes and vectorised computing. pandas also provides a range of dataframe based operations that resemble SQL style operations – joining tables, for example, and performing grouping style summary operations.

One of the things we’re quite keen to do as a course team is identify visually appealing ways of illustrating a variety of data manipulating operations; so I wondered whether we might be able to use ipythonblocks as a basis for visualising – and debugging – pandas dataframe operations.

I’ve posted a demo IPython notebook here: ipythonblocks/pandas proof of concept [nbviewer preview]. In it, I’ve started to sketch out some simple functions for visualising pandas dataframes using ipythonblocks blocks.

For example, the following minimal function finds the size and shape of a pandas dataframe and uses it to configure a simple block:

def pBlockGrid(df):
    (y,x)=df.shape
    return BlockGrid(x,y)

We can also colour individual blocks – the following example uses colour to reveal the different datatypes of columns within a dataframe:

ipythinblocks pandas type colour

A more elaborate function attempts to visualise the outcome of merging two data frames:

ipythonblocks pandas demo

The green colour identifies key columns, the red and blue cells data elements from the left and right joined dataframes respectively, and the black cells NA/NaN cells.

One thing I started wondering about that I have to admit quite excited me (?!;-) was whether it would be possible to extend the pandas dataframe itself with methods for producing ipythonblocks visual representations of the state of a dataframe, or the effect of dataframe based operations such as .concat() and .merge() on source dataframes.

If you have any comments on this approach, suggestions for additional or alternative ways of visualising dataframe transformations, or thoughts about how to extend pandas dataframes with ipythonblocks style visualisations of those datastructures and/or the operations that can be applied to them, please let me know via the comments:-)

PS some thoughts on a possible pandas interface:

  • DataFrame().blocks() to show the blocks
  • .cat(blocks=True) and .merge(blocks=True) to return (df, blocks)
  • DataFrame().blocks(blockProperties={}) and eg .merge(blocks=True, blockProperties={})
  • blockProperties: showNA=True|False, color_base=(), color_NA=(), color_left=(), color_right=(), color_gradient=[] (eg for a .cat() on many dataframes), colorView=structure|datatypes|missing (the colorView reveals the datatypes of the columns, the structure origins of cells returned from a .merge() or .cat(), or a view of missing data (reveal NA/NaN etc over a base color), colorTypes={} (to set the colors for different datatypes)

So Is This Guerrillla Research?

A couple of days ago I delivered a workshop with Martin Weller on the topic of “Guerrilla Research”.

guerrilapdf

The session was run under the #elesig banner, and was the result of an invitation to work through the germ of an idea that was a blog post Martin had published in October 2013, The Art Of Guerrilla Research.

In that post, Martin had posted a short list of what he saw as “guerrilla research” characteristics:

  1. It can be done by one or two researchers and does not require a team
  2. It relies on existing open data, information and tools
  3. It is fairly quick to realise
  4. It is often disseminated via blogs and social media

Looking at these principles now, as in, right now, as a I type (I don’t know what I’m going to write…), I don’t necessarily see any of these as defining, at least, not without clarification. Let’s reflect, and see how my fingers transcribe my inner voice…

In the first case, a source crowd or network may play a role in the activity, so maybe it’s the initiation of the activity that only requires one or two people?

Open data, information and tools helps, but I’d gear this more towards pre-existing data, information and tools, rather than necessarily open: if you work inside an organisation, you may be able to appropriate resources that are not open or available outside the organisation, and may even have limited access within the organisation; you may have to “steal” access to them, even; open resources do mean that other people can engage in the same activity using the same resources, though, which provides transparency and reproducibility; open resources also make inside, outside activities possible.

The activity may be quick to realise, sort of: I can quickly set a scraper going to collect data about X, and the analysis of the data may be quick to realise; but I may need the scraper to run for days, or weeks, or months; more qualifying, I think, is that the activity only requires a relatively short number of relatively quick bursts of activity.

Online means of dissemination are natural, because they’re “free”, immediate, have potentially wide reach; but I think an email to someone who can, or a letter to the local press, or an activity that is it’s own publication, such as a submission to a consultation in which the responses are all published, could also count too.

Maybe I should have looked at those principles a little more closely before the workshop…;-) And maybe I should have made reference to them in my presentation. Martin did, in his.

PS WordPress just “related” this back to me, from June, 2009: Guerrilla Education: Teaching and Learning at the Speed of News

Time to Drop Calculators in Favour of Notebook Programming?

With the UK national curriculum for schools set to include a healthy dose of programming from September 2014 (Statutory guidance – National curriculum in England: computing programmes of study) I’m wondering what the diff will be on the school day (what gets dropped if computing is forced in?) and who’ll be teaching it?

A few years ago I spent way too much time engaged in robotics related school outreach activities. One of the driving ideas was that we could use practical and creative robotics as a hands-on platform in a variety of curriculum context: maths and robotics, for example, or science and robotics. We also ran some robot fashion shows – I particularly remember a two(?) day event at the Quay Arts Centre on the Isle of Wight where a couple of dozen or so kids put on a fashion show with tabletop robots – building and programming the robots, designing fashion dolls to sit on them, choosing the music, doing the lights, videoing the show, and then running the show itself in front of a live audience. Brilliant.

On the science side, we ran an extended intervention with the Pompey Study Centre, a study centre attached to the Portsmouth Football Club, that explored scientific principles in the context of robot football. As part of the ‘fitness training’ programme for the robot footballers, the kids had to run scientific experiments as they calibrated and configured their robots.

The robot platform – mechanical design, writing control programmes, working with sensors, understanding interactions with the real world, dealing with uncertainty – provided a medium for creative problem solving that could provide a context for, or be contextualised by, the academic principles being taught from a range of curriculum areas. The emphasis was very much on learning by doing, using an authentic problem solving context to motivate the learning of principles in order to be able to solve problems better or more easily. The idea was that kids should be able to see what the point was, and rehearse the ideas, strategies and techniques of informed problem solving inside the classroom that they might then be able to draw on outside the classroom, or in other classrooms. Needless to say, we were disrespectful of curriculum boundaries and felt free to draw on other curriculum areas when working within a particular curriculum area.

In many respects, robotics provides a great container for teaching pragmatic and practical computing. But robot kit is still pricey and if not used across curriculum areas can be hard for schools to afford. There are also issues of teacher skilling, and the set-up and tear-down time required when working with robot kits across several different classes over the same school day or week.

So how is the new computing curriculum teaching going to be delivered? One approach that I think could have promise if kids are expected to used text based programming languages (which they are required to do at KS3) is to use a notebook style programming environment. The first notebook style environment I came across was Mathematica, though expensive license fees mean I’ve never really used it (Using a Notebook Interface).

More recently, I’ve started playing with IPython Notebooks (“ipynb”; for example, Doodling With IPython Notebooks for Education).

(Start at 2 minutes 16 seconds in – I’m not sure that WordPress embeds respect the time anchor I set. Yet another piece of hosted WordPress crapness.)

For a history of IPython Notebooks, see The IPython notebook: a historical retrospective.

Whilst these can be used for teaching programming, they can also be used for doing simple arithmetic, calculator style, as well as simple graph plotting. If we’re going to teach kids to use calculators, then maybe:

1) we should be teaching them to use “found calculators”, such as on their phone, via the Google search box, in those two-dimensional programming surfaces we call spreadsheets, using tools such as WolframAlpha, etc;

2) maybe we shouldn’t be teaching them to use calculators at all? Maybe instead we should be teaching them to use “programmatic calculations”, as for example in Mathematica, or IPython Notebooks?

Maths is a tool and a language, and notebook environments, or other forms of (inter)active, executable worksheets that can be constructed and or annotated by learners, experimented with, and whose exercises can be repeated, provide a great environment for exploring how to use and work with that language. They’re also great for learning how the automated execution of mathematical statements can allow you to do mathematical work far more easily than you can do by hand. (This is something I think we often miss when teaching kids the mechanics of maths – they never get a chance to execute powerful mathematical ideas with computational tool support. One argument against using tools is that kids don’t learn to spot when a result a calculator gives is nonsense if they don’t also learn the mechanics by hand. I don’t think many people are that great at estimating numbers even across orders of magnitude even with the maths that they have learned to do by hand, so I don’t really rate that argument!)

Maybe it’s because I’m looking for signs of uptake of notebook ideas, or maybe it’s because it’s an emerging thing, but I noticed another example of notebook working again today, courtesy of @alexbilbie: reports written over Neo4J graph databases submitted to the Neo4j graph gist winter challenge. The GraphGist how to guide looks like they’re using a port of, or extensions to, an IPython Notebook, though I’ve not checked…

Note that IPython notebooks have access to the shell, so other languages can be used within them if appropriate support is provided. For example, we can use R code in the IPython notebook context.

Note that interactive, computationaal and data analysis notebooks are also starting to gain traction in certain areas of research under the moniker “reproducible research”. An example I came across just the other day was The Dataverse Network Project, and an R package that provides an interface to it: dvn – Sharing Reproducible Research from R.

In much the same way that I used to teach programming as a technique for working with robots, we can also teach programming in the context of data analysis. A major issue here is how we get data in to and out of a programming environment in an seamless way. Increasingly, data sources hosted online are presenting APIs (programmable interfaces) with wrappers that provide a nice interface to a particular programming language. This makes it easy to use a function call in the programming language to pull data into the programme context. Working with data, particularly when it comes to charting data, provides another authentic hook between maths and programming. Using them together allows us to present each as a tool that works with the other, helping answer the question “but why are learning this?” with the response “so now you can do this, see this, work with this, find this out”, etc. (I appreciate this is quite a utilitarian view of the value of knowledge…)

But how far can we go in terms of using “raw”, but very powerful, computational tools in school? The other day, I saw this preview of the Wolfram Language:

There is likely to be a cost barrier to using this language, but I wonder: why shouldn’t we use this style of language, or at least the notebook style of computing, in KS3 and 4? What are the barriers (aside from licensing cost and machine access) to using such a medium for teaching computing in context (in maths, in science, in geography, etc)?

Programming puritans might say that notebook style computing isn’t real programming… (I’m not sure why, but I could imagine they might… erm… anyone fancy arguing that line in the comments?!:-) But so what? We don’t want to teach everyone to be a programmer, but we do maybe want to help them realise what sorts of computational levers there are, even if they don’t become computational mechanics?

Data Textualisation – Making Human Readable Sense of Data

A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)

An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.

Here’s a quick example of the sort of thing I mean – the generation of this piece of text:

The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.

from a data table that can be sliced like this:

slicing nomis JSA figures

In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.

The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.

There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.

There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94” is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377” generated mechanically from a year on year comparison).

Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.

The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.

PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…

MOOC Busting: Personal Googalytics…

Reading Game Analytics: Maximizing the Value of Player Data earlier this morning (which I suggest might be a handy read if you’re embarking on a learning analytics project…) I was struck by the mention of “player dossiers”. A Game Studies article from 2011 by Ben Medler- Player Dossiers: Analyzing Gameplay Data as a Reward describes them as follows:

Recording player gameplay data has become a prevalent feature in many games and platform systems. Players are now able to track their achievements, analyze their past gameplay behaviour and share their data with their gaming friends. A common system that gives players these abilities is known as a player dossier, a data-driven reporting tool comprised of a player’s gameplay data. Player dossiers presents a player’s past gameplay by using statistical and visualization methods while offering ways for players to connect to one another using online social networking features.

Which is to say – you can grab your own performance and achievement data and then play with it, maybe in part to help you game the game.

The Game Analytics book also mentioned the availability of third party services built on top of game APIs that let third parties build analytics tools for users that are not otherwise supported by the game publishers.

Hmmm…

What I started to wonder was – are there any services out there that allow you aggregate dossier material from different games to provide a more rounded picture of your performance as a gamer, or maybe services that homologate dossiers from different games to give overall rankings?

In the learning analytics space, this might correspond to getting your data back from a MOOC provider, for example, and giving it to a third party to analyse. As a user of MOOC platform, I doubt that you’ll be allowed to see much of the raw data that’s being collected about you; I’m also wary that institutions that sign up to MOOC platforms will also get screwed by the platform providers when it comes to asking for copies of the data. (I suggest folk signing their institutions up to MOOC platforms talk to their library colleagues, and ask how easy it is for them to get data, (metadata, transaction data, usage data etc etc) out of the library system vendors, and what sort of contracts got them into the mess they may admit to being in.)

(By the by, again the Game Analytics book made a useful distinction – that of viewing folk as customers, (i.e. people you can eventually get money from), or as players of the game (or maybe in MOOC land, learners). Whilst you may think of yourself as a player (learner), what they really want to do is develop you as a customer. In this respect, I think one of the great benefits of the arrival of MOOCs is that it allows us to see just how we can “monetise” education and let’s us talk freely and, erm, openly, in cold hard terms about the revenue potential of these things, and how they can be used as part of a money making/sales venture, without having to pretend to talk about educational benefits, which we’d probably feel obliged to do if we were talking about universities. Just like game publishers create product (games) to make money, MOOCspace is about businesses making money from education. (If it isn’t, why is venture capital interested?))

Anyway, all that’s all by the by, not just the by the by bit: this was just supposed to be a quick post, rather than a rant, about how we might do a little bit to open up part of the learning analytics data collection process to the community. (The technique generalises to other sectors…) The idea is built on appropriating a technology that many website publishers use to collect data, the third party service that is Google Analytics (eg from 2012, 88% of Universities UK members use Google Analytics on their public websites). I’m not sure how many universities use Google Analytics to track VLE activity though? Or how many MOOC operators use Google Analytics to track activity on course related pages? But if there are some, I think we can grab that data and pop it into a communal data pool; or grab that data into our own Google Account.

So how might we do that?

Almost seven years ago now – SEVEN YEARS! – in a post entitled They Stole OUr Learning Environment – Now We’re Stealing It Back, I described a recipe for customising a VLE (virtual learning environment – the thing that MOOC operators are reimagining and will presumably start (re)selling back to educational institutions as “Cloud based solutions”) – by injecting a panel that allowed you to add your own widgets from third part providers. The technique relied on a browser extension that allowed you to write your own custom javascript programmes that would be injected into the page just before it finished loading. In short, it used an extension that essentially allowed you to create you own additional extensions within it. It was an easy way of writing browser extensions.

That’s all a rather roundabout way of saying we can quite easily write extensions that change the behaviour of a web page. (Hmm… can we do this for mobile devices?) So what I propose – though I don’t have time to try it and test it right now (the rant used up the spare time I had!) – is an extension that simply replaces the Google Analytics tracking code with another tracking code:

– either a “common” one, that pools data from multiple individuals into the same Google Analytics account;
– or a “personal” one, that lets you collect all the data that the course provider was using Google Analytics to collect about you.

(Ideally the rewrite would take place before the tracking script is loaded? Or we’d have to reload the script with the new code if the rewrite happens too late? I’m not sure how the injection/replacement of the original tracking code with the new one actual takes place when the extension loads?)

Another “advantage” of this approach is that you hijack the Google Analytics data so it doesn’t get sent to the account of the person whose site you’re visiting. (Google Analytics docs suggest that using multiple tracking codes is “not supported”, though this doesn’t mean it can’t be done if you wanted to just overload the data collection (i.e. let the publisher collect the data to their account, and you just grab a copy of it too…).

(An alternative, cruder, approach might be to create an extension that purges Google Analytics code within a page, and then inject your own Google Analytics scripts/code. This would have the downside of not incorporating the instrumentation that the original page publisher added to the page. Hmm.. seems I looked at this way back when too… Collecting Third Party Website Statistics (like Yahoo’s) with Google Analytics.)

All good fun, eh? And for folk operating cMOOCs, maybe this represents a way of tracking user activity across multiple sites (though to mollify ethical considerations, tracking/analytics code should probably only be injected onto whitelisted course related domains, or users presented with a “track my activity on this site” button…?)

A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context

Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.

To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):

  • you can’t start following someone on Twitter until you join Twitter;
  • follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
  • starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.

A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)

There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).

So with low follower numbers, where can we get our timestamps from?

In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).

If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.

This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.

(I’m making this up as I’m going along…;)

If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.

Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.

Got that?!;-) (I’m still making this up as I’m going along…!)

We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.

If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2”, or more simply “fo_i makes friends with B somewhen between t1 and t2”.

Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].

If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)

We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).

For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:

if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]

Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]

If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:

– set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
– encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
– consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
– if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.

I think I need to go and walk the dog again.

PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.

Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics

Last year I sat on a couple of panels organised by I’m a Scientist’s Shane McCracken at various science communication conferences. A couple of days ago, I noticed Shane had popped up a post asking Who are you Twitter?, a quick review of a social media mapping exercise carried out on the followers of the @imascientist Twitter account.

Using the technique described in Estimated Follower Accession Charts for Twitter, we can estimate a follower acquisition growth curve for the @imascientist Twitter account:

imascientist

I’ve already noted how we may be able to use “spikes” in follower acquisition rates to identify news events that involved the owner of a particular Twitter account and caused a surge in follower numbers as a result (What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events).

Thinking back to the context of evaluating the impact of events that include social media as part of the overall campaign, it struck me that whilst running a particular event may not lead to a huge surge in follower numbers on the day of the event or in the immediate aftermath, the followers who do sign up over that period might have signed up as a result of the event. And now we have the first inklings of a post hoc analysis tool that lets us try to identify these people, and perhaps look to see if their profiles are different to profiles of followers who signed up at different times (maybe reflecting the audience interest profile of folk who attended a particular event, or reflecting sign ups from a particular geographical area?)

In other words, through generating the follower acquisition curve, can we use it to filter down to folk who started following around a particular time in order to then see whether there is a possibility that they started following as a result of a particular event, and if so can count as some sort of “conversion”? (I appreciate that there are a lot of caveats in there!;-)

A similar approach may also be relevant in the context of analysing link building around historical online community events, such as MOOCs… If we know somebody took a particular MOOC at a particular time, might we be able to construct their follower acquisition curve and then analyse it around the time of the MOOC, looking to see if the connections built over that period are different to the users other followers, and as such may represent links developed as a result of taking the MOOC? Analysing the timelines of the respective parties may further reveal conversational dynamics between those parties, and as such allow is to see whether a fruitful social learning relationship developed out of contact made in the MOOC?

What Are MOOCs (Good For)? I Don’t Really Know…

There’s mutterings in the air at the moment relating to MOOCs – Massive(ly) Open Online Courses: over the last few days, missives from David and Jim and Pat and Stephen amongst others.

Arguably primed by the open courseware and open learning initiatives that started a decade or so ago (precursor: OSTP), several notable MOOC platforms (Coursera, Udacity) provide a one stop supermarket for 20-100 hour large cohort, online “course experiences” offered by traditional universities. Using a blend of readings and video lectures, courses provide pacing through a predetermined syllabus on a particular topic, with social tools to allow students to discuss elements of the course with each other. To provide the feedback on progress, computer marking systems and scaleable “peer assessment” provide a means of letting a student know how well they are doing on the course.

At least, I think that’s how it works. I don’t really know. Though I’ve signed up for several MOOCs, I’ve never actually started any of them, or tried to work through them.

Maybe that’s because I tend to learn from resources discovered on the web in response to questions I’ve formulated myself (questions which often derive from reading other resources and either being confused by them or not being able to make sense of them!). But just see how far a search far a web search for visualisation site:coursera.org gets you. As I believe Patrick McAndrew suggested. (Hmmm… I appear to have hit my article limit… Here’s Patrick’s OER13 keynote which I think led to the article.)

I’m not sure who, within the universities that have signed up to delivering courses on the various MOOC platforms, is deciding which courses to run with, but I suspect the marketing department has a hand in it.

Marketing departments also used to run university web presences, too, didn’t they?

Way back when, when I was still at school, I used to watch tech documentaries – I remember seeing a window based GUI for the first time on an old episode of Horizon – and OU programmes, amongst other things… If you’re over a certain age, you’ll remember them:

Things have moved on since then, of course. OU/BBC programmes now are of a far more mainstream flavour (for example, clips from recent OU/BBC co-pros currently on iPlayer). Add to that, a wide variety of online videos, such as the 60 second animation series (such as 60 Second Adventures in Thought, or 60 Second Adventures in Astronomy, or even 60 Second Adventures in Economics), or clips from OU produced videos sent to students as part of their course materials (this sort of thing, for example: A Cyborg Future?, The Buildings of Ancient Rome or Environmental Policy in an International Context.)

As “tasters”, the OU course programmes that appeared in the dead parts of the schedule on BBC2 introduced me to a particular form of discourse, somewhere between a lecture and a tutorial, at particular level of understanding (higher education academic). The programmes were created as embedded parts of a distance education course, complementing readings and exercises, and though I find it hard to remember back, I think that came across. That the programmes were a glimpse into a wider, and deeper, exploration of a particular topic that the course of which they were a part provided. I don’t really recall.

So whilst the OU course programmes were part of a course, they were not the whole of it. They were windows into a course, rather than a microcosm of one. I get the impression that the MOOCs are intended in some way to be “complete”, albeit short, courses, that are maybe intended to act as tasters of more comprehensive, formal offerings. I don’t really know.

My introduction to the OU, then, aged ten, or thereabouts, so 35 years or so ago, was through these glimpses into courses that other people were studying. They were opportunities for me to walk into a lecture to see what it was like. The programmes were not intended for me, but I could partake of them. That is more of a “passively discoverable OER” model than a MOOC. Maybe. I don’t really know.

I wonder now, if now was then, how I would have come to discover the world of “academic” communications. Through Google, presumably. Through the web. Through the marketing department? Or through the academics, (but discovered how?).

I guess we could argue that MOOCs represent, in part, higher education marketing departments waking up to the fact that the web exists, that it is a contentful medium, in part at least, and that the universities have content that may attract eyeballs. Maybe. I don’t really know.

If the marketing departments are leading the MOOC campaigns, I wonder what sort of return they expect? Raising “brand awareness”? Being seen to be there/having a presence on a platform because other universities have? Generating leads and building up mailing lists? Online courses as “promotional items” (whatever that means)? Edutainment offerings?! I don’t really know.

Going back to the OU programmes on BBC2, the primary audience then were presumably students on a course, because the programmes were part of the course. Partially open courses. Courses being run for real that also had an open window open on to them, so that other people could see what sorts of thing were covered by those courses, and maybe learn a little along the way.

This is more in keeping with the model of online course delivery being pursued by Jim Groom’s ds106 or Martin Weller’s H817 module on “Open Education” (I think; I don’t really know). I think. I don’t really know.

Other models are possible, of course. The “cMOOC” – connectionist/connectivist MOOC – idea explores a different pedagogy. The xMOOC offerings of Coursera and Udacity wrap not opencourseware in a delivery platform and run scheduled cohorts. The original OU OpenLearn offering had the platform (Moodle), had the open content, but didn’t have the community that comes from marshalling learners into a scheduled offering. Or the hype. Or at least, not the right sort of hype (the hype that follows VC investment, where VC does not refer to Vice Chancellor). The cMOOC idea tries to be open as to curriculum, in part – a more fluid learning environment where loose co-ordination amongst participants encourages an element of personal research into a topic and sharing of the lessons learned. A pedagogy that seeks to foster independent learning in the sense of being able to independently research a topic, rather than independently pay to follow a prescribed path through a set of learning materials. In the xMOOC, a prescribed path that propagates a myth of there being one true way to learn a subject. One true path.

My own open course experiment was an uncourse. Tasked with writing a course on a subject about which I knew nothing, I sought to capture my own learning path through it, using that trail to inform the design of a supposedly more polished offering. The traces are still there, still open – Digital Worlds – Interactive Media and Game Design. The pages still get hit, the resources still get downloaded. I could – should – do a little more to make evident the pathways through the content.

Whilst the reverse chronological form of a blog made sense as I was discovering a trail through the subject area – new content was revealed to any others following the uncourse in a sensible order – looking back at the material now the journeys through each topic area start at the end, presenting anyone wishing to follow the path I took with an apparent problem. Though not really… If you select a Category on the Digital Worlds blog, and add ?order=asc – as for example http://digitalworlds.wordpress.com/category/animation/?order=asc, the posts will be presented in chronological order. I wonder if there is a switch – or a plugin – that can make chronological views of posts within a particular tag or category on WordPress automatically display in a chronological order? I don’t really know. This would provide one way of transforming a platform configured as a “live presentation” site into one that worked better as a legacy site. It’s not hard to imagine a Janus theme that would provide these two faces of a site, one in reverse chronological order for live delivery, the other in chronological order for folk wishing to follow the steps taken by a previous journeyman in the same order as they were originally taken.

I still don’t know what forces are at play that may result in MOOC-hype and whatever shakes out as a result transforming, if at all, higher education as we know it and as developing countries may yet come to know it. I really don’t know.

And I still don’t have a good feeling for how we can make most effective use of the web to support a knowledge driven society; how best we can make use of online content resources and social communication tools to help people to develop their own personal, and deeper, understanding about whatever topic, to help them make sense of whatever they need to make sense of; how best schools and universities can draw on the web to help people develop lifelong learning skills; what it means to use the web in furtherance of lifelong learning.

I really, really, don’t know.

Local News Templates – A Business Opportunity for Data Journalists?

As well as serendipity, I believe in confluence

A headline in the Press Gazette declares that Trinity Mirror will be roll[ing] out five templates across 130-plus regional newspapers as emphasis moves to digital. Apparently, this follows a similar initiative by Johnston Press midway through last year: Johnston to roll out five templates for network of titles.

It seems that “key” to the Trinity Mirror initiative is the creation of a new “Shared Content Unit” based in Liverpool that will provide features content to Trinity’s papers across the UK [which] will produce material across the regional portfolio in print and online including travel, fashion, food, films, books and “other content areas that do not require a wholly local flavour”.

[Update – 25/3/13: Trinity Mirror to create digital data journalism unit to produce content for online and printed titles]

In my local rag last week, (the Isle of Wight County Press), a front page story on the Island’s gambling habit localised a national report by the Campaign for Fairer Gambling on Fixed Odds Betting Terminals. The report included a dataset (“To find the stats for your area download the spreadsheet here and click on the arrow in column E to search for your MP”) that I’m guessing (I haven’t checked…) provided some of the numerical facts in the story. (The Guardian Datastore also republished the data (£5bn gambled on Britain’s poorest high streets: see the data) with an additional column relating to “claimant count”, presumably the number of unemployment benefit claimants in each area (again, I haven’t checked…)) Localisation appeared in several senses:

IWCP gambling

So for example, the number of local betting shops and Fixed Odds betting terminals was identified, the mooted spend across those and the spend per head of population. Sensemaking of the figures was also applied by relating the spend to an equivalent number of NHS procedures or police jobs. (Things like the BBC Dimensions How Big Really provide one way of coming up with equivalent or corresponding quantities, at least in geographical area terms. (There is also a “How Many Really” visualisation for comparing populations.) Any other services out there like this? Maybe it’s possible to craft Wolfram Alpha queries to do this?)

Something else I spotted, via RBloggers, a post by Alex Singleton of the University of Liverpool: an Open Atlas around the 2011 Census for England and Wales, who has “been busy writing (and then running – around 4 days!) a set of R code that would map every Key Statistics variable for all local authority districts”. The result is a set of PDF docs for each Local Authority district mapping out each indicator. As well as publishing the separate PDFs, Alex has made the code available.

So what’s confluential about those?

The IWCP article localises the Fairer Gambling data in several ways:
– the extent of the “problem” in the local area, in terms of numbers of betting shops and terminals;
– a consideration of what the spend equates to on a per capita basis (the report might also have used a population of over 18s to work out the average “per adult islander”); note that there are also at least a couple of significant problems with calculating per capita averages in this example: first, the Island is a holiday destination, and the population swings over the summer months; secondly, do holidaymakers spend differently to residents on this machines?
– a corresponding quantity explanation that recasts the numbers into an equivalent spend on matters with relevant local interest.

The Census Atlas takes one recipe and uses it to create localised reports for each LA district. (I’m guessing with a quick tweak,separate reports could be generated for the different areas within a single Local Authority).

Trinity Mirror’s “Shared Content Unit” will produce content “that do[es] not require a wholly local flavour”, presumably syndicating it to its relevant outlets. But it’s not hard to also imagine a “Localisable Content” unit that develops applications that can help produced localised variants of “templated” stories produced centrally. This needn’t be quite as automated as the line taken by computational story generation outfits such as Narrative Science (for example, Can the Computers at Narrative Science Replace Paid Writers? or Can an Algorithm Write a Better News Story Than a Human Reporter?) but instead could produce a story outline or shell that can be localised.

A shorter term approach might be to centrally produce data driven applications that can be used to generate charts, for example, relevant to a locale in an appropriate style. So for example, using my current tool of choice for generating charts, R, we could generate something and then allow local press to grab data relevant to them and generate a chart in an appropriate style (for example, Style your R charts like the Economist, Tableau … or XKCD). This approach saves duplication of effort in getting the data, cleaning it, building basic analysis and chart tools around it, and so on, whilst allowing for local customisation in the data views presented. With the increasing number of workflows available around R, (for example, RPubs, knitr, github, and a new phase for the lab notebook, Create elegant, interactive presentations from R with Slidify, [Wordpress] Bloggin’ from R).

Using R frameworks such as Shiny, we can quickly build applications such as my example NHS Winter Sitrep data viewer (about) that explores how users may be able to generate chart reports at Trust or Strategic Health Authority level, and (if required) download data sets related to those areas alone for further analysis. The data is scraped and cleaned once, “centrally”, and common analyses and charts coded once, “centrally”, and can then be used to generate items at a local level.

The next step would be to create scripted story templates that allow journalists to pull in charts and data as required, and then add local colour – quotes from local representatives, corresponding quantities that are somehow meaningful. (I should try to build an example app from the Fairer Gaming data, maybe, and pick up on the Guardian idea of also adding in additional columns…again, something where the work can be done centrally, looking for meaningful datasets and combining it with the original data set.)

Business opportunities also arise outside media groups. For example, a similar service idea could be used to provide story templates – and pull-down local data – to hyperlocal blogs. Or a ‘data journalism wire service’ could develop applications either to aid in the creation of data supported stories on a particular topic. PR companies could do a similar thing (for example, appifying the Fairer Gambling data as I “appified” the NHS Winter sitrep data, maybe adding in data such as the actual location of fixed odds betting terminals. (On my to do list is packaging up the recently announced UCAS 2013 entries data.)).

The insight here is not to produce interactive data apps (aka “news applications”) for “readers” who have no idea how to use them or what read from them whatever stories they might tell; rather, the production of interactive applications for generating charts and data views that can be used by a “data” journalist. Rather than having a local journalist working with a local team of developers and designers to get a data flavoured story out, a central team produces a single application that local journalists can use to create a localised version of a particular story that has local meaning but at national scale.

Note that by concentrating specialisms in a central team, there may also be the opportunity to then start exploring the algorithmic annotation of local data records. It is worth noting that Narrative Science are already engaged in this sort activity too, as for example described in this ProPublica article on How To Edit 52,000 Stories at Once, a news application that includes “short narrative descriptions of almost all of the more than 52,000 schools in our database, generated algorithmically by Narrative Science”.

PS Hmm… I wonder… is there time to get a proposal together on this sort of idea for the Carnegie Trust Neighbourhood News Competition? Get in touch if you’re interested…