Tinkering With MOOC Data – Rebasing Time

[I’ve been asked to take this post down because it somehow goes against, I dunno, something, but as a stop gap I’ll try to just remove the charts and leave the text, to see if I get another telling off…]

As a member of an organisation where academics tend to be course designers and course producers, and kept as far away from students as possible (Associate Lecturers handle delivery as personal tutors and personal points of contact), I’ve never really got my head around what “learning analytics” is supposed to deliver: it always seemed far more useful to me to think about course analytics as way of tracking how the course materials are working and whether they seem to be being used as intended. Rather than being interested in particular students, the emphasis would be more on how a set of online course materials work in much the same way as tracking how any website works. Which is to say, are folk going to the pages you expect, spending the time on them you expect, reaching goal pages as and when you expect, and so on.

Having just helped out on a MOOC, I was allowed to have a copy of the course related data files the provider makes available to partners:

I'm not allowed to show you this, apparently...

The course was on learning to code for data analysis using the Python pandas library, so I thought I’d try to apply what was covered in the course (perhaps with a couple of extra tricks…) to the data that flowed from the course…

And here’s one of the tricks… rebasing (normalising) time.

For example, one of the things I was interested in was how long learners were spending on particular steps and particular weeks on the one hand, and how long their typical study sessions were on the other. This could then all be aggregated to provide some course stats about loading which could feed back into possible revisions of the course material, activity design (and redesign) etc.

Here’s an example of how a randomly picked learner progressed through the course:

I'm not allowed to show you this, apparently...

The horizontal x-axis is datetime, the vertical y axis is an encoding of the week and step number, with clear separation between the weeks and steps within a week incrementally ordered. The points show the datetime at which the learner first visited the step. The points are coloured by “stint”, a trick I borrowed from my F1 data wrangling stuff: during the course of a race, cars complete several “stints”, where a stint corresponds to a set laps completed on a particular set of tyres; analysing races based on stints can often turn up interesting stories…

To identify separate study session (“stints”) I used a simple heuristic – if the gap between start-times of consecutively studied stints exceeded a certain threshold (55 minutes, say), then I assumed that the steps were considered in separate study sessions. This needs a bit of tweaking, possibly, perhaps including timestamps from comments or question responses that can intrude on long gaps to flag them as not being breaks in study, or perhaps making the decision about whether the gap between two steps is actually a long one compared to a typically short median time for that step? (There are similar issues in the F1 data, for example when trying to work out whether a pit stop may actually be a drive-through penalty rather than an actual stop.)

In the next example, I rebased the time for two learners based on the time they first encountered the first step of the course. That is, the “learner time” (in hours) is the time between them first seeing a particular step, and the time they first saw their first step. The colour field distiguishes between the two learners.

I'm not allowed to show you this, apparently...

We can draw on the idea of “stints”, or learner sessions further, and use the earliest time within a stint to act as the origin. So for example, for another random learner, here we see an incremental encoding on of the step number on the y-axis, with the weeks clearly separated, the “elapsed study session time” along the horizontal y-axis, and the colour mapping out the different study sessions.

I'm not allowed to show you this, apparently...

The spacing on the y-axis needs sorting out a bit more so that it shows clearer progression through steps, perhaps by using an ordered categorical axis with a faint horizontal rule separator to distinguish the separate weeks. (Having an interactive pop-up that contains some information the particular step each mark refers to, as well as information about how much time was spent on it, whether there was commenting activity, etc, what the mean and median study time for the step is, etc etc, could also be useful.) However, I have to admit that I find charting in pandas/matplotlib really tricky, and only seem to have slightly more success with seaborn; I think I may need to move this stuff over to R so that I can make use of ggplot, which I find far more intuitive…

Finally, whilst the above charts are at the individual learner level, my motivation for creating them was to better understand how the course materials were working, and to try to get my eye in to some numbers that I could start to track as aggregate numbers (means, medians, etc) over the course as a whole. (Trying to find ways of representing learner activity so that we could start to try to identify clusters or particular common patterns of activity / signatures of different ways of studying the course, is then another whole other problem – though visual insights may also prove helpful there.)