It’s that time of year again when the vagaries of marking assert themselves and we get to third mark end of course assessment scripts where the first two markers award significantly different marks or their marks straddle the pass/fail boundary.
There is a fiddle function available to us where we can “standardise” marks, shifting the the mean of particular markers (and maybe the sd?) whose stats suggest they may be particularly harsh or lenient. Ostensibly, standardisation just fixes the distribution; ideally, this is probably something a trained statistiscian should do; pragmatically, it’s often an attempt to reduce the amount of third marking; intriguingly, we could probably hack some code to “optimise” stanadardisation to bring everyone in to line and reduce third marking to zero; but we tend to avoid that process, leaving the raw marks in all their glory.
I’ve never really understood why we don’t do a post mortem after the final award board and compare the marks awarded by markers in their MK1 (first marker) and MK2 (second marker) roles against the mark finally awarded to a script. This would generate some sort of error signal that modules teams, staff tutors and markers could use to see how effective any given marker is at “predicting” the final grade awarded to a script. But we don’t do that. Analytics are purely for applying to learners because it’s their fault. (I often wonder if the learning analytics folk look at marker identity as one of the predictors for a student’s retentin and grade; and if there is an effect; or maybe some things are best left under the stone…)
Anyway… third marking time.
In a sense, it’s a really useful activity because we get to see a full range of student scripts and get a feel for what they’ve got out of the course.
But the fact that we often get a large disparity between marks does, as ever, raise questions about the reliablity of the marks awarded to scripts we don’t third mark (for example, if two harsh or lenient markers mark the same scripts). I’m sure there are ways the numbers could be churned to give some useful and simple insights into individual marker behaviour, rather than the not overly helpful views we’re given over marker distributions. And I wonder if we just train a simple text classifier on raw scripts against the final awarded mark on a script how much it would vary compared to human markers. And maybe one that classifies based on screenshots of the report (a 20 second skim of how a report looks often gives me sense of which grade boundary it is likely as not to fall in…)
But ours not to reason why…