Working with Broken

OpenAI announce the release of an AI generated text identification tool that they admit is broken (“not fully reliable”, as they euphemistically describe it) and that you have figure out how to use as best you can, given its unreliability.

Even though we spend 2-3 years producing new courses, the first presentation always has broken bits, broken bits that sometimes take years to be discovered, others that are discovered quickly but still take years before any one gets round to fixing them. Sometimes, courses need a lot of work after the first presentation reveals major issues with them. Updates to materials are discouraged, in case they are themselves broken or break things in turn, which means materials start to rot in place (modules can remain in place for 5 years, even 10, with few changes).

My attitude has been that we can ship things that are a bit broken, if we fix them quickly, and/or actively engage with students to mitigate or even explore the issue. We just need to be open. Quality improved through iteration. Quality assured through rapid response (not least becuase the higher the quality at the start, the less work we make for ourselves by having to fix things).

Tinkering with ChatGPT, I started wondering about how we can co-opt ChatGPT as a teaching and learning tool, given its output may be broken. ChatGPT as an unreliable but well-intentioned tutor. Over the last week or two, I’ve been trying to produce a new “sample report” that models the sort of thing that we expect students to produce for their data analysis and management course end-of-course assessment (a process that made me realise how much technical brokenness we are probably letting slip through if the presented reports look plausible enough — the lesson of ChatGPT again comes to mind here, with the student submitting the report as being akin to an unreliable author who can slip all sorts of data management abuses and analysis mistakes through in a report, in a largely non-technical that only displays the results and doesn’t show the working). In so doing, I wondered whether it might be more useful to create an “unreliable” sample report, but annotate it with comments, as if from a tutor, that acknowledged good points and picked up on bad ones.

My original thinking seven or eight years ago now was that the final assessment report for the data management and analysis course would be presented as a reproducible document. That never happened — I had little role in designing the assessment, and things were not so mature getting on for a decade ago now when the cours was first mooted — but as tools like Jupyter Book and Quarto show, there are now tools in place that can produce good quality interactive HTML reports with hideable/revealable code, or easily produce two parallel MS Word or PDF document outputs – a “finished document” output (with code hidden), and a “fully worked” document with all the code showing. This would add a lot of work for the student though. Currently, the model we use is for students to work in a single project diary notebook (though some students occasionally use multiple notebooks) that contains all manner of horrors, and then paste things like charts and tables into the final report. The final report typically contains a quick discursive commentary where the students explain what they think they did. We reserve the right to review the code (the report is the thing that is assessed), but I suspect the notebook contents rarely get a detailed look from markers, even if they are looked at at all. For students to tease only the relevant code out of their notebook into a reproducible report would be a lot of extra work…

For the sample report, my gut feeling is that the originating notebook for the data handling and analysis should not be shared. We need to leave the students with some work to do, and for a technical course, that is the method. In this model, we give a sample unreproducible report, unreliable but commented upon, that hints at the sort of thing we expect to get back, but that hides the working. Currently, we assess the student’s report, which operates at the same level. But ideally, they’d give us a reproducible report back that gives us the chance to look at their working inline.

Anyway, that’s all an aside. The point of this post was an announcement I saw from OpenAI — New AI classifier for indicating AI-written text — where they claim to have  “trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers“. Or not:

Our classifier is not fully reliable. [OpenAI’s emphasis] In our evaluations on a “challenge set” of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as “likely AI-written,” while incorrectly labeling human-written text as AI-written 9% of the time (false positives).

OpenAI then make the following offer: [w]e’re making this classifier publicly available to get feedback on whether imperfect tools like this one are useful.

So I’m wondering: is this the new way of doing things? Giving up on the myth that things work properly, and instead accept that we have to work with tools that are known to be a bit broken? That we have to find ways of working with them that accommodate that? Accepting that everything we use is broken-when-shipped, that everything is unreliable, and that it is up to us to use our own craft, and come up with our own processes, in order to produce things that are up to the standard we expect, even given the unreliability of everything we have to work with? Quality assurance as an end user problem?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “Working with Broken”

  1. Hi. Interesting conclusion! Hasn’t it always been a bit like this with software (and hardware)? Shipped with bugs or faults and let the user tell us about it!

Comments are closed.