If you look back not that far in history, the word “computer” was a term applied a person working in a particular role. According to Webster’s 1828 American Dictionary of the English Language, a computer was defined as “[o]ne who computes or reckons; one who estimates or considers the force and effect of causes, with a view to form a correct estimate of the effects”.

Going back a bit further, to Samuel Johnson’s magnum opus, we see a “computer” is defined more concisely as a “reckoner” or “accountant”.

In a disambiguation page, Wikipedia identifies Computer_(job_description), quoting Turing’s Computing Machinery and Intelligence paper in Mind (Volume LIX, Issue 236, October 1950, Pages 433–460):

The human computer is supposed to be following fixed rules; he has no authority to deviate from them in any detail. We may suppose that these rules are supplied in a book, which is altered whenever he is put on to a new job.

Skimming through a paper that appeared in my feeds today — CHARTDIALOGS: Plotting from Natural Language Instructions [ACL 2020; code repo] — the following jumped out at me:

In order to further inspect the quality and difficulty of our dataset, we sampled a subset of 444 partial dialogs. Each partial dialog consists of the first several turns of a dialog, and ends with a Describer utterance. The corresponding Operator response is omitted. Thus, the human has to predict what the Operator (the plotting agent) will plot, given this partial dialog. We created a new MTurk task, where we presented each partial dialog to 3 workers and collected their responses.

Humans. As computers. Again.

Originally, the computer was a person doing a mechanical task.

Now, a computer is a digital device.

Now a computer aspires to be AI, artificial (human) intelligence.

Now AI is, in many cases, behind the Wizard of Oz curtain, inside von Kempelen’s “The Turk” automaton (not…), a human.

Human Inside.

A couple of of other things that jumped out at me, relating to instrumentation and comparison between machines:

The cases in which the majority of the workers (3/3 or 2/3) exactly match the original Operator, corresponding to the first two rows, happen 72.6% of the time. The cases when at least 3 out of all 4 humans (including the original Operator) agree, corresponding to row 1, 2 and 5, happen 80.6% of the time. This setting is also worth considering because the original Operator is another MTurk worker, who can also make mistakes. Both of these numbers show that a large fraction of the utterances in our dataset are intelligible implying an overall good quality dataset. Fleiss’ Kappa among all 4 humans is 0.849; Cohen’s Kappa between the original Operator and the majority among 3 new workers is 0.889. These numbers indicate a strong agreement as well.

Just like you might compare the performance  of different implementations of an algorithm in code, we also compare the performance of their  instationation in digitial or human computers.

At the moment, for “intelligence” tasks (and it’s maybe worth noting that Mechanical Turk has work packages defined as HITs, “Human Intelligence Tasks”) humans are regarded as providing the benchmark god standard, imperfect as it is.

7.5 Models vs. Gold Human Performance (P3) The gold human performance was obtained by having one of the authors perform the same task as described in the previous subsection, on a subset


See also: Robot Workers?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: