More Tukey Gems

Via a half quote by Adam Cooper in his SoLAR flare talk today, elucidated in his blog post Exploratory Data Analysis, I am led to a talk by John Tukey – The Technical Tools of Statistics – read at the 125th Anniversary Meeting of the American Statistical Association, Boston, November 1964.

As ever (see, for example, Quoting Tukey on Visual Storytelling with Data), it contains some gems… The following is a spoiler of the joy of reading the paper itself. I suggest you do that instead – you’ll more than likely find your own gems in the text: The Technical Tools of Statistics.

If you’re too lazy to click away, here are some of the quotes and phrases I particularly enjoyed.

To start with, the quote referenced by Adam:

Some of my friends felt that I should be very explicit in warning you of how much time and money can be wasted on computing, how much clarity and insight can be lost in great stacks of computer output. In fact, I ask you to remember only two points:

  • The tool that is so dull that you cannot cut yourself on it is not likely to be sharp enough to be either useful or helpful.
  • Most uses of the classical tools of statistics have been, are, and will be, made by those who know not what they do.

And here’s one I’m going to use when talking about writing diagrams:

Hand-drawing of graphs, except perhaps for reproduction in books and in some journals, is now economically wasteful, slow, and on the way out.

(It strikes me that using a spreadsheet wizard to create charts in a research or production setting, where we are working in a reproducible, document generation context, is akin to the “hand-drwaing of graphs” of yesteryear?)

“I know of no person or group that is taking nearly adequate advantage of the graphical potentialities of the computer.”

Nothing’s changed?!

[W]e are going to reach a position we should have reached long ago. We are going, if I have to build it myself, to have a programming system — a “language” if you like — with all that that implies, suited to the needs of data analysis. This will be planned to handle numbers in organized patterns of very different shapes, to apply a wide variety of data-analytical operations to make new patterns from old, to carry out the oddest sequences of apparently unrelated operations, to provide a wide variety of outputs, to automatically store all time-expensive intermediate results “on disk” until the user decides whether or not he will want to do something else with them, and to do all this and much more easily.

Since I’ve started playing with pandas, my ability to have written conversations with data has improved. Returning to R after a few months away, I’m also finding that easier to write as well (the tabular data models, and elements of the syntax, are broadly similar across the two).

Most of the technical tools of the future statistician will bear the stamp of computer manufacture, and will be used in a computer. We will be remiss in our duty to our students if we do not see that they learn to use the computer more easily, flexibly, and thoroughly than we ever have; we will be remiss in our duties to ourselves if we do not try to improve and broaden our own uses.

This does not mean that we shall have to continue to teach our students the elements of computer programming; most of the class of ’70 is going to learn that as freshmen or sophomores. Nor does it mean that each student will write his own program for analysis of variance or for seasonal adjustment, this would be a waste. … It must mean learning to put together, effectively and easily — on a program-self-modifying computer and by means of the most helpful software then available — data analytical steps appropriate to the need, whether this is to uncover an anticipated specific appearance or to explore some broad area for unanticipated, illuminating appearances, or, as is more likely, to do both.

Interesting to note that in the UK, “text-based programming” has made it into the curriculum. (Related: Text Based Programming, One Line at a Time (short course pitch).)

Tukey also talks about how computing will offer flexibility and fluidity. Flexibility includes the “freedom to introduce new approaches; freedom, in a word, to be a journeyman carpenter of data-analytical tools”. Fluidity “means that we are prepared to use structures of analysis that can flow rather freely … to fit the apparent desires of the data”.

As the computer revolution finally penetrates into the technical tools of statistics, it will not change the essential characteristics of these tools, no matter how much it changes their appearance, scope, appositeness and economy. We can only look for:

  • more of the essential erector-set character of data analysis techniques, in which a kit of pieces are available for assembly into any of a multitude of analytical schemes,
  • an increasing swing toward a greater emphasis on graphicality and informality of inference,
  • a greater and greater role for, graphical techniques as aids to exploration and incisiveness,
  • steadily increasing emphasis on flexibility and on fluidity,
  • wider and deeper use of empirical inquiry, of actual trials on potentially interesting data, as a way to discover new analytic techniques,
  • greater emphasis on parsimony of representation and inquiry, on the focussing, in each individual analysis, of most of our attention on relatively specific questions, usually in combination with a broader spreading of the remainder of our attention to the exploration of more diverse possibilities.

In order that our tools, and their uses, develop effectively … we shall have to give still more attention to doing the approximately right, rather than the exactly wrong, …

All quotes from John Tukey, The Technical Tools of Statistics, 1964.

Wonderful:-)