Data Textualisation – Making Human Readable Sense of Data

A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)

An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.

Here’s a quick example of the sort of thing I mean – the generation of this piece of text:

The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.

from a data table that can be sliced like this:

slicing nomis JSA figures

In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.

The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.

There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.

There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94” is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377” generated mechanically from a year on year comparison).

Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.

The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.

PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…

7 comments

    • Tony Hirst

      Hi Gergely – thanks for that link, looks interesting (and the Rapport framework, too). I spent the weekend handcrafting some bespoke functions for giving text summaries of single rows of a table selected out on particular cell values, but the code is for a private project and not shareable atm:-(

      However, I do intend to work out related ideas openly and in a bit more detail around some more ONS reports over the next few weeks. The immediate aim is to find a method for producing press-release like reports for pulling out possibly newsworthy features around simple data releases using pre-identified newsrules and templated canned text/trope text generators. Would love to see your code if it’s available…:-)

      FWIW, the original code for IW stats shown in the post is at: https://gist.github.com/psychemedia/7536017 but by that point I hadn’t really got into the swing of thinking how to start writing sentence generating functions.