A few days ago, I saw via the @HSCICOpenData Twitter feed that an annually released dataset on Written Complaints in the NHS has just been published.
The data comes in the form of a couple of spreadsheets in which each row describes a count of the written complaints received and upheld under a variety of categories for each GP and dental practice, or local NHS trust.
The practice level spreadsheet looks like this:
Each practice is identified solely by a practice code – to find the name and address of the actual practice requires looking up the code in another dataset.
The column headings supplied in the CSV document only partially identify each column (and indeed, there are duplicates such as Total number of written complaints received, that a spreadsheet reader might disambiguate by adding numerical suffix to) – a more complete description (that shows how the columns are actually hierarchically defined) is provided in an associated metadata spreadsheet.
For a reporter wanting to know whether or not any practices in their area fared particularly badly in terms of the number of upheld complaints, the task might be broken down as follows:
- identify the practices in of interest from their practice codes (which requires finding a set of practice codes of interest);
- for each of those practices, look along the row to see whether or not there are any large numbers in the complaints upheld column.
But if you have a spreadsheet with 10, 20, 30 or more columns, scanning along a row looking for items of interest can rapidly become quite a daunting task.
So an idea I’ve been working on, which I suspect harkens back to the earliest days of database reporting, is to look at ways of turning each row of data into a text based, human readable report.
Something like the following, for example:
Each record, each “Complaint Report”, is a textual rendering of a single row from the practice complaints spreadsheet, with a bit of administrative metadata enrichment in the form of the practice name, address (and in later versions, telephone number).
These reports are quicker to scan, and could be sort or highlighted depending on the number of upheld complaints, for example. The journalist can then quickly review the reports, and identify any practices that might be worth phoning up for a comment to ask why they appear to have received a large number of upheld complaints in a particular area, for example… Data driven press releases used to assist reporting, in other words.
FWIW, I popped up a sketch script that generates the above report from the data, and also pulls in practice administrative metadata from an epracurr spreadsheet, here: NHS complaints spreadsheet2text sketch. See also: Data Driven Press Releases From HSCIC Data – Diabetes Prescribing.
PS I’m not Microsoft Office suite user, but I suspect you can get a fair way along this sort of process by using a mail merge? There may be other ways of generating templated reports too. Any Microsoft Office users fancy letting me know how you’d go about doing something like the above in Word and Excel? I’d guess complicating factors are the requirements to make use of the column headers and only display the items associated with non-zero counts, which perhaps requires some macro magic? Things could perhaps be simplified by reshaping the data, perhaps putting it into a long form by melting the complaints columns, or melting the complaints columns cannily to provide two value columns, once for complaints received and one for complaints upheld?
Then you could filter out the blank rows before the merge.
Via @simonperry, news that AP will use robots to write some business stories (Automated Insights are one of several companies I’ve been tracking over the years who are involved in such activities, eg Notes on Narrative Science and Automated Insights).
The claim is that using algorithms to do the procedural writing opens up time for the journalists to do more of the sensemaking. One way I see this is that we can use data2text techniques to produce human readable press releases of things like statistical releases, which has a couple of advantages at least.
Firstly, the grunt – and error prone – work of running the numbers (calculating month on month or year on year changes, handling seasonal adjustments etc) can be handled by machines using transparent and reproducible algorithms. Secondly, churning numbers into simple words (“x went up month on month from Sept 2013 to Oct 2013 and down year on year from 2012”) makes them searchable using words, rather than having to write our own database or spreadsheet queries with lots of inequalities in them.
In this respect, something that’s been on my to do list for way to long is to produce some simple “press release” generators based on ONS releases (something I touched on in Data Textualisation – Making Human Readable Sense of Data).
Matt Waite’s upcoming course on “automated story bots” looks like it might produce some handy resources in this regard (code repo). In the meantime, he already shared the code described in How to write 261 leads in a fraction of a second here: ucr-story-bot.
For the longer term, on my “to ponder” list is what might something like “The Grammar of Graphics” be for data textualisation? (For background, see A Simple Introduction to the Graphing Philosophy of ggplot2.)
For example, what might a ggplot2 inspired gtplot library look like for converting data tables not into chart elements, but textual elements? Does it even make sense to try to construct such a grammar? What would the corollaries to aesthetics, geoms and scales be?
I think I perhaps need to mock-up some examples to see if anything comes to mind and that the function names, as well as the outputs, might look like, let alone the code to implement them! Or maybe code first is the way, to get a feel for how to build up the grammar from sensible looking implementation elements? Or more likely, perhaps a bit of iteration may be required?!
A picture may be worth a thousand words, but whilst many of us may get a pre-attentive gut reaction reading from a data set visualised using a chart type we’re familiar with, how many of us actually take the time to read a chart thoroughly and maybe verbalise, even if only to ourselves, what the marks on the chart mean, and how they relate to each other? (See How fertility rates affect population for an example of how to read a particular sort of chart.)
An idea that I’m finding increasingly attractive is the notion of text visualisation (or text visualization for the US-English imperialistic searchbots). That is, the generation of mechanical text from data tables so we can read words that describe the numbers – and how they relate – rather than looking at pictures of them or trying to make sense of the table itself.
Here’s a quick example of the sort of thing I mean – the generation of this piece of text:
The total number of people claiming Job Seeker’s Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.
from a data table that can be sliced like this:
In the same way that we make narrative decisions when it comes to choosing what to put into a data visualisation, as well as how to read it (and how the various elements displayed in it relate to each other), so we make choices about the textual, or narrative, mapping from the data set to the text version (that is, the data textualisation) of it. When we present a chart or data table to a reader, we can try to influence their reading of it in variety of ways: by choosing the sort of order of bars on a bar chart, or rows in table, for example; or by highlighting one or more elements in a chart or table through the use of colour, font, transparency, and so on.
The actual reading of the chart or table is still largely under the control of the reader, however, and may be thought of as non-linear in the sense that the author of the chart or table can’t really control the order in which the various attributes of the table or chart, or relationships between the various elements, are encountered by the reader. In a linear text, however, the author retains a far more significant degree of control over the exposition, and the way it is presented to the reader.
There is thus a considerable amount of editorial judgement put into the mapping from a data table to text interpretations of the data contained within a particular row, or down a column, or from some combination thereof. The selection of the data points and how the relationships between them are expressed in the sentences formed around them directs attention in terms of how to read the data in a very literal way.
There may also be a certain amount of algorithmic analysis used along the way as sentences are constructed from looking at the relationships between different data elements; (“up 94” is a representation (both in sense of rep-resentation and re-presentation) of a month on month change of +94, “down 377” generated mechanically from a year on year comparison).
Every cell in a table may be a fact that can be reported, but there are many more stories to be told by comparing how different data elements in a table stand in relation to each other.
The area of geekery related to this style of computing is known as NLG – natural language generation – but I’ve not found any useful code libraries (in R or Python, preferably…) for messing around with it. (The JSA example above was generated using R as a proof of concept around generating monthly press releases from ONS/nomis job-figures.
PS why “data textualisation“, when we can consider even graphical devices as “texts” to be read? I considered “data characterisation” in the sense of turning data in characters, but characterisation is more general a term. Data narration was another possibility, but those crazy Americans patenting everything that moves might think I was “stealing” ideas from Narrative Science. Narrative Science (as well as Data2Text and Automated Insights etc. (who else should I mention?)) are certainly interesting but I have no idea how any of them do what they do. And in terms of narrating data stories – I think that’s a higher level process than the mechanical textualisation I want to start with. Which is not to say I don’t also have a few ideas about how to weave a bit of analysis into the textualisation process too…