How Might Data Journalists Show Their Working? Sweave
If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)
In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?
For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:
You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.
The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.
This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?
PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)

There seem to be some problems with your use of Sweave. The first ggplot doesn’t appear, and the code for the other plots is part of the plot image so it is not rendered well.
Kent Johnson
November 1, 2011 at 5:20 pm
@Kent I’m not sure what to suggest… the file runs ok for me in RStudio when I copy and paste the raw file from the gist that’s online (you did copy and paste the raw file, yes?). I’m still an R and Sweave novice, so my debugging skills are still pretty limited, but without seeing the output trace of your run or the document it produces, I can’t even apply even those limited skills/observations…
Tony Hirst
November 1, 2011 at 8:05 pm
Thanks! I look forward to looking at it closer.
I got an error then I tried this:
https://raw.github.com/gist/1330309/acece1cff5e985c5d19df7b0f3e4800eb202ffd9/bowelCancer.Rnw
The header wasn’t parsed properly (it had some NA, not sure why header= didn’t work).
As a quick fix, I set the columns names on line 28
names(cancerdata)<-c('Area','Rate','Population','Number')
ChrisL
November 1, 2011 at 5:28 pm
@ChrisL Hmmm… the header should be being set in the readHTML function already? readHTMLTable( srcURL, which=1, header=c(‘Area’,'Rate’,'Population’,'Number’) )
I’m afraid I don’t really understand R enough to be able to debug it much further?! Thanks for suggesting the workaround (that was actually the route I originally took, but then the header thing seemed to be more elegant….)
Tony Hirst
November 1, 2011 at 7:59 pm
I don’t understand either. It should have worked. I’m using XML 3.4-3 on R 2.13.2, linux, maybe it makes a difference.
Chris L
November 1, 2011 at 8:08 pm
I’m on a Mac using I have no idea what! My version of R is maybe a 4 month old install, and I’m using the current version of RStudio.
Tony Hirst
November 1, 2011 at 8:10 pm
I tweaked the gist to add your fix in a commented out line, as a crib… thanks:-)
Tony Hirst
November 1, 2011 at 8:13 pm
@chrisL On another point, I wondered about trying to automate the setting of number.seq to have 1000 steps in the range 0..max(number) or min(number)..max(number). Would that make sense do you think?
Tony Hirst
November 1, 2011 at 8:44 pm
To have number.seq <- seq(min(number), max(number), 100) you mean?
Yes I guess it'd be a good enough heuristic. You'll have all the data shown as well.
(100 points is enough to get a smooth curve)
Also, I was playing with the colouring and just realized that one has to go to shape 21 to get the transparency working. For example:
geom_point(shape=21,fill=alpha('#FF7F2A',0.5),colour=alpha('#000000',0.7))
Thanks again for your post!
Chris L
November 1, 2011 at 10:05 pm
@ChrisL Do you know if there is a z-value lookup function in R anywhere (I can’t find it)? It could be then used to generalise the funnel plot function further, and allow the user to specify the confidence values that are plotter?
Tony Hirst
November 2, 2011 at 8:07 am
Sorry, I meant “to get the transparency and both the fill and colour options working”. I wanted the shape to be an orange disk with a black circle.
Chris L
November 1, 2011 at 10:08 pm
@ChrisL Re: colour and transparency – I’ll try that… I still haven’t got my head round colour palettes etc in ggplot at all yet… there’s a few things I want to try out with heatmaps as soon as I crack it:-)
PS thanks for the feedback:-)
Tony Hirst
November 1, 2011 at 11:09 pm
Re: the z-scores
providing it’s normally distributed, you can simply calculate them with pnorm and qnorm:
Chris L
November 2, 2011 at 9:58 am
@ChrisL thanks..I fear my lack of stats knowledge is showing through;-)
Tony Hirst
November 2, 2011 at 10:45 am
[...] also put together a couple of posts describing how the funnel plot could be generated from a data set using the statistical programming [...]
Data Referenced Journalism and the Media – Still a Long Way to Go Yet? « OUseful.Info, the blog…
November 5, 2011 at 1:26 am
[...] also: Quick Core Dump of Idle Thoughts on the “Making Open Data Real” Consultation , How Might Data Journalists Show Their Working? Sweave. Rate this: Share this:Like this:LikeBe the first to like this [...]
Why Open Data Dumps On Their Own Add Little to Transparency… « OUseful.Info, the blog…
January 5, 2012 at 6:15 pm
[...] how to generate reports that can (optionally) also self-document with actually source R code, see How might data journalists show their working? Sweave. The code used in, and comments added to, that post make further refinements to the funnel plot [...]
Power Tools for Aspiring Data Journalists: Funnel Plots in R « OUseful.Info, the blog…
January 12, 2012 at 10:18 am
[...] to use archived tweets. When I get a chance, I’ll try to wrap this into a Sweave script (cf. How Might Data Journalists Show Their Working? Sweave for the automated generation of PDF and HTML reports.). Rate this: Share this:Like this:LikeBe the [...]
A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets « OUseful.Info, the blog…
January 22, 2012 at 12:34 pm
[...] took more notes of the useful sites I went to. I’m pretty sure I got started with Tony Hirst’s How Might Data Journalists Show Their Working? Sweave, I know also that I looked at Vanderbilt’s Converting Documents Produced by Sweave, Nicola [...]
OER Visualisation Project: Exploring automated reporting using linked data and R/Sweave/R2HTML [day 36] – MASHe
February 7, 2012 at 8:32 pm