How Might Data Journalists Show Their Working? Sweave

If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)

In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?

For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:

You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.

The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.

This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?

PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

19 thoughts on “How Might Data Journalists Show Their Working? Sweave”

  1. There seem to be some problems with your use of Sweave. The first ggplot doesn’t appear, and the code for the other plots is part of the plot image so it is not rendered well.

    1. @Kent I’m not sure what to suggest… the file runs ok for me in RStudio when I copy and paste the raw file from the gist that’s online (you did copy and paste the raw file, yes?). I’m still an R and Sweave novice, so my debugging skills are still pretty limited, but without seeing the output trace of your run or the document it produces, I can’t even apply even those limited skills/observations…

    1. @ChrisL Hmmm… the header should be being set in the readHTML function already? readHTMLTable( srcURL, which=1, header=c(‘Area’,’Rate’,’Population’,’Number’) )
      I’m afraid I don’t really understand R enough to be able to debug it much further?! Thanks for suggesting the workaround (that was actually the route I originally took, but then the header thing seemed to be more elegant….)

    2. @chrisL On another point, I wondered about trying to automate the setting of number.seq to have 1000 steps in the range 0..max(number) or min(number)..max(number). Would that make sense do you think?

      1. To have number.seq <- seq(min(number), max(number), 100) you mean?
        Yes I guess it'd be a good enough heuristic. You'll have all the data shown as well.
        (100 points is enough to get a smooth curve)

        Also, I was playing with the colouring and just realized that one has to go to shape 21 to get the transparency working. For example:
        geom_point(shape=21,fill=alpha('#FF7F2A',0.5),colour=alpha('#000000',0.7))

        Thanks again for your post!

        1. @ChrisL Do you know if there is a z-value lookup function in R anywhere (I can’t find it)? It could be then used to generalise the funnel plot function further, and allow the user to specify the confidence values that are plotter?

    3. Sorry, I meant “to get the transparency and both the fill and colour options working”. I wanted the shape to be an orange disk with a black circle.

      1. @ChrisL Re: colour and transparency – I’ll try that… I still haven’t got my head round colour palettes etc in ggplot at all yet… there’s a few things I want to try out with heatmaps as soon as I crack it:-)
        PS thanks for the feedback:-)

  2. Re: the z-scores
    providing it’s normally distributed, you can simply calculate them with pnorm and qnorm:

    z<-c(0.67,1,1.64,1.96,2.58)
    p<-1-2*(1-pnorm(z))
    print(p,digits=2)
    # and back
    qnorm((p+1)/2)
    

Comments are closed.

%d bloggers like this: