OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

How Might Data Journalists Show Their Working? Sweave

If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)

In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?

For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:

You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.

The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.

This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?

PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)

Written by Tony Hirst

November 1, 2011 at 11:04 am

Posted in Data, onlinejournalismblog, Rstats

Tagged with

19 Responses

Subscribe to comments with RSS.

  1. There seem to be some problems with your use of Sweave. The first ggplot doesn’t appear, and the code for the other plots is part of the plot image so it is not rendered well.

    Kent Johnson

    November 1, 2011 at 5:20 pm

    • @Kent I’m not sure what to suggest… the file runs ok for me in RStudio when I copy and paste the raw file from the gist that’s online (you did copy and paste the raw file, yes?). I’m still an R and Sweave novice, so my debugging skills are still pretty limited, but without seeing the output trace of your run or the document it produces, I can’t even apply even those limited skills/observations…

      Tony Hirst

      November 1, 2011 at 8:05 pm

  2. Thanks! I look forward to looking at it closer.

    I got an error then I tried this:
    https://raw.github.com/gist/1330309/acece1cff5e985c5d19df7b0f3e4800eb202ffd9/bowelCancer.Rnw
    The header wasn’t parsed properly (it had some NA, not sure why header= didn’t work).
    As a quick fix, I set the columns names on line 28
    names(cancerdata)<-c('Area','Rate','Population','Number')

    ChrisL

    November 1, 2011 at 5:28 pm

    • @ChrisL Hmmm… the header should be being set in the readHTML function already? readHTMLTable( srcURL, which=1, header=c(‘Area’,’Rate’,’Population’,’Number’) )
      I’m afraid I don’t really understand R enough to be able to debug it much further?! Thanks for suggesting the workaround (that was actually the route I originally took, but then the header thing seemed to be more elegant….)

      Tony Hirst

      November 1, 2011 at 7:59 pm

      • I don’t understand either. It should have worked. I’m using XML 3.4-3 on R 2.13.2, linux, maybe it makes a difference.

        Chris L

        November 1, 2011 at 8:08 pm

    • I’m on a Mac using I have no idea what! My version of R is maybe a 4 month old install, and I’m using the current version of RStudio.

      Tony Hirst

      November 1, 2011 at 8:10 pm

    • I tweaked the gist to add your fix in a commented out line, as a crib… thanks:-)

      Tony Hirst

      November 1, 2011 at 8:13 pm

    • @chrisL On another point, I wondered about trying to automate the setting of number.seq to have 1000 steps in the range 0..max(number) or min(number)..max(number). Would that make sense do you think?

      Tony Hirst

      November 1, 2011 at 8:44 pm

      • To have number.seq <- seq(min(number), max(number), 100) you mean?
        Yes I guess it'd be a good enough heuristic. You'll have all the data shown as well.
        (100 points is enough to get a smooth curve)

        Also, I was playing with the colouring and just realized that one has to go to shape 21 to get the transparency working. For example:
        geom_point(shape=21,fill=alpha('#FF7F2A',0.5),colour=alpha('#000000',0.7))

        Thanks again for your post!

        Chris L

        November 1, 2011 at 10:05 pm

        • @ChrisL Do you know if there is a z-value lookup function in R anywhere (I can’t find it)? It could be then used to generalise the funnel plot function further, and allow the user to specify the confidence values that are plotter?

          Tony Hirst

          November 2, 2011 at 8:07 am

    • Sorry, I meant “to get the transparency and both the fill and colour options working”. I wanted the shape to be an orange disk with a black circle.

      Chris L

      November 1, 2011 at 10:08 pm

      • @ChrisL Re: colour and transparency – I’ll try that… I still haven’t got my head round colour palettes etc in ggplot at all yet… there’s a few things I want to try out with heatmaps as soon as I crack it:-)
        PS thanks for the feedback:-)

        Tony Hirst

        November 1, 2011 at 11:09 pm

  3. Re: the z-scores
    providing it’s normally distributed, you can simply calculate them with pnorm and qnorm:

    z<-c(0.67,1,1.64,1.96,2.58)
    p<-1-2*(1-pnorm(z))
    print(p,digits=2)
    # and back
    qnorm((p+1)/2)
    

    Chris L

    November 2, 2011 at 9:58 am

    • @ChrisL thanks..I fear my lack of stats knowledge is showing through;-)

      Tony Hirst

      November 2, 2011 at 10:45 am

  4. [...] also put together a couple of posts describing how the funnel plot could be generated from a data set using the statistical programming [...]

  5. [...] also: Quick Core Dump of Idle Thoughts on the “Making Open Data Real” Consultation , How Might Data Journalists Show Their Working? Sweave. Rate this: Share this:Like this:LikeBe the first to like this [...]

  6. [...] how to generate reports that can (optionally) also self-document with actually source R code, see How might data journalists show their working? Sweave. The code used in, and comments added to, that post make further refinements to the funnel plot [...]

  7. [...] to use archived tweets. When I get a chance, I’ll try to wrap this into a Sweave script (cf. How Might Data Journalists Show Their Working? Sweave for the automated generation of PDF and HTML reports.). Rate this: Share this:Like this:LikeBe the [...]

  8. [...] took more notes of the useful sites I went to. I’m pretty sure I got started with Tony Hirst’s How Might Data Journalists Show Their Working? Sweave, I know also that I looked at Vanderbilt’s Converting Documents Produced by Sweave, Nicola [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 757 other followers

%d bloggers like this: