If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)
In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?
For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:
You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.
The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.
This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?
PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)
There seem to be some problems with your use of Sweave. The first ggplot doesn’t appear, and the code for the other plots is part of the plot image so it is not rendered well.
@Kent I’m not sure what to suggest… the file runs ok for me in RStudio when I copy and paste the raw file from the gist that’s online (you did copy and paste the raw file, yes?). I’m still an R and Sweave novice, so my debugging skills are still pretty limited, but without seeing the output trace of your run or the document it produces, I can’t even apply even those limited skills/observations…
Thanks! I look forward to looking at it closer.
I got an error then I tried this:
https://raw.github.com/gist/1330309/acece1cff5e985c5d19df7b0f3e4800eb202ffd9/bowelCancer.Rnw
The header wasn’t parsed properly (it had some NA, not sure why header= didn’t work).
As a quick fix, I set the columns names on line 28
names(cancerdata)<-c('Area','Rate','Population','Number')
@ChrisL Hmmm… the header should be being set in the readHTML function already? readHTMLTable( srcURL, which=1, header=c(‘Area’,’Rate’,’Population’,’Number’) )
I’m afraid I don’t really understand R enough to be able to debug it much further?! Thanks for suggesting the workaround (that was actually the route I originally took, but then the header thing seemed to be more elegant….)
I don’t understand either. It should have worked. I’m using XML 3.4-3 on R 2.13.2, linux, maybe it makes a difference.
I’m on a Mac using I have no idea what! My version of R is maybe a 4 month old install, and I’m using the current version of RStudio.
I tweaked the gist to add your fix in a commented out line, as a crib… thanks:-)
@chrisL On another point, I wondered about trying to automate the setting of number.seq to have 1000 steps in the range 0..max(number) or min(number)..max(number). Would that make sense do you think?
To have number.seq <- seq(min(number), max(number), 100) you mean?
Yes I guess it'd be a good enough heuristic. You'll have all the data shown as well.
(100 points is enough to get a smooth curve)
Also, I was playing with the colouring and just realized that one has to go to shape 21 to get the transparency working. For example:
geom_point(shape=21,fill=alpha('#FF7F2A',0.5),colour=alpha('#000000',0.7))
Thanks again for your post!
@ChrisL Do you know if there is a z-value lookup function in R anywhere (I can’t find it)? It could be then used to generalise the funnel plot function further, and allow the user to specify the confidence values that are plotter?
Sorry, I meant “to get the transparency and both the fill and colour options working”. I wanted the shape to be an orange disk with a black circle.
@ChrisL Re: colour and transparency – I’ll try that… I still haven’t got my head round colour palettes etc in ggplot at all yet… there’s a few things I want to try out with heatmaps as soon as I crack it:-)
PS thanks for the feedback:-)
Re: the z-scores
providing it’s normally distributed, you can simply calculate them with pnorm and qnorm:
@ChrisL thanks..I fear my lack of stats knowledge is showing through;-)