Sketching Scatterplots to Demonstrate Different Correlations

Looking just now for an openly licensed graphic showing a set of scatterplots that illustrate different correlation coefficients between X and Y values, I couldn’t find one.

[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]

So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):

library(MASS)

corrdata=function(samples=200,r=0){
  data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
  X = data[, 1]  # standard normal (mu=0, sd=1)
  Y = data[, 2]  # standard normal (mu=0, sd=1)
  data.frame(x=X,y=Y)
}

df=data.frame()
for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){
  tmp=corrdata(200,i)
  tmp['corr']=i
  df=rbind(df,tmp)
}

library(ggplot2)

g=ggplot(df,aes(x=x,y=y))+geom_point(size=1)
g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')

And here’s an example of the result:

scatterCorr

It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:

scatterCorrSquare

So what sort of OER does that make this post?!;-)

PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?

UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)

//based on https://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184
#The NORmal-To-Anything (NORTA) algorithm
library(MASS)
library(ggplot2)

#NORTA - h/t rseiter
corrdata2=function(samples, r){
  mu <- rep(0,4)
  Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r)
  rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma)
  #unifvars <- pnorm(rawvars)
  unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions
  print(cor(unifvars))
  unifvars
}

df2=data.frame()
for (i in c(1,0.9,0.6,0.4,0)){
  tmp=data.frame(corrdata2(200,i)[,1:2])
  tmp['corr']=i
  df2=rbind(df2,tmp)
}
g=ggplot(df2,aes(x=X1,y=X2))+geom_point(size=1)+facet_wrap(~corr)
g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()

Here’s what it looks like with 1000 points:

unifromScatterCorr

Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…

unifscattercorrsmall

The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…

If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this:

corrnorm1000

Hmmm. Not sure about that…?

PS see also this Anscombe’s Quartet notebook and this recipe for creating datasets with the same summary statistics

8 comments

  1. Scott Akenhead

    Nice contribution. I will use it. Thanks.
    But, you know what’s not nice?
    The distribution of the ratio of randomly distributed variables is not nice.
    The mathematical approach via Taylor Series will stop most analysts dead in their tracks. But every analyst needs to understand that a variable which is the ratio of two values, each estimated with uncertainty, such as survival (estimated survivors over estimated starters), can be surprisingly distributed and is best avoided. Can be seriously misleading if treated superficially. If unavoidable, at least recognize and acknowledge the distribution of the ratio you are bandying about. So a nifty tool like you presented, that simulates data showing how ratios are distributed, and what happens to those distributions as the shapes, scales, and correlation of numerator and denominator distributions change, would also be a nice contribution.

    • Tony Hirst

      @Scott re: “The distribution of the ratio of randomly distributed variables is not nice.”:-)
      What I originally wanted was a scatterplot that provided a strong visual representation of different correlation values, so I started out looking for something where the distribution of values over x was uniform. But I couldn’t find a code fragment that I could understand that did that;-) (If you have one, please share it:-)
      Even then, the code I reused I have taken on trust.

      Re: making a more general app – I guess it could be nice to be able to select different distributions across x and then generate appropriately correlated values for y. But is there a general recipe for this, or do we need to have different generator functions for each one? (My grasp of doing things with stats is still really hazy;-)

      • rseiter

        How about using the NORmal-To-Anything (NORTA) algorithm?

        library(MASS)
        library(ggplot2)
        mu <- rep(0,4)
        Sigma <- matrix(.7, nrow=4, ncol=4) + diag(4)*.3
        rawvars <- mvrnorm(n=10000, mu=mu, Sigma=Sigma)
        #unifvars <- pnorm(rawvars)
        unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions
        cor(unifvars)
        g <- ggplot(data.frame(unifvars[,1:2]),aes(x=unifvars[,1],y=unifvars[,2])) + geom_point(size=1) + stat_smooth(method='lm',se=FALSE,color='red')
        plot(g)

        Some explanation: Operational Risk Modelling and Management (Google eBook) – NORTA

  2. Pingback: Condensed News | Data Analytics & R