Looking just now for an openly licensed graphic showing a set of scatterplots that illustrate different correlation coefficients between X and Y values, I couldn’t find one.
[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]
So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):
library(MASS) corrdata=function(samples=200,r=0){ data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE) X = data[, 1] # standard normal (mu=0, sd=1) Y = data[, 2] # standard normal (mu=0, sd=1) data.frame(x=X,y=Y) } df=data.frame() for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){ tmp=corrdata(200,i) tmp['corr']=i df=rbind(df,tmp) } library(ggplot2) g=ggplot(df,aes(x=x,y=y))+geom_point(size=1) g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')
And here’s an example of the result:
It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:
So what sort of OER does that make this post?!;-)
PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?
UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)
//based on https://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184 #The NORmal-To-Anything (NORTA) algorithm library(MASS) library(ggplot2) #NORTA - h/t rseiter corrdata2=function(samples, r){ mu <- rep(0,4) Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r) rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma) #unifvars <- pnorm(rawvars) unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions print(cor(unifvars)) unifvars } df2=data.frame() for (i in c(1,0.9,0.6,0.4,0)){ tmp=data.frame(corrdata2(200,i)[,1:2]) tmp['corr']=i df2=rbind(df2,tmp) } g=ggplot(df2,aes(x=X1,y=X2))+geom_point(size=1)+facet_wrap(~corr) g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()
Here’s what it looks like with 1000 points:
Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…
The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…
If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this:
Hmmm. Not sure about that…?
PS see also this Anscombe’s Quartet notebook and this recipe for creating datasets with the same summary statistics
PPS For a Python equivalent: https://stackoverflow.com/questions/18683821/generating-random-correlated-x-and-y-points-using-numpy