Looking just now for an openly licensed graphic showing a set of scatterplots that illustrate different correlation coefficients between X and Y values, I couldn’t find one.
[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]
So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):
library(MASS) corrdata=function(samples=200,r=0){ data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE) X = data[, 1] # standard normal (mu=0, sd=1) Y = data[, 2] # standard normal (mu=0, sd=1) data.frame(x=X,y=Y) } df=data.frame() for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){ tmp=corrdata(200,i) tmp['corr']=i df=rbind(df,tmp) } library(ggplot2) g=ggplot(df,aes(x=x,y=y))+geom_point(size=1) g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')
And here’s an example of the result:
It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:
So what sort of OER does that make this post?!;-)
PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?
UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)
//based on https://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184 #The NORmal-To-Anything (NORTA) algorithm library(MASS) library(ggplot2) #NORTA - h/t rseiter corrdata2=function(samples, r){ mu <- rep(0,4) Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r) rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma) #unifvars <- pnorm(rawvars) unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions print(cor(unifvars)) unifvars } df2=data.frame() for (i in c(1,0.9,0.6,0.4,0)){ tmp=data.frame(corrdata2(200,i)[,1:2]) tmp['corr']=i df2=rbind(df2,tmp) } g=ggplot(df2,aes(x=X1,y=X2))+geom_point(size=1)+facet_wrap(~corr) g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()
Here’s what it looks like with 1000 points:
Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…
The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…
If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this:
Hmmm. Not sure about that…?
PS see also this Anscombe’s Quartet notebook and this recipe for creating datasets with the same summary statistics
PPS For a Python equivalent: https://stackoverflow.com/questions/18683821/generating-random-correlated-x-and-y-points-using-numpy
Maybe you could modify this method to use a different distribution:
https://stat.ethz.ch/pipermail/r-help/2007-April/128925.html
Nice contribution. I will use it. Thanks.
But, you know what’s not nice?
The distribution of the ratio of randomly distributed variables is not nice.
The mathematical approach via Taylor Series will stop most analysts dead in their tracks. But every analyst needs to understand that a variable which is the ratio of two values, each estimated with uncertainty, such as survival (estimated survivors over estimated starters), can be surprisingly distributed and is best avoided. Can be seriously misleading if treated superficially. If unavoidable, at least recognize and acknowledge the distribution of the ratio you are bandying about. So a nifty tool like you presented, that simulates data showing how ratios are distributed, and what happens to those distributions as the shapes, scales, and correlation of numerator and denominator distributions change, would also be a nice contribution.
@Scott re: “The distribution of the ratio of randomly distributed variables is not nice.”:-)
What I originally wanted was a scatterplot that provided a strong visual representation of different correlation values, so I started out looking for something where the distribution of values over x was uniform. But I couldn’t find a code fragment that I could understand that did that;-) (If you have one, please share it:-)
Even then, the code I reused I have taken on trust.
Re: making a more general app – I guess it could be nice to be able to select different distributions across x and then generate appropriately correlated values for y. But is there a general recipe for this, or do we need to have different generator functions for each one? (My grasp of doing things with stats is still really hazy;-)
How about using the NORmal-To-Anything (NORTA) algorithm?
library(MASS)
library(ggplot2)
mu <- rep(0,4)
Sigma <- matrix(.7, nrow=4, ncol=4) + diag(4)*.3
rawvars <- mvrnorm(n=10000, mu=mu, Sigma=Sigma)
#unifvars <- pnorm(rawvars)
unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions
cor(unifvars)
g <- ggplot(data.frame(unifvars[,1:2]),aes(x=unifvars[,1],y=unifvars[,2])) + geom_point(size=1) + stat_smooth(method='lm',se=FALSE,color='red')
plot(g)
Some explanation: Operational Risk Modelling and Management (Google eBook) – NORTA
Ah, interesting – thanks for that. (Not sure I understand what’s happening just yet, but will try to make sense of it! Duly added to the post – hope that’s okay with you?:-)
That’s OK with me. Thanks for the original post!
I published an extension of your plot above at http://rpubs.com/rseiter/50148
I included code that shows my output for pnorm without qunif (the second plot). Not sure why we are seeing different results.
@Rich Brilliant – thanks; I’ve added a link at the top of the post. Much appreciated:-)