How to generate a matrix with the same correlation coefficient between all variables?

Question

1 vote

How would I go about generating a 10x4 matrix, with values in each column between 1-10, that has equal correlation (0.5) between all variables?

i.e. if the matrix is X, I would like the corr(X) to look like this:

0  0.5  0.5  0.5
5  1.0  0.5  0.5
5  0.5  1.0  0.5
5  0.5  0.5  1.0

4 Comments
Show 2 older comments Hide 2 older comments

John D'Errico on 20 Mar 2018

Edited: John D'Errico on 20 Mar 2018

Open in MATLAB Online

You still have not said if your goal is to have a "randomly" generated set with that property. If they must be integers, etc. For example, is your goal something like this:

V
V =
           10       5.0059
       9.9696       3.3314
         5.69           10
            1            1
       1.6416       1.2878
       2.4309       1.6418
       2.9086       1.8561
       7.6598       3.9871
       3.9759       2.3348
       5.5832       3.0557
corrcoef(V)
ans =
            1          0.5
          0.5            1

So, there is a pair of vectors that appear fairly random, lie entirely in [1,10], and have exactly a correlation of 0.5. Were there reason to do so, I could show a method by which one would generate a larger set of vectors with an exact correlation matrix. But I won't bother to do so without knowing if this is your goal, since David has provided a solution that is entirely sufficient for what you have said so far.

David Lindén on 20 Mar 2018

ah sorry. No they do not need to be integers. normal distribution. It would be cool with randomly generated! but not necessary.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

John D'Errico on 20 Mar 2018

Edited: John D'Errico on 20 Mar 2018

Open in MATLAB Online

2 votes

So, can we generate a "pseudo-random" array, with the EXACTLY specified correlation matrix? I'll play a little fast and loose with the distribution of the result here but I am limited in that respect because of the requirements.

C = [1.0  0.5  0.5  0.5
     0.5  1.0  0.5  0.5
     0.5  0.5  1.0  0.5
     0.5  0.5  0.5  1.0]
L = chol(C);
X = null([rand(5,10);ones(1,10)])*L;
X = X - min(X);
X = X./max(X)*9 + 1;

Does X satisfy your requirements, as I compute it above?

X
X =
          1            1       3.4325            1
     1.2749       1.4652            1       4.2534
      2.203       2.7917       3.3868       4.9848
     2.2315       2.8286           10       6.3546
     2.2306       2.8276       3.6105       4.8588
     2.3264       2.9522       4.3873       4.4211
         10       7.0027       8.0362       8.2408
     2.2413       2.8415       3.6973       4.8099
     2.2957           10       6.3947       6.6823
     2.3766       3.0175       4.7941           10
min(X)
ans =
     1     1     1     1
max(X)
ans =
    10    10    10    10
corrcoef(X)
ans =
            1          0.5          0.5          0.5
          0.5            1          0.5          0.5
          0.5          0.5            1          0.5
          0.5          0.5          0.5            1

The only requirement is that you have a current, or at least recent MATLAB release, since I used implicit expansion in some of those computations. With older releases, you can still use bsxfun.

If I re-run the above computations, you will see you get a different result each time, because of the call to rand in this early line:

X = null([rand(5,10);ones(1,10)])*L;

So a somewhat pseudo-random array, such that the correlation matrix is exactly as desired, and the min and max of the elements is satisfied.

There is some method in the madness of those computations. But I want to leave something for you to think about why it works. :-) Honestly, I think there is only one tricky part - to understand why I computed X initially as I did, and why that row of ones was included in the call to null.

You would want to think, for example, why this alternative would not have done quite as well for the initial computation of X:

X = null(rand(6,10))*L;

It will be close to the required correlation matrix, but not exact.

6 Comments
Show 4 older comments Hide 4 older comments

John D'Errico on 21 Mar 2018

Edited: John D'Errico on 21 Mar 2018

Open in MATLAB Online

Drat. I was hoping you would not ask for an explanation. ;-) As I think about it, I tend to answer roughly one question a day, because an interesting question/answer often requires an interesting explanation thereof. But I knew it deserved an explanation, or at least more than the subtle hint I did give. So here it goes...

First, what did chol do? This is a very classic trick to get a given correlation matrix, usually applied to multivariate Gaussian problems. And there, it is usually used for a covariance matrix. But a correlation matrix is really just a scaled covariance matrix, so no problem there. You can think of the Cholesky factor as a transformation matrix, taking you from a problem with uncorrelated variables to one that has the desired correlations.

But how do we compute a covariance matrix using linear algebra? It looks like this:

C = (X - Xbar)'*(X - Xbar)/(n-1)

So, essentially we subtract off the mean of each column from X. Then you multiply the result by itself transposed, divide by n-1, where n is the number of rows, because this is an expectation computed from a sample.

Look at the term (X-Xbar) in there. What does that mean? Assume that Xis a data matrix, here 10x4. Xbar is a matrix that has the mean of the rows of X, replicated down the rows. We can write Xbar as the rank 1 matrix formed by an outer product:

Xbar = ones(n,1)*mean(X)

So mean(X) is a single row, the average of all rows of X. ones(n,1) is a column vector of 1's. All of this may seem pretty simple, but I wanted to spell it out, because it is important. Why?

I gave you a hint to this when I pointed out that you MIGHT have tried a different solution, that would in fact fail. The reason for the failure is what the vector of ones in my solution corrects. So, suppose we had tried this variation:

U = null(rand(6,10));
X = U*L;
X = X - min(X);
X = X./max(X)*9 + 1;

The last two lines are unchanged, as all they do is shift and scale the variables. That has no impact on correlation. Lets try it, and see what happens.

corrcoef(X)
ans =
          1      0.50551      0.49171      0.48886
    0.50551            1        0.504      0.50688
    0.49171        0.504            1      0.49093
    0.48886      0.50688      0.49093            1

So I got a correlation matrix that was CLOSE. In fact, if I had decided to build a 100x4 matrix like this:

U = null(rand(96,100));
X = U*L;
X = X - min(X);
X = X./max(X)*9 + 1;
format long g
corrcoef(X)
ans =
                       1         0.499999601633632         0.499995938366208         0.500001294099641
       0.499999601633632                         1          0.50000170892656         0.500000149572411
       0.499995938366208          0.50000170892656                         1         0.500005247567822
       0.500001294099641         0.500000149572411         0.500005247567822                         1

here the result is almost dead on, but still not correct. I had to write the matrix out using more digits just to show you there is a true difference.

So, while this scheme ALMOST works, it does not give the exact correlations desired. And that is the clue. It fails because the null space of U as I computed it is not orthogonal to a vector of ones!

U = null(rand(6,10));
format short g
U'*U
ans =
            1   4.1633e-17   5.5511e-17  -1.3878e-17
   4.1633e-17            1   6.9389e-18    1.249e-16
   5.5511e-17   6.9389e-18            1   2.7756e-17
  -1.3878e-17    1.249e-16   2.7756e-17            1

The columns of U are orthonormal vectors. Unit vectors that are mutually orthogonal. So if I form U'*U the diagonal will be 1, the off diagonal be zero (here, just least significant bit trash.) U'*U is the identity matrix, as it should be.

So now, lets compute the matrix product that we need to compute. Again, what matters was this:

(X - ones(n,1)*mean(X))' * (X - ones(n,1)*mean(X))

where X=U*L. As I computed it here, with

U = null(rand(6,10));
U'*ones(10,1)
ans =
     0.1587
    0.25869
   -0.08894
   -0.10126

we see that U is not orthogonal to a vector of all 1's. The result will be randomly small numbers, that as the vector gets longer, these dot products will be less and less significant.

But, suppose I computed it as:

U = null([rand(5,10);ones(1,10)]);
U'*ones(10,1)
ans =
            0
  -1.1102e-16
   1.1102e-16
  -8.3267e-17

ZERO. At least least significant bit trash. This is because U is explicitly forced to be orthogonal to a random array, as well as a vector of ones!

But why is this important? It seems like I am going somewhere for no good purpose. But suppose we expand the matrix product out here:

(X - ones(n,1)*mean(X))' * (X - ones(n,1)*mean(X))

We get terms like

X'*X + X'*ones(n,1)*mean(X) + ...

when X=U*L

L'*U'*U*L + L'*U'*ones(n,1)*mean(X) + ...

If U is an orthogonal matrix, so we know that

U'*U = eye(4)

But as well, we know that U as I am creating it is orthogonal to a vector of ones! So we have U'*ones(n,1)=0.

So the whole thing reduces to

L'*U'*U*L = L'*L = C

A lot of hand waving in there. But the point of all this hand waving is I had to force X as I create it to be orthogonal to a vector of ones, in order for the correlation matrix to be exactly correct. Null did exactly that for us here:

U = null([rand(5,10);ones(1,10)]);

As I wrote it in my solution, I did a couple of things in one line, then multiplying that by L to form X directly.

Finally, now you know why I did not explain this in detail in my answer. Even the above is a bit lacking in rigor, but it should be sufficient. I did indeed give a subtle hint about the reasoning when I said in my original answer that this would fail:

X = null(rand(6,10))*L;

It would be close, but it will not be perfect. And you wanted the EXACT correlations to be there. A nice thing about the solution I posed is it will be correct for any correlation matrix.

David Lindén on 22 Mar 2018

Edited: David Lindén on 22 Mar 2018

Open in MATLAB Online

Awesome! This is so cool. Once again, thanks a lot for taking the time to answer. Very comprehensive explanation.

Two things I still need to confirm:

1)

So, since the rand() function generates values between 0<x<1, an increased size of the matrix will result in more values closer to one. Hence, the null space will be "closer" to being orthogonal to a vector of ones.

And therefore,

 U = null(rand(96,100))
 X = U*L

will give a correlation matrix which is closer to the one desired than

 U = null(rand(6,10))
 X = U*L

However, both these methods will never be able to generate EXACTLY the desired correlation matrix due to not being orthogonal to a vector of ones. Thus, we need to add a column of ones

U = null(rand(5,10);ones(1,10))

to satisfy

U'*ones(n,1)=0

Am I understanding this part correctly?

-

2)

You say" A nice thing about the solution I posed is it will be correct for any correlation matrix.". However if you would like to produce a correlation matrix with negative correlation between all variables, as such:

C= [1.0  -0.5  -0.5  -0.5
   -0.5   1.0  -0.5  -0.5
   -0.5  -0.5   1.0  -0.5
   -0.5  -0.5  -0.5   1.0]

This method is no longer feasible due to chol(C) requiring a positive definite matrix. So this method only applies to non-negative correlations?

John D'Errico on 22 Mar 2018

Edited: John D'Errico on 26 Mar 2018

Open in MATLAB Online

Several questions here, with my usual long winded explanations, written on the fly. But as long as they are interesting questions, what the heck. ;-)

1. You are on a very reasonable track to understand why the vector of ones was needed, although I would probably use a slightly different way of visualizing it. Lets compare a problem in 10 dimensions to one in 100 dimensions. As usual here, I'll come at it from a perhaps different direction.

A10 = rand(10,6);
A100 = rand(100,96);

Now, suppose I wanted to ask how well a vector of ones is represented as a linear combination of the sets A10 and A100? Thus, can we solve the approximate problems...

A10*X10 = ones(10,1)
A100*X100 = ones(100,1)

So in MATLAB we see...

A10 = rand(10,6);
A100 = rand(100,96);
X10 = A10\ones(10,1);
X100 = A100\ones(100,1);
norm(ones(10,1) - A10*X10)
ans =
       0.495545084880473
norm(ones(100,1) - A100*X100)
ans =
       0.157112841217639

So it looks like a vector of all ones is better approximated by a random set of 96 vectors in 100 dimensions than a random set of 6 vectors in 10 dimensions. That makes perfect sense of course, certainly in hindsight. But it might give you more intuition as to why the error drops as we expand the number of dimensions here.

No matter, as you already had a good understanding here.

2. Does the scheme I gave you fail for negative correlations? Well, no. Ok, maybe yes. As always, the answer gets more complicated. :-)

Is the matrix C a valid correlation matrix?

C = [1.0  -0.5  -0.5  -0.5
    -0.5   1.0  -0.5  -0.5
    -0.5  -0.5   1.0  -0.5
    -0.5  -0.5  -0.5   1.0];

A correlation matrix has several required properties.

1. It is symmetric.

2. It has a diagonal of all ones.

3. All correlations must lie in [-1,1]

4. It must be at least non-negative definite, so all non-negative eigenvalues. Even then, chol might burp if one or more was zero, but we can survive that with some work.

There may be some other restrictions too, but 1:4 is a good start.

1, 2, 3 are easily seen to be true for this matrix. But #4?

L = chol(C)
Error using chol
Matrix must be positive definite. 
eig(C)
ans =
                      -0.5
                       1.5
                       1.5
                       1.5

This matrix fails the test for #4.

Does that mean all correlation matrices with negative correlations are impossible? No.

Cfun = @(k) (1-k)*eye(4) + k*ones(4);

You will see this generates a 4x4 matrix as you want it.

Cfun(.5)
ans =
            1          0.5          0.5          0.5
          0.5            1          0.5          0.5
          0.5          0.5            1          0.5
          0.5          0.5          0.5            1
Cfun(-.5)
ans =
            1         -0.5         -0.5         -0.5
         -0.5            1         -0.5         -0.5
         -0.5         -0.5            1         -0.5
         -0.5         -0.5         -0.5            1

A little experimentation is often useful, even though I know where I want to go here.

eig(Cfun(.5))
ans =
                       0.5
                       0.5
                       0.5
                       2.5
eig(Cfun(-.5))
ans =
                      -0.5
                       1.5
                       1.5
                       1.5
eig(Cfun(-.1))
ans =
                       0.7
                       1.1
                       1.1
                       1.1
eig(Cfun(-.25))
ans =
                      0.25
                      1.25
                      1.25
                      1.25
eig(Cfun(-1/3))
ans =
     -2.77555756156289e-17
          1.33333333333333
          1.33333333333333
          1.33333333333333

As it turns out, you can indeed have negative correlations in a matrix of that form. You cannot go too far though. -1/3 is the limit, and it is in fact, at the hairy edge of utility. (Gershgorin disks comes to mind here...)

chol(Cfun(-1/3))
ans =
            1     -0.33333     -0.33333     -0.33333
            0      0.94281      -0.4714      -0.4714
            0            0       0.8165      -0.8165
            0            0            0   2.1073e-08

chol is barely able to form a decomposition without failing. Any further, and it will start to complain.

chol(Cfun(-1/3 - 1.e-12))
Error using chol
Matrix must be positive definite.

So had your question originally have been, can you generate a 10x4 matrix with an invalid, impossible correlation matrix, my answer would have been a flat out no. Not all such matrices with negative correlations are invalid for such a purpose though.

C = rand(4)*2 - 1;
C = (C + C')/2;
C = C + eye(4) - diag(diag(C))
C =
          1      0.38149     -0.54464    -0.073504
    0.38149            1      0.26695      0.49452
   -0.54464      0.26695            1   0.00047764
  -0.073504      0.49452   0.00047764            1
eig(C)
ans =
   0.083407
    0.78609
     1.5348
     1.5957

Had this been your matrix, there would be no problem.

I might add that there is a valid reason why a correlation matrix has all non-negative eigenvalues. We can think of a correlation matrix as just a scaled covariance matrix. Now, we can think of the eigenvalues of a covariance matrix as sort of principal variances, as opposed to variances with principles. ;)

A covariance matrix that had negative eigenvalues would effectively correspond to a system with a negative variance (in SOME direction). The same applies to a correlation matrix. And since variances are always non-negative, we would have a problem if any was negative. So it is ok if an individual correlation or covariance (off-diagonal) is negative within limits. But not an eigenvalue.

David Lindén on 26 Mar 2018

John, some very interesting stuff. Massive thanks for the help! This has given me some new insights to digest.

John D'Errico on 26 Mar 2018

:)

Sign in to comment.

Answer 2

David Goodmanson on 20 Mar 2018

Edited: David Goodmanson on 20 Mar 2018

Open in MATLAB Online

1 vote

Hi David,

MODIFIED

the next thing that comes to mind is

a = [2*ones(1,4);-2*eye(4);zeros(5,4)] + 5;
a =
   7     7     7
   5     5     5
   3     5     5
   5     3     5
   5     5     3
   5     5     5
   5     5     5
   5     5     5
   5     5     5
   5     5     5
corrcoef(a)
ans =
0000    0.5000    0.5000    0.5000
5000    1.0000    0.5000    0.5000
5000    0.5000    1.0000    0.5000
5000    0.5000    0.5000    1.0000

or of course any similar scheme where the multiplicative 2 and added 5 are different. The simplest case along these lines is probably

a = [ones(1,4);-eye(4);zeros(5,4)] + 2;

5 Comments
Show 3 older comments Hide 3 older comments

David Lindén on 21 Mar 2018

Appreciate the answer David! Thanks. I did however received a solution from John that works better for my purpose.

David Goodmanson on 21 Mar 2018

Edited: David Goodmanson on 21 Mar 2018

Hi David,

Yes, since integer values are not required, John's method is definitely the way to do this. I looked briefly at what can be done with integer values, and although it's possible to introduce some randomness, I haven't been able to do better than solutions that have the same integer appearing six times in each column (different integers between columns at least).

If you dropped the 1 < n < 10 restriction which I am sure you would have to do, it would be interesting to see how far you could get on a more complicated matrix with rational entries like John mentioned.

Sign in to comment.

How to generate a matrix with the same correlation coefficient between all variables?

4 Comments
Show 2 older comments Hide 2 older comments

Accepted Answer

6 Comments
Show 4 older comments Hide 4 older comments

More Answers (1)

5 Comments
Show 3 older comments Hide 3 older comments

Categories

Tags

Community Treasure Hunt

How to generate a matrix with the same correlation coefficient between all variables?

4 Comments Show 2 older comments Hide 2 older comments

Accepted Answer

6 Comments Show 4 older comments Hide 4 older comments

More Answers (1)

5 Comments Show 3 older comments Hide 3 older comments

Categories

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments

6 Comments
Show 4 older comments Hide 4 older comments

5 Comments
Show 3 older comments Hide 3 older comments