Covariance and Correlation

Introduction to Data Science Basic Statistics Covariance and Correlation

Covariance

Covariance is a measure of the joint variability of two or more random variables. If larger values of one variable correspond to larger values of another variable, and the same holds for smaller values (variables tend to show similar behavior), the covariance is positive.

In the opposite case, when larger values of one variable correspond to smaller values of another variable (variables tend to show opposite behavior), the covariance is negative.

If the values in one variable do not predict the values in another variable, the covariance is zero.

The sign of covariance shows the tendency of linear relationship between the variables. However, the magnitude of the covariance is not easy to interpret as it is not normalized and depends on the magnitude of the variables. The normalized version of the covariance, the correlation coefficient, shows the strength of the linear relation by its magnitude.

Sample Variance

The sample covariance of two samples \(X\) and \(Y\) are:

\(cov(X,Y)=E(X-E(X))E(Y-E(Y))=\frac{1}{N-1}\sum_{i=1}^{N}{(X_{i}-\bar{X})(Y_{i}-\bar{Y})}\)

We use \(N-1\) instead of \(N\) to make the estimator unbiased as we use the sample mean (\(\bar(x)\) and \(\bar(y)\)) instead of the population mean in the computation. If the population mean is known, the unbiased estimator is:

\(cov(X,Y)=\frac{1}{N}\sum_{i=1}^{N}{(X_{i}-E(X))(Y_{i}-E(Y))}\)

The variance is a special case of the covariance in which the 2 samples are identical.

Correlation

Correlation commonly refers to the degree to which a pair of variables are linearly related. Pearson’s correlation coefficient of two samples is defined as the covariance divided by the product of their standard deviations.

\(corr(X,Y)=\rho_{XY}=\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}\)

Unlike covariance, the magnitude of correlation shows the strength of the linear relation as it is normalized such that the results always have a value between -1 and 1. However, similar to covariance, the measure only reflects a linear correlation of variables, and ignores many other types of relationships or correlation.

Spearman’s rank correlation coefficient has been developed to be more sensitive to non-linear relationships. It measures rank correlation and assesses how well the relationship between two variables can be described using a monotonic function (that is to say the function is monotonically increasing or decreasing).

A Spearman correlation of 1 results when the two variables are monotonically related, even if their relationship is not linear. This means that all data points with high ranks from one sample correspond to data points with high ranks from another sample. In contrast, this does not give a perfect Pearson correlation.

The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables. It is defined as:

\(\LARGE{r_{s}=\rho_{R(X),R(Y)}=\frac{cov(R(X),R(Y))}{\sigma_{R(X)}\sigma_{R(Y)}}}\)

Code

In NM Dev, the class SpearmanRankCorrelation computes the sample correlation between two sample sets

				
					// create arrays of doubles for our datasets
val x = doubleArrayOf(106.0, 86.0, 100.0, 101.0, 99.0, 103.0, 97.0, 113.0, 112.0, 110.0)
val y = doubleArrayOf(7.0, 0.0, 27.0, 50.0, 28.0, 29.0, 20.0, 12.0, 6.0, 17.0)

// create SpearmanRankCorrelation object
val correlation = SpearmanRankCorrelation(x, y)

println("Sample correlation: " + correlation.value())

				
					Sample correlation: -0.17575757575757575

Covariance Matrix

In the case of where there are more than 2 samples, we use a covariance matrix to find the relationship between the samples. A covariance matrix is a square matrix that gives the covariance between each pair of samples. The covariance of a random vector X is typically denoted by \(K_{XX}\) or \(\Sigma_{X}\).

\(\Sigma_{X}=\)

	a	b	c
a	cov(a, a)	cov(a, b)	cov(a, c)
b	cov(b, a)	cov(b, b)	cov(b, c)
c	cov(c, a)	cov(c, b)	cov(c, c)

Let’s break it down more simply. Since the variance is a special case of covariance where the 2 samples are identical as stated earlier, we can simplify cov(a, a) into var(a) and likewise for the other covariances along the main diagonal.

\(\Sigma_{X}=\)

	a	b	c
a	cov(a, a) = var(a)	cov(a, b)	cov(a, c)
b	cov(b, a)	cov(b, b) = var(b)	cov(b, c)
c	cov(c, a)	cov(c, b)	cov(c, c) = var(c)

We can further simplify this matrix as cov(a, b) = cov(b, a).

\(\Sigma_{X}=\)

	a	b	c
a	var(a)	cov(a, b)	cov(a, c)
b	cov(b, a) = cov(a, b)	var(b)	cov(b, c)
c	cov(c, a) = cov(a, c)	cov(c, b) = cov(b, c)	var(c)

From the above, we can see that the correlation matrix is symmetrical and its main diagonal contains variances.

Code

In NM Dev, we can use the class SampleCovariance to compute the covariance matrix.

				
					// create a matrix for our dataset
val a = doubleArrayOf(1.4022225, -0.04625344, 1.26176112, -1.8394428, 0.7182637)
val b = doubleArrayOf(-0.2230975, 0.91561987, 1.17086252, 0.2282348, 0.0690674)
val c = doubleArrayOf(0.6939930, 1.94611387, -0.82939259, 1.0905923, 0.1458883)
val d = doubleArrayOf(0.6939930, 0.18818663, -0.29040783, 0.6937185, 0.4664052)
val e = doubleArrayOf(0.6939930, -0.10749210, 3.27376532, 0.5141217, 0.7691778)
val f = doubleArrayOf(-2.5275280, 0.64942255, 0.07506224, -1.0787524, 1.6217606)
val dataset: Array = arrayOf(a, b, c, d, e, f)
val dmatrix = DenseMatrix(dataset)

// create SampleCovariance object
val cov = SampleCovariance(dmatrix)

println("Sample covariance: " + cov)

				
					Sample covariance: 5x5
	[,1] [,2] [,3] [,4] [,5] 
[1,] 1.951918, -0.187493, 0.448644, 0.347863, -0.522400, 
[2,] -0.187493, 0.600273, -0.742584, 0.404512, -0.173547, 
[3,] 0.448644, -0.742584, 2.167306, -0.250672, 0.085099, 
[4,] 0.347863, 0.404512, -0.250672, 1.301751, -0.385892, 
[5,] -0.522400, -0.173547, 0.085099, -0.385892, 0.317301,

Previous Topic

Back to Lesson

Next Lesson