Methods of correlation

Correlation Methods

You can generate a correlation or covariance matrix from numeric data columns, and have the choice of storing the computation results in an-autogenerated worksheet, or display the results in a table format whose values can be color coded.

This method requires multiple numeric data columns whose values should be stored in a single worksheet.

An example of a correlation matrix displayed as a color-coded table is shown below.

Using Fisher's z Transformation (z_r)

This option is provided to allow transforming a skewed sampling distribution into a normalized format.

The theoretical sampling distribution of the correlation coefficient can be approximated by the normal distribution when the value of a population correlation ρ = 0, but as the value of r deviates from zero, the sampling distribution becomes increasingly skewed. Fisher's ztransformation transforms a skewed sampling distribution into a normalized format.

The relationship between Pearson's product-moment correlation coefficient and the Fisher-Transformed values are shown in the right-hand side image.

The image below shows the Fisher-transformed values of the correlation matrix displayed above.

Pearson Product-Moment Correlation Coefficient (Pearson's r)

Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter ρ (rho) and may be referred to as the population correlation coefficient or thepopulation Pearson correlation coefficient. The formula for ρ is:

$\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}$

For a sample[edit]

Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. That formula for r is:

$r = \frac{\sum ^n _{i=1}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}}$

An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data (X_i, Y_i), the sample Pearson correlation coefficient is

$r = \frac{1}{n-1} \sum ^n _{i=1} \left( \frac{X_i - \bar{X}}{s_X} \right) \left( \frac{Y_i - \bar{Y}}{s_Y} \right)$

where

$\frac{X_i - \bar{X}}{s_X}, \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i, \text{ and } s_X=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2}$

are the standard score, sample mean, and sample standard deviation, respectively.

Mathematical properties[edit]

The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).

A key mathematical property of the Pearson correlation coefficient is that it is invariant (up to a sign) to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation: see a later section for an application of this.

The Pearson correlation can be expressed in terms of uncentered moments. Since μ_X = E(X), σ_X² = E[(X − E(X))²] = E(X²) − E²(X) and likewise for Y, and since

$E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y),\,$

the correlation can also be written as

$\rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}~\sqrt{E(Y^2)- (E(Y))^2}}.$

Spearman's rank correlation coefficient

In applications where duplicate values (ties) are known to be absent, a simpler procedure can be used to calculate ρ.^[3]^[4]Differences $d_i = x_i - y_i$ between the ranks of each observation on the two variables are calculated, and ρ is given by:

$\rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.$

Note that this latter method should not be used in cases where the data set is truncated; that is, when the Spearman correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above.

Example [edit]

In this example, we will use the raw data in the table below to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.

IQ, $X_i$	Hours of TV per week, $Y_i$
106	7
86	0
100	27
101	50
99	28
103	29
97	20
113	12
112	6
110	17

First, we must find the value of the term $d^2_i$ . To do so we use the following steps, reflected in the table below.

Sort the data by the first column ( $X_i$ ). Create a new column $x_i$ and assign it the ranked values 1,2,3,...n.
Next, sort the data by the second column ( $Y_i$ ). Create a fourth column $y_i$ and similarly assign it the ranked values 1,2,3,...n.
Create a fifth column $d_i$ to hold the differences between the two rank columns ( $x_i$ and $y_i$ ).
Create one final column $d^2_i$ to hold the value of column $d_i$ squared.

IQ, $X_i$	Hours of TV per week, $Y_i$	rank $x_i$	rank $y_i$	$d_i$	$d^2_i$
86	0	1	1	0	0
97	20	2	6	−4	16
99	28	3	8	−5	25
100	27	4	7	−3	9
101	50	5	10	−5	25
103	29	6	9	−3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

With $d^2_i$ found, we can add them to find $\sum d_i^2 = 194$ . The value of n is 10. So these values can now be substituted back into the equation,

$\rho = 1- {\frac {6\times194}{10(10^2 - 1)}}$

which evaluates to ρ = -29/165 = −0.175757575... with a P-value = 0.6864058 (using the t distribution)

This low value shows that the correlation between IQ and hours spent watching TV is very low.

Lognormal distribution

Lognormal Distribution Probability Density Function A variable X is lognormally distributed if Y = LN(X) is normally distributed with "LN" denoting the natural logarithm. The general formula for the probability density function of the lognormal distribution is where is the shape parameter , is the location parameter and m is the scale parameter . The case where = 0 and m = 1 is called the standard lognormal distribution . The case where equals zero is called the 2-parameter lognormal distribution. The equation for the standard lognormal distribution is Since the general form of probability functions can be expressed in terms of the standard distribution , all subsequent formulas in this section are given for the standard form of the function. The following is the plot of the lognormal probability density function for four values of . There are several commo...

Statistical Gallery

Search This Blog

Methods of correlation

Correlation Methods

Correlation and Covariance Matrices

Using Fisher's z Transformation (z_r)

Pearson Product-Moment Correlation Coefficient (Pearson's r)

For a sample[edit]

Mathematical properties[edit]

Spearman's rank correlation coefficient

Example [edit]

Comments

Post a Comment

Popular posts from this blog

Frequency Polygons

Lognormal distribution

Weibull distribution