Methods of correlation

Correlation Methods

You can generate a correlation or covariance matrix from numeric data columns, and have the choice of storing the computation results in an-autogenerated worksheet, or display the results in a table format whose values can be color coded.

This method requires multiple numeric data columns whose values should be stored in a single worksheet.

An example of a correlation matrix displayed as a color-coded table is shown below.

Using Fisher's z Transformation (z_r)

This option is provided to allow transforming a skewed sampling distribution into a normalized format.

The theoretical sampling distribution of the correlation coefficient can be approximated by the normal distribution when the value of a population correlation ρ = 0, but as the value of r deviates from zero, the sampling distribution becomes increasingly skewed. Fisher's ztransformation transforms a skewed sampling distribution into a normalized format.

The relationship between Pearson's product-moment correlation coefficient and the Fisher-Transformed values are shown in the right-hand side image.

The image below shows the Fisher-transformed values of the correlation matrix displayed above.

Pearson Product-Moment Correlation Coefficient (Pearson's r)

Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter ρ (rho) and may be referred to as the population correlation coefficient or thepopulation Pearson correlation coefficient. The formula for ρ is:

$\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}$

For a sample[edit]

Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. That formula for r is:

$r = \frac{\sum ^n _{i=1}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}}$

An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data (X_i, Y_i), the sample Pearson correlation coefficient is

$r = \frac{1}{n-1} \sum ^n _{i=1} \left( \frac{X_i - \bar{X}}{s_X} \right) \left( \frac{Y_i - \bar{Y}}{s_Y} \right)$

where

$\frac{X_i - \bar{X}}{s_X}, \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i, \text{ and } s_X=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2}$

are the standard score, sample mean, and sample standard deviation, respectively.

Mathematical properties[edit]

The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).

A key mathematical property of the Pearson correlation coefficient is that it is invariant (up to a sign) to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation: see a later section for an application of this.

The Pearson correlation can be expressed in terms of uncentered moments. Since μ_X = E(X), σ_X² = E[(X − E(X))²] = E(X²) − E²(X) and likewise for Y, and since

$E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y),\,$

the correlation can also be written as

$\rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}~\sqrt{E(Y^2)- (E(Y))^2}}.$

Spearman's rank correlation coefficient

In applications where duplicate values (ties) are known to be absent, a simpler procedure can be used to calculate ρ.^[3]^[4]Differences $d_i = x_i - y_i$ between the ranks of each observation on the two variables are calculated, and ρ is given by:

$\rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.$

Note that this latter method should not be used in cases where the data set is truncated; that is, when the Spearman correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above.

Example [edit]

In this example, we will use the raw data in the table below to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.

IQ, $X_i$	Hours of TV per week, $Y_i$
106	7
86	0
100	27
101	50
99	28
103	29
97	20
113	12
112	6
110	17

First, we must find the value of the term $d^2_i$ . To do so we use the following steps, reflected in the table below.

Sort the data by the first column ( $X_i$ ). Create a new column $x_i$ and assign it the ranked values 1,2,3,...n.
Next, sort the data by the second column ( $Y_i$ ). Create a fourth column $y_i$ and similarly assign it the ranked values 1,2,3,...n.
Create a fifth column $d_i$ to hold the differences between the two rank columns ( $x_i$ and $y_i$ ).
Create one final column $d^2_i$ to hold the value of column $d_i$ squared.

IQ, $X_i$	Hours of TV per week, $Y_i$	rank $x_i$	rank $y_i$	$d_i$	$d^2_i$
86	0	1	1	0	0
97	20	2	6	−4	16
99	28	3	8	−5	25
100	27	4	7	−3	9
101	50	5	10	−5	25
103	29	6	9	−3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

With $d^2_i$ found, we can add them to find $\sum d_i^2 = 194$ . The value of n is 10. So these values can now be substituted back into the equation,

$\rho = 1- {\frac {6\times194}{10(10^2 - 1)}}$

which evaluates to ρ = -29/165 = −0.175757575... with a P-value = 0.6864058 (using the t distribution)

This low value shows that the correlation between IQ and hours spent watching TV is very low.

Double exponential distribution

Double Exponential Distribution Probability Density Function The general formula for the probability density function of the double exponential distribution is where is the location parameter and is the scale parameter . The case where = 0 and = 1 is called the standard double exponential distribution . The equation for the standard double exponential distribution is Since the general form of probability functions can be expressed in terms of the standard distribution , all subsequent formulas in this section are given for the standard form of the function. The following is the plot of the double exponential probability density function. Cumulative Distribution Function The formula for the cumulative distribution function of the double exponential distribution is The following is the plot of the double exponential cumulative distribution function. Percent Point Function...

Statistical Gallery

Search This Blog

Methods of correlation

Correlation Methods

Correlation and Covariance Matrices

Using Fisher's z Transformation (z_r)

Pearson Product-Moment Correlation Coefficient (Pearson's r)

For a sample[edit]

Mathematical properties[edit]

Spearman's rank correlation coefficient

Example [edit]

Comments

Post a Comment

Popular posts from this blog

Frequency Polygons

Lognormal distribution

Double exponential distribution