Skip to main content

Methods of correlation

Correlation Methods 


    Correlation and Covariance Matrices


    You can generate a correlation or covariance matrix from numeric data columns, and have the choice of storing the computation results in an-autogenerated worksheet, or display the results in a table format whose values can be color coded.
    • This method requires multiple numeric data columns whose values should be stored in a single worksheet.
    • An example of a correlation matrix displayed as a color-coded table is shown below.

    Using Fisher's z Transformation (zr)


    This option is provided to allow transforming a skewed sampling distribution into a normalized format.
    • The theoretical sampling distribution of the correlation coefficient can be approximated by the normal distribution when the value of a population correlation ρ = 0, but as the value of r deviates from zero, the sampling distribution becomes increasingly skewed. Fisher's ztransformation transforms a skewed sampling distribution into a normalized format.
    • The relationship between Pearson's product-moment correlation coefficient and the Fisher-Transformed values are shown in the right-hand side image.
      The image below shows the Fisher-transformed values of the correlation matrix displayed above.

    Pearson Product-Moment Correlation Coefficient (Pearson's r)



     \rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}

    For a sample[edit]

    Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. That formula for r is:
    r = \frac{\sum ^n _{i=1}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}}
    An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data (XiYi), the sample Pearson correlation coefficient is
    r = \frac{1}{n-1} \sum ^n _{i=1} \left( \frac{X_i - \bar{X}}{s_X} \right) \left( \frac{Y_i - \bar{Y}}{s_Y} \right)
    where
    \frac{X_i - \bar{X}}{s_X}, \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i, \text{ and } s_X=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2}
    are the standard score, sample mean, and sample standard deviation, respectively.

    Mathematical properties[edit]

    The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
    A key mathematical property of the Pearson correlation coefficient is that it is invariant (up to a sign) to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where abc, and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation: see a later section for an application of this.
    The Pearson correlation can be expressed in terms of uncentered moments. Since μX = E(X), σX2 = E[(X − E(X))2] = E(X2) − E2(X) and likewise for Y, and since
    E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y),\,
    the correlation can also be written as
    \rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}~\sqrt{E(Y^2)- (E(Y))^2}}.

    Spearman's rank correlation coefficient


    In applications where duplicate values (ties) are known to be absent, a simpler procedure can be used to calculate ρ.[3][4]Differences d_i = x_i - y_i between the ranks of each observation on the two variables are calculated, and ρ is given by:
     \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.
    Note that this latter method should not be used in cases where the data set is truncated; that is, when the Spearman correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above.

    Example [edit]

    In this example, we will use the raw data in the table below to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.
    IQX_iHours of TV per week, Y_i
    1067
    860
    10027
    10150
    9928
    10329
    9720
    11312
    1126
    11017
    First, we must find the value of the term d^2_i. To do so we use the following steps, reflected in the table below.
    1. Sort the data by the first column (X_i). Create a new column x_i and assign it the ranked values 1,2,3,...n.
    2. Next, sort the data by the second column (Y_i). Create a fourth column y_i and similarly assign it the ranked values 1,2,3,...n.
    3. Create a fifth column d_i to hold the differences between the two rank columns (x_i and y_i).
    4. Create one final column d^2_i to hold the value of column d_i squared.
    IQX_iHours of TV per week, Y_irank x_irank y_id_id^2_i
    8601100
    972026−416
    992838−525
    1002747−39
    10150510−525
    1032969−39
    106773416
    110178539
    112692749
    11312104636
    With d^2_i found, we can add them to find \sum d_i^2 = 194. The value of n is 10. So these values can now be substituted back into the equation,
     \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}
    which evaluates to ρ = -29/165 = −0.175757575... with a P-value = 0.6864058 (using the t distribution)
    This low value shows that the correlation between IQ and hours spent watching TV is very low. 

Comments

Post a Comment

Popular posts from this blog

Frequency Polygons

Learning Objectives Create and interpret frequency polygons Create and interpret cumulative frequency polygons Create and interpret overlaid frequency polygons Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful for comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions . To create a frequency polygon, start just as for histograms , by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the ...

Lognormal distribution

Lognormal Distribution Probability Density Function A variable X is lognormally distributed if Y = LN(X) is normally distributed with "LN" denoting the natural logarithm. The general formula for the  probability density function  of the lognormal distribution is where   is the  shape parameter ,   is the  location parameter  and  m is the  scale parameter . The case where   = 0 and  m  = 1 is called the  standard lognormal distribution . The case where   equals zero is called the 2-parameter lognormal distribution. The equation for the standard lognormal distribution is Since the general form of probability functions can be  expressed in terms of the standard distribution , all subsequent formulas in this section are given for the standard form of the function. The following is the plot of the lognormal probability density function for four values of  . There are several commo...

Double exponential distribution

Double Exponential Distribution Probability Density Function The general formula for the  probability density function  of the double exponential distribution is where   is the  location parameter  and   is the  scale parameter . The case where   = 0 and   = 1 is called the  standard double exponential distribution . The equation for the standard double exponential distribution is Since the general form of probability functions can be  expressed in terms of the standard distribution , all subsequent formulas in this section are given for the standard form of the function. The following is the plot of the double exponential probability density function. Cumulative Distribution Function The formula for the  cumulative distribution function  of the double exponential distribution is The following is the plot of the double exponential cumulative distribution function. Percent Point Function...