Skip to main content

Methods of correlation

Correlation Methods 


    Correlation and Covariance Matrices


    You can generate a correlation or covariance matrix from numeric data columns, and have the choice of storing the computation results in an-autogenerated worksheet, or display the results in a table format whose values can be color coded.
    • This method requires multiple numeric data columns whose values should be stored in a single worksheet.
    • An example of a correlation matrix displayed as a color-coded table is shown below.

    Using Fisher's z Transformation (zr)


    This option is provided to allow transforming a skewed sampling distribution into a normalized format.
    • The theoretical sampling distribution of the correlation coefficient can be approximated by the normal distribution when the value of a population correlation ρ = 0, but as the value of r deviates from zero, the sampling distribution becomes increasingly skewed. Fisher's ztransformation transforms a skewed sampling distribution into a normalized format.
    • The relationship between Pearson's product-moment correlation coefficient and the Fisher-Transformed values are shown in the right-hand side image.
      The image below shows the Fisher-transformed values of the correlation matrix displayed above.

    Pearson Product-Moment Correlation Coefficient (Pearson's r)



     \rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}

    For a sample[edit]

    Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. That formula for r is:
    r = \frac{\sum ^n _{i=1}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}}
    An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data (XiYi), the sample Pearson correlation coefficient is
    r = \frac{1}{n-1} \sum ^n _{i=1} \left( \frac{X_i - \bar{X}}{s_X} \right) \left( \frac{Y_i - \bar{Y}}{s_Y} \right)
    where
    \frac{X_i - \bar{X}}{s_X}, \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i, \text{ and } s_X=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2}
    are the standard score, sample mean, and sample standard deviation, respectively.

    Mathematical properties[edit]

    The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
    A key mathematical property of the Pearson correlation coefficient is that it is invariant (up to a sign) to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where abc, and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation: see a later section for an application of this.
    The Pearson correlation can be expressed in terms of uncentered moments. Since μX = E(X), σX2 = E[(X − E(X))2] = E(X2) − E2(X) and likewise for Y, and since
    E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y),\,
    the correlation can also be written as
    \rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}~\sqrt{E(Y^2)- (E(Y))^2}}.

    Spearman's rank correlation coefficient


    In applications where duplicate values (ties) are known to be absent, a simpler procedure can be used to calculate ρ.[3][4]Differences d_i = x_i - y_i between the ranks of each observation on the two variables are calculated, and ρ is given by:
     \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.
    Note that this latter method should not be used in cases where the data set is truncated; that is, when the Spearman correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above.

    Example [edit]

    In this example, we will use the raw data in the table below to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.
    IQX_iHours of TV per week, Y_i
    1067
    860
    10027
    10150
    9928
    10329
    9720
    11312
    1126
    11017
    First, we must find the value of the term d^2_i. To do so we use the following steps, reflected in the table below.
    1. Sort the data by the first column (X_i). Create a new column x_i and assign it the ranked values 1,2,3,...n.
    2. Next, sort the data by the second column (Y_i). Create a fourth column y_i and similarly assign it the ranked values 1,2,3,...n.
    3. Create a fifth column d_i to hold the differences between the two rank columns (x_i and y_i).
    4. Create one final column d^2_i to hold the value of column d_i squared.
    IQX_iHours of TV per week, Y_irank x_irank y_id_id^2_i
    8601100
    972026−416
    992838−525
    1002747−39
    10150510−525
    1032969−39
    106773416
    110178539
    112692749
    11312104636
    With d^2_i found, we can add them to find \sum d_i^2 = 194. The value of n is 10. So these values can now be substituted back into the equation,
     \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}
    which evaluates to ρ = -29/165 = −0.175757575... with a P-value = 0.6864058 (using the t distribution)
    This low value shows that the correlation between IQ and hours spent watching TV is very low. 

Comments

Post a Comment

Popular posts from this blog

Frequency Polygons

Learning Objectives Create and interpret frequency polygons Create and interpret cumulative frequency polygons Create and interpret overlaid frequency polygons Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful for comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions . To create a frequency polygon, start just as for histograms , by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the

Lognormal distribution

Lognormal Distribution Probability Density Function A variable X is lognormally distributed if Y = LN(X) is normally distributed with "LN" denoting the natural logarithm. The general formula for the  probability density function  of the lognormal distribution is where   is the  shape parameter ,   is the  location parameter  and  m is the  scale parameter . The case where   = 0 and  m  = 1 is called the  standard lognormal distribution . The case where   equals zero is called the 2-parameter lognormal distribution. The equation for the standard lognormal distribution is Since the general form of probability functions can be  expressed in terms of the standard distribution , all subsequent formulas in this section are given for the standard form of the function. The following is the plot of the lognormal probability density function for four values of  . There are several common parameterizations of the lognormal distribution. The form given here is from  Evans, Ha

Basics of Sampling Techniques

Population                A   population   is a group of individuals(or)aggregate of objects under study.It is also known as universe. The population is divided by (i)finite population  (ii)infinite population, (iii) hypothetical population,  subject to a statistical study . A population includes each element from the set of observations that can be made. (i) Finite population : A population is called finite if it is possible to count its individuals. It may also be called a countable population. The number of vehicles crossing a bridge every day, (ii) Infinite population : Sometimes it is not possible to count the units contained in the population. Such a population is called infinite or uncountable. ex, The number of germs in the body of a patient of malaria is perhaps something which is uncountable   (iii) Hypothetical population : Statistical population which has no real existence but is imagined to be generated by repetitions of events of a certain typ