Skip to main content

Methods of correlation

Correlation Methods 


    Correlation and Covariance Matrices


    You can generate a correlation or covariance matrix from numeric data columns, and have the choice of storing the computation results in an-autogenerated worksheet, or display the results in a table format whose values can be color coded.
    • This method requires multiple numeric data columns whose values should be stored in a single worksheet.
    • An example of a correlation matrix displayed as a color-coded table is shown below.

    Using Fisher's z Transformation (zr)


    This option is provided to allow transforming a skewed sampling distribution into a normalized format.
    • The theoretical sampling distribution of the correlation coefficient can be approximated by the normal distribution when the value of a population correlation ρ = 0, but as the value of r deviates from zero, the sampling distribution becomes increasingly skewed. Fisher's ztransformation transforms a skewed sampling distribution into a normalized format.
    • The relationship between Pearson's product-moment correlation coefficient and the Fisher-Transformed values are shown in the right-hand side image.
      The image below shows the Fisher-transformed values of the correlation matrix displayed above.

    Pearson Product-Moment Correlation Coefficient (Pearson's r)



     \rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}

    For a sample[edit]

    Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. That formula for r is:
    r = \frac{\sum ^n _{i=1}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}}
    An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data (XiYi), the sample Pearson correlation coefficient is
    r = \frac{1}{n-1} \sum ^n _{i=1} \left( \frac{X_i - \bar{X}}{s_X} \right) \left( \frac{Y_i - \bar{Y}}{s_Y} \right)
    where
    \frac{X_i - \bar{X}}{s_X}, \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i, \text{ and } s_X=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2}
    are the standard score, sample mean, and sample standard deviation, respectively.

    Mathematical properties[edit]

    The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
    A key mathematical property of the Pearson correlation coefficient is that it is invariant (up to a sign) to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where abc, and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation: see a later section for an application of this.
    The Pearson correlation can be expressed in terms of uncentered moments. Since μX = E(X), σX2 = E[(X − E(X))2] = E(X2) − E2(X) and likewise for Y, and since
    E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y),\,
    the correlation can also be written as
    \rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}~\sqrt{E(Y^2)- (E(Y))^2}}.

    Spearman's rank correlation coefficient


    In applications where duplicate values (ties) are known to be absent, a simpler procedure can be used to calculate ρ.[3][4]Differences d_i = x_i - y_i between the ranks of each observation on the two variables are calculated, and ρ is given by:
     \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.
    Note that this latter method should not be used in cases where the data set is truncated; that is, when the Spearman correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above.

    Example [edit]

    In this example, we will use the raw data in the table below to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.
    IQX_iHours of TV per week, Y_i
    1067
    860
    10027
    10150
    9928
    10329
    9720
    11312
    1126
    11017
    First, we must find the value of the term d^2_i. To do so we use the following steps, reflected in the table below.
    1. Sort the data by the first column (X_i). Create a new column x_i and assign it the ranked values 1,2,3,...n.
    2. Next, sort the data by the second column (Y_i). Create a fourth column y_i and similarly assign it the ranked values 1,2,3,...n.
    3. Create a fifth column d_i to hold the differences between the two rank columns (x_i and y_i).
    4. Create one final column d^2_i to hold the value of column d_i squared.
    IQX_iHours of TV per week, Y_irank x_irank y_id_id^2_i
    8601100
    972026−416
    992838−525
    1002747−39
    10150510−525
    1032969−39
    106773416
    110178539
    112692749
    11312104636
    With d^2_i found, we can add them to find \sum d_i^2 = 194. The value of n is 10. So these values can now be substituted back into the equation,
     \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}
    which evaluates to ρ = -29/165 = −0.175757575... with a P-value = 0.6864058 (using the t distribution)
    This low value shows that the correlation between IQ and hours spent watching TV is very low. 

Comments

Post a Comment

Popular posts from this blog

Double exponential distribution

Double Exponential Distribution Probability Density Function The general formula for the  probability density function  of the double exponential distribution is where   is the  location parameter  and   is the  scale parameter . The case where   = 0 and   = 1 is called the  standard double exponential distribution . The equation for the standard double exponential distribution is Since the general form of probability functions can be  expressed in terms of the standard distribution , all subsequent formulas in this section are given for the standard form of the function. The following is the plot of the double exponential probability density function. Cumulative Distribution Function The formula for the  cumulative distribution function  of the double exponential distribution is The following is the plot of the double exponential cumulative distribution function. Percent Point Function The formula for the  percent point function  of the double exponential distribution

Runs Test for Detecting Non-randomness

Runs Test for Detecting Non-randomness Purpose: Detect Non-Randomness The runs test ( Bradley, 1968 ) can be used to decide if a data set is from a random process. A run is defined as a series of increasing values or a series of decreasing values. The number of increasing, or decreasing, values is the length of the run. In a random data set, the probability that the ( I +1)th value is larger or smaller than the I th value follows a binomial distribution , which forms the basis of the runs test. Typical Analysis and Test Statistics The first step in the runs test is to count the number of runs in the data sequence. There are several ways to define runs in the literature, however, in all cases the formulation must produce a dichotomous sequence of values. For example, a series of 20 coin tosses might produce the f

Basics of Sampling Techniques

Population                A   population   is a group of individuals(or)aggregate of objects under study.It is also known as universe. The population is divided by (i)finite population  (ii)infinite population, (iii) hypothetical population,  subject to a statistical study . A population includes each element from the set of observations that can be made. (i) Finite population : A population is called finite if it is possible to count its individuals. It may also be called a countable population. The number of vehicles crossing a bridge every day, (ii) Infinite population : Sometimes it is not possible to count the units contained in the population. Such a population is called infinite or uncountable. ex, The number of germs in the body of a patient of malaria is perhaps something which is uncountable   (iii) Hypothetical population : Statistical population which has no real existence but is imagined to be generated by repetitions of events of a certain typ