Recent from talks
Contribute something
Nothing was collected or created yet.
Correlation coefficient
View on WikipediaA correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables.[a] The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.[citation needed]
Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation.[2] As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by outliers and the possibility of incorrectly being used to infer a causal relationship between the variables (for more, see Correlation does not imply causation).[3]
Types
[edit]There are several different measures for the degree of correlation in data, depending on the kind of data: principally whether the data is a measurement, ordinal, or categorical.
Pearson
[edit]The Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength and direction of the linear relationship between two variables that is defined as the covariance of the variables divided by the product of their standard deviations.[4] This is the best-known and most commonly used type of correlation coefficient. When the term "correlation coefficient" is used without further qualification, it usually refers to the Pearson product-moment correlation coefficient.
Intra-class
[edit]Intraclass correlation (ICC) is a descriptive statistic that can be used, when quantitative measurements are made on units that are organized into groups; it describes how strongly units in the same group resemble each other.
Rank
[edit]Rank correlation is a measure of the relationship between the rankings of two variables, or two rankings of the same variable:
- Spearman's rank correlation coefficient is a measure of how well the relationship between two variables can be described by a monotonic function.
- The Kendall tau rank correlation coefficient is a measure of the portion of ranks that match between two data sets.
- Goodman and Kruskal's gamma is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level.
Tetrachoric and polychoric
[edit]The polychoric correlation coefficient measures association between two ordered-categorical variables. It's technically defined as the estimate of the Pearson correlation coefficient one would obtain if:
- The two variables were measured on a continuous scale, instead of as ordered-category variables.
- The two continuous variables followed a bivariate normal distribution.
When both variables are dichotomous instead of ordered-categorical, the polychoric correlation coefficient is called the tetrachoric correlation coefficient.
Interpreting correlation coefficient values
[edit]The correlation between two variables have different associations that are measured in values such as r or R. Correlation values range from −1 to +1, where ±1 indicates the strongest possible correlation and 0 indicates no correlation between variables.[5]
| r or R | r or R | Strength or weakness of association between variables[6] |
|---|---|---|
| +1.0 to +0.8 | -1.0 to -0.8 | Perfect or very strong association |
| +0.8 to +0.6 | -0.8 to -0.6 | Strong association |
| +0.6 to +0.4 | -0.6 to -0.4 | Moderate association |
| +0.4 to +0.2 | -0.4 to -0.2 | Weak association |
| +0.2 to 0.0 | -0.2 to 0.0 | Very weak or no association |
See also
[edit]- Correlation disattenuation
- Coefficient of determination
- Correlation and dependence
- Correlation ratio
- Distance correlation
- Goodness of fit, any of several measures that measure how well a statistical model fits observations by summarizing the discrepancy between observed values and the values expected under the model
- Multiple correlation
- Partial correlation
Notes
[edit]- ^ Correlation coefficient: A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, values near 0.50 are considered moderate and values below 0.30 are considered to show weak relationship. A low negative value (approaching -1.00) is similarly a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.[1]
References
[edit]- ^ "correlation coefficient". NCME.org. National Council on Measurement in Education. Archived from the original on July 22, 2017. Retrieved April 17, 2014.
- ^ Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN 0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
- ^ Boddy, Richard; Smith, Gordon (2009). Statistical Methods in Practice: For scientists and technologists. Chichester, U.K.: Wiley. pp. 95–96. ISBN 978-0-470-74664-6.
- ^ Weisstein, Eric W. "Statistical Correlation". mathworld.wolfram.com. Retrieved 2020-08-22.
- ^ Taylor, John R. (1997). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements (PDF) (2nd ed.). Sausalito, CA: University Science Books. p. 217. ISBN 0-935702-75-X. Archived from the original (PDF) on 15 February 2019. Retrieved 14 February 2019.
- ^ "The Correlation Coefficient (r)". Boston University.
Correlation coefficient
View on GrokipediaFundamentals
Definition
In statistics, correlation refers to a measure of statistical dependence between two random variables, indicating how they tend to vary together without implying causation, as a relationship may arise from confounding factors or coincidence rather than one variable directly influencing the other.[11] This dependence can manifest as linear or monotonic associations, where changes in one variable are systematically accompanied by changes in the other, either in the same direction (positive) or opposite direction (negative). Correlation coefficients standardize this relationship to provide a dimensionless quantity that facilitates comparison across different datasets or scales. To understand correlation, it is essential to first consider prerequisite concepts such as random variables, which are variables whose values are determined by outcomes of a random process, and covariance, an unnormalized measure of the joint variability between two such variables that quantifies how they deviate from their expected values in tandem.[12] Covariance captures the direction and magnitude of this co-variation but is sensitive to the units of measurement, making it less comparable across contexts; correlation coefficients address this by normalizing covariance relative to the individual variabilities of the variables involved.[13] The correlation coefficient typically ranges from -1 to +1, where a value of +1 signifies perfect positive association (both variables increase together), -1 indicates perfect negative association (one increases as the other decreases), and 0 suggests no linear association, though non-linear dependencies may still exist.[14] This bounded scale allows for intuitive interpretation of the strength and direction of the relationship. The concept was introduced by Francis Galton in the late 1880s as part of his work on regression and heredity, with Karl Pearson providing a formal mathematical definition in the 1890s, establishing it as a cornerstone of statistical analysis.[15][16] The Pearson correlation coefficient serves as the most common example of this measure in practice.[17]General Properties
Correlation coefficients exhibit several fundamental mathematical properties that make them useful for measuring associations between variables. The population correlation coefficient, denoted by the Greek letter ρ, quantifies the true linear relationship between two random variables in the entire population, while the sample correlation coefficient, denoted r, serves as an estimate of ρ based on observed data from a finite sample.[18] This distinction is crucial because r is subject to sampling variability and converges to ρ as the sample size increases.[8] A key property is the decomposition of the correlation coefficient in terms of covariance and standard deviations. Specifically, the population correlation is given by where is the covariance between and , and and are their respective standard deviations.[19] This relation standardizes the covariance, rendering the correlation coefficient dimensionless and independent of the units of measurement for the variables. The sample analog follows the same form, replacing population parameters with sample estimates.[20] Due to this standardization, correlation coefficients are bounded between -1 and +1, with values of ±1 indicating perfect positive or negative linear relationships, 0 indicating no linear association, and intermediate values reflecting the strength and direction of the linear dependence.[19] Additionally, the coefficient is symmetric, such that , and invariant under linear transformations of the variables, meaning that affine shifts (adding constants) or scalings (multiplying by positive constants) do not alter its value.[21][22] These properties hold for standardized measures like the Pearson correlation coefficient.[23] However, these properties come with limitations: correlation coefficients are designed to detect linear associations and may produce low values even for strong nonlinear relationships, failing to capture dependencies that deviate from linearity.[24] For instance, variables related through a quadratic or exponential function might yield a correlation near zero despite a clear pattern.[25]Pearson Correlation Coefficient
Formula and Computation
The Pearson correlation coefficient, denoted as for a population, measures the linear relationship between two random variables and . It is defined as where is the covariance, and are the standard deviations, and are the means, and denotes the expected value.[26][27] For a sample of paired observations , the sample Pearson correlation coefficient estimates using where and are the sample means. This formula arises from the sample covariance divided by the product of sample standard deviations; the sample covariance is typically computed as to provide an unbiased estimate of the population covariance, while the sample variances use the same denominator for unbiasedness. In the correlation formula, the terms cancel out, yielding the expression above, which is consistent but slightly biased as an estimator of .[4] To compute , first calculate the sample means and . Next, center the data by subtracting these means to obtain deviations and . Then, compute the numerator as the sum of the products of these deviations, which estimates the covariance (scaled by ). Finally, compute the denominator as the square root of the product of the sums of squared deviations, which are proportional to the sample variances. Dividing yields , which ranges from -1 to 1.[4] Consider a small dataset of paired observations on heights (in inches) and weights (in pounds) for illustration: (60, 120), (62, 125), (65, 130), (68, 135).| i | Height | Weight | |||||
|---|---|---|---|---|---|---|---|
| 1 | 60 | 120 | -3.75 | -7.5 | 28.125 | 14.0625 | 56.25 |
| 2 | 62 | 125 | -1.75 | -2.5 | 4.375 | 3.0625 | 6.25 |
| 3 | 65 | 130 | 1.25 | 2.5 | 3.125 | 1.5625 | 6.25 |
| 4 | 68 | 135 | 4.25 | 7.5 | 31.875 | 18.0625 | 56.25 |
| Sum | 255 | 510 | 0 | 0 | 67.5 | 36.75 | 125 |
cor() function from the base stats package calculates for vectors x and y using the formula above. Similarly, in Python, the scipy.stats.pearsonr(x, y) function from SciPy provides and its p-value.
