GeographyStatistical Techniques and Data AnalysisA-Level

Pearson's Product-Moment Correlation Coefficient

A statistical measure that quantifies the strength and direction of the linear relationship between two continuous interval or ratio variables.

Understand the formulaSee the free derivationOpen the full walkthrough

r = \frac{n ( \sum x y ) - ( \sum x ) ( \sum y )}{[ n \sum x ^{2} - ( \sum x ) ^{2} ] [ n \sum y ^{2} - ( \sum y ) ^{2} ]}

Open Full Walkthrough Try Calculator

This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.

Core idea

Overview

Pearson's r produces a value between -1 and +1, where +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear correlation. In geographical research, it is essential for testing hypotheses about how two variables, such as distance from a CBD and property prices, covary across a landscape. The coefficient assumes that the data is normally distributed and that the relationship is strictly linear.

When to use: Use when analyzing two sets of interval or ratio data to determine if a linear trend exists between them.

Why it matters: It allows geographers to move beyond visual inspection of scatter graphs to provide a statistically significant confirmation of relationships between environmental or social variables.

Symbols

Variables

r = Correlation Coefficient, n = Sample size, x = Variable 1 data points, y = Variable 2 data points

r

Correlation Coefficient

Variable

n

Sample size

Variable

x

Variable 1 data points

Variable

y

Variable 2 data points

Variable

Walkthrough

Derivation

Derivation of Pearson's Product-Moment Correlation Coefficient

The formula is derived from the definition of the correlation coefficient as the covariance of two variables divided by the product of their standard deviations. It simplifies the algebraic expression of the Pearson coefficient for easier computational use.

The relationship between the two variables is linear.
The data points are paired as (x, y) observations.
The variables are measured on an interval or ratio scale.

Defining the Correlation Coefficient

Start with the population definition where r is the covariance divided by the product of the standard deviations.

r = \frac{Cov ( X , Y )}{σ _{x} σ _{y}} = \frac{\frac{1}{n} \sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\frac{1}{n} \sum ( x _{i} - x ˉ ) ^{2} \frac{1}{n} \sum ( y _{i} - y ˉ ) ^{2}}

Note: Note that the 1/n terms cancel out during simplification.

Expanding the Covariance Term

Expand the brackets and apply the summation to each term, using the property that the sum of the mean is n times the mean.

\sum (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ}) = \sum (x_{i} y_{i} - x_{i} \overset{y}{ˉ} - y_{i} \overset{x}{ˉ} + \overset{x}{ˉ} \overset{y}{ˉ}) = \sum x_{i} y_{i} - n \overset{x}{ˉ} \overset{y}{ˉ}

Note: Recall that n $\overset{x}{ˉ}$ = $\sum$ x and n $\overset{y}{ˉ}$ = $\sum$ y.

Simplifying the Covariance Expression

Substitute the definitions of the means (x-bar and y-bar) into the expanded covariance expression to clear the denominators.

\sum x_{i} y_{i} - n (\frac{\sum x}{n}) (\frac{\sum y}{n}) = \frac{n \sum x y - ( \sum x ) ( \sum y )}{n}

Note: This creates the numerator of the final formula.

Simplifying the Variance Denominator

Apply the same algebraic expansion to the variance terms for x and y. When substituted back into the denominator, the 'n' factors cancel out.

\sum (x_{i} - \overset{x}{ˉ})^{2} = \sum x^{2} - n \overset{x}{ˉ}^{2} = \sum x^{2} - \frac{( \sum x ) ^{2}}{n} = \frac{n \sum x ^{2} - ( \sum x ) ^{2}}{n}

Note: Ensure you calculate the sum of the squares (sum $x^{2}$ ) and the square of the sum (sum x)^2 separately to avoid errors.

Final Assembly

Combine the simplified numerator and denominator to arrive at the computational formula.

r = \frac{\frac{n \sum x y - ( \sum x ) ( \sum y )}{n}}{\frac{n \sum x ^{2} - ( \sum x ) ^{2}}{n} \frac{n \sum y ^{2} - ( \sum y ) ^{2}}{n}} = \frac{n \sum x y - ( \sum x ) ( \sum y )}{[ n \sum x ^{2} - ( \sum x ) ^{2} ] [ n \sum y ^{2} - ( \sum y ) ^{2} ]}

Note: This form is often called the 'computational formula' because it is more efficient for manual calculation.

Result

r = \frac{\frac{n \sum x y - ( \sum x ) ( \sum y )}{n}}{\frac{n \sum x ^{2} - ( \sum x ) ^{2}}{n} \frac{n \sum y ^{2} - ( \sum y ) ^{2}}{n}} = \frac{n \sum x y - ( \sum x ) ( \sum y )}{[ n \sum x ^{2} - ( \sum x ) ^{2} ] [ n \sum y ^{2} - ( \sum y ) ^{2} ]}

Source: AQA/Edexcel A-Level Geography Specification - Quantitative Skills: Statistical Analysis

Free formulas

Rearrangements

Solve for $r$

Make r the subject

r = \frac{n ( \sum x y ) - ( \sum x ) ( \sum y )}{[ n \sum x ^{2} - ( \sum x ) ^{2} ] [ n \sum y ^{2} - ( \sum y ) ^{2} ]}

The formula is already defined with r as the subject.

Difficulty: 1/5

Solve for $n$

Make n the subject

n^{2} [r^{2} (\sum x^{2}) (\sum y^{2}) - (\sum x y)^{2}] + n [2 r^{2} (\sum x y) (\sum x) (\sum y) - r^{2} (\sum x^{2}) (\sum y)^{2} - r^{2} (\sum y^{2}) (\sum x)^{2}] + [r^{2} (\sum x)^{2} (\sum y)^{2} - (\sum x)^{2} (\sum y)^{2}] = 0

Isolating n requires squaring both sides and using the quadratic formula or variable substitution techniques.

Difficulty: 5/5

Solve for $\sum x y$

Make Σxy the subject

\sum x y = \frac{r [ n \sum x ^{2} - ( \sum x ) ^{2} ] [ n \sum y ^{2} - ( \sum y ) ^{2} ] + ( \sum x ) ( \sum y )}{n}

Isolate the numerator term by multiplying by the denominator and rearranging.

Difficulty: 3/5

The static page shows the finished rearrangements. The app keeps the full worked algebra walkthrough.

Visual intuition

Graph

Graph unavailable for this formula.

Contains advanced operator notation (integrals/sums/limits)

Why it behaves this way

Intuition

Think of the data as a cloud of points on a scatter graph. This equation calculates how well those points fit onto a straight line. Imagine trying to draw a 'best-fit' line through the cloud: the numerator measures how much the x and y values 'move together' (covariance), while the denominator acts as a scaling factor (standard deviations) to normalize the result, ensuring the value always sits between -1 and 1 regardless of the units used.

r

Pearson's Correlation Coefficient

The 'tightness' of the linear relationship; 1 is a perfect line, 0 is a cloud with no discernible slope.

n

Sample size

The number of pairs of observations; it acts as an 'averaging' agent in the calculation.

Σ x y

Sum of the products

Captures the interaction between variables; if x and y are both high or both low, this sum is large and positive.

Σ x^{2} /Σ y^{2}

Sum of squares

Represents the total 'spread' or variance of each individual variable.

Signs and relationships

Numerator (nΣxy - (Σx)(Σy)): If the numerator is positive, x and y increase together (positive correlation). If negative, one increases as the other decreases (negative correlation).
Square root in denominator: This forces the result into the -1 to +1 range by dividing the covariance by the product of the two variables' individual standard deviations (normalization).

One free problem

Practice Problem

Given a small sample where n=5, Σx=15, Σy=20, Σxy=70, Σx²=55, and Σy²=90, calculate Pearson's r.

Sample size5

sumXY70

sumX15

sumY20

sumX255

sumY290

Solve for: $r$

Hint: Calculate the numerator first, then the denominator parts separately.

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

Investigating the correlation between the distance of settlements from a river (x) and the average annual flood depth (y) to determine flood risk zones.

Study smarter

Tips

Always plot a scatter graph first to check for linearity before calculating r.
Ensure that your sample size (n) is sufficiently large to avoid skewed results from outliers.
Remember that correlation does not imply causation.

Avoid these traps

Common Mistakes

Forgetting to square the sum (Σx)² versus summing the squares Σx².
Applying the test to non-linear relationships (e.g., exponential growth patterns).
Ignoring the impact of extreme outliers which can heavily bias the result.

Common questions

Frequently Asked Questions

Use when analyzing two sets of interval or ratio data to determine if a linear trend exists between them.

It allows geographers to move beyond visual inspection of scatter graphs to provide a statistically significant confirmation of relationships between environmental or social variables.

Forgetting to square the sum (Σx)² versus summing the squares Σx². Applying the test to non-linear relationships (e.g., exponential growth patterns). Ignoring the impact of extreme outliers which can heavily bias the result.

Investigating the correlation between the distance of settlements from a river (x) and the average annual flood depth (y) to determine flood risk zones.

Always plot a scatter graph first to check for linearity before calculating r. Ensure that your sample size (n) is sufficiently large to avoid skewed results from outliers. Remember that correlation does not imply causation.

References

Sources

Pearson, K. (1896). Mathematical Contributions to the Theory of Evolution.
Burt, J. E., Barber, G. M., & Rigby, D. L. (2009). Elementary Statistics for Geographers.
AQA/Edexcel A-Level Geography Specification - Quantitative Skills: Statistical Analysis

Pearson's Product-Moment Correlation Coefficient

Overview

Variables

Derivation

Defining the Correlation Coefficient

Expanding the Covariance Term

Simplifying the Covariance Expression

Simplifying the Variance Denominator

Final Assembly

Rearrangements

Graph

Intuition

Practice Problem

Real-World Context

Tips

Common Mistakes

Related Formulas

Spearman's Rank Correlation Coefficient

Frequently Asked Questions

Sources