SOCI832: Lesson 5.3: Simple Correlation

0. How to I get my computer set up for today’s class?

# Install Packages
if(!require(dplyr)) {install.packages("sjlabelled", repos='', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjlabelled", repos='', dependencies=TRUE)}
if(!require(summarytools)) {install.packages("summarytools", repos='', dependencies=TRUE)}
if(!require(ggplot2)) {install.packages("ggplot2", repos='', dependencies= TRUE)}
if(!require(ggthemes)) {install.packages("ggthemes", repos='', dependencies= TRUE)}

# Load packages into memory

# Turn off scientific notation
options(digits=5, scipen=15) 

# Stop View from overloading memory with a large datasets
RStudioView <- View
View <- function(x) {
  if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }

elect_2013 <- read.csv(url(""))

3. How do I calculate correlation coefficients?

Coorelation coefficents are some of the most widely used and reported statistics in academic papers. In this section we learn about how to obtain the most common types of correlation coefficients. We also create our own function for making very attractive correlation matricies.

Correlation coefficients can roughly be defined as:

  • a measure of the relationship between two variables
  • that is standardised so that
  • a perfect positive correlation is 1
  • a perfect negative correlation is -1
  • and no relationship is 0

It turns out that there are lots of different correlation coefficients, and they can be useful for different specialised purposes, such as variables measured at the binary or ordinal level.

However, when most people talk about correlation coefficients, or just correlation, they are talking about the most famous and widely used correlation coefficient called “Pearson correlation coefficient”. This was invented by the famous statistician Karl Pearson.

It is also called ‘Pearson’s R’ and sometimes just ‘r’. And that has no relationships with the statistical package ‘R’!

Each correlation coefficient is measured in a different way, and this is so that it takes account of the different types of data that the coefficient is working with.

When we have approximately continuous data, Pearson’s correlation coefficient is the best measure, and by learning about this measure, you will get a sense of what all the different correlation coefficients are trying to do.

There is an excellent four page explaination of Pearson’s r in Field et al. 2012 pages 206-209 (6.3.1- 6.3.2) which recommend everyone read. I will try to summarise it here, but my explanation will be overly simple.

A correlation coefficient brings together two main ideas: (1) covariance; and (2) standardisation.

3.1 What is covariance? What does it have to do with correlation coefficents?

Covariance is the idea that for two variables, when one is high on an individual, the other variable will also be high, and vice versa - when one is low, the other variable will be low.

We could think of a very simple example of measuring height and weight on everyone in a school. Now we know that height and weight are not perfectly correlated because there are lots of other factors. But we also know that they tend to move together. This “moving together” is covariance.

The way statisticians measure covariance is by looking at how much two measures move away from the mean of the whole sample. So if we have a school where the average height is 130cm and the average weight is 50kg, statisticians would measure the covariance by asking “When someone differs from the mean height, do they also differ by a similar amount in the same direction from the mean weight?”

Mathematically the covariance of a person called Anne is:

Anne's covariance = (Anne's weight - average school weight) * (Anne's height - average school height)

For the whole school, we can calcuate the average covariance which is called simply covariance by taking the average (actually n-1 is the denominator, rather than n, but it has very close the same effect for large sample sizes). And this will give us a number that is positive if height and weight move together, and negative if they don’t, and approximately zero if there is no relationship.

3.2 What is standardisation? What does it have to do with correlation coefficients?

However, we have a problem, which is that the units of covariance dependent on what we are measuring and the units we are measuring it.

Statisticians solve this by using ‘standardisation’, which is actually used in many different settings.

The formula for standardizing a variable (in almost all settings) is:

standardized value = subtract the mean from each case, and then divide by the standard deviation


Standardized X = (mean X - x for each case)/(standard deviation of X)

3.2.1 What does the formula for standardisation actually mean?

  • subtract mean from each case: The reason you subtract the mean from each case, is that it causes the new mean of all cases to be zero. [Note: it turns out that for covariance, we have already subtracted the mean as part of the formula, so we don’t need to worry about this for pearson correlations.]
  • divide by standard deviation: The second rule is to divide by the standard deviation of a variable (or in this case, the product of the standard deviation of each variable). Why? Because it transforms ALL our units of analysis - that is EVERY variable we divide by it’s own standard deviation - into the same unit of analysis, which is standard deviations. This means that the numbers “1” and “-1” mean very similar things for each variable: they mean 1 standard deviation from the mean of the variable in the sample, irrespective of whether the variable is age, weight, height, or number of pets.

This can be seen in Field et al. 2012 in equation 6.3.

And the result is a magical number called Pearon’s r.

When a statistical packages such as R calculates a Pearons’s r for you, two important questions to ask ourselves are:

  • Is it ‘significant’? - which we measure with p-value and/or confidence intervals.
  • Is it ‘important’? - which we measure with the correlation coefficient, which is itself an example of one measure of effect size.

3.3 How do I read significance, p-values, and confidence intervals of a correlation coefficent?

Reading significance with p-values is pretty straight forward. We look at the p-value associated with each correlation, and ask if the p-value < 0.05. Sometimes the p-value is reported as ‘sig.’ or just ‘p’ or as asterixes like ‘*’ or ‘**’ or ‘***’ (which normally mean p < 0.05, 0.01, and 0.001 respectively)

Reading confidence intervals involve look at the 95% confidence interval and asking whether the interval includes the number zero (i.e. is one end of the interval positive, and the other end negative). If the 95% confidence interval does not include zero, then we say it is is statistically significant because there is less than a 5% chance the true parameter is zero.

3.4 How do I identify importance and effect size for correlation coefficients?

But how do we know ‘importance’?

There is a concept in statistics which has grown in popularity over the last 30 years, and this is known as effect size. The general argument for focusing on effect size is this:

If you have a large enough sample size, then you can get a p-value less than 0.05 for something that is actually of only trivial importance. And for this reason, we can’t just use p-values to tell us whether something really matters in the real world. We could have a statistically significant difference - particularly with a very large sample - that is of very little real world importance because the difference is so small.

For example, we might find that the height of two groups of a million people differs by 0.1 millimetre, with p < 0.05. In most cases, we don’t care about such a small difference, especially when the variation between tallest and shortest within each group is probably at least 1 metre.

As a result of this criticism, the notion of effect size’s has developed.

A variable has a large effect on another variable if a one standard deviation variation in one variable causes (or is correlated with) a large variation in the standard deviation of a second variable.

We can see here that we are thinking about how variance, measured in standardised units, in one variable is related to variance in another.

Anyway, we will be returning to effect size quite a few times in this course, for the moment, I just want you to know that in the case of Pearson’s r, effect size is measured DIRECTLY in the coefficient.

And the rule of thumb for interpreting effect sizes of a Pearson’s r are:

  • small effect size: +/- 0.1 - 0.2
  • medium effect size: +/- 0.3 - 0.4
  • large effect size: +/- 0.5 +

One way to get a grip on what these mean is to take the square of Pearson’s r, which is called ‘R-square’ or ‘R-squared’.

It turns out that R-squared can be read directly as the “Proportion of variation in one variable explained by the second and vice versa”.

In short, it can basically be read like a proportion (or if you like, percentage - if you multiple the R-square by 100).

We can rewrite the rule of thumb for effect size measured in R-square as:

  • small effect size: R-square = 0.01-0.04 (i.e 1%-4% of variance explained by other variable)
  • medium effect size: R-square = 0.09-0.16 (i.e.9%-16%)
  • large effect size: R-square = 0.25+ (ie. 25%+)

With these rules of thumb in hand, let’s look at our data and analysis it with R.

3.5 How do I calculate correlation coefficients in R? cor() and cor.test()

The two simplest commands to get a correlation coefficient are cor() and cor.test()

cor() unfortuately doesn’t give a lot of information so it’s not that useful.

cor.test() however is quite useful.

Let’s look at the correlation between five different variables and likelihood of voting.

I’m going to move from those with little or no effect, through to those with a large effect size.

This first correlation shows the relationship between urban/rural location and likelihood of voting, for which there is little relationship. The command is below, followed by the output.

cor.test(elect_2013$rural_urban, elect_2013$likelihood_vote)
##  Pearson's product-moment correlation
## data:  elect_2013$rural_urban and elect_2013$likelihood_vote
## t = 1.37, df = 3801, p-value = 0.17
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0096043  0.0539327
## sample estimates:
##      cor 
## 0.022187

How do we read this? Remember the first thing to do is to look at the p-value. In this case it is 0.17, which is greater that 0.05, so we can safely say the correlation coefficient is not significantly different from zero.

We can also see this when we look at the 95% confidence interval, which is from -0.0096 to 0.053. Notice that the 95% confidence interval includes zero, which means that we can’t be confident the true parameter is not zero.

We can also see that the correlation coefficient is 0.022 - which is tiny. Remember that a correlation of 0.1 is considered a small effect size, and this is one fifth of that.

Let’s next look at the effect of highest educational qualification:

cor.test(elect_2013$highest_qual, elect_2013$likelihood_vote)
##  Pearson's product-moment correlation
## data:  elect_2013$highest_qual and elect_2013$likelihood_vote
## t = 2.28, df = 3768, p-value = 0.022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0052743 0.0690319
## sample estimates:
##      cor 
## 0.037191

In this case, the p-value is below 0.05, so the correlation is significant, but when you look at the correlation coefficient - it is basically 0.04, which again is tiny. So this is a case of ‘significant’ but ‘not very meaningful’.

Next let’s look at the impact of internet skills on the likelihood of voting:

cor.test(elect_2013$internet_skills, elect_2013$likelihood_vote)
##  Pearson's product-moment correlation
## data:  elect_2013$internet_skills and elect_2013$likelihood_vote
## t = 5.39, df = 3924, p-value = 0.000000076
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.054546 0.116652
## sample estimates:
##      cor 
## 0.085682

In this case, the correlation is highly significant and the effect size - basically 0.09 - is small but meaningful. So those people who have better internet skills tend to be more likely to vote.

Next let’s look at something with a much larger effect size: political knowledge.

cor.test(elect_2013$pol_knowledge, elect_2013$likelihood_vote)
##  Pearson's product-moment correlation
## data:  elect_2013$pol_knowledge and elect_2013$likelihood_vote
## t = 25.1, df = 3924, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.34449 0.39841
## sample estimates:
##     cor 
## 0.37176

In this case the p-value is incredibly small, and the correlation coefficient is 0.37 - a medium sized effect.

Last, let’s look at the impact of interest in politics:

cor.test(elect_2013$interest_pol, elect_2013$likelihood_vote)
##  Pearson's product-moment correlation
## data:  elect_2013$interest_pol and elect_2013$likelihood_vote
## t = -37.8, df = 3913, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.53937 -0.49343
## sample estimates:
##      cor 
## -0.51677

In this case the relationship is highly significant and effect size is large, but the relationship is negative: -0.52.

Does this mean that people who are interested in politics are less likely to vote? Are those interested in politics all anarchists?

What do you think?

When you get a result like this, you should check two things: your codebook, and your data-entry/data cleaning.

If you look at the codebook, then you will find that “interest_pol” is reverse coded. (1) = A good deal of interest in politics, while (4) = None (no interest in politics).

Thus, the real meaning of this correlation is that those who are more interested in politics are much more likely to say they will vote if voting were not compulsory.

In fact, we can say that approximately 25% of the variance in likelihood of voting is explained by variance in interest in politics (by calculating the R-square).

Last updated on 26 August, 2019 by Dr Nicholas Harrigan (