SOCI832: Lesson 8.1: Linear Regression

Learning Objectives: Lesson 8.1 + 8.2

By the end of lessons 8.1 + 8.2, students should be able to: (1) understand the theory behind; (2) run the R code for; and (3) interpret the R output for:

    * a linear regression
    * measuring effect size with R-squared
    * measuring effect size with standardised coefficients (beta)


0. Code to run to set up your computer.

# Update Packages
update.packages(ask = FALSE, repos='https://cran.csiro.au/', dependencies = TRUE)
# Install Packages
if(!require(dplyr)) {install.packages("dplyr", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjPlot", repos='https://cran.csiro.au/', dependencies=TRUE)}

# Load packages into memory
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjstats)
library(sjPlot)

# Turn off scientific notation
options(digits=3, scipen=8) 

# Stop View from overloading memory with a large datasets
RStudioView <- View
View <- function(x) {
  if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}

# Datasets
# Example 1: Crime Dataset
lga <- readRDS(url("https://methods101.com/data/nsw-lga-crime-clean.RDS"))

# Example 2: AuSSA Dataset
aus2012 <- readRDS(url("https://mqsociology.github.io/learn-r/soci832/aussa2012.RDS"))

# Example 3: Australian Electoral Survey
aes_full <- readRDS(gzcon(url("https://mqsociology.github.io/learn-r/soci832/aes_full.rds")))

# Example 4: AES 2013, reduced
elect_2013 <- read.csv(url("https://methods101.com/data/elect_2013.csv"))

1. Linear Regression

So far we have looked at the relationship between two variables with analytical techniques such as correlations and comparisons of means.

We have also looked at the relationship between three or more variables with analytical techniques like factor analysis and measures of the reliability of scales (like Cronbach’s Alpha). These allowed us to see if a set of three or more variables are ‘moving together’ - whether they are measuring some common, unobserved variable.

However, in much of social science, what we are concerned with is explaining an outcome. In most social science papers, we try to explain (or predict) an outcome (dependent variable), and report the impact of many different potential causal variables (independent variables).

A linear regression is one model - and probably the most popular model - for exploring the impact of multiple independent variables on a single dependent variable.

A linear regression has a number of fundamental characteristics, including:

  1. there is one outcome (dependent) variable;
  2. which is continuous or interval (i.e. not a binary, or categorical variable); and
  3. the predictor (independent) variables are either binary (0/1) or continious (or interval) (they are not categorical)
  4. the model of the outcome variable is a linear combination of the independent variables, which means that the equation for a linear regression is:

\(\begin{aligned} y = &b_1 x_1 + b_2 x_2 + b_3 x_3 + ... + c + e \\ \\ \text{Where:} \\ y=&\text{ dependent variable (outcome variables)} \\ x_1 , x_2 , \text{ ... }x_n = &\text{ independent variables (predictor variables)} \\ b_1 , b_2 , \text{ ... }b_n = &\text{ slope of relationship between }y\text{ and }x_1 , x_2 , \text{ ... }x_n \\ c = &\text{ intercept (i.e. value of }y \text{ when }x_1 , x_2 , \text{ ... }x_n=0 \text{)} \\ e = &\text{ error (i.e. that part of }y \text{ not explained by model)} \\ \end{aligned}\)

1.1 A Silly Example: Predicting Visits to the Rest Room

Let’s use a silly example - because it is memorable.

Say we are interested in knowing how many time a patron of a bar or pub or night club visits the rest room in an evening.

And let’s also suppose that we have some ethical way to measure this and collect the data on 100 people who are in this busy bar on a saturday night.

In this case, our dependent variable is ‘visits to the rest room’, and it might vary between 0 and 10 for our 100 people.

And the question we ask ourselves is - what cause or predicts the number of visits?

Well we might think that there are two important variables

  1. the number of hours the person has been in the bar; and
  2. the number of drinks the person has had at the bar.

To express this as a line we could say that, on average:

\(\begin{aligned} \text{ rest_room_visits } = &b_1 \text{ hours_at_bar } + b_2 \text{ number_of_drinks } + c\\ \\ \text{Where:} \\ b_1 = &\text{ slope of relationship between hours_at_bar and rest_room_visits } \\ b_2 = &\text{ slope of relationship between number_of_drinks and rest_room_visits} \\ c = &\text{ the constant or y-intercept, which is the number of rest_room_visits (on average) } \\ &\text{ for someone with zero for the two other variables, i.e. they just arrived at bar } \\ &\text{ (hours_at_bar = 0) and has not had a drink yet (number_of_drinks = 0) } \\ \end{aligned}\)

Now if we want to try to work out what B1, B2, and C are we would first need to get a whole lot of data. Well we are lucky, because we have collected data from 100 people on a Saturday night at a local Irish Pub.

The data looks like this, with four columns

Patron rest_room_visits hours_at_bar number_of_drinks
01 3 1 4
02 1 0.5 1
03 1 2 1
04 6 4 4
05 9 5 6
06 2 3 2
07 1 0 1
08 0 2 0
09 15 9 13
10 4 2 3

… AND SO ON FOR 100 people …

We can ignore the first column (patron) which is just an identifing column.

Now we could, if we wanted to, plot each person on a chart, where the y axis is visits, and the x axis is one of the predictor/independent variables (e.g. hours).

If we did this we would have a very messy graph but there would be a general tendency for visits to go up as the number of hours at the pub went up, and the number of drinks goes up.

In a linear regression, the statistical software calculates the ‘line of best fit’ - meaning the line that is as close as possible to all persons - all the dots on our graph.

This line of best fit can then be described with the three coefficients discussed earlier: \(b_1\), \(b_2\), and \(c\).

B1 and B2 are ‘slopes’ - which you would have learnt about in year 8 mathematics. Remember the equation \(y = bx + c\) Well \(b\) is the slope, and \(c\) is the intercept, and it is the same in linear regression models.

Exactly how the statistical packages calcuate the line of best fit is not our concern, and it involves some complex maths. What matters to us as social scientists is that - so long as basic assumptions about the nature of the data are true, and the statistical package is sound - these models can estimate the relationship between various predictors (independent variables) - and an outcome (dependent variable) we care about.

So let’s say we did analyse our data, and R told us the values for the coefficients were:

  • \(b_1 = 0.5\)
  • \(b_2 = 0.8\)
  • \(c = 0.2\)

What does this mean?

It means that - on average:

  • for each one hour a person spends at the bar the will visit the rest room 0.5 times;
  • for each drink they have, they will visit the rest room 0.8 times; and
  • for someone who stays for zero hours and has no drinks, they will visit the rest room 0.2 times.

Exercise: Make your own mental model of ‘visits to the rest room’

We will now do an exercise on the white board where we discuss and come up with our own ‘best guess’ of a linear model of visits to the rest room.

Questions:

  1. What variables, that you can measure, might impact on the number of times a patron goes to the rest room?

  2. Once you have chosen your variables, take an educated guess at the coefficients for each independent variable, and for the constant (intercept - the value of the outcome variable when all independent variables equal zero).


2. Linear Regression Example: Political Knowledge

Let’s run a model in R ourselves now, and see how it is done with a real dataset - the dataset from McAllister 2016.

Let’s start with a very simple model of political knowledge, looking at the impact of being 18-24 years of age (or not) on political knowledge.

Now remember, our political knowledge scale goes from 0 to 10, and each one point represents one quiz question the person go right (out of a total of 10 questions) The independent variable we are using to use is a binary variable (also called a dummy variable) which is equal to 1 if the person in aged 18-24 years of age or zero otherwise.

To run a linear regression model, we need to run two lines of code. The first run’s the model, with the command ‘lm()’. Inside the brackets we put the dependent variable and then a tilde (~) and then the independent variable/s (if we have multiple independent variables, we separate them with plus (+) sign). We put the name of the dataset at the end of the set of arguments.

NOTE: In this case, we are using piping with %>%. The meaning of this piping is explained after the hash symbols on each line

We then send the output of this line to a summary command to give us a very useful summary of the model.

When you are ready, run the following command:

elect_2013 %>%          # Take this data frame and PUT it...
    stats::lm(pol_knowledge ~ age_18_24, data = .) %>%     # Into this model. Data frame will appear where the full stop is (.)
    base::summary()         # and then we put the output of lm() into summary()
## 
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.855 -2.541  0.145  2.145  6.459 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)   4.8545     0.0479  101.31        < 2e-16 ***
## age_18_24    -1.3140     0.1996   -6.58 0.000000000052 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.89 on 3849 degrees of freedom
##   (104 observations deleted due to missingness)
## Multiple R-squared:  0.0111, Adjusted R-squared:  0.0109 
## F-statistic: 43.3 on 1 and 3849 DF,  p-value: 0.000000000052

Now this code might look confusing, but it will become easy - with some practice reading it.

Remember the rule “First look at the p-value”.

However, for models (e.g. regression models like this), we aren’t that interested in the p-value for the model overall. What you should look at is the p-value for the coefficients. and for some reason in R these p-values are listed under the heading “Pr(>|t|)”. Why? I don’t know.

Anyway, so we look at the two numbers under “Pr” and they are both very small numbers (with 11 and 16 zeros after the decimal point), which means that p < 0.05, and so they are significant and we can interpret the coefficients.

What do the coefficients say? Well lets start with the first that is just called “(Intercept)”. This is the C which we discussed in the earlier section, as in the value of y, when all the x’s are equal to zero.

In this case, the intercept has a value of 4.85, which means that people who have a zero for the variable 18-24 year old (i.e. everyone else) have a mean score on the political knowledge quiz of 4.85.

And now let’s look at the second coefficient - the first variable - which is -1.31. So this means persons who are 18-24 years of age, get, on average, 1.31 less questions correct in the quiz than persons who are older.

So, in essence, this model says that 25+ year olds get on average 4.85 quiz questions right, and 18-24 year olds get 3.54 question right.

3. R-squared and Adjusted R-squared

What else can this model tell us? Well towards the bottom of the model you can see that it gives a value for the R-squared and the Adjusted R-squared. We were introduced to the notion of an R-squared in the last set of notes (Week 2, Part 2 - correlation), because R-squared is literally the square of Pearson’s R. We learnt in that that R-squared is an measure of effect size, and that it can it represents the proportion (percentage) of variation in the dependent variable that MIGHT be able to be explained by the independent variable/s.

We also said that roughly speaking we can look at R-squared and say that: small effect size is 0.01-0.04 (i.e 1%-4%) medium effect size is 0.09-0.16 (i.e.9%-16%) large effect size is 0.25+ (ie. 25%+)

So given this, what does the output for our regression model say? It says that the multiple R-squared is 0.01. So basically the difference between 18-24 year olds and the rest of the population explains about 1% of the variation in political knowledge - at best!

3.1 Example: A more complex model

So let’s now make a more complex model of political engagement. I will just write the model out below. You should understand it’s meaning:

elect_2013 %>%
    stats::lm(pol_knowledge ~ age_18_24 
  + internet_skills
  + female
  + tertiary_ed
  + income_quintiles
  + aust_born
  + interest_pol, data = .) %>%
    base::summary()
## 
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24 + internet_skills + 
##     female + tertiary_ed + income_quintiles + aust_born + interest_pol, 
##     data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.052 -1.819  0.014  1.749  8.047 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.2572     0.1675   43.34  < 2e-16 ***
## age_18_24         -0.7244     0.1876   -3.86  0.00011 ***
## internet_skills    0.1172     0.0358    3.27  0.00107 ** 
## female            -0.7485     0.0835   -8.97  < 2e-16 ***
## tertiary_ed        0.8749     0.0988    8.85  < 2e-16 ***
## income_quintiles   0.1304     0.0334    3.90 0.000098 ***
## aust_born          0.0431     0.0952    0.45  0.65064    
## interest_pol      -1.5622     0.0537  -29.11  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.42 on 3423 degrees of freedom
##   (524 observations deleted due to missingness)
## Multiple R-squared:  0.293,  Adjusted R-squared:  0.291 
## F-statistic:  202 on 7 and 3423 DF,  p-value: <2e-16

If you run this model and look at the output you will see that it is a much better model, with nearly 1/3 of all political knowledge explainable by the model itself. One question you might have is: “Which of these varibles is most important?”

This is essentially an effect size question. There are two ways of answering this question: the first uses R-square; and the second Standardised regression coefficients.

We can measure the effect size of a particular variable by looking at the change in the R-squared when we run the model with and without the variable.

For example, let’s run exactly the same model as model 2 but we will drop out interest in politics:

elect_2013 %>%
  stats::lm(pol_knowledge ~ age_18_24 
     + internet_skills
     + female
     + tertiary_ed
     + income_quintiles
     + aust_born, data = .) %>%
  base::summary()
## 
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24 + internet_skills + 
##     female + tertiary_ed + income_quintiles + aust_born, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.898 -2.134  0.091  2.145  6.669 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.2579     0.1472   28.94  < 2e-16 ***
## age_18_24         -1.5296     0.2071   -7.39  1.9e-13 ***
## internet_skills    0.1680     0.0399    4.22  2.6e-05 ***
## female            -1.0899     0.0921  -11.83  < 2e-16 ***
## tertiary_ed        1.1527     0.1097   10.51  < 2e-16 ***
## income_quintiles   0.1630     0.0372    4.38  1.2e-05 ***
## aust_born          0.1016     0.1059    0.96     0.34    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.7 on 3435 degrees of freedom
##   (513 observations deleted due to missingness)
## Multiple R-squared:  0.118,  Adjusted R-squared:  0.116 
## F-statistic: 76.5 on 6 and 3435 DF,  p-value: <2e-16

What is the R-squared (or actually Adjusted R-squared - this is a better measure)? It is around 0.11, meaning that the explanatory power of the model has dropped from explaining around 30% of variation in political knowledge, to only explaining approximately 10%.

We can also use a second method to measure effect size: standardized regression coefficients.

4. Standardised Coefficients - Beta

The other method that is commonly used to measure effect size and to compare effect size across independent variables in a regression model is ‘standardised regression coefficients’.

At some level these are quite simple to understand. To estimate standardised coefficients, all of the variables are standardised (also called normalised) so that their mean is zero and their standard deviation is 1. And then they are put in the regression model.

The effect of this is to make the coefficients all measured in the same units: Each coefficient is a number that represents “The number of standard deviations change in y (the dependent variables) are predicted/caused by a one standard deviation change in each X (each independent variable).”

Let’s estimate the standardised regression coefficients - which are also called ‘standardized betas’ or just ‘betas’ - for model 2.

To do this we need to install and then load the package “lm.beta”.

if(!require(lm.beta)) {install.packages("lm.beta", repos='https://cran.csiro.au/', dependencies=TRUE)}
base::library(lm.beta)

and then the command is simple: we just run lm.beta() before we run summary (and after we have run lm())

elect_2013 %>%
  stats::lm(pol_knowledge ~ age_18_24 
     + internet_skills
     + female
     + tertiary_ed
     + income_quintiles
     + aust_born
     + interest_pol, data = .) %>%
  lm.beta::lm.beta() %>%
  base::summary()
## 
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24 + internet_skills + 
##     female + tertiary_ed + income_quintiles + aust_born + interest_pol, 
##     data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.052 -1.819  0.014  1.749  8.047 
## 
## Coefficients:
##                  Estimate Standardized Std. Error t value Pr(>|t|)    
## (Intercept)       7.25724      0.00000    0.16747   43.34  < 2e-16 ***
## age_18_24        -0.72437     -0.05817    0.18756   -3.86  0.00011 ***
## internet_skills   0.11722      0.05497    0.03581    3.27  0.00107 ** 
## female           -0.74846     -0.13042    0.08348   -8.97  < 2e-16 ***
## tertiary_ed       0.87490      0.14094    0.09883    8.85  < 2e-16 ***
## income_quintiles  0.13042      0.06266    0.03345    3.90 0.000098 ***
## aust_born         0.04310      0.00655    0.09517    0.45  0.65064    
## interest_pol     -1.56222     -0.43175    0.05367  -29.11  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.42 on 3423 degrees of freedom
##   (524 observations deleted due to missingness)
## Multiple R-squared:  0.293,  Adjusted R-squared:  0.291 
## F-statistic:  202 on 7 and 3423 DF,  p-value: <2e-16
stats::sd(elect_2013$age, na.rm = TRUE)
## [1] 17
stats::sd(elect_2013$pol_knowledge, na.rm = TRUE)
## [1] 2.94

This information is identitical to the output from a summary of a normally linear regression model, except that it has added the standardised beta coefficients, as the second set of numbers (under the heading ‘Standardized’).

If we look at these, we can see that the highest beta is for interest in politics, which is -0.43. When assessing effect size, we can ignore the sign (+/-), and a beta of 0.4 is very large.

When thinking about effect size, and beta, a rough rule of thumb is

  • small = ~ 0.05
  • medium = ~ 0.1-0.2
  • large ,= 0.3+

Reading the beta’s we can see that the three variables in our model with the largest impact on political knowledge are, in rank order:

  1. Interest in politics
  2. Tertiary education
  3. Gender
Last updated on 30 September, 2019 by Dr Nicholas Harrigan (nicholas.harrigan@mq.edu.au)