The final lab introduceblos 1) how to conduct a multiple regression model, 2).how to use dummy variables in the regression model, and 3) how to create regression tables using R. We use five packages in this lab. Load them using the following code:
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(summarytools)
library(sjPlot)
Also, run the following code. Otherwise, you will see scientific notations (e.g., 2e-16) instead of numbers in the R output.
options(digits=5, scipen=15)
This lab uses the 2009 AuSSa dataset. Download the data file on iLearn and put it into your working directory. Then, run the following code:
aus2009 <-readRDS("aussa2009.rds")
Using the 2009 AuSSa dataset, this lab investigates gender income gaps in Australia. For this investigation, we will use six variables: gender (sex), education (educyrs), employment status (wrkst), supervisor status (wrksup), working hours per week (wrkhrs) and annual income (rinc). We also add a variable of identification numbers (id) to this dataset. We create a new dataset consisting of only complete cases which have no missing values on the six variables of our choice. The following code performs this task. If you cannot understand the code, review Creating a Dataset of Complete Cases.
income <- aus2009 %>%
select(id, sex, educyrs, wrkst, wrksup, wrkhrs, rinc)
income <- income %>%
mutate(mark = 0)
income <- income %>%
mutate(mark = replace(mark, complete.cases(income), 1))
income <- income %>%
filter(mark == 1)
Multiple Regression Analysis
Running regression models is quite simple. We use ‘lm()
’ again but add more independent variables. Thus, ‘model name <- lm(dependent ~ independent 1 + independent 2 + independent 3 + …, data = data name)
’ conducts multiple regression analysis. The following code estimates a regression model in which education (educyrs) and working hours per week (wrkhrs) explain the annual income(rinc).
model <- lm(rinc ~ educyrs + wrkhrs, data = income)
summary(model)
##
## Call:
## lm(formula = rinc ~ educyrs + wrkhrs, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81451 -16559 -2982 17717 88216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12938.1 4758.5 -2.72 0.0067 **
## educyrs 2668.0 271.5 9.83 <0.0000000000000002 ***
## wrkhrs 737.8 67.2 10.99 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25600 on 732 degrees of freedom
## Multiple R-squared: 0.235, Adjusted R-squared: 0.232
## F-statistic: 112 on 2 and 732 DF, p-value: <0.0000000000000002
In the output, the intercept is -12,938.1, which means that the annual income will be - $12,938.1 if people have 0 years of schooling and do not work. The coefficient of educyrs is 2,668, which means that for each additional one-year increase in years of schooling, the annual income will increase by $2,668 while controlling for the effect of working hours per week. This positive effect of education is statistically significant at α = .05 because the p-value (<0.0000000000000002) is much less than .05. The coefficient of wrkhrs is 737.8, which means that for each additional one-hour increase in the working hours per week, the annual income will increase by $737.8 while controlling for the effect of education. This positive effect of working hours is statistically significant at α = .05 because the p-value (<0.0000000000000002) is much less than .05. Thus, the regression equation of this model is \(rinc = 2,688 × educyrs + 737.8 × wrkhrs – 12,938.1\). The coefficient of determination (Multiple R-squared) is 0.235, which means that 23.5% of the total variance in the annual incomes is explained by the two independent variables.
Dummy Variable Analysis
Creating a Dummy Variable.
A dummy variable is a binary variable used in the regression analysis that represents subgroups (or categories) of the sample in your study. And it is used to estimate the effect of nominal variables. Let’s make a dummy variable of gender. The following code shows two categories of gender (sex): male and female.
frq(income$sex)
##
## R: Sex (x) <numeric>
## # total N=735 valid N=735 mean=1.53 sd=0.50
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------
## 1 | Male | 346 | 47.07 | 47.07 | 47.07
## 2 | Female | 389 | 52.93 | 52.93 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
Since we want to investigate how much less women earn compared to men, males should be the reference category which takes a value of 0. Consequently, females should take a value of 1. Thus, let’s recode sex in such a way. In a nutshell, the reference category must take 0 values in the dummy variable.
income <- rec(income, sex, rec = "1=0;2=1", append = TRUE)
Then, we rename the newly recoded variable as female and set the variable and value label.
income <- var_rename(income, sex_r = female)
income$female <- set_label(income$female, label = "Gender")
income$female <- set_labels(income$female, labels = c("Male" = 0, "Female" = 1))
The most important step is to change this dummy variable as a factor. Otherwise, R will not treat it as a dummy variable in the regression analysis.
income$female <- to_factor(income$female)
In some cases, you may want to use women instead of men as your reference category. In this case, you can easily change a reference category by using the ‘ref_lvl(variable, lvl = value for the reference category)
’ function. Make sure that the variable used for ‘ref_lvl()
’ should be a variable that is already converted to a factor. The following code makes a new variable (male) in which 0 is female, and 1 is male.
income$male <- ref_lvl(income$female, lvl = 1)
The following code makes a dummy variable of employment status (wrkst). wrkst has ten categories, which is too many to analyse them. When you look at the frequency table of wrkst, value 4 to 10 can be collapsed into one employment status, ‘not in labour force”. We will use this category as a reference category.
frq(aus2009$wrkst)
##
## R: Current employment status (x) <numeric>
## # total N=1525 valid N=1488 mean=4.14 sd=3.17
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------------------------
## 1 | Employed-full time, main job | 579 | 37.97 | 38.91 | 38.91
## 2 | Employed-part time, main job | 206 | 13.51 | 13.84 | 52.76
## 4 | Helping family member | 25 | 1.64 | 1.68 | 54.44
## 5 | Unemployed | 23 | 1.51 | 1.55 | 55.98
## 6 | Student, school, vocational training | 44 | 2.89 | 2.96 | 58.94
## 7 | Retired | 336 | 22.03 | 22.58 | 81.52
## 8 | Housewife, -man, home duties | 171 | 11.21 | 11.49 | 93.01
## 9 | Permanently disabled | 71 | 4.66 | 4.77 | 97.78
## 10 | Other,not in labour force | 33 | 2.16 | 2.22 | 100.00
## <NA> | <NA> | 37 | 2.43 | <NA> | <NA>
income <- rec(income, wrkst, rec = "1=1; 2=2; 4:10=0", append = TRUE)
income$wrkst_r <- set_label(income$wrkst_r, label = "Current employment status")
income$wrkst_r <- set_labels(income$wrkst_r, labels = c("Not in labour force" = 0,
"Full-time" = 1,
"Part-time" = 2))
income$wrkst_r <- to_factor(income$wrkst_r)
The following code makes a dummy variable of supervisor status (wrksup). Supervisees are treated as a reference category.
frq(income$wrksup)
##
## R: Supervises others at work (x) <numeric>
## # total N=735 valid N=735 mean=1.49 sd=0.50
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ----------------------------------------------
## 1 | Yes | 374 | 50.88 | 50.88 | 50.88
## 2 | No | 361 | 49.12 | 49.12 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
income <- rec(income, wrksup, rec = "1=1; 2=0", append = TRUE)
income$wrksup_r <- set_label(income$wrksup_r, label = "Supervisor jobs")
income$wrksup_r <- set_labels(income$wrksup_r, labels = c("Suvervisee" = 0,
"Supervisor" = 1))
income$wrksup_r <- to_factor(income$wrksup_r)
Assigning Variable Labels
We also assign variable labels to all the other variables of the income dataset.
income$educyrs <- set_label(income$educyrs, label = "Education")
income$wrkhrs <- set_label(income$wrkhrs, label = "Hours worked per week")
income$rinc <- set_label(income$rinc, label = "Annual income")
Dummy Variable Regression Analysis
First, we estimate the annual income only by gender, which shows gender difference in the annual income.
model.4.1 <- lm(rinc ~ female, data = income)
summary(model.4.1)
##
## Call:
## lm(formula = rinc ~ female, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65099 -22495 -99 19105 58105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65099 1485 43.83 <0.0000000000000002 ***
## female1 -19204 2042 -9.41 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27600 on 733 degrees of freedom
## Multiple R-squared: 0.108, Adjusted R-squared: 0.106
## F-statistic: 88.5 on 1 and 733 DF, p-value: <0.0000000000000002
In the output, ‘female1’ indicates ‘being a female (when the value of female is 1). Thus, the coefficient of ‘female1’ means that women earn $19,204 less than men. And this female disadvantage is statistically significant (the p-value is less than .001).
For the comparison, let’s use male instead of female in the regression analysis. Note that ‘female’ is the reference category in the male variable.
model.4.1.1 <- lm(rinc ~ male, data = income)
summary(model.4.1.1)
##
## Call:
## lm(formula = rinc ~ male, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65099 -22495 -99 19105 58105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45895 1401 32.76 <0.0000000000000002 ***
## male1 19204 2042 9.41 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27600 on 733 degrees of freedom
## Multiple R-squared: 0.108, Adjusted R-squared: 0.106
## F-statistic: 88.5 on 1 and 733 DF, p-value: <0.0000000000000002
In the output, look at the coefficient of ‘male1’. It is 19,204. Note that it is a POSITIVE number. It means that men earn $19,204 MORE than women. This example shows well that regression coefficients change when a different category is set as a reference category and that the interpretation of coefficients for dummy variables should be made in relation to the assigned reference category.
The second model estimates female disadvantage in the annual income while controlling for education.
model.4.2 <- lm(rinc ~ female + educyrs, data = income)
summary(model.4.2)
##
## Call:
## lm(formula = rinc ~ female + educyrs, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63729 -17900 -2893 19989 59271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24743 4136 5.98 0.0000000034 ***
## female1 -19715 1908 -10.33 < 0.0000000000000002 ***
## educyrs 2836 274 10.36 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25800 on 732 degrees of freedom
## Multiple R-squared: 0.222, Adjusted R-squared: 0.22
## F-statistic: 104 on 2 and 732 DF, p-value: <0.0000000000000002
After controlling education, the coefficient of female drops to -19,715, which means that women earn $19,715 less than men who have the same level of education.
The third model estimates female disadvantage in the annual income while controlling for education and working hours per week.
model.4.3 <- lm(rinc ~ female + educyrs + wrkhrs, data = income)
summary(model.4.3)
##
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73473 -15943 -2091 17612 75984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 361.3 4917.4 0.07 0.94
## female1 -14496.9 1929.2 -7.51 0.00000000000016727 ***
## educyrs 2743.1 262.0 10.47 < 0.0000000000000002 ***
## wrkhrs 571.0 68.5 8.34 0.00000000000000036 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24700 on 731 degrees of freedom
## Multiple R-squared: 0.289, Adjusted R-squared: 0.287
## F-statistic: 99.3 on 3 and 731 DF, p-value: <0.0000000000000002
After controlling education and working hours per week, the coefficient of female reduced to -14,496, which means that women earn $14,496 less than men if they have the same level of education and work as much as men do. This reduction implies that women work less than men, which in turn decreases their annual income.
The fourth model adds another dummy variable, employment status (wrkst_r).
model.4.4 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
summary(model.4.4)
##
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69347 -15734 -1928 16941 76334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18471.8 7654.2 -2.41 0.01606 *
## female1 -12781.4 1889.4 -6.76 0.000000000027 ***
## educyrs 2551.5 255.7 9.98 < 0.0000000000000002 ***
## wrkhrs 309.9 90.7 3.42 0.00067 ***
## wrkst_r1 35211.0 5827.7 6.04 0.000000002427 ***
## wrkst_r2 22627.1 6079.8 3.72 0.00021 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24000 on 729 degrees of freedom
## Multiple R-squared: 0.333, Adjusted R-squared: 0.328
## F-statistic: 72.8 on 5 and 729 DF, p-value: <0.0000000000000002
The output shows two coefficients of employment status (wrkst_r). In wrkst_r, 1 means “Full-time employment”, and 2 means “Part-time employment”. Thus, the coefficient of ‘wrkst_r1’ is for “Full-time employment and that of ‘wrkst_r2’ is for “Part-time employment”. The interpretation is that full-time workers earn $35,211 more, and part-time workers earn $22,627 more than those who are not in the labour force (the reference category) while controlling for gender, education and working hours per week. After controlling employment status as well as education and working hours per week, women earn $12,781 less than men. The gender gap is reduced a lot compared to that in the second model, which implies that employment status accounts substantially for female disadvantage in the annual income. Women are less likely to have stable jobs than men, which decrease their annual income.
The fifth model adds a dummy variable of supervisor status, wrksup_r.
model.4.5 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, data = income)
summary(model.4.5)
##
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r,
## data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71303 -14958 -1293 16018 77518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17601 7563 -2.33 0.02023 *
## female1 -12462 1868 -6.67 0.00000000005 ***
## educyrs 2367 256 9.24 < 0.0000000000000002 ***
## wrkhrs 238 91 2.62 0.00904 **
## wrkst_r1 35738 5758 6.21 0.00000000091 ***
## wrkst_r2 22764 6005 3.79 0.00016 ***
## wrksup_r1 7992 1824 4.38 0.00001346487 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23700 on 728 degrees of freedom
## Multiple R-squared: 0.35, Adjusted R-squared: 0.345
## F-statistic: 65.4 on 6 and 728 DF, p-value: <0.0000000000000002
The coefficient of ‘wrksup_r1’ (which means those who supervise others at work) is 7,992. This means that supervisors earn $7,992 more than supervisees while controlling for gender, education, working hours per week and employment status. The coefficient of ‘female1’ is a little bit changed, implying that supervisor status does not contribute much to female disadvantage in the annual income. The regression equation of this model is \(rinc =-12,462 × female + 2,367 × educyrs + 238 × wrkhrs + 35,738 × wrkst_r1 + 22,764 × wrkst_r2 +7,992 × wrksup_r1 – 17,601\) R-square (Multiple R-squared) is 0.35, which means that 35% of the total variance in the annual income is explained by the five variables (gender, education, working hours, employment status, and supervisor status).
Creating a Regression Table
We have so far analysed female disadvantage in the annual income using several regression models. How can we present all these regression models in a just single table? Using the ‘tab_model()
’ function would be the easiest way. The following code combines the five regression models we have estimated so far and creates a single regression table.
tab_model(model.4.1, model.4.2, model.4.3, model.4.4, model.4.5, digits = 3,
pred.labels = c("Intercept", "Female (ref. = Male)",
"Education", "Hours worked per week",
"Full-time (ref. = Not in labour force)",
"Part-time (ref. = Not in labour force)",
"Supervisor (ref. = Supervisee)"),
dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
title = "Gender Effects on Annual Income",
show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "stars")
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
---|---|---|---|---|---|
Predictors | Estimates | Estimates | Estimates | Estimates | Estimates |
Intercept |
65099.191 *** (1485.239) |
24743.249 *** (4135.493) |
361.329 (4917.426) |
-18471.756 * (7654.244) |
-17601.103 * (7563.033) |
Female (ref. = Male) |
-19204.178 *** (2041.574) |
-19714.886 *** (1908.532) |
-14496.910 *** (1929.191) |
-12781.385 *** (1889.367) |
-12461.921 *** (1867.632) |
Education |
2835.734 *** (273.737) |
2743.087 *** (261.986) |
2551.511 *** (255.720) |
2366.974 *** (256.071) |
|
Hours worked per week |
571.010 *** (68.454) |
309.875 *** (90.654) |
238.257 ** (91.022) |
||
Full-time (ref. = Not in labour force) |
35211.005 *** (5827.722) |
35737.917 *** (5757.544) |
|||
Part-time (ref. = Not in labour force) |
22627.062 *** (6079.790) |
22764.274 *** (6005.349) |
|||
Supervisor (ref. = Supervisee) |
7992.262 *** (1823.718) |
||||
Observations | 735 | 735 | 735 | 735 | 735 |
R2 / R2 adjusted | 0.108 / 0.106 | 0.222 / 0.220 | 0.289 / 0.287 | 0.333 / 0.328 | 0.350 / 0.345 |
|
In the code, you must list the name of regression models that will be included in a regression table. ‘digits = value
’ specifies the decimal places for numbers in the table. ‘pred.labels = c()
’ sets the labels of independent variables displayed in the first column of the table. ‘dv.labels = c()
’ sets the labels of models displayed in the first row of the table. ‘title = “”
’ sets the title of the table. ‘show.se = TRUE
’ displays the standard errors of coefficients, ‘show.ci = FALSE
’ does not show the confidence intervals of coefficients, and ‘collapse.se = TRUE
’ shows standard errors along with coefficients in the same cell. ‘p.style = “stars”
’ displays the p-value in the format of stars.
Lab 11 Participation Activity |
No Lab Participation Activity this week. Completing R Analysis Task 4 will contribute to your participation mark. |
The R codes you have written so far look like:
################################################################################
# Lab 11: Multiple Regression
# 31/5/2021
# SOCI8015 & SOCX8015
################################################################################
# Load packages
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(summarytools)
library(sjPlot)
# Avoid scientific notation
options(digits=5, scipen=15)
# Load dataset
aus2009 <- readRDS("aussa2009.RDS")
# Create a dataset of complete cases
income <- aus2009 %>%
select(id, sex, educyrs, wrkst, wrksup, wrkhrs, rinc)
income <- income %>%
mutate(mark = 0)
income <- income %>%
mutate(mark = replace(mark, complete.cases(income), 1))
income <- income %>%
filter(mark == 1)
# Multiple regression
model <- lm(rinc ~ educyrs + wrkhrs, data = income)
summary(model)
# Make dummy variables
## Gender
frq(income$sex)
income <- rec(income, sex, rec = "1=0;2=1", append = TRUE)
income <- var_rename(income, sex_r = female) # rename `sex_r` as `female`
income$female <- set_label(income$female, label = "Gender")
income$female <- set_labels(income$female, labels = c("Male" = 0, "Female" = 1))
income$female <- to_factor(income$female) # convert into a factor for dummy
income$male <- ref_lvl(income$female, lvl = 1) # set females as a reference
## Current employment status
frq(aus2009$wrkst)
income <- rec(income, wrkst, rec = "1=1; 2=2; 4:10=0", append = TRUE)
income$wrkst_r <- set_label(income$wrkst_r, label = "Current employment status")
income$wrkst_r <- set_labels(income$wrkst_r, labels = c("Not in labour force" = 0,
"Full-time" = 1,
"Part-time" = 2))
income$wrkst_r <- to_factor(income$wrkst_r)
## Supervisor
frq(income$wrksup)
income <- rec(income, wrksup, rec = "1=1; 2=0", append = TRUE)
income$wrksup_r <- set_label(income$wrksup_r, label = "Supervisor jobs")
income$wrksup_r <- set_labels(income$wrksup_r, labels = c("Suvervisee" = 0,
"Supervisor" = 1))
income$wrksup_r <- to_factor(income$wrksup_r)
## Assign Variable Name
income$educyrs <- set_label(income$educyrs, label = "Education")
income$wrkhrs <- set_label(income$wrkhrs, label = "Hours worked per week")
income$rinc <- set_label(income$rinc, label = "Annual income")
# Dummy variable regression analysis
model.4.1 <- lm(rinc ~ female, data = income) # male as the reference
summary(model.4.1)
model.4.1.1 <- lm(rinc ~ male, data = income) # female as the reference
summary(model.4.1.1)
model.4.2 <- lm(rinc ~ female + educyrs, data = income)
summary(model.4.2)
model.4.3 <- lm(rinc ~ female + educyrs + wrkhrs, data = income)
summary(model.4.3)
model.4.4 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
summary(model.4.4)
model.4.5 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, data = income)
summary(model.4.5)
# Make a regression table
tab_model(model.4.1, model.4.2, model.4.3, model.4.4, model.4.5, digits = 3,
pred.labels = c("Intercept", "Female (ref. = Male)",
"Education", "Hours worked per week",
"Full-time (ref. = Not in labour force)",
"Part-time (ref. = Not in labour force)",
"Supervisor (ref. = Supervisee)"),
dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
title = "Gender Effects on Annual Income",
show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "stars")