0. Code to run to set up your computer.
# Update Packages
update.packages(ask = FALSE, repos='https://cran.csiro.au/', dependencies = TRUE)
# Install Packages
if(!require(dplyr)) {install.packages("dplyr", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjPlot", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(lm.beta)) {install.packages("lm.beta", repos='https://cran.csiro.au/', dependencies=TRUE)}
# Load packages into memory
base::library(dplyr)
base::library(sjlabelled)
base::library(sjmisc)
base::library(sjstats)
base::library(sjPlot)
base::library(lm.beta)
# Turn off scientific notation
options(digits=3, scipen=8)
# Stop View from overloading memory with a large datasets
RStudioView <- View
View <- function(x) {
if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}
# Datasets
# Example 1: Crime Dataset
lga <- readRDS(url("https://methods101.com/data/nsw-lga-crime-clean.RDS"))
# Example 2: AuSSA Dataset
aus2012 <- readRDS(url("https://mqsociology.github.io/learn-r/soci832/aussa2012.RDS"))
# Example 3: Australian Electoral Survey
aes_full <- readRDS(gzcon(url("https://mqsociology.github.io/learn-r/soci832/aes_full.rds")))
# Example 4: AES 2013, reduced
elect_2013 <- read.csv(url("https://methods101.com/data/elect_2013.csv"))
1. Other types of regression models
There are an almost infinite number of regression models available for data analysis.
As a researcher, it is impossible to know all them.
What I want to do in this lesson is introduce you:
- The main ways that models vary, such as their dependent variable, their assumptions about the distribution of the dependent variable, and the method of estimation.
- The basic commands for running the most common models you are likely to come across.
2. How regression models vary.
I think we can conceptualise - even if it is an oversimplification - of three main ways that regression models systematically differ from each other:
- The measurement of dependent variable: is it continuous/interval, binary, ordinal, or a range of choices, or something else?
- The (assumed) statistical distribution of the dependent variable: is it normally distributed, or a count, or is it best represented by a logistic (or probit) distribution.
- Dependencies between the cases (units of analysis): Are these repeated measurements on the same cases (such as in time-series)? Are the cases nested within larger organisational units (e.g. classes, schools, states, nations?).
- The method of estimation: there are lots of different ways of calcuating the best model - some involve direct calculation, while others involve simulations and maximising/minimising certain ‘fit’ statistics.
In the table below we list a number of the most important regression models, and their characteristics.
Model name | Dep Var | When to use? | Command in R |
---|---|---|---|
Linear regression (ordinary least squares - OLS) |
Cont. or Intval |
DV is continuous or interval. e.g. Mark out of 100 in exam. |
stats::lm(...) |
Logistic regression (Logit) |
Binary | DV is binary. e.g. Pass(1)/Fail(0) Alternative to Probit Follow convention of discipline |
stats::glm(... , family = binomial) |
Probit regression (Probit) |
Binary | DV is binary. e.g. Pass(1)/Fail(0) Alternative to Logit Follow convention of discipline |
stats::glm(..., family = binomial(link = "probit")) |
Conditional logit | Choices | DV is three or more (unordered) nominal choices. e.g. Brand of phone; Favourite colour. IVs = characteristics of choices |
survival::clogit(...) |
Multinomial logit | Choices | DV is three or more (unordered) nominal choices. e.g. Brand of phone; Favourite colour. IVs = characteristics of individuals |
mlogit::mlogit(...) |
Ordinal logistic regression (Ordered logit) |
Ordinal | DV is ordinal variable (few options). e.g. Agree/Neutral/Disagree Trump is good President |
MASS::polr(...) or ordinal::clm(...) |
Poisson regression | Count | DV is a count variable (assumes variance = mean). e.g. Number of students who fail in each class |
stats::glm(..., family="poisson") |
Negative binomial regression | Count | DV is a count variable (doesnot assume variance = mean). e.g. Number of students who fail in each class |
MASS::glm.nb(...) |
Zero inflated negative binomial regression |
Count | DV is a count variable (large number of zero cases). e.g. Number of students who fail in each class |
pscl::zeroinfl(...) |
Multilevel Models | Any | Cases are clustered into groups which mean they are not independent e.g. students in classes, classes in schools, schools in states. |
lme4::lmer(...) |
Tobit regression | Cont. but censored |
DV is censored, i.e. you cannot observe the DV above or below a certain value. e.g. ‘ATAR less than 30’; ‘surivial longer than 5 years’ |
VGAM::vglm(..., tobit(Upper = ...) or AER::tobit(...) |
Survival analysis (Cox regresion) |
Time | DV is time until event. e.g. Years survival from diagnosis; Years studying PhD until graduation. |
survival::coxph(...) |
3. References
Conditional logistic regression
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/clogit.html
Multinomial logistic regression
https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ Hoffman, S.D. & Duncan, G.J. (1988) ‘Multinomial and conditional logit discrete-choice models in demography.’ Demography 25: 415. DOI: 10.2307/2061541 Estimation of multinomial logit models in R : The mlogit Packages
Poisson Regression
https://www.dataquest.io/blog/tutorial-poisson-regression-in-r/ https://stats.idre.ucla.edu/r/dae/poisson-regression/ https://www.theanalysisfactor.com/generalized-linear-models-in-r-part-6-poisson-regression-count-variables/ https://www.theanalysisfactor.com/glm-r-overdispersion-count-regression/
Ordered Logit
https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ https://www.rdocumentation.org/packages/MASS/versions/7.3-51.4/topics/polr https://www.r-bloggers.com/how-to-perform-ordinal-logistic-regression-in-r/ https://cran.r-project.org/web/packages/ordinal/ordinal.pdf
Zero inflated negative binomial
https://stats.idre.ucla.edu/r/dae/zinb/ https://cran.r-project.org/web/packages/pscl/pscl.pdf
Multilevel models
https://www.rdocumentation.org/packages/lme4/versions/1.1-21/topics/lmer
Tobit models
https://stats.idre.ucla.edu/r/dae/tobit-models/ https://www.rdocumentation.org/packages/AER/versions/1.2-7/topics/tobit https://cran.r-project.org/web/packages/VGAM/VGAM.pdf https://www.rdocumentation.org/packages/VGAM/versions/1.1-1/topics/vglm