SOCI832: Lesson 11.2: Other Regression Models

0. Code to run to set up your computer.

# Update Packages
update.packages(ask = FALSE, repos='https://cran.csiro.au/', dependencies = TRUE)
# Install Packages
if(!require(dplyr)) {install.packages("dplyr", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjPlot", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(lm.beta)) {install.packages("lm.beta", repos='https://cran.csiro.au/', dependencies=TRUE)}

# Load packages into memory
base::library(dplyr)
base::library(sjlabelled)
base::library(sjmisc)
base::library(sjstats)
base::library(sjPlot)
base::library(lm.beta)

# Turn off scientific notation
options(digits=3, scipen=8) 

# Stop View from overloading memory with a large datasets
RStudioView <- View
View <- function(x) {
  if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}

# Datasets
# Example 1: Crime Dataset
lga <- readRDS(url("https://methods101.com/data/nsw-lga-crime-clean.RDS"))

# Example 2: AuSSA Dataset
aus2012 <- readRDS(url("https://mqsociology.github.io/learn-r/soci832/aussa2012.RDS"))

# Example 3: Australian Electoral Survey
aes_full <- readRDS(gzcon(url("https://mqsociology.github.io/learn-r/soci832/aes_full.rds")))

# Example 4: AES 2013, reduced
elect_2013 <- read.csv(url("https://methods101.com/data/elect_2013.csv"))

1. Other types of regression models

There are an almost infinite number of regression models available for data analysis.

As a researcher, it is impossible to know all them.

What I want to do in this lesson is introduce you:

  1. The main ways that models vary, such as their dependent variable, their assumptions about the distribution of the dependent variable, and the method of estimation.
  2. The basic commands for running the most common models you are likely to come across.

2. How regression models vary.

I think we can conceptualise - even if it is an oversimplification - of three main ways that regression models systematically differ from each other:

  1. The measurement of dependent variable: is it continuous/interval, binary, ordinal, or a range of choices, or something else?
  2. The (assumed) statistical distribution of the dependent variable: is it normally distributed, or a count, or is it best represented by a logistic (or probit) distribution.
  3. Dependencies between the cases (units of analysis): Are these repeated measurements on the same cases (such as in time-series)? Are the cases nested within larger organisational units (e.g. classes, schools, states, nations?).
  4. The method of estimation: there are lots of different ways of calcuating the best model - some involve direct calculation, while others involve simulations and maximising/minimising certain ‘fit’ statistics.

In the table below we list a number of the most important regression models, and their characteristics.

Model name Dep Var When to use? Command in R
Linear regression
(ordinary least squares - OLS)
Cont.
or
Intval
DV is continuous or interval.
e.g. Mark out of 100 in exam.
stats::lm(...)
Logistic regression
(Logit)
Binary DV is binary.
e.g. Pass(1)/Fail(0)
Alternative to Probit
Follow convention of discipline
stats::glm(... , family = binomial)
Probit regression
(Probit)
Binary DV is binary.
e.g. Pass(1)/Fail(0)
Alternative to Logit
Follow convention of discipline
stats::glm(...,
family = binomial(link = "probit"))
Conditional logit Choices DV is three or more
(unordered) nominal choices.
e.g. Brand of phone;
Favourite colour.
IVs = characteristics of choices
survival::clogit(...)
Multinomial logit Choices DV is three or more
(unordered) nominal choices.
e.g. Brand of phone;
Favourite colour.
IVs = characteristics of individuals
mlogit::mlogit(...)
Ordinal logistic regression
(Ordered logit)
Ordinal DV is ordinal variable
(few options).
e.g. Agree/Neutral/Disagree
Trump is good President
MASS::polr(...)
or
ordinal::clm(...)
Poisson regression Count DV is a count variable
(assumes variance = mean).
e.g. Number of students who
fail in each class
stats::glm(..., family="poisson")
Negative binomial regression Count DV is a count variable
(doesnot assume variance = mean).
e.g. Number of students who
fail in each class
MASS::glm.nb(...)
Zero inflated negative
binomial regression
Count DV is a count variable
(large number of zero cases).
e.g. Number of students who
fail in each class
pscl::zeroinfl(...)
Multilevel Models Any Cases are clustered into groups
which mean they are not independent
e.g. students in classes, classes in schools, schools in states.
lme4::lmer(...)
Tobit regression Cont.
but censored
DV is censored, i.e. you cannot
observe the DV above or below a
certain value.
e.g. ‘ATAR less than 30’;
‘surivial longer than 5 years’
VGAM::vglm(..., tobit(Upper = ...)
or
AER::tobit(...)
Survival analysis
(Cox regresion)
Time DV is time until event.
e.g. Years survival from diagnosis;
Years studying PhD until graduation.
survival::coxph(...)

3. References

Multinomial logistic regression

https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ Hoffman, S.D. & Duncan, G.J. (1988) ‘Multinomial and conditional logit discrete-choice models in demography.’ Demography 25: 415. DOI: 10.2307/2061541 Estimation of multinomial logit models in R : The mlogit Packages