SOCI8015 Lab 11: Multiple Regression

Simple Regression Analysis
Multiple Regression Analysis
Creating a Regression Table
Dummy Variable Analysis
- Creating a Dummy Variable.
Dummy Variable Regression Analysis
Creating a Regression Table
Lab 11 Participation Activity

The final lab introduces 1) how to run a multiple regression model, 2) how to use dummy variables in the regression model, and 3) how to create regression tables using R. We use five packages in this lab. Load them with the following code:

library(dplyr)
library(sjlabelled)
library(sjmisc)
library(summarytools)
library(sjPlot)

Also, run the following code. Otherwise, you will see scientific notations (e.g., 2e-16) instead of numbers in the R output.

options(digits=5, scipen=15)

This lab uses the Crime Rates Dataset of NSW Local Government Areas (NSW Crime) that you used in the last lab. I assume that the dataset is already in your working folder. If so, run the following code:

crime <- readRDS("nsw-lga-crime-2023.RDS")

In Lab 10, you constructed a dataset of complete observations from the crime dataset. We will use this dataset again in this lab. Therefore, you will need to run the follwoing code again to construct this complete observation dataset. The new dataset name is crime2. And you also rescale the median sales price of houses by dividing it by 10,000. If you can’t remember how to construct a dataset of complete observations, see Creating a Dataset of Complete Cases.

crime2 <- crime %>%
  select(code, lga, medage, unemploy, housmedprice, sexoff, region, urban)
crime2 <- crime2 %>%
  mutate(mark = 0)
crime2 <- crime2 %>%
  mutate(mark = replace(mark, complete.cases(crime2), 1))
crime2 <- crime2 %>%
  filter(mark == 1)
crime2 <- crime2 %>%
  mutate(housmedprice = housmedprice/10000)
crime2$housmedprice <- set_label(crime2$housmedprice, label = "Median sale price of house ($10,000)")

Simple Regression Analysis

Let’s start with some simple regression analyses. In the first model, we estimate the effect of the median sales price of houses (housmedprice) on the rate of sexual offences (sexoff). Run the following code.

model.2.1 <- lm(sexoff ~ housmedprice, data = crime2)
summary(model.2.1)

## 
## Call:
## lm(formula = sexoff ~ housmedprice, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -208.5  -56.8  -17.1   45.9  314.4 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  276.6542    11.1605   24.79 < 0.0000000000000002 ***
## housmedprice  -0.6382     0.0929   -6.87        0.00000000033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.6 on 118 degrees of freedom
## Multiple R-squared:  0.286,  Adjusted R-squared:  0.279 
## F-statistic: 47.2 on 1 and 118 DF,  p-value: 0.000000000326

In the second model, we estimate the effect of the median age of residents (medage) on the rate of sexual offences. Run the following code.

model.2.2 <- lm(sexoff ~ medage, data = crime2)
summary(model.2.2)

## 
## Call:
## lm(formula = sexoff ~ medage, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -182.4  -89.6  -11.3   54.6  356.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   156.81      75.27    2.08    0.039 *
## medage          1.58       1.76    0.90    0.371  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105 on 118 degrees of freedom
## Multiple R-squared:  0.00679,    Adjusted R-squared:  -0.00163 
## F-statistic: 0.807 on 1 and 118 DF,  p-value: 0.371

In the third model, we estimate the effect of the unemployment rate (unemploy) on the rate of sexual offences. Run the following code.

model.2.3 <- lm(sexoff ~ unemploy, data = crime2)
summary(model.2.3)

## 
## Call:
## lm(formula = sexoff ~ unemploy, data = crime2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -210.74  -63.10   -9.37   46.23  269.19 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    74.44      32.85    2.27     0.025 *  
## unemploy       24.69       5.23    4.72 0.0000065 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.2 on 118 degrees of freedom
## Multiple R-squared:  0.159,  Adjusted R-squared:  0.152 
## F-statistic: 22.3 on 1 and 118 DF,  p-value: 0.00000654

Multiple Regression Analysis

Running multiple regression models is quite simple. We use lm() again but add more independent variables. Thus, model name <- lm(dependent ~ independent 1 + independent 2 + independent 3 + …, data = data name) will conduct a multiple regression analysis. The following code simmultaneously estimates the effect of the median sales price of houses and the median age of residents on the rate of sexual offences. For an interpretation of this result, see the week 13 lecture sldies.

model.2.4 <- lm(sexoff ~ housmedprice + medage, data = crime2)
summary(model.2.4)

## 
## Call:
## lm(formula = sexoff ~ housmedprice + medage, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -196.7  -50.8  -16.9   44.0  315.1 
## 
## Coefficients:
##              Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)  343.1565    69.4031    4.94 0.00000257768 ***
## housmedprice  -0.6655     0.0971   -6.85 0.00000000036 ***
## medage        -1.5153     1.5608   -0.97          0.33    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.7 on 117 degrees of freedom
## Multiple R-squared:  0.291,  Adjusted R-squared:  0.279 
## F-statistic:   24 on 2 and 117 DF,  p-value: 0.00000000179

In the next model, let’s try to estimate the effect of the median sales price of houses and the unemploment rates on the rate of sexual offences simultaneously. Run the following code.

model.2.5 <- lm(sexoff ~ housmedprice + unemploy, data = crime2)
summary(model.2.5)

## 
## Call:
## lm(formula = sexoff ~ housmedprice + unemploy, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -187.3  -48.4  -17.3   41.0  250.2 
## 
## Coefficients:
##              Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)  172.6219    33.7375    5.12 0.000001232 ***
## housmedprice  -0.5423     0.0941   -5.76 0.000000068 ***
## unemploy      15.8759     4.8810    3.25      0.0015 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.3 on 117 degrees of freedom
## Multiple R-squared:  0.345,  Adjusted R-squared:  0.334 
## F-statistic: 30.8 on 2 and 117 DF,  p-value: 0.0000000000181

In the final model, let’s try to estimate the effect of all the three independent variables on the rate of sexual offences simultaneously. Run the following code. As you can see in the code, all the three independent variables are added using the plus(+) sign.

model.2.6 <- lm(sexoff ~ housmedprice + medage + unemploy, data = crime2)
summary(model.2.6)

## 
## Call:
## lm(formula = sexoff ~ housmedprice + medage + unemploy, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -180.4  -46.7  -13.6   39.8  249.4 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)  216.7383    78.0375    2.78     0.0064 ** 
## housmedprice  -0.5616     0.0993   -5.66 0.00000011 ***
## medage        -0.9506     1.5153   -0.63     0.5317    
## unemploy      15.5098     4.9284    3.15     0.0021 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.5 on 116 degrees of freedom
## Multiple R-squared:  0.347,  Adjusted R-squared:  0.33 
## F-statistic: 20.5 on 3 and 116 DF,  p-value: 0.0000000000952

Creating a Regression Table

So far, We have run the six regression models that examine the effect of community characteristics on the rate of sexual offences. However, it is not efficient to look at the results of each model separately. Instead, we can present all of the regression models in a just one single table. We will use the tab_model() function to summarise all the regression models so far. The following code will combine the six regression models we have run and create a single regression table.

tab_model(model.2.1, model.2.2, model.2.3, model.2.4, model.2.5, model.2.6, 
          dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6"), 
          title = "Regression Coefficients on Sexual Offence Rates", digits = 3, 
          show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "stars")

Regression Coefficients on Sexual Offence Rates
	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6
Predictors	Estimates	Estimates	Estimates	Estimates	Estimates	Estimates
(Intercept)	276.654 ^*** (11.161)	156.814 ^* (75.269)	74.438 ^* (32.853)	343.156 ^*** (69.403)	172.622 ^*** (33.737)	216.738 ^** (78.038)
Median sale price of house($10,000)	-0.638 ^*** (0.093)			-0.665 ^*** (0.097)	-0.542 ^*** (0.094)	-0.562 ^*** (0.099)
Median age of residents		1.582 (1.761)		-1.515 (1.561)		-0.951 (1.515)
Unemployment rate			24.686 ^*** (5.230)		15.876 ^** (4.881)	15.510 ^** (4.928)
Observations	120	120	120	120	120	120
R² / R² adjusted	0.286 / 0.279	0.007 / -0.002	0.159 / 0.152	0.291 / 0.279	0.345 / 0.334	0.347 / 0.330
p<0.05 p<0.01 * p<0.001

In the code, you need to list all the name of regression models to be included in a regression table. dv.labels = c() sets the labels of the models to be displayed in the first row of the regression table. title = “” sets the title of the table. digits = value sets the decimal places for the numbers in the table. show.se = TRUE shows the standard errors of the coefficients, show.ci = FALSE does not show the confidence intervals of the coefficients, and collapse.se = TRUE shows the standard errors together with the coefficients in the same cell. p.style = “stars” displays the p-value in star format.

Dummy Variable Analysis

Creating a Dummy Variable.

A dummy variable is a binary variable used in the regression analysis that represents subgroups (or categories) of the sample in your study. And it is used to estimate the effect of nominal variables. Let’s make a dummy variable of urban. The following code shows two categories of urban: rural and urban.

frq(crime2$urban)

## Urban or Rural (x) <categorical> 
## # total N=120 valid N=120 mean=0.59 sd=0.49
## 
## Value |    Label |  N | Raw % | Valid % | Cum. %
## ------------------------------------------------
##     0 | 0. Rural | 49 | 40.83 |   40.83 |  40.83
##     1 | 1. Urban | 71 | 59.17 |   59.17 | 100.00
##  <NA> |     <NA> |  0 |  0.00 |    <NA> |   <NA>

urban has two categories and the reference category, rural, already has a lower value assigned to it. Therefore, we can use urban as it is for a dummy variable. However, we need to change this dummy variable to a factor. Otherwise, R will not treat it as a dummy variable in the regression analysis.

crime2$urban <- to_factor(crime2$urban)

In some cases, you may want to use urban as a reference category instead of rural. In this case, you can easily change the reference category by using the ref_lvl(variable, lvl = value for the reference category) function. Make sure that the variable used for ref_lvl() is one that has already been converted to a factor. The following code creates a new variable (rural) where 0 is urban, and 1 is rural.

crime2$rural <- ref_lvl(crime2$urban, lvl = 1)

Then, let’s make dummy variables for region. The following code shows two categories of region.

frq(crime2$region)

## Regions of LGA (x) <categorical> 
## # total N=120 valid N=120 mean=5.93 sd=3.92
## 
## Value |                          Label |  N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------------------
##     1 | 1. Greater metropolitan Sydney | 30 | 25.00 |   25.00 |  25.00
##     2 |            2. Sydney surrounds |  4 |  3.33 |    3.33 |  28.33
##     3 |             3. Mid north coast |  6 |  5.00 |    5.00 |  33.33
##     4 |                      4. Murray |  9 |  7.50 |    7.50 |  40.83
##     5 |                5. Murrumbidgee |  9 |  7.50 |    7.50 |  48.33
##     6 |                      6. Hunter | 10 |  8.33 |    8.33 |  56.67
##     7 |                   7. Illawarra |  5 |  4.17 |    4.17 |  60.83
##     8 |              8. Richmond-Tweed |  6 |  5.00 |    5.00 |  65.83
##     9 |             9. Canberra region |  8 |  6.67 |    6.67 |  72.50
##    10 |                   10. Northern | 12 | 10.00 |   10.00 |  82.50
##    11 |               11. Central west | 13 | 10.83 |   10.83 |  93.33
##    12 |              12. North western |  7 |  5.83 |    5.83 |  99.17
##    13 |                   13. Far west |  1 |  0.83 |    0.83 | 100.00
##  <NA> |                           <NA> |  0 |  0.00 |    <NA> |   <NA>

As this result shows, region has 13 categories. We will recode it into a variable with three categories following the recoding scheme of <Table 1>.

Table 1: Recoding scheme for new region variable
Old variable(region)		New variable(region2)
Values	Labels	Values	Labels
1	Greater Metropolitan Sydney	1	Greater Metropolitan Sydney
2	Sydney Surrounds	2	Sydney Surrounds
3	Mid North Coast	3	Rural and Regional Areas
4	Murray
5	Murrumbidgee
6	Hunter
7	Illawarra
8	Richmond-Tweed
9	Canberra Region
10	Northern
11	Central West
12	North Western
13	Far west

The following code will create this new three-region cateogy variable.

crime2 <- rec(crime2, region, rec = "1=1; 2=2; 3:13=3")

Then, we rename the newly-recoded variable(region_r) to region2 and assign an appropriate variable name and value labels.

crime2 <- var_rename(crime2, region_r = region2)
crime2$region2 <- set_label(crime2$region2, label = "Three regions")
crime2$region2 <- set_labels(crime2$region2, labels = c("Greater Metropolitan Sydney" = 1,
                                                "Sydney Surrounds" = 2,
                                                "Rural and Regional Areas" = 3))

Finally, we need to hange this dummy variable to a factor.

crime2$region2 <- to_factor(crime2$region2)

Dummy Variable Regression Analysis

First, we estimate the effect of urban on the rate of sexual offences, which shows the difference in the sexual offences between urban and rural LGAs.

model.3.1 <- lm(sexoff ~ urban, data = crime2)
summary(model.3.1)

## 
## Call:
## lm(formula = sexoff ~ urban, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -215.0  -75.3  -11.7   78.2  312.8 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    270.6       13.9   19.47 < 0.0000000000000002 ***
## urban1         -79.0       18.1   -4.37             0.000027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.3 on 118 degrees of freedom
## Multiple R-squared:  0.139,  Adjusted R-squared:  0.132 
## F-statistic: 19.1 on 1 and 118 DF,  p-value: 0.0000268

In the output, ‘urban1’ indicates ‘being a urban LGA (when the value of urban is 1). Thus, the coefficient of ‘urban1’ means that the sexual offence rate of urban LGAs is lower than that of rural LGAs by 79.0. This difference is statistically significant (the p-value is less than .05).

Second, we estimate the effect of region on the rate of sexual offences, which shows the difference in the sexual offences across the three regions.

model.3.2 <- lm(sexoff ~ region2, data = crime2)
summary(model.3.2)

## 
## Call:
## lm(formula = sexoff ~ region2, data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -204.0  -49.3   -8.9   32.0  323.8 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)    127.2       16.0    7.94 0.0000000000013 ***
## region22        52.8       46.7    1.13            0.26    
## region23       132.4       18.6    7.12 0.0000000000925 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.7 on 117 degrees of freedom
## Multiple R-squared:  0.307,  Adjusted R-squared:  0.295 
## F-statistic: 25.9 on 2 and 117 DF,  p-value: 0.000000000491

The output shows two coefficients of region2. In region2, 2 means “Sydney Surrounds”, and 3 means “Rural or Regional Areas”. Thus, the coefficient of ‘region22’ is for “Sydney Surrounds” and that of ‘region23’ is for Rural or Regional Areas”. In this dummy variable, the reference category is “Greater Metropolitan Sydney”. As a result, the interpretation is that the sexual offence rate of LGAs in Sydney Surrounds is higher than that of LGAs in Greater Metropolitan Sydney by 52.8. However, this difference is NOT statistically significant (the p-value is greater than .05). It also indicate that the sexual offence rate of LGAs in Rural and Regional Areas is higher than that of LGAs in Greater Metropolitan Sydney by 132.4. And this difference is statistically significant (the p-value is less than .05).

Third, we estimate the effect of region on the rate of sexual offences after controlling for the median sales price of houses, the median age of residents, and the unemployment rate.

model.3.3 <- lm(sexoff ~ urban + housmedprice + medage + unemploy, data = crime2)
summary(model.3.3)

## 
## Call:
## lm(formula = sexoff ~ urban + housmedprice + medage + unemploy, 
##     data = crime2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -188.87  -53.53   -2.45   42.31  212.38 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   273.411     77.288    3.54  0.00058 ***
## urban1        -61.209     19.368   -3.16  0.00201 ** 
## housmedprice   -0.389      0.110   -3.53  0.00060 ***
## medage         -2.385      1.529   -1.56  0.12144    
## unemploy       19.817      4.940    4.01  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.3 on 115 degrees of freedom
## Multiple R-squared:  0.399,  Adjusted R-squared:  0.378 
## F-statistic: 19.1 on 4 and 115 DF,  p-value: 0.00000000000454

Fourth, we estimate the effect of region on the rate of sexual offences fter controlling for the median sales price of houses, the median age of residents, and the unemployment rate.

model.3.4 <- lm(sexoff ~ region2 + housmedprice + medage + unemploy, data = crime2)
summary(model.3.4)

## 
## Call:
## lm(formula = sexoff ~ region2 + housmedprice + medage + unemploy, 
##     data = crime2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -153.1  -47.7  -13.7   33.7  240.6 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  207.3124    72.8374    2.85   0.00525 ** 
## region22      73.0265    46.4907    1.57   0.11901    
## region23     141.0931    30.3602    4.65 0.0000091 ***
## housmedprice  -0.0716     0.1407   -0.51   0.61176    
## medage        -4.4925     1.5875   -2.83   0.00550 ** 
## unemploy      18.0721     4.6425    3.89   0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.9 on 114 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.43 
## F-statistic: 18.9 on 5 and 114 DF,  p-value: 0.000000000000116

Finally, we estimate the effect of urban and region on the rate of sexual offences fter controlling for the median sales price of houses, the median age of residents, and the unemployment rate.

model.3.5 <- lm(sexoff ~ urban + region2 + housmedprice + medage + unemploy, data = crime2)
summary(model.3.5)

## 
## Call:
## lm(formula = sexoff ~ urban + region2 + housmedprice + medage + 
##     unemploy, data = crime2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -160.62  -50.81    0.96   31.28  236.06 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  254.7107    71.5715    3.56   0.00055 ***
## urban1       -58.8171    18.3154   -3.21   0.00172 ** 
## region22     100.3419    45.5029    2.21   0.02947 *  
## region23     138.7570    29.2004    4.75 0.0000060 ***
## housmedprice   0.0911     0.1445    0.63   0.52952    
## medage        -5.7397     1.5750   -3.64   0.00041 ***
## unemploy      22.6293     4.6839    4.83 0.0000043 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.8 on 113 degrees of freedom
## Multiple R-squared:  0.499,  Adjusted R-squared:  0.473 
## F-statistic: 18.8 on 6 and 113 DF,  p-value: 0.00000000000000463

Creating a Regression Table

Let’s make a regression table that shows all the results of dummy variable analysis.

tab_model(model.3.1, model.3.2, model.3.3, model.3.4, model.3.5,
          pred.labels = c("Intercept", "Urban (ref. = Rural)", 
                          "Sydney surrounds (ref. = Greater Sydney)",
                          "Rural or regional (ref. = Greater Sydney)",
                          "Median sale price of houses",
                          "Median age of residents",
                          "Unemployment rates"),
          dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"), 
          title = "Regional Differences in Sexual Offence Rates", digits = 3, 
          show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "stars")

Regional Differences in Sexual Offence Rates
	Model 1	Model 2	Model 3	Model 4	Model 5
Predictors	Estimates	Estimates	Estimates	Estimates	Estimates
Intercept	270.596 ^*** (13.899)	127.190 ^*** (16.010)	273.411 ^*** (77.288)	207.312 ^** (72.837)	254.711 ^*** (71.571)
Urban (ref. = Rural)	-78.972 ^*** (18.069)		-61.209 ^** (19.368)		-58.817 ^** (18.315)
Sydney surrounds (ref. = Greater Sydney)		52.835 (46.676)		73.026 (46.491)	100.342 ^* (45.503)
Rural or regional (ref. = Greater Sydney)		132.446 ^*** (18.594)		141.093 ^*** (30.360)	138.757 ^*** (29.200)
Median sale price of houses			-0.389 ^*** (0.110)	-0.072 (0.141)	0.091 (0.144)
Median age of residents			-2.385 (1.529)	-4.492 ^** (1.587)	-5.740 ^*** (1.575)
Unemployment rates			19.817 ^*** (4.940)	18.072 ^*** (4.643)	22.629 ^*** (4.684)
Observations	120	120	120	120	120
R² / R² adjusted	0.139 / 0.132	0.307 / 0.295	0.399 / 0.378	0.454 / 0.430	0.499 / 0.473
p<0.05 p<0.01 * p<0.001

We use codes similar to what we used for the multiple regression model. But we changed variable names using pred.labels. pred.labels = c() sets the labels of independent variables displayed in the first column of the table.

Lab 11 Participation Activity

No Lab Participation Activity this week. Completing R Analysis Task 4 will contribute to your participation mark.

The R codes you have written so far look like:

################################################################################
# Lab 11: Multiple Regression
# 29/5/2024
# SOCI8015 & SOCX8015
################################################################################
# Load packages
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(summarytools)
library(sjPlot)

# Avoid scientific notation
options(digits=5, scipen=15) 

# Load datasets
crime <- readRDS("nsw-lga-crime-2023.RDS")

# Creating a Dataset of Complete Cases
crime2 <- crime %>%
  select(code, lga, medage, unemploy, housmedprice, sexoff, region, urban)

crime2 <- crime2 %>%
  mutate(mark = 0)

crime2 <- crime2 %>%
  mutate(mark = replace(mark, complete.cases(crime2), 1))

crime2 <- crime2 %>%
  filter(mark == 1)

# Changing the Scale of Variables
crime2 <- crime2 %>%
  mutate(housmedprice = housmedprice/10000)
crime2$housmedprice <- set_label(crime2$housmedprice, label = "Median sale price of house ($10,000)")

# Simple regression: Model 1
model.2.1 <- lm(sexoff ~ housmedprice, data = crime2)
summary(model.2.1)

# Simple regression: Model 2
model.2.2 <- lm(sexoff ~ medage, data = crime2)
summary(model.2.2)

# Simple regression:Model 3
model.2.3 <- lm(sexoff ~ unemploy, data = crime2)
summary(model.2.3)

# Multiple regression: Model 4
model.2.4 <- lm(sexoff ~ housmedprice + medage, data = crime2)
summary(model.2.4)

# Multiple regression: Model 5
model.2.5 <- lm(sexoff ~ housmedprice + unemploy, data = crime2)
summary(model.2.5)

# Multiple regression: Model 6
model.2.6 <- lm(sexoff ~ housmedprice + medage + unemploy, data = crime2)
summary(model.2.6)

# Creating a regression table
tab_model(model.2.1, model.2.2, model.2.3, model.2.4, model.2.5, model.2.6, 
          dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6"), 
          title = "Regression Coefficients on Sexual Offence Rates", digits = 3, 
          show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "stars")

# Check the variable of urban
frq(crime2$urban)

# convert into a factor for dummy
crime2$urban <- to_factor(crime2$urban)

# make a dummy variable for rural in which urban are the reference group
crime2$rural <- ref_lvl(crime2$urban, lvl = 1)

# Creating a dummy variable for region
frq(crime2$region)
crime2 <- rec(crime2, region, rec = "1=1; 2=2; 3:13=3")

# rename variable
crime2 <- var_rename(crime2, region_r = region2)

# Assign variable name and value labels
crime2$region2 <- set_label(crime2$region2, label = "Three regions")
crime2$region2 <- set_labels(crime2$region2, labels = c("Greater Metropolitan Sydney" = 1,
                                                "Sydney Surrounds" = 2,
                                                "Rural and Regional Areas" = 3))
# convert into a factor for dummy
crime2$region2 <- to_factor(crime2$region2)


# Dummy Variable Analysis
model.3.1 <- lm(sexoff ~ urban, data = crime2)
summary(model.3.1)

model.3.1.2 <- lm(sexoff ~ rural, data = crime2)
summary(model.3.1.2)

model.3.3 <- lm(sexoff ~ urban + housmedprice + medage + unemploy, data = crime2)
summary(model.3.3)

model.3.2 <- lm(sexoff ~ region2, data = crime2)
summary(model.3.2)

model.3.4 <- lm(sexoff ~ region2 + housmedprice + medage + unemploy, data = crime2)
summary(model.3.4)

model.3.5 <- lm(sexoff ~ urban + region2 + housmedprice + medage + unemploy, data = crime2)
summary(model.3.5)

# Creating regression table
tab_model(model.3.1, model.3.2, model.3.3, model.3.4, model.3.5,
          pred.labels = c("Intercept", "Urban (ref. = Rural)", 
                          "Sydney surrounds (ref. = Greater Sydney)",
                          "Rural or regional (ref. = Greater Sydney)",
                          "Median sale price of houses",
                          "Median age of residents",
                          "Unemployment rates"),
          dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"), 
          title = "Regional Differences in Sexual Offence Rates", digits = 3, 
          show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "stars")

Last updated on 26 May, 2024 by Dr Hang Young Lee(hangyoung.lee@mq.edu.au)