# Week 5: Bivariate Analysis, Scales & Indicies, and Dimension Reduction

Learning Objectives

By the end of this class, students should be able to (1) define, (2) know when to use, (3) interpret R output for, and (4) - with the assistance of methods101.com and Google - run the R commands for the following types of statistical analysis:

• comparison of means
• chi square test
• correlation coefficient
• visualisation of a matrix of correlation coefficients

# 0. How to I get my computer set up for today’s class?

# Install Packages
if(!require(dplyr)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(summarytools)) {install.packages("summarytools", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(ggplot2)) {install.packages("ggplot2", repos='https://cran.csiro.au/', dependencies= TRUE)}
if(!require(ggthemes)) {install.packages("ggthemes", repos='https://cran.csiro.au/', dependencies= TRUE)}

# Load packages into memory
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjstats)
library(sjPlot)
library(summarytools)
library(ggplot2)
library(ggthemes)

# Turn off scientific notation
options(digits=5, scipen=15)

# Stop View from overloading memory with a large datasets
RStudioView <- View
View <- function(x) {
if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}

elect_2013 <- read.csv(url("https://methods101.com/data/elect_2013.csv"))

# 2. How do I compare the mean of two variables?

We will learn about the basic ways to compare the difference in means of two groups - one where the groups are independent (e.g. height of men and women), and one where they are paired (e.g. height of the same people at age 15 and age 15.5 years).

## 2.1 How do I compare two different groups? Independent samples t-test

Let’s say we want to compare the political knowledge of men and women in our dataset. We want to ask if the mean for men, and the mean for women is different the command to test this is ‘t.test’.

Below is the command we run t.test. Below that is the output in the R console.

t.test(elect_2013$pol_knowledge ~ elect_2013$female)
##
##  Welch Two Sample t-test
##
## data:  elect_2013$pol_knowledge by elect_2013$female
## t = 11.6, df = 3839, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.88917 1.24939
## sample estimates:
## mean in group 0 mean in group 1
##          5.3283          4.2590

Now I know this looks like a mess, but it is actually not too difficult to understand.

Remember the first rule of reading statistical output: look at the p-value.

If the p-value is above 0.05, then there is normally no need to interpret anything else because the test is not significant.

WARNING: As mentioned, there is a complication to this instruction, because some statistical commands give multiple p-values - one for each variable, and one for the model overall. This instruction is about evaluating the p-value for each individual variable.

So if we follow this rule of looking at the p-value first,

What does it say?

The p-value in this case is “p-value < 2.2e-16” (if you haven’t turned off scientific notation) or “p-value <0.0000000000000002”. What does that mean? Is that less than 0.05? Yes! So the test tells us the difference in means is highly statistically significant. There is only a very low probablity that we got this difference by random chance.

So what is the next step for intepreting this output?

Let’s look at the last three line. They say:

## sample estimates:
## mean in group 0 mean in group 1
##    5.328302        4.259021 

This is telling us that the mean of the group with value “0” is 5.33, and the mean for the group with value “1” is 4.26. But what is group 0 and 1? Well we need to look at our data. The means are measure in “pol_knowledge” units, and the variable for gender is 1 = female, and 0 = male. So this tells us that the mean political knowledge for men in our sample is 5.3, and for women is 4.3.

We could stop interpreting our data here, but there is another useful part of the output to interpret. Look at these two lines:

## 95 percent confidence interval:
## 0.8891745 1.2493867

This tells us that the ‘difference of means’ between men and women has a 95% confidence interval of 0.89 to 1.25. This says that the TRUE difference between men and women - the population parameter - is with 95% certainty between 0.89 and 1.25.

## 2.2 How do I compare two sets of data on the same set of case? Paired (or dependent) samples t-test

The second type of comparison of means we are going to run is the paired test. In a paired test the two variables to be measured are measured on the same units of analysis

The reason we need a different test for this is because when the same unit of analysis is used for the two variables the two variables are dependent on each other - they are not independent samples - as so the statistical test changes to account for this.

In the next example, we are going to compare participants average score for ‘following the election on TV’ vs ‘following the election in the newspaper’.

t.test(elect_2013$election_tv, elect_2013$election_newspaper,
paired = TRUE)
##
##  Paired t-test
##
## data:  elect_2013$election_tv and elect_2013$election_newspaper
## t = -26.6, df = 3883, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.46002 -0.39683
## sample estimates:
## mean of the differences
##                -0.42842

This can be read in the same way as the previous t-test except that in this case the last line reports the difference in means, not the two means.

Intuitively we know that this means people followed the election more in the newspaper than on TV, but we can check this, by just running two means() to double check we are right:

mean(elect_2013$election_tv, na.rm = TRUE) ## [1] 2.008 mean(elect_2013$election_newspaper, na.rm = TRUE)
## [1] 2.4337

And you can see that what we thought was true is, with people having an average score of 2.01 for election_tv, and 2.43 for election_newspaper.

Last updated on 26 August, 2019 by Dr Nicholas Harrigan (nicholas.harrigan@mq.edu.au)