SOCI832: Introduction to Statistics and Univariate Analysis

Reading

Field, A., Miles, J., and Field, Z. (2012). Discovering statistics using R. Sage publications.

  • Chapter 2: Everything you ever wanted to know about statistics (well, sort of)

Concepts

Descriptive statistics
Mean/Median/Mode (Central tendency)
Standard deviation/Interquartile range (Variation/dispersion)
Minimum/Maximum
Percentile
N (number of non-missing cases)
Inferential statistics
Population
Population parameter
Sample
Sample statistic
Hypothesis testing
Null hypothesis
Sampling distribution
Central limit theorm
Standard error
Confidence interval
p-value
Effect size
Bivariate inferential statistics
Correlation, comparison of means, chi-squared
Multivariate inferential statistics
Linear and logistic regression
Dimension reduction and finding categories
Factor analysis, Cluster analysis


Summary

Most quantitative social science – i.e. social science that uses numbers to measure and characterise the world – uses inferential statistics to model their data and test their theories and hypotheses. At the core of inferential statistics is the idea that there is a characteristic (such as the mean number of beers drunk per week) which we are interested in measuring on population. The characteristic of the population is called a parameter. However, the population is too large to measure directly (e.g. the population of a country), so instead we need to ‘infer’ (hence the word ‘inferential’ statistics) this parameter. We infer the parameter by taking a random sample from the population. Random in this case means true randomness – all members of the population have equal chance of being in the sample. When we measure this characteristic (i.e. mean number of beers drunk per week) on the sample, we get a statistic. This sample statistic is our estimate of the population parameter.

Because a sample statistic is based on a random sample, it is not a perfect reflection of the population statistic. However, because a sample is random the sample statistic has a mathematical and probabilistic relationship with the population parameter. This probabilistic relationship is expressed in statistical modelling with several different measures, such as confidence intervals, significance tests, and p-values.

Most sample statistics in inferential statistics are expressed in terms of two numbers (1) a coefficient; and (2) a standard error. The coefficient represents the estimate of the population parameter (e.g. mean number of beers drunk per week in the population). The standard error is a measure of the uncertainty of this estimate. The standard error is a function of the variability of the sample and the sample size (for a mean, it is the standard deviation divided by the square root of the sample size). With most sample statistics, the major claim that we make from a model is that “There is a 95% chance that the population parameter is within +/- 1.96 standard errors of the coefficient.” This number 1.96 comes from the fact that 95% of cases in a normal distribution lie within plus or minus 1.96 standard errors of the mean of the distribution. And for most practical situations, we can assume that 1.96 is 2, and, thus, that 95% of the time the population parameter is in the range of (sample coefficient +/- 2 x standard error).

So, for example, we may take a sample of 1000 Australians and find that the mean number of beers drunk per week is 3, and the standard error of this mean is 1.4. Thus, from this we can say that there is a 95% chance that the population parameter lies between 0.2 and 5.8. Or more simply, the 95% confidence interval is (0.2 – 5.8).

Another way of expressing the uncertainty (or certainty) of a coefficient’s estimate of the population parameter is a significance test and p-value. In this case, we ask “What is the percentage chance of having got this sample statistic (coefficient) if the true population parameter is zero?” We use the sample coefficient, and sample standard error, and ask is the sample coefficient more than 1.96 (our magic number) standard errors away from zero? If so, then we know that the p-value (the calculated probability) of the coefficient being zero is less than 5%.

When describing and modelling data, we tend to think of three main classes of statistics: univariate, bivariate, and multivariate statistics.

Univariate statistics summarise single variables. The main univariate statistics include mean, median, standard deviation, frequency, minimum, maximum, quartiles and quintiles. We can use graphical representations such as histograms to illustrate univariate statistics.

Bivariate statistics express the relationship between two variables. Two of the most important bivariate statistical measures are correlation coefficients (such as Pearson’s correlation coefficient), and comparisons of means. We often also use crosstabulations (crosstabs) of two variables to illustrate such data. We can represent bivariate statistics with a wide range of graphical representations, the most common being the scatterplot.

Multivariate statistics model the relationship between three or more variables. Probably the canonical example of a multivariate statistics is the linear regression model, which models an outcome variable (y) as the linear product of two or more variables (e.g. university grade = hours of studying + ability to focus).


1. Descriptive statistics: two numbers that are as good as a million

When we use maths for social sciences, we generally use it in one of two ways:

  • to generate descriptive statistics - which allow us to summarise a large amount of data in a few numbers - and/or
  • to generate inferential statistics - which allow us to draw conclusions about the wider world based on a tiny sample of that world.

In this section, we will learn about descriptive statistics.

The main purpose of descriptive statistics to summarise a group.

In descriptive statistics, we can actually see, touch, and/or count the group. We use descriptive statistics to describe entire populations we have measured - such as in a census - or to describe a sample (as the sample, not as a representative of the population).

This summary helps us turn a unwheldy set of numbers into one or two: a central tendence, and some indication of variable.

One example of using descriptive statistics is calculating the the median personal weekly personal income of people aged 15 years and over (which is $662) of the 23 million Australian’s who did the 2016 census. We get one number - $662/week - to represent 23,000,000 numbers.

1.1 Mean/Median/Mode: The centre

As mentioned, the most common type of descriptive statistics try to summarise the “centre” of a dataset (or more correctly, the centre of the values of a variable). We call this ‘the central tendency’ of the data. The most common example of this is the average - formally called mean - of a set of numbers.

Other measures of the central tendency are the median, and the mode. Box 1 provides definitions, uses, and examples.

All of these measures of the central tendency have a common characteristic: they are trying to find one number that can represent them all. It’s like running a talent contest to find the ‘most ordinary, normal, typical Australian’, except in this case you are trying to find the ‘most ordinary, normal, typical number in this set’.

Box 1: Definitions, uses, and examples of mean, median, and mode

Mean (the arithmetic mean): The average. The sum of all cases divided by the number of cases.

The mean is good for summarising the centre of a set of numbers that are normally distributed (we will learn about this later). For example, height is normally distributed, so the mean is a good summary measure for a group of people’s heights.

Median: The middle point in a set of numbers. If we order a set of numbers from highest to lowest, and find the number in the middle, this is the median. If two numbers are in the middle (there is an even number of cases), then the median is the average to the two numbers.

The median is good for summarising the centre of a set of numbers that is skewed (i.e. there are a large number of small values, or a large number of large values). For example, income is highly skewed - i.e. unequally distributed, with a large number of poor people, and a few people who are very very rich - and the median is a better measure of this. In Australia the mean income is around $80,000/year, while the median closer to $50,000/year.

Mode: The most common number in a set.

The mode is good for summarising the centre of a set of numbers that are not ordered (i.e. categorical/nominal). For example, if we have a list of 40,000 students and Macquarie Uni, and a variable which is the name of their degree, it doesn’t make sense to talk about the mean or median of the ‘degree’. However the ‘mode’ makes sense. In every day language, we would say the most common degree at Macquarie University is an Arts Degree. Another way to say this, is the ‘mode’ degree is an Arts Degree.

1.2 Standard deviation/Interquartile range: Giving variation a number

We know from life, however, that the ‘most ordinary, normal, typical’ single example of anything is a gross simplication.

All sets have variety, and the second major type of descriptive statistic tries to capture this variety, and do it in just one number.

Standard deviation

The most common measure of variety - or technically ‘variability’, ‘variation’, or ‘dispersion’ - in a set of numbers is probably the standard deviation.

Box 2 provides an explanation of standard deviation - as equation, and also in everyday language.

Intiutively - not strickly - you can think of standard deviation as the average amount - as an absolute value (so negative become positive, and positive stay positive) - that each person’s value for a variable ‘misses’ (deviates) from the mean of the variable.

If the standard deviation is close to zero, then virtually everyone in the dataset has the same value for the variable.

If the standard deviation is large, then you expect that - on average - any particular person you pick from the dataset is going to be quite distant from (greater or lesser than) the mean.

Interquartile range

While standard deviation is a good measure of variablity for normally distributed data, when we have skewed data (where there are a lot of numbers at one end of the range - like lots of low income people, and a few wealthy in an income distribution), then a better measure of variation is interquartile range, which simply means the range of the ‘middle 50%’ of a dataset. IQR = Q3 - Q1, where Q3 is the number on the boundary between the largest 25%, and the next 25%, and Q1, is the number on the boundary between the smallest 25% and the next 25%.

Box 2: Standard Deviation: As an equation and in everyday language

\(\begin{aligned} \text{Standard Deviation of variable x} = &\sqrt{\frac{\sum_{i=1}^n (x_i - \bar x)^2}{N}}\\ \\ \text{Where:} \\ x_i = &\text{value of x for each individual i} \\ \bar x = &\text{ mean of x} \\ N,n = &\text{ number of observations} \\ \sum_{i=1}^n = &\text{sum for all observations from 1 to n} \\ \end{aligned}\)

While this equation looks scary to many people, I think the best way to break it down - conceptually, and by analogy, not strict mathematically - is to think about this as having just three main parts:

  • Variation from mean: First, it basically asks how much, each value of a variable (e.g. the age of each particular student in a class) varies from the mean for the whole set (e.g. the average age of the class). This is represented in the equation as: \[(x_i - \bar x)\]
  • Averaged over all individuals: Second, it takes the average (mean) of this across all observations in the dataset (e.g. all students in the class). This is represented by the sum of (the funny E) divided by N: \[\frac{\sum_{i=1}^n \text{variation from mean} }{N}\]
  • Punishing (counting) big variations more than small ones: Third, it actually squares the deviation from the mean - which has the effect of ‘punishing’ (i.e. inflating or more heavily weighting) larger deviations from the mean more. It then takes the square root of the whole thing, basically to counter act the effect of the squaring, and make it all measured in the same ‘units’ as the mean.

BUT, the take away from all this is: the standard deviation is - more or less - the average amount all values of a variable vary from the mean.

For example, you have two classes of 100 students each, and the average age (mean) of each class is 22, but the standard deviation of the first class is zero, and standard deviation of the second class is one.

Based on this standard deviation, you would know that - roughly and on average - if you picked any, say, six students from each class:

  • the six students from the class with zero standard deviation would all be aged 22 (i.e. ages 22, 22, 22, 22, 22, 22), while
  • the six students from class with one year standard deviation, would be comprised of students who - on average - deviate from the mean age by one year (e.g. ages 20, 21, 22, 22, 23, 24).
Notice that both groups have the same mean (22), but the second class is much more ‘dispersed’ and ‘varied’: the higher the standard deviation, the more ‘spread out’ the ages in the class are.

1.3 Minimum/Maximum: The bounds of our data

Another important set of descriptive statistics of any variable are the minimum and maximum values.

In the world of practical statistics, these numbers are very important because they give a sense of the absolute limits of the values of a dataset.

There are two main reasons you want to know this:

  1. Identify errors: Sometimes you make mistakes (or other make mistakes), and the minimum and maximum are often very easy ways to spot this. For example, if the minimum and maximum ages of a sample of adults is 17 and 999, you know that you have problems with your data at both ends. 17 means you have a person who could not consent to do the study, and 999 probably means that you have a missing value, and have not recoded it so the computer/statistical package knows it is missing (and not just a really big number).
  2. Identify skewness: It gives you another quick feel for the data. If the distribution is very skewed (e.g. income), then the minimum and maximum will look very different to the minimum and maximum for a classic normal distribution (e.g. height). Apparently if the average wealth was rescaled so it was equal to the average height, then the richest person would be more than 100km tall.

1.4 Percentile: For categories

Another type of descriptive statistic is the percentile. For example, the percentage of people who do not speak english at home, or the percentage of people who own dogs.

Percentiles are particularly useful for summarising categorical (nominal) data, as means, standard deviations, minimums, and maximums don’t tend to make sense.

1.5 N (number of non-missing cases): Did people actually answer this question?

The final important descriptive statistic we will cover is ‘N’ or ‘n’. This is the number of valid cases in a dataset, or the number of valid cases for a variable.

This number tells you how many people actually answered this question.

This is useful for spotting variables that have large numbers of missing data, which would skew the data, or, again, some sort of data entry problem (such as coding as missing, people who didn’t answer a question, but who can be assumed to be a zero or 1, e.g. “How religious are you?” may not be asked to people without a religion. However, we know that that number is not really missing, it is just very low. How religious are people with out religion? It is pretty safe to assume that they are not very religious!)

2. Inferential statistics: Drawing conclusions about the big bad world

The Population and The Sample

Figure 1: The Population and The Sample

2.1 Population: The big bad world, hidden behind a curtain

As we can see in Figure 1, the sample is drawn from the population, and the sample and the population have their own statistics: the population parameter, and the sample statistic.

2.1.1 Population parameter: The number we will never know

Population parameter – A measure (for example, mean or standard deviation) used to describe a population distribution. For example, the mean age of Australians.

2.2 Sample: The mini world we can really see and touch

2.2.1 Sample statistic: Our approximation of the real world

Sample statistic – A measure (for example, mean or standard deviation) used to describe a sample distribution. For example, the mean age of 500 Australians in a random sample

2.3 Hypothesis testing: How do I know if I am wrong?

2.3.1 Null hypothesis: The hypothesis that nothing is happening

Research hypothesis (H1) – A statement reflecting the substantive hypothesis. It is always expressed in terms of population parameters, but its specific form varies from test to test. It is also called as alternative hypothesis.

Null hypothesis (H0) – A statement of “no difference (or no relationship),” which contradicts the research hypothesis and is always expressed in terms of population parameters.

Examples:

H1: There is a difference in occupational prestige scores between men and women

  • H0: There is no difference in occupational prestige scores between men and women.

H1: Men has a higher level of education than women.

  • H0: Men has a equal or lower level of education than women.

H1: Aboriginals have a smaller number of friends than non-Aboriginals.

  • H0: Aboriginals have a larger or equal number of friends than non-Aboriginals.

2.3.2 Sampling distribution: If i could do this survey 1,000 times…

A sampling distribution

Figure 2: A sampling distribution

2.3.3 Central limit theorm: If you roll dice infinite times you always get 3.5 (on average)

The central limit theorem says three things:

  1. As the number of random samples of size N from a population approaches infinity, the mean of the sampling distribution converges on (approximates) the mean of the population.
  2. As the number of random samples from a population gets larger, the sampling distribution approximates a normal distribution
  3. As the number of random samples (repetitions -> infinity) of size N (N = sample size for each sample) from a population approaches infinity, the standard deviation of that sampling distribution approximates (the standard deviation of the population/square root of N)

2.3.4 Standard error: If I repeated this survey 1,000 times… the answer would vary this much (holds up hands pretending to have caught a big fish).

We call this last number - (the standard deviation of the population/square root of N) - the standard error.

The standard error of a sampling distribution gets smaller as the sample size of each sample increases.

Figure 3: The standard error of a sampling distribution gets smaller as the sample size of each sample increases.

Implication: As the sample size gets larger, the standard error (standard deviation of sampling distributions) becomes smaller, which means that a larger sample can better represent the population and thus more accurately infer the parameter.

The standard error of a sampling distribution gets smaller as the standard deviation of the population gets smaller.

Figure 4: The standard error of a sampling distribution gets smaller as the standard deviation of the population gets smaller.

Implication: As the standard deviation of populations gets smaller, the standard error becomes smaller. When populations have smaller variances, the sample can better represent the population and thus more accurately infer the parameter.

2.3.5 Confidence interval: The population parameter is somewhere between here and here

Confidence Level: The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter.

  • 95% confidence level (standard) – there is a .95 probability that a specified interval DOES contain the population mean. In other words, there are 5 chances out of 100 (or 1 chance out of 20) that the interval DOES NOT contain the population mean.
  • 99% confidence level – there is 1 chance out of 100 that the interval DOES NOT contain the population mean.

Confidence Interval: The range within which we are confident (with a 90%/95%/99% certainty) that the population parameter lies within.

For a normal distribution:

95% Confidence Interval = Sample Statistic +/- 1.96 * Standard Error

Confidence Interval = Sample Statistic +/- Z * Standard Error

2.3.6 p-value: the chance I’m wrong. The chance nothing is happening

p-value: The probably of obtaining the test statistic.

  • test statistic = ((sample mean) - null_hypothesis)/standard error

Since null_hypothesis is generally that nothing is happening, i.e. 0, then test statistic reduces to:

  • test statistic = sample_statistic/standard_error

Main examples of test statistics = Z, t.

  • t is for samples n < 81
  • z is for everything else.

p-value for various levels of z-score:

  • z = 1.65 p = .10
  • z = 1.96 p = .05
  • z = 2.33 p = .01

2.3.7 Effect size: Does it really matter? How much?

Significance asks “Does it exist?” Effect size (strength) asks “Does it really matter?”

As we move towards measuring bivariate associations, we tend to be concerned with three things:

  • Significance: Is there evidence any association (above chance) exists?
  • Direction: Is it positive or negative (or no relationship)?
  • Strength: How strong is the association? We call this effect size.

While strength and significance are associated, they are not the same.

Large samples can get significant results, but the effect sizes (strengths) are often small.

Examples of measures of effect size:

Bivariate:

  • r/correlation coefficient: 0.1-0.2 small; 0.3-0.4 medium; 0.5+ large
  • R square: percentage/proportion of dependent variable potentially explained by independent variable.
  • in cross tabulation: maximum different in percentage change in DV caused by one unit change in IV

Regression:

  • Standardized Beta: the impact of one standard deviation change in IV on DV (measured in SD of DV)
  • R square: percentage/proportion of dependent variable potentially explained by model.
  • Change in R square when you add a variable to a model: percentage/proportion of dependent variable potentially explained by that variable.
Last updated on 03 May, 2020 by Dr Nicholas Harrigan (nicholas.harrigan@mq.edu.au)