SOCI832: Lesson 1.3 Introduction to R

====================================

Reading

Field, A., Miles, J., and Field, Z. (2012). Discovering statistics using R. Sage publications.

  • Chapter 3: The R enviroment


Concepts


Overview: Intro to R and R Studio

Questions

  • Why use R and RStudio?
  • What do I do when R breaks? How do I get help?
  • Getting started: How do we open projects, scripts, and data?
  • How do I run simple commands in R? Create objects? Calcuations? Run functions?
  • How do I manipulate data frames? New variables? Subset?

Exercises

  • Exercise 1: Importing the data from The Guardian article.
  • Exercise 2: How do I get help?
  • Exercise 3: Subsetting the data from The Guardian article.


1. Why choose to do statistics in R?

A lot of the most basic statistical analysis can be done in Microsoft Excel, and for very simple things this will be your best option.

Another very popular statistical package is SPSS by IBM. This package has the advantage of having a Graphical User Interface (GUI), which means you can run most commands by pointing and clicking with a mouse.

However, for this course we are using R. Why R? There are three main reasons.

  1. Free and open source: First, R is free and open source. This means that (1) you can run it from home, instead of having to be connected to your university wifi network, and (2) you can use R for the rest of your life, without having to pay thousands of dollars in annual licence fees. As an example of those licencing fees, Figure 1 shows SPSS yearly subscription costs from 2018.
  2. Advanced models: Second, there is a plethora of packages, available for free, that extends R’s functionality. These include the most advanced statistical techniques available. For SPSS, many of the more advance statistical models, such as Structural Equation Modelling, require the highest level of subscription. Other types of statistical analysis that are available in R, such as social network analysis, are just not available in SPSS or other popular statistical software like Stata.
  3. Growing popularity: Third, for the above two reasons, R is enjoying a large and increasing share of the market for statistical programs, so learning R should stand you in good stead for future social research. For example, see this article which documents the growing popularity of R in recent years. The most important figures are 2d and 2e - which can be seen in Figure 2 below. Figure 2 shows that the number of google scholar hits – basically scholarly usage – which shows that SPSS usage peaked in 2009, and R is rapidly increasing in popularity, and will likely become the most widely used statistical package in the next few years.
Cost of SPSS licences per year (26/7/2018).

Figure 1: Cost of SPSS licences per year (26/7/2018).

Scholarly articles using each major statistical package.

Figure 2: Scholarly articles using each major statistical package.

Source: http://r4stats.com/articles/popularity/

Installing & Updating R and R Studio

I have written a detailed walk through of the steps to install R and RStudio here: Appendix 1: Installing R and RStudio.

Updating R and R Studio

It is generally a very good idea - particularly when working along with instructions in this class - to make sure you are using the latest versions of R Studio and R.

If you installed either more than a month ago, then chances are they our out of date.

However, you can just check by following these instructions

  1. Updating RStudio

In RStudio:

  • Open RStudio
  • In the menu select “help” (on the far right side), and then select “check for updates” See Figure 3.
  • If it is uptodate you will get a window that opens that says “No Update Available: You’re using the newest version of RStudio.” See Figure 4.
  • If you are NOT up to date, either follow the link provided by RStudio, or simply navigate to https://www.rstudio.com/products/rstudio/download/, and select “Download” under the heading “RStudio Desktop Open Source Licence FREE”
How to check your RStudio version is current.

Figure 3: How to check your RStudio version is current.

How to check your RStudio version is current.

Figure 4: How to check your RStudio version is current.

  1. Updating R

What version of R am I using?

You can tell which version of R you are running by looking in RStudio, in the Console window, as you can see in Figure 5.

How to check your RStudio version is current.

Figure 5: How to check your RStudio version is current.

What is the latest version of R?

In any browser:

You can see the version number on this page.

If you are not uptodate, then on the pages above simply:

  • click on R-X.X.X.pkg (for Mac) or ‘Download R X.X.X for Windows’ (for windows) - where X.X.X is the latest version number of R.
  • download the file
  • click on it to run and install
  • default settings are fine.

Once the new version of R is installed need to close RStudio, and then reopen it. RStudio should automatically recognise the new version of R. You can confirm the new version of R is installed by lookibng at the version number of R in the console window (as explained just above in Figure 5).


2. Getting Starting in RStudio

COMMON CONFUSION: We almost never open R directly. We just open RStudio.

One common mistakes people new to R and RStudio make is to think that they need to directly open R on their computer. You don’t. You don’t need to click on the R icon and open it at any stage.

Instead, we just open RStudio, and RStudio will communicate with R for us.

RStudio is a nice interface that makes it much easier to use R.


2.1 Open New Project

When you first open RStudio you should see something like Figure 6.

What RStudio looks like when we first open it.

Figure 6: What RStudio looks like when we first open it.

The first thing you want to do is open a new project.

A project is like a separate workspace you create for each of your areas of work, such as your thesis, or an assignment, or a project at work. Each project will have its own working director, files, folders, and history.

Step 1: To open a project go to File and then New Project, as shown in Figure 7.

Step 1: How to open a new project in RStudio.

Figure 7: Step 1: How to open a new project in RStudio.

Step 2: In the window that opens, you have a number of choices. At this stage, select “New Director”, as shown in Figure 8.

Step 2: How to open a new project in RStudio.

Figure 8: Step 2: How to open a new project in RStudio.

Step 3: Select “New Project”, as shown in Figure 9.

Step 3: How to open a new project in RStudio.

Figure 9: Step 3: How to open a new project in RStudio.

Step 4: Type in the name of your new directory. In my case I am creating one called “2019, SOCI832 Practice” Figure 10.

Step 4: How to open a new project in RStudio.

Figure 10: Step 4: How to open a new project in RStudio.

Step 5: RStudio will now open a new project for you. You can see the name of the folder at the top of the RStudio window, and also under the ‘Console’ tab.

2.2 Open Script file

The next thing we want to do, before we really get started with RStudio, is open a script file.

As you probably know, we need to type commands into R - we can’t just point and click like we do in Microsoft Word or SPSS.

When we type commands, two things happen:

  • We make mistakes - typos.
  • We often want to run our commands again - repeating the same or similar analysis.
  • We often want to save our commands so we know what we did.

Using a script - which is basically just a text file with all our commands in it - allows us to correct our mistakes, save our commands, and also copy and paste them later when we want to do similar analysis.

Step 1: To open a script file go to File menu and then New File and then R Script as shown in Figure 11.

Step 4: How to open a new script file in RStudio.

Figure 11: Step 4: How to open a new script file in RStudio.

There are also two short cuts to doing this:

  • On the keyboard (Windows), you can type: Ctrl+Shift+N
  • You can also click the white paper icon with the green plus sign on it in the top left corner of the RStudio screen.

Step 2: To save your script file go to the menu File and then Save As. Navigate to your folder (in my case C:019, SOCI832 Practice), and then give your script file a name. I will call mine nicks_script. And press Save

2.3 Running your first script

Let’s run your first script.

In the script window you can type the classic first command of any book that teaches programming.

print("Hello World")

And then put your cursor anywhere on that line of code in the script window and press Ctrl+Enter (in Windows) or Cmd+Enter (on a Mac).

In the console window (below the script) you should see this.

## [1] "Hello World"

Actually, what you will see in the console window is what appears in Figure 12.

Output in Console window from `print("Hello World")` command.

Figure 12: Output in Console window from print("Hello World") command.

What does it mean?

It is a command that says “Print to the console the text inside the double inverted commas”, which in our case is “Hello World”.

It is a pretty useless command, but it shows how you can run a line in a script file, and see the results in the console window.

2.4 Looking around R Studio

Lets take a quick look around RStudio, and orient ourselves.

Figure 13 shows what RStudio looks like in a default setting.

Looking around the RStudio interface.

Figure 13: Looking around the RStudio interface.

The numbers in Figure 13 refer to:

  1. Script Window: Where you type and save your commands as a text file.
  2. Console Window: Where the output of our R commands appears. The commands we run will appear in blue, while the output will appear in black.
  3. Environment Tab: This is where all your currently open data will appear. At the moment you can see that I have created a variable called “x”, which holds the value “Hello World Again”.
  4. History Tab: This contains the history of all the commands we have recently run in this project. We can go to this tab and re-run any commands we have run previously.
  5. Files Tab: This shows the files and folders within our project folder.
  6. Plots Tab: Where graphs and figures will appear, when we make them.
  7. Packages Tab: A tab that makes it easier to install Packages in R.
  8. Help Tab: The Help window
  9. Viewer Tab: A built in browser in RStudio, which can be used to open webpages (e.g. HTML files) created by R.
  10. Save Button: Click to save your current script
  11. Find Button: To find text in your script
  12. Run Button: To run any highlighted script.

3. Running Commands in R

In this section we are going to learn about (1) simple objects in R (often called variables in other programming languages); (2) conducting calculations on these objects; and then (3) transforming these objects with functions.

3.1 Simple objects

We can create a simple object with this command:

x <- "Hello World"

In R we call <- the assignment operator. It says “Put the value on the right into the object name on the left.”

You can see that x appears in the Environment tab in the top right screen.

We can see the object x by simply typing x and then running that line of code

x
## [1] "Hello World"

We can also do this with a number, for example:

y <- 9

y appears in the environment window.

And now we can look at the contents of y by just typing y and running the line of code:

y
## [1] 9

Notice if we assign some different value to y, then y changes:

y <- 9
y
## [1] 9
y <- 2
y
## [1] 2

We can also delete variables with the command rm(). Notice when we call y after we have removed it, we get an error.

y <- 9
y
## [1] 9
rm(y)
y
## Error in eval(expr, envir, enclos): object 'y' not found

3.2 Calculations in R

We can do calculations in R by entering them as you would normally, as in a calculator.

For instance, you could calculate 1+1 by typing in and running the following piece of code.

1+1
## [1] 2

We can also do calculations on variables. In the next example we create new variables, and then call them to see their contents.

a <- 100
a
b <- 3
b
c <- a + b
c
d <- a * b
d
e <- a / b
e
f <- a - b
f
## [1] 100
## [1] 3
## [1] 103
## [1] 300
## [1] 33.33333
## [1] 97

3.3 An introduction to functions

Most of the power of R is contained in commands that are called ‘functions’.

Functions take one or more variables, process these variables, and produce an output.

The conventional syntax for naming most functions in R is

  1. The name of the function
  2. A round open bracket “(”
  3. One or more arguments: variables or values being passed to the function
  4. A round close bracket “)”   Example 1: For example, we might create a variable called z and put the value 100 into it. And then we run the function “sqrt()” on it. “sqrt()” computers the square root of the number in the brackets.
z <- 100
sqrt(z)
## [1] 10

We can see that R returns the square root of 100, which is 10.

Example 2: We can do a more complex example with a vector. We can create a vector with the function “c()”, and then we can calculate the sum with “sum()”, the mean with “mean()”, and the standard deviation with the function “sd()”.

q <- c(1,2,3,4,5,6,7,8,9,10)
sum(q)
mean(q)
sd(q)
## [1] 55
## [1] 5.5
## [1] 3.02765

Optional exercise: Round this number

Create a variable called cat which equals 10/3 (i.e. 3.333333)

Use the function round() to:

  • First, round cat to one whole digit, i.e. 3
  • Second, round cat to two decimal places, i.e. 3.33

Use R Help files and/or Google to work out what arguments you need to give `round()


4. Working with Data Frames

When we are doing social science data analysis, we are almost always working with a type of object known as a data frame.

A data frame is basically like an Excel Spreadsheet, where each row is a unit of analysis (such as a person), and each column is a variable (such as a characteristics of the person, like their age, gender, or the party they voted for at the last election).

While we tend not to do this, we can make our own data frames from scratch.

In this case, we are going make a data frame of about my rabbits (my partner, Rachel, calls them our flatmates).

This is what the data would look like as a table.

name colour Sex breed weight DOB Alive
scamper grey male minilop 1 2018-01-01 false
shredder grey male minilop 1.8 2018-06-01 true
chia white_spots female mixed 1.2 2018-08-01 false
jess white female mixed 2.5 2018-06-01 true
celeste white female nz_white 4.5 2018-08-01 true
stephii white female netherlands_dwarf 0.9 2018-12-01 true

We could enter this as a series of vectors

name = c("scamper", "shredder", "chia", "jess", "celeste", "stephii")
colour = c("grey", "grey", "white_spots", "white", "white", "white")
sex = c("male", "male", "female", "female", "female", "female")
breed = c("minilop", "minilop", "mixed", "mixed", "nz_white", "netherlands_dwarf")
weight_kg = c(1, 1.8, 1.2, 2.5, 4.5, 0.9)
dob = c("2018-01-01", "2018-06-01", "2018-08-01", "2018-06-01", NA, NA)
alive = c(FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)

We could look at some of the different characteristics of some of these vectors

length(name)
## [1] 6
dim(name)
## NULL
str(name)
##  chr [1:6] "scamper" "shredder" "chia" "jess" "celeste" "stephii"
str(weight_kg)
##  num [1:6] 1 1.8 1.2 2.5 4.5 0.9
str(alive)
##  logi [1:6] FALSE TRUE FALSE TRUE TRUE TRUE

We could now combine them into a data frame

rabbits <- data.frame(name, colour, sex, breed, weight_kg, dob, alive)

We can then examine the data frame with some simple commands.

str(rabbits)
## 'data.frame':    6 obs. of  7 variables:
##  $ name     : Factor w/ 6 levels "celeste","chia",..: 4 5 2 3 1 6
##  $ colour   : Factor w/ 3 levels "grey","white",..: 1 1 3 2 2 2
##  $ sex      : Factor w/ 2 levels "female","male": 2 2 1 1 1 1
##  $ breed    : Factor w/ 4 levels "minilop","mixed",..: 1 1 2 2 4 3
##  $ weight_kg: num  1 1.8 1.2 2.5 4.5 0.9
##  $ dob      : Factor w/ 3 levels "2018-01-01","2018-06-01",..: 1 2 3 2 NA NA
##  $ alive    : logi  FALSE TRUE FALSE TRUE TRUE TRUE
rabbits
##       name      colour    sex             breed weight_kg        dob alive
## 1  scamper        grey   male           minilop       1.0 2018-01-01 FALSE
## 2 shredder        grey   male           minilop       1.8 2018-06-01  TRUE
## 3     chia white_spots female             mixed       1.2 2018-08-01 FALSE
## 4     jess       white female             mixed       2.5 2018-06-01  TRUE
## 5  celeste       white female          nz_white       4.5       <NA>  TRUE
## 6  stephii       white female netherlands_dwarf       0.9       <NA>  TRUE

We can then access the individual variables (columns) of the dataset with the dollar sign $.

For example, we could create a new variable weight_pounds with this command, which calculates the new variable by multiplying the weight in kilograms by 2.2 (the conversion factor for kg to pounds).

rabbits$weight_pounds <- rabbits$weight_kg * 2.2
rabbits
##       name      colour    sex             breed weight_kg        dob alive
## 1  scamper        grey   male           minilop       1.0 2018-01-01 FALSE
## 2 shredder        grey   male           minilop       1.8 2018-06-01  TRUE
## 3     chia white_spots female             mixed       1.2 2018-08-01 FALSE
## 4     jess       white female             mixed       2.5 2018-06-01  TRUE
## 5  celeste       white female          nz_white       4.5       <NA>  TRUE
## 6  stephii       white female netherlands_dwarf       0.9       <NA>  TRUE
##   weight_pounds
## 1          2.20
## 2          3.96
## 3          2.64
## 4          5.50
## 5          9.90
## 6          1.98

Aside: Types of Data (units of data)

The main types of data (i.e. fundamental units of data - in a single ‘cell’) are:

  • Numeric: A number, that can be decimal.
  • Character: A string of characters. Could be a word or a sentence or a random string of charaters.
  • Integer: A whole number. Cannot be a decimal.
  • Logical: Can be logically TRUE or FALSE. If converted to numeric, True = 1; False = 0.
  • Dates: A complex data type, which is generally measured as the number of days since 1970. We may do more on this later.


Aside: Types of Objects (data structures)

The main types of objects (i.e. structures that can hold data) in R are:

  • vector: A list of items, like a row or column of an Excel Spreadsheet, BUT all items must be the same type of data (i.e. all numeric, all character, all logical, etc.).
  • list: A list of items, like a vector, but items can have different data types (so the one list can have numbers and characters in different items.)
  • matrix: A two dimensional vector, like an Excel Spreadsheet, BUT like a vector, all items must be the same type of data (i.e. all numeric, all character, all logical, etc.).
  • data frame: Like a matrix or spreadsheet - two dimensions, rows (units of analysis) and columns (variables) - BUT (1) items in the same column must have same data type, (2) different columns (variables) can have different data types.
  • factor: A complex data type, which is basically a vector of numbers, but each number refers to a set of categorical variables.
  • array: these are like multidimensional matricies (i.e. they can have more than two dimensions). All items have to be of one data type (e.g. all numeric, or all character).


Aside: Missing Data

NA

Sometimes research assistants don’t enter data collectly. Sometimes respondents don’t answer all questions in a survey. Sometimes questions are relevant to some respondents (e.g. men generally don’t have a bra size).

In these cases, we have what is called ‘Missing Data’.

There are many different conventions for recording missing data. Some programs use a blank cell, others a dot, others -99.

R uses NA.

For example, if you look at the date of birth of the rabbits stephii and celeste, they have NA for their birth dates, because we don’t know when they were born: Stephii was found abandoned in a local council pound (in western Sydney), while Celeste was found abandoned in Centential Park (in the eastern suburbs of Sydney).

To ask if a particular cell (element) is missing, we use the function is.na()

rabbits$name
## [1] scamper  shredder chia     jess     celeste  stephii 
## Levels: celeste chia jess scamper shredder stephii
rabbits$dob
## [1] 2018-01-01 2018-06-01 2018-08-01 2018-06-01 <NA>       <NA>      
## Levels: 2018-01-01 2018-06-01 2018-08-01
is.na(rabbits$dob[4])
## [1] FALSE
is.na(rabbits$dob[5])
## [1] TRUE

NULL

R has another type of missing values, which is called NULL. Null is a bit different to NA. NULL is nothingness. NULL is the absense of anything. It can’t be put into a vector or list. is.null() can only exist as a single element.

You don’t need to worry too much about NULL. We won’t deal with it much, but it is worth knowing it is lurking around in the background.


5. Reading and Saving Data

5.1. Saving R data frames and other objects

We have created a data frame with the rabbit data in it. We could save that to our computer if we wanted. This will save it into a file in our working directory.

There are three main formats to save standard dataframe in:

  • CSV: comma separated values, which are basically a very simple form of spreadsheet
  • rds: saves a single R object. Has the advantage that you can load the object into any name when you restore it.
  • RData: saves a single or multiple objects. Has the disadvantage that it can only load objects back into their original name when you restore it.
write.csv(rabbits, file = "rabbits.csv")
saveRDS(rabbits, file = "my_data.rds")
a <- c(1,2,3,4,5)
b <- c("one", "ten", "one hundred")
save(rabbits, a, b, file = "much_data.RData")

5.2 Reading data from files

We can then read this data back into R using any one of the three following commands.

flatmates <- read.csv(file="rabbits.csv", header=TRUE, sep=",")
flatmates_from_rds <- readRDS("my_data.rds")
load("much_data.RData")

5.3 Reading from Excel Files

If we want to read data from Excel files, we need to download it to our working directory, and install the package readxl, and then use the command read_excel.

install.packages("readxl", dependencies = TRUE)
library(readxl)
my_excel_file <- read_excel("excel-example.xlsx")

5.4 Reading from SPSS files

If we want to read data from SPSS files (called .sav files) then we need

install.packages("sjlabelled", dependencies = TRUE)
library(sjlabelled) 
my_spss_file <- read_spss("spss-example.sav")
my_stata_file <- read_stata("stata-example.dta")

5.5 Reading from standard datasets in R

data()
data(ChickWeight)
str(ChickWeight)
?ChickWeight

5.6 Importing from a website

library(sjlabelled) 
guardian_from_csv <- read.csv(url("https://methods101.com/data/guard_data.csv"), header=TRUE, sep=",")
guardian_from_rds <- readRDS(url("https://methods101.com/data/guard_data.rds"))
guardian_from_spss <- sjlabelled::read_spss(url("https://methods101.com/data/guard_data.sav"))

You can also open the data from an Excel file you can download from https://www.methods101.com by:

  1. going to https://methods101.com/data/guard_data.xlsx and
  2. downloading the .xlsx file to your computer, and then
  3. saving the file to your working directory and
  4. making sure you have readxl package installed and loaded with the command library(readxl), and then
  5. opening it in RStudio with the command guardian_from_excel <- readxl::read_xlsx("url("https://methods101.com/data/"guard_data.sav"))

Exercise 1: Importing data from The Guardian article.

Follow the directions above to important the data from The Guardian article.

In pairs or groups of 3, attempt the following questions, and then write your answers on the Google Doc to share with the class.

  1. What different commands (and point and click in RStudio) can you use to get a ‘quick feel’ for The Guardian data? This can include commands we haven’t run today.
  2. What command do we use to access just one variable (column) of a data frame? How would you create a new variable percent_year12_or_equiv which expresses percentage of population who have completed year 12 or equivalent as a percentage (i.e. between 100% and 0%) and not as a proportion (i.e. between 1 and 0)?
  3. What exactly do the numbers mean for each of the variables? Start from the first column and try to work out - using logic, and the internet, and the references in The Guardian article - what the numbers in each column actually mean? What is the name of the ‘book’ that is missing from this dataset?


Exercise 2: Getting Help.

Remember the philosophy ‘The typos are the pedagogy’. Making mistakes is how we learn to code and how to do statistics.

  1. What mistakes in coding have you already made today? List them on the Google Doc.
  2. What are the best ways of solving problems you run into when using R? Write a list of useful sources, websites, and strategies.


6. Subsetting

Often what we want to analyse a subset of our data.

For example, we want to analyse only the women or only the men in our dataset.

To do this, we use square brackets [] after the name of an object.

We will learn about subsetting through an exercise.

Exercise 3: Subsetting with [].

Try these different commands, and try to work out - based on the output in your console window - what the meaning of the various numbers inside the square brackets are

guardian_from_csv[1]  # command 1
guardian_from_csv[1,5]  # command 2
guardian_from_csv$electorate[5]  # command 3
guardian_from_csv$electorate[c(1,5)]  # command 4
guardian_from_csv$electorate[1,5]  # command 5
guardian_from_csv$electorate[guardian_from_csv$X2pp_swing > 10]  # command 6
  1. Why does command 5 give an error, but command 2, and command 4 do not give errors?
  2. CHALLENGE: Write some code that would illustrate one of the findings from The Guardian graphs?

For example, you might want to show the mean two party preferred vote for high and low income areas.

If you are having trouble, I would suggest doing this: 1. Go back to The Guardian article and find one figure/graph that looks particularly dramatic or interesting. 2. Identify the two variables in the graph (the x and y axis). 3. Look at your dataset in R. What is the name of these variables in R? What command do you need to call to access the variables? 4. Calculate the mean or median of your independent variable and put it in a variable like income_mean 5. Calculate the mean of your dependent variable at two levels of your independent variable (one level above the mean of the independent variable, and one below the mean).


Last updated on 04 August, 2019 by Dr Nicholas Harrigan (nicholas.harrigan@mq.edu.au)