SOCI832: Lesson 4.5: Recoding Variables

1. Run the standard set up code

# Install Packages
if(!require(dplyr)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(summarytools)) {install.packages("summarytools", repos='https://cran.csiro.au/', dependencies=TRUE)}

# Load packages into memory
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjPlot)
library(summarytools)

# Turn off scientific notation
options(digits=5, scipen=15) 

# Stop View from overloading memory with a large datasets
RStudioView <- View
View <- function(x) {
  if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}

2. Major Types of Recoding

2.0 Checking coding with frq()

If you need to check the coding for any variable - before or after recoding - the best function to use is frq().

For example:

frq(mydata$age_r)


## 
## # Age Category (x) <numeric> 
## # total N=30  valid N=30  mean=4.77  sd=1.92
##  
##  val label frq raw.prc valid.prc cum.prc
##    1   10s   2    6.67      6.67    6.67
##    2   20s   1    3.33      3.33   10.00
##    3   30s   5   16.67     16.67   26.67
##    4   40s   5   16.67     16.67   43.33
##    5   50s   6   20.00     20.00   63.33
##    6   60s   6   20.00     20.00   83.33
##    7   70s   3   10.00     10.00   93.33
##    8   80s   1    3.33      3.33   96.67
##    9   90s   1    3.33      3.33  100.00
##   NA    NA   0    0.00        NA      NA

2.1: Recode

Source SOC830 Lab 04

To recode variables, use data name <- rec(data name, variable name, rec = "recoding scheme", append = TRUE). append = TRUE means that R will append a newly recoded variable to data name. For example, the following code will recode age variable in mydata and generate a new variable of age group titled age_r.

mydata <- rec(mydata, age, rec = "min:19 = 1; 20:29 = 2; 30:39 = 3; 40:49 = 4;
              50:59 = 5; 60:69 = 6; 70:79 = 7; 80:89 = 8; 90:max = 9", 
              append = TRUE)

Let me explain more about “recoding scheme” in the above code. a:b means all values from a to b. For example, min:19 means all the numbers from the minimum value to 19. In “recoding scheme”, we need to specify how the values of old variables are converted into the values of new variables. The left side of equal signs (=) is for the values of old variables, and the left side for the values of new variables. For example, min:19 = 1 means that all the values from the minimum value to 19 will be converted into 1. Semicolon(;) is used for separating the coding schemes of each value.

Running this code will make a new variable titled “age_r”. Then, we will assign the variable and value label to this new variable by running the following code:

mydata$age_r <- set_label(mydata$age_r, label = "Age Category")
mydata$age_r <- set_labels(mydata$age_r, labels = c ("10s" = 1,
                                                     "20s" = 2,
                                                     "30s" = 3,
                                                     "40s" = 4,
                                                     "50s" = 5,
                                                     "60s" = 6,
                                                     "70s" = 7,
                                                     "80s" = 8,
                                                     "90s" = 9))

Then, let us check the new variable by making a frequency table of it.

frq(mydata$age_r)


## 
## # Age Category (x) <numeric> 
## # total N=30  valid N=30  mean=4.77  sd=1.92
##  
##  val label frq raw.prc valid.prc cum.prc
##    1   10s   2    6.67      6.67    6.67
##    2   20s   1    3.33      3.33   10.00
##    3   30s   5   16.67     16.67   26.67
##    4   40s   5   16.67     16.67   43.33
##    5   50s   6   20.00     20.00   63.33
##    6   60s   6   20.00     20.00   83.33
##    7   70s   3   10.00     10.00   93.33
##    8   80s   1    3.33      3.33   96.67
##    9   90s   1    3.33      3.33  100.00
##   NA    NA   0    0.00        NA      NA

2.2: Calculate

We can also simply create variables by adding or subtracting.

Say we have a variable we want to normalise (transform into a variable with a mean of zero and a standard deviation of 1), then we can run code like this

ds$standardised_age <- (ds$age - mean(ds$age))/(sd(ds$age))

2.3: Reversing Codes

To reverse the direction of a likert scale (or any set of numbers where you want to change the numbering from, from example, 1 to 6, to 1 to 6 in the opposite direction), then the formula is

reversed_scale <- (min(scale) + max(scale)) - scale

Normally we just do this by hand, for example with likelihood of voting

ds$likely_vote <- 6 - ds$a12

2.4: Constructing Scale or Index

To create a scale or index, the simplest thing to do is to add the variables that compose the scale together.

For example

pol_knowledge <- q1 + q2 + q3 + q4 + q5 + q6 + q7 + q8 + q9 + q10

2.5: Creating Dummy Variables + making factors + reference level

Source SOC830 Lab 11

A dummy variable is a binary variable used in regression analysis that represents subgroups (or categories) of the sample in your study. And it is used to estimate the effect of nominal variables. Let’s make a dummy variable of gender. The following code shows two categories of gender (sex): male and female.

frq(income$sex)
 
# # R: Sex (x) <numeric> 
# # total N=735  valid N=735  mean=1.53  sd=0.50
#  
#  val  label frq raw.prc valid.prc cum.prc
#    1   Male 346   47.07     47.07   47.07
#    2 Female 389   52.93     52.93  100.00
#   NA     NA   0    0.00        NA      NA

Since we want to investigate how much less women earn compared to men, males should be the reference category which takes 0. Consequently, females should take 1. Thus, let’s recode sex in such a way. In a nutshell, the reference category must take 0 values in dummy variables.

income <- rec(income, sex, rec = "1=0;2=1", append = TRUE)

Then, we rename a newly recoded variable as female and set a variable and value label.

income <- var_rename(income, sex_r = female)
income$female <- set_label(income$female, label = "Gender")
income$female <- set_labels(income$female, labels = c("Male" = 0, "Female" = 1))

The most important step is to change this dummy variable as a factor. Otherwise, R will not treat it as a dummy variable in regression analysis.

income$female <- to_factor(income$female)

In some cases, you may want to use women instead of men as your reference category. In this case, you can easily change a reference category by using the ref_lvl(variable, lvl = value for the reference category) function. Make sure that the variable used for ref_lvl() should be a variable that is converted to a factor. The following code makes a new variable (male) in which 0 is female, and 1 is male.

income$male <- ref_lvl(income$female, lvl = 1)

2.6: Making Binary and Categorical Variables Factors

2.7: Removing Variables

Source SOC830 Lab 03

In case you want to remove unnecessary variables from data, use the following code remove_var() function. The following code will remove b_year. The code is

data name <- data name %>% remove_var(variable name)

You can remove multiple variables by

remove_var(var 1, var 2, var 3, var 4, ...)

mydata <- mydata %>%
  remove_var(b_year)

In this code, %>% is called as pipes. Think of it simply as “then”. Thus, the code means that choose mydata, and then remove a variable of b_year, which will be new mydata.

2.8: Creating New Variable Label and Value Labels

Source SOC830 Lab 03

data name$variable name <- set_label(data name$variable name, label = "variable label") will assign variable labels to specified variables. After assigning it, get_label function will show a newly assigned variable label.

Let’s assign variable labels to the others as well.

mydata$sex <- set_label(mydata$sex, label = "Gender")
mydata$age <- set_label(mydata$age, label = "Age")
mydata$polorient <- set_label(mydata$polorient, label = "Political Orientation")
mydata$class <- set_label(mydata$class, label = "Social Class")

You now need to undertake close analysis of the article, the offical codebook, and your own r_codebook to try to make sure all your variables are correctly coded, and if not, decide how you are going to change them.

2.9: How to create new value labels

The R code for creating value labels is:

data name$variable name <- set_labels(data name$variable name, labels = c("category 1" = value 1, "category 2" = value 2, ...))

For example:

mydata$sex <- set_labels(mydata$sex, labels = c("male" = 1, "female" = 2))

mydata$polorient <- set_labels(mydata$polorient, 
                               labels = c("Far left" = 1,
                                          "Left" = 2,
                                          "Center" = 3,
                                          "Right" = 4,
                                          "Far right" = 5))

mydata$class <- set_labels(mydata$class, labels = c("Lower class" = 1,
                                                   "Working class" = 2,
                                                   "Lower middle class" = 3,
                                                   "Middle class" = 4,
                                                   "Upper middle class" = 5,
                                                   "Upper class" = 6))
Last updated on 24 August, 2019 by Dr Nicholas Harrigan (nicholas.harrigan@mq.edu.au)