2  Data Structures in R

2.1 Introduction

In this chapter, we will introduce the basic data containers and types. There are various data structures in R to store and manipulate data. In this chapter, we discuss these data structures and their properties for effective data management.

2.2 Containers

In R, there are several data structures to store collections of data. The most commonly used data structures are: vectors, matrices, arrays, lists, and data frames. In this part, we will focus on vectors, matrices, arrays, and lists, and cover data frames in Chapter 7.

2.2.1 Vectors

In R, we use a vector to store an ordered collection of objects of the same type. We can create a vector using one of the following functions:

  • c() is the most general way to create a vector.
  • : creates a sequence of integers.
  • seq() is used to generate regular sequences.
  • rep() is useful to generate vectors of replicated elements.

Some illustrative examples are given below.

# Creating a vector of numbers
v1 = c(2.5, 4, 7.3, 0.1)
v1
[1] 2.5 4.0 7.3 0.1
# Creating a character vector
v2 = c("A", "B", "C", "D")
v2
[1] "A" "B" "C" "D"
# Creating an integer sequence
v3 = -3:3
v3
[1] -3 -2 -1  0  1  2  3
# Creating a sequence with seq()
seq(0, 2, by = 0.5) # increment by 0.5
[1] 0.0 0.5 1.0 1.5 2.0
# Creating a sequence with seq()
seq(0, 2, len = 6) # length of the sequence is 6
[1] 0.0 0.4 0.8 1.2 1.6 2.0
rep(1:5, each = 2) 
 [1] 1 1 2 2 3 3 4 4 5 5
# Creating a replicated vector with rep()
rep(1:5, times = 2)
 [1] 1 2 3 4 5 1 2 3 4 5

To index certain element(s) of a vector, we use [ ] with a vector/scalar of positions to reference the elements of the vector. Including a minus sign before the vector/scalar removes the indexed elements from the vector.

# Referencing elements of a vector
x <- c(4, 7, 2, 10, 1, 0)
x[4] # return the fourth element
[1] 10
# Return elements from index 1 to 3
x[1:3] 
[1] 4 7 2
# Return elements at indices 2, 5 and 6
x[c(2,5,6)] 
[1] 7 1 0
# Remove the third element from x
x[-3] 
[1]  4  7 10  1  0
# Remove multiple elements from x
x[-c(4,5)] 
[1] 4 7 2 0
# Logical referencing
x[x>4] # return elements bigger than 4
[1]  7 10
# Modifying elements of a vector
x[3] <- 999 
x
[1]   4   7 999  10   1   0

The following additional functions can be useful to return the indices of a vector.

  • which(): returns the position or the index of the value which satisfies the given condition.
  • which.max(): returns the location of the (first) maximum element of a numeric vector.
  • which.min(): returns the location of the (first) minimum element of a numeric vector.
  • match(): returns the first position of an element of a vector in another vector.
x <- c(4, 7, 2, 10, 1, 0)
x>=4 # return a logical vectors of TRUE and FALSE
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE
# Return indices of elements satisfying the condition
which(x>=4) # return indices
[1] 1 2 4
# Return indices of the maximum element
which.max(x) 
[1] 4
# Return the maximum element using which.max()
x[which.max(x)] # return the first maximum element
[1] 10
# Return the maximum element using max()
max(x)
[1] 10
# Using match()
y <- rep(1:5, times=5:1)
y
 [1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
match(1:5, y) # return the first position of each element in y that matches with 1:5
[1]  1  6 10 13 15
# Return unique elements
unique(y)
[1] 1 2 3 4 5
match(unique(y), y) # return the first position of each element in y that matches with unique(y)
[1]  1  6 10 13 15

When vectors are used in math expressions the operations are performed element-wise.

# Element-wise operations
x = c(4, 7, 2, 10, 1, 0)
y = x^2 + 1
y
[1]  17  50   5 101   2   1
x*y # element-wise multiplication
[1]   68  350   10 1010    2    0

In Table 2.1, we provide some useful functions for vector operations.

Table 2.1: Useful functions for vectors
Function Description
sum(x), prod(x) Sum/product of the elements of x
cumsum(x), cumprod(x) Cumulative sum/product of the elements of x
min(x), max(x) Minimum/maximum element of x
mean(x), median(x) Mean/median of x
var(x), sd(x) Variance/standard deviation of x
cov(x, y), cor(x, y) Covariance/correlation of x and y
range(x) Range of x
quantile(x) Quantiles of x for the given probabilities
fivenum(x) Five number summary of x
length(x) Number of elements in x
unique(x) Unique elements of x
rev(x) Reverse the elements of x
sort(x) Sort the elements of x
which(x) Indices of TRUEs in a logical vector
which.max(x), which.min(x) Index of the max/min element of x
match(x) First position of an element in a vector
union(x, y) Union of x and y
intersect(x, y) Intersection of x and y
setdiff(x, y) Elements of x that are not in y
setequal(x, y) Do x and y contain the same elements?

Below, we provide some illustrative examples.

# Useful functions for vectors
x <- c(4,7,2,10,1,0)
y <- 3*x^2 + 1
y
[1]  49 148  13 301   4   1
sum(x) # sum of elements
[1] 24
range(x) # range of elements
[1]  0 10
length(x) # number of elements
[1] 6
rev(x) # reverse the elements
[1]  0  1 10  2  7  4
sort(x) # sort in increasing order
[1]  0  1  2  4  7 10
sort(x, decreasing = TRUE) # sort in decreasing order
[1] 10  7  4  2  1  0
which(x==7) # return indices
[1] 2
union(x,y) # union(x,y)
 [1]   4   7   2  10   1   0  49 148  13 301
setdiff(x,y) # elements in x but not in y
[1]  7  2 10  0
intersect(x,y) # intersection of x and y
[1] 4 1
setequal(x,y) # do x and y contain the same elements?
[1] FALSE

2.2.2 Matrices

A matrix is a two-dimensional generalization of a vector. To create a matrix, we use the function matrix(), with the syntax

matrix(data=NA, nrow=1, ncol=1, byrow = FALSE, dimnames = NULL)

The arguments are as follows:

  • data is a vector that gives data to fill the matrix
  • nrow is the desired number of rows
  • ncol is the desired number of columns
  • byrow is set to FALSE by default, which means matrix is filled by columns. Otherwise, matrix is filled by rows.
  • dimnames is an optional list of length 2 giving the row and column names, respectively.

Below, we provide some illustrative examples.

# Creating a matrix
y = matrix(nrow = 3, ncol = 4)
y
     [,1] [,2] [,3] [,4]
[1,]   NA   NA   NA   NA
[2,]   NA   NA   NA   NA
[3,]   NA   NA   NA   NA
# Creating a matrix
x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4, 
           byrow = TRUE,
           dimnames = list(rows = c("r.1", "r.2", "r.3"),
                           cols = c("c.1", "c.2", "c.3", "c.4")))
x
     cols
rows  c.1 c.2 c.3 c.4
  r.1   5   0   6   1
  r.2   3   5   9   5
  r.3   7   1   5   3

Some useful functions for matrices are given below.

class(x) # class of x
[1] "matrix" "array" 
colnames(x) # access to column names
[1] "c.1" "c.2" "c.3" "c.4"
rownames(x) # access to row names
[1] "r.1" "r.2" "r.3"
rownames(x)=c("A","B","C") # change the row names
dimnames(x) # access to both row and column names
$rows
[1] "A" "B" "C"

$cols
[1] "c.1" "c.2" "c.3" "c.4"
dimnames(x)$rows # access to row names
[1] "A" "B" "C"
dimnames(x)$cols # access to column names
[1] "c.1" "c.2" "c.3" "c.4"
dim(x) # dimensions of x
[1] 3 4
nrow(x) # number of rows
[1] 3
ncol(x) # number of columns
[1] 4

The elements of a matrix can be referenced using the [ ] just like with vectors, but now with 2-dimensions.

x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4)
x
     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    0    3    5    5
[3,]    6    5    7    3
x[2, 3] # Element in Row 2, Column 3
[1] 5
x[1, ] # Row 1, all Columns
[1] 5 1 9 1
x[ , 2] # All Rows, Column 2
[1] 1 3 5
x[c(1, 3), ] # Rows 1 and 3, all columns
     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    6    5    7    3

When matrices are used in math expressions the operations are performed element-wise.

  • For matrix multiplication use the %*% operator.
  • If a vector is used in matrix multiplication, it will be coerced to either a row or column matrix to make the arguments conformable.
  • Using %*% on two vectors will return the inner product as a matrix and not a scalar.
A = matrix(1:4, nrow = 2)
B = matrix(1, nrow = 2, ncol = 2)
A*B # element-wise multiplication
     [,1] [,2]
[1,]    1    3
[2,]    2    4
A%*%B # matrix multiplication
     [,1] [,2]
[1,]    4    4
[2,]    6    6
y = 1:3
y%*%y # inner product as a matrix
     [,1]
[1,]   14
A/c(y%*%y) 
           [,1]      [,2]
[1,] 0.07142857 0.2142857
[2,] 0.14285714 0.2857143
# A/(y%*%y) # Error: non-conformable arrays

We use the apply function for applying functions to the margins of a matrix, array, or dataframes. The syntax is

apply(X, MARGIN, FUN, ...)

where

  • X is a matrix, array or dataframe,
  • MARGIN is a vector of subscripts indicating which margins to apply the function to (1=rows, 2=columns, c(1,2)=rows and columns),
  • FUN is the function to be applied,
  • ... stands for optional arguments for FUN.
# Using apply function
x = matrix(1:12, nrow = 3, ncol = 4)
apply(x, MARGIN=1, sum)  # Row sums
[1] 22 26 30
apply(x, 2, mean) # Column means
[1]  2  5  8 11
# Handling missing data with apply
x[1,1] <- NaN
apply(x, 2, mean, na.rm=TRUE) # Column means ignoring NaN
[1]  2.5  5.0  8.0 11.0

In Table 2.2, we list some useful functions for matrix operations.

Table 2.2: Useful functions for matrices
Function Description
t(A) Transpose of A
det(A) Determinant of A
solve(A, b) Solves the equation Ax = b for x
solve(A) Matrix inverse of A
MASS::ginv(A) Generalized inverse of A (MASS package)
eigen(A) Eigenvalues and eigenvectors of A
chol(A) Cholesky factorization of A
diag(n) Creates an n by n identity matrix
diag(A) Returns the diagonal elements of a matrix A
diag(x) Create a diagonal matrix from a vector x
lower.tri(A), upper.tri(A) Matrix of logicals indicating lower/upper triangular matrix
apply Apply a function to the margins of a matrix
rbind(...) Combines arguments by rows
cbind(...) Combines arguments by columns
dim(A) Dimensions of A
nrow(A), ncol(A) Number of rows/columns of A
colnames(A), rownames(A) Get or set the column/row names of A
dimnames(A) Get or set the dimension names of A

2.2.3 Arrays

An array is a multi-dimensional generalization of a vector. To create an array, we use array(data = NA, dim = length(data), dimnames = NULL), where data is a vector that provides the values to fill the array; dim specifies the dimensions of the array (a vector of one or more elements giving the maximum indices in each dimension); and dimnames defines the names of the dimensions (a list with one component for each dimension, either NULL or a character vector of the length specified by dim for that dimension).

We fill the array by columns, similar to matrices. The math operations on arrays are performed element-wise, similar to vectors and matrices. Also, all elements of an array must be of the same type.

# Creating an array
w = array(1:24, 
          dim = c(4, 3, 2),
          dimnames = list(c("A","B","C","D"), c("X","Y","Z"), c("N","M")))
w
, , N

  X Y  Z
A 1 5  9
B 2 6 10
C 3 7 11
D 4 8 12

, , M

   X  Y  Z
A 13 17 21
B 14 18 22
C 15 19 23
D 16 20 24

The option dim = c(4, 3, 2) specifies that the array has 4 rows, 3 columns, and 2 “pages” (or layers). Thus, the array w has 4 x 3 x 2 = 24 elements. We can think of c("N","M") as the names of the pages. Thus, w[ , , "N"] returns the first page and w[ , , "M"] returns the second page.

# Referencing elements of an array
w[ , , "N"] # First page
  X Y  Z
A 1 5  9
B 2 6 10
C 3 7 11
D 4 8 12
w[2, , ] # Second row, all columns, all pages
   N  M
X  2 14
Y  6 18
Z 10 22
w[ , "Y", ] # All rows, second column, all pages
  N  M
A 5 17
B 6 18
C 7 19
D 8 20
w[1, 2, 2] # Element in Row 1, Column 2, Page 2
[1] 17

2.2.4 Lists

A list is a general form of a vector whose components can be of different types and dimensions. To create a list, we use list(...). We can name the elements of a list using the name = value syntax. Arguments can be specified with or without names. In the example below, we create a list with three elements: the first element is named num, the second element has no name, and the third element is named identity.

# Creating a list
x = list(num = c(1,2,3), "Econometrics", identity=diag(2))
x
$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

$identity
     [,1] [,2]
[1,]    1    0
[2,]    0    1
# Names of list elements
names(x)
[1] "num"      ""         "identity"

We use [ ], [[ ]] and $ to reference elements of a list. Below, we provide some examples of referencing elements of the list x.

x[[2]]     # the second element of x
[1] "Econometrics"
x[["num"]] # element named "num"
[1] 1 2 3
x$identity # element named "identity"
     [,1] [,2]
[1,]    1    0
[2,]    0    1
x[[3]][1,] # first row of the third element 
[1] 1 0
x[1:2]     # a sublist from the first two elements
$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

In Table 2.3, we provide some useful functions for lists.

Table 2.3: Useful functions for lists
Function Description
lapply() Apply a function to each element of a list; returns a list
sapply() Same as lapply(), but returns a vector or matrix by default
vapply() Similar to sapply(), but has a pre-specified type of return value
replicate() Repeated evaluation of an expression; useful for replicating lists
unlist(x) Produce a vector of all the components that occur in x
length(x) Number of objects in x
names(x) Names of the objects in x

2.3 Data types

2.3.1 Numeric data

In R, we use numeric to represent real numbers. Numeric data can be either double or integer, but in practice numeric data is almost always double (type double refers to real numbers). We can use the format() function to format an object for pretty printing. See ?format() for additional arguments.

x = 123.456789
is.numeric(x) # check if x is numeric
[1] TRUE
is.double(x) # check if x is double
[1] TRUE
is.integer(x) # check if x is integer
[1] FALSE
format(13.7, nsmall = 3) # Minimum number of digits to the right of the decimal point
[1] "13.700"
format(2^16, scientific = TRUE) # scientific notation
[1] "6.5536e+04"
format(2^16, scientific = FALSE) # fixed notation
[1] "65536"

2.3.2 Booleans

Boolean (or logical) values are represented by the keywords TRUE and FALSE in all caps or simply T and F. We can use logical operators to compare numeric values or vectors element-wise. The result of a comparison is a logical value (TRUE or FALSE). In Table 2.4, we provide some useful functions for logical and relational operations.

Table 2.4: Useful functions for logical and relational operations
Function Description
!x NOT x
x & y x AND y element-wise; returns a vector
x && y x AND y; returns a single value
x | y x OR y element-wise; returns a vector
x || y x OR y; returns a single value
xor(x, y) Exclusive OR of x and y, element-wise
x %in% y x IN y
x < y x < y
x > y x > y
x <= y x ≤ y
x >= y x ≥ y
x == y x = y
x != y x ≠ y
isTRUE(x) TRUE if x is TRUE
all(...) TRUE if all arguments are TRUE
any(...) TRUE if at least one argument is TRUE
identical(x, y) Safe and reliable way to test two objects for being EXACTLY equal
all.equal(x, y) Test if two objects are NEARLY equal

Below, we provide some illustrative examples.

x = 1:10
(x%%2 == 0) | (x > 5) # Even numbers or greater than 5
 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
y = 5:15 
x %in% y # Elements of x that are also in y
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
x[x %in% y] # Elements of x that are also in y
[1]  5  6  7  8  9 10
any(x > 5) # Is at least one element of x greater than 5?
[1] TRUE
all(x > 5) # Are all elements of x greater than 5?
[1] FALSE

In general, logical operators may not produce a single value and may return an NA if an element is NA or NaN.

x = sqrt(2)
x^2
[1] 2
x^2 == 2 # Returns FALSE due to rounding error
[1] FALSE
identical(x^2, 2) # Returns FALSE because not exactly equal
[1] FALSE
all.equal(x^2, 2) # Returns TRUE because nearly equal
[1] TRUE
all.equal(x^2, 1) # Returns a message indicating the difference
[1] "Mean relative difference: 0.5"
isTRUE(all.equal(x^2, 1)) # Returns FALSE
[1] FALSE

2.3.3 Characters

In R, the text data type is called character. We use single or double quotes to create character strings.

s1 = c("A", "B", "C")
class(s1) # check the class of s1
[1] "character"
s2 = 'The world is beautiful'
class(s2) # check the class of s2
[1] "character"

In Table 2.5, we provide some useful functions for character vectors.

Table 2.5: Some useful functions for character vectors
Function Description
cat() Concatenate objects and print to console (\n for newline)
paste() Concatenate objects and return a string
print() Print an object
substr() Extract or replace substrings in a character vector
strtrim() Trim character vectors to specified display widths
strsplit() Split elements of a character vector according to a substring
grep() Search for matches to a pattern within a character vector; returns a vector of the indices that matched
grepl() Like grep(), but returns a logical vector
arep() Similar to grep(), but searches for approximate matches
regexpr() Similar to grep(), but returns the position of the first instance of a pattern within a string
gsub() Replace all occurrences of a pattern with a character vector
sub() Like gsub(), but only replaces the first occurrence
tolower(), toupper() Convert to all lower/upper case
noquote() Print a character vector without quotations
nchar() Number of characters
letters, LETTERS Built-in vector of lower and upper case letters

Below, we provide some illustrative examples.

animals = c("bird", "horse", "fish")
home = c("tree", "barn", "lake")

length(animals) # Number of elements in the vector
[1] 3
nchar(animals) # Number of characters in each string
[1] 4 5 4
cat("Animals:", animals) # Need \n to move cursor to a newline
Animals: bird horse fish
cat(animals, home, "\n") # Joins one vector after the other
bird horse fish tree barn lake 
paste(animals, collapse=" ") # Create one long string of animals
[1] "bird horse fish"
a = paste(animals, home, sep=".") # Pairwise joining of animals and home
a
[1] "bird.tree"  "horse.barn" "fish.lake" 
unlist(strsplit(a, ".", fixed=TRUE)) # Split the strings back
[1] "bird"  "tree"  "horse" "barn"  "fish"  "lake" 
substr(animals, 2, 4) # Get characters 2-4 of each animal
[1] "ird" "ors" "ish"
strtrim(animals, 3) # Print the first three characters
[1] "bir" "hor" "fis"
toupper(animals) # Print animals in all upper case
[1] "BIRD"  "HORSE" "FISH" 

In Table 2.5, we provide some functions for pattern matching and replacement: grep(), grepl(), areg(), regexpr(), gsub(), and sub(). We can use special characters in regular expressions to define search patterns. Some common special characters are listed below.

  • ^: Beginning of character string
  • $: End of character string
  • .: Any single character
  • *: Zero or more of the preceding character
  • +: One or more of the preceding character
  • ?: Zero or one of the preceding character
  • {n}: Exactly n of the preceding character
  • [a -- c]: Any one of the characters a, b, or c
  • [^a -- c]: Beginning with the characters a, b, or c

To illustrate the use of these functions, we use the built-in colors() function that returns a vector of color names in R.

# First ten colors
colors()[1:10]
 [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
 [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"   
 [9] "aquamarine1"   "aquamarine2"  
# Number of colors
length(colors())
[1] 657
colors()[grep("red", colors())] # All colors that contain "red"
 [1] "darkred"         "indianred"       "indianred1"      "indianred2"     
 [5] "indianred3"      "indianred4"      "mediumvioletred" "orangered"      
 [9] "orangered1"      "orangered2"      "orangered3"      "orangered4"     
[13] "palevioletred"   "palevioletred1"  "palevioletred2"  "palevioletred3" 
[17] "palevioletred4"  "red"             "red1"            "red2"           
[21] "red3"            "red4"            "violetred"       "violetred1"     
[25] "violetred2"      "violetred3"      "violetred4"     
colors()[grep("^red", colors())] # Colors that start with "red"
[1] "red"  "red1" "red2" "red3" "red4"
colors()[grep("red$", colors())] # Colors that end with "red"
[1] "darkred"         "indianred"       "mediumvioletred" "orangered"      
[5] "palevioletred"   "red"             "violetred"      
colors()[grep("red.", colors())] # Colors with one character after "red"
 [1] "indianred1"     "indianred2"     "indianred3"     "indianred4"    
 [5] "orangered1"     "orangered2"     "orangered3"     "orangered4"    
 [9] "palevioletred1" "palevioletred2" "palevioletred3" "palevioletred4"
[13] "red1"           "red2"           "red3"           "red4"          
[17] "violetred1"     "violetred2"     "violetred3"     "violetred4"    
colors()[grep("^[r-t]", colors())] # Colors that begin with r, s, or t
 [1] "red"          "red1"         "red2"         "red3"         "red4"        
 [6] "rosybrown"    "rosybrown1"   "rosybrown2"   "rosybrown3"   "rosybrown4"  
[11] "royalblue"    "royalblue1"   "royalblue2"   "royalblue3"   "royalblue4"  
[16] "saddlebrown"  "salmon"       "salmon1"      "salmon2"      "salmon3"     
[21] "salmon4"      "sandybrown"   "seagreen"     "seagreen1"    "seagreen2"   
[26] "seagreen3"    "seagreen4"    "seashell"     "seashell1"    "seashell2"   
[31] "seashell3"    "seashell4"    "sienna"       "sienna1"      "sienna2"     
[36] "sienna3"      "sienna4"      "skyblue"      "skyblue1"     "skyblue2"    
[41] "skyblue3"     "skyblue4"     "slateblue"    "slateblue1"   "slateblue2"  
[46] "slateblue3"   "slateblue4"   "slategray"    "slategray1"   "slategray2"  
[51] "slategray3"   "slategray4"   "slategrey"    "snow"         "snow1"       
[56] "snow2"        "snow3"        "snow4"        "springgreen"  "springgreen1"
[61] "springgreen2" "springgreen3" "springgreen4" "steelblue"    "steelblue1"  
[66] "steelblue2"   "steelblue3"   "steelblue4"   "tan"          "tan1"        
[71] "tan2"         "tan3"         "tan4"         "thistle"      "thistle1"    
[76] "thistle2"     "thistle3"     "thistle4"     "tomato"       "tomato1"     
[81] "tomato2"      "tomato3"      "tomato4"      "turquoise"    "turquoise1"  
[86] "turquoise2"   "turquoise3"   "turquoise4"  
colors()[grep("green.*", colors())] # Colors that contain "green" followed by zero or more characters
 [1] "darkgreen"         "darkolivegreen"    "darkolivegreen1"  
 [4] "darkolivegreen2"   "darkolivegreen3"   "darkolivegreen4"  
 [7] "darkseagreen"      "darkseagreen1"     "darkseagreen2"    
[10] "darkseagreen3"     "darkseagreen4"     "forestgreen"      
[13] "green"             "green1"            "green2"           
[16] "green3"            "green4"            "greenyellow"      
[19] "lawngreen"         "lightgreen"        "lightseagreen"    
[22] "limegreen"         "mediumseagreen"    "mediumspringgreen"
[25] "palegreen"         "palegreen1"        "palegreen2"       
[28] "palegreen3"        "palegreen4"        "seagreen"         
[31] "seagreen1"         "seagreen2"         "seagreen3"        
[34] "seagreen4"         "springgreen"       "springgreen1"     
[37] "springgreen2"      "springgreen3"      "springgreen4"     
[40] "yellowgreen"      
colors()[grep("green.+", colors())] # Colors that contain "green" followed by one or more characters
 [1] "darkolivegreen1" "darkolivegreen2" "darkolivegreen3" "darkolivegreen4"
 [5] "darkseagreen1"   "darkseagreen2"   "darkseagreen3"   "darkseagreen4"  
 [9] "green1"          "green2"          "green3"          "green4"         
[13] "greenyellow"     "palegreen1"      "palegreen2"      "palegreen3"     
[17] "palegreen4"      "seagreen1"       "seagreen2"       "seagreen3"      
[21] "seagreen4"       "springgreen1"    "springgreen2"    "springgreen3"   
[25] "springgreen4"   

Finally, we illustrate the use of gsub() and sub() functions for pattern replacement.

places = c("home", "zoo", "school", "work", "park")
gsub("o", "O", places) # Replace all "o" with "O"
[1] "hOme"   "zOO"    "schOOl" "wOrk"   "park"  
sub("o", "O", places)  # Replace the first "o" with "O"
[1] "hOme"   "zOo"    "schOol" "wOrk"   "park"  

2.3.4 Factors

In R, we use the factor data type to represent ordered or unordered categorical variables. The function factor() is used to create factor type variables.

factor(rep(1:2, 4), labels=c("BA", "BS")) # Unordered factor variable
[1] BA BS BA BS BA BS BA BS
Levels: BA BS
factor(rep(1:3, 4), labels=c("low", "med", "high"), ordered=TRUE) # Ordered factor variable
 [1] low  med  high low  med  high low  med  high low  med  high
Levels: low < med < high

In the following example, we load the STAR.csv dataset that contains a variable gender indicating the gender of students. The initial class of gender is character. We convert gender to a factor variable with labels “0” and “1” using the factor() function.

# Load STAR.csv dataset
star = read.csv("data/STAR.csv")
class(star$gender) # Check the class of gender
[1] "character"
star$gender0 = factor(star$gender, labels=c("0", "1"), ordered=FALSE) # Convert to factor variable
class(star$gender0) # Check the class again
[1] "factor"
head(star[, c("gender", "gender0")]) # View the first few rows
     gender gender0
1122 female       0
1137 female       0
1143 female       0
1160   male       1
1183   male       1
1195   male       1

In Table 2.6, we provide some useful functions for factor variables.

Table 2.6: Useful functions for factor variables
Function Description
levels(x) Retrieve or set the levels of x
nlevels(x) Returns the number of levels in x
relevel(x, ref) Levels of x are reordered so that the level specified by ref is first
reorder() Reorders levels based on the values of a second variable
gl() Generate factors by specifying the pattern of their levels
cut(x, breaks) Divides the range of x into intervals (factors) determined by breaks

If we want to convert a factor variable to a numeric variable, we can use as.numeric(f) as illustrated below.

star$gender1 = as.numeric(star$gender0) # Convert factor to numeric
class(star$gender1)
[1] "numeric"
head(star[, c("gender", "gender1")]) # View the first few rows
     gender gender1
1122 female       1
1137 female       1
1143 female       1
1160   male       2
1183   male       2
1195   male       2
star$gender2 = as.numeric(as.character(star$gender0)) # Convert factor to numeric correctly
head(star[, c("gender", "gender1", "gender2")]) # View the first few rows
     gender gender1 gender2
1122 female       1       0
1137 female       1       0
1143 female       1       0
1160   male       2       1
1183   male       2       1
1195   male       2       1

In the following examples, we illustrate some of the functions listed in Table 2.6.

f = gl(3, 2, labels=paste("ECN", 1:3, sep="_")) # Create a factor variable 
f
[1] ECN_1 ECN_1 ECN_2 ECN_2 ECN_3 ECN_3
Levels: ECN_1 ECN_2 ECN_3
levels(f) # Get the levels of f
[1] "ECN_1" "ECN_2" "ECN_3"
nlevels(f) # Get the number of levels of f
[1] 3
relevel(f, "ECN_2") # Reorder levels so that "ECN_2" is first
[1] ECN_1 ECN_1 ECN_2 ECN_2 ECN_3 ECN_3
Levels: ECN_2 ECN_1 ECN_3
f = gl(3, 2, labels=1:3) # Create a factor variable 
as.numeric(levels(f))[f] # Convert factor to numeric correctly
[1] 1 1 2 2 3 3
#
x = runif(10)
cut(x, 3) # Cut x into three intervals
 [1] (0.0434,0.246] (0.0434,0.246] (0.0434,0.246] (0.0434,0.246] (0.246,0.448] 
 [6] (0.448,0.65]   (0.448,0.65]   (0.0434,0.246] (0.448,0.65]   (0.246,0.448] 
Levels: (0.0434,0.246] (0.246,0.448] (0.448,0.65]
cut(x, c(0,.25,.5,.75,1)) # Cut x at the given cut points
 [1] (0,0.25]   (0,0.25]   (0,0.25]   (0,0.25]   (0.25,0.5] (0.5,0.75]
 [7] (0.5,0.75] (0,0.25]   (0.5,0.75] (0.25,0.5]
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

2.3.5 Dates and Times

We use the R date object for calendar dates (Year--Month--Day). We can add days to or subtract days from a date. We can also compare dates using logical operators. In Table 2.7, we provide some useful functions for date objects.

Table 2.7: Useful functions for date objects
Function Description
Sys.Date() Current date
as.Date() Convert a character string to a date object
format.Date() Change the format of a date object
seq.Date() Generate sequence of dates
cut.Date() Cut dates into intervals
weekdays, months, quarters Extract parts of a date object
julian Number of days since a given origin

The Date suffix is optional for calling format.Date(), seq.Date() and cut.Date(), but is necessary for viewing the appropriate documentation. To convert a character string to a date object, we use as.Date(x, format), where x is a character vector representing dates and format is a character string representing the date format of x. In Table 2.8, we list some common conversion specifications for date formats.

Table 2.8: Common date format specifiers
Specifier Description
%a Abbreviated weekday name
%A Full weekday name
%d Day of the month
%B Full month name
%b Abbreviated month name
%m Numeric month (01-12)
%y Year without century
%Y Year with century

Below, we provide some illustrative examples.

d1 = c("5-01-2008", "19-08-2008", "2-02-2009", "29-09-2009")
dates1 = as.Date(d1, format = "%d-%m-%Y") # Convert to date object
class(dates1) # Check the class
[1] "Date"
dates1 # View the date object
[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"

Here is an example with abbreviated month names.

d2 = c("2008/01/05", "2008/08/19", "2009/02/02", "2009/09/29")
dates2 = as.Date(d2, format="%Y/%m/%d")
class(dates2) # Check the class
[1] "Date"
dates2 # View the date object
[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"

To create a sequence of dates, we can use seq.Date(from, to, by, length.out = NULL), where

  • from, to: Start and ending date objects
  • by : A character string, containing one of "day", "week", "month" or "year"
  • length.out: Integer, desired length of the sequence

Below, we provide some illustrative examples.

seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by = "week")
[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by = "3 days")
 [1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
 [6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"
seq.Date(as.Date("2011/1/1"), by = "week", length.out = 10)
 [1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
 [6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"

To divide a sequence of dates into levels, we can use cut.Date(x, breaks, start.on.monday = TRUE).

jan = seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="days")
cut(jan, breaks="weeks")
 [1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
 [7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10 2011-01-10 2011-01-17 2011-01-17
[19] 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-24
[25] 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24
[31] 2011-01-31
6 Levels: 2010-12-27 2011-01-03 2011-01-10 2011-01-17 ... 2011-01-31

Recall that date objects can be used in arithmetic operations. Specifically,

  • Days can be added or subtracted to a date.
  • Dates can be subtracted.
  • Dates can be compared using logical operators.

Below, we provide some illustrative examples.

jan1 = as.Date("2011/1/1")
(jan8 = jan1 + 7) # Add 7 days to 2011/1/1
[1] "2011-01-08"
jan1 - 14 # Subtract 2 weeks from 2011/1/8
[1] "2010-12-18"
jan8 - jan1 # Number of days between 2011/1/1 and 2011/1/8
Time difference of 7 days
jan8 > jan1 # Compare dates
[1] TRUE
# Use format to extract parts of a date object or change the appearance
format.Date(jan8, "%Y")
[1] "2011"
format.Date(jan8, "%b-%d")
[1] "Oca-08"

2.3.6 Missing Data

The missing data in R is represented by the keyword NA. The built-in functions in R generally handle missing data in different ways. For example, the mean function ignores NA only if the argument na.rm=TRUE, whereas the which function always ignores missing data.

x = c(4, 7, 2, 0, 1, NA)
mean(x)
[1] NA
mean(x, na.rm=TRUE)
[1] 2.8
which(x > 4)
[1] 2

We need to check the documentation of functions to see how they handle missing data.

We use NaN (Not a Number) to denote quantities that are not a number, such as 0/0. In R, NaN implies NA (NaN refers to unavailable numeric data and NA refers to any type of unavailable data). Undefined or null objects are denoted in R by NULL.

# Undefined row names
x = matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))
x
     c.1 c.2
[1,]   1   3
[2,]   2   4
# NULL value in a list
x = list(a=1:5, b=NULL, c="Econometrics")
x
$a
[1] 1 2 3 4 5

$b
NULL

$c
[1] "Econometrics"
# Empty list
y = list()
y
list()

In Table 2.9, we provide some useful functions for testing missing data.

Table 2.9: Functions for testing null data
Function Description
is.na(x) Tests for NA or NaN data in x
is.nan(x) Tests for NaN data in x
is.null(x) Tests if x is NULL
x = c(4, 7, 2, 0, 1, NA)
(x == NA) 
[1] NA NA NA NA NA NA
is.na(x) # Check which elements are NA
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
is.nan(x) # Check which elements are NaN
[1] FALSE FALSE FALSE FALSE FALSE FALSE
any(is.na(x)) # Check if there is any NA in x
[1] TRUE
y <- x/0 # Create NaN values
y
[1] Inf Inf Inf NaN Inf  NA
is.nan(y) 
[1] FALSE FALSE FALSE  TRUE FALSE FALSE
is.na(y)
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE

The first example above shows that we cannot use the equality operator == to test for NA values. Instead, we should use the is.na() function.