2 Data Structures in R

2.1 Introduction

In this chapter, we will introduce the basic data containers and types. There are various data structures in R to store and manipulate data. In this chapter, we discuss these data structures and their properties for effective data management.

2.2 Containers

In R, there are several data structures to store collections of data. The most commonly used data structures are: vectors, matrices, arrays, lists, and data frames. In this part, we will focus on vectors, matrices, arrays, and lists, and cover data frames in Chapter 7.

2.2.1 Vectors

In R, we use a vector to store an ordered collection of objects of the same type. We can create a vector using one of the following functions:

c() is the most general way to create a vector.
: creates a sequence of integers.
seq() is used to generate regular sequences.
rep() is useful to generate vectors of replicated elements.

Some illustrative examples are given below.

# Creating a vector of numbers
v1 = c(2.5, 4, 7.3, 0.1)
v1

[1] 2.5 4.0 7.3 0.1

# Creating a character vector
v2 = c("A", "B", "C", "D")
v2

[1] "A" "B" "C" "D"

# Creating an integer sequence
v3 = -3:3
v3

[1] -3 -2 -1  0  1  2  3

# Creating a sequence with seq()
seq(0, 2, by = 0.5) # increment by 0.5

[1] 0.0 0.5 1.0 1.5 2.0

# Creating a sequence with seq()
seq(0, 2, len = 6) # length of the sequence is 6

[1] 0.0 0.4 0.8 1.2 1.6 2.0

rep(1:5, each = 2)

 [1] 1 1 2 2 3 3 4 4 5 5

# Creating a replicated vector with rep()
rep(1:5, times = 2)

 [1] 1 2 3 4 5 1 2 3 4 5

To index certain element(s) of a vector, we use [ ] with a vector/scalar of positions to reference the elements of the vector. Including a minus sign before the vector/scalar removes the indexed elements from the vector.

# Referencing elements of a vector
x <- c(4, 7, 2, 10, 1, 0)
x[4] # return the fourth element

[1] 10

# Return elements from index 1 to 3
x[1:3]

[1] 4 7 2

# Return elements at indices 2, 5 and 6
x[c(2,5,6)]

[1] 7 1 0

# Remove the third element from x
x[-3]

[1]  4  7 10  1  0

# Remove multiple elements from x
x[-c(4,5)]

[1] 4 7 2 0

# Logical referencing
x[x>4] # return elements bigger than 4

[1]  7 10

# Modifying elements of a vector
x[3] <- 999 
x

[1]   4   7 999  10   1   0

The following additional functions can be useful to return the indices of a vector.

which(): returns the position or the index of the value which satisfies the given condition.
which.max(): returns the location of the (first) maximum element of a numeric vector.
which.min(): returns the location of the (first) minimum element of a numeric vector.
match(): returns the first position of an element of a vector in another vector.

x <- c(4, 7, 2, 10, 1, 0)
x>=4 # return a logical vectors of TRUE and FALSE

[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

# Return indices of elements satisfying the condition
which(x>=4) # return indices

[1] 1 2 4

# Return indices of the maximum element
which.max(x)

[1] 4

# Return the maximum element using which.max()
x[which.max(x)] # return the first maximum element

[1] 10

# Return the maximum element using max()
max(x)

[1] 10

# Using match()
y <- rep(1:5, times=5:1)
y

 [1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5

match(1:5, y) # return the first position of each element in y that matches with 1:5

[1]  1  6 10 13 15

# Return unique elements
unique(y)

[1] 1 2 3 4 5

match(unique(y), y) # return the first position of each element in y that matches with unique(y)

[1]  1  6 10 13 15

When vectors are used in math expressions the operations are performed element-wise.

# Element-wise operations
x = c(4, 7, 2, 10, 1, 0)
y = x^2 + 1
y

[1]  17  50   5 101   2   1

x*y # element-wise multiplication

[1]   68  350   10 1010    2    0

In Table 2.1, we provide some useful functions for vector operations.

Table 2.1: Useful functions for vectors

Function	Description
`sum(x)`, `prod(x)`	Sum/product of the elements of `x`
`cumsum(x)`, `cumprod(x)`	Cumulative sum/product of the elements of `x`
`min(x)`, `max(x)`	Minimum/maximum element of `x`
`mean(x)`, `median(x)`	Mean/median of `x`
`var(x)`, `sd(x)`	Variance/standard deviation of `x`
`cov(x, y)`, `cor(x, y)`	Covariance/correlation of `x` and `y`
`range(x)`	Range of `x`
`quantile(x)`	Quantiles of `x` for the given probabilities
`fivenum(x)`	Five number summary of `x`
`length(x)`	Number of elements in `x`
`unique(x)`	Unique elements of `x`
`rev(x)`	Reverse the elements of `x`
`sort(x)`	Sort the elements of `x`
`which(x)`	Indices of `TRUE`s in a logical vector
`which.max(x)`, `which.min(x)`	Index of the max/min element of `x`
`match(x)`	First position of an element in a vector
`union(x, y)`	Union of `x` and `y`
`intersect(x, y)`	Intersection of `x` and `y`
`setdiff(x, y)`	Elements of `x` that are not in `y`
`setequal(x, y)`	Do `x` and `y` contain the same elements?

Below, we provide some illustrative examples.

# Useful functions for vectors
x <- c(4,7,2,10,1,0)
y <- 3*x^2 + 1
y

[1]  49 148  13 301   4   1

sum(x) # sum of elements

[1] 24

range(x) # range of elements

[1]  0 10

length(x) # number of elements

[1] 6

rev(x) # reverse the elements

[1]  0  1 10  2  7  4

sort(x) # sort in increasing order

[1]  0  1  2  4  7 10

sort(x, decreasing = TRUE) # sort in decreasing order

[1] 10  7  4  2  1  0

which(x==7) # return indices

[1] 2

union(x,y) # union(x,y)

 [1]   4   7   2  10   1   0  49 148  13 301

setdiff(x,y) # elements in x but not in y

[1]  7  2 10  0

intersect(x,y) # intersection of x and y

[1] 4 1

setequal(x,y) # do x and y contain the same elements?

[1] FALSE

2.2.2 Matrices

A matrix is a two-dimensional generalization of a vector. To create a matrix, we use the function matrix(), with the syntax

matrix(data=NA, nrow=1, ncol=1, byrow = FALSE, dimnames = NULL)

The arguments are as follows:

data is a vector that gives data to fill the matrix
nrow is the desired number of rows
ncol is the desired number of columns
byrow is set to FALSE by default, which means matrix is filled by columns. Otherwise, matrix is filled by rows.
dimnames is an optional list of length 2 giving the row and column names, respectively.

Below, we provide some illustrative examples.

# Creating a matrix
y = matrix(nrow = 3, ncol = 4)
y

     [,1] [,2] [,3] [,4]
[1,]   NA   NA   NA   NA
[2,]   NA   NA   NA   NA
[3,]   NA   NA   NA   NA

# Creating a matrix
x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4, 
           byrow = TRUE,
           dimnames = list(rows = c("r.1", "r.2", "r.3"),
                           cols = c("c.1", "c.2", "c.3", "c.4")))
x

     cols
rows  c.1 c.2 c.3 c.4
  r.1   5   0   6   1
  r.2   3   5   9   5
  r.3   7   1   5   3

Some useful functions for matrices are given below.

class(x) # class of x

[1] "matrix" "array"

colnames(x) # access to column names

[1] "c.1" "c.2" "c.3" "c.4"

rownames(x) # access to row names

[1] "r.1" "r.2" "r.3"

rownames(x)=c("A","B","C") # change the row names
dimnames(x) # access to both row and column names

$rows
[1] "A" "B" "C"

$cols
[1] "c.1" "c.2" "c.3" "c.4"

dimnames(x)$rows # access to row names

[1] "A" "B" "C"

dimnames(x)$cols # access to column names

[1] "c.1" "c.2" "c.3" "c.4"

dim(x) # dimensions of x

[1] 3 4

nrow(x) # number of rows

[1] 3

ncol(x) # number of columns

[1] 4

The elements of a matrix can be referenced using the [ ] just like with vectors, but now with 2-dimensions.

x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4)
x

     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    0    3    5    5
[3,]    6    5    7    3

x[2, 3] # Element in Row 2, Column 3

[1] 5

x[1, ] # Row 1, all Columns

[1] 5 1 9 1

x[ , 2] # All Rows, Column 2

[1] 1 3 5

x[c(1, 3), ] # Rows 1 and 3, all columns

     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    6    5    7    3

When matrices are used in math expressions the operations are performed element-wise.

For matrix multiplication use the %*% operator.
If a vector is used in matrix multiplication, it will be coerced to either a row or column matrix to make the arguments conformable.
Using %*% on two vectors will return the inner product as a matrix and not a scalar.

A = matrix(1:4, nrow = 2)
B = matrix(1, nrow = 2, ncol = 2)
A*B # element-wise multiplication

     [,1] [,2]
[1,]    1    3
[2,]    2    4

A%*%B # matrix multiplication

     [,1] [,2]
[1,]    4    4
[2,]    6    6

y = 1:3
y%*%y # inner product as a matrix

     [,1]
[1,]   14

A/c(y%*%y)

           [,1]      [,2]
[1,] 0.07142857 0.2142857
[2,] 0.14285714 0.2857143

# A/(y%*%y) # Error: non-conformable arrays

We use the apply function for applying functions to the margins of a matrix, array, or dataframes. The syntax is

apply(X, MARGIN, FUN, ...)

where

X is a matrix, array or dataframe,
MARGIN is a vector of subscripts indicating which margins to apply the function to (1=rows, 2=columns, c(1,2)=rows and columns),
FUN is the function to be applied,
... stands for optional arguments for FUN.

# Using apply function
x = matrix(1:12, nrow = 3, ncol = 4)
apply(x, MARGIN=1, sum)  # Row sums

[1] 22 26 30

apply(x, 2, mean) # Column means

[1]  2  5  8 11

# Handling missing data with apply
x[1,1] <- NaN
apply(x, 2, mean, na.rm=TRUE) # Column means ignoring NaN

[1]  2.5  5.0  8.0 11.0

In Table 2.2, we list some useful functions for matrix operations.

Table 2.2: Useful functions for matrices

Function	Description
`t(A)`	Transpose of `A`
`det(A)`	Determinant of `A`
`solve(A, b)`	Solves the equation `Ax = b` for `x`
`solve(A)`	Matrix inverse of `A`
`MASS::ginv(A)`	Generalized inverse of `A` (MASS package)
`eigen(A)`	Eigenvalues and eigenvectors of `A`
`chol(A)`	Cholesky factorization of `A`
`diag(n)`	Creates an `n` by `n` identity matrix
`diag(A)`	Returns the diagonal elements of a matrix `A`
`diag(x)`	Create a diagonal matrix from a vector `x`
`lower.tri(A)`, `upper.tri(A)`	Matrix of logicals indicating lower/upper triangular matrix
`apply`	Apply a function to the margins of a matrix
`rbind(...)`	Combines arguments by rows
`cbind(...)`	Combines arguments by columns
`dim(A)`	Dimensions of `A`
`nrow(A)`, `ncol(A)`	Number of rows/columns of `A`
`colnames(A)`, `rownames(A)`	Get or set the column/row names of `A`
`dimnames(A)`	Get or set the dimension names of `A`

2.2.3 Arrays

An array is a multi-dimensional generalization of a vector. To create an array, we use array(data = NA, dim = length(data), dimnames = NULL), where data is a vector that provides the values to fill the array; dim specifies the dimensions of the array (a vector of one or more elements giving the maximum indices in each dimension); and dimnames defines the names of the dimensions (a list with one component for each dimension, either NULL or a character vector of the length specified by dim for that dimension).

We fill the array by columns, similar to matrices. The math operations on arrays are performed element-wise, similar to vectors and matrices. Also, all elements of an array must be of the same type.

# Creating an array
w = array(1:24, 
          dim = c(4, 3, 2),
          dimnames = list(c("A","B","C","D"), c("X","Y","Z"), c("N","M")))
w

The option dim = c(4, 3, 2) specifies that the array has 4 rows, 3 columns, and 2 “pages” (or layers). Thus, the array w has 4 x 3 x 2 = 24 elements. We can think of c("N","M") as the names of the pages. Thus, w[ , , "N"] returns the first page and w[ , , "M"] returns the second page.

# Referencing elements of an array
w[ , , "N"] # First page

w[2, , ] # Second row, all columns, all pages

w[ , "Y", ] # All rows, second column, all pages

w[1, 2, 2] # Element in Row 1, Column 2, Page 2

[1] 17

2.2.4 Lists

A list is a general form of a vector whose components can be of different types and dimensions. To create a list, we use list(...). We can name the elements of a list using the name = value syntax. Arguments can be specified with or without names. In the example below, we create a list with three elements: the first element is named num, the second element has no name, and the third element is named identity.

# Creating a list
x = list(num = c(1,2,3), "Econometrics", identity=diag(2))
x

$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

$identity
     [,1] [,2]
[1,]    1    0
[2,]    0    1

# Names of list elements
names(x)

[1] "num"      ""         "identity"

We use [ ], [[ ]] and $ to reference elements of a list. Below, we provide some examples of referencing elements of the list x.

x[[2]]     # the second element of x

[1] "Econometrics"

x[["num"]] # element named "num"

[1] 1 2 3

x$identity # element named "identity"

     [,1] [,2]
[1,]    1    0
[2,]    0    1

x[[3]][1,] # first row of the third element

[1] 1 0

x[1:2]     # a sublist from the first two elements

$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

In Table 2.3, we provide some useful functions for lists.

Table 2.3: Useful functions for lists

Function	Description
`lapply()`	Apply a function to each element of a list; returns a list
`sapply()`	Same as `lapply()`, but returns a vector or matrix by default
`vapply()`	Similar to `sapply()`, but has a pre-specified type of return value
`replicate()`	Repeated evaluation of an expression; useful for replicating lists
`unlist(x)`	Produce a vector of all the components that occur in `x`
`length(x)`	Number of objects in `x`
`names(x)`	Names of the objects in `x`

2.3 Data types

2.3.1 Numeric data

In R, we use numeric to represent real numbers. Numeric data can be either double or integer, but in practice numeric data is almost always double (type double refers to real numbers). We can use the format() function to format an object for pretty printing. See ?format() for additional arguments.

x = 123.456789
is.numeric(x) # check if x is numeric

[1] TRUE

is.double(x) # check if x is double

[1] TRUE

is.integer(x) # check if x is integer

[1] FALSE

format(13.7, nsmall = 3) # Minimum number of digits to the right of the decimal point

[1] "13.700"

format(2^16, scientific = TRUE) # scientific notation

[1] "6.5536e+04"

format(2^16, scientific = FALSE) # fixed notation

[1] "65536"

2.3.2 Booleans

Boolean (or logical) values are represented by the keywords TRUE and FALSE in all caps or simply T and F. We can use logical operators to compare numeric values or vectors element-wise. The result of a comparison is a logical value (TRUE or FALSE). In Table 2.4, we provide some useful functions for logical and relational operations.

Table 2.4: Useful functions for logical and relational operations

Function	Description
`!x`	NOT `x`
`x & y`	`x` AND `y` element-wise; returns a vector
`x && y`	`x` AND `y`; returns a single value
`x \| y`	`x` OR `y` element-wise; returns a vector
`x \|\| y`	`x` OR `y`; returns a single value
`xor(x, y)`	Exclusive OR of `x` and `y`, element-wise
`x %in% y`	`x` IN `y`
`x < y`	`x < y`
`x > y`	`x > y`
`x <= y`	`x ≤ y`
`x >= y`	`x ≥ y`
`x == y`	`x = y`
`x != y`	`x ≠ y`
`isTRUE(x)`	`TRUE` if `x` is `TRUE`
`all(...)`	`TRUE` if all arguments are `TRUE`
`any(...)`	`TRUE` if at least one argument is `TRUE`
`identical(x, y)`	Safe and reliable way to test two objects for being EXACTLY equal
`all.equal(x, y)`	Test if two objects are NEARLY equal

Below, we provide some illustrative examples.

x = 1:10
(x%%2 == 0) | (x > 5) # Even numbers or greater than 5

 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

y = 5:15 
x %in% y # Elements of x that are also in y

 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

x[x %in% y] # Elements of x that are also in y

[1]  5  6  7  8  9 10

any(x > 5) # Is at least one element of x greater than 5?

[1] TRUE

all(x > 5) # Are all elements of x greater than 5?

[1] FALSE

In general, logical operators may not produce a single value and may return an NA if an element is NA or NaN.

x = sqrt(2)
x^2

[1] 2

x^2 == 2 # Returns FALSE due to rounding error

[1] FALSE

identical(x^2, 2) # Returns FALSE because not exactly equal

[1] FALSE

all.equal(x^2, 2) # Returns TRUE because nearly equal

[1] TRUE

all.equal(x^2, 1) # Returns a message indicating the difference

[1] "Mean relative difference: 0.5"

isTRUE(all.equal(x^2, 1)) # Returns FALSE

[1] FALSE

2.3.3 Characters

In R, the text data type is called character. We use single or double quotes to create character strings.

s1 = c("A", "B", "C")
class(s1) # check the class of s1

[1] "character"

s2 = 'The world is beautiful'
class(s2) # check the class of s2

[1] "character"

In Table 2.5, we provide some useful functions for character vectors.

Table 2.5: Some useful functions for character vectors

Function	Description
`cat()`	Concatenate objects and print to console (`\n` for newline)
`paste()`	Concatenate objects and return a string
`print()`	Print an object
`substr()`	Extract or replace substrings in a character vector
`strtrim()`	Trim character vectors to specified display widths
`strsplit()`	Split elements of a character vector according to a substring
`grep()`	Search for matches to a pattern within a character vector; returns a vector of the indices that matched
`grepl()`	Like `grep()`, but returns a logical vector
`arep()`	Similar to `grep()`, but searches for approximate matches
`regexpr()`	Similar to `grep()`, but returns the position of the first instance of a pattern within a string
`gsub()`	Replace all occurrences of a pattern with a character vector
`sub()`	Like `gsub()`, but only replaces the first occurrence
`tolower()`, `toupper()`	Convert to all lower/upper case
`noquote()`	Print a character vector without quotations
`nchar()`	Number of characters
`letters`, `LETTERS`	Built-in vector of lower and upper case letters

Below, we provide some illustrative examples.

animals = c("bird", "horse", "fish")
home = c("tree", "barn", "lake")

length(animals) # Number of elements in the vector

[1] 3

nchar(animals) # Number of characters in each string

[1] 4 5 4

cat("Animals:", animals) # Need \n to move cursor to a newline

Animals: bird horse fish

cat(animals, home, "\n") # Joins one vector after the other

bird horse fish tree barn lake

paste(animals, collapse=" ") # Create one long string of animals

[1] "bird horse fish"

a = paste(animals, home, sep=".") # Pairwise joining of animals and home
a

[1] "bird.tree"  "horse.barn" "fish.lake"

unlist(strsplit(a, ".", fixed=TRUE)) # Split the strings back

[1] "bird"  "tree"  "horse" "barn"  "fish"  "lake"

substr(animals, 2, 4) # Get characters 2-4 of each animal

[1] "ird" "ors" "ish"

strtrim(animals, 3) # Print the first three characters

[1] "bir" "hor" "fis"

toupper(animals) # Print animals in all upper case

[1] "BIRD"  "HORSE" "FISH"

In Table 2.5, we provide some functions for pattern matching and replacement: grep(), grepl(), areg(), regexpr(), gsub(), and sub(). We can use special characters in regular expressions to define search patterns. Some common special characters are listed below.

^: Beginning of character string
$: End of character string
.: Any single character
*: Zero or more of the preceding character
+: One or more of the preceding character
?: Zero or one of the preceding character
{n}: Exactly n of the preceding character
[a -- c]: Any one of the characters a, b, or c
[^a -- c]: Beginning with the characters a, b, or c

To illustrate the use of these functions, we use the built-in colors() function that returns a vector of color names in R.

# First ten colors
colors()[1:10]

 [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
 [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"   
 [9] "aquamarine1"   "aquamarine2"

# Number of colors
length(colors())

[1] 657

colors()[grep("red", colors())] # All colors that contain "red"

 [1] "darkred"         "indianred"       "indianred1"      "indianred2"     
 [5] "indianred3"      "indianred4"      "mediumvioletred" "orangered"      
 [9] "orangered1"      "orangered2"      "orangered3"      "orangered4"     
[13] "palevioletred"   "palevioletred1"  "palevioletred2"  "palevioletred3" 
[17] "palevioletred4"  "red"             "red1"            "red2"           
[21] "red3"            "red4"            "violetred"       "violetred1"     
[25] "violetred2"      "violetred3"      "violetred4"

colors()[grep("^red", colors())] # Colors that start with "red"

[1] "red"  "red1" "red2" "red3" "red4"

colors()[grep("red$", colors())] # Colors that end with "red"

[1] "darkred"         "indianred"       "mediumvioletred" "orangered"      
[5] "palevioletred"   "red"             "violetred"

colors()[grep("red.", colors())] # Colors with one character after "red"

 [1] "indianred1"     "indianred2"     "indianred3"     "indianred4"    
 [5] "orangered1"     "orangered2"     "orangered3"     "orangered4"    
 [9] "palevioletred1" "palevioletred2" "palevioletred3" "palevioletred4"
[13] "red1"           "red2"           "red3"           "red4"          
[17] "violetred1"     "violetred2"     "violetred3"     "violetred4"

colors()[grep("^[r-t]", colors())] # Colors that begin with r, s, or t

 [1] "red"          "red1"         "red2"         "red3"         "red4"        
 [6] "rosybrown"    "rosybrown1"   "rosybrown2"   "rosybrown3"   "rosybrown4"  
[11] "royalblue"    "royalblue1"   "royalblue2"   "royalblue3"   "royalblue4"  
[16] "saddlebrown"  "salmon"       "salmon1"      "salmon2"      "salmon3"     
[21] "salmon4"      "sandybrown"   "seagreen"     "seagreen1"    "seagreen2"   
[26] "seagreen3"    "seagreen4"    "seashell"     "seashell1"    "seashell2"   
[31] "seashell3"    "seashell4"    "sienna"       "sienna1"      "sienna2"     
[36] "sienna3"      "sienna4"      "skyblue"      "skyblue1"     "skyblue2"    
[41] "skyblue3"     "skyblue4"     "slateblue"    "slateblue1"   "slateblue2"  
[46] "slateblue3"   "slateblue4"   "slategray"    "slategray1"   "slategray2"  
[51] "slategray3"   "slategray4"   "slategrey"    "snow"         "snow1"       
[56] "snow2"        "snow3"        "snow4"        "springgreen"  "springgreen1"
[61] "springgreen2" "springgreen3" "springgreen4" "steelblue"    "steelblue1"  
[66] "steelblue2"   "steelblue3"   "steelblue4"   "tan"          "tan1"        
[71] "tan2"         "tan3"         "tan4"         "thistle"      "thistle1"    
[76] "thistle2"     "thistle3"     "thistle4"     "tomato"       "tomato1"     
[81] "tomato2"      "tomato3"      "tomato4"      "turquoise"    "turquoise1"  
[86] "turquoise2"   "turquoise3"   "turquoise4"

colors()[grep("green.*", colors())] # Colors that contain "green" followed by zero or more characters

 [1] "darkgreen"         "darkolivegreen"    "darkolivegreen1"  
 [4] "darkolivegreen2"   "darkolivegreen3"   "darkolivegreen4"  
 [7] "darkseagreen"      "darkseagreen1"     "darkseagreen2"    
[10] "darkseagreen3"     "darkseagreen4"     "forestgreen"      
[13] "green"             "green1"            "green2"           
[16] "green3"            "green4"            "greenyellow"      
[19] "lawngreen"         "lightgreen"        "lightseagreen"    
[22] "limegreen"         "mediumseagreen"    "mediumspringgreen"
[25] "palegreen"         "palegreen1"        "palegreen2"       
[28] "palegreen3"        "palegreen4"        "seagreen"         
[31] "seagreen1"         "seagreen2"         "seagreen3"        
[34] "seagreen4"         "springgreen"       "springgreen1"     
[37] "springgreen2"      "springgreen3"      "springgreen4"     
[40] "yellowgreen"

colors()[grep("green.+", colors())] # Colors that contain "green" followed by one or more characters

 [1] "darkolivegreen1" "darkolivegreen2" "darkolivegreen3" "darkolivegreen4"
 [5] "darkseagreen1"   "darkseagreen2"   "darkseagreen3"   "darkseagreen4"  
 [9] "green1"          "green2"          "green3"          "green4"         
[13] "greenyellow"     "palegreen1"      "palegreen2"      "palegreen3"     
[17] "palegreen4"      "seagreen1"       "seagreen2"       "seagreen3"      
[21] "seagreen4"       "springgreen1"    "springgreen2"    "springgreen3"   
[25] "springgreen4"

Finally, we illustrate the use of gsub() and sub() functions for pattern replacement.

places = c("home", "zoo", "school", "work", "park")
gsub("o", "O", places) # Replace all "o" with "O"

[1] "hOme"   "zOO"    "schOOl" "wOrk"   "park"

sub("o", "O", places)  # Replace the first "o" with "O"

[1] "hOme"   "zOo"    "schOol" "wOrk"   "park"

2.3.4 Factors

In R, we use the factor data type to represent ordered or unordered categorical variables. The function factor() is used to create factor type variables.

factor(rep(1:2, 4), labels=c("BA", "BS")) # Unordered factor variable

[1] BA BS BA BS BA BS BA BS
Levels: BA BS

factor(rep(1:3, 4), labels=c("low", "med", "high"), ordered=TRUE) # Ordered factor variable

 [1] low  med  high low  med  high low  med  high low  med  high
Levels: low < med < high

In the following example, we load the STAR.csv dataset that contains a variable gender indicating the gender of students. The initial class of gender is character. We convert gender to a factor variable with labels “0” and “1” using the factor() function.

# Load STAR.csv dataset
star = read.csv("data/STAR.csv")
class(star$gender) # Check the class of gender

[1] "character"

star$gender0 = factor(star$gender, labels=c("0", "1"), ordered=FALSE) # Convert to factor variable
class(star$gender0) # Check the class again

[1] "factor"

head(star[, c("gender", "gender0")]) # View the first few rows

     gender gender0
1122 female       0
1137 female       0
1143 female       0
1160   male       1
1183   male       1
1195   male       1

In Table 2.6, we provide some useful functions for factor variables.

Table 2.6: Useful functions for factor variables

Function	Description
`levels(x)`	Retrieve or set the levels of `x`
`nlevels(x)`	Returns the number of levels in `x`
`relevel(x, ref)`	Levels of `x` are reordered so that the level specified by `ref` is first
`reorder()`	Reorders levels based on the values of a second variable
`gl()`	Generate factors by specifying the pattern of their levels
`cut(x, breaks)`	Divides the range of `x` into intervals (factors) determined by `breaks`

If we want to convert a factor variable to a numeric variable, we can use as.numeric(f) as illustrated below.

star$gender1 = as.numeric(star$gender0) # Convert factor to numeric
class(star$gender1)

[1] "numeric"

head(star[, c("gender", "gender1")]) # View the first few rows

     gender gender1
1122 female       1
1137 female       1
1143 female       1
1160   male       2
1183   male       2
1195   male       2

star$gender2 = as.numeric(as.character(star$gender0)) # Convert factor to numeric correctly
head(star[, c("gender", "gender1", "gender2")]) # View the first few rows

     gender gender1 gender2
1122 female       1       0
1137 female       1       0
1143 female       1       0
1160   male       2       1
1183   male       2       1
1195   male       2       1

In the following examples, we illustrate some of the functions listed in Table 2.6.

f = gl(3, 2, labels=paste("ECN", 1:3, sep="_")) # Create a factor variable 
f

[1] ECN_1 ECN_1 ECN_2 ECN_2 ECN_3 ECN_3
Levels: ECN_1 ECN_2 ECN_3

levels(f) # Get the levels of f

[1] "ECN_1" "ECN_2" "ECN_3"

nlevels(f) # Get the number of levels of f

[1] 3

relevel(f, "ECN_2") # Reorder levels so that "ECN_2" is first

[1] ECN_1 ECN_1 ECN_2 ECN_2 ECN_3 ECN_3
Levels: ECN_2 ECN_1 ECN_3

f = gl(3, 2, labels=1:3) # Create a factor variable 
as.numeric(levels(f))[f] # Convert factor to numeric correctly

[1] 1 1 2 2 3 3

#
x = runif(10)
cut(x, 3) # Cut x into three intervals

 [1] (0.0434,0.246] (0.0434,0.246] (0.0434,0.246] (0.0434,0.246] (0.246,0.448] 
 [6] (0.448,0.65]   (0.448,0.65]   (0.0434,0.246] (0.448,0.65]   (0.246,0.448] 
Levels: (0.0434,0.246] (0.246,0.448] (0.448,0.65]

cut(x, c(0,.25,.5,.75,1)) # Cut x at the given cut points

 [1] (0,0.25]   (0,0.25]   (0,0.25]   (0,0.25]   (0.25,0.5] (0.5,0.75]
 [7] (0.5,0.75] (0,0.25]   (0.5,0.75] (0.25,0.5]
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

2.3.5 Dates and Times

We use the R date object for calendar dates (Year--Month--Day). We can add days to or subtract days from a date. We can also compare dates using logical operators. In Table 2.7, we provide some useful functions for date objects.

Table 2.7: Useful functions for date objects

Function	Description
`Sys.Date()`	Current date
`as.Date()`	Convert a character string to a date object
`format.Date()`	Change the format of a date object
`seq.Date()`	Generate sequence of dates
`cut.Date()`	Cut dates into intervals
`weekdays`, `months`, `quarters`	Extract parts of a date object
`julian`	Number of days since a given origin

The Date suffix is optional for calling format.Date(), seq.Date() and cut.Date(), but is necessary for viewing the appropriate documentation. To convert a character string to a date object, we use as.Date(x, format), where x is a character vector representing dates and format is a character string representing the date format of x. In Table 2.8, we list some common conversion specifications for date formats.

Table 2.8: Common date format specifiers

Specifier	Description
`%a`	Abbreviated weekday name
`%A`	Full weekday name
`%d`	Day of the month
`%B`	Full month name
`%b`	Abbreviated month name
`%m`	Numeric month (01-12)
`%y`	Year without century
`%Y`	Year with century

Below, we provide some illustrative examples.

d1 = c("5-01-2008", "19-08-2008", "2-02-2009", "29-09-2009")
dates1 = as.Date(d1, format = "%d-%m-%Y") # Convert to date object
class(dates1) # Check the class

[1] "Date"

dates1 # View the date object

[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"

Here is an example with abbreviated month names.

d2 = c("2008/01/05", "2008/08/19", "2009/02/02", "2009/09/29")
dates2 = as.Date(d2, format="%Y/%m/%d")
class(dates2) # Check the class

[1] "Date"

dates2 # View the date object

[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"

To create a sequence of dates, we can use seq.Date(from, to, by, length.out = NULL), where

from, to: Start and ending date objects
by : A character string, containing one of "day", "week", "month" or "year"
length.out: Integer, desired length of the sequence

Below, we provide some illustrative examples.

seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by = "week")

[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"

seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by = "3 days")

 [1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
 [6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"

seq.Date(as.Date("2011/1/1"), by = "week", length.out = 10)

 [1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
 [6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"

To divide a sequence of dates into levels, we can use cut.Date(x, breaks, start.on.monday = TRUE).

jan = seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="days")
cut(jan, breaks="weeks")

 [1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
 [7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10 2011-01-10 2011-01-17 2011-01-17
[19] 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-24
[25] 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24
[31] 2011-01-31
6 Levels: 2010-12-27 2011-01-03 2011-01-10 2011-01-17 ... 2011-01-31

Recall that date objects can be used in arithmetic operations. Specifically,

Days can be added or subtracted to a date.
Dates can be subtracted.
Dates can be compared using logical operators.

Below, we provide some illustrative examples.

jan1 = as.Date("2011/1/1")
(jan8 = jan1 + 7) # Add 7 days to 2011/1/1

[1] "2011-01-08"

jan1 - 14 # Subtract 2 weeks from 2011/1/8

[1] "2010-12-18"

jan8 - jan1 # Number of days between 2011/1/1 and 2011/1/8

Time difference of 7 days

jan8 > jan1 # Compare dates

[1] TRUE

# Use format to extract parts of a date object or change the appearance
format.Date(jan8, "%Y")

[1] "2011"

format.Date(jan8, "%b-%d")

[1] "Oca-08"

2.3.6 Missing Data

The missing data in R is represented by the keyword NA. The built-in functions in R generally handle missing data in different ways. For example, the mean function ignores NA only if the argument na.rm=TRUE, whereas the which function always ignores missing data.

x = c(4, 7, 2, 0, 1, NA)
mean(x)

[1] NA

mean(x, na.rm=TRUE)

[1] 2.8

which(x > 4)

[1] 2

We need to check the documentation of functions to see how they handle missing data.

We use NaN (Not a Number) to denote quantities that are not a number, such as 0/0. In R, NaN implies NA (NaN refers to unavailable numeric data and NA refers to any type of unavailable data). Undefined or null objects are denoted in R by NULL.

# Undefined row names
x = matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))
x

     c.1 c.2
[1,]   1   3
[2,]   2   4

# NULL value in a list
x = list(a=1:5, b=NULL, c="Econometrics")
x

$a
[1] 1 2 3 4 5

$b
NULL

$c
[1] "Econometrics"

# Empty list
y = list()
y

list()

In Table 2.9, we provide some useful functions for testing missing data.

Table 2.9: Functions for testing null data

Function	Description
`is.na(x)`	Tests for `NA` or `NaN` data in `x`
`is.nan(x)`	Tests for `NaN` data in `x`
`is.null(x)`	Tests if `x` is `NULL`

x = c(4, 7, 2, 0, 1, NA)
(x == NA)

[1] NA NA NA NA NA NA

is.na(x) # Check which elements are NA

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

is.nan(x) # Check which elements are NaN

[1] FALSE FALSE FALSE FALSE FALSE FALSE

any(is.na(x)) # Check if there is any NA in x

[1] TRUE

y <- x/0 # Create NaN values
y

[1] Inf Inf Inf NaN Inf  NA

is.nan(y)

[1] FALSE FALSE FALSE  TRUE FALSE FALSE

is.na(y)

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE

The first example above shows that we cannot use the equality operator == to test for NA values. Instead, we should use the is.na() function.