# Creating a vector of numbers
v1 = c(2.5, 4, 7.3, 0.1)
v1[1] 2.5 4.0 7.3 0.1
In this chapter, we will introduce the basic data containers and types. There are various data structures in R to store and manipulate data. In this chapter, we discuss these data structures and their properties for effective data management.
In R, there are several data structures to store collections of data. The most commonly used data structures are: vectors, matrices, arrays, lists, and data frames. In this part, we will focus on vectors, matrices, arrays, and lists, and cover data frames in Chapter 7.
In R, we use a vector to store an ordered collection of objects of the same type. We can create a vector using one of the following functions:
c() is the most general way to create a vector.: creates a sequence of integers.seq() is used to generate regular sequences.rep() is useful to generate vectors of replicated elements.Some illustrative examples are given below.
# Creating a vector of numbers
v1 = c(2.5, 4, 7.3, 0.1)
v1[1] 2.5 4.0 7.3 0.1
# Creating a character vector
v2 = c("A", "B", "C", "D")
v2[1] "A" "B" "C" "D"
# Creating an integer sequence
v3 = -3:3
v3[1] -3 -2 -1 0 1 2 3
# Creating a sequence with seq()
seq(0, 2, by = 0.5) # increment by 0.5[1] 0.0 0.5 1.0 1.5 2.0
# Creating a sequence with seq()
seq(0, 2, len = 6) # length of the sequence is 6[1] 0.0 0.4 0.8 1.2 1.6 2.0
rep(1:5, each = 2) [1] 1 1 2 2 3 3 4 4 5 5
# Creating a replicated vector with rep()
rep(1:5, times = 2) [1] 1 2 3 4 5 1 2 3 4 5
To index certain element(s) of a vector, we use [ ] with a vector/scalar of positions to reference the elements of the vector. Including a minus sign before the vector/scalar removes the indexed elements from the vector.
# Referencing elements of a vector
x <- c(4, 7, 2, 10, 1, 0)
x[4] # return the fourth element[1] 10
# Return elements from index 1 to 3
x[1:3] [1] 4 7 2
# Return elements at indices 2, 5 and 6
x[c(2,5,6)] [1] 7 1 0
# Remove the third element from x
x[-3] [1] 4 7 10 1 0
# Remove multiple elements from x
x[-c(4,5)] [1] 4 7 2 0
# Logical referencing
x[x>4] # return elements bigger than 4[1] 7 10
# Modifying elements of a vector
x[3] <- 999
x[1] 4 7 999 10 1 0
The following additional functions can be useful to return the indices of a vector.
which(): returns the position or the index of the value which satisfies the given condition.which.max(): returns the location of the (first) maximum element of a numeric vector.which.min(): returns the location of the (first) minimum element of a numeric vector.match(): returns the first position of an element of a vector in another vector.x <- c(4, 7, 2, 10, 1, 0)
x>=4 # return a logical vectors of TRUE and FALSE[1] TRUE TRUE FALSE TRUE FALSE FALSE
# Return indices of elements satisfying the condition
which(x>=4) # return indices[1] 1 2 4
# Return indices of the maximum element
which.max(x) [1] 4
# Return the maximum element using which.max()
x[which.max(x)] # return the first maximum element[1] 10
# Return the maximum element using max()
max(x)[1] 10
# Using match()
y <- rep(1:5, times=5:1)
y [1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
match(1:5, y) # return the first position of each element in y that matches with 1:5[1] 1 6 10 13 15
# Return unique elements
unique(y)[1] 1 2 3 4 5
match(unique(y), y) # return the first position of each element in y that matches with unique(y)[1] 1 6 10 13 15
When vectors are used in math expressions the operations are performed element-wise.
# Element-wise operations
x = c(4, 7, 2, 10, 1, 0)
y = x^2 + 1
y[1] 17 50 5 101 2 1
x*y # element-wise multiplication[1] 68 350 10 1010 2 0
In Table 2.1, we provide some useful functions for vector operations.
| Function | Description |
|---|---|
sum(x), prod(x) |
Sum/product of the elements of x |
cumsum(x), cumprod(x) |
Cumulative sum/product of the elements of x |
min(x), max(x) |
Minimum/maximum element of x |
mean(x), median(x) |
Mean/median of x |
var(x), sd(x) |
Variance/standard deviation of x |
cov(x, y), cor(x, y) |
Covariance/correlation of x and y |
range(x) |
Range of x |
quantile(x) |
Quantiles of x for the given probabilities |
fivenum(x) |
Five number summary of x |
length(x) |
Number of elements in x |
unique(x) |
Unique elements of x |
rev(x) |
Reverse the elements of x |
sort(x) |
Sort the elements of x |
which(x) |
Indices of TRUEs in a logical vector |
which.max(x), which.min(x) |
Index of the max/min element of x |
match(x) |
First position of an element in a vector |
union(x, y) |
Union of x and y |
intersect(x, y) |
Intersection of x and y |
setdiff(x, y) |
Elements of x that are not in y |
setequal(x, y) |
Do x and y contain the same elements? |
Below, we provide some illustrative examples.
# Useful functions for vectors
x <- c(4,7,2,10,1,0)
y <- 3*x^2 + 1
y[1] 49 148 13 301 4 1
sum(x) # sum of elements[1] 24
range(x) # range of elements[1] 0 10
length(x) # number of elements[1] 6
rev(x) # reverse the elements[1] 0 1 10 2 7 4
sort(x) # sort in increasing order[1] 0 1 2 4 7 10
sort(x, decreasing = TRUE) # sort in decreasing order[1] 10 7 4 2 1 0
which(x==7) # return indices[1] 2
union(x,y) # union(x,y) [1] 4 7 2 10 1 0 49 148 13 301
setdiff(x,y) # elements in x but not in y[1] 7 2 10 0
intersect(x,y) # intersection of x and y[1] 4 1
setequal(x,y) # do x and y contain the same elements?[1] FALSE
A matrix is a two-dimensional generalization of a vector. To create a matrix, we use the function matrix(), with the syntax
matrix(data=NA, nrow=1, ncol=1, byrow = FALSE, dimnames = NULL)The arguments are as follows:
data is a vector that gives data to fill the matrixnrow is the desired number of rowsncol is the desired number of columnsbyrow is set to FALSE by default, which means matrix is filled by columns. Otherwise, matrix is filled by rows.dimnames is an optional list of length 2 giving the row and column names, respectively.Below, we provide some illustrative examples.
# Creating a matrix
y = matrix(nrow = 3, ncol = 4)
y [,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
# Creating a matrix
x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3),
nrow = 3, ncol = 4,
byrow = TRUE,
dimnames = list(rows = c("r.1", "r.2", "r.3"),
cols = c("c.1", "c.2", "c.3", "c.4")))
x cols
rows c.1 c.2 c.3 c.4
r.1 5 0 6 1
r.2 3 5 9 5
r.3 7 1 5 3
Some useful functions for matrices are given below.
class(x) # class of x[1] "matrix" "array"
colnames(x) # access to column names[1] "c.1" "c.2" "c.3" "c.4"
rownames(x) # access to row names[1] "r.1" "r.2" "r.3"
rownames(x)=c("A","B","C") # change the row names
dimnames(x) # access to both row and column names$rows
[1] "A" "B" "C"
$cols
[1] "c.1" "c.2" "c.3" "c.4"
dimnames(x)$rows # access to row names[1] "A" "B" "C"
dimnames(x)$cols # access to column names[1] "c.1" "c.2" "c.3" "c.4"
dim(x) # dimensions of x[1] 3 4
nrow(x) # number of rows[1] 3
ncol(x) # number of columns[1] 4
The elements of a matrix can be referenced using the [ ] just like with vectors, but now with 2-dimensions.
x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3),
nrow = 3, ncol = 4)
x [,1] [,2] [,3] [,4]
[1,] 5 1 9 1
[2,] 0 3 5 5
[3,] 6 5 7 3
x[2, 3] # Element in Row 2, Column 3[1] 5
x[1, ] # Row 1, all Columns[1] 5 1 9 1
x[ , 2] # All Rows, Column 2[1] 1 3 5
x[c(1, 3), ] # Rows 1 and 3, all columns [,1] [,2] [,3] [,4]
[1,] 5 1 9 1
[2,] 6 5 7 3
When matrices are used in math expressions the operations are performed element-wise.
%*% operator.%*% on two vectors will return the inner product as a matrix and not a scalar.A = matrix(1:4, nrow = 2)
B = matrix(1, nrow = 2, ncol = 2)
A*B # element-wise multiplication [,1] [,2]
[1,] 1 3
[2,] 2 4
A%*%B # matrix multiplication [,1] [,2]
[1,] 4 4
[2,] 6 6
y = 1:3
y%*%y # inner product as a matrix [,1]
[1,] 14
A/c(y%*%y) [,1] [,2]
[1,] 0.07142857 0.2142857
[2,] 0.14285714 0.2857143
# A/(y%*%y) # Error: non-conformable arraysWe use the apply function for applying functions to the margins of a matrix, array, or dataframes. The syntax is
apply(X, MARGIN, FUN, ...)where
X is a matrix, array or dataframe,MARGIN is a vector of subscripts indicating which margins to apply the function to (1=rows, 2=columns, c(1,2)=rows and columns),FUN is the function to be applied,... stands for optional arguments for FUN.# Using apply function
x = matrix(1:12, nrow = 3, ncol = 4)
apply(x, MARGIN=1, sum) # Row sums[1] 22 26 30
apply(x, 2, mean) # Column means[1] 2 5 8 11
# Handling missing data with apply
x[1,1] <- NaN
apply(x, 2, mean, na.rm=TRUE) # Column means ignoring NaN[1] 2.5 5.0 8.0 11.0
In Table 2.2, we list some useful functions for matrix operations.
| Function | Description |
|---|---|
t(A) |
Transpose of A |
det(A) |
Determinant of A |
solve(A, b) |
Solves the equation Ax = b for x |
solve(A) |
Matrix inverse of A |
MASS::ginv(A) |
Generalized inverse of A (MASS package) |
eigen(A) |
Eigenvalues and eigenvectors of A |
chol(A) |
Cholesky factorization of A |
diag(n) |
Creates an n by n identity matrix |
diag(A) |
Returns the diagonal elements of a matrix A |
diag(x) |
Create a diagonal matrix from a vector x |
lower.tri(A), upper.tri(A) |
Matrix of logicals indicating lower/upper triangular matrix |
apply |
Apply a function to the margins of a matrix |
rbind(...) |
Combines arguments by rows |
cbind(...) |
Combines arguments by columns |
dim(A) |
Dimensions of A |
nrow(A), ncol(A) |
Number of rows/columns of A |
colnames(A), rownames(A) |
Get or set the column/row names of A |
dimnames(A) |
Get or set the dimension names of A |
An array is a multi-dimensional generalization of a vector. To create an array, we use array(data = NA, dim = length(data), dimnames = NULL), where data is a vector that provides the values to fill the array; dim specifies the dimensions of the array (a vector of one or more elements giving the maximum indices in each dimension); and dimnames defines the names of the dimensions (a list with one component for each dimension, either NULL or a character vector of the length specified by dim for that dimension).
We fill the array by columns, similar to matrices. The math operations on arrays are performed element-wise, similar to vectors and matrices. Also, all elements of an array must be of the same type.
# Creating an array
w = array(1:24,
dim = c(4, 3, 2),
dimnames = list(c("A","B","C","D"), c("X","Y","Z"), c("N","M")))
w, , N
X Y Z
A 1 5 9
B 2 6 10
C 3 7 11
D 4 8 12
, , M
X Y Z
A 13 17 21
B 14 18 22
C 15 19 23
D 16 20 24
The option dim = c(4, 3, 2) specifies that the array has 4 rows, 3 columns, and 2 “pages” (or layers). Thus, the array w has 4 x 3 x 2 = 24 elements. We can think of c("N","M") as the names of the pages. Thus, w[ , , "N"] returns the first page and w[ , , "M"] returns the second page.
# Referencing elements of an array
w[ , , "N"] # First page X Y Z
A 1 5 9
B 2 6 10
C 3 7 11
D 4 8 12
w[2, , ] # Second row, all columns, all pages N M
X 2 14
Y 6 18
Z 10 22
w[ , "Y", ] # All rows, second column, all pages N M
A 5 17
B 6 18
C 7 19
D 8 20
w[1, 2, 2] # Element in Row 1, Column 2, Page 2[1] 17
A list is a general form of a vector whose components can be of different types and dimensions. To create a list, we use list(...). We can name the elements of a list using the name = value syntax. Arguments can be specified with or without names. In the example below, we create a list with three elements: the first element is named num, the second element has no name, and the third element is named identity.
# Creating a list
x = list(num = c(1,2,3), "Econometrics", identity=diag(2))
x$num
[1] 1 2 3
[[2]]
[1] "Econometrics"
$identity
[,1] [,2]
[1,] 1 0
[2,] 0 1
# Names of list elements
names(x)[1] "num" "" "identity"
We use [ ], [[ ]] and $ to reference elements of a list. Below, we provide some examples of referencing elements of the list x.
x[[2]] # the second element of x[1] "Econometrics"
x[["num"]] # element named "num"[1] 1 2 3
x$identity # element named "identity" [,1] [,2]
[1,] 1 0
[2,] 0 1
x[[3]][1,] # first row of the third element [1] 1 0
x[1:2] # a sublist from the first two elements$num
[1] 1 2 3
[[2]]
[1] "Econometrics"
In Table 2.3, we provide some useful functions for lists.
| Function | Description |
|---|---|
lapply() |
Apply a function to each element of a list; returns a list |
sapply() |
Same as lapply(), but returns a vector or matrix by default |
vapply() |
Similar to sapply(), but has a pre-specified type of return value |
replicate() |
Repeated evaluation of an expression; useful for replicating lists |
unlist(x) |
Produce a vector of all the components that occur in x |
length(x) |
Number of objects in x |
names(x) |
Names of the objects in x |
In R, we use numeric to represent real numbers. Numeric data can be either double or integer, but in practice numeric data is almost always double (type double refers to real numbers). We can use the format() function to format an object for pretty printing. See ?format() for additional arguments.
x = 123.456789
is.numeric(x) # check if x is numeric[1] TRUE
is.double(x) # check if x is double[1] TRUE
is.integer(x) # check if x is integer[1] FALSE
format(13.7, nsmall = 3) # Minimum number of digits to the right of the decimal point[1] "13.700"
format(2^16, scientific = TRUE) # scientific notation[1] "6.5536e+04"
format(2^16, scientific = FALSE) # fixed notation[1] "65536"
Boolean (or logical) values are represented by the keywords TRUE and FALSE in all caps or simply T and F. We can use logical operators to compare numeric values or vectors element-wise. The result of a comparison is a logical value (TRUE or FALSE). In Table 2.4, we provide some useful functions for logical and relational operations.
| Function | Description |
|---|---|
!x |
NOT x |
x & y |
x AND y element-wise; returns a vector |
x && y |
x AND y; returns a single value |
x | y |
x OR y element-wise; returns a vector |
x || y |
x OR y; returns a single value |
xor(x, y) |
Exclusive OR of x and y, element-wise |
x %in% y |
x IN y |
x < y |
x < y |
x > y |
x > y |
x <= y |
x ≤ y |
x >= y |
x ≥ y |
x == y |
x = y |
x != y |
x ≠ y |
isTRUE(x) |
TRUE if x is TRUE |
all(...) |
TRUE if all arguments are TRUE |
any(...) |
TRUE if at least one argument is TRUE |
identical(x, y) |
Safe and reliable way to test two objects for being EXACTLY equal |
all.equal(x, y) |
Test if two objects are NEARLY equal |
Below, we provide some illustrative examples.
x = 1:10
(x%%2 == 0) | (x > 5) # Even numbers or greater than 5 [1] FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
y = 5:15
x %in% y # Elements of x that are also in y [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
x[x %in% y] # Elements of x that are also in y[1] 5 6 7 8 9 10
any(x > 5) # Is at least one element of x greater than 5?[1] TRUE
all(x > 5) # Are all elements of x greater than 5?[1] FALSE
In general, logical operators may not produce a single value and may return an NA if an element is NA or NaN.
x = sqrt(2)
x^2[1] 2
x^2 == 2 # Returns FALSE due to rounding error[1] FALSE
identical(x^2, 2) # Returns FALSE because not exactly equal[1] FALSE
all.equal(x^2, 2) # Returns TRUE because nearly equal[1] TRUE
all.equal(x^2, 1) # Returns a message indicating the difference[1] "Mean relative difference: 0.5"
isTRUE(all.equal(x^2, 1)) # Returns FALSE[1] FALSE
In R, the text data type is called character. We use single or double quotes to create character strings.
s1 = c("A", "B", "C")
class(s1) # check the class of s1[1] "character"
s2 = 'The world is beautiful'
class(s2) # check the class of s2[1] "character"
In Table 2.5, we provide some useful functions for character vectors.
| Function | Description |
|---|---|
cat() |
Concatenate objects and print to console (\n for newline) |
paste() |
Concatenate objects and return a string |
print() |
Print an object |
substr() |
Extract or replace substrings in a character vector |
strtrim() |
Trim character vectors to specified display widths |
strsplit() |
Split elements of a character vector according to a substring |
grep() |
Search for matches to a pattern within a character vector; returns a vector of the indices that matched |
grepl() |
Like grep(), but returns a logical vector |
arep() |
Similar to grep(), but searches for approximate matches |
regexpr() |
Similar to grep(), but returns the position of the first instance of a pattern within a string |
gsub() |
Replace all occurrences of a pattern with a character vector |
sub() |
Like gsub(), but only replaces the first occurrence |
tolower(), toupper() |
Convert to all lower/upper case |
noquote() |
Print a character vector without quotations |
nchar() |
Number of characters |
letters, LETTERS |
Built-in vector of lower and upper case letters |
Below, we provide some illustrative examples.
animals = c("bird", "horse", "fish")
home = c("tree", "barn", "lake")
length(animals) # Number of elements in the vector[1] 3
nchar(animals) # Number of characters in each string[1] 4 5 4
cat("Animals:", animals) # Need \n to move cursor to a newlineAnimals: bird horse fish
cat(animals, home, "\n") # Joins one vector after the otherbird horse fish tree barn lake
paste(animals, collapse=" ") # Create one long string of animals[1] "bird horse fish"
a = paste(animals, home, sep=".") # Pairwise joining of animals and home
a[1] "bird.tree" "horse.barn" "fish.lake"
unlist(strsplit(a, ".", fixed=TRUE)) # Split the strings back[1] "bird" "tree" "horse" "barn" "fish" "lake"
substr(animals, 2, 4) # Get characters 2-4 of each animal[1] "ird" "ors" "ish"
strtrim(animals, 3) # Print the first three characters[1] "bir" "hor" "fis"
toupper(animals) # Print animals in all upper case[1] "BIRD" "HORSE" "FISH"
In Table 2.5, we provide some functions for pattern matching and replacement: grep(), grepl(), areg(), regexpr(), gsub(), and sub(). We can use special characters in regular expressions to define search patterns. Some common special characters are listed below.
^: Beginning of character string$: End of character string.: Any single character*: Zero or more of the preceding character+: One or more of the preceding character?: Zero or one of the preceding character{n}: Exactly n of the preceding character[a -- c]: Any one of the characters a, b, or c[^a -- c]: Beginning with the characters a, b, or cTo illustrate the use of these functions, we use the built-in colors() function that returns a vector of color names in R.
# First ten colors
colors()[1:10] [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
[5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
[9] "aquamarine1" "aquamarine2"
# Number of colors
length(colors())[1] 657
colors()[grep("red", colors())] # All colors that contain "red" [1] "darkred" "indianred" "indianred1" "indianred2"
[5] "indianred3" "indianred4" "mediumvioletred" "orangered"
[9] "orangered1" "orangered2" "orangered3" "orangered4"
[13] "palevioletred" "palevioletred1" "palevioletred2" "palevioletred3"
[17] "palevioletred4" "red" "red1" "red2"
[21] "red3" "red4" "violetred" "violetred1"
[25] "violetred2" "violetred3" "violetred4"
colors()[grep("^red", colors())] # Colors that start with "red"[1] "red" "red1" "red2" "red3" "red4"
colors()[grep("red$", colors())] # Colors that end with "red"[1] "darkred" "indianred" "mediumvioletred" "orangered"
[5] "palevioletred" "red" "violetred"
colors()[grep("red.", colors())] # Colors with one character after "red" [1] "indianred1" "indianred2" "indianred3" "indianred4"
[5] "orangered1" "orangered2" "orangered3" "orangered4"
[9] "palevioletred1" "palevioletred2" "palevioletred3" "palevioletred4"
[13] "red1" "red2" "red3" "red4"
[17] "violetred1" "violetred2" "violetred3" "violetred4"
colors()[grep("^[r-t]", colors())] # Colors that begin with r, s, or t [1] "red" "red1" "red2" "red3" "red4"
[6] "rosybrown" "rosybrown1" "rosybrown2" "rosybrown3" "rosybrown4"
[11] "royalblue" "royalblue1" "royalblue2" "royalblue3" "royalblue4"
[16] "saddlebrown" "salmon" "salmon1" "salmon2" "salmon3"
[21] "salmon4" "sandybrown" "seagreen" "seagreen1" "seagreen2"
[26] "seagreen3" "seagreen4" "seashell" "seashell1" "seashell2"
[31] "seashell3" "seashell4" "sienna" "sienna1" "sienna2"
[36] "sienna3" "sienna4" "skyblue" "skyblue1" "skyblue2"
[41] "skyblue3" "skyblue4" "slateblue" "slateblue1" "slateblue2"
[46] "slateblue3" "slateblue4" "slategray" "slategray1" "slategray2"
[51] "slategray3" "slategray4" "slategrey" "snow" "snow1"
[56] "snow2" "snow3" "snow4" "springgreen" "springgreen1"
[61] "springgreen2" "springgreen3" "springgreen4" "steelblue" "steelblue1"
[66] "steelblue2" "steelblue3" "steelblue4" "tan" "tan1"
[71] "tan2" "tan3" "tan4" "thistle" "thistle1"
[76] "thistle2" "thistle3" "thistle4" "tomato" "tomato1"
[81] "tomato2" "tomato3" "tomato4" "turquoise" "turquoise1"
[86] "turquoise2" "turquoise3" "turquoise4"
colors()[grep("green.*", colors())] # Colors that contain "green" followed by zero or more characters [1] "darkgreen" "darkolivegreen" "darkolivegreen1"
[4] "darkolivegreen2" "darkolivegreen3" "darkolivegreen4"
[7] "darkseagreen" "darkseagreen1" "darkseagreen2"
[10] "darkseagreen3" "darkseagreen4" "forestgreen"
[13] "green" "green1" "green2"
[16] "green3" "green4" "greenyellow"
[19] "lawngreen" "lightgreen" "lightseagreen"
[22] "limegreen" "mediumseagreen" "mediumspringgreen"
[25] "palegreen" "palegreen1" "palegreen2"
[28] "palegreen3" "palegreen4" "seagreen"
[31] "seagreen1" "seagreen2" "seagreen3"
[34] "seagreen4" "springgreen" "springgreen1"
[37] "springgreen2" "springgreen3" "springgreen4"
[40] "yellowgreen"
colors()[grep("green.+", colors())] # Colors that contain "green" followed by one or more characters [1] "darkolivegreen1" "darkolivegreen2" "darkolivegreen3" "darkolivegreen4"
[5] "darkseagreen1" "darkseagreen2" "darkseagreen3" "darkseagreen4"
[9] "green1" "green2" "green3" "green4"
[13] "greenyellow" "palegreen1" "palegreen2" "palegreen3"
[17] "palegreen4" "seagreen1" "seagreen2" "seagreen3"
[21] "seagreen4" "springgreen1" "springgreen2" "springgreen3"
[25] "springgreen4"
Finally, we illustrate the use of gsub() and sub() functions for pattern replacement.
places = c("home", "zoo", "school", "work", "park")
gsub("o", "O", places) # Replace all "o" with "O"[1] "hOme" "zOO" "schOOl" "wOrk" "park"
sub("o", "O", places) # Replace the first "o" with "O"[1] "hOme" "zOo" "schOol" "wOrk" "park"
In R, we use the factor data type to represent ordered or unordered categorical variables. The function factor() is used to create factor type variables.
factor(rep(1:2, 4), labels=c("BA", "BS")) # Unordered factor variable[1] BA BS BA BS BA BS BA BS
Levels: BA BS
factor(rep(1:3, 4), labels=c("low", "med", "high"), ordered=TRUE) # Ordered factor variable [1] low med high low med high low med high low med high
Levels: low < med < high
In the following example, we load the STAR.csv dataset that contains a variable gender indicating the gender of students. The initial class of gender is character. We convert gender to a factor variable with labels “0” and “1” using the factor() function.
# Load STAR.csv dataset
star = read.csv("data/STAR.csv")
class(star$gender) # Check the class of gender[1] "character"
star$gender0 = factor(star$gender, labels=c("0", "1"), ordered=FALSE) # Convert to factor variable
class(star$gender0) # Check the class again[1] "factor"
head(star[, c("gender", "gender0")]) # View the first few rows gender gender0
1122 female 0
1137 female 0
1143 female 0
1160 male 1
1183 male 1
1195 male 1
In Table 2.6, we provide some useful functions for factor variables.
| Function | Description |
|---|---|
levels(x) |
Retrieve or set the levels of x |
nlevels(x) |
Returns the number of levels in x |
relevel(x, ref) |
Levels of x are reordered so that the level specified by ref is first |
reorder() |
Reorders levels based on the values of a second variable |
gl() |
Generate factors by specifying the pattern of their levels |
cut(x, breaks) |
Divides the range of x into intervals (factors) determined by breaks |
If we want to convert a factor variable to a numeric variable, we can use as.numeric(f) as illustrated below.
star$gender1 = as.numeric(star$gender0) # Convert factor to numeric
class(star$gender1)[1] "numeric"
head(star[, c("gender", "gender1")]) # View the first few rows gender gender1
1122 female 1
1137 female 1
1143 female 1
1160 male 2
1183 male 2
1195 male 2
star$gender2 = as.numeric(as.character(star$gender0)) # Convert factor to numeric correctly
head(star[, c("gender", "gender1", "gender2")]) # View the first few rows gender gender1 gender2
1122 female 1 0
1137 female 1 0
1143 female 1 0
1160 male 2 1
1183 male 2 1
1195 male 2 1
In the following examples, we illustrate some of the functions listed in Table 2.6.
f = gl(3, 2, labels=paste("ECN", 1:3, sep="_")) # Create a factor variable
f[1] ECN_1 ECN_1 ECN_2 ECN_2 ECN_3 ECN_3
Levels: ECN_1 ECN_2 ECN_3
levels(f) # Get the levels of f[1] "ECN_1" "ECN_2" "ECN_3"
nlevels(f) # Get the number of levels of f[1] 3
relevel(f, "ECN_2") # Reorder levels so that "ECN_2" is first[1] ECN_1 ECN_1 ECN_2 ECN_2 ECN_3 ECN_3
Levels: ECN_2 ECN_1 ECN_3
f = gl(3, 2, labels=1:3) # Create a factor variable
as.numeric(levels(f))[f] # Convert factor to numeric correctly[1] 1 1 2 2 3 3
#
x = runif(10)
cut(x, 3) # Cut x into three intervals [1] (0.0434,0.246] (0.0434,0.246] (0.0434,0.246] (0.0434,0.246] (0.246,0.448]
[6] (0.448,0.65] (0.448,0.65] (0.0434,0.246] (0.448,0.65] (0.246,0.448]
Levels: (0.0434,0.246] (0.246,0.448] (0.448,0.65]
cut(x, c(0,.25,.5,.75,1)) # Cut x at the given cut points [1] (0,0.25] (0,0.25] (0,0.25] (0,0.25] (0.25,0.5] (0.5,0.75]
[7] (0.5,0.75] (0,0.25] (0.5,0.75] (0.25,0.5]
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]
We use the R date object for calendar dates (Year--Month--Day). We can add days to or subtract days from a date. We can also compare dates using logical operators. In Table 2.7, we provide some useful functions for date objects.
| Function | Description |
|---|---|
Sys.Date() |
Current date |
as.Date() |
Convert a character string to a date object |
format.Date() |
Change the format of a date object |
seq.Date() |
Generate sequence of dates |
cut.Date() |
Cut dates into intervals |
weekdays, months, quarters |
Extract parts of a date object |
julian |
Number of days since a given origin |
The Date suffix is optional for calling format.Date(), seq.Date() and cut.Date(), but is necessary for viewing the appropriate documentation. To convert a character string to a date object, we use as.Date(x, format), where x is a character vector representing dates and format is a character string representing the date format of x. In Table 2.8, we list some common conversion specifications for date formats.
| Specifier | Description |
|---|---|
%a |
Abbreviated weekday name |
%A |
Full weekday name |
%d |
Day of the month |
%B |
Full month name |
%b |
Abbreviated month name |
%m |
Numeric month (01-12) |
%y |
Year without century |
%Y |
Year with century |
Below, we provide some illustrative examples.
d1 = c("5-01-2008", "19-08-2008", "2-02-2009", "29-09-2009")
dates1 = as.Date(d1, format = "%d-%m-%Y") # Convert to date object
class(dates1) # Check the class[1] "Date"
dates1 # View the date object[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"
Here is an example with abbreviated month names.
d2 = c("2008/01/05", "2008/08/19", "2009/02/02", "2009/09/29")
dates2 = as.Date(d2, format="%Y/%m/%d")
class(dates2) # Check the class[1] "Date"
dates2 # View the date object[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"
To create a sequence of dates, we can use seq.Date(from, to, by, length.out = NULL), where
from, to: Start and ending date objectsby : A character string, containing one of "day", "week", "month" or "year"length.out: Integer, desired length of the sequenceBelow, we provide some illustrative examples.
seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by = "week")[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by = "3 days") [1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
[6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"
seq.Date(as.Date("2011/1/1"), by = "week", length.out = 10) [1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
[6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"
To divide a sequence of dates into levels, we can use cut.Date(x, breaks, start.on.monday = TRUE).
jan = seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="days")
cut(jan, breaks="weeks") [1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
[7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10 2011-01-10 2011-01-17 2011-01-17
[19] 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-24
[25] 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24
[31] 2011-01-31
6 Levels: 2010-12-27 2011-01-03 2011-01-10 2011-01-17 ... 2011-01-31
Recall that date objects can be used in arithmetic operations. Specifically,
Below, we provide some illustrative examples.
jan1 = as.Date("2011/1/1")
(jan8 = jan1 + 7) # Add 7 days to 2011/1/1[1] "2011-01-08"
jan1 - 14 # Subtract 2 weeks from 2011/1/8[1] "2010-12-18"
jan8 - jan1 # Number of days between 2011/1/1 and 2011/1/8Time difference of 7 days
jan8 > jan1 # Compare dates[1] TRUE
# Use format to extract parts of a date object or change the appearance
format.Date(jan8, "%Y")[1] "2011"
format.Date(jan8, "%b-%d")[1] "Oca-08"
The missing data in R is represented by the keyword NA. The built-in functions in R generally handle missing data in different ways. For example, the mean function ignores NA only if the argument na.rm=TRUE, whereas the which function always ignores missing data.
x = c(4, 7, 2, 0, 1, NA)
mean(x)[1] NA
mean(x, na.rm=TRUE)[1] 2.8
which(x > 4)[1] 2
We need to check the documentation of functions to see how they handle missing data.
We use NaN (Not a Number) to denote quantities that are not a number, such as 0/0. In R, NaN implies NA (NaN refers to unavailable numeric data and NA refers to any type of unavailable data). Undefined or null objects are denoted in R by NULL.
# Undefined row names
x = matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))
x c.1 c.2
[1,] 1 3
[2,] 2 4
# NULL value in a list
x = list(a=1:5, b=NULL, c="Econometrics")
x$a
[1] 1 2 3 4 5
$b
NULL
$c
[1] "Econometrics"
# Empty list
y = list()
ylist()
In Table 2.9, we provide some useful functions for testing missing data.
| Function | Description |
|---|---|
is.na(x) |
Tests for NA or NaN data in x |
is.nan(x) |
Tests for NaN data in x |
is.null(x) |
Tests if x is NULL |
x = c(4, 7, 2, 0, 1, NA)
(x == NA) [1] NA NA NA NA NA NA
is.na(x) # Check which elements are NA[1] FALSE FALSE FALSE FALSE FALSE TRUE
is.nan(x) # Check which elements are NaN[1] FALSE FALSE FALSE FALSE FALSE FALSE
any(is.na(x)) # Check if there is any NA in x[1] TRUE
y <- x/0 # Create NaN values
y[1] Inf Inf Inf NaN Inf NA
is.nan(y) [1] FALSE FALSE FALSE TRUE FALSE FALSE
is.na(y)[1] FALSE FALSE FALSE TRUE FALSE TRUE
The first example above shows that we cannot use the equality operator == to test for NA values. Instead, we should use the is.na() function.