2  # Data Structures in R

2.1 Introduction

In this chapter, we will introduce the basic data structures and types. R provides a variety of data structures to store and manipulate data efficiently. Understanding these data structures is crucial for effective data analysis and manipulation in R.

2.2 Vectors

In R, a vector is an ordered collection of objects of the same type.
We can create a vector using one of the following functions.

  • c() is the most general way to create a vector.
  • : creates a sequence of integers.
  • seq() is used to generate regular sequences.
  • rep() is useful to generate vectors of replicated elements.

Some illustrative examples are given below.

# Creating vectors
v1 = c(2.5, 4, 7.3, 0.1)
v1
[1] 2.5 4.0 7.3 0.1
# Creating character vector
v2 = c("A", "B", "C", "D")
v2
[1] "A" "B" "C" "D"
# Creating integer sequence
v3 = -3:3
v3
[1] -3 -2 -1  0  1  2  3
# Creating sequences with seq()
seq(0, 2, by = 0.5) # increment by 0.5
[1] 0.0 0.5 1.0 1.5 2.0
# Creating sequences with seq()
seq(0, 2, len = 6) # length of sequence is 6
[1] 0.0 0.4 0.8 1.2 1.6 2.0
rep(1:5, each = 2) 
 [1] 1 1 2 2 3 3 4 4 5 5
# Creating replicated vectors with rep()
rep(1:5, times = 2)
 [1] 1 2 3 4 5 1 2 3 4 5

To index certain element(s) of a vector, we use [ ] with a vector/scalar of positions to reference the elements of the vector. Including a minus sign before the vector/scalar removes the indexed elements from the vector.

# Referencing elements of a vector
x <- c(4, 7, 2, 10, 1, 0)
x[4] # return the fourth element
[1] 10
# Return elements from index 1 to 3
x[1:3] 
[1] 4 7 2
# Return elements at indices 2, 5 and 6
x[c(2,5,6)] 
[1] 7 1 0
# Remove the third element from x
x[-3] 
[1]  4  7 10  1  0
# Remove multiple elements from x
x[-c(4,5)] 
[1] 4 7 2 0
# Logical referencing
x[x>4] # return elements bigger than 4
[1]  7 10
# Modifying elements of a vector
x[3] <- 999 
x
[1]   4   7 999  10   1   0

The following additional functions can be useful to return the indices of a vector.

  • which(): returns the position or the index of the value which satisfies the given condition.
  • which.max(): returns the location of the (first) maximum element of a numeric vector.
  • which.min(): returns the location of the (first) minimum element of a numeric vector.
  • match(): returns the first position of an element of a vector in another vector.
x <- c(4, 7, 2, 10, 1, 0)
x>=4 # return a logical vectors of TRUE and FALSE
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE
# Return indices of elements satisfying the condition
which(x>=4) # return indices
[1] 1 2 4
# Return indices of maximum element
which.max(x) 
[1] 4
# Return the maximum element using which.max()
x[which.max(x)] # return the first maximum element
[1] 10
# Return the maximum element using max()
max(x)
[1] 10
# Using match()
y <- rep(1:5, times=5:1)
y
 [1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
match(1:5, y) # return the first position of each element in y that matches with 1:5
[1]  1  6 10 13 15
# Return unique elements
unique(y)
[1] 1 2 3 4 5
match(unique(y), y) # return the first position of each element in y that matches with unique(y)
[1]  1  6 10 13 15

When vectors are used in math expressions the operations are performed element-wise.

# Element-wise operations
x = c(4, 7, 2, 10, 1, 0)
y = x^2 + 1
y
[1]  17  50   5 101   2   1
x*y # element-wise multiplication
[1]   68  350   10 1010    2    0

In Table 2.1, we provide some useful functions for vector operations.

Table 2.1: Useful functions for vectors
Function Description
sum(x), prod(x) Sum/product of the elements of x
cumsum(x), cumprod(x) Cumulative sum/product of the elements of x
min(x), max(x) Minimum/maximum element of x
mean(x), median(x) Mean/median of x
var(x), sd(x) Variance/standard deviation of x
cov(x, y), cor(x, y) Covariance/correlation of x and y
range(x) Range of x
quantile(x) Quantiles of x for the given probabilities
fivenum(x) Five number summary of x
length(x) Number of elements in x
unique(x) Unique elements of x
rev(x) Reverse the elements of x
sort(x) Sort the elements of x
which(x) Indices of TRUEs in a logical vector
which.max(x), which.min(x) Index of the max/min element of x
match(x) First position of an element in a vector
union(x, y) Union of x and y
intersect(x, y) Intersection of x and y
setdiff(x, y) Elements of x that are not in y
setequal(x, y) Do x and y contain the same elements?

Below, we provide some illustrative examples.

# Useful functions for vectors
x <- c(4,7,2,10,1,0)
y <- 3*x^2 + 1
y
[1]  49 148  13 301   4   1
sum(x) # sum of elements
[1] 24
range(x) # range of elements
[1]  0 10
length(x) # number of elements
[1] 6
rev(x) # reverse the elements
[1]  0  1 10  2  7  4
sort(x) # sort in increasing order
[1]  0  1  2  4  7 10
sort(x, decreasing = TRUE) # sort in decreasing order
[1] 10  7  4  2  1  0
which(x==7) # return indices
[1] 2
union(x,y) # union(x,y)
 [1]   4   7   2  10   1   0  49 148  13 301
setdiff(x,y) # elements in x but not in y
[1]  7  2 10  0
intersect(x,y) # intersection of x and y
[1] 4 1
setequal(x,y) # do x and y contain the same elements?
[1] FALSE

2.3 Matrices

A matrix is a two-dimensional generalization of a vector. To create a matrix, we use the function matrix(), with the syntax

matrix(data=NA, nrow=1, ncol=1, byrow = FALSE, dimnames = NULL)
  • data is a vector that gives data to fill the matrix
  • nrow is the desired number of rows
  • ncol is the desired number of columns
  • byrow is set to FALSE by default, which means matrix is filled by columns. Otherwise, matrix is filled by rows.
  • dimnames is an optional list of length 2 giving the row and column names, respectively.
# Creating a matrix
x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4, 
           byrow = TRUE,
           dimnames = list(rows = c("r.1", "r.2", "r.3"),
                           cols = c("c.1", "c.2", "c.3", "c.4")))
x
     cols
rows  c.1 c.2 c.3 c.4
  r.1   5   0   6   1
  r.2   3   5   9   5
  r.3   7   1   5   3

Some useful functions for matrices are given below.

class(x) # class of x
[1] "matrix" "array" 
colnames(x) # access to column names
[1] "c.1" "c.2" "c.3" "c.4"
rownames(x) # access to row names
[1] "r.1" "r.2" "r.3"
rownames(x)=c("A","B","C") # change the row names
dimnames(x) # access to both row and column names
$rows
[1] "A" "B" "C"

$cols
[1] "c.1" "c.2" "c.3" "c.4"
dimnames(x)$rows # access to row names
[1] "A" "B" "C"
dimnames(x)$cols # access to column names
[1] "c.1" "c.2" "c.3" "c.4"
dim(x) # dimensions of x
[1] 3 4
nrow(x) # number of rows
[1] 3
ncol(x) # number of columns
[1] 4

The elements of a matrix can be referenced using the [ ] just like with vectors, but now with 2-dimensions.

x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4)
x
     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    0    3    5    5
[3,]    6    5    7    3
x[2, 3] # Element in Row 2, Column 3
[1] 5
x[1, ] # Row 1, all Columns
[1] 5 1 9 1
x[ , 2] # All Rows, Column 2
[1] 1 3 5
x[c(1, 3), ] # Rows 1 and 3, all columns
     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    6    5    7    3

When matrices are used in math expressions the operations are performed element-wise.

  • For matrix multiplication use the %*% operator.
  • If a vector is used in matrix multiplication, it will be coerced to either a row or column matrix to make the arguments conformable.
  • Using %*% on two vectors will return the inner product as a matrix and not a scalar.
A = matrix(1:4, nrow = 2)
B = matrix(1, nrow = 2, ncol = 2)
A*B # element-wise multiplication
     [,1] [,2]
[1,]    1    3
[2,]    2    4
A%*%B # matrix multiplication
     [,1] [,2]
[1,]    4    4
[2,]    6    6
y = 1:3
y%*%y # inner product as a matrix
     [,1]
[1,]   14
A/c(y%*%y) 
           [,1]      [,2]
[1,] 0.07142857 0.2142857
[2,] 0.14285714 0.2857143
# A/(y%*%y) # Error: non-conformable arrays

We use the apply function for applying functions to the margins of a matrix, array, or dataframes. The syntax is

apply(X, MARGIN, FUN, ...)

where

  • X is a matrix, array or dataframe,
  • MARGIN is a vector of subscripts indicating which margins to apply the function to (1=rows, 2=columns, c(1,2)=rows and columns),
  • FUN is the function to be applied,
  • ... stands for optional arguments for FUN.
# Using apply function
x = matrix(1:12, nrow = 3, ncol = 4)
apply(x, MARGIN=1, sum)  # Row sums
[1] 22 26 30
apply(x, 2, mean) # Column means
[1]  2  5  8 11
# Handling missing data with apply
x[1,1] <- NaN
apply(x, 2, mean, na.rm=TRUE) # Column means ignoring NaN
[1]  2.5  5.0  8.0 11.0

In Table 2.2, we list some useful functions for matrix operations.

Table 2.2: Useful functions for matrices
Function Description
t(A) Transpose of A
det(A) Determinant of A
solve(A, b) Solves the equation Ax = b for x
solve(A) Matrix inverse of A
MASS::ginv(A) Generalized inverse of A (MASS package)
eigen(A) Eigenvalues and eigenvectors of A
chol(A) Cholesky factorization of A
diag(n) Creates an n by n identity matrix
diag(A) Returns the diagonal elements of a matrix A
diag(x) Create a diagonal matrix from a vector x
lower.tri(A), upper.tri(A) Matrix of logicals indicating lower/upper triangular matrix
apply Apply a function to the margins of a matrix
rbind(...) Combines arguments by rows
cbind(...) Combines arguments by columns
dim(A) Dimensions of A
nrow(A), ncol(A) Number of rows/columns of A
colnames(A), rownames(A) Get or set the column/row names of A
dimnames(A) Get or set the dimension names of A

2.4 Arrays

An array is a multi-dimensional generalization of a vector. To create an array, we use array(data = NA, dim = length(data), dimnames = NULL), where data is a vector that provides the values to fill the array; dim specifies the dimensions of the array (a vector of one or more elements giving the maximum indices in each dimension); and dimnames defines the names of the dimensions (a list with one component for each dimension, either NULL or a character vector of the length specified by dim for that dimension).

We fill the array by columns, similar to matrices. The math operations on arrays are performed element-wise, similar to vectors and matrices. Also, all elements of an array must be of the same type.

# Creating an array
w = array(1:24, 
          dim = c(4, 3, 2),
          dimnames = list(c("A","B","C","D"), c("X","Y","Z"), c("N","M")))
w
, , N

  X Y  Z
A 1 5  9
B 2 6 10
C 3 7 11
D 4 8 12

, , M

   X  Y  Z
A 13 17 21
B 14 18 22
C 15 19 23
D 16 20 24

The option dim = c(4, 3, 2) specifies that the array has 4 rows, 3 columns, and 2 “pages” (or layers). Thus, the array w has 4 x 3 x 2 = 24 elements. We can think of c("N","M") as the names of the pages. Thus, w[ , , "N"] returns the first page and w[ , , "M"] returns the second page.

# Referencing elements of an array
w[ , , "N"] # First page
  X Y  Z
A 1 5  9
B 2 6 10
C 3 7 11
D 4 8 12
w[2, , ] # Second row, all columns, all pages
   N  M
X  2 14
Y  6 18
Z 10 22
w[ , "Y", ] # All rows, second column, all pages
  N  M
A 5 17
B 6 18
C 7 19
D 8 20
w[1, 2, 2] # Element in Row 1, Column 2, Page 2
[1] 17

2.5 Lists

A list is a general form of a vector whose components can be of different types and dimensions. To create a list, we use list(...). We can name the elements of a list using the name = value syntax. Arguments can be specified with or without names. In the example below, we create a list with three elements: the first element is named num, the second element has no name, and the third element is named identity.

# Creating a list
x = list(num = c(1,2,3), "Econometrics", identity=diag(2))
x
$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

$identity
     [,1] [,2]
[1,]    1    0
[2,]    0    1

We use [ ], [[ ]] and $ to reference elements of a list. Below, we provide some examples of referencing elements of the list x.

x[[2]]     # the second element of x
[1] "Econometrics"
x[["num"]] # element named "num"
[1] 1 2 3
x$identity # element named "identity"
     [,1] [,2]
[1,]    1    0
[2,]    0    1
x[[3]][1,] # first row of the third element 
[1] 1 0
x[1:2]     # a sublist from the first two elements
$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

In Table 2.3, we provide some useful functions for lists.

Table 2.3: Useful functions for lists
Function Description
lapply() Apply a function to each element of a list; returns a list
sapply() Same as lapply(), but returns a vector or matrix by default
vapply() Similar to sapply(), but has a pre-specified type of return value
replicate() Repeated evaluation of an expression; useful for replicating lists
unlist(x) Produce a vector of all the components that occur in x
length(x) Number of objects in x
names(x) Names of the objects in x

2.6 Data types

2.6.1 Numeric data

Numeric data in R can be either or , but in practice numeric data is almost always double (type double refers to real numbers). See ?integer and ?double. .Machine outputs numeric characteristics of the machine running R, such as the largest integer or the machine’s precision. format() formats an object for pretty printing. format() is a generic function that is used with other types of objects. See ?format() for additional arguments.

format(c(1, 10, 100, 1000), trim = FALSE)
[1] "   1" "  10" " 100" "1000"
format(c(1, 10, 100, 1000), trim = TRUE)
[1] "1"    "10"   "100"  "1000"
format(13.7, nsmall = 3)
[1] "13.700"
# nsmall - Minimum number of digits to the right of the decimal point
format(2^16, scientific = TRUE)
[1] "6.5536e+04"
# scientific - Use scientific notation

2.6.2 Booleans

Boolean (or logical) values are represented by the reserved words TRUE and FALSE in all caps or simply T and F.

Function Description
!x NOT x
x & y x AND y element-wise; returns a vector
x && y x AND y; returns a single value
x | y x OR y element-wise; returns a vector
x || y x OR y; returns a single value
xor(x, y) Exclusive OR of x and y, element-wise
x %in% y x IN y
x < y x < y
x > y x > y
x <= y x ≤ y
x >= y x ≥ y
x == y x = y
x != y x ≠ y
isTRUE(x) TRUE if x is TRUE
all(...) TRUE if all arguments are TRUE
any(...) TRUE if at least one argument is TRUE
identical(x, y) Safe and reliable way to test two objects for being EXACTLY equal
all.equal(x, y) Test if two objects are NEARLY equal

Table: Useful logical and relational functions

x = 1:10
(x%%2 == 0) | (x > 5) # What elements of x are even or greater than 5
 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
y = 5:15 # What elements of x are in y
x %in% y
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
x[x %in% y]
[1]  5  6  7  8  9 10
any(x > 5) # Are any elements of x greater then 5?
[1] TRUE
all(x > 5) # Are all elements of x greater then 5?
[1] FALSE

In general, logical operators may not produce a single value and may return an NA if an element is NA or NaN. If you must get a single TRUE or FALSE, such as with if expressions, you should NOT use == or !=. Unless you are absolutely sure that nothing unusual can happen, you should use identical() function instead. identical() only returns a single logical value, TRUE or FALSE, never NA.

name = "Nick";
if(name == "Nick") TRUE else FALSE 
[1] TRUE

What if name is never set to "Nick"?

name = NA
if(identical(name, "Nick")) TRUE else FALSE
[1] FALSE
# try
# if(name == "Nick") TRUE else FALSE

With all.equal() objects are treated as equal if the only difference is probably the result of inexact foating-point calculations. Returns TRUE if the mean relative difference is less then the specified tolerance. all.equal() either returns TRUE or a character string that describes the difference. Therefore, do not use all.equal() directly in if expressions, instead use with isTRUE() or identical().

(x = sqrt(2))
[1] 1.414214
x^2
[1] 2
x^2 == 2
[1] FALSE
all.equal(x^2, 2)
[1] TRUE
all.equal(x^2, 1)
[1] "Mean relative difference: 0.5"
isTRUE(all.equal(x^2, 1))
[1] FALSE

2.6.3 Characters

Character strings are defined by quotation marks, single ' ' or double " ".

Function Description
cat() Concatenate objects and print to console (\n for newline)
paste() Concatenate objects and return a string
print() Print an object
substr() Extract or replace substrings in a character vector
strtrim() Trim character vectors to specified display widths
strsplit() Split elements of a character vector according to a substring
grep() Search for matches to a pattern within a character vector; returns a vector of the indices that matched
grepl() Like grep(), but returns a logical vector
arep() Similar to grep(), but searches for approximate matches
regexpr() Similar to grep(), but returns the position of the first instance of a pattern within a string
gsub() Replace all occurrences of a pattern with a character vector
sub() Like gsub(), but only replaces the first occurrence
tolower(), toupper() Convert to all lower/upper case
noquote() Print a character vector without quotations
nchar() Number of characters
letters, LETTERS Built-in vector of lower and upper case letters

Table: Useful functions for character vectors

animals = c("bird", "horse", "fish")
home = c("tree", "barn", "lake")

length(animals) # Number of strings
[1] 3
nchar(animals) # Number of characters in each string
[1] 4 5 4
cat("Animals:", animals) # Need \n to move cursor to a newline
Animals: bird horse fish
cat(animals, home, "\n") # Joins one vector after the other
bird horse fish tree barn lake 
paste(animals, collapse=" ") # Create one long string of animals
[1] "bird horse fish"
a_h = paste(animals, home, sep=".") # Pairwise joining of animals and home

# Split strings at ".", fixed=TRUE since "." is used for pattern matching
unlist(strsplit(a_h, ".", fixed=TRUE))
[1] "bird"  "tree"  "horse" "barn"  "fish"  "lake" 
substr(animals, 2, 4) # Get characters 2-4 of each animal
[1] "ird" "ors" "ish"
strtrim(animals, 3) # Print the first three characters
[1] "bir" "hor" "fis"
toupper(animals) # Print animals in all upper case
[1] "BIRD"  "HORSE" "FISH" 
A regular expression is a pattern that describes a set of strings.
colors()[grep("red", colors())] # All colors that contain "red"
 [1] "darkred"         "indianred"       "indianred1"      "indianred2"     
 [5] "indianred3"      "indianred4"      "mediumvioletred" "orangered"      
 [9] "orangered1"      "orangered2"      "orangered3"      "orangered4"     
[13] "palevioletred"   "palevioletred1"  "palevioletred2"  "palevioletred3" 
[17] "palevioletred4"  "red"             "red1"            "red2"           
[21] "red3"            "red4"            "violetred"       "violetred1"     
[25] "violetred2"      "violetred3"      "violetred4"     
colors()[grep("^red", colors())] # Colors that start with "red"
[1] "red"  "red1" "red2" "red3" "red4"
colors()[grep("red$", colors())] # Colors that end with "red"
[1] "darkred"         "indianred"       "mediumvioletred" "orangered"      
[5] "palevioletred"   "red"             "violetred"      
colors()[grep("red.", colors())] # Colors with one character after "red"
 [1] "indianred1"     "indianred2"     "indianred3"     "indianred4"    
 [5] "orangered1"     "orangered2"     "orangered3"     "orangered4"    
 [9] "palevioletred1" "palevioletred2" "palevioletred3" "palevioletred4"
[13] "red1"           "red2"           "red3"           "red4"          
[17] "violetred1"     "violetred2"     "violetred3"     "violetred4"    
colors()[grep("^[r-t]", colors())] # Colors that begin with r, s, or t
 [1] "red"          "red1"         "red2"         "red3"         "red4"        
 [6] "rosybrown"    "rosybrown1"   "rosybrown2"   "rosybrown3"   "rosybrown4"  
[11] "royalblue"    "royalblue1"   "royalblue2"   "royalblue3"   "royalblue4"  
[16] "saddlebrown"  "salmon"       "salmon1"      "salmon2"      "salmon3"     
[21] "salmon4"      "sandybrown"   "seagreen"     "seagreen1"    "seagreen2"   
[26] "seagreen3"    "seagreen4"    "seashell"     "seashell1"    "seashell2"   
[31] "seashell3"    "seashell4"    "sienna"       "sienna1"      "sienna2"     
[36] "sienna3"      "sienna4"      "skyblue"      "skyblue1"     "skyblue2"    
[41] "skyblue3"     "skyblue4"     "slateblue"    "slateblue1"   "slateblue2"  
[46] "slateblue3"   "slateblue4"   "slategray"    "slategray1"   "slategray2"  
[51] "slategray3"   "slategray4"   "slategrey"    "snow"         "snow1"       
[56] "snow2"        "snow3"        "snow4"        "springgreen"  "springgreen1"
[61] "springgreen2" "springgreen3" "springgreen4" "steelblue"    "steelblue1"  
[66] "steelblue2"   "steelblue3"   "steelblue4"   "tan"          "tan1"        
[71] "tan2"         "tan3"         "tan4"         "thistle"      "thistle1"    
[76] "thistle2"     "thistle3"     "thistle4"     "tomato"       "tomato1"     
[81] "tomato2"      "tomato3"      "tomato4"      "turquoise"    "turquoise1"  
[86] "turquoise2"   "turquoise3"   "turquoise4"  
places = c("home", "zoo", "school", "work", "park")
gsub("o", "O", places) # Replace all "o" with "O"
[1] "hOme"   "zOO"    "schOOl" "wOrk"   "park"  
sub("o", "O", places)  # Replace the first "o" with "O"
[1] "hOme"   "zOo"    "schOol" "wOrk"   "park"  

2.6.4 Factors

A type variable is a categorical variable with a defined number of ordered or unordered levels. Use the function factor() to create a factor variable.

factor(rep(1:2, 4), labels=c("BA", "BS"))
[1] BA BS BA BS BA BS BA BS
Levels: BA BS
factor(rep(1:3, 4), labels=c("low", "med", "high"), ordered=TRUE)
 [1] low  med  high low  med  high low  med  high low  med  high
Levels: low < med < high

Here are some useful functions to handle factor type data.

Function Description
levels(x) Retrieve or set the levels of x
nlevels(x) Returns the number of levels in x
relevel(x, ref) Levels of x are reordered so that the level specified by ref is first
reorder() Reorders levels based on the values of a second variable
gl() Generate factors by specifying the pattern of their levels
cut(x, breaks) Divides the range of x into intervals (factors) determined by breaks

Table: Useful functions for factor variables

Often you might encounter a case where you might need to convert a factor variable, say f, to a numeric variable. You can do so by as.numeric(as.character(f)). This is okay. However, for long vectors with few levels, this is an inefficient way. A better approach is as.numeric(levels(f))[f].

f = gl(3, 2, labels=paste("trt", 1:3, sep="_"))
levels(f)
[1] "trt_1" "trt_2" "trt_3"
nlevels(f)
[1] 3
relevel(f, "trt_2")
[1] trt_1 trt_1 trt_2 trt_2 trt_3 trt_3
Levels: trt_2 trt_1 trt_3
f = gl(3, 2, labels=1:3)
as.numeric(levels(f))[f]
[1] 1 1 2 2 3 3
#
x = runif(10)
cut(x, 3) # Cut x into three intervals
 [1] (0.0316,0.329] (0.626,0.924]  (0.0316,0.329] (0.329,0.626]  (0.0316,0.329]
 [6] (0.0316,0.329] (0.329,0.626]  (0.626,0.924]  (0.0316,0.329] (0.626,0.924] 
Levels: (0.0316,0.329] (0.329,0.626] (0.626,0.924]
cut(x, c(0,.25,.5,.75,1)) # Cut x at the given cut points
 [1] (0,0.25]   (0.5,0.75] (0,0.25]   (0.25,0.5] (0,0.25]   (0,0.25]  
 [7] (0.5,0.75] (0.75,1]   (0,0.25]   (0.75,1]  
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

2.6.5 Dates and Times

R has objects that are dates only and objects that are dates and times. We will just focus on dates. Look at ?DateTimeClasses for information about how to handles dates and times. An R date object has the format Year--Month--Day. Days can be added or subtracted to a date. Dates can be compared using logical operators.

Function Description
Sys.Date() Current date
as.Date() Convert a character string to a date object
format.Date() Change the format of a date object
seq.Date() Generate sequence of dates
cut.Date() Cut dates into intervals
weekdays, months, quarters Extract parts of a date object
julian Number of days since a given origin

Table: Useful functions for date objects

Note that .Date suffix is optional for calling format.Date(), seq.Date() and cut.Date(), but is necessary for viewing the appropriate documentation.

Converting a string to a date object requires specifying a format string that defines the date format. Any character in the format string other then the % symbol is interpreted literally. Common conversion specifications (see ?strptime for a complete list) are given below.

Specifier Description
%a Abbreviated weekday name
%A Full weekday name
%d Day of the month
%B Full month name
%b Abbreviated month name
%m Numeric month (01-12)
%y Year without century
%Y Year with century

Table: Common date format specifiers

dates1 = c("5jan2008", "19aug2008", "2feb2009", "29sep2009")
as.Date(dates1, format = "%d%b%Y")
[1] NA NA NA NA
dates2 = c("5-1-2008", "19-8-2008", "2-2-2009", "29-9-2009")
as.Date(dates2, format="%d-%m-%Y")
[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"

To create a sequence of dates, seq.Date(from, to, by, length.out = NULL), where

Argument Description
from, to Start and ending date objects
by A character string, containing one of "day", "week", "month" or "year"
length.out Integer, desired length of the sequence

Table: Arguments for generating date sequences

seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="week")
[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="3 days")
 [1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
 [6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"
seq.Date(as.Date("2011/1/1"), by="week", length.out=10)
 [1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
 [6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"

To divide a sequence of dates in to levels cut.Date(x, breaks, start.on.monday = TRUE).

jan = seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="days")
cut(jan, breaks="weeks")
 [1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
 [7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10 2011-01-10 2011-01-17 2011-01-17
[19] 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-24
[25] 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24
[31] 2011-01-31
6 Levels: 2010-12-27 2011-01-03 2011-01-10 2011-01-17 ... 2011-01-31

Operations with dates:

  • Days can be added or subtracted to a date.
  • Dates can be subtracted.
  • Dates can be compared using logical operators.
jan1 = as.Date("2011/1/1")
(jan8 = jan1 + 7) # Add 7 days to 2011/1/1
[1] "2011-01-08"
jan1 - 14 # Subtract 2 weeks from 2011/1/8
[1] "2010-12-18"
jan8 - jan1 # Number of days between 2011/1/1 and 2011/1/8
Time difference of 7 days
jan8 > jan1 # Compare dates
[1] TRUE
# Use format to extract parts of a date object or change the appearance
format.Date(jan8, "%Y")
[1] "2011"
format.Date(jan8, "%b-%d")
[1] "Oca-08"

2.6.6 Missing Data

R denotes data that is not available by NA. How a function handles missing data depends on the function. For example mean() ignores NAs only if the argument na.rm=TRUE, whereas which() always ignores missing data.

x = c(4, 7, 2, 0, 1, NA)
mean(x)
[1] NA
mean(x, na.rm=TRUE)
[1] 2.8
which(x > 4)
[1] 2

You have to see the documentation for how a particular function handles missing data. Quantities that are not a number, such as 0/0, are denoted by NaN. In R, NaN implies NA (NaN refers to unavailable numeric data and NA refers to any type of unavailable data). Undefined or null objects are denoted in R by NULL. For example, say we do not want to add row labels to a matrix.

x = matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))

To test for missing data avoid using identical() and never use ==. Instead you can use the following functions.

Function Description
is.na(x) Tests for NA or NaN data in x
is.nan(x) Tests for NaN data in x
is.null(x) Tests if x is NULL

Table: Functions for testing missing or null data

x = c(4, 7, 2, 0, 1, NA)
(x == NA)
[1] NA NA NA NA NA NA
is.na(x)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
any(is.na(x))
[1] TRUE
(y <- x/0)
[1] Inf Inf Inf NaN Inf  NA
is.nan(y)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE
is.na(y)
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE