2 # Data Structures in R

2.1 Introduction

In this chapter, we will introduce the basic data structures and types. R provides a variety of data structures to store and manipulate data efficiently. Understanding these data structures is crucial for effective data analysis and manipulation in R.

2.2 Vectors

In R, a vector is an ordered collection of objects of the same type.
We can create a vector using one of the following functions.

c() is the most general way to create a vector.
: creates a sequence of integers.
seq() is used to generate regular sequences.
rep() is useful to generate vectors of replicated elements.

Some illustrative examples are given below.

# Creating vectors
v1 = c(2.5, 4, 7.3, 0.1)
v1

[1] 2.5 4.0 7.3 0.1

# Creating character vector
v2 = c("A", "B", "C", "D")
v2

[1] "A" "B" "C" "D"

# Creating integer sequence
v3 = -3:3
v3

[1] -3 -2 -1  0  1  2  3

# Creating sequences with seq()
seq(0, 2, by = 0.5) # increment by 0.5

[1] 0.0 0.5 1.0 1.5 2.0

# Creating sequences with seq()
seq(0, 2, len = 6) # length of sequence is 6

[1] 0.0 0.4 0.8 1.2 1.6 2.0

rep(1:5, each = 2)

 [1] 1 1 2 2 3 3 4 4 5 5

# Creating replicated vectors with rep()
rep(1:5, times = 2)

 [1] 1 2 3 4 5 1 2 3 4 5

To index certain element(s) of a vector, we use [ ] with a vector/scalar of positions to reference the elements of the vector. Including a minus sign before the vector/scalar removes the indexed elements from the vector.

# Referencing elements of a vector
x <- c(4, 7, 2, 10, 1, 0)
x[4] # return the fourth element

[1] 10

# Return elements from index 1 to 3
x[1:3]

[1] 4 7 2

# Return elements at indices 2, 5 and 6
x[c(2,5,6)]

[1] 7 1 0

# Remove the third element from x
x[-3]

[1]  4  7 10  1  0

# Remove multiple elements from x
x[-c(4,5)]

[1] 4 7 2 0

# Logical referencing
x[x>4] # return elements bigger than 4

[1]  7 10

# Modifying elements of a vector
x[3] <- 999 
x

[1]   4   7 999  10   1   0

The following additional functions can be useful to return the indices of a vector.

which(): returns the position or the index of the value which satisfies the given condition.
which.max(): returns the location of the (first) maximum element of a numeric vector.
which.min(): returns the location of the (first) minimum element of a numeric vector.
match(): returns the first position of an element of a vector in another vector.

x <- c(4, 7, 2, 10, 1, 0)
x>=4 # return a logical vectors of TRUE and FALSE

[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

# Return indices of elements satisfying the condition
which(x>=4) # return indices

[1] 1 2 4

# Return indices of maximum element
which.max(x)

[1] 4

# Return the maximum element using which.max()
x[which.max(x)] # return the first maximum element

[1] 10

# Return the maximum element using max()
max(x)

[1] 10

# Using match()
y <- rep(1:5, times=5:1)
y

 [1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5

match(1:5, y) # return the first position of each element in y that matches with 1:5

[1]  1  6 10 13 15

# Return unique elements
unique(y)

[1] 1 2 3 4 5

match(unique(y), y) # return the first position of each element in y that matches with unique(y)

[1]  1  6 10 13 15

When vectors are used in math expressions the operations are performed element-wise.

# Element-wise operations
x = c(4, 7, 2, 10, 1, 0)
y = x^2 + 1
y

[1]  17  50   5 101   2   1

x*y # element-wise multiplication

[1]   68  350   10 1010    2    0

In Table 2.1, we provide some useful functions for vector operations.

Table 2.1: Useful functions for vectors

Function	Description
`sum(x)`, `prod(x)`	Sum/product of the elements of `x`
`cumsum(x)`, `cumprod(x)`	Cumulative sum/product of the elements of `x`
`min(x)`, `max(x)`	Minimum/maximum element of `x`
`mean(x)`, `median(x)`	Mean/median of `x`
`var(x)`, `sd(x)`	Variance/standard deviation of `x`
`cov(x, y)`, `cor(x, y)`	Covariance/correlation of `x` and `y`
`range(x)`	Range of `x`
`quantile(x)`	Quantiles of `x` for the given probabilities
`fivenum(x)`	Five number summary of `x`
`length(x)`	Number of elements in `x`
`unique(x)`	Unique elements of `x`
`rev(x)`	Reverse the elements of `x`
`sort(x)`	Sort the elements of `x`
`which(x)`	Indices of `TRUE`s in a logical vector
`which.max(x)`, `which.min(x)`	Index of the max/min element of `x`
`match(x)`	First position of an element in a vector
`union(x, y)`	Union of `x` and `y`
`intersect(x, y)`	Intersection of `x` and `y`
`setdiff(x, y)`	Elements of `x` that are not in `y`
`setequal(x, y)`	Do `x` and `y` contain the same elements?

Below, we provide some illustrative examples.

# Useful functions for vectors
x <- c(4,7,2,10,1,0)
y <- 3*x^2 + 1
y

[1]  49 148  13 301   4   1

sum(x) # sum of elements

[1] 24

range(x) # range of elements

[1]  0 10

length(x) # number of elements

[1] 6

rev(x) # reverse the elements

[1]  0  1 10  2  7  4

sort(x) # sort in increasing order

[1]  0  1  2  4  7 10

sort(x, decreasing = TRUE) # sort in decreasing order

[1] 10  7  4  2  1  0

which(x==7) # return indices

[1] 2

union(x,y) # union(x,y)

 [1]   4   7   2  10   1   0  49 148  13 301

setdiff(x,y) # elements in x but not in y

[1]  7  2 10  0

intersect(x,y) # intersection of x and y

[1] 4 1

setequal(x,y) # do x and y contain the same elements?

[1] FALSE

2.3 Matrices

A matrix is a two-dimensional generalization of a vector. To create a matrix, we use the function matrix(), with the syntax

matrix(data=NA, nrow=1, ncol=1, byrow = FALSE, dimnames = NULL)

data is a vector that gives data to fill the matrix
nrow is the desired number of rows
ncol is the desired number of columns
byrow is set to FALSE by default, which means matrix is filled by columns. Otherwise, matrix is filled by rows.
dimnames is an optional list of length 2 giving the row and column names, respectively.

# Creating a matrix
x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4, 
           byrow = TRUE,
           dimnames = list(rows = c("r.1", "r.2", "r.3"),
                           cols = c("c.1", "c.2", "c.3", "c.4")))
x

     cols
rows  c.1 c.2 c.3 c.4
  r.1   5   0   6   1
  r.2   3   5   9   5
  r.3   7   1   5   3

Some useful functions for matrices are given below.

class(x) # class of x

[1] "matrix" "array"

colnames(x) # access to column names

[1] "c.1" "c.2" "c.3" "c.4"

rownames(x) # access to row names

[1] "r.1" "r.2" "r.3"

rownames(x)=c("A","B","C") # change the row names
dimnames(x) # access to both row and column names

$rows
[1] "A" "B" "C"

$cols
[1] "c.1" "c.2" "c.3" "c.4"

dimnames(x)$rows # access to row names

[1] "A" "B" "C"

dimnames(x)$cols # access to column names

[1] "c.1" "c.2" "c.3" "c.4"

dim(x) # dimensions of x

[1] 3 4

nrow(x) # number of rows

[1] 3

ncol(x) # number of columns

[1] 4

The elements of a matrix can be referenced using the [ ] just like with vectors, but now with 2-dimensions.

x = matrix(c(5,0,6,1,3,5,9,5,7,1,5,3), 
           nrow = 3, ncol = 4)
x

     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    0    3    5    5
[3,]    6    5    7    3

x[2, 3] # Element in Row 2, Column 3

[1] 5

x[1, ] # Row 1, all Columns

[1] 5 1 9 1

x[ , 2] # All Rows, Column 2

[1] 1 3 5

x[c(1, 3), ] # Rows 1 and 3, all columns

     [,1] [,2] [,3] [,4]
[1,]    5    1    9    1
[2,]    6    5    7    3

When matrices are used in math expressions the operations are performed element-wise.

For matrix multiplication use the %*% operator.
If a vector is used in matrix multiplication, it will be coerced to either a row or column matrix to make the arguments conformable.
Using %*% on two vectors will return the inner product as a matrix and not a scalar.

A = matrix(1:4, nrow = 2)
B = matrix(1, nrow = 2, ncol = 2)
A*B # element-wise multiplication

     [,1] [,2]
[1,]    1    3
[2,]    2    4

A%*%B # matrix multiplication

     [,1] [,2]
[1,]    4    4
[2,]    6    6

y = 1:3
y%*%y # inner product as a matrix

     [,1]
[1,]   14

A/c(y%*%y)

           [,1]      [,2]
[1,] 0.07142857 0.2142857
[2,] 0.14285714 0.2857143

# A/(y%*%y) # Error: non-conformable arrays

We use the apply function for applying functions to the margins of a matrix, array, or dataframes. The syntax is

apply(X, MARGIN, FUN, ...)

where

X is a matrix, array or dataframe,
MARGIN is a vector of subscripts indicating which margins to apply the function to (1=rows, 2=columns, c(1,2)=rows and columns),
FUN is the function to be applied,
... stands for optional arguments for FUN.

# Using apply function
x = matrix(1:12, nrow = 3, ncol = 4)
apply(x, MARGIN=1, sum)  # Row sums

[1] 22 26 30

apply(x, 2, mean) # Column means

[1]  2  5  8 11

# Handling missing data with apply
x[1,1] <- NaN
apply(x, 2, mean, na.rm=TRUE) # Column means ignoring NaN

[1]  2.5  5.0  8.0 11.0

In Table 2.2, we list some useful functions for matrix operations.

Table 2.2: Useful functions for matrices

Function	Description
`t(A)`	Transpose of `A`
`det(A)`	Determinant of `A`
`solve(A, b)`	Solves the equation `Ax = b` for `x`
`solve(A)`	Matrix inverse of `A`
`MASS::ginv(A)`	Generalized inverse of `A` (MASS package)
`eigen(A)`	Eigenvalues and eigenvectors of `A`
`chol(A)`	Cholesky factorization of `A`
`diag(n)`	Creates an `n` by `n` identity matrix
`diag(A)`	Returns the diagonal elements of a matrix `A`
`diag(x)`	Create a diagonal matrix from a vector `x`
`lower.tri(A)`, `upper.tri(A)`	Matrix of logicals indicating lower/upper triangular matrix
`apply`	Apply a function to the margins of a matrix
`rbind(...)`	Combines arguments by rows
`cbind(...)`	Combines arguments by columns
`dim(A)`	Dimensions of `A`
`nrow(A)`, `ncol(A)`	Number of rows/columns of `A`
`colnames(A)`, `rownames(A)`	Get or set the column/row names of `A`
`dimnames(A)`	Get or set the dimension names of `A`

2.4 Arrays

An array is a multi-dimensional generalization of a vector. To create an array, we use array(data = NA, dim = length(data), dimnames = NULL), where data is a vector that provides the values to fill the array; dim specifies the dimensions of the array (a vector of one or more elements giving the maximum indices in each dimension); and dimnames defines the names of the dimensions (a list with one component for each dimension, either NULL or a character vector of the length specified by dim for that dimension).

We fill the array by columns, similar to matrices. The math operations on arrays are performed element-wise, similar to vectors and matrices. Also, all elements of an array must be of the same type.

# Creating an array
w = array(1:24, 
          dim = c(4, 3, 2),
          dimnames = list(c("A","B","C","D"), c("X","Y","Z"), c("N","M")))
w

The option dim = c(4, 3, 2) specifies that the array has 4 rows, 3 columns, and 2 “pages” (or layers). Thus, the array w has 4 x 3 x 2 = 24 elements. We can think of c("N","M") as the names of the pages. Thus, w[ , , "N"] returns the first page and w[ , , "M"] returns the second page.

# Referencing elements of an array
w[ , , "N"] # First page

w[2, , ] # Second row, all columns, all pages

w[ , "Y", ] # All rows, second column, all pages

w[1, 2, 2] # Element in Row 1, Column 2, Page 2

[1] 17

2.5 Lists

A list is a general form of a vector whose components can be of different types and dimensions. To create a list, we use list(...). We can name the elements of a list using the name = value syntax. Arguments can be specified with or without names. In the example below, we create a list with three elements: the first element is named num, the second element has no name, and the third element is named identity.

# Creating a list
x = list(num = c(1,2,3), "Econometrics", identity=diag(2))
x

$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

$identity
     [,1] [,2]
[1,]    1    0
[2,]    0    1

We use [ ], [[ ]] and $ to reference elements of a list. Below, we provide some examples of referencing elements of the list x.

x[[2]]     # the second element of x

[1] "Econometrics"

x[["num"]] # element named "num"

[1] 1 2 3

x$identity # element named "identity"

     [,1] [,2]
[1,]    1    0
[2,]    0    1

x[[3]][1,] # first row of the third element

[1] 1 0

x[1:2]     # a sublist from the first two elements

$num
[1] 1 2 3

[[2]]
[1] "Econometrics"

In Table 2.3, we provide some useful functions for lists.

Table 2.3: Useful functions for lists

Function	Description
`lapply()`	Apply a function to each element of a list; returns a list
`sapply()`	Same as `lapply()`, but returns a vector or matrix by default
`vapply()`	Similar to `sapply()`, but has a pre-specified type of return value
`replicate()`	Repeated evaluation of an expression; useful for replicating lists
`unlist(x)`	Produce a vector of all the components that occur in `x`
`length(x)`	Number of objects in `x`
`names(x)`	Names of the objects in `x`

2.6 Data types

2.6.1 Numeric data

Numeric data in R can be either or , but in practice numeric data is almost always double (type double refers to real numbers). See ?integer and ?double. .Machine outputs numeric characteristics of the machine running R, such as the largest integer or the machine’s precision. format() formats an object for pretty printing. format() is a generic function that is used with other types of objects. See ?format() for additional arguments.

format(c(1, 10, 100, 1000), trim = FALSE)

[1] "   1" "  10" " 100" "1000"

format(c(1, 10, 100, 1000), trim = TRUE)

[1] "1"    "10"   "100"  "1000"

format(13.7, nsmall = 3)

[1] "13.700"

# nsmall - Minimum number of digits to the right of the decimal point
format(2^16, scientific = TRUE)

[1] "6.5536e+04"

# scientific - Use scientific notation

2.6.2 Booleans

Boolean (or logical) values are represented by the reserved words TRUE and FALSE in all caps or simply T and F.

Function	Description
`!x`	NOT `x`
`x & y`	`x` AND `y` element-wise; returns a vector
`x && y`	`x` AND `y`; returns a single value
`x \| y`	`x` OR `y` element-wise; returns a vector
`x \|\| y`	`x` OR `y`; returns a single value
`xor(x, y)`	Exclusive OR of `x` and `y`, element-wise
`x %in% y`	`x` IN `y`
`x < y`	`x < y`
`x > y`	`x > y`
`x <= y`	`x ≤ y`
`x >= y`	`x ≥ y`
`x == y`	`x = y`
`x != y`	`x ≠ y`
`isTRUE(x)`	`TRUE` if `x` is `TRUE`
`all(...)`	`TRUE` if all arguments are `TRUE`
`any(...)`	`TRUE` if at least one argument is `TRUE`
`identical(x, y)`	Safe and reliable way to test two objects for being EXACTLY equal
`all.equal(x, y)`	Test if two objects are NEARLY equal

Table: Useful logical and relational functions

x = 1:10
(x%%2 == 0) | (x > 5) # What elements of x are even or greater than 5

 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

y = 5:15 # What elements of x are in y
x %in% y

 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

x[x %in% y]

[1]  5  6  7  8  9 10

any(x > 5) # Are any elements of x greater then 5?

[1] TRUE

all(x > 5) # Are all elements of x greater then 5?

[1] FALSE

In general, logical operators may not produce a single value and may return an NA if an element is NA or NaN. If you must get a single TRUE or FALSE, such as with if expressions, you should NOT use == or !=. Unless you are absolutely sure that nothing unusual can happen, you should use identical() function instead. identical() only returns a single logical value, TRUE or FALSE, never NA.

name = "Nick";
if(name == "Nick") TRUE else FALSE

[1] TRUE

What if name is never set to "Nick"?

name = NA
if(identical(name, "Nick")) TRUE else FALSE

[1] FALSE

# try
# if(name == "Nick") TRUE else FALSE

With all.equal() objects are treated as equal if the only difference is probably the result of inexact foating-point calculations. Returns TRUE if the mean relative difference is less then the specified tolerance. all.equal() either returns TRUE or a character string that describes the difference. Therefore, do not use all.equal() directly in if expressions, instead use with isTRUE() or identical().

(x = sqrt(2))

[1] 1.414214

x^2

[1] 2

x^2 == 2

[1] FALSE

all.equal(x^2, 2)

[1] TRUE

all.equal(x^2, 1)

[1] "Mean relative difference: 0.5"

isTRUE(all.equal(x^2, 1))

[1] FALSE

2.6.3 Characters

Character strings are defined by quotation marks, single ' ' or double " ".

Function	Description
`cat()`	Concatenate objects and print to console (`\n` for newline)
`paste()`	Concatenate objects and return a string
`print()`	Print an object
`substr()`	Extract or replace substrings in a character vector
`strtrim()`	Trim character vectors to specified display widths
`strsplit()`	Split elements of a character vector according to a substring
`grep()`	Search for matches to a pattern within a character vector; returns a vector of the indices that matched
`grepl()`	Like `grep()`, but returns a logical vector
`arep()`	Similar to `grep()`, but searches for approximate matches
`regexpr()`	Similar to `grep()`, but returns the position of the first instance of a pattern within a string
`gsub()`	Replace all occurrences of a pattern with a character vector
`sub()`	Like `gsub()`, but only replaces the first occurrence
`tolower()`, `toupper()`	Convert to all lower/upper case
`noquote()`	Print a character vector without quotations
`nchar()`	Number of characters
`letters`, `LETTERS`	Built-in vector of lower and upper case letters

Table: Useful functions for character vectors

animals = c("bird", "horse", "fish")
home = c("tree", "barn", "lake")

length(animals) # Number of strings

[1] 3

nchar(animals) # Number of characters in each string

[1] 4 5 4

cat("Animals:", animals) # Need \n to move cursor to a newline

Animals: bird horse fish

cat(animals, home, "\n") # Joins one vector after the other

bird horse fish tree barn lake

paste(animals, collapse=" ") # Create one long string of animals

[1] "bird horse fish"

a_h = paste(animals, home, sep=".") # Pairwise joining of animals and home

# Split strings at ".", fixed=TRUE since "." is used for pattern matching
unlist(strsplit(a_h, ".", fixed=TRUE))

[1] "bird"  "tree"  "horse" "barn"  "fish"  "lake"

substr(animals, 2, 4) # Get characters 2-4 of each animal

[1] "ird" "ors" "ish"

strtrim(animals, 3) # Print the first three characters

[1] "bir" "hor" "fis"

toupper(animals) # Print animals in all upper case

[1] "BIRD"  "HORSE" "FISH"

A regular expression is a pattern that describes a set of strings.

colors()[grep("red", colors())] # All colors that contain "red"

 [1] "darkred"         "indianred"       "indianred1"      "indianred2"     
 [5] "indianred3"      "indianred4"      "mediumvioletred" "orangered"      
 [9] "orangered1"      "orangered2"      "orangered3"      "orangered4"     
[13] "palevioletred"   "palevioletred1"  "palevioletred2"  "palevioletred3" 
[17] "palevioletred4"  "red"             "red1"            "red2"           
[21] "red3"            "red4"            "violetred"       "violetred1"     
[25] "violetred2"      "violetred3"      "violetred4"

colors()[grep("^red", colors())] # Colors that start with "red"

[1] "red"  "red1" "red2" "red3" "red4"

colors()[grep("red$", colors())] # Colors that end with "red"

[1] "darkred"         "indianred"       "mediumvioletred" "orangered"      
[5] "palevioletred"   "red"             "violetred"

colors()[grep("red.", colors())] # Colors with one character after "red"

 [1] "indianred1"     "indianred2"     "indianred3"     "indianred4"    
 [5] "orangered1"     "orangered2"     "orangered3"     "orangered4"    
 [9] "palevioletred1" "palevioletred2" "palevioletred3" "palevioletred4"
[13] "red1"           "red2"           "red3"           "red4"          
[17] "violetred1"     "violetred2"     "violetred3"     "violetred4"

colors()[grep("^[r-t]", colors())] # Colors that begin with r, s, or t

 [1] "red"          "red1"         "red2"         "red3"         "red4"        
 [6] "rosybrown"    "rosybrown1"   "rosybrown2"   "rosybrown3"   "rosybrown4"  
[11] "royalblue"    "royalblue1"   "royalblue2"   "royalblue3"   "royalblue4"  
[16] "saddlebrown"  "salmon"       "salmon1"      "salmon2"      "salmon3"     
[21] "salmon4"      "sandybrown"   "seagreen"     "seagreen1"    "seagreen2"   
[26] "seagreen3"    "seagreen4"    "seashell"     "seashell1"    "seashell2"   
[31] "seashell3"    "seashell4"    "sienna"       "sienna1"      "sienna2"     
[36] "sienna3"      "sienna4"      "skyblue"      "skyblue1"     "skyblue2"    
[41] "skyblue3"     "skyblue4"     "slateblue"    "slateblue1"   "slateblue2"  
[46] "slateblue3"   "slateblue4"   "slategray"    "slategray1"   "slategray2"  
[51] "slategray3"   "slategray4"   "slategrey"    "snow"         "snow1"       
[56] "snow2"        "snow3"        "snow4"        "springgreen"  "springgreen1"
[61] "springgreen2" "springgreen3" "springgreen4" "steelblue"    "steelblue1"  
[66] "steelblue2"   "steelblue3"   "steelblue4"   "tan"          "tan1"        
[71] "tan2"         "tan3"         "tan4"         "thistle"      "thistle1"    
[76] "thistle2"     "thistle3"     "thistle4"     "tomato"       "tomato1"     
[81] "tomato2"      "tomato3"      "tomato4"      "turquoise"    "turquoise1"  
[86] "turquoise2"   "turquoise3"   "turquoise4"

places = c("home", "zoo", "school", "work", "park")
gsub("o", "O", places) # Replace all "o" with "O"

[1] "hOme"   "zOO"    "schOOl" "wOrk"   "park"

sub("o", "O", places)  # Replace the first "o" with "O"

[1] "hOme"   "zOo"    "schOol" "wOrk"   "park"

2.6.4 Factors

A type variable is a categorical variable with a defined number of ordered or unordered levels. Use the function factor() to create a factor variable.

factor(rep(1:2, 4), labels=c("BA", "BS"))

[1] BA BS BA BS BA BS BA BS
Levels: BA BS

factor(rep(1:3, 4), labels=c("low", "med", "high"), ordered=TRUE)

 [1] low  med  high low  med  high low  med  high low  med  high
Levels: low < med < high

Here are some useful functions to handle factor type data.

Function	Description
`levels(x)`	Retrieve or set the levels of `x`
`nlevels(x)`	Returns the number of levels in `x`
`relevel(x, ref)`	Levels of `x` are reordered so that the level specified by `ref` is first
`reorder()`	Reorders levels based on the values of a second variable
`gl()`	Generate factors by specifying the pattern of their levels
`cut(x, breaks)`	Divides the range of `x` into intervals (factors) determined by `breaks`

Table: Useful functions for factor variables

Often you might encounter a case where you might need to convert a factor variable, say f, to a numeric variable. You can do so by as.numeric(as.character(f)). This is okay. However, for long vectors with few levels, this is an inefficient way. A better approach is as.numeric(levels(f))[f].

f = gl(3, 2, labels=paste("trt", 1:3, sep="_"))
levels(f)

[1] "trt_1" "trt_2" "trt_3"

nlevels(f)

[1] 3

relevel(f, "trt_2")

[1] trt_1 trt_1 trt_2 trt_2 trt_3 trt_3
Levels: trt_2 trt_1 trt_3

f = gl(3, 2, labels=1:3)
as.numeric(levels(f))[f]

[1] 1 1 2 2 3 3

#
x = runif(10)
cut(x, 3) # Cut x into three intervals

 [1] (0.0316,0.329] (0.626,0.924]  (0.0316,0.329] (0.329,0.626]  (0.0316,0.329]
 [6] (0.0316,0.329] (0.329,0.626]  (0.626,0.924]  (0.0316,0.329] (0.626,0.924] 
Levels: (0.0316,0.329] (0.329,0.626] (0.626,0.924]

cut(x, c(0,.25,.5,.75,1)) # Cut x at the given cut points

 [1] (0,0.25]   (0.5,0.75] (0,0.25]   (0.25,0.5] (0,0.25]   (0,0.25]  
 [7] (0.5,0.75] (0.75,1]   (0,0.25]   (0.75,1]  
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

2.6.5 Dates and Times

R has objects that are dates only and objects that are dates and times. We will just focus on dates. Look at ?DateTimeClasses for information about how to handles dates and times. An R date object has the format Year--Month--Day. Days can be added or subtracted to a date. Dates can be compared using logical operators.

Function	Description
`Sys.Date()`	Current date
`as.Date()`	Convert a character string to a date object
`format.Date()`	Change the format of a date object
`seq.Date()`	Generate sequence of dates
`cut.Date()`	Cut dates into intervals
`weekdays`, `months`, `quarters`	Extract parts of a date object
`julian`	Number of days since a given origin

Table: Useful functions for date objects

Note that .Date suffix is optional for calling format.Date(), seq.Date() and cut.Date(), but is necessary for viewing the appropriate documentation.

Converting a string to a date object requires specifying a format string that defines the date format. Any character in the format string other then the % symbol is interpreted literally. Common conversion specifications (see ?strptime for a complete list) are given below.

Specifier	Description
`%a`	Abbreviated weekday name
`%A`	Full weekday name
`%d`	Day of the month
`%B`	Full month name
`%b`	Abbreviated month name
`%m`	Numeric month (01-12)
`%y`	Year without century
`%Y`	Year with century

Table: Common date format specifiers

dates1 = c("5jan2008", "19aug2008", "2feb2009", "29sep2009")
as.Date(dates1, format = "%d%b%Y")

[1] NA NA NA NA

dates2 = c("5-1-2008", "19-8-2008", "2-2-2009", "29-9-2009")
as.Date(dates2, format="%d-%m-%Y")

[1] "2008-01-05" "2008-08-19" "2009-02-02" "2009-09-29"

To create a sequence of dates, seq.Date(from, to, by, length.out = NULL), where

Argument	Description
`from`, `to`	Start and ending date objects
`by`	A character string, containing one of `"day"`, `"week"`, `"month"` or `"year"`
`length.out`	Integer, desired length of the sequence

Table: Arguments for generating date sequences

seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="week")

[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"

seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="3 days")

 [1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
 [6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"

seq.Date(as.Date("2011/1/1"), by="week", length.out=10)

 [1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
 [6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"

To divide a sequence of dates in to levels cut.Date(x, breaks, start.on.monday = TRUE).

jan = seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="days")
cut(jan, breaks="weeks")

 [1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
 [7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10 2011-01-10 2011-01-17 2011-01-17
[19] 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-17 2011-01-24
[25] 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24 2011-01-24
[31] 2011-01-31
6 Levels: 2010-12-27 2011-01-03 2011-01-10 2011-01-17 ... 2011-01-31

Operations with dates:

Days can be added or subtracted to a date.
Dates can be subtracted.
Dates can be compared using logical operators.

jan1 = as.Date("2011/1/1")
(jan8 = jan1 + 7) # Add 7 days to 2011/1/1

[1] "2011-01-08"

jan1 - 14 # Subtract 2 weeks from 2011/1/8

[1] "2010-12-18"

jan8 - jan1 # Number of days between 2011/1/1 and 2011/1/8

Time difference of 7 days

jan8 > jan1 # Compare dates

[1] TRUE

# Use format to extract parts of a date object or change the appearance
format.Date(jan8, "%Y")

[1] "2011"

format.Date(jan8, "%b-%d")

[1] "Oca-08"

2.6.6 Missing Data

R denotes data that is not available by NA. How a function handles missing data depends on the function. For example mean() ignores NAs only if the argument na.rm=TRUE, whereas which() always ignores missing data.

x = c(4, 7, 2, 0, 1, NA)
mean(x)

[1] NA

mean(x, na.rm=TRUE)

[1] 2.8

which(x > 4)

[1] 2

You have to see the documentation for how a particular function handles missing data. Quantities that are not a number, such as 0/0, are denoted by NaN. In R, NaN implies NA (NaN refers to unavailable numeric data and NA refers to any type of unavailable data). Undefined or null objects are denoted in R by NULL. For example, say we do not want to add row labels to a matrix.

x = matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))

To test for missing data avoid using identical() and never use ==. Instead you can use the following functions.

Function	Description
`is.na(x)`	Tests for `NA` or `NaN` data in `x`
`is.nan(x)`	Tests for `NaN` data in `x`
`is.null(x)`	Tests if `x` is `NULL`

Table: Functions for testing missing or null data

x = c(4, 7, 2, 0, 1, NA)
(x == NA)

[1] NA NA NA NA NA NA

is.na(x)

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

any(is.na(x))

[1] TRUE

(y <- x/0)

[1] Inf Inf Inf NaN Inf  NA

is.nan(y)

[1] FALSE FALSE FALSE  TRUE FALSE FALSE

is.na(y)

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE