# Load necessary libraries
library(tidyverse)
library(readxl)7 Dataframes
7.1 Introduction
In R, a dataframe is a matrix-like structure used for storing datasets. It has rows and columns, where each column can contain different types of data (e.g., numeric, character, factor). In this chapter, we will explore how to create, manipulate, and analyze dataframes in R. We will cover two approaches: the base R approach and the tidyverse approach.
7.2 The base R approach
7.2.1 Using the data.frame function
In the base R approach, we use the data.frame function to create a dataframe and the str function to examine its structure. The syntax of data.frame is
data.frame(..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, fix.empty.names = TRUE,
stringsAsFactors = FALSE)where
...: these are the columns of the dataframe specified asname = valuepairs.row.names: this specifies the row names of the dataframe.check.rows: if TRUE, checks that the rows have the same length.check.names: if TRUE, checks that the column names are valid R variable names.fix.empty.names: if TRUE, converts empty column names toV1,V2, etc.stringsAsFactors: if TRUE, converts character vectors to factors.
Below, we provide some illustrative examples.
# Create a simple dataframe
df1 <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Height = c(5.5, 6.0, 5.8)
)
# Print the dataframe
df1 Name Age Height
1 Alice 25 5.5
2 Bob 30 6.0
3 Charlie 35 5.8
# Examine the structure of the dataframe
str(df1)'data.frame': 3 obs. of 3 variables:
$ Name : chr "Alice" "Bob" "Charlie"
$ Age : num 25 30 35
$ Height: num 5.5 6 5.8
Note that the row names are automatically assigned as 1, 2, 3, etc. if not specified. We can use the row.names argument to assign custom row names.
# Create a dataframe with custom row names
df2 <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Height = c(5.5, 6.0, 5.8),
row.names = c("A", "B", "C")
)
# Print the dataframe with custom row names
df2 Name Age Height
A Alice 25 5.5
B Bob 30 6.0
C Charlie 35 5.8
In the following table, we summarize some useful functions to examine dataframes in R.
| Function | Description |
|---|---|
str() |
Displays the structure of the dataframe |
is.data.frame() |
Checks if an object is a dataframe |
as.data.frame() |
Converts an object to a dataframe |
head() |
Displays the first few rows of the dataframe |
tail() |
Displays the last few rows of the dataframe |
dim() |
Returns the dimensions (rows and columns) |
nrow() |
Returns the number of rows |
ncol() |
Returns the number of columns |
names() |
Returns the column names |
summary() |
Provides a summary of each column |
colnames() |
Gets or sets the column names |
rownames() |
Gets or sets the row names |
Below are some examples of using these functions.
# Create a dataframe
x <- runif(10)
y <- letters[1:10]
z <- sample(c(rep(T,5),rep(F,5)))
df3 <- data.frame(x, y, z)
# Display the structure of the dataframe
str(df3)'data.frame': 10 obs. of 3 variables:
$ x: num 0.666 0.419 0.385 0.601 0.383 ...
$ y: chr "a" "b" "c" "d" ...
$ z: logi FALSE TRUE TRUE FALSE TRUE TRUE ...
# Display the first few rows of the dataframe
head(df3) x y z
1 0.6662274 a FALSE
2 0.4190514 b TRUE
3 0.3853232 c TRUE
4 0.6005672 d FALSE
5 0.3834097 e TRUE
6 0.8991093 f TRUE
# Display the dimensions of the dataframe
dim(df3)[1] 10 3
# Get the column names of the dataframe
names(df3)[1] "x" "y" "z"
# Get a summary of the dataframe
summary(df3) x y z
Min. :0.002241 Length:10 Mode :logical
1st Qu.:0.383888 Class :character FALSE:5
Median :0.454043 Mode :character TRUE :5
Mean :0.480009
3rd Qu.:0.595667
Max. :0.899109
# Change the column names of the dataframe
colnames(df3) <- c("RandomNumbers", "Letters", "LogicalValues")
head(df3) RandomNumbers Letters LogicalValues
1 0.6662274 a FALSE
2 0.4190514 b TRUE
3 0.3853232 c TRUE
4 0.6005672 d FALSE
5 0.3834097 e TRUE
6 0.8991093 f TRUE
# Check if df3 is a dataframe
is.data.frame(df3)[1] TRUE
# Convert a matrix to a dataframe
mat <- matrix(data = 10:18, nrow = 3)
df4 <- as.data.frame(mat)
df4 V1 V2 V3
1 10 13 16
2 11 14 17
3 12 15 18
7.2.2 Importing and exporting dataframes
The R native format for saving dataframes is .Rdata or .rda. To save a dataframe in this format, we use the save function, which has the following syntax:
save(..., list = character(),
file = stop("'file' must be specified"),
ascii = FALSE, version = NULL, envir = parent.frame(),
compress = isTRUE(!ascii), compression_level,
eval.promises = TRUE, precheck = TRUE)where ... are the objects to be saved, list is a character vector of object names to be saved, and file is the name of the file to save the objects to.
# Save df2 and df3 to Rdata file
save(df2, df3, file = "data/dataframes.Rdata")To load data from an .Rdata or .rda file, we use the load function.
# Load data from Rdata file
load("data/dataframes.Rdata")The read.table function can be used to import data from the following text file formats: space delimited, tab delimited, and comma separated values (CSV). The syntax of read.table is as follows:
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#", allowEscapes = FALSE,
flush = FALSE, stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)Some of the important arguments are:
file: the name of the file with the extension to be read.header: if TRUE, the first line of the file contains the column names.sep: the character that separates the values in the file (e.g., comma, tab, space).as.is: if TRUE, character columns are not converted to factors.na.strings: a character vector of strings to be interpreted as NA values.skip: the number of lines to skip before reading the data.
In the following example, we load the data from the caschool.csv file located in the data folder. This dataset contains information on kindergarten through eighth-grade students across 420 California school districts in 1999. It is a district-level dataset that includes variables on average student performance and demographic characteristics. In Table 7.2, we describe the variables in this dataset.
| Variable | Description |
|---|---|
dist_code |
District code |
read_scr |
Average reading score |
math_scr |
Average math score |
county |
County |
district |
District |
gr_span |
Grade span of district |
enrl_tot |
Total enrollment |
teachers |
Number of teachers |
computer |
Number of computers |
testscr |
Average test score (= (read_scr + math_scr)/2) |
comp_stu |
Computers per student (= computer / enrl_tot) |
expn_stu |
Expenditures per student |
str |
Student-teacher ratio (= teachers / enrl_tot) |
el_pct |
Percent of English learners |
meal_pct |
Percent qualifying for reduced-price lunch |
clw_pct |
Percent qualifying for CalWorks |
avginc |
District average income (in $1000s) |
# Load the caschool.csv file
caschool <- read.table("data/caschool.csv", header = TRUE, sep = ",", as.is = TRUE)
# Examine the structure of the caschool dataframe
str(caschool)'data.frame': 420 obs. of 18 variables:
$ Observation.Number: int 1 2 3 4 5 6 7 8 9 10 ...
$ dist_cod : int 75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
$ county : chr "Alameda" "Butte" "Butte" "Butte" ...
$ district : chr "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
$ gr_span : chr "KK-08" "KK-08" "KK-08" "KK-08" ...
$ enrl_tot : int 195 240 1550 243 1335 137 195 888 379 2247 ...
$ teachers : num 10.9 11.1 82.9 14 71.5 ...
$ calw_pct : num 0.51 15.42 55.03 36.48 33.11 ...
$ meal_pct : num 2.04 47.92 76.32 77.05 78.43 ...
$ computer : int 67 101 169 85 171 25 28 66 35 0 ...
$ testscr : num 691 661 644 648 641 ...
$ comp_stu : num 0.344 0.421 0.109 0.35 0.128 ...
$ expn_stu : num 6385 5099 5502 7102 5236 ...
$ str : num 17.9 21.5 18.7 17.4 18.7 ...
$ avginc : num 22.69 9.82 8.98 8.98 9.08 ...
$ el_pct : num 0 4.58 30 0 13.86 ...
$ read_scr : num 692 660 636 652 642 ...
$ math_scr : num 690 662 651 644 640 ...
To export a dataframe to a text file, we can use the write.table function. The syntax of this function is as follows:
write.table(x, file = "", append = FALSE, quote = TRUE,
sep = " ", eol = "\n", na = "NA", dec = ".",
row.names = TRUE, col.names = TRUE,
qmethod = c("escape", "double"),
fileEncoding = "")where x is the dataframe to be exported, file is the name of the file to write to, sep is the character that separates the values in the file, na is the string to represent NA values, and row.names and col.names indicate whether to include row and column names in the output file.
# Export the caschool dataframe to a CSV file
write.table(x=caschool, file = "data/caschool_exported.csv", sep = ",")The simplest and recommended way to import Excel files is to use the read_excel function from the readxl package. The haven package provides functions read_sas() and read_stata() to import SAS and Stata files, respectively.
7.2.3 Selecting rows and columns
We use [ , ] operator to select rows and columns from a dataframe: data[row_indices, column_indices]. The row_indices and column_indices can be specified using numeric indices, logical vectors, or character vectors (column names). For column selection, we can also use the $ operator to access a specific column by its name.
# Column names
colnames(caschool) [1] "Observation.Number" "dist_cod" "county"
[4] "district" "gr_span" "enrl_tot"
[7] "teachers" "calw_pct" "meal_pct"
[10] "computer" "testscr" "comp_stu"
[13] "expn_stu" "str" "avginc"
[16] "el_pct" "read_scr" "math_scr"
# Selecting the 'testscr' column using the $ operator
summary(caschool$testscr) Min. 1st Qu. Median Mean 3rd Qu. Max.
605.6 640.0 654.4 654.2 666.7 706.8
# Selecting testscr and meal_pct columns using [ ] operator with column names
head(caschool[,c("testscr", "meal_pct")]) testscr meal_pct
1 690.80 2.0408
2 661.20 47.9167
3 643.60 76.3226
4 647.70 77.0492
5 640.85 78.4270
6 605.55 86.9565
# Selecting the 11th and 16th columns using [ ] operator with numeric indices
head(caschool[, c(11,16)]) testscr el_pct
1 690.80 0.000000
2 661.20 4.583333
3 643.60 30.000002
4 647.70 0.000000
5 640.85 13.857677
6 605.55 12.408759
# Selecting the first 5 rows of the columns 'testscr' and 'meal_pct'
head(caschool[1:5, c("testscr", "meal_pct")]) testscr meal_pct
1 690.80 2.0408
2 661.20 47.9167
3 643.60 76.3226
4 647.70 77.0492
5 640.85 78.4270
# Accessing specific rows using a logical condition
head(caschool[caschool$meal_pct > 90, c("district", "meal_pct")]) # rows where meal_pct > 90 district meal_pct
7 Holt Union Elementary 94.6237
8 Vineland Elementary 100.0000
9 Orange Center Elementary 93.1398
12 West Fresno Elementary 98.6056
13 Allensworth Elementary 98.1308
16 Pixley Union Elementary 94.2957
In Table 7.3, we summarize some common dataframe indexing operations in R.
| Command | Meaning |
|---|---|
data[n,] |
Select the nth row of the dataframe |
data[-n,] |
Remove the nth row of the dataframe |
data[1:n,] |
Select all of the columns from rows 1 to n of the dataframe |
data[-(1:n),] |
Remove rows from 1 to n of the dataframe |
data[c(i,j,k),] |
Select rows i, j, and k of the dataframe |
data[x > y,] |
Select rows where x > y |
data[,m] |
Select the mth column of the dataframe |
data[,-m] |
Remove the mth column of the dataframe |
data[,1:m] |
Select columns 1 to m of the dataframe |
data[,-(1:m)] |
Remove all columns from 1 to m of the dataframe |
data[,c(i,j,k)] |
Select columns i, j, and k of the dataframe |
data[,x > y] |
Select columns where x > y |
data[x > y,a != b] |
Select rows where x > y and columns where a != b |
7.2.4 Adding and removing columns and rows
We can add new columns to a dataframe by assigning values to a new column name using the $ operator or the [ ] operator.
# Adding a new column using the $ operator
caschool$meal_pct_squared <- caschool$meal_pct^2
# Adding a new column using the [ ] operator
caschool[,"log_testscr"] <- log(caschool$testscr)Alternatively, we can use the cbind function to add new columns to a dataframe. In the following example, we add a new column sqrt_testscr to the caschool dataframe using cbind.
# Adding a new column using cbind
caschool <- cbind(caschool, sqrt_testscr = sqrt(caschool$testscr))
head(caschool[, c("testscr", "log_testscr", "sqrt_testscr")]) testscr log_testscr sqrt_testscr
1 690.80 6.537850 26.28307
2 661.20 6.494056 25.71381
3 643.60 6.467077 25.36927
4 647.70 6.473428 25.44995
5 640.85 6.462795 25.31501
6 605.55 6.406137 24.60793
We can remove columns from a dataframe by assigning NULL to the column using the $ operator or the [ ] operator.
# Removing a column using the $ operator
caschool$meal_pct_squared <- NULL
# Removing a column using the [ ] operator
caschool[,"log_testscr"] <- NULLThe subset function can also be used to select specific columns from a dataframe. The syntax of the subset function is as follows:
subset(x, subset, select, drop = FALSE, ...)where x is the dataframe, subset is a logical expression to filter rows, and select is a vector of column names or indices to select specific columns. In the following example, we use the subset function to create a new dataframe with only the testscr, meal_pct, and enrl_tot columns from the caschool dataframe.
# Selecting specific columns using subset
caschool_subset <- subset(caschool, select = c("testscr", "meal_pct", "enrl_tot"))
head(caschool_subset) testscr meal_pct enrl_tot
1 690.80 2.0408 195
2 661.20 47.9167 240
3 643.60 76.3226 1550
4 647.70 77.0492 243
5 640.85 78.4270 1335
6 605.55 86.9565 137
We use the rbind function to combine two dataframes by rows. The dataframes must have the same columns. In the following example, we create two dataframes df_a and df_b, and then combine them using rbind.
# Create two dataframes
df_a <- data.frame(
Name = c("David", "Eva"),
Age = c(28, 32),
Height = c(5.9, 5.6)
)
df_b <- data.frame(
Name = c("Frank", "Grace"),
Age = c(27, 29),
Height = c(6.1, 5.7)
)
# Combine the dataframes using rbind
df_combined <- rbind(df_a, df_b)
# Print the combined dataframe
df_combined Name Age Height
1 David 28 5.9
2 Eva 32 5.6
3 Frank 27 6.1
4 Grace 29 5.7
7.2.5 Sorting
We can use the order function to sort a dataframe by one or more columns. The order function returns the indices that would sort the specified columns. In the following example, we sort the caschool dataframe by the testscr column in descending order.
# Sort the caschool dataframe by testscr in descending order
caschool_sorted <- caschool[order(caschool$testscr, decreasing = TRUE), ]
head(caschool_sorted[, c("district", "testscr")]) district testscr
417 Los Altos Elementary 706.75
416 Las Lomitas Elementary 704.30
415 Saratoga Union Elementary 700.30
414 Portola Valley Elementary 699.10
413 Cold Spring Elementary 698.45
412 Hillsborough City Elementary 698.25
In the following example, we sort the caschool dataframe first by the avginc and testscr columns in descending order.
# Sort the caschool dataframe by avginc and testscr (both descending)
caschool_sorted1 <- caschool[order(-caschool$avginc, -caschool$testscr), ]
# Display the first 10 rows of the sorted dataframe
caschool_sorted1[1:10, c("district", "avginc", "testscr")] district avginc testscr
405 Reed Union Elementary 55.32800 693.95
414 Portola Valley Elementary 50.67700 699.10
404 Menlo Park City Elementary 49.93900 691.90
413 Cold Spring Elementary 43.23000 698.45
402 Montecito Union Elementary 43.23000 691.05
417 Los Altos Elementary 41.73411 706.75
387 Kentfield Elementary 41.09300 681.90
415 Saratoga Union Elementary 40.40200 700.30
411 Orinda Union Elementary 40.26400 698.20
408 Del Mar Union Elementary 38.62857 695.20
7.2.6 Missing values
In R, missing values are represented by NA. To see whether there are any missing values in the caschool dataframe, we can use the any function along with is.na.
# Check if there are any missing values in the caschool dataframe
any(is.na(caschool))[1] FALSE
The na.omit function can be used to remove rows with missing values from a dataframe. In the following example, we create a dataframe with some missing values and then use is.na and na.omit to handle the missing values.
# Create a dataframe with missing values
df_missing <- data.frame(
Name = c("Alice", "Bob", "John", "David"),
Age = c(25, NA, 30, NA),
Height = c(5.5, 6.0, 5.8, NA)
)
# Check for missing values
is.na(df_missing) Name Age Height
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE TRUE TRUE
# Remove rows with missing values
df_no_missing <- na.omit(df_missing)
df_no_missing Name Age Height
1 Alice 25 5.5
3 John 30 5.8
To see how many missing values are in each column of a dataframe, we can use the colSums function along with is.na.
# Count missing values in each column
colSums(is.na(df_missing)) Name Age Height
0 2 1
In the following example, we replace missing values in a dataframe with a specific value (e.g., 0).
# Replace missing values with 0
df <- df_missing
df[is.na(df)] <- 0
df Name Age Height
1 Alice 25 5.5
2 Bob 0 6.0
3 John 30 5.8
4 David 0 0.0
In the following example, we replace missing values in df_missing with the mean of the respective columns.
# Replace missing values with column means
df <- df_missing
df$Age[is.na(df$Age)] <- mean(df$Age, na.rm = TRUE)
df$Height[is.na(df$Height)] <- mean(df$Height, na.rm = TRUE)
df Name Age Height
1 Alice 25.0 5.500000
2 Bob 27.5 6.000000
3 John 30.0 5.800000
4 David 27.5 5.766667
7.2.7 Merging and reshaping
The merge function in R is used to combine two dataframes by common columns or row names. The syntax of the merge function is as follows:
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by,
all = FALSE, all.x = FALSE, all.y = FALSE,
sort = TRUE, suffixes = c(".x", ".y"),
incomparables = NULL, ...)where x and y are the dataframes to be merged, by specifies the common columns to merge on, and all, all.x, and all.y control the type of join (inner, left, right, or full).
In the following example, we create two dataframes and merge them by a common column ID. The all = TRUE argument specifies a full outer join, which includes all rows from both dataframes.
# Create two dataframes to merge
df1 <- data.frame(
ID = c(1, 2, 3),
Name = c("Alice", "Bob", "Charlie")
)
df1 ID Name
1 1 Alice
2 2 Bob
3 3 Charlie
df2 <- data.frame(
ID = c(2, 3, 4),
Age = c(30, 35, 40)
)
df2 ID Age
1 2 30
2 3 35
3 4 40
# Merge the dataframes by the 'ID' column
merged_df <- merge(df1, df2, by = "ID", all = TRUE)
merged_df ID Name Age
1 1 Alice NA
2 2 Bob 30
3 3 Charlie 35
4 4 <NA> 40
In the following example, we perform an inner join by setting the all = FALSE argument, which includes only the rows with matching ID values in both dataframes.
# Merge the dataframes by the 'ID' column (inner join)
inner_merged_df <- merge(df1, df2, by = "ID", all = FALSE)
inner_merged_df ID Name Age
1 2 Bob 30
2 3 Charlie 35
In the following example, we perform a left join by setting the all.x = TRUE argument, which includes all rows from the first dataframe (df1) and only the matching rows from the second dataframe (df2).
# Merge the dataframes by the 'ID' column (left join)
left_merged_df <- merge(df1, df2, by = "ID", all.x = TRUE)
left_merged_df ID Name Age
1 1 Alice NA
2 2 Bob 30
3 3 Charlie 35
Finally, we perform a right join by setting the all.y = TRUE argument, which includes all rows from the second dataframe (df2) and only the matching rows from the first dataframe (df1).
# Merge the dataframes by the 'ID' column (right join)
right_merged_df <- merge(df1, df2, by = "ID", all.y = TRUE)
right_merged_df ID Name Age
1 2 Bob 30
2 3 Charlie 35
3 4 <NA> 40
We use the reshape function to reshape dataframes from wide format to long format and vice versa. The syntax of the reshape function is as follows:
reshape(data, varying = NULL, v.names = NULL, timevar = "time",
idvar = "id", ids = 1:NROW(data),
times = seq_along(varying[[1]]),
drop = NULL, direction, new.row.names = NULL,
sep = ".",
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
)where
data: the dataframe to be reshaped.varying: names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’).v.names: names of variables in the long format that correspond to multiple variables in the wide format.timevar: the name of the variable in the long format that indicates time or measurement occasions.times: the values to use for a newly created timevar variable in long format.idvar: the name of the variable that identifies the subjects or units.direction: the direction of reshaping, either “long” or “wide”.
In the following example, we create a dataframe in wide format and then reshape it to long format.
# Create a dataframe in wide format
wide_df <- data.frame(
ID = c(1, 2, 3),
Name = c("Alice", "Bob", "Charlie"),
Score_2020 = c(85, 90, 95),
Score_2021 = c(88, 92, 96)
)
wide_df ID Name Score_2020 Score_2021
1 1 Alice 85 88
2 2 Bob 90 92
3 3 Charlie 95 96
# Reshape the dataframe from wide to long format
long_df <- reshape(wide_df, varying = c("Score_2020", "Score_2021"),
v.names = "Score",
timevar = "Year",
times = c(2020, 2021),
idvar = c("ID", "Name"),
direction = "long")
long_df ID Name Year Score
1.Alice.2020 1 Alice 2020 85
2.Bob.2020 2 Bob 2020 90
3.Charlie.2020 3 Charlie 2020 95
1.Alice.2021 1 Alice 2021 88
2.Bob.2021 2 Bob 2021 92
3.Charlie.2021 3 Charlie 2021 96
Finally, we reshape the long format dataframe back to wide format.
wide_df2 <- reshape(long_df,
v.names = "Score",
timevar = "Year",
idvar = c("ID", "Name"),
direction = "wide")
wide_df2 ID Name Score.2020 Score.2021
1.Alice.2020 1 Alice 85 88
2.Bob.2020 2 Bob 90 92
3.Charlie.2020 3 Charlie 95 96
7.3 The tidyverse approach
7.3.1 Using the tibble function
In the tidyverse approach, the dataframes are referred to as tibbles. We use the tibble function to create tibbles. The syntax of the tibble function is as follows:
tibble(
...,
.rows = NULL,
.name_repair = c("check_unique", "unique", "universal", "minimal")
)where ... are the columns of the tibble specified as name = value pairs, .rows specifies the number of rows, and .name_repair controls how to handle column names.
In the following example, we create a simple tibble with three columns: Name, Age, and Height. The row after the column names shows the data types of each column.
# Create a simple tibble
tb1 <- tibble(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Height = c(5.5, 6.0, 5.8)
)
# Check the class of the tibble
class(tb1)[1] "tbl_df" "tbl" "data.frame"
# Print the tibble
tb1# A tibble: 3 × 3
Name Age Height
<chr> <dbl> <dbl>
1 Alice 25 5.5
2 Bob 30 6
3 Charlie 35 5.8
We can use $, [] and [[ ]] to select columns from a tibble. The operators $ and [[ ]] return a vector, while [] returns a tibble.
# Selecting the 'Age' column using the $ operator
tb1$Age[1] 25 30 35
# Selecting the 'Height' column
tb1[,"Height"]# A tibble: 3 × 1
Height
<dbl>
1 5.5
2 6
3 5.8
# Selecting the 'Height' column using [[ ]]
tb1[["Height"]][1] 5.5 6.0 5.8
# Selecting the 2nd column
tb1[,2] # A tibble: 3 × 1
Age
<dbl>
1 25
2 30
3 35
# Selecting multiple columns
tb1[,c("Name", "Height")] # A tibble: 3 × 2
Name Height
<chr> <dbl>
1 Alice 5.5
2 Bob 6
3 Charlie 5.8
7.3.2 Importing and exporting tibbles
The read_csv function from the readr package is used to import CSV files as tibbles. The syntax of the read_csv function is as follows:
read_csv(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"),
quoted_na = TRUE, quote = "\"", comment = "",
trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress(),
skip_empty_rows = TRUE)In the following example, we load the caschool.csv file as a tibble.
# Load the caschool.csv file as a tibble
caschool_tb <- read_csv("data/caschool.csv", col_names = TRUE, na = c("", "NA"),
skip = 0)
# Check the class of the tibble
class(caschool_tb)[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# Examine the structure of the tibble
glimpse(caschool_tb)Rows: 420
Columns: 18
$ `Observation Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, …
$ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "F…
$ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Th…
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-…
$ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 22…
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 4…
$ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.31…
$ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.95…
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56,…
$ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 6…
$ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0…
$ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988,…
$ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133,…
$ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333…
$ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.85767…
$ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, …
$ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, …
To export a tibble to a CSV file, we can use the write_csv function from the readr package. The syntax of this function is as follows:
write_csv(x, file, na = "NA", append = FALSE,
col_names = !append, quote_escape = "double")where x is the tibble to be exported, file is the name of the file to write to, na is the string to represent NA values, and col_names indicates whether to include column names in the output file.
In the following example, we export the caschool_tb tibble to a CSV file.
# Export the caschool_tb tibble to a CSV file
write_csv(caschool_tb, "data/caschool_export.csv")The read_excel function from the readxl package can be used to import Excel files as tibbles. The syntax of the read_excel function is as follows:
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE,
skip = 0, n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")In the following example, we load the caschool.xlsx file as a tibble.
caschool_tb <- read_excel("data/caschool.xlsx", col_names = TRUE, sheet = 1)
class(caschool_tb)[1] "tbl_df" "tbl" "data.frame"
glimpse(caschool_tb)Rows: 420
Columns: 18
$ `Observation Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, …
$ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "F…
$ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Th…
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-…
$ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 22…
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 4…
$ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.31…
$ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.95…
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56,…
$ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 6…
$ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0…
$ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988,…
$ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133,…
$ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333…
$ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.85767…
$ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, …
$ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, …
The haven package provides functions read_sas() and read_stata() to import SAS and Stata files, respectively, as tibbles.
7.3.3 Selecting rows and columns
As in the base R approach, we can use indexing operations stated in Table 7.3 to select rows and columns from a tibble. In the following, we provide some illustrative examples.
# Selecting the 'testscr' column
summary(caschool_tb$testscr) Min. 1st Qu. Median Mean 3rd Qu. Max.
605.6 640.0 654.4 654.2 666.7 706.8
# Selecting the first 5 rows of the columns 'testscr' and 'meal_pct'
head(caschool_tb[1:5, c("testscr", "meal_pct")])# A tibble: 5 × 2
testscr meal_pct
<dbl> <dbl>
1 691. 2.04
2 661. 47.9
3 644. 76.3
4 648. 77.0
5 641. 78.4
# Accessing specific rows using a logical condition
head(caschool_tb[caschool_tb$meal_pct > 90, c("district", "meal_pct")]) # rows where meal_pct > 90# A tibble: 6 × 2
district meal_pct
<chr> <dbl>
1 Holt Union Elementary 94.6
2 Vineland Elementary 100
3 Orange Center Elementary 93.1
4 West Fresno Elementary 98.6
5 Allensworth Elementary 98.1
6 Pixley Union Elementary 94.3
7.3.4 Adding and removing columns and rows
We can use the approach in the base R approach to add and remove columns and rows, sort, handle missing values, and merge and reshape tibbles. Below, we provide some illustrative examples.
# Adding a new column using the $ operator
caschool_tb$meal_pct_squared <- caschool_tb$meal_pct^2
head(caschool_tb[, c("meal_pct", "meal_pct_squared")])# A tibble: 6 × 2
meal_pct meal_pct_squared
<dbl> <dbl>
1 2.04 4.16
2 47.9 2296.
3 76.3 5825.
4 77.0 5937.
5 78.4 6151.
6 87.0 7561.
# Removing a column using the $ operator
caschool_tb$meal_pct_squared <- NULL
# Selecting specific columns using subset
caschool_subset_tb <- subset(caschool_tb, select = c("testscr", "meal_pct", "enrl_tot"))
head(caschool_subset_tb)# A tibble: 6 × 3
testscr meal_pct enrl_tot
<dbl> <dbl> <dbl>
1 691. 2.04 195
2 661. 47.9 240
3 644. 76.3 1550
4 648. 77.0 243
5 641. 78.4 1335
6 606. 87.0 137
# Check for missing values
any(is.na(caschool_tb))[1] FALSE
7.3.5 Manipulating tibbles with dplyr
The dplyr package provides a set of functions to manipulate tibbles. Some of the commonly used functions are:
filter(): to filter rows based on conditions.select(): to select specific columns.mutate(): to add new columns or modify existing columns.arrange(): to sort the tibble by one or more columns.summarize(): to summarize the tibble by calculating summary statistics.group_by(): to group the tibble by one or more columns for summary operations.join(): to merge two tibbles by common columns.bind_rows(): to combine two tibbles by rows.bind_cols(): to combine two tibbles by columns.rename(): to rename columns in a tibble.relocate(): to change the order of columns in a tibble.slice_function(): to select rows by their position.count(): to count the number of occurrences of unique values in a column.
The dplyr functions can be used with the pipe operator %>% or |> to create a sequence of operations. This operator takes the input before the pipe and uses it as the first argument of the function after the pipe. For example, x|> f(y) is equivalent to f(x, y) and x |> f(y) |> g(z) is equivalent to g(f(x, y), z).
In the following example, we first filter the rows where meal_pct is greater than 90, and then select the district and meal_pct columns.
# Filter rows where meal_pct > 90 and select specific columns
caschool_tb_filtered <- caschool_tb |>
filter(meal_pct > 90) |>
select(district, meal_pct)# Display the first few rows of the filtered tibble
head(caschool_tb_filtered)# A tibble: 6 × 2
district meal_pct
<chr> <dbl>
1 Holt Union Elementary 94.6
2 Vineland Elementary 100
3 Orange Center Elementary 93.1
4 West Fresno Elementary 98.6
5 Allensworth Elementary 98.1
6 Pixley Union Elementary 94.3
In the following example, we add a new column log_avginc that is the logarithm of the avginc column, locate the log_avginc column before the avginc column, and finally sort the tibble by the log_avginc column in descending order.
# Add a new column, relocate it, and sort the tibble
caschool_tb_filtered <- caschool_tb |>
mutate(log_avginc = log(avginc)) |>
relocate(log_avginc, .before = avginc) |>
arrange(desc(log_avginc))
# Examine the structure of the filtered tibble
glimpse(caschool_tb_filtered)Rows: 420
Columns: 19
$ `Observation Number` <dbl> 405, 414, 404, 402, 413, 417, 387, 415, 411, 408,…
$ dist_cod <dbl> 65425, 68981, 68965, 69252, 69161, 69518, 65334, …
$ county <chr> "Marin", "San Mateo", "San Mateo", "Santa Barbara…
$ district <chr> "Reed Union Elementary", "Portola Valley Elementa…
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-06", "KK-06", "KK-…
$ enrl_tot <dbl> 1059, 687, 1940, 535, 220, 3724, 1091, 2341, 2422…
$ teachers <dbl> 65.120, 44.590, 111.820, 31.000, 12.330, 208.480,…
$ calw_pct <dbl> 0.8499, 0.3049, 0.4124, 0.0800, 0.4545, 1.0741, 0…
$ meal_pct <dbl> 1.4164, 0.0000, 3.9175, 0.7200, 0.0000, 1.5038, 0…
$ computer <dbl> 272, 209, 265, 84, 22, 721, 113, 286, 466, 300, 3…
$ testscr <dbl> 693.95, 699.10, 691.90, 691.05, 698.45, 706.75, 6…
$ comp_stu <dbl> 0.2568461, 0.3042212, 0.1365979, 0.1570093, 0.100…
$ expn_stu <dbl> 6460.657, 7217.263, 6180.149, 7039.949, 6500.450,…
$ str <dbl> 16.26229, 15.40704, 17.34931, 17.25806, 17.84266,…
$ log_avginc <dbl> 4.013279, 3.925472, 3.910802, 3.766535, 3.766535,…
$ avginc <dbl> 55.32800, 50.67700, 49.93900, 43.23000, 43.23000,…
$ el_pct <dbl> 2.2662890, 1.1644833, 6.1340203, 1.4953271, 1.363…
$ read_scr <dbl> 695.9, 698.3, 693.7, 693.5, 693.3, 704.0, 689.8, …
$ math_scr <dbl> 692.0, 699.9, 690.1, 688.6, 703.6, 709.5, 674.0, …
7.4 Further Readings
For further reading on dataframes, we recommend Jones, Harden, and Crawley (2022) for the base R approach and Wickham, Grolemund, and Çetinkaya-Rundel (2023) for the tidyverse approach.