7  Dataframes

7.1 Introduction

In R, a dataframe is a matrix-like structure used for storing datasets. It has rows and columns, where each column can contain different types of data (e.g., numeric, character, factor). In this chapter, we will explore how to create, manipulate, and analyze dataframes in R. We will cover two approaches: the base R approach and the tidyverse approach.

# Load necessary libraries
library(tidyverse)
library(readxl)

7.2 The base R approach

7.2.1 Using the data.frame function

In the base R approach, we use the data.frame function to create a dataframe and the str function to examine its structure. The syntax of data.frame is

data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = FALSE)

where

  • ... : these are the columns of the dataframe specified as name = value pairs.
  • row.names : this specifies the row names of the dataframe.
  • check.rows : if TRUE, checks that the rows have the same length.
  • check.names : if TRUE, checks that the column names are valid R variable names.
  • fix.empty.names : if TRUE, converts empty column names to V1, V2, etc.
  • stringsAsFactors : if TRUE, converts character vectors to factors.

Below, we provide some illustrative examples.

# Create a simple dataframe
df1 <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Height = c(5.5, 6.0, 5.8)
)

# Print the dataframe
df1
     Name Age Height
1   Alice  25    5.5
2     Bob  30    6.0
3 Charlie  35    5.8
# Examine the structure of the dataframe
str(df1)
'data.frame':   3 obs. of  3 variables:
 $ Name  : chr  "Alice" "Bob" "Charlie"
 $ Age   : num  25 30 35
 $ Height: num  5.5 6 5.8

Note that the row names are automatically assigned as 1, 2, 3, etc. if not specified. We can use the row.names argument to assign custom row names.

# Create a dataframe with custom row names
df2 <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Height = c(5.5, 6.0, 5.8),
  row.names = c("A", "B", "C")
)
# Print the dataframe with custom row names
df2
     Name Age Height
A   Alice  25    5.5
B     Bob  30    6.0
C Charlie  35    5.8

In the following table, we summarize some useful functions to examine dataframes in R.

Table 7.1: Some useful functions to examine dataframes
Function Description
str() Displays the structure of the dataframe
is.data.frame() Checks if an object is a dataframe
as.data.frame() Converts an object to a dataframe
head() Displays the first few rows of the dataframe
tail() Displays the last few rows of the dataframe
dim() Returns the dimensions (rows and columns)
nrow() Returns the number of rows
ncol() Returns the number of columns
names() Returns the column names
summary() Provides a summary of each column
colnames() Gets or sets the column names
rownames() Gets or sets the row names

Below are some examples of using these functions.

# Create a dataframe
x <- runif(10)
y <- letters[1:10]
z <- sample(c(rep(T,5),rep(F,5)))
df3 <- data.frame(x, y, z)
# Display the structure of the dataframe
str(df3)
'data.frame':   10 obs. of  3 variables:
 $ x: num  0.666 0.419 0.385 0.601 0.383 ...
 $ y: chr  "a" "b" "c" "d" ...
 $ z: logi  FALSE TRUE TRUE FALSE TRUE TRUE ...
# Display the first few rows of the dataframe
head(df3)
          x y     z
1 0.6662274 a FALSE
2 0.4190514 b  TRUE
3 0.3853232 c  TRUE
4 0.6005672 d FALSE
5 0.3834097 e  TRUE
6 0.8991093 f  TRUE
# Display the dimensions of the dataframe
dim(df3)
[1] 10  3
# Get the column names of the dataframe
names(df3)
[1] "x" "y" "z"
# Get a summary of the dataframe
summary(df3)
       x                 y                 z          
 Min.   :0.002241   Length:10          Mode :logical  
 1st Qu.:0.383888   Class :character   FALSE:5        
 Median :0.454043   Mode  :character   TRUE :5        
 Mean   :0.480009                                     
 3rd Qu.:0.595667                                     
 Max.   :0.899109                                     
# Change the column names of the dataframe
colnames(df3) <- c("RandomNumbers", "Letters", "LogicalValues")
head(df3)
  RandomNumbers Letters LogicalValues
1     0.6662274       a         FALSE
2     0.4190514       b          TRUE
3     0.3853232       c          TRUE
4     0.6005672       d         FALSE
5     0.3834097       e          TRUE
6     0.8991093       f          TRUE
# Check if df3 is a dataframe
is.data.frame(df3)
[1] TRUE
# Convert a matrix to a dataframe
mat <- matrix(data = 10:18, nrow = 3)
df4 <- as.data.frame(mat)
df4
  V1 V2 V3
1 10 13 16
2 11 14 17
3 12 15 18

7.2.2 Importing and exporting dataframes

The R native format for saving dataframes is .Rdata or .rda. To save a dataframe in this format, we use the save function, which has the following syntax:

save(..., list = character(),
     file = stop("'file' must be specified"),
     ascii = FALSE, version = NULL, envir = parent.frame(),
     compress = isTRUE(!ascii), compression_level,
     eval.promises = TRUE, precheck = TRUE)

where ... are the objects to be saved, list is a character vector of object names to be saved, and file is the name of the file to save the objects to.

# Save df2 and df3 to Rdata file
save(df2, df3, file = "data/dataframes.Rdata")

To load data from an .Rdata or .rda file, we use the load function.

# Load data from Rdata file
load("data/dataframes.Rdata")

The read.table function can be used to import data from the following text file formats: space delimited, tab delimited, and comma separated values (CSV). The syntax of read.table is as follows:

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#", allowEscapes = FALSE,
           flush = FALSE, stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

Some of the important arguments are:

  • file : the name of the file with the extension to be read.
  • header : if TRUE, the first line of the file contains the column names.
  • sep : the character that separates the values in the file (e.g., comma, tab, space).
  • as.is : if TRUE, character columns are not converted to factors.
  • na.strings : a character vector of strings to be interpreted as NA values.
  • skip : the number of lines to skip before reading the data.

In the following example, we load the data from the caschool.csv file located in the data folder. This dataset contains information on kindergarten through eighth-grade students across 420 California school districts in 1999. It is a district-level dataset that includes variables on average student performance and demographic characteristics. In Table 7.2, we describe the variables in this dataset.

Table 7.2: Description of variables in the caschool dataframe
Variable Description
dist_code District code
read_scr Average reading score
math_scr Average math score
county County
district District
gr_span Grade span of district
enrl_tot Total enrollment
teachers Number of teachers
computer Number of computers
testscr Average test score (= (read_scr + math_scr)/2)
comp_stu Computers per student (= computer / enrl_tot)
expn_stu Expenditures per student
str Student-teacher ratio (= teachers / enrl_tot)
el_pct Percent of English learners
meal_pct Percent qualifying for reduced-price lunch
clw_pct Percent qualifying for CalWorks
avginc District average income (in $1000s)
# Load the caschool.csv file
caschool <- read.table("data/caschool.csv", header = TRUE, sep = ",", as.is = TRUE)
# Examine the structure of the caschool dataframe
str(caschool)
'data.frame':   420 obs. of  18 variables:
 $ Observation.Number: int  1 2 3 4 5 6 7 8 9 10 ...
 $ dist_cod          : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
 $ county            : chr  "Alameda" "Butte" "Butte" "Butte" ...
 $ district          : chr  "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
 $ gr_span           : chr  "KK-08" "KK-08" "KK-08" "KK-08" ...
 $ enrl_tot          : int  195 240 1550 243 1335 137 195 888 379 2247 ...
 $ teachers          : num  10.9 11.1 82.9 14 71.5 ...
 $ calw_pct          : num  0.51 15.42 55.03 36.48 33.11 ...
 $ meal_pct          : num  2.04 47.92 76.32 77.05 78.43 ...
 $ computer          : int  67 101 169 85 171 25 28 66 35 0 ...
 $ testscr           : num  691 661 644 648 641 ...
 $ comp_stu          : num  0.344 0.421 0.109 0.35 0.128 ...
 $ expn_stu          : num  6385 5099 5502 7102 5236 ...
 $ str               : num  17.9 21.5 18.7 17.4 18.7 ...
 $ avginc            : num  22.69 9.82 8.98 8.98 9.08 ...
 $ el_pct            : num  0 4.58 30 0 13.86 ...
 $ read_scr          : num  692 660 636 652 642 ...
 $ math_scr          : num  690 662 651 644 640 ...

To export a dataframe to a text file, we can use the write.table function. The syntax of this function is as follows:

write.table(x, file = "", append = FALSE, quote = TRUE,
            sep = " ", eol = "\n", na = "NA", dec = ".",
            row.names = TRUE, col.names = TRUE,
            qmethod = c("escape", "double"),
            fileEncoding = "")

where x is the dataframe to be exported, file is the name of the file to write to, sep is the character that separates the values in the file, na is the string to represent NA values, and row.names and col.names indicate whether to include row and column names in the output file.

# Export the caschool dataframe to a CSV file
write.table(x=caschool, file = "data/caschool_exported.csv", sep = ",")

The simplest and recommended way to import Excel files is to use the read_excel function from the readxl package. The haven package provides functions read_sas() and read_stata() to import SAS and Stata files, respectively.

7.2.3 Selecting rows and columns

We use [ , ] operator to select rows and columns from a dataframe: data[row_indices, column_indices]. The row_indices and column_indices can be specified using numeric indices, logical vectors, or character vectors (column names). For column selection, we can also use the $ operator to access a specific column by its name.

# Column names
colnames(caschool)
 [1] "Observation.Number" "dist_cod"           "county"            
 [4] "district"           "gr_span"            "enrl_tot"          
 [7] "teachers"           "calw_pct"           "meal_pct"          
[10] "computer"           "testscr"            "comp_stu"          
[13] "expn_stu"           "str"                "avginc"            
[16] "el_pct"             "read_scr"           "math_scr"          
# Selecting the 'testscr' column using the $ operator
summary(caschool$testscr)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  605.6   640.0   654.4   654.2   666.7   706.8 
# Selecting testscr and meal_pct columns using [ ] operator with column names
head(caschool[,c("testscr", "meal_pct")])
  testscr meal_pct
1  690.80   2.0408
2  661.20  47.9167
3  643.60  76.3226
4  647.70  77.0492
5  640.85  78.4270
6  605.55  86.9565
# Selecting the 11th and 16th columns using [ ] operator with numeric indices
head(caschool[, c(11,16)]) 
  testscr    el_pct
1  690.80  0.000000
2  661.20  4.583333
3  643.60 30.000002
4  647.70  0.000000
5  640.85 13.857677
6  605.55 12.408759
# Selecting the first 5 rows of the columns 'testscr' and 'meal_pct'
head(caschool[1:5, c("testscr", "meal_pct")])
  testscr meal_pct
1  690.80   2.0408
2  661.20  47.9167
3  643.60  76.3226
4  647.70  77.0492
5  640.85  78.4270
# Accessing specific rows using a logical condition
head(caschool[caschool$meal_pct > 90, c("district", "meal_pct")]) # rows where meal_pct > 90
                   district meal_pct
7     Holt Union Elementary  94.6237
8       Vineland Elementary 100.0000
9  Orange Center Elementary  93.1398
12   West Fresno Elementary  98.6056
13   Allensworth Elementary  98.1308
16  Pixley Union Elementary  94.2957

In Table 7.3, we summarize some common dataframe indexing operations in R.

Table 7.3: Some common dataframe indexing operations
Command Meaning
data[n,] Select the nth row of the dataframe
data[-n,] Remove the nth row of the dataframe
data[1:n,] Select all of the columns from rows 1 to n of the dataframe
data[-(1:n),] Remove rows from 1 to n of the dataframe
data[c(i,j,k),] Select rows i, j, and k of the dataframe
data[x > y,] Select rows where x > y
data[,m] Select the mth column of the dataframe
data[,-m] Remove the mth column of the dataframe
data[,1:m] Select columns 1 to m of the dataframe
data[,-(1:m)] Remove all columns from 1 to m of the dataframe
data[,c(i,j,k)] Select columns i, j, and k of the dataframe
data[,x > y] Select columns where x > y
data[x > y,a != b] Select rows where x > y and columns where a != b

7.2.4 Adding and removing columns and rows

We can add new columns to a dataframe by assigning values to a new column name using the $ operator or the [ ] operator.

# Adding a new column using the $ operator
caschool$meal_pct_squared <- caschool$meal_pct^2
# Adding a new column using the [ ] operator
caschool[,"log_testscr"] <- log(caschool$testscr)

Alternatively, we can use the cbind function to add new columns to a dataframe. In the following example, we add a new column sqrt_testscr to the caschool dataframe using cbind.

# Adding a new column using cbind
caschool <- cbind(caschool, sqrt_testscr = sqrt(caschool$testscr))
head(caschool[, c("testscr", "log_testscr", "sqrt_testscr")])
  testscr log_testscr sqrt_testscr
1  690.80    6.537850     26.28307
2  661.20    6.494056     25.71381
3  643.60    6.467077     25.36927
4  647.70    6.473428     25.44995
5  640.85    6.462795     25.31501
6  605.55    6.406137     24.60793

We can remove columns from a dataframe by assigning NULL to the column using the $ operator or the [ ] operator.

# Removing a column using the $ operator
caschool$meal_pct_squared <- NULL
# Removing a column using the [ ] operator
caschool[,"log_testscr"] <- NULL

The subset function can also be used to select specific columns from a dataframe. The syntax of the subset function is as follows:

subset(x, subset, select, drop = FALSE, ...)

where x is the dataframe, subset is a logical expression to filter rows, and select is a vector of column names or indices to select specific columns. In the following example, we use the subset function to create a new dataframe with only the testscr, meal_pct, and enrl_tot columns from the caschool dataframe.

# Selecting specific columns using subset
caschool_subset <- subset(caschool, select = c("testscr", "meal_pct", "enrl_tot"))
head(caschool_subset)
  testscr meal_pct enrl_tot
1  690.80   2.0408      195
2  661.20  47.9167      240
3  643.60  76.3226     1550
4  647.70  77.0492      243
5  640.85  78.4270     1335
6  605.55  86.9565      137

We use the rbind function to combine two dataframes by rows. The dataframes must have the same columns. In the following example, we create two dataframes df_a and df_b, and then combine them using rbind.

# Create two dataframes
df_a <- data.frame(
  Name = c("David", "Eva"),
  Age = c(28, 32),
  Height = c(5.9, 5.6)
)
df_b <- data.frame(
  Name = c("Frank", "Grace"),
  Age = c(27, 29),
  Height = c(6.1, 5.7)
)
# Combine the dataframes using rbind
df_combined <- rbind(df_a, df_b)
# Print the combined dataframe
df_combined
   Name Age Height
1 David  28    5.9
2   Eva  32    5.6
3 Frank  27    6.1
4 Grace  29    5.7

7.2.5 Sorting

We can use the order function to sort a dataframe by one or more columns. The order function returns the indices that would sort the specified columns. In the following example, we sort the caschool dataframe by the testscr column in descending order.

# Sort the caschool dataframe by testscr in descending order
caschool_sorted <- caschool[order(caschool$testscr, decreasing = TRUE), ]
head(caschool_sorted[, c("district", "testscr")])
                        district testscr
417         Los Altos Elementary  706.75
416       Las Lomitas Elementary  704.30
415    Saratoga Union Elementary  700.30
414    Portola Valley Elementary  699.10
413       Cold Spring Elementary  698.45
412 Hillsborough City Elementary  698.25

In the following example, we sort the caschool dataframe first by the avginc and testscr columns in descending order.

# Sort the caschool dataframe by avginc and testscr (both descending)
caschool_sorted1 <- caschool[order(-caschool$avginc, -caschool$testscr), ]
# Display the first 10 rows of the sorted dataframe
caschool_sorted1[1:10, c("district", "avginc", "testscr")]
                      district   avginc testscr
405      Reed Union Elementary 55.32800  693.95
414  Portola Valley Elementary 50.67700  699.10
404 Menlo Park City Elementary 49.93900  691.90
413     Cold Spring Elementary 43.23000  698.45
402 Montecito Union Elementary 43.23000  691.05
417       Los Altos Elementary 41.73411  706.75
387       Kentfield Elementary 41.09300  681.90
415  Saratoga Union Elementary 40.40200  700.30
411    Orinda Union Elementary 40.26400  698.20
408   Del Mar Union Elementary 38.62857  695.20

7.2.6 Missing values

In R, missing values are represented by NA. To see whether there are any missing values in the caschool dataframe, we can use the any function along with is.na.

# Check if there are any missing values in the caschool dataframe
any(is.na(caschool))
[1] FALSE

The na.omit function can be used to remove rows with missing values from a dataframe. In the following example, we create a dataframe with some missing values and then use is.na and na.omit to handle the missing values.

# Create a dataframe with missing values
df_missing <- data.frame(
  Name = c("Alice", "Bob", "John", "David"),
  Age = c(25, NA, 30, NA),
  Height = c(5.5, 6.0, 5.8, NA)
)
# Check for missing values
is.na(df_missing)
      Name   Age Height
[1,] FALSE FALSE  FALSE
[2,] FALSE  TRUE  FALSE
[3,] FALSE FALSE  FALSE
[4,] FALSE  TRUE   TRUE
# Remove rows with missing values
df_no_missing <- na.omit(df_missing)
df_no_missing
   Name Age Height
1 Alice  25    5.5
3  John  30    5.8

To see how many missing values are in each column of a dataframe, we can use the colSums function along with is.na.

# Count missing values in each column
colSums(is.na(df_missing))
  Name    Age Height 
     0      2      1 

In the following example, we replace missing values in a dataframe with a specific value (e.g., 0).

# Replace missing values with 0
df <- df_missing
df[is.na(df)] <- 0
df
   Name Age Height
1 Alice  25    5.5
2   Bob   0    6.0
3  John  30    5.8
4 David   0    0.0

In the following example, we replace missing values in df_missing with the mean of the respective columns.

# Replace missing values with column means
df <- df_missing
df$Age[is.na(df$Age)] <- mean(df$Age, na.rm = TRUE)
df$Height[is.na(df$Height)] <- mean(df$Height, na.rm = TRUE)
df
   Name  Age   Height
1 Alice 25.0 5.500000
2   Bob 27.5 6.000000
3  John 30.0 5.800000
4 David 27.5 5.766667

7.2.7 Merging and reshaping

The merge function in R is used to combine two dataframes by common columns or row names. The syntax of the merge function is as follows:

merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by,
      all = FALSE, all.x = FALSE, all.y = FALSE,
      sort = TRUE, suffixes = c(".x", ".y"),
      incomparables = NULL, ...)

where x and y are the dataframes to be merged, by specifies the common columns to merge on, and all, all.x, and all.y control the type of join (inner, left, right, or full).

In the following example, we create two dataframes and merge them by a common column ID. The all = TRUE argument specifies a full outer join, which includes all rows from both dataframes.

# Create two dataframes to merge
df1 <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Alice", "Bob", "Charlie")
)
df1
  ID    Name
1  1   Alice
2  2     Bob
3  3 Charlie
df2 <- data.frame(
  ID = c(2, 3, 4),
  Age = c(30, 35, 40)
)
df2
  ID Age
1  2  30
2  3  35
3  4  40
# Merge the dataframes by the 'ID' column
merged_df <- merge(df1, df2, by = "ID", all = TRUE)
merged_df
  ID    Name Age
1  1   Alice  NA
2  2     Bob  30
3  3 Charlie  35
4  4    <NA>  40

In the following example, we perform an inner join by setting the all = FALSE argument, which includes only the rows with matching ID values in both dataframes.

# Merge the dataframes by the 'ID' column (inner join)
inner_merged_df <- merge(df1, df2, by = "ID", all = FALSE)
inner_merged_df
  ID    Name Age
1  2     Bob  30
2  3 Charlie  35

In the following example, we perform a left join by setting the all.x = TRUE argument, which includes all rows from the first dataframe (df1) and only the matching rows from the second dataframe (df2).

# Merge the dataframes by the 'ID' column (left join)
left_merged_df <- merge(df1, df2, by = "ID", all.x = TRUE)
left_merged_df
  ID    Name Age
1  1   Alice  NA
2  2     Bob  30
3  3 Charlie  35

Finally, we perform a right join by setting the all.y = TRUE argument, which includes all rows from the second dataframe (df2) and only the matching rows from the first dataframe (df1).

# Merge the dataframes by the 'ID' column (right join)
right_merged_df <- merge(df1, df2, by = "ID", all.y = TRUE)
right_merged_df
  ID    Name Age
1  2     Bob  30
2  3 Charlie  35
3  4    <NA>  40

We use the reshape function to reshape dataframes from wide format to long format and vice versa. The syntax of the reshape function is as follows:

reshape(data, varying = NULL, v.names = NULL, timevar = "time",
        idvar = "id", ids = 1:NROW(data),
        times = seq_along(varying[[1]]),
        drop = NULL, direction, new.row.names = NULL,
        sep = ".",
        split = if (sep == "") {
            list(regexp = "[A-Za-z][0-9]", include = TRUE)
        } else {
            list(regexp = sep, include = FALSE, fixed = TRUE)}
        )

where

  • data : the dataframe to be reshaped.
  • varying : names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’).
  • v.names : names of variables in the long format that correspond to multiple variables in the wide format.
  • timevar : the name of the variable in the long format that indicates time or measurement occasions.
  • times : the values to use for a newly created timevar variable in long format.
  • idvar : the name of the variable that identifies the subjects or units.
  • direction : the direction of reshaping, either “long” or “wide”.

In the following example, we create a dataframe in wide format and then reshape it to long format.

# Create a dataframe in wide format
wide_df <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Alice", "Bob", "Charlie"),
  Score_2020 = c(85, 90, 95),
  Score_2021 = c(88, 92, 96)
)
wide_df
  ID    Name Score_2020 Score_2021
1  1   Alice         85         88
2  2     Bob         90         92
3  3 Charlie         95         96
# Reshape the dataframe from wide to long format
long_df <- reshape(wide_df, varying = c("Score_2020", "Score_2021"),
                   v.names = "Score",
                   timevar = "Year",
                   times = c(2020, 2021),
                   idvar = c("ID", "Name"),
                   direction = "long")

long_df
               ID    Name Year Score
1.Alice.2020    1   Alice 2020    85
2.Bob.2020      2     Bob 2020    90
3.Charlie.2020  3 Charlie 2020    95
1.Alice.2021    1   Alice 2021    88
2.Bob.2021      2     Bob 2021    92
3.Charlie.2021  3 Charlie 2021    96

Finally, we reshape the long format dataframe back to wide format.

wide_df2 <- reshape(long_df, 
                    v.names = "Score",
                    timevar = "Year",
                    idvar = c("ID", "Name"),
                    direction = "wide")

wide_df2
               ID    Name Score.2020 Score.2021
1.Alice.2020    1   Alice         85         88
2.Bob.2020      2     Bob         90         92
3.Charlie.2020  3 Charlie         95         96

7.3 The tidyverse approach

7.3.1 Using the tibble function

In the tidyverse approach, the dataframes are referred to as tibbles. We use the tibble function to create tibbles. The syntax of the tibble function is as follows:

tibble(
  ...,
  .rows = NULL,
  .name_repair = c("check_unique", "unique", "universal", "minimal")
)

where ... are the columns of the tibble specified as name = value pairs, .rows specifies the number of rows, and .name_repair controls how to handle column names.

In the following example, we create a simple tibble with three columns: Name, Age, and Height. The row after the column names shows the data types of each column.

# Create a simple tibble
tb1 <- tibble(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Height = c(5.5, 6.0, 5.8)
)
# Check the class of the tibble
class(tb1)
[1] "tbl_df"     "tbl"        "data.frame"
# Print the tibble
tb1
# A tibble: 3 × 3
  Name      Age Height
  <chr>   <dbl>  <dbl>
1 Alice      25    5.5
2 Bob        30    6  
3 Charlie    35    5.8

We can use $, [] and [[ ]] to select columns from a tibble. The operators $ and [[ ]] return a vector, while [] returns a tibble.

# Selecting the 'Age' column using the $ operator
tb1$Age
[1] 25 30 35
# Selecting the 'Height' column 
tb1[,"Height"]
# A tibble: 3 × 1
  Height
   <dbl>
1    5.5
2    6  
3    5.8
# Selecting the 'Height' column using [[ ]]
tb1[["Height"]]
[1] 5.5 6.0 5.8
# Selecting the 2nd column
tb1[,2]  
# A tibble: 3 × 1
    Age
  <dbl>
1    25
2    30
3    35
# Selecting multiple columns
tb1[,c("Name", "Height")] 
# A tibble: 3 × 2
  Name    Height
  <chr>    <dbl>
1 Alice      5.5
2 Bob        6  
3 Charlie    5.8

7.3.2 Importing and exporting tibbles

The read_csv function from the readr package is used to import CSV files as tibbles. The syntax of the read_csv function is as follows:

read_csv(file, col_names = TRUE, col_types = NULL,
         locale = default_locale(), na = c("", "NA"),
         quoted_na = TRUE, quote = "\"", comment = "",
         trim_ws = TRUE, skip = 0, n_max = Inf,
         guess_max = min(1000, n_max), progress = show_progress(),
         skip_empty_rows = TRUE)

In the following example, we load the caschool.csv file as a tibble.

# Load the caschool.csv file as a tibble
caschool_tb <- read_csv("data/caschool.csv", col_names = TRUE, na = c("", "NA"),
                        skip = 0)
# Check the class of the tibble
class(caschool_tb)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
# Examine the structure of the tibble
glimpse(caschool_tb)
Rows: 420
Columns: 18
$ `Observation Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ dist_cod             <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, …
$ county               <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "F…
$ district             <chr> "Sunol Glen Unified", "Manzanita Elementary", "Th…
$ gr_span              <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-…
$ enrl_tot             <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 22…
$ teachers             <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 4…
$ calw_pct             <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.31…
$ meal_pct             <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.95…
$ computer             <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56,…
$ testscr              <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 6…
$ comp_stu             <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0…
$ expn_stu             <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988,…
$ str                  <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133,…
$ avginc               <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333…
$ el_pct               <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.85767…
$ read_scr             <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, …
$ math_scr             <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, …

To export a tibble to a CSV file, we can use the write_csv function from the readr package. The syntax of this function is as follows:

write_csv(x, file, na = "NA", append = FALSE,
          col_names = !append, quote_escape = "double")

where x is the tibble to be exported, file is the name of the file to write to, na is the string to represent NA values, and col_names indicates whether to include column names in the output file.

In the following example, we export the caschool_tb tibble to a CSV file.

# Export the caschool_tb tibble to a CSV file
write_csv(caschool_tb, "data/caschool_export.csv")

The read_excel function from the readxl package can be used to import Excel files as tibbles. The syntax of the read_excel function is as follows:

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
           col_types = NULL, na = "", trim_ws = TRUE,
           skip = 0, n_max = Inf, guess_max = min(1000, n_max),
           progress = readxl_progress(), .name_repair = "unique")

In the following example, we load the caschool.xlsx file as a tibble.

caschool_tb <- read_excel("data/caschool.xlsx", col_names = TRUE, sheet = 1)
class(caschool_tb)
[1] "tbl_df"     "tbl"        "data.frame"
glimpse(caschool_tb)
Rows: 420
Columns: 18
$ `Observation Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ dist_cod             <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, …
$ county               <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "F…
$ district             <chr> "Sunol Glen Unified", "Manzanita Elementary", "Th…
$ gr_span              <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-…
$ enrl_tot             <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 22…
$ teachers             <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 4…
$ calw_pct             <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.31…
$ meal_pct             <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.95…
$ computer             <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56,…
$ testscr              <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 6…
$ comp_stu             <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0…
$ expn_stu             <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988,…
$ str                  <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133,…
$ avginc               <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333…
$ el_pct               <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.85767…
$ read_scr             <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, …
$ math_scr             <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, …

The haven package provides functions read_sas() and read_stata() to import SAS and Stata files, respectively, as tibbles.

7.3.3 Selecting rows and columns

As in the base R approach, we can use indexing operations stated in Table 7.3 to select rows and columns from a tibble. In the following, we provide some illustrative examples.

# Selecting the 'testscr' column
summary(caschool_tb$testscr)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  605.6   640.0   654.4   654.2   666.7   706.8 
# Selecting the first 5 rows of the columns 'testscr' and 'meal_pct'
head(caschool_tb[1:5, c("testscr", "meal_pct")])
# A tibble: 5 × 2
  testscr meal_pct
    <dbl>    <dbl>
1    691.     2.04
2    661.    47.9 
3    644.    76.3 
4    648.    77.0 
5    641.    78.4 
# Accessing specific rows using a logical condition
head(caschool_tb[caschool_tb$meal_pct > 90, c("district", "meal_pct")]) # rows where meal_pct > 90
# A tibble: 6 × 2
  district                 meal_pct
  <chr>                       <dbl>
1 Holt Union Elementary        94.6
2 Vineland Elementary         100  
3 Orange Center Elementary     93.1
4 West Fresno Elementary       98.6
5 Allensworth Elementary       98.1
6 Pixley Union Elementary      94.3

7.3.4 Adding and removing columns and rows

We can use the approach in the base R approach to add and remove columns and rows, sort, handle missing values, and merge and reshape tibbles. Below, we provide some illustrative examples.

# Adding a new column using the $ operator
caschool_tb$meal_pct_squared <- caschool_tb$meal_pct^2
head(caschool_tb[, c("meal_pct", "meal_pct_squared")])
# A tibble: 6 × 2
  meal_pct meal_pct_squared
     <dbl>            <dbl>
1     2.04             4.16
2    47.9           2296.  
3    76.3           5825.  
4    77.0           5937.  
5    78.4           6151.  
6    87.0           7561.  
# Removing a column using the $ operator
caschool_tb$meal_pct_squared <- NULL
# Selecting specific columns using subset
caschool_subset_tb <- subset(caschool_tb, select = c("testscr", "meal_pct", "enrl_tot"))
head(caschool_subset_tb)
# A tibble: 6 × 3
  testscr meal_pct enrl_tot
    <dbl>    <dbl>    <dbl>
1    691.     2.04      195
2    661.    47.9       240
3    644.    76.3      1550
4    648.    77.0       243
5    641.    78.4      1335
6    606.    87.0       137
# Check for missing values
any(is.na(caschool_tb))
[1] FALSE

7.3.5 Manipulating tibbles with dplyr

The dplyr package provides a set of functions to manipulate tibbles. Some of the commonly used functions are:

  • filter() : to filter rows based on conditions.
  • select() : to select specific columns.
  • mutate() : to add new columns or modify existing columns.
  • arrange() : to sort the tibble by one or more columns.
  • summarize() : to summarize the tibble by calculating summary statistics.
  • group_by() : to group the tibble by one or more columns for summary operations.
  • join() : to merge two tibbles by common columns.
  • bind_rows() : to combine two tibbles by rows.
  • bind_cols() : to combine two tibbles by columns.
  • rename() : to rename columns in a tibble.
  • relocate() : to change the order of columns in a tibble.
  • slice_function() : to select rows by their position.
  • count() : to count the number of occurrences of unique values in a column.

The dplyr functions can be used with the pipe operator %>% or |> to create a sequence of operations. This operator takes the input before the pipe and uses it as the first argument of the function after the pipe. For example, x|> f(y) is equivalent to f(x, y) and x |> f(y) |> g(z) is equivalent to g(f(x, y), z).

In the following example, we first filter the rows where meal_pct is greater than 90, and then select the district and meal_pct columns.

# Filter rows where meal_pct > 90 and select specific columns
caschool_tb_filtered <- caschool_tb |>
  filter(meal_pct > 90) |>
  select(district, meal_pct)
# Display the first few rows of the filtered tibble
head(caschool_tb_filtered)
# A tibble: 6 × 2
  district                 meal_pct
  <chr>                       <dbl>
1 Holt Union Elementary        94.6
2 Vineland Elementary         100  
3 Orange Center Elementary     93.1
4 West Fresno Elementary       98.6
5 Allensworth Elementary       98.1
6 Pixley Union Elementary      94.3

In the following example, we add a new column log_avginc that is the logarithm of the avginc column, locate the log_avginc column before the avginc column, and finally sort the tibble by the log_avginc column in descending order.

# Add a new column, relocate it, and sort the tibble
caschool_tb_filtered <- caschool_tb |>
  mutate(log_avginc = log(avginc)) |>
  relocate(log_avginc, .before = avginc) |>
  arrange(desc(log_avginc))
# Examine the structure of the filtered tibble
glimpse(caschool_tb_filtered)
Rows: 420
Columns: 19
$ `Observation Number` <dbl> 405, 414, 404, 402, 413, 417, 387, 415, 411, 408,…
$ dist_cod             <dbl> 65425, 68981, 68965, 69252, 69161, 69518, 65334, …
$ county               <chr> "Marin", "San Mateo", "San Mateo", "Santa Barbara…
$ district             <chr> "Reed Union Elementary", "Portola Valley Elementa…
$ gr_span              <chr> "KK-08", "KK-08", "KK-08", "KK-06", "KK-06", "KK-…
$ enrl_tot             <dbl> 1059, 687, 1940, 535, 220, 3724, 1091, 2341, 2422…
$ teachers             <dbl> 65.120, 44.590, 111.820, 31.000, 12.330, 208.480,…
$ calw_pct             <dbl> 0.8499, 0.3049, 0.4124, 0.0800, 0.4545, 1.0741, 0…
$ meal_pct             <dbl> 1.4164, 0.0000, 3.9175, 0.7200, 0.0000, 1.5038, 0…
$ computer             <dbl> 272, 209, 265, 84, 22, 721, 113, 286, 466, 300, 3…
$ testscr              <dbl> 693.95, 699.10, 691.90, 691.05, 698.45, 706.75, 6…
$ comp_stu             <dbl> 0.2568461, 0.3042212, 0.1365979, 0.1570093, 0.100…
$ expn_stu             <dbl> 6460.657, 7217.263, 6180.149, 7039.949, 6500.450,…
$ str                  <dbl> 16.26229, 15.40704, 17.34931, 17.25806, 17.84266,…
$ log_avginc           <dbl> 4.013279, 3.925472, 3.910802, 3.766535, 3.766535,…
$ avginc               <dbl> 55.32800, 50.67700, 49.93900, 43.23000, 43.23000,…
$ el_pct               <dbl> 2.2662890, 1.1644833, 6.1340203, 1.4953271, 1.363…
$ read_scr             <dbl> 695.9, 698.3, 693.7, 693.5, 693.3, 704.0, 689.8, …
$ math_scr             <dbl> 692.0, 699.9, 690.1, 688.6, 703.6, 709.5, 674.0, …

7.4 Further Readings

For further reading on dataframes, we recommend Jones, Harden, and Crawley (2022) for the base R approach and Wickham, Grolemund, and Çetinkaya-Rundel (2023) for the tidyverse approach.