22 Experiments and Quasi-Experiments

# Load necessary libraries
library(fixest)
library(sandwich)
library(lmtest)
library(modelsummary)
options("modelsummary_factory_default" = "gt") 
library(haven)
library(rdrobust)
library(knitr)
library(tidyverse)
library(labelled)
library(did)
library(broom)

22.1 Randomized controlled experiments

A randomized controlled experiment is a study designed to measure the effect of a treatment or policy intervention on an outcome variable by randomly assigning experimental units to treatment and control groups. The units in the treatment group receive the treatment or are subject to the intervention and those in the control group are not subject to any changes. We then measure the treatment or intervention effect by comparing the groups in terms of the outcome variable. We usually collect data on two variables of interest: the outcome variable, denoted by $Y$, and the treatment indicator, denoted by $D$, which specifies whether a unit is in the treatment or control group.

Example 22.1 (Clinical drug trial) Consider a randomized controlled experiment designed to assess whether a proposed drug lowers cholesterol levels. In this experiment, we have:

$Y$ is the cholesterol level,
$D=1$ if the individual is in the treatment group (i.e., receives the drug), and $D=0$ if the individual is in the control group (i.e., does not receive the drug).

Example 22.2 (Job training program) Consider a randomized controlled experiment designed to assess the effect of a job training program on wages. In this experiment, we have:

$Y$ represents the wage income,
$D=1$ if the individual is in the treatment group (i.e., participates to the job training program) and $D=0$ if the individual is in the control group (i.e., does not participate to the job training program).

Example 22.3 (Tennessee class size experiment) The Tennessee class size reduction experiment, known as Project STAR (Student–Teacher Achievement Ratio), was a 4-year experiment designed to evaluate the effect of class size on learning. Students were randomly assigned to one of three groups:

a regular class (22–25 students),
a regular class with a teacher’s aide, or
a small class (13–17 students).

In this experiment, students then took standardized tests (the Stanford Achievement Tests) in reading and math. The goal of the experiment was to assess the effect of class type (regular, regular with an aide, or small) on test scores. In this experiment, we have one outcome variable and two treatment indicators:

$Y$ is the test scores,
$D_1=1$ if the student is in a small class (i.e., the student is in the treatment group), and $D_1=0$ if the student is in a regular class (i.e., the student is in the control group).
$D_2=1$ if the student is in a regular class with a teacher’s aide (i.e., the student is in the treatment group) and $D_2=0$ if the student is in a regular class (i.e., the student is in the control group).

In all these examples, the experiment is deliberately designed and implemented to assess the causal effect of a treatment or intervention on the outcome variable. The assignment mechanism is the key feature that distinguishes randomized controlled experiments from other types of studies. A randomized controlled experiment is one in which the assignment mechanism satisfies the following properties (Imbens and Rubin (2015), Titiunik (2021)):

It is designed and implemented by the researchers.
It is known to the researchers.
It ensures randomization such that the probability of participating in the treatment does not depend on the potential outcomes (defined below) of the experimental units.

22.2 Potential outcomes and causal effects

To define the treatment effect, we need to first introduce the potential outcomes framework.

Definition 22.1 (Potential outcomes) The outcome of an individual under potential treatment or non-treatment is called the potential outcome. We use $Y_i(1)$ and $Y_i(0)$ to denote the treated and untreated potential outcomes of the $i$th individual, respectively.

If the individual is in the treated group, then the observed outcome $Y_i$ is given by the treated outcome, i.e., $Y_i=Y_i(1)$, and if the individual is in the control group, the observed outcome $Y_i$ is given by the untreated outcome, i.e., $Y_i=Y_i(0)$. Note that we do not observe the untreated potential outcome for the treated individual, and the treated potential outcome for the untreated individual.

Definition 22.2 (Causal effect) The causal effect (or treatment effect) is defined as the difference between the treated and untreated potential outcomes: $Y_i(1)-Y_i(0)$.

This treatment effect cannot be observed because an individual can be either treated or untreated, but not both at the same time. This is called the the fundamental problem of causal inference in the literature.

Definition 22.3 (Average treatment effect) The average treatment effect (ATE) is defined as \[ \text{ATE}=\E(Y_i(1)-Y_i(0)). \tag{22.1}\]

The observed outcome $Y_i$ can be expressed as $Y_i=D_iY_i(1)+(1-D_i)Y_i(0)$. If the individual is treated ($D_i=1$), then $Y_i=Y_i(1)$. Conversely, if the individual is untreated ($D_i=0$), then $Y_i=Y_i(0)$. Thus,
\[ \begin{align} Y_i&=D_iY_i(1)+(1-D_i)Y_i(0)\nonumber\\ &=Y_i(0)+D_i\left(Y_i(1)-Y_i(0)\right)\nonumber\\ &=\E\left[Y_i(0)\right]+D_i\left(Y_i(1)-Y_i(0)\right)+\left(Y_i(0)-\E\left[Y_i(0)\right]\right)\nonumber\\ &=\beta_{0i}+\beta_{1i}D_i+u_i, \end{align} \]

where

$\beta_{0i}=\E\left[Y_i(0)\right]$,
$\beta_{1i}=\left(Y_i(1)-Y_i(0)\right)$ is the individual causal effect, and
$u_i=\left(Y_i(0)-\E\left[Y_i(0)\right]\right)$.

Under the assumption that the treatment effect is constant, i.e., $\beta_{1i}=\beta_1$, and $\beta_{0i}=\beta_0$, we obtain the usual regression model: \[ \begin{align} Y_i=\beta_{0}+\beta_{1}D_i+u_i. \end{align} \tag{22.2}\]

We want to show that $\beta_1$ in Equation 22.2 equals to the ATE. To that end, we need to make some assumptions about potential outcomes and the treatment assignment.

Assumptions for causal inference

Consistency assumption: The observed outcome is $Y_i=Y_i(1)$ if the $i$th unit is treated and $Y_i=Y_i(0)$ if the $i$th unit is untreated.
No interference assumption: The $i$th unit’s potential outcomes depend on only its own treatment status, not on the treatment status of other units.
Randomization assumption: The treatment indicator $D_i$ is independent of potential outcomes $Y_i(0)$ and $Y_i(1)$, and $0<P(D_i=1)<1$.

Assumptions 1 and 2 together are referred as the Stable Unit-Treatment Value Assumption (SUTVA) (Imbens and Rubin (2015)). Assumption 3 is the ramdomization assumption. In particular, it implies that $\E(u_i|D_i)=0$ because $D_i$ is independent of the potential outcomes. The last part of Assumption 3, $0<P(D_i=1)<1$, ensures that there are both treated and untreated individuals in the sample.

Theorem 22.1 (ATE and $\beta_1$) Under Assumptions 1-3, we have $ATE=\beta_1$.

Proof (Proof of Theorem 22.1). Note that we can express $\beta_1$ as $\beta_1=\E[Y_i|D_i=1]-\E[Y_i|D_i=0]$. Then, we have \[ \begin{align*} \beta_1&=\E[Y_i|D_i=1]-\E[Y_i|D_i=0]\\ &=\E\left[Y_i(1)|D_i=1\right]-\E\left[Y_i(0)|D_i=0\right]\\ &=\E\left[Y_i(1)\right]-\E\left[Y_i(0)\right]=\E\left[Y_i(1)-Y_i(0)\right], \end{align*} \] where the second equality follows from Assumption 1 and the third equality from Assumption 3.

Theorem 22.1 suggests that we can use the OLS estimator of $\beta_1$ to estimate ATE. Since $D_i$ is binary, we can express the OLS estimator $\hat{\beta}_1$ as \[ \begin{align} \hat{\beta}_1=\bar{Y}^{treated}-\bar{Y}^{control}=\frac{1}{n_1}\sum_{i:D_i=1}Y_i-\frac{1}{n_0}\sum_{i:D_i=0}Y_i, \end{align} \tag{22.3}\] where $n_1$ and $n_0$ are the sizes of treatment and control groups, respectively.

Definition 22.4 (Difference-in-means estimator) The OLS estimator $\hat{\beta}_1$ in Equation 22.3 is called the difference-in-means estimator or differences estimator.

If the Assumption 3 fails, i.e., if treatment assignment and the potential outcomes are dependent, then the OLS estimator suffers from selection bias. The difference between $\beta_1=\E[Y_i|D_i=1]-\E[Y_i|D_i=0]$ and $\text{ATE}=\E[Y_i(1)-Y_i(0)]$ is called selection bias.

In estimating the ATE using Equation 22.2, do we need to consider any other variables, such as pre-treatment characteristics or control variables? If there are variables that capture the pre-treatment characteristics of units, or if control variables are available, we can include them in Equation 22.2 to increase the efficiency of the OLS estimator: \[ Y_i=\beta_0+\beta_1D_i+\beta_2W_{1i}+\dots+\beta_{r+1}W_{ri}+u_i, \] where $W_{1i},\dots,W_{ri}$ are pre-treatment characteristics or control variables. Note that since assignment to treatment is randomized, the conditional mean independence assumption holds with respect to $D_i$: \[ \E(u_i|D_i,W_{1i},\dots,W_{ri})=\E(u_i|W_{1i},\dots,W_{ri}). \]

Thus, the OLS estimator of $\beta_1$ is an unbiased estimator of the ATE. However, recall that the coefficients on the control variables do not have a causal interpretation unless $\E(u_i|W_{1i},\dots,W_{ri})=0$.

When the assignment of individuals to treatment and control groups is influenced by one or more observable variables $W$, we refer to this as randomization based on covariates. In this case, the OLS estimator of $\beta_1$ based on Equation 22.2 suffers from omitted variable bias because $D$ is correlated with the omitted variable $W$. The omitted variable bias can be avoided by using $W$ as a control variable in the model.

22.3 Threats to validity of experiments

As in the case of multiple linear regression model, threats to internal validity of an experiment can induce correlation between the treatment indicator and the error term in Equation 22.2. These threats include:

Failure to randomize,
Failure to follow the treatment protocol,
Attrition,
Experimental effects, and
Small sample sizes.

Under the first threat, Assumption 3 fails. For example, consider the job training program in Example 22.2. If we assign disproportionately high-ability individuals to the treatment group, we induce correlation between the treatment indicator $D$ and the potential outcomes and, therefore, between $D$ and the error term $u$. This correlation arises because ability (which is part of the error term) influences both treatment assignment and the outcome variable (wage). In this case, since the zero-conditional mean assumption fails, the OLS estimator of $\beta_1$ will be biased and inconsistent.

If we have pre-treatment covariates $W_1,\dots,W_r$ on the experimental units, we can test whether Assumption 3 holds. The idea of this testing approach is that if the treatment is randomly assigned, then the treatment indicator $D$ should be uncorrelated with the pre-treatment covariates. Therefore, we can use the $F$-statistic to test the joint null hypothesis that the coefficients on the pre-treatment covariates in the regression of $D$ on $W_1,\dots,W_r$ are zero. If the treatment is randomly assigned, then the $F$-statistic should fail to reject this joint null hypothesis.

In an experiment, subjects may not comply with the treatment protocol. For example, in our job training program in Example 22.2, some subjects in the treatment group may not participate in the training, or some subjects from the control group may participate in the training sessions. In either case, the unobserved factors (e.g., ability) that are part of the error term will be correlated with $D$, since these unobserved factors influence whether subjects comply with the treatment protocol.

Attrition occurs when subjects leave either the treatment group or the control group during an experiment. For example, in our job training program in Example 22.2, the most able candidates in the treatment group may leave the program early if they are able to find a job before completing the training. This can create systematic differences between the treatment and control groups, thereby inducing a correlation between $D$ and $u$.

The experimental effect, also known as the Hawthorne effect, arises when subjects in an experiment change their behavior simply because they are aware of being part of the experiment.

The final threat to internal validity is a small sample size. This threat does not violate Assumption 3, but it can affect the standard error of the OLS estimator of $\beta_1$. Recall that the asymptotic variance of the OLS estimator is inversely proportional to the sample size. Thus, the estimated standard errorof the OLS estimator can be large when the sample size is small. Therefore, we may fail to reject the null hypothesis of no treatment effect, even if the treatment has an effect on the outcome variable.

As in the case of the multiple linear regression model, threats to the external validity of an experiment make it difficult to generalize its results to other populations and settings.

The population and setting studied should be sufficiently similar to the population and setting of interest. For example, the results of an experiment on the effect of class size on learning in elementary schools may not generalize to college students because the populations and settings are different.
If the program or policy studied in an experiment differs in some aspects from the program or policy of interest, its results may not be generalizable.
Large-scale experiments can create general equilibrium effects that alter the broader context, and thus producing results that are different from those of smaller-scale experiments. Deaton and Cartwright (2018) give the following example: “Suppose an RCT demonstrates that in the study population a new way of using fertilizer had a substantial positive effect on, say, cocoa yields, so that farmers who used the new methods saw increases in production and in incomes compared to those in the control group. If the procedure is scaled up to the whole country, or to all cocoa farmers worldwide, the price will drop, and if the demand for cocoa is price inelastic - as is usually thought to be the case, at least in the short run - cocoa farmers’ incomes will fall. Indeed, the conventional wisdom for many crops is that farmers do best when the harvest is small, not large. In this case, the scaled-up effect is opposite in sign to the trial effect.”

22.4 Experimental estimates of the effect of class size reductions

We use the dataset from Tennessee class size reduction experiment, known as Project STAR (Student-Teacher Achievement Ratio), contained in the STAR.csv file. This experiment was a 4-year experiment designed to evaluate the impact of class sizes on learning for kindergarten through third grade. The experiment cost $12 million to conduct. Upon entering the school system, students were randomly assigned to one of three groups:

Regular class (22-25 students),
Regular class with a teacher’s aide, or
Small class (13-17 students).

The interventions began when students entered kindergarten and continued through third grade. Teachers are also randomly assigned to classes. Students in regular classes were re-randomized after the first year to either a regular class or a regular class with a teacher’s aide. Students initially assigned to small classes remained in small classes throughout the experiment. Each year, students in the experiment were given standardized tests (the Stanford Achievement Test) in reading and math.

# Load the data
STAR <- read.table("data/STAR.csv",
    header = TRUE,
    sep = ",",
    stringsAsFactors = TRUE
)
# Column names
colnames(STAR)

 [1] "gender"      "ethnicity"   "birth"       "stark"       "star1"      
 [6] "star2"       "star3"       "readk"       "read1"       "read2"      
[11] "read3"       "mathk"       "math1"       "math2"       "math3"      
[16] "lunchk"      "lunch1"      "lunch2"      "lunch3"      "schoolk"    
[21] "school1"     "school2"     "school3"     "degreek"     "degree1"    
[26] "degree2"     "degree3"     "ladderk"     "ladder1"     "ladder2"    
[31] "ladder3"     "experiencek" "experience1" "experience2" "experience3"
[36] "tethnicityk" "tethnicity1" "tethnicity2" "tethnicity3" "systemk"    
[41] "system1"     "system2"     "system3"     "schoolidk"   "schoolid1"  
[46] "schoolid2"   "schoolid3"

Our goal is to estimate the effect of class type on test scores. Some of variables that we use in our empirical models are described in Table 22.1.

Table 22.1: The STAR dataset

Variable	Description
`gender`	Factor indicating the student’s gender.
`ethnicity`	Factor indicating the student’s ethnicity, with levels `cauc` (Caucasian), `afam` (African American), `asian` (Asian), `hispanic` (Hispanic), `amindian` (American Indian), or `other`.
`stark`	Factor indicating the STAR class type in kindergarten: regular, small, or regular-with-aide.
`star1`	Factor indicating the STAR class type in 1st grade: regular, small, or regular-with-aide.
`star2`	Factor indicating the STAR class type in 2nd grade: regular, small, or regular-with-aide.
`star3`	Factor indicating the STAR class type in 3rd grade: regular, small, or regular-with-aide.
`readk`, `read1`, `read2`, `read3`	Total reading scaled scores in kindergarten, 1st grade, 2nd grade, and 3rd grade, respectively.
`mathk`, `math1`, `math2`, `math3`	Total math scaled scores in kindergarten, 1st grade, 2nd grade, and 3rd grade, respectively.
`lunchk`	Factor indicating whether the student qualified for free lunch in kindergarten.
`experiencek`	Years of the teacher’s total teaching experience in kindergarten.
`schoolidk`, `schoolid1`, `schoolid2`, `schoolid3`	Factors indicating the school ID in kindergarten, 1st grade, 2nd grade, and 3rd grade, respectively.

In this experiment, there are two treatment indicators: SmallClass and RegAid. The first indicator equals to 1 if the student is in a small class, and 0 otherwise. The second indicator equals to 1 if the student is in a regular class with teacher’s aid, and 0 otherwise. The control group consists of students in regular classes, denoted by Regular, which takes the value 1 if the student is in a regular class and 0 otherwise. In our empirical model, we use Regular as the reference group. Thus, our estimation equation takes the following form: \[ \begin{align} Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+u_i, \end{align} \tag{22.4}\] where

$Y_i$ is the test score of the $i$th student,
${\tt SmallClass}_i$ equals to $1$ if the $i$th student is in a small class, and $0$ otherwise,
${\tt RegAid}_i$ equals to $1$ if the $i$th student is in a regular class with teacher’s aid, and $0$ otherwise,
$\beta_1$ represents the effect of being in a small class on the test score, relative to being in a regular class,
$\beta_2$ gives the effect of being in a regular class with teacher’s aid on the test score, relative to being in a regular class.

We estimate the model by using observations from kindergarten through third grade. We expect that the error terms to be correlated for students in the same school because these student may share similar unobserved characteristics. To account for this correlation in $u_i$, we use standard errors clustered at the school level. This is achieved by using vcovCL(rk, type = "HC1", cluster = ~ schoolidk in the following code chunk.

# Model K: Kindergarten
rk <- lm(I(readk + mathk) ~ stark,
    data = STAR
)
rk_vcov <- vcovCL(rk,
    type = "HC1",
    cluster = ~schoolidk
)
rk_se <- sqrt(diag(rk_vcov))

# Model 1: First Grader
r1 <- lm(I(read1 + math1) ~ star1,
    data = STAR
)
r1_vcov <- vcovCL(r1,
    type = "HC1",
    cluster = ~schoolid1
)
r1_se <- sqrt(diag(r1_vcov))

# Model 2: Second Grader
r2 <- lm(I(read2 + math2) ~ star2,
    data = STAR
)
r2_vcov <- vcovCL(r2,
    type = "HC1",
    cluster = ~schoolid2
)
r2_se <- sqrt(diag(r2_vcov))

# Model 3: Third Grader
r3 <- lm(I(read3 + math3) ~ star3,
    data = STAR
)
r3_vcov <- vcovCL(r3,
    type = "HC1",
    cluster = ~schoolid3
)
r3_se <- sqrt(diag(r3_vcov))

The estimation results are presented in Table 22.2. For kindergarten students, those in a small class have, on average, 13.9 more points on the test compared to those in a regular class. The estimated effect of being in a regular class with a teacher’s aide is 0.31 points and is statistically insignificant. The average estimated effect of being in a small class, relative to being in a regular class, is 29.8 points for first grade, 19.4 points for second grade, and 15.6 points for third grade. The estimated effect of being in a regular class with a teacher’s aide, relative to being in a regular class, is statistically insignificant for each grade, except for first grade. Overall, these results suggest that reducing class size has a positive effect on test scores, while adding teacher’s aides to regular classes has no effect on test scores, except for first grade.

# Print results in table form
modelsummary(
    models = list(
        "(K)" = rk,
        "(1)" = r1,
        "(2)" = r2,
        "(3)" = r3
    ),
    vcov = list(rk_se, r1_se, r2_se, r3_se),
    fmt = 2,
    stars = TRUE,
    gof_omit = "AIC|BIC|Log|F"
)

Table 22.2: Project STAR: Estimation Results

	(K)	(1)	(2)	(3)
(Intercept)	918.04***	1039.39***	1157.81***	1228.51***
	(4.82)	(5.82)	(5.29)	(4.66)
starkregular+aide	0.31
	(3.77)
starksmall	13.90***
	(4.23)
star1regular+aide		11.96***
		(4.86)
star1small		29.78***
		(4.79)
star2regular+aide			3.48
			(4.91)
star2small			19.39***
			(5.12)
star3regular+aide				-0.29
				(4.04)
star3small				15.59***
				(4.21)
Num.Obs.	5786	6379	6049	5967
R2	0.007	0.017	0.009	0.010
R2 Adj.	0.007	0.017	0.009	0.010
RMSE	73.47	90.48	83.67	72.89
Std.Errors	Custom	Custom	Custom	Custom
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Next, we extend the model in Equation 22.4 with some covariates because of two main reasons. First, by using additional covariates, we can increase the efficiency of the OLS estimator. Second, assignment to class types may be influenced by observable characteristics of students and teachers. In such cases, the OLS estimator based on a model that includes these covariates may yield different estimates of the treatment effects. For these two reasons, we consider the following covariates in our empirical model:

experience: Teacher’s years of experience
boy: Student is a boy (a dummy variable)
lunch: Free lunch eligibility (a dummy variable)
black: Student is African-American (a dummy variable)
race: Student’s race is other than black or white (a dummy variable)
schoolid: School indicator variables (school fixed effects)

In the following, for kindergarten students, we generate these covariates and relevel the lunch variable to set “non-free” as the reference category.

# Create covariates
STAR$black <- (STAR$ethnicity == "afam")
STAR$race <- !(STAR$ethnicity == "afam" | STAR$ethnicity == "cauc")
STAR$boy <- (STAR$gender == "male") # equals 1 if male, 0 otherwise

# Relevel lunch variable to set "non-free" as the reference category
STAR$lunchk <- relevel(STAR$lunchk, ref = "non-free")

We consider the following four models, where the first model is the baseline model without any covariates, and the second, third, and fourth models include additional covariates. \[ \begin{align} &1.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+u_i,\\ &2.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+\beta_3{\tt experience}_i+u_i,\\ &3.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+\beta_3{\tt experience}_i+{\tt schoolid}+u_i,\\ &4.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+\beta_3{\tt experience}_i+\beta_4{\tt boy}\nonumber\\ &\quad\quad+\beta_5{\tt lunch}+\beta_6{\tt black}+\beta_7{\tt race}+{\tt schoolid}+u_i. \end{align} \]

As in Stock and Watson (2020), we estimate these models using the sample data only from the kindergarten.

# Model 9
rk <- lm(I(readk + mathk) ~ stark,
    data = STAR
)
rk_vcov <- vcovCL(rk,
    type = "HC1",
    cluster = ~schoolidk
)
rk_se <- sqrt(diag(rk_vcov))

# Model 10: Controls: Teacher Experience
rk1 <- lm(I(mathk + readk) ~ stark + experiencek,
    data = STAR
)
rk1_vcov <- vcovCL(rk1,
    type = "HC1",
    cluster = ~schoolidk
)
rk1_se <- sqrt(diag(rk1_vcov))

# Model 11: Controls: Teacher Experience + School Indicator
rk2 <- lm(I(mathk + readk) ~ stark + experiencek + factor(schoolidk),
    data = STAR
)
rk2_vcov <- vcovCL(rk2,
    type = "HC1",
    cluster = ~schoolidk
)
rk2_se <- sqrt(diag(rk2_vcov))

# Model 12: Controls: Teacher Experience + School Indicator + Gender + Free Lunch + Black + Race
rk3 <- lm(
    I(mathk + readk) ~ stark + experiencek +
        factor(schoolidk) + boy + lunchk + black + race,
    data = STAR
)
rk3_vcov <- vcovCL(rk3,
    type = "HC1",
    cluster = ~schoolidk
)
rk3_se <- sqrt(diag(rk_vcov))

The estimation results are presented in Table 22.3. The results in this table show that adding these additional covariates and school fixed effects to the model does not lead to substantially different estimates of the treatment effects. However, note that the reported standard errors are relatively smaller than those reported in Table 22.2, suggesting an efficiency gain from including these covariates.

modelsummary(
    models = list(
        "(9)" = rk,
        "(10)" = rk1,
        "(11)" = rk2,
        "(12)" = rk3
    ),
    vcov = list(rk_se, rk1_se, rk2_se, rk3_se),
    fmt = 2,
    coef_omit = "schoolidk",
    stars = TRUE,
    gof_omit = "AIC|BIC|Log|F"
)

Table 22.3: Project STAR: Estimation Results with Covariates

	(9)	(10)	(11)	(12)
(Intercept)	918.04***	904.72***	925.67***	946.18***
	(4.82)	(6.20)	(3.60)	(4.82)
starkregular+aide	0.31	-0.60	1.22	1.79
	(3.77)	(3.84)	(3.67)	(3.77)
starksmall	13.90***	14.01***	15.93***	15.89***
	(4.23)	(4.25)	(4.11)	(4.23)
experiencek		1.47***	0.74***	0.66***
		(0.44)	(0.36)	(0.16)
boyTRUE				-12.09***
				(1.67)
lunchkfree				-34.70***
				(2.02)
blackTRUE				-25.43***
				(3.58)
raceTRUE				-8.50
				(12.22)
Num.Obs.	5786	5766	5766	5748
R2	0.007	0.020	0.234	0.291
R2 Adj.	0.007	0.020	0.223	0.281
RMSE	73.47	73.06	64.61	62.19
Std.Errors	Custom	Custom	Custom	Custom
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

22.5 Quasi-Experiments

In a quasi-experiment, also known as a natural experiment, units are assigned to treatment and control groups based on changes in their circumstances. In other words, treatment and control groups are not formed through an explicit randomized controlled experiment but as a result of a policy intervention (or other changes in circumstances). Therefore, in a quasi-experiment, we assume that there is “as if” random assignment. Since there is no complete randomization, we expect some systematic differences between the treatment and control groups, which means we cannot resort to the difference-in-means estimator introduced earlier to estimate the treatment effect. In this section, we introduce the differences-in-differences (DiD) and regression discontinuity (RD) approaches for estimating the treatment effect in quasi-experimental settings.

Example 22.4 (The effect of minimum wage on employment) Card and Krueger (1994) investigate the effect of an increase in the minimum wage on average employment at fast-food restaurants in New Jersey. On April 1, 1992, New Jersey increased the minimum hourly wage from $4.25 to $5.05, making it the highest minimum wage in the country. In contrast, the minimum hourly wage in neighboring Pennsylvania remained unchanged. In this example, the treatment and control groups are naturally formed by the increase in the minimum wage:

The treatment group consists of fast-food restaurants in New Jersey.
The control group consists of fast-food restaurants in Pennsylvania.

Card and Krueger (1994) argued that fast-food restaurants in Pennsylvania can serve as a control group because Pennsylvania is a nearby state, and both states share similar seasonal employment patterns and are subject to common unobserved shocks. The authors collected data on fast-food restaurants in both states before and after the wage change, including information on employment, starting wages, prices, and other store characteristics. Figure 22.1 shows the sate minimum wage as of July 1, 2024.

Figure 22.1: Minimum wage by state (Source: Wikipedia)

22.6 Differences-in-differences

22.6.1 Differences-in-differences with two time periods

To introduce the differences-in-differences (DiD) approach involving two groups and two time periods, we introduce the following notation:

Two groups: treatment group and control group,
Two time periods: before ($t=1$) and after ($t=2$) intervention (or policy change),
$\bar{Y}^{treatment,before}$: the sample average of $Y$ for the treated units before the intervention,
$\bar{Y}^{treatment,after}$: the sample average of $Y$ for the treated units after the intervention,
$\bar{Y}^{control,before}$: the sample average of $Y$ for the control group before the intervention, and
$\bar{Y}^{control,after}$: the sample average of $Y$ for the control group after the intervention.

Definition 22.5 (The DiD estimator) The DiD estimator is given by \[ \begin{align} \hat{\beta}&=\left(\bar{Y}^{ treatment,after}-\bar{Y}^{treatment,before}\right)-\left(\bar{Y}^{control,after}-\bar{Y}^{control,before}\right)\\ &=\Delta\bar{Y}^{treatment}-\Delta\bar{Y}^{control}. \end{align} \tag{22.5}\]

We illustrate the DiD estimator $\hat{\beta}$ in Figure 22.2. Before the treatment, the sample averages are $\bar{Y}^{treatment,before}=40$ and $\bar{Y}^{control,before}=20$, and after the treatment these averages are $\bar{Y}^{treatment,after}=80$ and $\bar{Y}^{control,after}=30$. Note that the difference in group means after the treatment is $80-30=50$. Some of this difference is due to the fact that both groups have different pre-treatment means. The DiD approach removes the effect of the difference in the pre-treatment period by focusing on the change in group means. Thus, the DiD estimator gives $(80-40)-(30-20)=30$.

The DiD approach is based on the assumption that, on average, the treatment and control groups would have followed the same trend in the absence of treatment. This assumption is called the parallel trends assumption in the literature. In Figure 22.2, the trend in the average outcome for the treated group in the absence of treatment is illustrated with the dashed black line. Thus, the counterfactual average outcome for the treated group in the post-treatment period is $\bar{Y}^{\text{treatment, after}} = 50$, so that the dashed line is parallel to the steel blue line. This assumption ensures that the difference between the observed average outcome $\bar{Y}^{treatment,after}=80$ and the counterfactual average outcome $\bar{Y}^{\text{treatment, after}} = 50$ is due to the treatment.

The DiD estimator in Equation 22.5 suggests that we can alternatively compute it using the following regression model: \[ \begin{align} \Delta Y_i=\beta_0+\beta_1D_i+u_i, \end{align} \tag{22.6}\] where $\Delta Y_i$ represents the change in the post-treatment value of $Y_i$ minus the pre-treatment value for the $i$th individual, and $D_i$ is the treatment indicator. Since $D_i$ is a binary variable, we have $\beta_1=\E[\Delta Y_i|D_i=1]-\E[\Delta Y_i|D_i=0]$. Thus, the OLS estimator of $\beta_1$ in Equation 22.6 is equivalent to the DiD estimator in Equation 22.5.

The advantage of using this regression approach is that we can rely on the standard regression tools to estimate the DiD estimator and its standard error. For example, we can easily compute the heteroskedasticity-robust standard error for the OLS estimator of $\beta_1$. Also, we can extend the regression in Equation 22.6 by including some pre-treatment characteristics of individuals as control variables to increase the efficiency of the OLS estimator: \[ \begin{align} \Delta Y_i=\beta_0+\beta_1D_i+\beta_2W_{1i}+\dots+\beta_{r+1}W_{ri}+u_i, \end{align} \tag{22.7}\] where $W_{1i},\dots,W_{ri}$ are pre-treatment characteristics of individuals. Again, the DiD estimator is the OLS estimator of $\beta_1$ in Equation 22.7.

Key Concept 22.1: Key features of the DiD approach

The key features of the DiD approach are:

There are two time periods: a pre-treatment period and a post-treatment period (or a before and after period).
There are two groups: a treatment group and a control group. The treatment group receives the treatment or is subjected to the policy change (intervention) only in the after period. The control group is not subjected to the treatment in either periods.
Parallel trends assumption: In the absence of treatment, the treatment group on average would have had the same trend in $Y$ as the control group.

22.6.2 Differences-in-differences with multiple time periods

We assume that there are two periods in the DiD approach so far. It is also possible to extend the DiD approach when we observe each unit in the treatment and control group at multiple time periods. Let us assume that we observe each unit for $t=1,2,\dots,T$ periods. We also assume that there is a common treatment timing denoted by $t_0$. Then, the DiD estimator can be computed using the following regression: \[ \begin{align} Y_{it}=\beta_0+\beta_1D_i+\beta_2{\tt Period}_t+\beta_3(D_i\times{\tt Period}_t)+u_{it}, \end{align} \tag{22.8}\] where $D_i$ is the treatment indicator, ${\tt Period}_t$ is the post-treatment indicator given by \[ \begin{align} {\tt Period}_t= \begin{cases} 1\quad\text{if}\quad t\geq t_0,\\ 0\quad\text{if}\quad t< t_0. \end{cases} \end{align} \] The DiD estimator is the OLS estimator of $\beta_3$ in Equation 22.8. We can also estimate the DiD parameter $\beta_3$ through the following two-way fixed effects regression: \[ \begin{align} Y_{it}=\alpha_i+\lambda_t+\beta_3(D_i\times{\tt Period}_t)+u_{it}, \end{align} \tag{22.9}\] where $\alpha_i$ is the individual (entity) fixed effect, $\lambda_t$ is the time fixed effect, $D_i$ is the treatment indicator, and ${\tt Period}_t$ is the post-treatment indicator. The DiD estimator is given by the fixed effects estimator of $\beta_3$. Both Equation 22.8 and Equation 22.9 can also be extended by considering variables $W_{1t},\dots,W_{rt}$ on the pre-treatment individual characteristics.

22.6.3 Application: The effect of minimum wage on employment

In this section, we replicate some the results from Card and Krueger (1994). The sample data obtained from Hansen (2022) is contained in CK1994.xlsx and consists of two surveys conducted on 331 fast-food restaurants in New Jersey and 79 fast-food restaurants in eastern Pennsylvania. The first survey was conducted between 2/15/1992 and 3/4/1992 (before the minimum wage increase), and the second between 11/5/1992 and 12/31/1992 (after the wage increase). Some of variables in the data set are described in Table 22.4.

Table 22.4: The CK1994 dataset

Variable	Description
`store`	Unique store ID.
`state`	Indicator variable equal to 1 if the store is located in New Jersey and 0 if located in Pennsylvania.
`empft`	Number of full-time employees.
`emppt`	Number of part-time employees.
`nmgrs`	Number of managers and assistant managers.
`wage_st`	Starting wage (dollars per hour).
`inctime`	Number of months until the usual first raise.
`firstinc`	Usual amount of the first raise (dollars per hour).
`meals`	Free or reduced-price meal code (see below).
`open`	Opening hour.
`hoursopen`	Number of hours open per day.
`time`	Indicator variable equal to 0 for the first survey (2/15/1992–3/4/1992) and 1 for the second survey (11/5/1992–12/31/1992).

# Load the data
df <- read_dta("data/CK1994.dta")

# Column names
colnames(df)

 [1] "store"        "chain"        "co_owned"     "state"        "southj"      
 [6] "centralj"     "northj"       "pa1"          "pa2"          "shore"       
[11] "ncalls"       "empft"        "emppt"        "nmgrs"        "wage_st"     
[16] "inctime"      "firstinc"     "meals"        "open"         "hoursopen"   
[21] "pricesoda"    "pricefry"     "priceentree"  "nregisters"   "nregisters11"
[26] "time"

We start by using Equation 22.5 to estimate the effect of the change in the minimum hourly wage on employment in New Jersey. In the following code chunk, we generate sample averages for both groups before and after the change. These estimates are provided in Table 22.5. In New Jersey, the average employment was 20.43 employees per restaurant before the increase in the minimum wage and 20.90 employees per restaurant after the change. In Pennsylvania, these averages were 23.38 and 21.10, respectively. The DiD estimator in Equation 22.5 gives $(20.90-20.43)-(21.10-23.38)=2.75$. Thus, the average employment increases by 2.75 employees per restaurant after the increase in the minimum wage, a result that contrasts with conventional economic theory.

# Compute total employment
df$fte <- df$empft + df$emppt / 2 + df$nmgrs
# Drop missing fte values
df <- df[!is.na(df$fte), ]
# Group by store and count number of periods per store
df$nperiods <- ave(df$store, df$store, FUN=length)
# Keep only stores with two periods
df <- df[df$nperiods == 2, ]
# Average Employment at Fast Food Restaurants
n1 <- mean(df[(df$state == 1 & df$time == 0), ]$fte)
n2 <- mean(df[df$state == 1 & df$time == 1, ]$fte)
p1 <- mean(df[df$state == 0 & df$time == 0, ]$fte)
p2 <- mean(df[df$state == 0 & df$time == 1, ]$fte)
# Collect means in a dataframe
c1 <- c(n1, n2, n2 - n1)
c2 <- c(p1, p2, p2 - p1)
d1 <- c(p1 - n1, p2 - n2, (n2 - n1) - (p2 - p1))
tb <- data.frame("New Jersey" = c1, "Pennsylvania" = c2, "Difference" = d1)
rownames(tb) <- c("Before Increase", "After Increase", "Difference")

# Print table
kable(tb, digits = 2, col.names = c("New Jersey", "Pennsylvania", "Difference"))

Table 22.5: Average Employment at Fast Food Restaurants

	New Jersey	Pennsylvania	Difference
Before Increase	20.43	23.38	2.95
After Increase	20.90	21.10	0.20
Difference	0.47	-2.28	2.75

Next, we use Equation 22.6 to estimate the effect of the increase in the minimum wage on employment. In the following code chunk, we generate the $\Delta Y$ and $D$ variables, and then use the OLS estimator to estimate the DiD parameter $\beta_1$. The DiD estimate of $\beta_1$ is $2.75$, which is exactly the same as the one in Table 22.5.

# Estimation of (22.6)
DeltaY <- df$fte[df$time == 1] - df$fte[df$time == 0]
D <- df$state[df$time == 1]
df1 <- data.frame(DeltaY = DeltaY, X = D)
model1 <- lm(DeltaY ~ D, data = df1)

# Print table
modelsummary(model1,
    stars = TRUE,
    vcov = "HC1",
    fmt = 3,
    gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
)

Table 22.6: OLS Estimates of the DiD Parameter

	(1)
(Intercept)	-2.283+
	(1.248)
D	2.750*
	(1.338)
Num.Obs.	384
R2	0.015
R2 Adj.	0.012
Std.Errors	HC1
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

We can also use Equation 22.8 to estimate the DiD parameter $\beta_3$. In the following code chunk, we use the OLS estimator and report clustered standard errors at the store level to account for within-store correlation. The DiD estimate of $\beta_3$ is 2.75 with a $t$-value of 2.05.

# Estimation of (22.7)
# Create treatment and period indicators
df$D <- df$state
df$Period <- df$time

# Simple OLS regression with clustering on store
model2 <- lm(fte ~ D + Period + I(D*Period), data = df)
model2_vcov <- vcovCL(model2,
                  type = "HC1",
                  cluster = ~ store)
model2_se <- sqrt(diag(model2_vcov))

modelsummary(model2,
    stars = TRUE,
    vcov = list(model2_se),
    fmt = 3,
    gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)

Table 22.7: OLS Estimates of the DiD Parameter

	(1)
(Intercept)	23.380***
	(1.382)
D	-2.949*
	(1.478)
Period	-2.283
	(1.249)
I(D * Period)	2.750
	(1.339)
Num.Obs.	768
R2	0.008
R2 Adj.	0.004
Std.Errors	Custom
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Finally, we use the two-way fixed effects model in Equation 22.9 to estimate the DiD parameter $\beta_3$. The following code chunk provides the same estimate as the previous methods.

model3 <- feols(
    fte ~ I(D * Period) | store + time,
    data = df,
    panel.id = ~ store + time,
    cluster = ~store
)

modelsummary(model3,
    stars = TRUE,
    fmt = 3,
    gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)

Table 22.8: OLS Estimates of the DiD Parameter

	(1)
I(D * Period)	2.750*
	(1.338)
Num.Obs.	768
R2	0.779
R2 Adj.	0.557
R2 Within	0.015
R2 Within Adj.	0.012
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

22.6.4 Identification of the differences-in-differences parameter

In this section, we use the potential outcomes framework to define the DiD parameter and then state assumptions under which the DiD parameter is identified. We assume a setting with multiple time periods and a common treatment timing denoted by $t_0$, which is the time period when the treatment starts. Let $Y_{it}(0)$ and $Y_{it}(1)$ denote the untreated and treated potential outcomes for unit $i$ at period $t$. For example, $Y_{it_0}(1)$ represents the treated potential outcome for unit $i$ in period $t_0$, while $Y_{i,t_0-1}(0)$ represents the untreated potential outcome for unit $i$ in period $t_0-1$. Also, recall that $D_i$ denotes the treatment indicator, which takes the value 1 for the treated units and 0 for the units in the control group. The observed outcome $Y_{it}$ for unit $i$ at period $t$ is related to the potential outcomes as follows: \[ Y_{it}= \begin{cases} Y_{it}(0) & \text{if }\, t< t_0, \\ D_iY_{it}(1)+(1-D_i)Y_{it}(0) & \text{if}\, t \ge t_0. \end{cases} \]

That is, we observe the untreated potential outcome for both the control and treated groups in the pre-treatment periods, and the treated potential outcome for the treated group and the untreated potential outcome for the control group in the post-treatment periods.

The target parameter of interest in this DiD design is usually the average treatment effect on the treated at time $t$, $\text{ATT}(t)$, which is defined as \[ \text{ATT}(t)=\E[Y_{it}(1)-Y_{it}(0)|D_i=1]. \]

The $\text{ATT}(t)$ represents the average difference between treated and untreated potential outcomes for the treatment group in period $t$. To show that $\text{ATT}(t)$ is identified, we need the following assumptions.

Assumptions for ATT

Parallel trends assumption: The potential outcomes satisfy \[ \E\left[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=1\right]=\E\left[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=0\right] \] for $t\ge t_0$.
No anticipation assumption: The potential outcomes satisfy \[ \E\left[Y_{it}(0)|D_i=1\right]=\E\left[Y_{it}(1)|D_i=1\right] \] for $t<t_0$.

The parallel trends assumption requires the average change in the untreated potential outcomes to be the same for both treatment and control groups in the post-treatment periods. The no anticipation assumption states that the untreated and treated potential outcomes of the treated units in the pre-treatment periods are on average equal. That is, receiving treatment in the post-treatment periods does not affect the average potential outcomes in the the pre-treatment periods on average. Hence, under the no anticipation assumption, $\text{ATT}(t)=0$ for $t<t_0$.

Theorem 22.2 (Identification of the ATT parameter) Under the parallel trends and no anticipation assumptions, $\text{ATT}(t)$ for $t\ge t_0$ is identified as \[ \text{ATT}(t)=\E[\Delta Y_{it}|D_i=1]-\E[\Delta Y_{it}|D_i=0], \] where $\Delta Y_{it}=Y_{it}-Y_{i,t_0-1}$ is the change in the observed outcome for unit $i$ from period $t_0-1$ to period $t$.

Proof (Proof of Theorem 22.2). Under the two assumptions above, for $t\ge t_0$, we have \[ \begin{align} \text{ATT}(t)&=\E[Y_{it}(1)-Y_{it}(0)|D_i=1]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=1]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=0]\\ &=\E[Y_{it}-Y_{i,t_0-1}|D_i=1]-\E[Y_{it}-Y_{i,t_0-1}|D_i=0]\\ &=\E[\Delta Y_{it}|D_i=1]-\E[\Delta Y_{it}|D_i=0], \end{align} \] where the second equality follows by adding and substracting $\E[Y_{i,t_0-1}(0)|D_i=1]$, the third equality from the parallel trends assumption, and the fourth equality holds because $Y_{it}(1)$ and $Y_{i,t_0-1}(0)$ are equal to the observed outcomes for the treatment group under no anticipation, while $Y_{it}(0)$ and $Y_{i,t_0-1}(0)$ are equal to the observed outcomes for the control group.

The identification result suggests that the ATT($t$) for $t\ge t_0$ can be estimated by the DiD estimator in the 2-period case. To this end, the DiD estimator in Equation 22.5 (or Equation 22.6 or Equation 22.7) can be used by considering $t_0-1$ as the first period and $t$ as the second period.

Alternatively, we can consider the following two-way fixed effects specification for estimating the ATT parameters: \[ Y_{it} = \alpha_i + \lambda_t + \sum_{s=1}^{t_0-2}\beta_s (D_i\times P_s) + \sum_{s=t_0}^{T}\beta_s (D_i\times P_s) + u_{it} \tag{22.10}\] where $P_s=1$ if $s=t$ and $P_s=0$ otherwise. The estimates of $\beta_s$ for $s\ge t_0$ correspond to the estimates of $\text{ATT}(t)$ for $t\ge t_0$, while the estimates of $\beta_s$ for $s<t_0$ correspond to the estimates of $\text{ATT}(t)$ for $t<t_0$.

This approach allows us using $t$-statistics and confidence intervals based on clustered standard errors. We can then plot ATT estimates along (point-wise) confidence intervals to summarize our results. This is often referred to as an event study.

Definition 22.6 (Event Study) An event study is a method used to estimate $\text{ATT}(t)$ across a range of pre- and post-treatment periods and to visualize the resulting estimates together with their confidence intervals.

The event study can be implemented by estimating the two-way fixed effects regression in Equation 22.10 and then plotting the estimated coefficients $\beta_s$ along with their confidence intervals. Note that the event study approach allows us to examine the dynamic effects of treatment over time. For example, we can investigate whether the treatment effect increases, decreases, or remains constant in the post-treatment periods.

Under the no anticipation assumption, we expect the estimates of $\beta_s$ for $s<t_0$ to be close to zero. In the literature, the estimates of $\beta_s$ for $s<t_0$ are often referred to as differential trends or pre-trends. From the proof of Theorem 22.2, we can see that \[ \begin{align} \beta_s&=\E[Y_{is}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{is}(0)-Y_{i,t_0-1}(0)|D_i=0]\\ &=\E[Y_{is}(0)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{is}(0)-Y_{i,t_0-1}(0)|D_i=0], \end{align} \] where the second equality follows from the no anticipation assumption. Thus, the estimates of $\beta_s$ for $s<t_0$ can be interpreted as the difference in the average change in untreated potential outcomes between the treatment and control groups in the pre-treatment periods. This is why these estimates are often referred to as differential trends or pre-trends. This interpretation also motivates the use of pre-treatment estimates of $\beta_s$ for $s < t_0$ to assess the plausibility of the parallel trends assumption. If the pre-treatment estimates of $\beta_s$ for $s<t_0$ are statistically insignificant, this may provide some evidence in favor of the parallel trends assumption. However, recall that the parallel trends assumption, $\E\left[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=1\right]=\E\left[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=0\right]$ for $t\ge t_0$, is about the counterfactual trend in the post-treatment periods. Therefore, in essence, it is not possible to directly test the parallel trends assumption. Nevertheless, if the pre-treatment estimates of $\beta_s$ for $s < t_0$ are statistically insignificant, this may suggest that the parallel trends assumption is more plausible than if they are statistically significant. For further discussion on the event study approach and the assessment of the parallel trends assumption, see Baker et al. (2026).

22.6.5 Application: Impact of policing on crime

We revisit Tella and Schargrodsky (2004) to study the causal effect of police presence on crime. The dataset is contained in the DS2004.dta file, which was obtained from Hansen (2022). The intervention studied in Tella and Schargrodsky (2004) involves the assignment of more police protection to certain blocks in Buenos Aires following the terrorist attack on main Jewish center in July 1994. The data consist of monthly observations on 876 blocks of which 37 blocks are provided with extra police protection and cover the time period from April to December of 1994, excluding July. Therefore, April to June constitute the pre-treatment months, while August to December represent the post-treatment period. In Table 22.9, we describe some of the variables in this dataset.

Table 22.9: The DS2004 dataset

Variable	Description
`block`	Block ID
`sameblock`	Jewish institution on block (treatment indicator)
`distance`	Number of blocks to closest Jewish institution
`public`	Public building on block
`gasstation`	Gas station on block
`bank`	Bank on block
`thefts`	Average number of motor vehicle thefts
`month`	Month (numeric)

The outcome variable is thefts which is the average monthly number of motor vehicle thefts. Below, we load the data and create the D and Period variables, where D is the treatment indicator and Period is the post-treatment indicator. We present the summary statistics of the data in Table 22.10. According to this table, 4.2% of the blocks are treated, and the average number of car theft is around 0.093.

# Load the data
df <- read_dta("data/DS2004.dta")

# Create treatment and period indicators
df$D <- df$sameblock
df$Period <- as.integer(df$month > 7)

# Summary statistics
var_label(df$D) <- "D"
var_label(df$Period) <- "Period"
var_label(df$thefts) <- "Thefts"

datasummary_skim(
    df[, c("D", "Period", "thefts")],
    fmt = 3
)

Table 22.10: Summary statistics

	Unique	Mean	SD	Median	Max
D	2	0.042	0.201	0.000	1.000
Period	2	0.556	0.497	1.000	1.000
Thefts	10	0.093	0.242	0.000	2.500

In Figure 22.3, we plot the average number of car thefts for treated and control blocks over time. The vertical dashed line indicates the common treatment timing (August). In the pre-treatment period, the average number of car thefts is broadly similar between treated and control blocks, except in June, where the treated blocks have a higher average number of car thefts. In the post-treatment period, there is a clear divergence in the average number of car thefts between treated and control blocks, with the treated blocks having a lower average number of car thefts compared to the control blocks. This may suggest that the police presence has a deterrent effect on car thefts.

# Average number of car thefts across months
df_grouped <- df %>%
    group_by(month, D) %>%
    summarise(thefts = mean(thefts), .groups = "drop")

ggplot(df_grouped, aes(x = month, y = thefts, color = factor(D))) +
    geom_line(linewidth = 1) +
    geom_point(size = 2) +
    geom_vline(xintercept = 8, linetype = "dashed", color = "black") +
    labs(
        x = "Month",
        y = "Average number of car thefts",
        color = NULL
    ) +
    scale_color_discrete(labels = c("D=0", "D=1")) +
    theme(
        legend.position = "right",
        legend.key = element_blank()
    )

Figure 22.3: Average number of car thefts across months

We first consider estimating the average treatment effect in the post-treatment periods. To this end, we use Equation 22.9 as if we have only 2 periods, pre- and post-treament. We can use the feols function from fixest package to estimate the model. We use the clustered standard errors at the block level to account for correlation in the error terms within blocks.

# Remove data for July (treatment month)
df_est <- df[df$month != 7, ]
# Estimation of (22.9) using fixest
model4_fixest <- feols(
    thefts ~ I(D * Period) | block + month,
    data = df_est,
    panel.id = ~ block + month,
    cluster = ~block
)

# Print results
modelsummary(model4_fixest,
    stars = TRUE,
    fmt = 3,
    gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)

Table 22.11: The effect of police presence on vehicle thefts

	(1)
I(D * Period)	-0.087**
	(0.030)
Num.Obs.	7008
R2	0.212
R2 Adj.	0.098
R2 Within	0.001
R2 Within Adj.	0.001
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

The estimation results are presented in Table 22.11. The estimate of the average treatment effect (on treated blocks) in the post-periods is -0.087, i.e., police presence, on average, reduced car thefts by 0.087 in the treated blocks. This estimate is almost the same as the one reported in the first column of Table 3 in Tella and Schargrodsky (2004). Given that the average number of car theft is around 0.093, this estimate is not negligible.

Next, we will consider an event study analysis and estimate Equation 22.10. First, we need to generate the interactions of the treatment dummy with pre and post periods.

df_est <- df[df$month != 7, ]
# Create lead and lag dummies
df_est$lead4 <- as.integer(df_est$D == 1 & df_est$month == 4) # April
df_est$lead3 <- as.integer(df_est$D == 1 & df_est$month == 5) # May
df_est$lead2 <- as.integer(df_est$D == 1 & df_est$month == 6) # June
# df_est$lead1 <- as.integer(df_est$D == 1 & df_est$month == 7)   # July
df_est$lead0 <- as.integer(df_est$D == 1 & df_est$month == 8) # August
df_est$lag1 <- as.integer(df_est$D == 1 & df_est$month == 9) # September
df_est$lag2 <- as.integer(df_est$D == 1 & df_est$month == 10) # October
df_est$lag3 <- as.integer(df_est$D == 1 & df_est$month == 11) # November
df_est$lag4 <- as.integer(df_est$D == 1 & df_est$month == 12) # December

The common treatment timing $t_0$ is August and the $t_0-1$ period is June. Thus, in the event study analysis, June will be the baseline period (reference period) and $\beta_{\text{june}}=0$. We use the fixed effects estimator to estimate the event study regression in Equation 22.10.

# Estimation of (22.10) using fixest
model5_fixest <- feols(
    thefts ~ lead4 + lead3 + lead0 + lag1 + lag2 + lag3 + lag4 |
        block + month,
    data = df_est,
    panel.id = ~ block + month,
    cluster = ~block
)

The estimation results are presented in Table 22.12. The estimates of the lead variables (pre-treatment periods) are statistically insignificant, which provides statistical evidence for the plausibility of the parallel trends assumption.

modelsummary(model5_fixest,
    stars = TRUE,
    fmt = 3,
    gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)

Table 22.12: Event Study Estimates

	(1)
lead4	-0.041
	(0.073)
lead3	-0.064
	(0.059)
lead0	-0.116*
	(0.056)
lag1	-0.138**
	(0.048)
lag2	-0.099
	(0.062)
lag3	-0.126*
	(0.049)
lag4	-0.131**
	(0.049)
Num.Obs.	7008
R2	0.212
R2 Adj.	0.098
R2 Within	0.002
R2 Within Adj.	0.001
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Alternatively, we can resort to the OLS estimator to estimate Equation 22.10. The following code chunk produces the same estimates as the fixed effects estimator. Note that we use the factor(month)+factor(block) to include month and block fixed effects, respectively.

# Fit the model with lm()
model6_lm <- lm(
    thefts ~ lead4 + lead3 + lead0 + lag1 + lag2 + lag3 + lag4
        + factor(month) + factor(block),
    data = df_est
)

To generate the event study plot, we next extract the point estimates and standard errors. We then generate (pointwise) 95% confidence intervals.

# Vector of coefficients for the leads/lags
theta <- coef(model6_lm)[c(
    "lead4", "lead3", "lead0",
    "lag1", "lag2", "lag3", "lag4"
)]

# Insert 0 at lead0 (baseline period)
theta <- c(theta[1:2], 0, theta[3:7])

# Clustered SE at block level
vcov_block <- vcovCL(model6_lm, cluster = ~ df_est$block)
se <- sqrt(diag(vcov_block))
theta_se <- se[c(
    "lead4", "lead3", "lead0",
    "lag1", "lag2", "lag3", "lag4"
)]

# Insert 0 for baseline period
theta_se <- c(theta_se[1:2], 0, theta_se[3:7])

# X-axis for plotting: -3:4
xax <- -3:4

# 95% confidence intervals
thetal <- theta - 1.96 * theta_se
thetau <- theta + 1.96 * theta_se

Finally, we plot the event study estimates and confidence intervals. The plot is shown in Figure 22.4. The treatment effect estimates for the pre-treatment periods are statistically insignificant, which provides statistical evidence for the plausibility of the parallel trends assumption. The estimates for the post-treatment periods are negative, suggesting that police presence reduces car thefts in the post-treatment periods.

# The event study plot
# Prepare a data frame for plotting
plot_df <- data.frame(
    period = xax,
    estimate = theta,
    lower = thetal,
    upper = thetau
)

# Event-study plot
ggplot(plot_df, aes(x = period, y = estimate)) +
    geom_line(color = "steelblue") +
    geom_point(color = "steelblue", size = 2) +
    geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, color = "steelblue") +
    geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
    geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
    scale_x_continuous(breaks = xax) +
    ylim(-0.4, 0.4) +
    labs(
        x = "Period",
        y = "ATT estimates"
    )

Figure 22.4: The pre- and post-treatment estimates of the ATT parameter

The event study plot can also be generated using the iplot function from the fixest package, which provides a convenient way to generate event study plots directly from the fixed effects regression results. To use the iplot function, we need to create a relative period variable that indicates the time relative to the treatment period (August). For example, April corresponds to -3, May to -2, June to -1, August to 0, September to 1, and so on. We can then estimate the fixed effects regression using the feols function and use the iplot function to generate the event study plot. The i(period_iplot, D, ref = -1) term in the regression formula indicates that we want to estimate the interaction of the treatment indicator D with the relative period variable period_iplot, using -1 (June) as the reference period. The resulting plot is shown in Figure 22.5.

# Relative period used only for the fixest/iplot alternative
df_est$period_iplot <- dplyr::case_when(
    df_est$month == 4 ~ -3,
    df_est$month == 5 ~ -2,
    df_est$month == 6 ~ -1,
    df_est$month == 8 ~ 0,
    df_est$month == 9 ~ 1,
    df_est$month == 10 ~ 2,
    df_est$month == 11 ~ 3,
    df_est$month == 12 ~ 4
)

# Estimate the fixed effects regression for iplot
model5_fixest_iplot <- feols(
    thefts ~ i(period_iplot, D, ref = -1) | block + month,
    data = df_est,
    panel.id = ~ block + month,
    cluster = ~block
)

iplot(
    model5_fixest_iplot,
    xlab = "Period",
    ylab = "ATT estimates",
    main = "",
    col = "steelblue",
    pt.join = TRUE,
    pt.join.par = list(col = "steelblue", lwd = 1),
    ref.line = TRUE,
    ref.line.par = list(col = "black", lty = 2),
    zero.par = list(col = "black", lty = 2),
    ylim = c(-0.4, 0.4)
)

Figure 22.5: The pre- and post-treatment estimates of the ATT parameter using iplot

22.6.6 Differences-in-differences with covariates

We can consider including covariates in a DiD design mainly for two reasons: (i) to check for balance in covariates between treatment and control groups that are thought to influence the untreated potential outcomes, and (ii) to estimate heterogeneous treatment effects across subgroups defined by covariates (Baker et al. (2026)).

In this section, we focus on the first reason and discuss why we should check for balance in covariates between treatment and control groups in a DiD design. Checking covariate balance provides a way to assess the plausibility of the parallel trends assumption. Recall that the parallel trends assumption requires the average change in untreated potential outcomes to be the same for both treatment and control groups in the post-treatment periods. If there are covariates that are thought to influence untreated potential outcomes and these covariates are imbalanced between treatment and control groups in the pre-treatment period, this may suggest that the average change in untreated potential outcomes may differ between treatment and control groups in the post-treatment periods, thereby violating the parallel trends assumption. Therefore, checking for balance in covariates that are thought to influence untreated potential outcomes can provide some evidence for or against the plausibility of the parallel trends assumption. It is important to note that we should consider only covariates that are thought to influence untreated potential outcomes, as imbalance in covariates that do not influence untreated potential outcomes may not necessarily suggest a violation of the parallel trends assumption. In particular, covariates that are themselves affected by the treatment should not be considered when checking for balance, as imbalance in these covariates may be a consequence of the treatment rather than a cause of differences in untreated potential outcomes between treatment and control groups.

One way to check for covariate balance is to compute the normalized differences in covariates between treatment and control groups (Imbens and Rubin (2015)). The normalized difference for a covariate $X$ is defined as \[ \text{ND}=\frac{\bar{X}_1-\bar{X}_0}{\sqrt{(S_1^2+S_0^2)/2}}, \] where $\bar{X}_1$ and $\bar{X}_0$ are the sample means of $X$ for the treatment and control groups, respectively, and $S_1^2$ and $S_0^2$ are the sample variances of $X$ for the treatment and control groups, respectively. A common rule of thumb is that a normalized difference larger than 0.25 in absolute value may indicate a significant imbalance in the covariate between treatment and control groups.

In our application on the impact of policing on crime, there are three time-invariant covariates: public, gasstation, and bank. Recall that these covariates are dummy variables that indicate the presence of a public building, a gas station, and a bank on the block, respectively. If we think that these covariates influence the untreated potential outcomes because they may attract more people to the block and thus may increase the number of car thefts in the absence of police presence. Therefore, it is important to check for balance in these covariates between treated and control blocks to assess the plausibility of the parallel trends assumption. The results of the normalized differences for these covariates are provided in Table 22.13. The normalized differences for all three covariates are smaller than 0.25 in absolute value, which provides some evidence in favor of the parallel trends assumption.

# Compute means, variances, and normalized differences for covariates
covariates <- c("public", "gasstation", "bank")

balance_df <- map_dfr(covariates, function(cov) {
    mean_treated <- df %>%
        filter(D == 1) %>%
        summarise(mean = mean(.data[[cov]], na.rm = TRUE)) %>%
        pull(mean)

    mean_control <- df %>%
        filter(D == 0) %>%
        summarise(mean = mean(.data[[cov]], na.rm = TRUE)) %>%
        pull(mean)

    var_treated <- df %>%
        filter(D == 1) %>%
        summarise(var = var(.data[[cov]], na.rm = TRUE)) %>%
        pull(var)

    var_control <- df %>%
        filter(D == 0) %>%
        summarise(var = var(.data[[cov]], na.rm = TRUE)) %>%
        pull(var)

    nd <- (mean_treated - mean_control) /
        sqrt((var_treated + var_control) / 2)

    tibble(
        Covariate = cov,
        `Mean Treated` = mean_treated,
        `Mean Control` = mean_control,
        `Variance Treated` = var_treated,
        `Variance Control` = var_control,
        `Normalized Difference` = nd
    )
})

balance_df %>%
    mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
    kable()

Table 22.13: Normalized differences for covariates

Covariate	Mean Treated	Mean Control	Variance Treated	Variance Control	Normalized Difference
public	0.054	0.031	0.051	0.030	0.114
gasstation	0.027	0.020	0.026	0.020	0.044
bank	0.054	0.080	0.051	0.073	-0.103

In the case of significant imbalance in covariates that are thought to influence untreated potential outcomes, we can update our unconditional parallel trends assumption to a conditional parallel trends assumption that incorporates covariates. Let $X_i$ be the vector of covariates for unit $i$. In our example on the impact of policing on crime, $X_i$ would include the public, gasstation, and bank covariates. Then, the conditional parallel trends assumption requires that the average change in untreated potential outcomes to be the same for both treatment and control groups in the post-treatment periods, conditional on covariates.

Conditional parallel trends assumption

The potential outcomes satisfy \[ \E\left[Y_{it}(0)-Y_{i,t_0-1}(0)|X_i, D_i=1\right]=\E\left[Y_{it}(0)-Y_{i,t_0-1}(0)|X_i, D_i=0\right] \] for $t\ge t_0$.

The conditional parallel trends assumption states that the average change in untreated potential outcomes is the same for treated and control units that share the same values of covariates in the post-treatment periods. In the case of our example on the impact of policing on crime, this assumption requires that, in the absence of police presence, the average change in the number of car thefts between August ($t_0$) and June ($t_0-1$) would be the same for treated and control blocks that have the same values of public, gasstation, and bank. This should also hold for the other post-treatment months, namely September, October, November, and December.

Note that the conditional parallel trends assumption explicitly requires that we have both treated and control units in the population that have the same values of covariates. If there are covariate values that are only observed in the treated group or only observed in the control group, then the conditional parallel trends assumption cannot hold for these covariate values. In this case, we can restrict our analysis to the subset of the population that has common support in covariates between treatment and control groups, i.e., the subset of the population that has covariate values that are observed in both treatment and control groups. This common support condition is also known as the overlap assumption in the potential outcomes framework.

Overlap assumption

For any value of covariates $X$, the probability of being treated is strictly between 0 and 1, i.e., $0<P(D_i=1|X_i)<1$ for all $i$.

The next theorem states that under the conditional parallel trends and overlap assumptions, the ATT($t$) for $t\ge t_0$ is identified.

Theorem 22.3 (Identification of the ATT parameter under conditional parallel trends) Under the no anticipation, parallel trends, and overlap assumptions, $\text{ATT}(t)$ for $t\ge t_0$ is identified as \[ \begin{align} \text{ATT}(t)&=\E[Y_{it}-Y_{i,t_0-1}|D_i=1]\\ &-\E\left[\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]|D_i=1\right]. \end{align} \]

Proof (Proof of Theorem 22.3). Under the no anticipation, parallel trends, and overlap assumptions, for $t\ge t_0$, we have \[ \begin{align} \text{ATT}(t)&=\E[Y_{it}(1)-Y_{it}(0)|D_i=1]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=1]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E\left[\E[Y_{it}(0)-Y_{i,t_0-1}(0)|X_i, D_i=1]|D_i=1\right]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E\left[\E[Y_{it}(0)-Y_{i,t_0-1}(0)|X_i, D_i=0]|D_i=1\right]\\ &=\E[Y_{it}-Y_{i,t_0-1}|D_i=1]-\E\left[\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]|D_i=1\right], \end{align} \] where the second equality follows by adding and substracting $\E[Y_{i,t_0-1}(0)|D_i=1]$, the third equality follows from the iterated law of expectation, the fourth equality from the conditional parallel trends assumption, and the fifth equality holds because $Y_{it}(1)$ and $Y_{i,t_0-1}(0)$ are equal to the observed outcomes for the treatment group under no anticipation, while $\E[Y_{it}(0)-Y_{i,t_0-1}(0)|X_i, D_i=0]$ is equal to the observed change in outcomes for the control group conditional on covariates.

Theorem 22.3 indicates that under the conditional parallel trends and overlap assumptions, the ATT($t$) for $t\ge t_0$ can be identified by the difference between two terms: (i) the average change in observed outcomes for the treated group, and (ii) the average change in observed outcomes for the control group, where the average is taken over the distribution of covariates in the treated group. Thus, in the cas of the second term, we need to first compute the average change in observed outcomes for the control group conditional on covariates, and then take the average of this quantity over the distribution of covariates in the treated group.

We can obtain the estimate of the first term by simply taking the average of the change in observed outcomes ($Y_{it}-Y_{i,t_0-1}$) for the treated group. In the case of the second term, we can formulate an estimate in two steps. In the first step, we can obtain an estimate of $\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]$ by running a regression of $Y_{it}-Y_{i,t_0-1}$ on covariates $X_i$ using only the control group data, i.e., $\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]=m_{D=0}(X_i)=X^{'}_i\beta$. In the second step, we can take the average of the fitted values $X^{'}_i\hat{\beta}$ from this regression over the covariates in the treated group to obtain an estimate of $\E\left[\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]|D_i=1\right]$. In the literature, this approach is often referred to as the regression-adjustment approach.

An alternative approach to estimate the second term is to use a weighting method (Abadie (2005)). This approach is based on the following equality: \[ \begin{align} &\E\left[\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]|D_i=1\right]=\E\left[\frac{D_i}{\E[D_i]}\E[\Delta Y_{it}|X_i,D_i=0]\right]\\ &=\E\left[\frac{\frac{(1-D_i)P(D_i=1|X_i)}{1-P(D_i=1|X_i)}}{\E\left[\frac{(1-D_i)P(D_i=1|X_i)}{1-P(D_i=1|X_i)}|\right]}\times \Delta Y_{it}\right]. \end{align} \] where $\Delta Y_{it}=Y_{it}-Y_{i,t_0-1}$ is the change in observed outcomes for unit $i$ from period $t_0-1$ to period $t$. To estimate the last term, we can first estimate the propensity score $P(D_i=1|X_i)$ by running a logistic regression of $D_i$ on covariates $X_i$ using the full sample data. We can then compute the weights $\frac{(1-D_i)P(D_i=1|X_i)}{1-P(D_i=1|X_i)}$ for each unit in the control group and take the average of $\Delta Y_{it}$ weighted by these weights to obtain an estimate of $\E\left[\E[Y_{it}-Y_{i,t_0-1}|X_i, D_i=0]|D_i=1\right]$. This approach is often referred to as the inverse probability weighting approach.

Finally, there is another approach called the doubly-robust approach that combines the regression-adjustment and inverse probability weighting approaches. In this approach, $\text{ATT}(t)$ for $t\ge t_0$ is expressed as

\[ \begin{align} \text{ATT}(t)&= \E\bigg[\left(\frac{D_i}{\E[D_i]}-\frac{\frac{(1-D_i)P(D_i=1|X_i)}{1-P(D_i=1|X_i)}}{\E\left[\frac{(1-D_i)P(D_i=1|X_i)}{1-P(D_i=1|X_i)}|\right]}\right)\\ &\times\left(\Delta Y_{it} -\E[\Delta Y_{it}|X_i, D_i=0]\right)\bigg]. \end{align} \]

The doubly robust approach is appealing because it provides a consistent estimate of $\text{ATT}(t)$ if either the regression model for $\E[Y_{it}-Y_{i,t_0-1}\mid X_i, D_i=0]$ is correctly specified or the model for the propensity score $P(D_i=1\mid X_i)$ is correctly specified. Baker et al. (2026) suggest using the doubly robust approach in practice, as it provides some protection against model misspecification.

For statistical inference, Callaway and Sant’Anna (2021) suggest using the multiplier bootstrap method to compute standard errors for these three approaches. Their method is implemented in the did package, which provides convenient tools for implementing these three approaches in DiD designs with covariates. For further details, see Callaway and Sant’Anna (2021) and Baker et al. (2026).

For our example on the impact of policing on crime, we can use the did package to implement the regression-adjustment, inverse probability weighting, and doubly-robust approaches to estimate the ATT parameters in the pre- and post-treatment periods. We first need to prepare the data for the did package. We remove the data for July and recode the months as follows: April is coded as 1, May as 2, June as 3, August as 4, September as 5, October as 6, November as 7, and December as 8. We also create a variable treat_month that indicates the treatment timing for each block. Treated blocks receive treatment in month 4 (August), while control blocks never receive treatment (coded as 0).

# Remove data for July (treatment month) and recode months as April 1, May 2, June 3, August 4, September 5, October 6, November 7, December 8
library(did)
df_did <- df %>%
    filter(month != 7) %>%
    mutate(
        time = case_when(
            month == 4 ~ 1,
            month == 5 ~ 2,
            month == 6 ~ 3,
            month == 8 ~ 4,
            month == 9 ~ 5,
            month == 10 ~ 6,
            month == 11 ~ 7,
            month == 12 ~ 8
        )
    )
# Create the gname variable for the did package, which indicates the treatment timing for each unit
df_did <- df_did %>%
    mutate(
        treat_month = case_when(
            D == 1 ~ 4, # Treated blocks receive treatment in month 4 (August)
            TRUE ~ 0 # Control blocks never receive treatment
        )
    )

In the following, we use the att_gt function from the did package to estimate the ATT parameters in the pre- and post-treatment periods using the regression-adjustment, inverse probability weighting, and doubly-robust approaches. We then use summary(doubly_robust) to summarize the estimation results for the doubly-robust approach.

# Estimate the ATT parameters using the regression-adjustment approach
reg_adj <- att_gt(
  yname = "thefts",
  tname = "time",
  idname = "block",
  gname = "treat_month",
  xformla = ~ public + gasstation + bank,
  data = df_did,
  panel = TRUE,
  bstrap = TRUE,
  base_period = "universal",  # Use June (time=3) as the base period
  est_method = "reg"
)

# Estimate the ATT parameters using the inverse probability weighting approach
ipw <- att_gt(
  yname = "thefts",
  tname = "time",
  idname = "block",
  gname = "treat_month",
  xformla = ~ public + gasstation + bank,
  data = df_did,
  panel = TRUE,
  bstrap = TRUE,
  base_period = "universal",  # Use June (time=3) as the base period
  est_method = "ipw"
)

# Estimate the ATT parameters using the doubly robust approach
doubly_robust <- att_gt(
  yname = "thefts",
  tname = "time",
  idname = "block",
  gname = "treat_month",
  xformla = ~ public + gasstation + bank,
  data = df_did,
  panel = TRUE,
  bstrap = TRUE,
  base_period = "universal",  # Use June (time=3) as the base period
  est_method = "doubly_robust"
)

# Summarize the results
summary(doubly_robust)


Call:
att_gt(yname = "thefts", tname = "time", idname = "block", gname = "treat_month", 
    xformla = ~public + gasstation + bank, data = df_did, panel = TRUE, 
    bstrap = TRUE, est_method = "doubly_robust", base_period = "universal")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 

Group-Time Average Treatment Effects:
 Group Time ATT(g,t) Std. Error [95% Simult.  Conf. Band]  
     4    1  -0.0375     0.0770       -0.2107      0.1357  
     4    2  -0.0627     0.0580       -0.1931      0.0677  
     4    3   0.0000         NA            NA          NA  
     4    4  -0.1189     0.0588       -0.2510      0.0133  
     4    5  -0.1382     0.0473       -0.2446     -0.0318 *
     4    6  -0.0976     0.0630       -0.2392      0.0441  
     4    7  -0.1247     0.0493       -0.2356     -0.0139 *
     4    8  -0.1305     0.0502       -0.2434     -0.0176 *
---
Signif. codes: `*' confidence band does not cover 0

P-value for pre-test of parallel trends assumption:  0.53511
Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  doubly_robust

We next use the aggte function from the did package to aggregate the ATT estimates across pre- and post-treatment periods to obtain dynamic ATT estimates. We set biters = 25000 to use 25,000 bootstrap iterations to compute the standard errors for the dynamic ATT estimates. We then summarize the estimation results for the dynamic ATT estimates using summary(dynamicATT). The overall ATT estimate, which is the average of the dynamic ATT estimates across post-treatment periods, is -0.122, which is larger in absolute value than the estimate of -0.087 obtained from the standard DiD regression without covariates.

dynamicATT <- aggte(doubly_robust, type = "dynamic", biters = 25000)
summary(dynamicATT)


Call:
aggte(MP = doubly_robust, type = "dynamic", biters = 25000)

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on event-study/dynamic aggregation:  
    ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.122        0.0506    -0.2212     -0.0228 *


Dynamic Effects:
 Event time Estimate Std. Error [95% Simult.  Conf. Band]  
         -3  -0.0375     0.0764       -0.2095      0.1345  
         -2  -0.0627     0.0617       -0.2015      0.0761  
         -1   0.0000         NA            NA          NA  
          0  -0.1189     0.0609       -0.2560      0.0183  
          1  -0.1382     0.0488       -0.2482     -0.0283 *
          2  -0.0976     0.0644       -0.2426      0.0475  
          3  -0.1247     0.0513       -0.2402     -0.0092 *
          4  -0.1305     0.0522       -0.2480     -0.0130 *
---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  doubly_robust

Finally, we use the ggdid function from the did package to plot the dynamic ATT estimates and confidence intervals. The resulting plot is shown in Figure 22.6. The plot shows that the ATT estimates for the pre-treatment periods are statistically insignificant, which provides some evidence for the plausibility of the parallel trends assumption. The ATT estimates for the post-treatment periods are negative, suggesting that police presence reduces car thefts in the post-treatment periods.

# Use ggdid to plot the dynamic ATT estimates with confidence intervals
ggdid(dynamicATT,
    xlab = "Event Time",
    ylab = "ATT Estimates",
    col = "steelblue",
    ylim = c(-0.4, 0.4)
)

Figure 22.6: Dynamic ATT estimates with confidence intervals

22.7 Regression discontinuity designs

Another example of quasi-experimental methods is the regression discontinuity design (RDD). In an RDD, the treatment and control groups are formed when a continuous variable $W$, known as the running variable, crosses a threshold $w_0$ (the cutoff level). The main idea of the RDD is that units with running variable values near the threshold are comparable in terms of characteristics, except for the receipt of treatment.¹

Example 22.5 (The effect of attending summer school on the next year’s GPA) Consider the hypothetical example illustrated in Figure 22.7, where $Y$ is the next year’s grade point average (GPA) and $W$ is the current year’s GPA. Let us assume that students are required to attend a mandatory summer school if their current year’s GPA is below $w_0=2$. We can then estimate the effect of attending summer school on next year’s GPA by comparing students with $W$ below $w_0=2$ to those with $W$ above $w_0=2$. In other words, the running variable defines the treatment and control groups:

Students with $W$ below $w_0=2$ constitute the treatment group.
Students with $W$ above $w_0=2$ constitute the control group.

# Hypothetical regression discontinuity design scatterplot
# Set seed for reproducibility
set.seed(45)

# 1. Generate the data
W <- runif(100, 0, 4)
y <- 0.8 * W - 0.95 * (W >= 2) + runif(100, 0, 2)
df <- data.frame(W = W, y = y)

ggplot(df, aes(x = W, y = y)) +
    geom_point(color = "steelblue", size = 2) +
    geom_vline(xintercept = 2, linetype = "dashed", color = "black", linewidth = 1.2) +
    geom_smooth(data = subset(df, W < 2), method = "lm", se = FALSE, color = "black", linewidth = 1.2) +
    geom_smooth(data = subset(df, W >= 2), method = "lm", se = FALSE, color = "black", linewidth = 1.2) +
    labs(x = "W", y = "Y")

Figure 22.7: Hypothetical regression discontinuity design scatterplot

There are two types of RDDs: (i) the sharp RDD and (ii) the fuzzy RDD. In the sharp RDD, the treatment group consists of individuals with the running variable values above (or below) the threshold $w_0$, and the control group consists of those with the running variable values below (or above) the threshold. Thus, in the sharp RDD, the cutoff clearly separates the treatment and control groups. The hypothetical example provided in Figure 22.7 is an example of the sharp RDD study. In the fuzzy RDD, the probability of treatment is a continuous function of the running variable, with a discontinuity at the cutoff.

In an RDD study, the treatment effect parameter is the average treatment effect at the threshold.

Definition 22.7 (Average treatment effect at the threshold) The average treatment effect at the threshold $w_0$, denoted by $\tau$, is defined as \[ \tau=\E[Y_i(1)-Y_i(0)|W_i=w_0]. \]

In both sharp and fuzzy designs, we first state the assumptions required to identify $\tau$, and then discuss how to estimate $\tau$ using the observed data.

22.7.1 Sharp regression discontinuity design

Let $D$ be the treatment indicator defined by $D=\mathbf{1}(W\geq w_0)$, where $\mathbf{1}(\cdot)$ is the indicator function. Thus, the $i$th unit is treated if its running value is above the treshold value, i.e., $D_i=1$ if $W_i\geq w_0$. We assume that the observed data $\{(Y_i,W_i)\}_{i=1}^n$ constitute a realization of an i.i.d sample from the distribution of $(Y,W)$.

We assume the following assumption for the identification of $\tau$ (Chernozhukov et al. (2024), Hansen (2022), Wooldridge (2025)).

Identification Assumption for the Sharp RDD

Continuity assumption: $\E[Y_i(1)|W_i=w]$ and $\E[Y_i(0)|W_i=w]$ are continuous at the threshold $w_0$.

The continuity assumption states that the conditional means of potential outcomes are continuous at $w_0$. This assumption holds if the conditional means are continuous functions of the running variable, as illustrated in Figure 22.8.

Figure 22.8: The conditional mean functions of potential outcomes

Theorem 22.4 (Identification of $\tau$ in the sharp RDD) Let $\lim_{w\downarrow w_0}$ and $\lim_{w\uparrow w_0}$ be the limits from above and from below, respectively. Then, under the continuity assumption, it follows that \[ \begin{align} \tau=\lim_{w\downarrow w_0}\E[Y_i|W_i=w]-\lim_{w\uparrow w_0}\E[Y_i|W_i=w]. \end{align} \]

Proof (Proof of Theorem 22.4). We can express $\tau$ as \[ \begin{align} \tau&=\E\left[Y_i(1)|W_i=w_0\right]-\E\left[Y_i(0)|W_i=w_0\right]\\ &=\lim_{w\downarrow w_0}\E[Y_i(1)|W_i=w]-\lim_{w\uparrow w_0}\E[Y_i(0)|W_i=w]\\ &=\lim_{w\downarrow w_0}\E[Y_i|W_i=w]-\lim_{w\uparrow w_0}\E[Y_i|W_i=w], \end{align} \] where the second equality follows from the continuity assumption and the third equality from the fact that $Y_i=Y_i(1)$ when $W_i\geq w_0$ and $Y_i=Y_i(0)$ when $W_i<w_0$. This result indicates that $\tau$ is identified because $\E[Y_i|W_i]$ is generally identified.

Let us assume that $E[Y(0)|W]=\alpha_0+\beta_0W$ and $E[Y(1)|W]=\alpha_1+\beta_1W$, where $\alpha_j$ and $\beta_j$ are unknown parameters for $j=0,1$. Also, let $u$ be an error term satisfying $E[u|W]=0$ so that $Y(0)=\alpha_0+\beta_0W+u$ and $Y(1)=\alpha_1+\beta_1W+u$. Then, we can express $\tau$ as \[ \tau=(\alpha_1+\beta_1w_0)-(\alpha_0+\beta_0w_0)=(\alpha_1-\alpha_0)+(\beta_1-\beta_0)w_0. \]

Using the above expression and $Y=DY(1)+(1-D)Y(0)$, we can relate $\tau$ to the observed data in the following way: \[ \begin{align} Y&=D(\alpha_1+\beta_1W+u)+(1-D)(\alpha_0+\beta_0W+u)\\ &=\alpha_0+\beta_0W+D(\alpha_1-\alpha_0)+(\beta_1-\beta_0)DW+u\\ &=\alpha_0+\beta_0W+D(\tau-(\beta_1-\beta_0)w_0)+(\beta_1-\beta_0)DW+u\\ &=\alpha_0+\beta_0W+\tau D+(\beta_1-\beta_0)D(W-w_0)+u,\\ \end{align} \] where we use $\tau=(\alpha_1-\alpha_0)+(\beta_1-\beta_0)w_0$ in the third equality. This result shows that we can estimate $\tau$ by regressing $Y$ on $W$, $D$, and the interaction term $D(W-w_0)$, where the coefficient on $D$ gives the estimate of $\tau$. Since $\tau$ is a local parameter, the literature suggests that we should not use all observations in the sample, but rather only those close to the threshold $w_0$. That is, given a bandwidth $h$, we should restrict the sample to observations with $W_i\in(w_0-h,w_0+h)$ for estimation. This is because the observations close to the threshold are more comparable in terms of characteristics, and thus more suitable for estimating the treatment effect at the threshold.

In the literature, there are alternative methods for choosing the bandwidth $h$ (Calonico et al. (2014), Imbens and Kalyanaraman (2012), Hansen (2022)). It is also common to weight the observations using a kernel function, which assigns relatively more weights to observations that are closer to the threshold. Let $K$ be a kernel function.² Then, we can estimate $\tau$ by the following weighted least squares estimator: \[ \hat{\tau}=e^{'}_3\argmin_{\theta\in\mathbb{R}^4}\sum_{i=1}^nK_h(W_i-w_0)(Y_i-V_i\theta)^2, \] where $K_h(W_i-w_0)=K\left((W_i-w_0)/h\right)/h$, $V_i=\left(1, W_i, D_i, D_i(W_i-w_0)\right)$, and $e_3=(0,0,1,0)^{'}$.

Among kernel-based estimators, the local linear estimator is widely used for estimating $\tau$. Calonico et al. (2014) show how to choose an optimal bandwidth and propose a bias-corrected local linear estimator with adjusted standard errors. Their method can be implemented using the rdrobust module. See https://rdpackages.github.io/rdrobust for details. This module can be installed by typing pip install rdrobust in a console.

22.7.2 Fuzzy regression discontinuity designs

In the fuzzy RDD, the probability of treatment changes continuously with the running variable, except at the threshold $w_0$, where it exhibits a discontinuous jump. Let $p(w)=P(D=1|W=w)$ be the probability of treatment (the propensity score) given the running variable $W=w$. The fuzzy RDD requires that $p(w)$ is discontinuous at the threshold, i.e., $\lim_{w\downarrow w_0}p(w)\neq \lim_{w\uparrow w_0}p(w)$.

The identification of $\tau$ in the fuzzy RDD requires the following assumptions (Hansen (2022), Wooldridge (2025)).

Identification Assumptions for the fuzzy RDD

Continuity assumption: $\E[Y_i(1)|W_i=w]$ and $\E[Y_i(0)|W_i=w]$ are continuous at the threshold $w_0$.
Discontinuity assumption: $\lim_{w\downarrow w_0}p(w)\neq \lim_{w\uparrow w_0}p(w)$.
No selection on gain assumption: $\E[Y_i(1)-Y_i(0)|D_i,W_i=w]=\E[Y_i(1)-Y_i(0)|W_i=w]$ for all $w\in(w_0-\epsilon,w_0+\epsilon)$ with some $\epsilon>0$.

The first assumption is the continuity assumption from the sharp RDD. The second assumption states that the probability of treatment is discontinuous at the threshold $w_0$. We illustrate this assumption in Figure 22.9 by assuming that the probability of treatment increases with the running variable. The figure shows that the probability of treatment jumps to 0.6 at the threshold $w_0$. This contrasts with the sharp RDD, in which the probability of treatment is 0 below the threshold and 1 above it, i.e., $p(w)=0$ if $w<w_0$ and $p(w)=1$ if $w\geq w_0$.

# Probability of treatment in the fuzzy RDD
# 1. Create the data
W_left <- seq(0, 5, length.out = 100)
W_right <- seq(5, 10, length.out = 100)

F_left <- 0.2 * sqrt(W_left)
F_right <- 0.6 + 0.05 * sqrt(W_right - 5)

df_left <- data.frame(W = W_left, F = F_left, side = "left")
df_right <- data.frame(W = W_right, F = F_right, side = "right")
df <- rbind(df_left, df_right)

# 2. Define the left and right limit points at cutoff
limit_points <- data.frame(
  W = c(5, 5),
  F = c(F_left[length(F_left)], F_right[1]),
  type = c("left", "right")
)

# 3. Plot using ggplot2
ggplot(df, aes(x = W, y = F)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_vline(xintercept = 5, linetype = "dashed", color = "black", linewidth = 1) +
  geom_point(data = limit_points[limit_points$type=="left", ], 
             aes(x = W, y = F), color = "steelblue", fill = "white", shape = 21, size = 3, stroke = 1.2) +
  geom_point(data = limit_points[limit_points$type=="right", ], 
             aes(x = W, y = F), color = "steelblue", size = 3) +
  labs(
    x = expression(w),
    y = expression(p(w))
  )

Figure 22.9: The probability of treatment in the fuzzy RDD

The third assumption is also known as the local unconfoundedness of treatment gains. It states that the average treatment effect at $w\in(w_0-\epsilon,w_0+\epsilon)$ is the same for treated and untreated units.

Theorem 22.5 (Identification of $\tau$ in the fuzzy RDD) Under Assumptions 1-3, $\tau$ is identified as \[ \begin{align} \tau=\frac{\lim_{w\downarrow w_0}\E[Y_i|W_i=w]-\lim_{w\uparrow w_0}\E[Y_i|W_i=w]}{\lim_{w\downarrow w_0}p(w)-\lim_{w\uparrow w_0}p(w)}. \end{align} \]

Proof (Proof of Theorem 22.5). See Hansen (2022) or Wooldridge (2025) for the proof.

Theorem 22.5 indicates that $\tau$ is given by the ratio of the jump in the conditional mean of the outcome variable at the threshold to the jump in the probability of treatment at the same threshold. The identification result indicates that we can resort to the local linear estimator to estimate both the numerator and the denominator. See Calonico et al. (2014) and Hansen (2022) for details. The rdrobust module can also handle the estimation of $\tau$ in the fuzzy RDD.

22.7.3 Application: The effect of an anti-poverty program on child mortality

Ludwig and Miller (2007) use a sharp RDD to study the effect of a federal grant-writing assistence for the federal anti-poverty program called Head Start on child mortality rates in the United States. The Head Start program, founded in 1965, provides funds to local municipalities for preschool, health, and social services for poor children aged three to five and their families.³ The poverty rate measured in 1960, with a cutoff of 59.1984, is used to determine whether a county is eligible for federal grant-writing assistance. This cutoff results in 300 counties being eligible to receive such assistance. The poverty rate measured in 1960, with a cutoff of 59.1984, is used as the running variable to form the treatment and control groups. The outcome variable $Y$ is the county mortality rate during the 1973-1983 period. The dataset is contained in the LM2007.dta file, which is obtained from Hansen (2022). Some of the variables included in the dataset are listed below:

Table 22.14: Some variables in the LM2007 dataset

Variable name	Description
`povrate60`	County poverty rate in 1960 (percent)
`mort_age59_related_postHS`	Mortality, ages 5–9, Head Start–related causes, 1973–1983
`mort_age59_injury_postHS`	Mortality, ages 5–9, injuries, 1973–1983
`mort_age59_all_postHS`	Mortality, ages 5–9, all causes, 1973–1983
`mort_age25plus_related_postHS`	Mortality, ages 25+, Head Start–related causes, 1973–1983
`mort_age25plus_injuries_postHS`	Mortality, ages 25+, injuries, 1973–1983
`mort_age59_related_preHS`	Mortality, ages 5–9, Head Start–related causes, 1959–1964
`census1960_pop`	Census 1960 county population

The outcome variable is denoted by mort_age59_related_postHS and the running variable by povrate60. Table 22.15 provides the summary statistics for these two variables.

# Load the data
df_head <- read_dta("data/LM2007.dta")
colnames(df_head)

 [1] "state"                          "povrate60"                     
 [3] "mort_age59_related_postHS"      "mort_age59_injury_postHS"      
 [5] "mort_age59_all_postHS"          "mort_age25plus_related_postHS" 
 [7] "mort_age25plus_injuries_postHS" "mort_age59_related_preHS"      
 [9] "census1960_pop"                 "census1960_pctsch1417"         
[11] "census1960_pctsch534"           "census1960_pctsch25plus"       
[13] "census1960_pop1417"             "census1960_pop534"             
[15] "census1960_pop25plus"           "census1960_pcturban"           
[17] "census1960_pctblack"

# Summary statistics
var_label(df_head$mort_age59_related_postHS) <- "Mortality Rate"
var_label(df_head$povrate60) <- "Poverty Rate"

datasummary_skim(
  df_head[, c("mort_age59_related_postHS", "povrate60")],
  fmt = 3
)

Table 22.15: Summary statistics

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
Mortality Rate	1034	0	2.254	5.726	0.000	0.000	136.054
Poverty Rate	2783	0	36.735	15.268	15.209	33.593	81.570

We use the rdrobust function with the cutoff c=59.1984 to estimate the RDD parameter $\tau$. The estimated RDD parameter is -2.409 with a robust t-statistic -2.032. This is a large estimate given that the mean child mortality rate is around 2.254 as shown in Table 22.15. This estimate indicates that the federal grant-writing assistance reduced mortality by 2.032 children per 100,000.

Y <- df_head$mort_age59_related_postHS
W <- df_head$povrate60
rdd_result <- rdrobust(y = Y, x = W, c = 59.1984)
summary(rdd_result)

Call: rdrobust

Sharp RD estimates using local polynomial regression.

Number of Obs.                 2783
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

                               Left        Right
Number of Obs.                 2489          294
Eff. Number of Obs.             234          180
Order est. (p)                    1            1
Order bias (q)                    2            2
BW est. (h)                   6.811        6.811
BW bias (b)                  10.726       10.726
rho (h/b)                     0.635        0.635
Unique Obs.                    2489          294

=====================================================================
                   Point    Robust Inference
                Estimate         z     P>|z|      [ 95% C.I. ]       
---------------------------------------------------------------------
     RD Effect    -2.409    -2.032     0.042    [-5.462 , -0.099]    
=====================================================================

We can use the rdplot function to generate the fitted polynomials along with sample averages of outcome variable within each bin as shown in Figure 22.10. The plot shows a clear jump in the fitted polynomials at the cutoff, which is consistent with the estimated RDD parameter.

rdplot(y=Y, x=W, binselect="qsmv",c=59.1984,
       y.label="Mortality Rate",
       title=NULL,
       x.label="Poverty Rate",
       col.lines="steelblue")

Figure 22.10: RDD plot for the effect of federal grant-writing assistance on child mortality

Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” The Review of Economic Studies 72 (1): 1–19.

Baker, Andrew, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. 2026. “Difference-in-Differences Designs: A Practitioner’s Guide.” Journal of Economic Literature, ahead of print. https://doi.org/10.1257/jel.20251650.

Callaway, Brantly, and Pedro H. C. Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

Calonico, Sebastian, Matias D. Cattaneo, and Rocio Titiunik. 2014. “Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs.” Econometrica 82 (6): 2295–326.

Card, David, and Alan B. Krueger. 1994. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” The American Economic Review 84 (4): 772–93.

Cattaneo, M. D., N. Idrobo, and R. Titiunik. 2020. A Practical Introduction to Regression Discontinuity Designs: Foundations. Elements in Quantitative and Computational Methods for the Social Sciences. Cambridge University Press.

Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. 2024. Applied Causal Inference Powered by ML and AI. https://causalml-book.org/.

Deaton, Angus, and Nancy Cartwright. 2018. “Understanding and Misunderstanding Randomized Controlled Trials.” Social Science & Medicine 210: 2–21.

Hansen, Bruce. 2022. Econometrics. Princeton University Press.

Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.

Imbens, Guido, and Karthik Kalyanaraman. 2012. “Optimal Bandwidth Choice for the Regression Discontinuity Estimator.” The Review of Economic Studies 79 (3): 933–59.

Ludwig, Jens, and Douglas L. Miller. 2007. “Does Head Start Improve Children’s Life Chances? Evidence from a Regression Discontinuity Design.” The Quarterly Journal of Economics 122 (1): 159–208.

Stock, James H., and Mark W. Watson. 2020. Introduction to Econometrics. Fourth edition. Pearson.

Tella, Rafael Di, and Ernesto Schargrodsky. 2004. “Do Police Reduce Crime? Estimates Using the Allocation of Police Forces After a Terrorist Attack.” The American Economic Review 94 (1): 115–33.

Titiunik, Rocío. 2021. “Natural Experiments.” In Advances in Experimental Political Science, edited by James N. Druckman and Donald P.Editors Green. Cambridge University Press.

Wooldridge, Jeffrey M. 2025. Introductory Econometrics: A Modern Approach. 8th ed. Cengage Learning.

The approach we presented in this section is called the continuity framework. A second approach, known as the local randomization framework, is based on the idea that the RDD can be viewed as a randomized experiment around the cutoff. See Cattaneo et al. (2020) for details.↩︎
A kernel is a non-negative real-valued function. See Wikipedia for commonly used kernel functions in non-parametric econometrics.↩︎
See Wikipedia on the details of the program.↩︎