# Load necessary libraries
library(fixest)
library(sandwich)
library(lmtest)
library(modelsummary)
options("modelsummary_factory_default" = "gt")
library(haven)
library(rdrobust)
library(knitr)
library(tidyverse)
library(labelled)22 Experiments and Quasi-Experiments
22.1 Randomized controlled experiments
A randomized controlled experiment is a study designed to measure the effect of a treatment or policy intervention on an outcome variable by randomly assigning experimental units to treatment and control groups. The units in the treatment group receive the treatment or are subject to the intervention and those in the control group are not subject to any changes. We then measure the treatment or intervention effect by comparing the groups in terms of the outcome variable. We usually collect data on two variables of interest: the outcome variable, denoted by \(Y\), and the treatment indicator, denoted by \(D\), which specifies whether a unit is in the treatment or control group.
Example 22.1 (Clinical drug trial) Consider a randomized controlled experiment designed to assess whether a proposed drug lowers cholesterol levels. In this experiment, we have:
- \(Y\) is the cholesterol level,
- \(D=1\) if the individual is in the treatment group (i.e., receives the drug), and \(D=0\) if the individual is in the control group (i.e., does not receive the drug).
Example 22.2 (Job training program) Consider a randomized controlled experiment designed to assess the effect of a job training program on wages. In this experiment, we have:
- \(Y\) represents the wage income,
- \(D=1\) if the individual is in the treatment group (i.e., participates to the job training program) and \(D=0\) if the individual is in the control group (i.e., does not participate to the job training program).
Example 22.3 (Tennessee class size experiment) The Tennessee class size reduction experiment, known as Project STAR (Student–Teacher Achievement Ratio), was a 4-year experiment designed to evaluate the effect of class size on learning. Students were randomly assigned to one of three groups:
- a regular class (22–25 students),
- a regular class with a teacher’s aide, or
- a small class (13–17 students).
In this experiment, students then took standardized tests (the Stanford Achievement Tests) in reading and math. The goal of the experiment was to assess the effect of class type (regular, regular with an aide, or small) on test scores. In this experiment, we have one outcome variable and two treatment indicators:
- \(Y\) is the test scores,
- \(D_1=1\) if the student is in a small class (i.e., the student is in the treatment group), and \(D_1=0\) if the student is in a regular class (i.e., the student is in the control group).
- \(D_2=1\) if the student is in a regular class with a teacher’s aide (i.e., the student is in the treatment group) and \(D_2=0\) if the student is in a regular class (i.e., the student is in the control group).
In all these examples, the experiment is deliberately designed and implemented to assess the causal effect of a treatment or intervention on the outcome variable. The assignment mechanism is the key feature that distinguishes randomized controlled experiments from other types of studies. A randomized controlled experiment is one in which the assignment mechanism satisfies the following properties (G. W. Imbens and Rubin (2015), Titiunik (2021)):
- It is designed and implemented by the researchers.
- It is known to the researchers.
- It ensures randomization such that the probability of participating in the treatment does not depend on the potential outcomes (defined below) of the experimental units.
22.2 Potential outcomes and causal effects
To define the treatment effect, we need to first introduce the potential outcomes framework.
Definition 22.1 (Potential outcomes) The outcome of an individual under potential treatment or non-treatment is called the potential outcome. We use \(Y_i(1)\) and \(Y_i(0)\) to denote the treated and untreated potential outcomes of the \(i\)th individual, respectively.
If the individual is in the treated group, then the observed outcome \(Y_i\) is given by the treated outcome, i.e., \(Y_i=Y_i(1)\), and if the individual is in the control group, the observed outcome \(Y_i\) is given by the untreated outcome, i.e., \(Y_i=Y_i(0)\). Note that we do not observe the untreated potential outcome for the treated individual, and the treated potential outcome for the untreated individual.
Definition 22.2 (Causal effect) The causal effect (or treatment effect) is defined as the difference between the treated and untreated potential outcomes: \(Y_i(1)-Y_i(0)\).
This treatment effect cannot be observed because an individual can be either treated or untreated, but not both at the same time. This is called the the fundamental problem of causal inference in the literature.
Definition 22.3 (Average treatment effect) The average treatment effect (ATE) is defined as \[ \text{ATE}=\E(Y_i(1)-Y_i(0)). \tag{22.1}\]
The observed outcome \(Y_i\) can be expressed as \(Y_i=D_iY_i(1)+(1-D_i)Y_i(0)\). If the individual is treated (\(D_i=1\)), then \(Y_i=Y_i(1)\). Conversely, if the individual is untreated (\(D_i=0\)), then \(Y_i=Y_i(0)\). Thus,
\[
\begin{align}
Y_i&=D_iY_i(1)+(1-D_i)Y_i(0)\nonumber\\
&=Y_i(0)+D_i\left(Y_i(1)-Y_i(0)\right)\nonumber\\
&=\E\left[Y_i(0)\right]+D_i\left(Y_i(1)-Y_i(0)\right)+\left(Y_i(0)-\E\left[Y_i(0)\right]\right)\nonumber\\
&=\beta_{0i}+\beta_{1i}D_i+u_i,
\end{align}
\]
where
- \(\beta_{0i}=\E\left[Y_i(0)\right]\),
- \(\beta_{1i}=\left(Y_i(1)-Y_i(0)\right)\) is the individual causal effect, and
- \(u_i=\left(Y_i(0)-\E\left[Y_i(0)\right]\right)\).
Under the assumption that the treatment effect is constant, i.e., \(\beta_{1i}=\beta_1\), and \(\beta_{0i}=\beta_0\), we obtain the usual regression model: \[ \begin{align} Y_i=\beta_{0}+\beta_{1}D_i+u_i. \end{align} \tag{22.2}\]
We want to show that \(\beta_1\) in Equation 22.2 equals to the ATE. To that end, we need to make some assumptions about potential outcomes and the treatment assignment.
Assumptions 1 and 2 together are referred as the Stable Unit-Treatment Value Assumption (SUTVA) (G. W. Imbens and Rubin (2015)). Assumption 3 is the ramdomization assumption. In particular, it implies that \(\E(u_i|D_i)=0\) because \(D_i\) is independent of the potential outcomes. The last part of Assumption 3, \(0<P(D_i=1)<1\), ensures that there are both treated and untreated individuals in the sample.
Theorem 22.1 (ATE and \(\beta_1\)) Under Assumptions 1-3, we have \(ATE=\beta_1\).
Proof (Proof of Theorem 22.1). Note that we can express \(\beta_1\) as \(\beta_1=\E[Y_i|D_i=1]-\E[Y_i|D_i=0]\). Then, we have \[ \begin{align*} \beta_1&=\E[Y_i|D_i=1]-\E[Y_i|D_i=0]\\ &=\E\left[Y_i(1)|D_i=1\right]-\E\left[Y_i(0)|D_i=0\right]\\ &=\E\left[Y_i(1)\right]-\E\left[Y_i(0)\right]=\E\left[Y_i(1)-Y_i(0)\right], \end{align*} \] where the second equality follows from Assumption 1 and the third equality from Assumption 3.
Theorem 22.1 suggests that we can use the OLS estimator of \(\beta_1\) to estimate ATE. Since \(D_i\) is binary, we can express the OLS estimator \(\hat{\beta}_1\) as \[ \begin{align} \hat{\beta}_1=\bar{Y}^{treated}-\bar{Y}^{control}=\frac{1}{n_1}\sum_{i:D_i=1}Y_i-\frac{1}{n_0}\sum_{i:D_i=0}Y_i, \end{align} \tag{22.3}\] where \(n_1\) and \(n_0\) are the sizes of treatment and control groups, respectively.
Definition 22.4 (Difference-in-means estimator) The OLS estimator \(\hat{\beta}_1\) in Equation 22.3 is called the difference-in-means estimator or differences estimator.
If the Assumption 3 fails, i.e., if treatment assignment and the potential outcomes are dependent, then the OLS estimator suffers from selection bias. The difference between \(\beta_1=\E[Y_i|D_i=1]-\E[Y_i|D_i=0]\) and \(\text{ATE}=\E[Y_i(1)-Y_i(0)]\) is called selection bias.
In estimating the ATE using Equation 22.2, do we need to consider any other variables, such as pre-treatment characteristics or control variables? If there are variables that capture the pre-treatment characteristics of units, or if control variables are available, we can include them in Equation 22.2 to increase the efficiency of the OLS estimator: \[ Y_i=\beta_0+\beta_1D_i+\beta_2W_{1i}+\dots+\beta_{r+1}W_{ri}+u_i, \] where \(W_{1i},\dots,W_{ri}\) are pre-treatment characteristics or control variables. Note that since assignment to treatment is randomized, the conditional mean independence assumption holds with respect to \(D_i\): \[ \E(u_i|D_i,W_{1i},\dots,W_{ri})=\E(u_i|W_{1i},\dots,W_{ri}). \]
Thus, the OLS estimator of \(\beta_1\) is an unbiased estimator of the ATE. However, recall that the coefficients on the control variables do not have a causal interpretation unless \(\E(u_i|W_{1i},\dots,W_{ri})=0\).
When the assignment of individuals to treatment and control groups is influenced by one or more observable variables \(W\), we refer to this as randomization based on covariates. In this case, the OLS estimator of \(\beta_1\) based on Equation 22.2 suffers from omitted variable bias because \(D\) is correlated with the omitted variable \(W\). The omitted variable bias can be avoided by using \(W\) as a control variable in the model.
22.3 Threats to validity of experiments
As in the case of multiple linear regression model, threats to internal validity of an experiment can induce correlation between the treatment indicator and the error term in Equation 22.2. These threats include:
- Failure to randomize,
- Failure to follow the treatment protocol,
- Attrition,
- Experimental effects, and
- Small sample sizes.
Under the first threat, Assumption 3 fails. For example, consider the job training program in Example 22.2. If we assign disproportionately high-ability individuals to the treatment group, we induce correlation between the treatment indicator \(D\) and the potential outcomes and, therefore, between \(D\) and the error term \(u\). This correlation arises because ability (which is part of the error term) influences both treatment assignment and the outcome variable (wage). In this case, since the zero-conditional mean assumption fails, the OLS estimator of \(\beta_1\) will be biased and inconsistent.
If we have pre-treatment covariates \(W_1,\dots,W_r\) on the experimental units, we can test whether Assumption 3 holds. The idea of this testing approach is that if the treatment is randomly assigned, then the treatment indicator \(D\) should be uncorrelated with the pre-treatment covariates. Therefore, we can use the \(F\)-statistic to test the joint null hypothesis that the coefficients on the pre-treatment covariates in the regression of \(D\) on \(W_1,\dots,W_r\) are zero. If the treatment is randomly assigned, then the \(F\)-statistic should fail to reject this joint null hypothesis.
In an experiment, subjects may not comply with the treatment protocol. For example, in our job training program in Example 22.2, some subjects in the treatment group may not participate in the training, or some subjects from the control group may participate in the training sessions. In either case, the unobserved factors (e.g., ability) that are part of the error term will be correlated with \(D\), since these unobserved factors influence whether subjects comply with the treatment protocol.
Attrition occurs when subjects leave either the treatment group or the control group during an experiment. For example, in our job training program in Example 22.2, the most able candidates in the treatment group may leave the program early if they are able to find a job before completing the training. This can create systematic differences between the treatment and control groups, thereby inducing a correlation between \(D\) and \(u\).
The experimental effect, also known as the Hawthorne effect, arises when subjects in an experiment change their behavior simply because they are aware of being part of the experiment.
The final threat to internal validity is a small sample size. This threat does not violate Assumption 3, but it can affect the standard error of the OLS estimator of \(\beta_1\). Recall that the asymptotic variance of the OLS estimator is inversely proportional to the sample size. Thus, the estimated standard errorof the OLS estimator can be large when the sample size is small. Therefore, we may fail to reject the null hypothesis of no treatment effect, even if the treatment has an effect on the outcome variable.
As in the case of the multiple linear regression model, threats to the external validity of an experiment make it difficult to generalize its results to other populations and settings.
- The population and setting studied should be sufficiently similar to the population and setting of interest. For example, the results of an experiment on the effect of class size on learning in elementary schools may not generalize to college students because the populations and settings are different.
- If the program or policy studied in an experiment differs in some aspects from the program or policy of interest, its results may not be generalizable.
- Large-scale experiments can create general equilibrium effects that alter the broader context, and thus producing results that are different from those of smaller-scale experiments. Deaton and Cartwright (2018) give the following example: “Suppose an RCT demonstrates that in the study population a new way of using fertilizer had a substantial positive effect on, say, cocoa yields, so that farmers who used the new methods saw increases in production and in incomes compared to those in the control group. If the procedure is scaled up to the whole country, or to all cocoa farmers worldwide, the price will drop, and if the demand for cocoa is price inelastic - as is usually thought to be the case, at least in the short run - cocoa farmers’ incomes will fall. Indeed, the conventional wisdom for many crops is that farmers do best when the harvest is small, not large. In this case, the scaled-up effect is opposite in sign to the trial effect.”
22.4 Experimental estimates of the effect of class size reductions
We use the dataset from Tennessee class size reduction experiment, known as Project STAR (Student-Teacher Achievement Ratio), contained in the STAR.csv file. This experiment was a 4-year experiment designed to evaluate the impact of class sizes on learning for kindergarten through third grade. The experiment cost $12 million to conduct. Upon entering the school system, students were randomly assigned to one of three groups:
- Regular class (22-25 students),
- Regular class with a teacher’s aide, or
- Small class (13-17 students).
The interventions began when students entered kindergarten and continued through third grade. Teachers are also randomly assigned to classes. Students in regular classes were re-randomized after the first year to either a regular class or a regular class with a teacher’s aide. Students initially assigned to small classes remained in small classes throughout the experiment. Each year, students in the experiment were given standardized tests (the Stanford Achievement Test) in reading and math.
# Load the data
STAR <- read.table("data/STAR.csv",
header = TRUE,
sep = ",",
stringsAsFactors = TRUE
)
# Column names
colnames(STAR) [1] "gender" "ethnicity" "birth" "stark" "star1"
[6] "star2" "star3" "readk" "read1" "read2"
[11] "read3" "mathk" "math1" "math2" "math3"
[16] "lunchk" "lunch1" "lunch2" "lunch3" "schoolk"
[21] "school1" "school2" "school3" "degreek" "degree1"
[26] "degree2" "degree3" "ladderk" "ladder1" "ladder2"
[31] "ladder3" "experiencek" "experience1" "experience2" "experience3"
[36] "tethnicityk" "tethnicity1" "tethnicity2" "tethnicity3" "systemk"
[41] "system1" "system2" "system3" "schoolidk" "schoolid1"
[46] "schoolid2" "schoolid3"
Our goal is to estimate the effect of class type on test scores. Some of variables that we use in our empirical models are described in Table 22.1.
| Variable | Description |
|---|---|
gender |
Factor indicating the student’s gender. |
ethnicity |
Factor indicating the student’s ethnicity, with levels cauc (Caucasian), afam (African American), asian (Asian), hispanic (Hispanic), amindian (American Indian), or other. |
stark |
Factor indicating the STAR class type in kindergarten: regular, small, or regular-with-aide. |
star1 |
Factor indicating the STAR class type in 1st grade: regular, small, or regular-with-aide. |
star2 |
Factor indicating the STAR class type in 2nd grade: regular, small, or regular-with-aide. |
star3 |
Factor indicating the STAR class type in 3rd grade: regular, small, or regular-with-aide. |
readk, read1, read2, read3 |
Total reading scaled scores in kindergarten, 1st grade, 2nd grade, and 3rd grade, respectively. |
mathk, math1, math2, math3 |
Total math scaled scores in kindergarten, 1st grade, 2nd grade, and 3rd grade, respectively. |
lunchk |
Factor indicating whether the student qualified for free lunch in kindergarten. |
experiencek |
Years of the teacher’s total teaching experience in kindergarten. |
schoolidk, schoolid1, schoolid2, schoolid3 |
Factors indicating the school ID in kindergarten, 1st grade, 2nd grade, and 3rd grade, respectively. |
In this experiment, there are two treatment indicators: SmallClass and RegAid. The first indicator equals to 1 if the student is in a small class, and 0 otherwise. The second indicator equals to 1 if the student is in a regular class with teacher’s aid, and 0 otherwise. The control group consists of students in regular classes, denoted by Regular, which takes the value 1 if the student is in a regular class and 0 otherwise. In our empirical model, we use Regular as the reference group. Thus, our estimation equation takes the following form: \[
\begin{align}
Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+u_i,
\end{align}
\tag{22.4}\] where
- \(Y_i\) is the test score of the \(i\)th student,
- \({\tt SmallClass}_i\) equals to \(1\) if the \(i\)th student is in a small class, and \(0\) otherwise,
- \({\tt RegAid}_i\) equals to \(1\) if the \(i\)th student is in a regular class with teacher’s aid, and \(0\) otherwise,
- \(\beta_1\) represents the effect of being in a small class on the test score, relative to being in a regular class,
- \(\beta_2\) gives the effect of being in a regular class with teacher’s aid on the test score, relative to being in a regular class.
We estimate the model by using observations from kindergarten through third grade. We expect that the error terms to be correlated for students in the same school because these student may share similar unobserved characteristics. To account for this correlation in \(u_i\), we use standard errors clustered at the school level. This is achieved by using vcovCL(rk, type = "HC1", cluster = ~ schoolidk in the following code chunk.
# Model K: Kindergarten
rk <- lm(I(readk + mathk) ~ stark,
data = STAR
)
rk_vcov <- vcovCL(rk,
type = "HC1",
cluster = ~schoolidk
)
rk_se <- sqrt(diag(rk_vcov))
# Model 1: First Grader
r1 <- lm(I(read1 + math1) ~ star1,
data = STAR
)
r1_vcov <- vcovCL(r1,
type = "HC1",
cluster = ~schoolid1
)
r1_se <- sqrt(diag(r1_vcov))
# Model 2: Second Grader
r2 <- lm(I(read2 + math2) ~ star2,
data = STAR
)
r2_vcov <- vcovCL(r2,
type = "HC1",
cluster = ~schoolid2
)
r2_se <- sqrt(diag(r2_vcov))
# Model 3: Third Grader
r3 <- lm(I(read3 + math3) ~ star3,
data = STAR
)
r3_vcov <- vcovCL(r3,
type = "HC1",
cluster = ~schoolid3
)
r3_se <- sqrt(diag(r3_vcov))The estimation results are presented in Table 22.2. For kindergarten students, those in a small class have, on average, 13.9 more points on the test compared to those in a regular class. The estimated effect of being in a regular class with a teacher’s aide is 0.31 points and is statistically insignificant. The average estimated effect of being in a small class, relative to being in a regular class, is 29.8 points for first grade, 19.4 points for second grade, and 15.6 points for third grade. The estimated effect of being in a regular class with a teacher’s aide, relative to being in a regular class, is statistically insignificant for each grade, except for first grade. Overall, these results suggest that reducing class size has a positive effect on test scores, while adding teacher’s aides to regular classes has no effect on test scores, except for first grade.
# Print results in table form
modelsummary(
models = list(
"(K)" = rk,
"(1)" = r1,
"(2)" = r2,
"(3)" = r3
),
vcov = list(rk_se, r1_se, r2_se, r3_se),
fmt = 2,
stars = TRUE,
gof_omit = "AIC|BIC|Log|F"
)| (K) | (1) | (2) | (3) | |
|---|---|---|---|---|
| (Intercept) | 918.04*** | 1039.39*** | 1157.81*** | 1228.51*** |
| (4.82) | (5.82) | (5.29) | (4.66) | |
| starkregular+aide | 0.31 | |||
| (3.77) | ||||
| starksmall | 13.90*** | |||
| (4.23) | ||||
| star1regular+aide | 11.96*** | |||
| (4.86) | ||||
| star1small | 29.78*** | |||
| (4.79) | ||||
| star2regular+aide | 3.48 | |||
| (4.91) | ||||
| star2small | 19.39*** | |||
| (5.12) | ||||
| star3regular+aide | -0.29 | |||
| (4.04) | ||||
| star3small | 15.59*** | |||
| (4.21) | ||||
| Num.Obs. | 5786 | 6379 | 6049 | 5967 |
| R2 | 0.007 | 0.017 | 0.009 | 0.010 |
| R2 Adj. | 0.007 | 0.017 | 0.009 | 0.010 |
| RMSE | 73.47 | 90.48 | 83.67 | 72.89 |
| Std.Errors | Custom | Custom | Custom | Custom |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
Next, we extend the model in Equation 22.4 with some covariates because of two main reasons. First, by using additional covariates, we can increase the efficiency of the OLS estimator. Second, assignment to class types may be influenced by observable characteristics of students and teachers. In such cases, the OLS estimator based on a model that includes these covariates may yield different estimates of the treatment effects. For these two reasons, we consider the following covariates in our empirical model:
experience: Teacher’s years of experienceboy: Student is a boy (a dummy variable)lunch: Free lunch eligibility (a dummy variable)black: Student is African-American (a dummy variable)race: Student’s race is other than black or white (a dummy variable)schoolid: School indicator variables (school fixed effects)
In the following, for kindergarten students, we generate these covariates and relevel the lunch variable to set “non-free” as the reference category.
# Create covariates
STAR$black <- (STAR$ethnicity == "afam")
STAR$race <- !(STAR$ethnicity == "afam" | STAR$ethnicity == "cauc")
STAR$boy <- (STAR$gender == "male") # equals 1 if male, 0 otherwise
# Relevel lunch variable to set "non-free" as the reference category
STAR$lunchk <- relevel(STAR$lunchk, ref = "non-free")We consider the following four models, where the first model is the baseline model without any covariates, and the second, third, and fourth models include additional covariates. \[ \begin{align} &1.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+u_i,\\ &2.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+\beta_3{\tt experience}_i+u_i,\\ &3.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+\beta_3{\tt experience}_i+{\tt schoolid}+u_i,\\ &4.\, Y_i=\beta_0+\beta_1{\tt SmallClass}_i+\beta_2{\tt RegAid}_i+\beta_3{\tt experience}_i+\beta_4{\tt boy}\nonumber\\ &\quad\quad+\beta_5{\tt lunch}+\beta_6{\tt black}+\beta_7{\tt race}+{\tt schoolid}+u_i. \end{align} \]
As in Stock and Watson (2020), we estimate these models using the sample data only from the kindergarten.
# Model 9
rk <- lm(I(readk + mathk) ~ stark,
data = STAR
)
rk_vcov <- vcovCL(rk,
type = "HC1",
cluster = ~schoolidk
)
rk_se <- sqrt(diag(rk_vcov))
# Model 10: Controls: Teacher Experience
rk1 <- lm(I(mathk + readk) ~ stark + experiencek,
data = STAR
)
rk1_vcov <- vcovCL(rk1,
type = "HC1",
cluster = ~schoolidk
)
rk1_se <- sqrt(diag(rk1_vcov))
# Model 11: Controls: Teacher Experience + School Indicator
rk2 <- lm(I(mathk + readk) ~ stark + experiencek + factor(schoolidk),
data = STAR
)
rk2_vcov <- vcovCL(rk2,
type = "HC1",
cluster = ~schoolidk
)
rk2_se <- sqrt(diag(rk2_vcov))
# Model 12: Controls: Teacher Experience + School Indicator + Gender + Free Lunch + Black + Race
rk3 <- lm(
I(mathk + readk) ~ stark + experiencek +
factor(schoolidk) + boy + lunchk + black + race,
data = STAR
)
rk3_vcov <- vcovCL(rk3,
type = "HC1",
cluster = ~schoolidk
)
rk3_se <- sqrt(diag(rk_vcov))The estimation results are presented in Table 22.3. The results in this table show that adding these additional covariates and school fixed effects to the model does not lead to substantially different estimates of the treatment effects. However, note that the reported standard errors are relatively smaller than those reported in Table 22.2, suggesting an efficiency gain from including these covariates.
modelsummary(
models = list(
"(9)" = rk,
"(10)" = rk1,
"(11)" = rk2,
"(12)" = rk3
),
vcov = list(rk_se, rk1_se, rk2_se, rk3_se),
fmt = 2,
coef_omit = "schoolidk",
stars = TRUE,
gof_omit = "AIC|BIC|Log|F"
)| (9) | (10) | (11) | (12) | |
|---|---|---|---|---|
| (Intercept) | 918.04*** | 904.72*** | 925.67*** | 946.18*** |
| (4.82) | (6.20) | (3.60) | (4.82) | |
| starkregular+aide | 0.31 | -0.60 | 1.22 | 1.79 |
| (3.77) | (3.84) | (3.67) | (3.77) | |
| starksmall | 13.90*** | 14.01*** | 15.93*** | 15.89*** |
| (4.23) | (4.25) | (4.11) | (4.23) | |
| experiencek | 1.47*** | 0.74*** | 0.66*** | |
| (0.44) | (0.36) | (0.16) | ||
| boyTRUE | -12.09*** | |||
| (1.67) | ||||
| lunchkfree | -34.70*** | |||
| (2.02) | ||||
| blackTRUE | -25.43*** | |||
| (3.58) | ||||
| raceTRUE | -8.50 | |||
| (12.22) | ||||
| Num.Obs. | 5786 | 5766 | 5766 | 5748 |
| R2 | 0.007 | 0.020 | 0.234 | 0.291 |
| R2 Adj. | 0.007 | 0.020 | 0.223 | 0.281 |
| RMSE | 73.47 | 73.06 | 64.61 | 62.19 |
| Std.Errors | Custom | Custom | Custom | Custom |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
22.5 Quasi-Experiments
In a quasi-experiment, also known as a natural experiment, units are assigned to treatment and control groups based on changes in their circumstances. In other words, treatment and control groups are not formed through an explicit randomized controlled experiment but as a result of a policy intervention (or other changes in circumstances). Therefore, in a quasi-experiment, we assume that there is “as if” random assignment. Since there is no complete randomization, we expect some systematic differences between the treatment and control groups, which means we cannot resort to the difference-in-means estimator introduced earlier to estimate the treatment effect. In this section, we introduce the differences-in-differences (DiD) and regression discontinuity (RD) approaches for estimating the treatment effect in quasi-experimental settings.
Example 22.4 (The effect of minimum wage on employment) Card and Krueger (1994) investigate the effect of an increase in the minimum wage on average employment at fast-food restaurants in New Jersey. On April 1, 1992, New Jersey increased the minimum hourly wage from $4.25 to $5.05, making it the highest minimum wage in the country. In contrast, the minimum hourly wage in neighboring Pennsylvania remained unchanged. In this example, the treatment and control groups are naturally formed by the increase in the minimum wage:
- The treatment group consists of fast-food restaurants in New Jersey.
- The control group consists of fast-food restaurants in Pennsylvania.
Card and Krueger (1994) argued that fast-food restaurants in Pennsylvania can serve as a control group because Pennsylvania is a nearby state, and both states share similar seasonal employment patterns and are subject to common unobserved shocks. The authors collected data on fast-food restaurants in both states before and after the wage change, including information on employment, starting wages, prices, and other store characteristics. Figure 22.1 shows the sate minimum wage as of July 1, 2024.
22.6 Differences-in-differences
22.6.1 Differences-in-differences with two time periods
To introduce the differences-in-differences (DiD) approach involving two groups and two time periods, we introduce the following notation:
- Two groups: treatment group and control group,
- Two time periods: before (\(t=1\)) and after (\(t=2\)) intervention (or policy change),
- \(\bar{Y}^{treatment,before}\): the sample average of \(Y\) for the treated units before the intervention,
- \(\bar{Y}^{treatment,after}\): the sample average of \(Y\) for the treated units after the intervention,
- \(\bar{Y}^{control,before}\): the sample average of \(Y\) for the control group before the intervention, and
- \(\bar{Y}^{control,after}\): the sample average of \(Y\) for the control group after the intervention.
Definition 22.5 (The DiD estimator) The DiD estimator is given by \[ \begin{align} \hat{\beta}&=\left(\bar{Y}^{ treatment,after}-\bar{Y}^{treatment,before}\right)-\left(\bar{Y}^{control,after}-\bar{Y}^{control,before}\right)\\ &=\Delta\bar{Y}^{treatment}-\Delta\bar{Y}^{control}. \end{align} \tag{22.5}\]
We illustrate the DiD estimator \(\hat{\beta}\) in Figure 22.2. Before the treatment, the sample averages are \(\bar{Y}^{treatment,before}=40\) and \(\bar{Y}^{control,before}=20\), and after the treatment these averages are \(\bar{Y}^{treatment,after}=80\) and \(\bar{Y}^{control,after}=30\). Note that the difference in group means after the treatment is \(80-30=50\). Some of this difference is due to the fact that both groups have different pre-treatment means. The DiD approach removes the effect of the difference in the pre-treatment period by focusing on the change in group means. Thus, the DiD estimator gives \((80-40)-(30-20)=30\).
The DiD approach is based on the assumption that, on average, the treatment and control groups would have followed the same trend in the absence of treatment. This assumption is called the parallel trends assumption in the literature. In Figure 22.2, the trend in the average outcome for the treated group in the absence of treatment is illustrated with the dashed black line. Thus, the counterfactual average outcome for the treated group in the post-treatment period is \(\bar{Y}^{\text{treatment, after}} = 50\), so that the dashed line is parallel to the steel blue line. This assumption ensures that the difference between the observed average outcome \(\bar{Y}^{treatment,after}=80\) and the counterfactual average outcome \(\bar{Y}^{\text{treatment, after}} = 50\) is due to the treatment.
The DiD estimator in Equation 22.5 suggests that we can alternatively compute it using the following regression model: \[ \begin{align} \Delta Y_i=\beta_0+\beta_1D_i+u_i, \end{align} \tag{22.6}\] where \(\Delta Y_i\) represents the change in the post-treatment value of \(Y_i\) minus the pre-treatment value for the \(i\)th individual, and \(D_i\) is the treatment indicator. In this regression, the DiD estimator is the OLS estimator of \(\beta_1=\E[\Delta Y_i|D_i=1]-\E[\Delta Y_i|D_i=0]\).
The DiD estimator can be extended to include additional regressors \(W_{1},\dots,W_{r}\). These variables can be pre-treatment individual characteristics, or they can be control variables: \[ \begin{align} \Delta Y_i=\beta_0+\beta_1D_i+\beta_2W_{1i}+\dots+\beta_{r+1}W_{ri}+u_i. \end{align} \tag{22.7}\]
The main reason for including additional regressors \(W_{1},\dots,W_{r}\) in the DiD model is to ensure that the parallel trends assumption holds.
22.6.2 Differences-in-differences with multiple time periods
We assume that there are two periods in the DiD approach so far. It is also possible to extend the DiD approach when we observe each unit in the treatment and control group at multiple time periods. Let us assume that we observe each unit for \(t=1,2,\dots,T\) periods. We also assume that there is a common treatment timing denoted by \(t_0\). Then, the DiD estimator can be computed using the following regression: \[ \begin{align} Y_{it}=\beta_0+\beta_1D_i+\beta_2{\tt Period}_t+\beta_3(D_i\times{\tt Period}_t)+u_{it}, \end{align} \tag{22.8}\] where \(D_i\) is the treatment indicator, \({\tt Period}_t\) is the post-treatment indicator given by \[ \begin{align} {\tt Period}_t= \begin{cases} 1\quad\text{if}\quad t\geq t_0,\\ 0\quad\text{if}\quad t< t_0. \end{cases} \end{align} \] The DiD estimator is the OLS estimator of \(\beta_3\) in Equation 22.8. We can also estimate the DiD parameter \(\beta_3\) through the following two-way fixed effects regression: \[ \begin{align} Y_{it}=\alpha_i+\lambda_t+\beta_3(D_i\times{\tt Period}_t)+u_{it}, \end{align} \tag{22.9}\] where \(\alpha_i\) is the individual (entity) fixed effect, \(\lambda_t\) is the time fixed effect, \(D_i\) is the treatment indicator, and \({\tt Period}_t\) is the post-treatment indicator. The DiD estimator is given by the fixed effects estimator of \(\beta_3\). Both Equation 22.8 and Equation 22.9 can also be extended by considering variables \(W_{1t},\dots,W_{rt}\) on the pre-treatment individual characteristics.
22.6.3 Application: The effect of minimum wage on employment
In this section, we replicate some the results from Card and Krueger (1994). The sample data obtained from Hansen (2022) is contained in CK1994.xlsx and consists of two surveys conducted on 331 fast-food restaurants in New Jersey and 79 fast-food restaurants in eastern Pennsylvania. The first survey was conducted between 2/15/1992 and 3/4/1992 (before the minimum wage increase), and the second between 11/5/1992 and 12/31/1992 (after the wage increase). Some of variables in the data set are described in Table 22.4.
| Variable | Description |
|---|---|
store |
Unique store ID. |
state |
Indicator variable equal to 1 if the store is located in New Jersey and 0 if located in Pennsylvania. |
empft |
Number of full-time employees. |
emppt |
Number of part-time employees. |
nmgrs |
Number of managers and assistant managers. |
wage_st |
Starting wage (dollars per hour). |
inctime |
Number of months until the usual first raise. |
firstinc |
Usual amount of the first raise (dollars per hour). |
meals |
Free or reduced-price meal code (see below). |
open |
Opening hour. |
hoursopen |
Number of hours open per day. |
time |
Indicator variable equal to 0 for the first survey (2/15/1992–3/4/1992) and 1 for the second survey (11/5/1992–12/31/1992). |
# Load the data
df <- read_dta("data/CK1994.dta")
# Column names
colnames(df) [1] "store" "chain" "co_owned" "state" "southj"
[6] "centralj" "northj" "pa1" "pa2" "shore"
[11] "ncalls" "empft" "emppt" "nmgrs" "wage_st"
[16] "inctime" "firstinc" "meals" "open" "hoursopen"
[21] "pricesoda" "pricefry" "priceentree" "nregisters" "nregisters11"
[26] "time"
We start by using Equation 22.5 to estimate the effect of the change in the minimum hourly wage on employment in New Jersey. In the following code chunk, we generate sample averages for both groups before and after the change. These estimates are provided in Table 22.5. In New Jersey, the average employment was 20.43 employees per restaurant before the increase in the minimum wage and 20.90 employees per restaurant after the change. In Pennsylvania, these averages were 23.38 and 21.10, respectively. The DiD estimator in Equation 22.5 gives \((20.90-20.43)-(21.10-23.38)=2.75\). Thus, the average employment increases by 2.75 employees per restaurant after the increase in the minimum wage, a result that contrasts with conventional economic theory.
# Compute total employment
df$fte <- df$empft + df$emppt / 2 + df$nmgrs
# Drop missing fte values
df <- df[!is.na(df$fte), ]
# Group by store and count number of periods per store
df$nperiods <- ave(df$store, df$store, FUN=length)
# Keep only stores with two periods
df <- df[df$nperiods == 2, ]
# Average Employment at Fast Food Restaurants
n1 <- mean(df[(df$state == 1 & df$time == 0), ]$fte)
n2 <- mean(df[df$state == 1 & df$time == 1, ]$fte)
p1 <- mean(df[df$state == 0 & df$time == 0, ]$fte)
p2 <- mean(df[df$state == 0 & df$time == 1, ]$fte)
# Collect means in a dataframe
c1 <- c(n1, n2, n2 - n1)
c2 <- c(p1, p2, p2 - p1)
d1 <- c(p1 - n1, p2 - n2, (n2 - n1) - (p2 - p1))
tb <- data.frame("New Jersey" = c1, "Pennsylvania" = c2, "Difference" = d1)
rownames(tb) <- c("Before Increase", "After Increase", "Difference")# Print table
kable(tb, digits = 2, col.names = c("New Jersey", "Pennsylvania", "Difference"))| New Jersey | Pennsylvania | Difference | |
|---|---|---|---|
| Before Increase | 20.43 | 23.38 | 2.95 |
| After Increase | 20.90 | 21.10 | 0.20 |
| Difference | 0.47 | -2.28 | 2.75 |
Next, we use Equation 22.6 to estimate the effect of the increase in the minimum wage on employment. In the following code chunk, we generate the \(\Delta Y\) and \(D\) variables, and then use the OLS estimator to estimate the DiD parameter \(\beta_1\). The DiD estimate of \(\beta_1\) is \(2.75\), which is exactly the same as the one in Table 22.5.
# Estimation of (22.6)
DeltaY <- df$fte[df$time == 1] - df$fte[df$time == 0]
D <- df$state[df$time == 1]
df1 <- data.frame(DeltaY = DeltaY, X = D)
model1 <- lm(DeltaY ~ D, data = df1)# Print table
modelsummary(model1,
stars = TRUE,
vcov = "HC1",
fmt = 3,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
)| (1) | |
|---|---|
| (Intercept) | -2.283+ |
| (1.248) | |
| D | 2.750* |
| (1.338) | |
| Num.Obs. | 384 |
| R2 | 0.015 |
| R2 Adj. | 0.012 |
| Std.Errors | HC1 |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
We can also use Equation 22.8 to estimate the DiD parameter \(\beta_3\). In the following code chunk, we first create the treatment variable, which represents \(D_i \times {\tt Period}_t\), and then use the OLS estimator to estimate \(\beta_3\). Note that we use clustered standard errors at the store level to account for correlation. The DiD estimate of \(\beta_3\) is 2.75 with a \(t\)-value of 2.05.
# Estimation of (22.7)
# Create treatment and period indicators
df$D <- df$state
df$Period <- df$time
# Simple OLS regression with clustering on store
model2 <- lm(fte ~ D + Period + I(D*Period), data = df)
model2_vcov <- vcovCL(model2,
type = "HC1",
cluster = ~ store)
model2_se <- sqrt(diag(model2_vcov))modelsummary(model2,
stars = TRUE,
vcov = list(model2_se),
fmt = 3,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE")| (1) | |
|---|---|
| (Intercept) | 23.380*** |
| (1.382) | |
| D | -2.949* |
| (1.478) | |
| Period | -2.283 |
| (1.249) | |
| I(D * Period) | 2.750 |
| (1.339) | |
| Num.Obs. | 768 |
| R2 | 0.008 |
| R2 Adj. | 0.004 |
| Std.Errors | Custom |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
Finally, we use the two-way fixed effects model in Equation 22.9 to estimate the DiD parameter \(\beta_3\). The following code chunk provides the same estimate as the previous methods.
model3 <- feols(
fte ~ I(D*Period) | store + time,
data = df,
panel.id = ~ store + time,
cluster = ~ store
)modelsummary(model3,
stars = TRUE,
fmt = 3,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)| (1) | |
|---|---|
| I(D * Period) | 2.750* |
| (1.338) | |
| Num.Obs. | 768 |
| R2 | 0.779 |
| R2 Adj. | 0.557 |
| R2 Within | 0.015 |
| R2 Within Adj. | 0.012 |
| Std.Errors | by: store |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
22.6.4 Identification of the differences-in-differences parameter
In this section, we use the potential outcomes framework to define the DiD parameter and then state assumptions under which the DiD parameter is identified. We assume a setting with multiple time periods and a common treatment timing denoted by \(t_0\), which is the time period when the treatment starts. Let \(Y_{it}(0)\) and \(Y_{it}(1)\) denote the untreated and treated potential outcomes for unit \(i\) at period \(t\). For example, \(Y_{it_0}(1)\) represents the treated potential outcome for unit \(i\) in period \(t_0\), while \(Y_{i,t_0-1}(0)\) represents the untreated potential outcome for unit \(i\) in period \(t_0-1\). Also, recall that \(D_i\) denotes the treatment indicator, which takes the value 1 for the treated units and 0 for the units in the control group. The target parameter of interest in this DiD design is usually the average treatment effect on the treated at time \(t\), which is defined as \[ \text{ATT}(t)=\E[Y_{it}(1)-Y_{it}(0)|D_i=1]. \]
The \(\text{ATT}(t)\) represents the average difference between treated and untreated potential outcomes for the treatment group in period \(t\). To show that \(\text{ATT}(t)\) is identified, we need the following assumptions.
The parallel trends assumption requires the average change in the untreated potential outcomes to be the same for both treatment and control groups. The no anticipation assumption states that the untreated and treated potential outcomes of the treated units in the pre-treatment periods are on average equal. That is, receiving treatment in the post-treatment periods does not affect the average potential outcomes in the the pre-treatment periods on average. Hence, \(\text{ATT}(t)=0\) for \(t<t_0\).
Theorem 22.2 (Identification of the ATT parameter) Under the parallel trends and no anticipation assumptions, the average treatment effect on the treated at time \(t\ge t_0\) is identified as \[\text{ATT}(t)=\E[\Delta Y_{it}|D_i=1]-\E[\Delta Y_{it}|D_i=0],\] where \(\Delta Y_{it}=Y_{it}-Y_{i,t_0-1}\) is the change in the observed outcome for unit \(i\) from period \(t_0-1\) to period \(t\).
Proof (Proof of Theorem 22.2). Under the two assumptions above, for \(t\ge t_0\), we have \[ \begin{align} \text{ATT}(t)&=\E[Y_{it}(1)-Y_{it}(0)|D_i=1]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=1]\\ &=\E[Y_{it}(1)-Y_{i,t_0-1}(0)|D_i=1]-\E[Y_{it}(0)-Y_{i,t_0-1}(0)|D_i=0]\\ &=\E[Y_{it}-Y_{i,t_0-1}|D_i=1]-\E[Y_{it}-Y_{i,t_0-1}|D_i=0]\\ &=\E[\Delta Y_{it}|D_i=1]-\E[\Delta Y_{it}|D_i=0], \end{align} \] where the second equality follows by adding and substracting \(\E[Y_{i,t_0-1}(0)|D_i=1]\), the third equality from the parallel trends assumption, and the fourth equality holds because \(Y_{it}(1)\) and \(Y_{i,t_0-1}(0)\) are equal to the observed outcomes for the treatment group under no anticipation, while \(Y_{it}(0)\) and \(Y_{i,t_0-1}(0)\) are equal to the observed outcomes for the control group.
The identification result suggests that the ATT(\(t\)) for \(t\ge t_0\) can be estimated by the DiD estimator in the 2-period case. To this end, the DiD estimator in Equation 22.5 (or Equation 22.6 or Equation 22.7) can be used by considering \(t_0-1\) as the first period and \(t\) as the second period.
In the multiple time period case, we can consider the following two-way fixed effects specification for estimating the ATT parameters: \[ Y_{it} = \alpha_i + \lambda_t + \sum_{s=1}^{t_0-2}\beta_s (D_i\times P_s) + \sum_{s=t_0}^{T}\beta_s (D_i\times P_s) + u_{it} \tag{22.10}\] where \(P_s=1\) if \(s=t\) and \(P_s=0\) otherwise. The estimates of \(\beta_s\) for \(s\ge t_0\) correspond to the estimates of \(\text{ATT}(t)\) for \(t\ge t_0\).
This approach allows us using t-statistics and confidence intervals based on clustered standard errors. We can then plot ATT estimates along (point-wise) confidence intervals to summarize our results. This is often referred to as an event study analysis. Note also that \(\beta_s\) estimates for the pre-treatment periods can be utilized to assess the plausibility of the parallel trends assumption. If the parallel trends assumption holds, then the estimates of \(\beta_s\) for \(s<t_0\) should be statistically insignificant.
22.6.5 Application: Impact of policing on crime
We revisit Tella and Schargrodsky (2004) to study the causal effect of police presence on crime. The dataset is contained in the DS2004.dta file, which was obtained from Hansen (2022). The intervention studied in Tella and Schargrodsky (2004) involves the assignment of more police protection to certain blocks in Buenos Aires following the terrorist attack on main Jewish center in July 1994. The data consist of monthly observations on 876 blocks of which 37 blocks are provided with extra police protection and cover the time period from April to December of 1994, excluding July. Therefore, April to June constitute the pre-treatment months, while August to December represent the post-treatment period. In Table 22.9, we describe some of the variables in this dataset.
| Variable | Description |
|---|---|
block |
Block ID |
sameblock |
Jewish institution on block (treatment indicator) |
distance |
Number of blocks to closest Jewish institution |
public |
Public building on block |
gasstation |
Gas station on block |
bank |
Bank on block |
thefts |
Average number of motor vehicle thefts |
month |
Month (numeric) |
The outcome variable is thefts which is the average monthly number of motor vehicle thefts. Below, we load the data and create the D and Period variables, where D is the treatment indicator and Period is the post-treatment indicator. We present the summary statistics of the data in Table 22.10. According to this table, 4.2% of the blocks are treated, and the average number of car theft is around 0.093.
# Load the data
df <- read_dta("data/DS2004.dta")
# Create treatment and period indicators
df$D <- df$sameblock
df$Period <- as.integer(df$month > 7)# Summary statistics
var_label(df$D) <- "D"
var_label(df$Period) <- "Period"
var_label(df$thefts) <- "Thefts"
datasummary_skim(
df[, c("D", "Period", "thefts")],
fmt = 3
)| Unique | Missing Pct. | Mean | SD | Min | Median | Max | |
|---|---|---|---|---|---|---|---|
| D | 2 | 0 | 0.042 | 0.201 | 0.000 | 0.000 | 1.000 |
| Period | 2 | 0 | 0.556 | 0.497 | 0.000 | 1.000 | 1.000 |
| Thefts | 10 | 0 | 0.093 | 0.242 | 0.000 | 0.000 | 2.500 |
We first consider estimating the average treatment effect in the post-treatment periods. To this end, we use Equation 22.9 as if we have only 2 periods, pre- and post-treament. We can use the feols function from fixest package to estimate the model. We use the clustered standard errors at the block level to account for correlation in the error terms within blocks.
# Estimation of (22.9) using fixest
model4_fixest <- feols(
thefts ~ I(D*Period) | block + month,
data = df,
panel.id = ~block + month,
cluster = ~ block
)# Print results
modelsummary(model4_fixest,
stars = TRUE,
fmt = 3,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)| (1) | |
|---|---|
| I(D * Period) | -0.078** |
| (0.024) | |
| Num.Obs. | 7884 |
| R2 | 0.198 |
| R2 Adj. | 0.097 |
| R2 Within | 0.001 |
| R2 Within Adj. | 0.001 |
| Std.Errors | by: block |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
The estimation results are presented in Table 22.11. The estimate of the average treatment effect (on treated blocks) in the post-periods is -0.078, i.e., police presence, on average, reduced car thefts by 0.078 in the treated blocks. This estimate is almost the same as the one reported in the first column of Table 3 in Tella and Schargrodsky (2004). Given that the average number of car theft is around 0.093, this estimate is not negligible.
Next, we will consider an event study analysis and estimate Equation 22.10. First, we need to generate the interactions of the treatment dummy with pre and post periods.
# Create lead and lag dummies
df$lead4 <- as.integer(df$D == 1 & df$month == 4) # April
df$lead3 <- as.integer(df$D == 1 & df$month == 5) # May
df$lead2 <- as.integer(df$D == 1 & df$month == 6) # June
df$lead1 <- as.integer(df$D == 1 & df$month == 7) # July
df$lead0 <- as.integer(df$D == 1 & df$month == 8) # August
df$lag1 <- as.integer(df$D == 1 & df$month == 9) # September
df$lag2 <- as.integer(df$D == 1 & df$month == 10) # October
df$lag3 <- as.integer(df$D == 1 & df$month == 11) # November
df$lag4 <- as.integer(df$D == 1 & df$month == 12) # DecemberThe common treatment timing \(t_0\) is August and the \(t_0-1\) period is July. Thus, in the event study analysis, July will be the baseline period (reference period) and \(\beta_{\text{july}}=0\). We use the fixed effects estimator to estimate the event study regression in Equation 22.10.
# Estimation of (22.10) using fixest
model5_fixest <- feols(
thefts ~ lead4 + lead3 + lead2 + lead0 + lag1 + lag2 + lag3 + lag4 |
block + month,
data = df,
panel.id = ~block + month,
cluster = ~block
)The estimation results are presented in Table 22.12. The estimates of the lead variables (pre-treatment periods) are statistically insignificant, which provides statistical evidence for the plausibility of the parallel trends assumption.
modelsummary(model5_fixest,
stars = TRUE,
fmt = 3,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE"
)| (1) | |
|---|---|
| lead4 | 0.032 |
| (0.062) | |
| lead3 | 0.009 |
| (0.037) | |
| lead2 | 0.073 |
| (0.045) | |
| lead0 | -0.043 |
| (0.033) | |
| lag1 | -0.065*** |
| (0.018) | |
| lag2 | -0.027 |
| (0.039) | |
| lag3 | -0.053** |
| (0.020) | |
| lag4 | -0.058** |
| (0.021) | |
| Num.Obs. | 7884 |
| R2 | 0.199 |
| R2 Adj. | 0.097 |
| R2 Within | 0.002 |
| R2 Within Adj. | 0.001 |
| Std.Errors | by: block |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
Alternatively, we can resort to the OLS estimator to estimate Equation 22.10. The following code chunk produces the same estimates as the fixed effects estimator. Note that we use the factor(month)+factor(block) to include month and block fixed effects, respectively.
# Fit the model with lm()
model6_lm <- lm(
thefts ~ lead4 + lead3 + lead2 + lead0 + lag1 + lag2 + lag3 + lag4
+ factor(month) + factor(block),
data = df
)To generate the event study plot, we next extract the point estimates and standard errors. We then generate (pointwise) 95% confidence intervals.
# Vector of coefficients for the leads/lags
theta <- coef(model6_lm)[c("lead4", "lead3", "lead2", "lead0",
"lag1", "lag2", "lag3", "lag4")]
# Insert 0 at lead0 (baseline period)
theta <- c(theta[1:3], 0, theta[4:8])# Clustered SE at block level
vcov_block <- vcovCL(model6_lm, cluster = ~ df$block)
se <- sqrt(diag(vcov_block))
theta_se <- se[c("lead4", "lead3", "lead2", "lead0",
"lag1", "lag2", "lag3", "lag4")]
# Insert 0 for baseline period
theta_se <- c(theta_se[1:3], 0, theta_se[4:8])# X-axis for plotting: -4,...,4
xax <- -4:4
# 95% confidence intervals
thetal <- theta - 1.96 * theta_se
thetau <- theta + 1.96 * theta_seFinally, we plot the event study estimates and confidence intervals. The plot is shown in Figure 22.3. The treatment effect estimates for the pre-treatment periods are statistically insignificant, which provides statistical evidence for the plausibility of the parallel trends assumption. The estimates for the post-treatment periods are negative, suggesting that police presence reduces car thefts in the post-treatment periods.
# The event study plot
# Prepare a data frame for plotting
plot_df <- data.frame(
period = xax,
estimate = theta,
lower = thetal,
upper = thetau
)
# Event-study plot
ggplot(plot_df, aes(x = period, y = estimate)) +
geom_line(color = "steelblue") +
geom_point(color = "steelblue", size = 2) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, color = "steelblue") +
geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
scale_x_continuous(breaks = xax) +
ylim(-0.4, 0.4) +
labs(
x = "Period",
y = "ATT estimates"
)
22.7 Regression discontinuity designs
Another example of quasi-experimental methods is the regression discontinuity design (RDD). In an RDD, the treatment and control groups are formed when a continuous variable \(W\), known as the running variable, crosses a threshold \(w_0\) (the cutoff level). The main idea of the RDD is that units with running variable values near the threshold are comparable in terms of characteristics, except for the receipt of treatment.1
Example 22.5 (The effect of attending summer school on the next year’s GPA) Consider the hypothetical example illustrated in Figure 22.4, where \(Y\) is the next year’s grade point average (GPA) and \(W\) is the current year’s GPA. Let us assume that students are required to attend a mandatory summer school if their current year’s GPA is below \(w_0=2\). We can then estimate the effect of attending summer school on next year’s GPA by comparing students with \(W\) below \(w_0=2\) to those with \(W\) above \(w_0=2\). In other words, the running variable defines the treatment and control groups:
- Students with \(W\) below \(w_0=2\) constitute the treatment group.
- Students with \(W\) above \(w_0=2\) constitute the control group.
# Hypothetical regression discontinuity design scatterplot
# Set seed for reproducibility
set.seed(45)
# 1. Generate the data
W <- runif(100, 0, 4)
y <- 0.8 * W - 0.95 * (W >= 2) + runif(100, 0, 2)
df <- data.frame(W = W, y = y)
ggplot(df, aes(x = W, y = y)) +
geom_point(color = "steelblue", size = 2) +
geom_vline(xintercept = 2, linetype = "dashed", color = "black", linewidth = 1.2) +
geom_smooth(data = subset(df, W < 2), method = "lm", se = FALSE, color = "black", linewidth = 1.2) +
geom_smooth(data = subset(df, W >= 2), method = "lm", se = FALSE, color = "black", linewidth = 1.2) +
labs(x = "W", y = "Y")
There are two types of RDDs: (i) the sharp RDD and (ii) the fuzzy RDD. In the sharp RDD, the treatment group consists of individuals with the running variable values above (or below) the threshold \(w_0\), and the control group consists of those with the running variable values below (or above) the threshold. Thus, in the sharp RDD, the cutoff clearly separates the treatment and control groups. The hypothetical example provided in Figure 22.4 is an example of the sharp RDD study. In the fuzzy RDD, the probability of treatment is a continuous function of the running variable, with a discontinuity at the cutoff.
In an RDD study, the treatment effect parameter is the average treatment effect at the threshold.
Definition 22.6 (Average treatment effect at the threshold) The average treatment effect at the threshold \(w_0\), denoted by \(\tau\), is defined as \[ \tau=\E[Y_i(1)-Y_i(0)|W_i=w_0]. \]
In both sharp and fuzzy designs, we first state the assumptions required to identify \(\tau\), and then discuss how to estimate \(\tau\) using the observed data.
22.7.2 Fuzzy regression discontinuity designs
In the fuzzy RDD, the probability of treatment changes continuously with the running variable, except at the threshold \(w_0\), where it exhibits a discontinuous jump. Let \(p(w)=P(D=1|W=w)\) be the probability of treatment (the propensity score) given the running variable \(W=w\). The fuzzy RDD requires that \(p(w)\) is discontinuous at the threshold, i.e., \(\lim_{w\downarrow w_0}p(w)\neq \lim_{w\uparrow w_0}p(w)\).
The identification of \(\tau\) in the fuzzy RDD requires the following assumptions (Hansen (2022), Wooldridge (2025)).
The first assumption is the continuity assumption from the sharp RDD. The second assumption states that the probability of treatment is discontinuous at the threshold \(w_0\). We illustrate this assumption in Figure 22.6 by assuming that the probability of treatment increases with the running variable. The figure shows that the probability of treatment jumps to 0.6 at the threshold \(w_0\). This contrasts with the sharp RDD, in which the probability of treatment is 0 below the threshold and 1 above it, i.e., \(p(w)=0\) if \(w<w_0\) and \(p(w)=1\) if \(w\geq w_0\).
# Probability of treatment in the fuzzy RDD
# 1. Create the data
W_left <- seq(0, 5, length.out = 100)
W_right <- seq(5, 10, length.out = 100)
F_left <- 0.2 * sqrt(W_left)
F_right <- 0.6 + 0.05 * sqrt(W_right - 5)
df_left <- data.frame(W = W_left, F = F_left, side = "left")
df_right <- data.frame(W = W_right, F = F_right, side = "right")
df <- rbind(df_left, df_right)
# 2. Define the left and right limit points at cutoff
limit_points <- data.frame(
W = c(5, 5),
F = c(F_left[length(F_left)], F_right[1]),
type = c("left", "right")
)
# 3. Plot using ggplot2
ggplot(df, aes(x = W, y = F)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_vline(xintercept = 5, linetype = "dashed", color = "black", linewidth = 1) +
geom_point(data = limit_points[limit_points$type=="left", ],
aes(x = W, y = F), color = "steelblue", fill = "white", shape = 21, size = 3, stroke = 1.2) +
geom_point(data = limit_points[limit_points$type=="right", ],
aes(x = W, y = F), color = "steelblue", size = 3) +
labs(
x = expression(w),
y = expression(p(w))
)
The third assumption is also known as the local unconfoundedness of treatment gains. It states that the average treatment effect at \(w\in(w_0-\epsilon,w_0+\epsilon)\) is the same for treated and untreated units.
Theorem 22.4 (Identification of \(\tau\) in the fuzzy RDD) Under Assumptions 1-3, \(\tau\) is identified as \[ \begin{align} \tau=\frac{\lim_{w\downarrow w_0}\E[Y_i|W_i=w]-\lim_{w\uparrow w_0}\E[Y_i|W_i=w]}{\lim_{w\downarrow w_0}p(w)-\lim_{w\uparrow w_0}p(w)}. \end{align} \]
Proof (Proof of Theorem 22.4). See Hansen (2022) or Wooldridge (2025) for the proof.
Theorem 22.4 indicates that \(\tau\) is given by the ratio of the jump in the conditional mean of the outcome variable at the threshold to the jump in the probability of treatment at the same threshold. The identification result indicates that we can resort to the local linear estimator to estimate both the numerator and the denominator. See Calonico, Cattaneo, and Titiunik (2014) and Hansen (2022) for details. The rdrobust module can also handle the estimation of \(\tau\) in the fuzzy RDD.
22.7.3 Application: The effect of an anti-poverty program on child mortality
Ludwig and Miller (2007) use a sharp RDD to study the effect of a federal grant-writing assistence for the federal anti-poverty program called Head Start on child mortality rates in the United States. The Head Start program, founded in 1965, provides funds to local municipalities for preschool, health, and social services for poor children aged three to five and their families.3 The poverty rate measured in 1960, with a cutoff of 59.1984, is used to determine whether a county is eligible for federal grant-writing assistance. This cutoff results in 300 counties being eligible to receive such assistance. The poverty rate measured in 1960, with a cutoff of 59.1984, is used as the running variable to form the treatment and control groups. The outcome variable \(Y\) is the county mortality rate during the 1973-1983 period. The dataset is contained in the LM2007.dta file, which is obtained from Hansen (2022). Some of the variables included in the dataset are listed below:
| Variable name | Description |
|---|---|
povrate60 |
County poverty rate in 1960 (percent) |
mort_age59_related_postHS |
Mortality, ages 5–9, Head Start–related causes, 1973–1983 |
mort_age59_injury_postHS |
Mortality, ages 5–9, injuries, 1973–1983 |
mort_age59_all_postHS |
Mortality, ages 5–9, all causes, 1973–1983 |
mort_age25plus_related_postHS |
Mortality, ages 25+, Head Start–related causes, 1973–1983 |
mort_age25plus_injuries_postHS |
Mortality, ages 25+, injuries, 1973–1983 |
mort_age59_related_preHS |
Mortality, ages 5–9, Head Start–related causes, 1959–1964 |
census1960_pop |
Census 1960 county population |
The outcome variable is denoted by mort_age59_related_postHS and the running variable by povrate60. Table 22.14 provides the summary statistics for these two variables.
# Load the data
df_head <- read_dta("data/LM2007.dta")
colnames(df_head) [1] "state" "povrate60"
[3] "mort_age59_related_postHS" "mort_age59_injury_postHS"
[5] "mort_age59_all_postHS" "mort_age25plus_related_postHS"
[7] "mort_age25plus_injuries_postHS" "mort_age59_related_preHS"
[9] "census1960_pop" "census1960_pctsch1417"
[11] "census1960_pctsch534" "census1960_pctsch25plus"
[13] "census1960_pop1417" "census1960_pop534"
[15] "census1960_pop25plus" "census1960_pcturban"
[17] "census1960_pctblack"
# Summary statistics
var_label(df_head$mort_age59_related_postHS) <- "Mortality Rate"
var_label(df_head$povrate60) <- "Poverty Rate"
datasummary_skim(
df_head[, c("mort_age59_related_postHS", "povrate60")],
fmt = 3
)| Unique | Missing Pct. | Mean | SD | Min | Median | Max | |
|---|---|---|---|---|---|---|---|
| Mortality Rate | 1034 | 0 | 2.254 | 5.726 | 0.000 | 0.000 | 136.054 |
| Poverty Rate | 2783 | 0 | 36.735 | 15.268 | 15.209 | 33.593 | 81.570 |
We use the rdrobust function with the cutoff c=59.1984 to estimate the RDD parameter \(\tau\). The estimated RDD parameter is -2.409 with a robust t-statistic -2.032. This is a large estimate given that the mean child mortality rate is around 2.254 as shown in Table 22.14. This estimate indicates that the federal grant-writing assistance reduced mortality by 2.032 children per 100,000.
Y <- df_head$mort_age59_related_postHS
W <- df_head$povrate60
rdd_result <- rdrobust(y = Y, x = W, c = 59.1984)
summary(rdd_result)Sharp RD estimates using local polynomial regression.
Number of Obs. 2783
BW type mserd
Kernel Triangular
VCE method NN
Number of Obs. 2489 294
Eff. Number of Obs. 234 180
Order est. (p) 1 1
Order bias (q) 2 2
BW est. (h) 6.811 6.811
BW bias (b) 10.726 10.726
rho (h/b) 0.635 0.635
Unique Obs. 2489 294
=====================================================================
Point Robust Inference
Estimate z P>|z| [ 95% C.I. ]
---------------------------------------------------------------------
RD Effect -2.409 -2.032 0.042 [-5.462 , -0.099]
=====================================================================
We can use the rdplot function to generate the fitted polynomials along with sample averages of outcome variable within each bin as shown in Figure 22.7. The plot shows a clear jump in the fitted polynomials at the cutoff, which is consistent with the estimated RDD parameter.
rdplot(y=Y, x=W, binselect="qsmv",c=59.1984,
y.label="Mortality Rate",
title=NULL,
x.label="Poverty Rate",
col.lines="steelblue")
The approach we presented in this section is called the continuity framework. A second approach, known as the local randomization framework, is based on the idea that the RDD can be viewed as a randomized experiment around the cutoff. See Cattaneo, Idrobo, and Titiunik (2020) for details.↩︎
A kernel is a non-negative real-valued function. See Wikipedia for commonly used kernel functions in non-parametric econometrics.↩︎