Introduction to Econometrics with Python

Authors

Published

October 23, 2025

Preface

This book serves as a Python companion to the well-known textbook by Stock and Watson (2020), which we use in our undergraduate econometrics courses. In our courses, the first semester introduces econometrics through Chapters 1 to 9, followed by Chapters 10 to 17 in the second semester. This book compiles the lecture and lab materials we used for teaching econometrics. Lecture slides are also available and can be provided upon request.

We organize the book into six parts: (i) Python, (ii) Introduction and Review, (iii) Fundamentals of Regression Analysis, (iv) Further Topics in Regression Analysis, (v) Regression Analysis of Economic Time Series Data, and (vi) Theoretical Topics. Except for the part on Python, the other parts are named as in Stock and Watson (2020). Each part consists of several chapters, with each chapter covering a specific topic.

In the first part, which consists of ten chapters, we introduce the fundamentals of Python for econometric analysis. This part assumes that readers have no prior exposure to Python. In the first chapter, we show how Python can be installed through Anaconda, and introduce commonly used libraries in econometrics and in-built data types and structures. In the second chapter, we introduce the NumPy module for numerical computations. The third chapter focuses on the main control structures in Python, such as loops, conditionals, comprehensions, and exception handling. In the fourth chapter, we show how to write efficient custom functions. The fifth chapter introduces key statistical functions from NumPy and SciPy. The sixth chapter presents the Matplotlib and Seaborn modules for data visualization. In the seventh chapter, we introduce the Pandas module for data management. In the eighth chapter, we cover some useful numerical methods for statistics and econometrics. The ninth chapter introduces the SymPy module for symbolic computation. Finally, in the tenth chapter, we introduce the basics of object-oriented programming (OOP) in Python.

There are three chapters in Introduction and Review. In this part, we define key econometric concepts, discuss data types, and provide a brief review of the probability theory and statistics required for econometric analysis.

The part titled Fundamentals of Regression Analysis consists of six chapters. Here, we introduce regression models with one and multiple regressors, cover hypothesis testing and confidence intervals in regression models, explore nonlinear regression models, and discuss a framework for identifying the strengths and limitations of regression studies.

In Further Topics in Regression Analysis, we explore extensions of the multiple linear regression model across five chapters. We begin with commonly used panel data models, then examine regression models with binary dependent variables. Next, we discuss the instrumental variable method when the error term is correlated with the regressor of interest, explore experimental and quasi-experimental methods, and finally cover widely used machine learning techniques for prediction when many regressors are present.

The part titled Regression Analysis of Economic Time Series Data covers time series methods and consists of three chapters. In the first chapter, we introduce basic time series concepts, including unit root testing, autoregressions, and autoregressive distributed lag models for forecasting. In the second chapter, we demonstrate how to use time series analysis to estimate dynamic causal effects. Finally, in the third chapter, we cover some additional time series models, including vector autoregressions, cointegration, vector error correction models, and volatility models.

The final part, Theoretical Topics, contains two chapters on econometric theory. In the first chapter, we present formal results along with their proofs for the regression model with one regressor. In the second chapter, using linear algebra notation, we provide theoretical results for the multiple linear regression model, instrumental variables regression, generalized method of moments estimation for the linear models.

Finally, we provide some technical details related to Chapters 14, 24, 28, and 29 in the appendices.

Notation

We adopt the standard notation used in Stock and Watson (2020). In particular, for an econometric model, we use Latin letters such as \(Y\), \(X\), \(W\), and \(Z\) to denote variables, and Greek letters such as \(\beta\), \(\gamma\), and \(\delta\) to denote unknown parameters.¹ For the convenience of readers, we list the Greek letters in the following table.

Table 1.1: Greek Letters

Greek Letter	Name	Greek Letter	Name
\(\alpha\)	alpha	\(\beta\)	beta
\(\gamma\)	gamma	\(\delta\), \(\Delta\)	delta
\(\epsilon\), \(\varepsilon\)	epsilon	\(\zeta\)	zeta
\(\eta\)	eta	\(\theta\), \(\Theta\)	theta
\(\iota\)	iota	\(\kappa\)	kappa
\(\lambda\), \(\Lambda\)	lambda	\(\mu\)	mu
\(\nu\)	nu	\(\xi\), \(\Xi\)	xi
\(o\)	omicron	\(\pi\), \(\Pi\)	pi
\(\rho\), \(\varrho\)	rho	\(\sigma\), \(\varsigma\), \(\Sigma\)	sigma
\(\tau\)	tau	\(\upsilon\), \(\Upsilon\)	upsilon
\(\varphi\), \(\phi\), \(\Phi\)	phi	\(\chi\)	chi
\(\psi\), \(\Psi\)	psi	\(\omega\), \(\Omega\)	omega

The notation used for the error term (or disturbance term) is not uniform in the literature. Some authors use Latin letters because it is a random variable, while others use Greek letters because it is an unknown term. Following Stock and Watson (2020), we use lowercase \(u\), \(v\), or \(e\) to denote the error term.

For vectors, we use lowercase boldface letters such as \(\bs{y}\), \(\bs{x}\), \(\bs{w}\), and \(\bs{z}\), and for matrices, we use uppercase boldface letters such as \(\bs{X}\), \(\bs{W}\), and \(\bs{Z}\). However, this convention does not apply exactly to the regression model in matrix form. Following Stock and Watson (2020), we use \(\bs{Y}\) to denote the \(n\times1\) vector of observations on the dependent variable, \(\bs{X}\) the \(n\times k\) matrix of independent variables, \(\bs{\beta}\) the \(k\times1\) vector of coefficients, and \(\bs{U}\) the \(n\times1\) vector of error terms.

We use hat or tilde notation to denote estimators. For example, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimators of \(\beta_0\) and \(\beta_1\), respectively. In scalar form, \(\hat{Y}\) denotes the predicted value of the dependent variable, and \(\hat{u}\) denotes the residual. In vector form, \(\hat{\bs{Y}}\) denotes the \(n\times1\) vector of predicted values, and \(\hat{\bs{U}}\) denotes the \(n\times1\) vector of residuals.

We denote a random sample on the random variable \(Y\) by \(\{Y_1, Y_2, \ldots, Y_n\}\) or \(\{Y_i\}_{i=1}^n\), where \(n\) is the sample size. In the case of time series data, we use \(\{Y_t\}_{t=1}^T\) to denote the random sample. In the case of panel data, we use double subscripts to denote observations, such as \({Y_{it}}\) for the \(i\)th entity at time \(t\).

Import Conventions

In each chapter, we first import all the modules required for the econometric analysis introduced in that chapter. We follow the alias conventions used by the Python community for commonly used modules:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

Thus, when we use np.array, we refer to the array function from NumPy.

Code Examples

We use the conventional code font for displaying code in the text. Code chunks are presented in highlighted cells, as shown in the following example:

x = 2

The output of a code cell is shown in the subsequent cell, as shown in the following example:

x = [1, 2, 3, 4]
x.append(99) # adding 99 to the list x
x

[1, 2, 3, 4, 99]

The above cell displays the updated list x. If a code cell returns multiple outputs, these outputs are displayed in separate cells, in the order they are produced. For example, the code below returns type(x) = list and type(y) = tuple in separate cells:

x = [10, 2, 35, 11, 12] # list
type(x)
y = (3, 5, 7) # tuple
type(y)

list

tuple

Callout Blocks

In Stock and Watson (2020), key concepts are highlighted in bold and defined within the relevant text of each chapter, while the main ideas are summarized in Key Concept boxes. We adopt the same approach and use a Quarto callout block for the key concept boxes, as illustrated in the following example:

Key Concept 9.2: Variance and standard deviation

Let \(Y\) be a random variable with mean \(\mu_Y\). The variance of \(Y\) is defined as:

Discrete case: \(\sigma^2_Y=\E\left[(Y-\mu_Y)^2\right]=\sum_{i=1}^k(y_i-\mu_Y)^2\times P(Y=y_i)\), where \(y_i\) are the possible values of \(Y\).
Continuous case: \(\sigma^2_Y=\E\left[(Y-\mu_Y)^2\right]=\int_D (y-\mu_Y)^2f_Y(y)\text{d}y\), where \(D\) is the support of \(Y\).

The square root of the variance is called the standard deviation of \(Y\) and is denoted by \(\sigma_Y\). The units of the standard deviation are the same as the units of \(Y\).

We also use a callout block to state econometric assumptions. For example, we use the following callout block for introducing the least squares assumptions:

Assumptions

The zero-conditional mean assumption: \(\E(u_i |X_i) = 0\), i.e., the conditional distribution of \(u_i\) given \(X_i\) has a mean of \(0\).
The random sampling assumption: \((X_i, Y_i):\, i =1,2,\dots,n\) are independently and identically distributed (i.i.d.) across observations.
The no large outliers assumption: \(\E(X_i^4)<\infty\) and \(\E(Y_i^4)<\infty\).

Definitions, Examples, and Theorems

To separate definitions, examples, and theorems from the main text, we use boxes with a white background, as illustrated in the following example:

Definition 5.1 A random variable is a real-valued function defined on the sample space of an experiment.

Data for Applications

For each methodological topic, Stock and Watson (2020) provide an application based on real-world data. We reproduce all tables, figures, and estimation results using the same datasets provided on the textbook’s web page. We provide all datasets used in this book in the GitHub repository: Datasets.

Acknowledgements

We use the theme from the book R for Data Science (2e) as the foundation for producing this book.

According to econometrician Anil K. Bera, the English phrase “It is all Greek to me”, meaning “I do not understand it at all”, played a role in establishing the tradition of using Greek letters for unknown quantities in econometrics. There are also similar formulations in other languages. See the Wikipedia page for more details. ↩︎