import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from scipy.integrate import quad
from scipy.integrate import dblquad
9 Review of Probability
\[ \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\corr}{corr} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\A}{\boldsymbol{A}} \DeclareMathOperator{\x}{\boldsymbol{x}} \DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\argmin}{argmin} \newcommand{\tr}{\text{tr}} \newcommand{\bs}{\boldsymbol} \newcommand{\mb}{\mathbb} \]
9.1 Probability and random variables
We use the term experiment to refer to a process that yields outcomes. For example, tossing a coin or rolling a die constitutes an experiment. We assume that the outcomes of experiments are mutually exclusive. The sample space associated with an experiment consists of all possible outcomes. An event is a subset of the sample space.
Example 9.1 Consider an experiment of tossing three coins. The sample space of this experiment is \[ \begin{align*} \Omega=\{\text{HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}\}, \end{align*} \] where \(H\) stands for heads and \(T\) stands for tails. The event of getting at least one head is \[ \begin{align*} E=\{\text{HHH, HHT, HTH, THH, HTT, THT, TTH}\}, \end{align*} \] which is a subset of the sample space \(\Omega\). The complement of the event \(E\) is the event of getting no heads, which is \(E^c=\{\text{TTT}\}\). The sample space \(\Omega\) is the event of getting at least one head or no heads, i.e., \(\Omega=E\cup E^c\).
Example 9.2 Consider an experiment of rolling two dies. The sample space of this experiment is \[ \begin{align*} \Omega=\{(x,y)\in\mathbb{Z}\times\mathbb{Z}: 1\leq x\leq 6, 1\leq y\leq 6\}, \end{align*} \] where \((x,y)\) is the outcome of the experiment, with \(x\) being the outcome of the first die and \(y\) being the outcome of the second die. The event of getting a sum of \(7\) is \[ \begin{align*} E&=\{(x,y)\in\Omega: x+y=7\}\\ &=\{(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)\}. \end{align*} \] The complement of the event \(E\) is the event of getting a sum that is not \(7\), which is \(E^c=\Omega\setminus E\). The sample space \(\Omega\) is the event of getting a sum of \(7\) or not, i.e., \(\Omega=E\cup E^c\).
Associated with a sample space, we can define two functions: probability and random variable. We can define the probability of an outcome as the proportion of times that the outcome occurs in the repeated trials of the experiment. The probability of an event is the sum of the probabilities of the outcomes in the event. In the following, we provide a formal definition of probability.
Definition 9.1 (Probability) Probability is a real-valued function defined on the sample space that satisfies the following three properties:1
- Non-negativity: The probability of an event is nonnegative, i.e., \(P(E)\geq 0\) for any event \(E\).
- Normalization: The probability of the sample space is \(1\), i.e., \(P(\Omega)=1\).
- Countable additivity: If \(E_1, E_2, \dots\) are mutually exclusive events, then \[ \begin{align*} P\left(\bigcup_{i=1}^\infty E_i\right) = \sum_{i=1}^\infty P(E_i). \end{align*} \]
Theorem 9.1 (Properties of Probability) The probability function defined in Definition 9.1 satisfies the following properties:
- \(P(\emptyset)=0\), where \(\emptyset\) is the empty set.
- \(P(A)=1-P(A^c)\), where \(A\) is an event.
- Let \(A\) and \(B\) be two events. Then, \(P(A)=P(A\cap B)+P(A\cap B^c)\).
- Let \(A\) and \(B\) be two events. If \(A\subseteq B\), then \(P(A)\leq P(B)\).
- \(P(A\cup B)=P(A)+P(B)-P(A\cap B)\) for any two events \(A\) and \(B\).
- \(P(A\cup B)\leq P(A)+P(B)\) for any two events \(A\) and \(B\).
The properties listed in Theorem 9.1 can be proved using the properties of the probability function defined in Definition 9.1 and elementary set theory.
Example 9.3 Consider Example 9.1. Let \(A\) be the event of getting exactly two heads and \(B\) be the event that the first tossed coin is tail. We want to compute \(P(A)\), \(P(B)\), \(P(A\cup B)\), \(P(A^c)\), \(P(B^c)\) an \(P(A\cap B)\). The events are \[ \begin{align*} &A=\{\text{HHT, HTH, THH}\},\\ &B=\{\text{TTT, THT, TTH, THH}\},\\ &A\cap B=\{\text{THH}\},\\ &A\cup B=\{\text{HTH, HHT, THH, TTH, THT, TTT}\},\\ \end{align*} \] Assuming that each outcome in the sample space is equally likely, we have \(P(A)=3/8\), \(P(B)=4/8=1/2\), \(P(A\cup B)=6/8=3/4\), \(P(A^c)=1-P(A)=5/8\), \(P(B^c)=1-P(B)=1/2\), and \(P(A\cap B)=1/8\).
Definition 9.2 A random variable is a real-valued function defined on the sample space of an experiment.
Definition 9.2 suggests that a random variable is a numerical summary of the sample space of an experiment. The set of values that a random variable can take is called the support of the random variable. A discrete random variable has a support that is countable, while a continuous random variable has a support that is an uncountable set.
Example 9.4 Consider Example 9.1. Let \(X\) be the random variable that counts the number of heads in the three tosses. Then, \(X\) is a discrete random variable that takes values from the set \(\{0, 1, 2, 3\}\). The events \(\{X=x\}\), where \(x\in\{0, 1, 2, 3\}\), are defined as follows: \[ \begin{align*} &\{X=0\}=\{\text{TTT}\},\\ &\{X=1\}=\{\text{HTT, THT, TTH}\},\\ &\{X=2\}=\{\text{HHT, HTH, THH}\},\\ &\{X=3\}=\{\text{HHH}\}. \end{align*} \]
9.2 Probability distribution of a discrete random variable
The probability distribution (or probability mass function) of a discrete random variable lists all possible values the variable can take, along with the probability associated with each value.
Definition 9.3 (Probability distribution of a discrete random variable) Let \(X\) be a discrete random variable that can take on \(k\) values, \(x_1, x_2, \dots, x_k\). The probability distribution of \(X\) is a function \(P(X=x)\) that assigns a probability to each value \(x_i\), such that \[ P(X=x_i) = p_i,\quad i=1,2,\dots,k, \] where \(p_i\geq 0\) for all \(i\) and \(\sum_{i=1}^k p_i = 1\).
Example 9.5 Consider an experiment of tossing a coin three times. Let \(X\) be the number of heads in the three tosses. Then, \(X\) is a discrete random variable that takes values from the set \(\{0, 1, 2, 3\}\). The probability distribution of \(X\) is given by \[ \begin{align*} &P(X=0)=P(\text{TTT})=1/8,\\ &P(X=1)=P(\text{HTT, THT, TTH})=3/8,\\ &P(X=2)=P(\text{HHT, HTH, THH})=3/8,\\ &P(X=3)=P(\text{HHH})=1/8. \end{align*} \]
Example 9.6 Let \(M\) be a random variable that measures the number of times your wireless network connection fails while you are writing a term paper. Let us assume that \(M\) takes values from \(\{0,1,2,3,4,5\}\), with the corresponding probabilities given by \(\{0.80,0.10,0.06,0.03, 0.01\}\). In the following, we summarize the probability distribution of \(M\) in a table.
# Probability distribution of M
= [0, 1, 2, 3, 4]
outcomes # probabilities
= [0.80, 0.10, 0.06, 0.03, 0.01]
p # cumulative probabilities
= np.cumsum(p)
cp = pd.DataFrame([outcomes, p, cp])
y = ["Outcome", "Probability distribution", "Cumulative probability distribution"]
y.index = ["", "", "", "", ""]
y.columns # y.columns = [""] * y.shape[1]
When we print the dataframe y
, we obtain the probability distribution shown in Table 9.1. The first row of Table 9.1 gives the possible outcomes of the random variable \(M\), the second row gives the probability distribution of \(M\), and the last row gives the cumulative probability distribution of \(M\), which is defined below.
Outcome | 0.0 | 1.0 | 2.00 | 3.00 | 4.00 |
Probability distribution | 0.8 | 0.1 | 0.06 | 0.03 | 0.01 |
Cumulative probability distribution | 0.8 | 0.9 | 0.96 | 0.99 | 1.00 |
We can use either a bar plot or a stem plot to illustrate the probability distribution of discrete random variables. Consider the discrete random variable defined in Example 9.6. In Figure 9.1, we use the stem plot for the probability distribution, while in Figure 9.2, we use the bar plot.
# Probability distribution of M
"darkgrid")
sns.set_style(= plt.subplots(figsize=(5, 3.5))
fig, ax ="steelblue", markerfmt="o", basefmt=" ")
ax.stem(outcomes, p, linefmt"Number of failures")
ax.set_xlabel("Probability")
ax.set_ylabel(0, max(p) + 0.02)
ax.set_ylim(
ax.set_xticks(outcomes) plt.show()

# Probability distribution of M
= plt.subplots(figsize=(5, 3.5))
fig, ax =0.5, color="steelblue")
ax.bar(outcomes, p, width"Number of failures")
ax.set_xlabel("Probability")
ax.set_ylabel(0, max(p) + 0.02)
ax.set_ylim(
ax.set_xticks(outcomes) plt.show()

The probability of an event can be computed from the probability distribution. For example, the probability of the event of one or two failures is the sum of the probabilities of the constituent outcomes: \(P(\{M=1\}\cup \{M=2\})=P(M=1)+P(M=2)= 0.16\).
Theorem 9.2 (Properties of probability distribution functions) Let \(P(Y=y)\) be the probability distribution function of the discrete random variable \(Y\), where \(y\) is a possible value of \(Y\). The following properties hold:
- \(0\leq P(Y=y)\leq 1\) for all \(y\).
- \(\sum_{y} P(Y=y)=1\), where the sum is over all possible values of \(Y\).
Definition 9.4 A Bernoulli random variable is a discrete random variable that takes only two values.
Example 9.7 Let \(G\) be the sex of the next new person you meet, where \(G = 0\) indicates that the person is male and \(G = 1\) indicates that the person is female. Then, the \(G\) is a Bernoulli random variable with the following probability distribution: \[ \begin{align} G= \begin{cases} 1,\quad\text{with probability $p$},\\ 0,\quad\text{with probability $1-p$}. \end{cases} \end{align} \tag{9.1}\]
The cumulative probability distribution is the probability that the random variable is less than or equal to a particular value. A cumulative probability distribution is also referred to as a cumulative distribution function (CDF) or a cumulative distribution. We use \(F_Y\) to denote the CDF of the random variable \(Y\).
Definition 9.5 (Cumulative distribution function (CDF)) Let \(Y\) be a discrete random variable with probability distribution \(P(Y=y)\). Then, the CDF of \(Y\), denoted by \(F_Y\), is defined as \[ F_Y(y)=P(Y\leq y)=\sum_{y^{*}\leq y} P(Y=y^{*}), \] where the sum is over all possible values \(y^{*}\) of \(Y\) that are less than or equal to \(y\).
Example 9.8 The last row of Table 9.1 in Example 9.6 gives the cumulative probability distribution of the random variable \(M\). For example, the probability of at most one connection failure is \(P(M\leq1)=0.90\). The CDF of \(M\) is given by \[ F_M(m)=P(M\leq m)=\begin{cases} 0,\quad&\text{if } m<0,\\ 0.80,\quad&\text{if } 0\leq m<1,\\ 0.90,\quad&\text{if } 1\leq m<2,\\ 0.96,\quad&\text{if } 2\leq m<3,\\ 0.99,\quad&\text{if } 3\leq m<4,\\ 1,\quad&\text{if } m\geq 4. \end{cases} \]
In Figure 9.3, we give the plot of the CDF of \(M\). Note that the CDF takes a step at the outcome values \(m\in\{0, 1, 2, 3,4\}\) and stays constant otherwise. The height of the step at a particular point \(m\) is equal to \(P(M = m)\), the probability associated with \(m\).
# The CDF of M
= np.arange(-1, 6)
data = np.cumsum(p)
y = np.insert(y, 0, 0)
yn = plt.subplots(figsize=(5, 3.5))
fig, ax =yn, xmin=data[:-1], xmax=data[1:], color="steelblue")
ax.hlines(y
ax.vlines(= data[1:-1], ymin=yn[:-1], ymax=yn[1:], color="steelblue", linestyle="dashed"
x
)1:-1], y, color="steelblue", s=20)
ax.scatter(data[1:-1], yn[:-1], color="white", s=20, edgecolor="steelblue", zorder=2)
ax.scatter(data[0], data[-1])
ax.set_xlim(data[-0.02, 1.02])
ax.set_ylim([ plt.show()

Theorem 9.3 (Properties of CDFs) Let \(F\) be the CDF of a random variable. Then, \(F\) has the following properties:
- \(F(-\infty)=\lim_{y\to-\infty}F(y)=0\),
- \(F(\infty)=\lim_{y\to\infty}F(y)=1\),
- \(F(y)\) is a nondecreasing function of \(y\), i.e., if \(y_1\) and \(y_2\) are any values such that \(y_1 < y_2\), then \(F(y_1) \leq F(y_2)\), and
- \(F(y)\) is right-continuous, i.e., \(\lim_{y\downarrow y_0}F(y)=F(y_0)\).
Example 9.9 Consider the following CDF: \[ F(y)=\begin{cases} 0, \quad&\text{if } y<0,\\ 1-e^{-y^2/2},\quad&\text{if } 0\leq y<\infty.\\ \end{cases} \] We want to show that \(F\) is a CDF. We need to check the properties of \(F\) listed in Theorem 9.3.
- \(F(-\infty)=\lim_{y\to-\infty}F(y)=0\) is satisfied since \(F(y)=0\) for \(y<0\).
- \(F(\infty)=\lim_{y\to\infty}F(y)=1\) is satisfied since \(\lim_{y\to\infty}F(y)=\lim_{y\to\infty}(1-e^{-y^2/2})=1\).
- To show that \(F(y)\) is nondecreasing, we compute its derivative: \[ F'(y)=\begin{cases} 0, \quad&\text{if } y<0,\\ e^{-y^2/2}y,\quad&\text{if } 0\leq y<\infty.\\ \end{cases} \] Since \(F'(y)\geq 0\) for all \(y\in\mathbb{R}\), we conclude that \(F(y)\) is nondecreasing.
- To show that \(F(y)\) is right-continuous, we compute \(\lim_{y\downarrow y_0}F(y)\) for any \(y_0\in\mathbb{R}\). If \(y_0<0\), then \(F(y)=0\) for all \(y<0\), and thus \(\lim_{y\downarrow y_0}F(y)=0=F(y_0)\). If \(y_0\geq 0\), then \(\lim_{y\downarrow y_0}F(y)=1-e^{-y_0^2/2}=F(y_0)\). Thus, \(F(y)\) is right-continuous.
The following code chunk generates the plot of \(F\) over the interval \([-1, 3]\). Note that the CDF is zero for \(y<0\) and increases to \(1\) as \(y\) increases.
# Plot the CDF F(y)
= np.linspace(-1, 3, 1000)
x = np.where(x < 0, 0, 1 - np.exp(-(x**2) / 2))
F = plt.subplots(figsize=(5, 4))
fig, ax ="steelblue")
ax.plot(x, F, color=0, ymin=0, ymax=1, color="black", linestyle="-", alpha=0.5)
ax.vlines(x=0, xmin=-1, xmax=3, color="black", linestyle="-", alpha=0.5)
ax.hlines(y"y")
ax.set_xlabel("F(y)")
ax.set_ylabel( plt.show()

9.3 Probability distribution of a continuous random variable
A random variable is continuous if its CDF is continuous. Formally, this means that \(\lim_{y\to y_0}F(y)=F(y_0)\) for all \(y_0\in\mathbb{R}\). For a continuous random variable, we use the probability density function (PDF) to summarize probabilities. For example, the area under the probability density function between any two points represents the probability that the random variable falls between those two points. A PDF is also known as a density function or simply a density.
Definition 9.6 Let \(Y\) be a continuous random variable with CDF \(F_Y\). Then, the PDF of \(Y\), denoted by \(f_Y\), is defined as \[ f_Y(y)=\frac{\text{d}F_Y(y)}{\text{d}y}, \] provided that the derivative exists. If \(f_Y\) is the PDF of \(Y\), then the CDF of \(Y\) can be expressed as \[ F_Y(y)=\int_{-\infty}^y f_Y(t)\text{d}t. \]
Example 9.10 Let \(Y\) be a continuous random variable with the following CDF: \[ \begin{align*} F_Y(y)= \begin{cases} 0,\quad&\text{for}\quad y<0,\\ y^2,\quad&\text{for}\quad 0\leq y\leq 1,\\ 1,\quad&\text{for}\quad y>1. \end{cases} \end{align*} \] Then, the PDF of \(Y\) is given by \[ \begin{align*} f_Y(y)=\frac{d}{dy}F_Y(y)= \begin{cases} \frac{d}{dy}0=0,\quad&\text{for}\quad y<0,\\ \frac{d}{dy}y^2=2y, \quad&\text{for}\quad 0\leq y< 1,\\ \frac{d}{dy}1=0,\quad&\text{for}\quad y>1. \end{cases} \end{align*} \]
Note that \(f_Y(y)\) is not defined at \(y=1\) because \(F_Y(y)\) is not differentiable at this point. To see this, we compute the left and right derivatives of \(F_Y(y)\) at \(y=1\): \[ \begin{align*} &F^{'}_{Y-}(1)=\lim_{h\to 0^-}\frac{F_Y(1+h)-F_Y(1)}{h}=\lim_{h\to 0^-}\frac{(1+h)^2-1}{h}=2,\\ &F^{'}_{Y+}(1)=\lim_{h\to 0^+}\frac{F_Y(1+h)-F_Y(1)}{h}=\lim_{h\to 0^+}\frac{1-1}{h}=0. \end{align*} \]
The following code chunk generates the plots of the CDF and PDF of \(Y\). Note that the CDF is continuous, while the PDF is not continuous at \(y=1\). The CDF is zero for \(y\leq0\) and increases to \(1\) as \(y\) increases. The PDF is zero for \(y\leq0\), increases linearly from \(0\) to \(2\) as \(y\) increases from \(0\) to \(1\), and is zero for \(y>1\).
# CDF and PDF plots
# Generate y values for the range
= np.linspace(-0.5, 1.5, 100)
y_cdf # Calculate the CDF
= np.where(y_cdf < 0, 0, np.where(y_cdf <= 1, y_cdf**2, 1))
cdf = plt.subplots(figsize=(5, 3.5))
fig, ax ="steelblue")
ax.plot(y_cdf, cdf, color"y")
ax.set_xlabel("F(y)")
ax.set_ylabel(
# Plot the PDF as piecewise
# Generate x values for the PDF
= np.linspace(-0.5, 1.5, 1000)
y_pdf # Calculate the PDF
= np.where(y_pdf < 0, 0, np.where((y_pdf > 0) & (y_pdf < 1), 2 * y_pdf, 0))
pdf
= plt.subplots(figsize=(5, 3.5))
fig, ax # First part: -0.5 <= y_pdf <= 1
= (y_pdf >= -0.5) & (y_pdf <= 1)
mask1 ="steelblue", label=r"$-0.5 \leq y \leq 1$")
ax.plot(y_pdf[mask1], pdf[mask1], color# Second part: 1 < y_pdf <= 1.5
= (y_pdf > 1) & (y_pdf <= 1.5)
mask2 ="steelblue", label=r"$1 < y \leq 1.5$")
ax.plot(y_pdf[mask2], pdf[mask2], color1], [0], facecolors="white", edgecolors="steelblue", s=40, zorder=2)
ax.scatter([1], [2], facecolors="white", edgecolors="steelblue", s=40, zorder=2)
ax.scatter(["y")
ax.set_xlabel("f(y)")
ax.set_ylabel(
plt.tight_layout() plt.show()


Example 9.11 Let \(Y\) be a continuous random variable with the following PDF: \[ \begin{align*} f_Y(y)=\begin{cases} 3y^{2},\quad&\text{for}\quad 0\leq y\leq 1,\\ 0,\quad&\text{otherwise}. \end{cases} \end{align*} \] Then, the CDF of \(Y\) is given by \[ \begin{align*} F(y)=\int_{-\infty}^yf(x)\text{d}x= \begin{cases} \int_{-\infty}^y0\text{d}x=0,\quad&\text{for}\quad y<0,\\ \int_{-\infty}^00\text{d}x+\int_{0}^y3x^2\text{d}x=y^3,\quad&\text{for}\quad 0\leq y\leq 1,\\ \int_{-\infty}^00\text{d}x+\int_{0}^13x^2\text{d}x+\int_{1}^y0\text{d}x=1,\quad&\text{for}\quad y>1. \end{cases} \end{align*} \]
The following code chunk generates the plots of the CDF and PDF of \(Y\). Note that the CDF is continuous, while the PDF is not continuous at \(y=1\). The CDF is zero for \(y\leq0\) and increases to \(1\) as \(y\) increases. The PDF is zero for \(y\leq0\), increases from \(0\) to \(3\) as \(y\) increases from \(0\) to \(1\), and is zero for \(y>1\).
# CDF and PDF plots
# Generate y values for the range
= np.linspace(-0.5, 1.5, 100)
y_cdf # Calculate the CDF
= np.where(y_cdf < 0, 0, np.where(y_cdf <= 1, y_cdf**3, 1))
cdf = plt.subplots(figsize=(5, 3.5))
fig, ax ="steelblue")
ax.plot(y_cdf, cdf, color"y")
ax.set_xlabel("F(y)")
ax.set_ylabel(# Plot the PDF as piecewise
= np.linspace(-0.5, 1.5, 1000)
y_pdf # Calculate the PDF
= np.where(y_pdf < 0, 0, np.where((y_pdf > 0) & (y_pdf < 1), 3 * y_pdf**2, 0))
pdf = plt.subplots(figsize=(5, 3.5))
fig, ax = (y_pdf >= -0.5) & (y_pdf <= 1)
mask1 ="steelblue", label=r"$-0.5 \leq y \leq 1$")
ax.plot(y_pdf[mask1], pdf[mask1], color= (y_pdf > 1) & (y_pdf <= 1.5)
mask2 ="steelblue", label=r"$1 < y \leq 1.5$")
ax.plot(y_pdf[mask2], pdf[mask2], color# Add a solid point at (1, 3)
1], [3], color="steelblue", zorder=1)
ax.scatter([# Add a hollow point at (1, 0)
1], [0], facecolors="white", edgecolors="steelblue", s=40, zorder=2)
ax.scatter(["y")
ax.set_xlabel("f(y)")
ax.set_ylabel(
plt.tight_layout() plt.show()


Theorem 9.4 (Properties of probability density functions) Let \(f_Y\) be the PDF of the continuous random variable \(Y\). Then, it has the following properties:
- \(f_Y(y)\geq0\) for all \(y\in\mathbb{R}\), and
- \(\int_{\mathbb{R}}f_Y(y)\text{d}y=1\).
In Theorem 9.4, the first property states that the PDF is nonnegative, while the second property states that the area under the PDF is equal to \(1\).
Example 9.12 Consider \(Y\) with the PDF \(f_Y(y) = 3/y^4\) for \(y>1\). We show that this is a proper density by checking the properties in Theorem 9.4. First, we note that \(f_Y(y)\geq 0\) for all \(y>1\). Second, we compute the integral of \(f_Y(y)\) over its domain: \[ \begin{align*} &\int_{1}^\infty \frac{3}{z^4}\, \text{d}z = \left.-\frac{3}{3}z^{-3}\right|_{z=1}^\infty = \left.-z^{-3}\right|_{z=1}^\infty=-\left(\lim_{t\rightarrow\infty}\frac{1}{t^3}-1\right)=1. \end{align*} \]
Thus, \(f_Y(y)\) is a proper density function. Numerically, we can use the quad
method from the scipy.integrate
module to compute the above integral, as illustrated below.
# Define the function
def fun(z):
return 3 / z**4
# Perform the integration
= quad(fun, 1, np.inf)
result, error
print("Integration result:", result)
print("Estimated error:", error)
Integration result: 1.0
Estimated error: 1.1102230246251565e-14
Theorem 9.5 Let \(Y\) be a continuous random variable with CDF \(F_Y\). Then, \(P(Y = y) = 0\) for all \(y\in\mathbb{R}\).
Proof (Proof of Theorem 9.5). Let \(\epsilon > 0\) and consider \(\{Y = y\}\subseteq \{y-\epsilon < Y\leq y\}\). Then, we have \[ \begin{align*} P(Y = y)\leq P(y-\epsilon< Y\leq y) = P(Y\leq y)-P(Y\leq y-\epsilon) = F_Y(y)-F_Y(y-\epsilon). \end{align*} \] Since probability function is nonnegative, we have \[ \begin{align*} 0\leq\lim_{\epsilon\to0}P(Y = y)&\leq \lim_{\epsilon\to0}\left(F_Y(y)-F_Y(y-\epsilon)\right)\\ &=F_Y(y)-\lim_{\epsilon\to0}F_Y(y-\epsilon)=0, \end{align*} \] where the last equality is ensured because \(F\) is continuous.
Theorem 9.5 indicates that if \(Y\) is a continuous random variable with CDF \(F_Y\) and PDF \(f_Y\), then it follows that \[ \begin{align*} P(a < Y < b) &= P(a\leq Y < b) = P(a < Y \leq b) = P(a \leq Y \leq b)\\ &=F_Y(b)-F_Y(a) =\int^b_af(y)\text{d}y. \end{align*} \]
This property contrasts with discrete random variables, where the probability of the random variable taking on a specific value can be nonzero.
9.4 Expected values, mean, and variance
In addition to the PDF and CDF, the moments of a random variable can also be used to characterize its distribution. The expected value of a discrete random variable \(Y\), denoted by \(\E(Y)\), is the average value of the random variable over many repeated trials or occurrences. The expected value of \(Y\) is also called the expectation of \(Y\) or the mean of \(Y\) and is denoted \(\mu_Y\). Let \(Y\) be a discrete random variable that can take on \(k\) values, \(y_1, y_2,\dots,y_k\). Then, \[ \begin{align} \mu_Y = \E(Y) =\sum_{j=1}^k y_j \times P(Y=y_j). \end{align} \] For \(r>1\), the \(r\)’th central moment of \(Y\) is defined by \[ \begin{align} \E\left(Y - \mu_Y\right)^r = \sum_{j=1}^k (y_j-\mu_Y)^r \times P(Y=y_j). \end{align} \]
Example 9.13 Consider the random variable \(M\) with probability distribution given in Table 9.1. Then, the mean of \(M\) is \[ \begin{align} \E(M)&=\sum_{j=1}^4 m_j \times P(M=m_j)\\ &= 0 \times 0.80 + 1 \times 0.10 + 2 \times 0.06 + 3 \times 0.03 + 4 \times 0.01\\ & = 0.35. \end{align} \]
Example 9.14 The expected value of the Bernoulli random variable \(G\) defined in Equation 9.1 is \[ \E(G)=1\times p+ 0\times (1-p)=p. \]
Let \(Y\) be a continuous random variable with support \(D\subset\mathbb{R}\) and PDF \(f_Y\). Then, the expected value of \(Y\) is defined by \[ \begin{align*} \E(Y) = \int_{D} y f_Y(y)\text{d}y . \end{align*} \]
For \(r>1\), the \(r\)’th central moment of \(Y\) is
\[
\begin{align*}
\E\left(Y- \mu_Y\right)^r = \int_{D} (y -\mu_Y)^r f_Y(y)\text{d}y.
\end{align*}
\]
Example 9.15 Consider the random variable \(Y\) in Example 9.12. Then, the expected value of \(Y\) is \[ \begin{align*} \E(Y) &= \int_{1}^\infty y f_Y(y)\text{d}y = \int_{1}^\infty y \cdot \frac{3}{y^4}\text{d}y\\ &= 3\int_{1}^\infty y^{-3}\text{d}y = \left.-\frac{3}{2}y^{-2}\right|_{1}^\infty = \frac{3}{2}. \end{align*} \]
Using the quad
function from the scipy.integrate
module, we can compute the expected value of \(Y\) numerically as follows:
# Define the function for the expected value
def fun(y):
return y * 3 / y**4
# Perform the integration
= quad(fun, 1, np.inf)
result, error print("Expected value E(Y):", result)
Expected value E(Y): 1.5
Expectation is a linear operator, i.e., \(\E(a+bY)=a+b\E(b)\), where \(a\) and \(b\) are two constants. Assuming that \(Y\) is a continuous random variable with the PDF \(f_Y\), we can prove this property in the following way: \[ \begin{align} \E(a+bY)&=\int(a+yb)f_Y(y)dy=a\int f_Y(y)dy+b\int yf_Y(y)dy\\ &=a+b\E(Y), \end{align} \] where we use the fact that \(\int f_Y(y)dy=1\) in the third equality.
Theorem 9.6 (Properties of expectation) Let \(Y\) be a random variable and \(g\) be a real-valued function. Then, the expectation of \(g(Y)\) is given by:
- Discrete case: \(\E(g(Y))=\sum_{j=1}^k g(y_j)P(Y=y_j)\), where \(y_j\) are the possible values of \(Y\).
- Continuous case: \(\E(g(Y))=\int_D g(y)f_Y(y)\text{d}y\), where \(D\) is the support of \(Y\).
Proof (Proof of Theorem 9.6). See Wackerly, Mendenhall, and Scheaffer (2008).
Example 9.16 Let \(G\) be the Bernoulli random variable defined in Example 9.7. Then, using Theorem 9.6, we can compute the expected value of \(G^3\) as follows: \[ \begin{align*} \E(G^3)=1^3\times p+0^3\times(1-p)=p. \end{align*} \]
The second central moment of a random variable is the variance of the random variable, denoted by \(\text{var}(Y)\) or \(\sigma^2_Y\). The square root of the variance is called the standard deviation, denoted by \(\sigma_Y\). The variance and standard deviation measure the dispersion or the spread of the probability distribution.
Example 9.17 Let \(M\) be the number of times your wireless network connection fails while you are writing a term paper. Using Table 9.1, we compute \(\sigma^2_Y\) as \[ \begin{align*} \sigma_M^2 &= \sum_{j=1}^5 (m_j-\mu_M)^2 \times P(M=m_j) \\ &= (0-0.35)^2\times0.80 + (1-0.35)^2\times0.10 \\ &+ (2-0.35)^2\times0.06 + (3-0.35)^2\times0.03+ (4-0.35)^2\times0.01\\ & = 0.6475. \end{align*} \]
We can use skewness to measure the asymmetry of the probability distribution of a random variable. The skewness of the distribution of \(Y\) is defined by \[ \begin{align} \text{Skewness}=\frac{\E\left[\left(Y - \mu_Y\right)^3\right]}{\sigma_Y^3}. \end{align} \]
Skewness is unit-free; that is, changing the units of \(Y\) does not affect its skewness. Based on this measure, we can identify the following cases:
- Skewness\(=0\): The distribution of \(Y\) is symmetric around \(\mu_Y\).
- Skewness\(>0\): The distribution of \(Y\) is right-skewed, with a long right tail.
- Skewness\(<0\): The distribution of \(Y\) is left-skewed, with a long left tail.
The kurtosis is the fourth standardized moment of \(Y\). It measures the extent to which the probability mass of a distribution is concentrated in the tails. For the random variable \(Y\), it is defined by \[ \begin{align} \text{Kurtosis}=\frac{\E\left[\left(Y - \mu_Y\right)^4\right]}{\sigma_Y^4}. \end{align} \]
We distinguish between the following cases:
- If \(\text{Kurtosis}=3\), then \(Y\) has the same kurtosis as the normal distribution.
- If \(\text{Kurtosis}>3\), then the distribution of \(Y\) has heavier (fatter) tails than the normal distribution; this is referred to as leptokurtic.
In Figure 9.7, we show the density plots of the standard normal distribution, the skewed normal distribution, the gamma distribution, and the Laplace distribution. The standard normal distribution is symmetric, the skewed-normal distribution is left-skewed, the gamma distribution is right-skewed, and the Laplace distribution is leptokurtotic.
# Shape of distributions
# Generate x values for the range
= np.linspace(-4, 4, 1000)
x_symmetric = np.linspace(-4, 1, 1000)
x_skewed_right = np.linspace(0, 8, 1000)
x_skewed_left = np.linspace(-4, 4, 1000)
x_leptokurtic
# Calculate the density functions
= stats.norm.pdf(x_symmetric, loc=0, scale=1) # Normal distribution
symmetric_pdf = stats.skewnorm.pdf(x_skewed_right, a=-4,loc=0,scale=1) # Skew-normal distribution
left_skewed_pdf = stats.gamma.pdf(x_skewed_left, a=2) # Gamma distribution
right_skewed_pdf = stats.laplace.pdf(x_leptokurtic, loc=0, scale=1) # Laplace distribution
leptokurtic_pdf
# Create plots
= plt.subplots(figsize=(5, 3.5))
fig, axes # Plot each density function
='blue', label="Normal distribution")
axes.plot(x_symmetric, symmetric_pdf, color='blue', alpha=0.2)
axes.fill_between(x_symmetric, symmetric_pdf, color=False)
axes.legend(frameon
= plt.subplots(figsize=(5, 3.5))
fig, axes ='red', label='Skewed normal distribution')
axes.plot(x_skewed_right, left_skewed_pdf, color='red', alpha=0.2)
axes.fill_between(x_skewed_right, left_skewed_pdf, color=False)
axes.legend(frameon
= plt.subplots(figsize=(5, 3.5))
fig, axes ='green',label="Gamma distribution")
axes.plot(x_skewed_left, right_skewed_pdf, color='green', alpha=0.2)
axes.fill_between(x_skewed_left, right_skewed_pdf, color=False)
axes.legend(frameon
= plt.subplots(figsize=(5, 3.5))
fig, axes ='purple',
axes.plot(x_leptokurtic, leptokurtic_pdf, color= "Laplace distribution")
label ='purple', alpha=0.2)
axes.fill_between(x_leptokurtic, leptokurtic_pdf, color='blue', label="Normal distribution")
axes.plot(x_symmetric, symmetric_pdf, color=False)
axes.legend(frameon
# Adjust layout
plt.tight_layout() plt.show()




Example 9.18 Let \(G\) be the Bernoulli random variable defined in Equation 9.1 with \(p=0.7\). Note that \(\E(G)=p\) and \(\sigma^2_G=p(1-p)\). Then, the skewness of \(G\) is given by \[ \begin{align*} \text{Skewness} &= \frac{\E\left[(G - \mu_G)^3\right]}{\sigma_G^3} = \frac{p(1-p)(1-2p)}{(p(1-p))^{3/2}}\\ &= \frac{0.7(1-0.7)(1-2\times0.7)}{(0.7(1-0.7))^{3/2}} = −0.872. \end{align*} \]
The kurtosis of \(G\) is given by \[ \begin{align*} \text{Kurtosis} &= \frac{\E\left[(G - \mu_G)^4\right]}{\sigma_G^4} = \frac{(0-p)^4(1-p)+(1-p)^4p}{(p(1-p))^2}\\ &= \frac{(-0.7)^4(1-0.7)+(1-0.7)^4\times0.7}{(0.7(1-0.7))^2} = 1.762. \end{align*} \]
In the following code chunk, we generate 1000 draws from the Bernoulli distribution with \(p=0.7\) and then generate the stem plot of the draws. The stem plot shows that the distribution of \(G\) is skewed to the left, i.e., the probability of \(G=1\) is higher than the probability of \(G=0\).
# Generate 1000 draws from the Bernoulli distribution with p=0.7
= 1000
n_draws = np.random.binomial(1, 0.7, n_draws)
draws # Calculate frequencies for stem plot
= np.unique(draws, return_counts=True)
values, counts = counts / n_draws
probs # Create a stem plot of the draws
= plt.subplots(figsize=(5, 3))
fig, ax ="steelblue", markerfmt="o", basefmt=" ")
ax.stem(values, probs, linefmt"Value")
ax.set_xlabel("Probability")
ax.set_ylabel(0, 1])
ax.set_xticks(["0", "1"])
ax.set_xticklabels([# ax.set_yticks([0, 0.5, 1])
# ax.set_yticklabels(['0', '0.5', '1'])
plt.tight_layout() plt.show()

Definition 9.7 (Quantiles and percentiles) Let \(Y\) be a random variable and let \(0 < p < 1\). The \(p\)th quantile of \(Y\), denoted by \(q_p\), is the smallest value such that \(F(q_p) = P(Y \leq q_p) \geq p\). If \(Y\) is continuous, then \(F(q_p) = p\). The \(p\)th quantile is also known as the \(100p\)th percentile of \(Y\).
Note that \(q_{0.5}\) is the median, \(q_{0.25}\) is the first quartile, and \(q_{0.75}\) is the third quartile of \(Y\).
Example 9.19 Let \(Y\) be a continuous random variable with the following CDF: \[ \begin{align*} F_Y(y)= \begin{cases} 0,\quad&\text{for}\quad y<0,\\ 1-\left(\frac{4.5}{4.5+y}\right)^3,\quad&\text{for}\quad y\geq 0. \end{cases} \end{align*} \] The median of \(Y\), i.e., \(q_{0.5}\), can be computed as follows: \[ \begin{align*} &F(q_{0.5})=0.5\implies 1-\left(\frac{4.5}{4.5+y}\right)^3=0.5\implies\frac{4.5}{4.5+q_{0.5}}=(0.5)^{1/3}\\ &\implies q_{0.5}\approx 1.17. \end{align*} \]
The PDF of \(Y\) is given by \[ \begin{align*} f_Y(y) = \frac{d}{dy} F_Y(y) = \begin{cases} 0,\quad&\text{for}\quad y < 0, \\ 3 \times (4.5)^3 \times (4.5 + y)^{-4}, \quad&\text{for}\quad y \geq 0. \end{cases} \end{align*} \]
In the following code chunk, we generate plots of the CDF and PDF of \(Y\). In the CDF plot, we highlight the point at \(y = 1.17\), which corresponds to the median of \(Y\). In the PDF plot, the shaded area under the curve between \(y = 0\) and \(y = 1.17\) is \(0.50\).
# Generate y values for the range
= np.linspace(-0.5, 5, 100)
y_cdf # Calculate the CDF
= np.where(y_cdf < 0, 0,
cdf 1 - (4.5 / (4.5 + y_cdf))**3)
= plt.subplots(figsize=(5, 3))
fig, ax ='steelblue')
ax.plot(y_cdf, cdf, color'y')
ax.set_xlabel('F(y)')
ax.set_ylabel(
# Show the point at y=1.17
= 1.17
y_median = 1 - (4.5 / (4.5 + y_median))**3
cdf_median ='red', zorder=3)
ax.scatter([y_median], [cdf_median], colorf"({y_median:.2f}, {cdf_median:.2f})", (y_median, cdf_median),
ax.annotate(="offset points", xytext=(10,10), ha='left', color='black')
textcoords
# Plot the PDF as piecewise
= np.linspace(-0.5, 5, 1000)
y_pdf # Calculate the PDF
= np.where(y_pdf < 0, 0,
pdf 3 * (4.5**3) * (4.5 + y_pdf)**-4)
= plt.subplots(figsize=(5, 3))
fig, ax ='steelblue')
ax.plot(y_pdf, pdf, color
# Show the point at y=1.17
0], color='red', zorder=3)
ax.scatter([y_median], [f"({y_median:.2f}, 0)",
ax.annotate(0),
(y_median, ="offset points", xytext=(10,10), ha='left', color='black')
textcoords
# Shade the region under the curve between y=0 and y=1.17
= (y_pdf >= 0) & (y_pdf <= 1.17)
mask ='steelblue', alpha=0.3)
ax.fill_between(y_pdf[mask], pdf[mask], color'y')
ax.set_xlabel('f(y)')
ax.set_ylabel(
plt.tight_layout() plt.show()


9.5 Two random variables
In this section, we extend the concept of probability distributions to two random variables.
9.5.1 Joint and marginal distributions
Definition 9.8 (Joint probability distribution) The joint probability distribution of two discrete random variables \(X\) and \(Y\) is the probability that the random variables simultaneously take on certain values, say \(x\) and \(y\). The joint probability distribution is written as \(P(X=x,Y=y)\).
We usually use tables to represent the joint probability distribution of two discrete random variables, as illustrated in the following example.
Example 9.20 Let \(Y\) be a binary random variable that equals \(1\) if the commute to school is short (less than 20 minutes) and that equals \(0\) otherwise, and let \(X\) be a binary random variable that equals \(0\) if it is raining and \(1\) if not. We assume that the joint probability distribution of \(X\) and \(Y\) is given in Table 9.2.
# Joint probability distribution of weather conditions and commuting times
= [0.15, 0.07, 0.22]
p1 = [0.15, 0.63, 0.78]
p2 = [0.30, 0.70, 1.00]
p3 = pd.DataFrame([p1, p2, p3])
y = ["Long commute (Y=0)", "Short commute (Y = 1)", "Total"]
y.index = ["Rain (X=0)", "No Rain (X=1)", "Total"]
y.columns y
Rain (X=0) | No Rain (X=1) | Total | |
---|---|---|---|
Long commute (Y=0) | 0.15 | 0.07 | 0.22 |
Short commute (Y = 1) | 0.15 | 0.63 | 0.78 |
Total | 0.30 | 0.70 | 1.00 |
Using Table 9.2, the probability of having a rainy day and a long commute is \(P(X=0,Y=0)=0.15\).
The following code chunk generates a three dimensional stem plot for the joint distribution of \(X\) and \(Y\). The height of each stem corresponds to the joint probability of \(X\) and \(Y\) taking on specific values.
# Create a 3D stem plot for the joint distribution
= plt.figure(figsize=(5, 4))
fig = fig.add_subplot(111, projection='3d')
ax # Prepare x, y positions for each stem
= np.array([0, 1, 0, 1])
x = np.array([0, 0, 1, 1])
y = np.zeros_like(x)
z = np.array([0.15, 0.07, 0.15, 0.63]) # Joint probabilities for (X,Y)
dz # Plot stems manually
for xi, yi, zi, dzi in zip(x, y, z, dz):
+ dzi], color='steelblue', linewidth=2)
ax.plot([xi, xi], [yi, yi], [zi, zi + dzi], color='steelblue', s=40)
ax.scatter([xi], [yi], [zi 0, 1])
ax.set_xticks(['X=0', 'X=1'])
ax.set_xticklabels([0, 1])
ax.set_yticks(['Y=0', 'Y=1'])
ax.set_yticklabels([ plt.show()

Theorem 9.7 (Properties of joint probability distributions) Let \(X\) and \(Y\) be two discrete random variables. Then, the joint probability distribution has the following properties: (i) \(P(X=x,Y=y)\geq0\) for all \(x\) and \(y\), and (ii) \(\sum_{x}\sum_{y}P(X=x,Y=y)=1\).
The first property states that the joint probabilities are nonnegative, while the second property states that the sum of all joint probabilities is equal to \(1\).
From the joint probability distribution, we can derive the marginal probability distributions of \(X\) and \(Y\).
Definition 9.9 (Marginal distributions) If \(X\) can take on \(l\) different values \(x_1,\dots,x_\ell\), then the marginal probability that \(Y\) takes on the value \(y\) is \[ \begin{align} P(Y=y)=\sum_{i=1}^\ell P(Y=y, X=x_i). \end{align} \]
Example 9.21 Consider the joint probability distribution of \(X\) and \(Y\) given in Table 9.2. Then, the probability of a long commute, i.e., \(P(Y=0)\), is \[ \begin{align*} P(Y=0)&=\sum_{i=1}^\ell P(Y=0, X=x_i)\\ &=P(Y=0,X=0)+P(Y=0,X=1)\\ &=0.15+0.07=0.22. \end{align*} \] In Table 9.2, the row sums give the marginal probability distribution of \(Y\), while the column sums give the marginal probability distribution of \(X\).
Definition 9.10 (Joint cummulative distribution function) The joint cumulative distribution function (CDF) of two random variables \(X\) and \(Y\) is defined as \[ \begin{align} F_{X,Y}(x,y) = P(X\leq x, Y\leq y). \end{align} \]
Note that if \(X\) is a discrete random variable, taking values \(x_1,\dots,x_\ell\), and \(Y\) is a discrete random variable taking values \(y_1,\dots,y_k\), then the joint CDF can be computed as \[ \begin{align*} F_{X,Y}(x,y) &= P(X\leq x, Y\leq y)\\ &= \sum_{x_i\leq x}^\ell\sum_{y_j\leq y}^k P(X=x_i,Y=y_j), \end{align*} \] where the first summation is over all values of \(X\) that are less than or equal to \(x\), and the second summation is over all values of \(Y\) that are less than or equal to \(y\).
Example 9.22 Consider the joint probability distribution of \(X\) and \(Y\) given in Table 9.2. Then, the joint CDF of \(X\) and \(Y\) at \((0,1)\) is \[ \begin{align*} F_{X,Y}(0,1) &= P(X\leq 0, Y\leq 1)\\ &= P(X=0,Y=0)+P(X=0,Y=1)\\ &= 0.15+0.15=0.30. \end{align*} \]
Definition 9.11 (Joint probability density function) Let \(X\) and \(Y\) be two continuous random variables with joint CDF \(F_{X,Y}(x,y)\). If there exists a function \(f_{X,Y}(x,y)\) such that \[ \begin{align} F_{X,Y}(x,y) = \int_{-\infty}^x\int_{-\infty}^y f_{X,Y}(u,v)\text{d}v\text{d}u, \end{align} \] then \(f_{X,Y}(x,y)\) is called the joint probability density function (PDF) of \(X\) and \(Y\).
The joint PDF \(f_{X,Y}(x,y)\) is always nonnegative, i.e., \(f_{X,Y}(x,y)\geq 0\) for all \(x\) and \(y\), and its integral over the entire space is equal to \(1\): \[ \begin{align} \int_{-\infty}^\infty\int_{-\infty}^\infty f_{X,Y}(x,y)\text{d}y\text{d}x = 1. \end{align} \]
Definition 9.10 suggests that if \(X\) and \(Y\) are continuous random variables, then their joint CDF can be written as \[ \begin{align*} F_{X,Y}(x,y) &= P(X\leq x, Y\leq y)\\ &= \int_{-\infty}^x\int_{-\infty}^y f_{X,Y}(u,v)\text{d}v\text{d}u. \end{align*} \]
Example 9.23 Let \(X\) and \(Y\) be two continuous random variables with joint PDF given by \[ \begin{align*} f(x,y)= \begin{cases} 30xy^2,\quad&x-1\leq y\leq 1-x,\, 0\leq x\leq1,\\ 0,\quad&\text{elsewhere}. \end{cases} \end{align*} \]
We want to compute \(F_{X,Y}(0.5, 0.5)\). The following code chunk illustrates the integration region, which is defined by the intersection of the inequalities \(x - 1 \leq y \leq 1 - x\), \(0 \leq x \leq 1\), \(x \leq 0.5\), and \(y \leq 0.5\). In Figure 9.11, this region is shaded in orange. The joint CDF \(F_{X,Y}(0.5, 0.5)\) is given by the integral of the joint PDF over this region. \[ \begin{align*} F(0.5,0.5)&=P(X\leq0.5, Y\leq0.5)=\int_{0}^{0.5}\int_{x-1}^{0.5}30xy^2dydx\\ &=\int_{0}^{0.5}\left(10xy^3\bigg|^{y=0.5}_{y=x-1}\right)dx=\int_{0}^{0.5}\left(\frac{10x}{8}-10x(x-1)^3\right)dx=9/16. \end{align*} \]
# Integration region for computing F_{X,Y}(0.5, 0.5)
# Generate x values for the range
= np.linspace(0, 1, 100)
x # Calculate the lower and upper bounds for y
= x - 1
y_lower = 1 - x
y_upper
= plt.subplots(figsize=(5, 3.5))
fig, ax # Shade the full support
=(y_lower <= y_upper), color='steelblue', alpha=0.3)
ax.fill_between(x, y_lower, y_upper, where
# Shade the area where y <= 0.5 and x <= 0.5
= x[x <= 0.5]
x_shade = y_lower[x <= 0.5]
y_lower_shade = np.minimum(y_upper[x <= 0.5], 0.5)
y_upper_shade =(y_lower_shade < y_upper_shade), color='orange', alpha=0.7)
ax.fill_between(x_shade, y_lower_shade, y_upper_shade, where
'x')
ax.set_xlabel('y')
ax.set_ylabel(0, 1)
ax.set_xlim(-1, 1)
ax.set_ylim(0, color='black', lw=0.5, ls='--')
plt.axhline(0, color='black', lw=0.5, ls='--')
ax.axvline(# Add lines x=0.5 and y=0.5
0.5, color='red', lw=1, ls='--', label='x=0.5')
ax.axvline(0.5, color='green', lw=1, ls='--', label='y=0.5')
ax.axhline(
plt.grid()='upper right', frameon=False)
ax.legend(loc
plt.tight_layout() plt.show()

Using the dblquad
function from the scipy.integrate
module, we can compute the joint CDF \(F_{X,Y}(0.5, 0.5)\) numerically as follows:
# Define the joint PDF function
def joint_pdf(y, x):
return 30 * x * y**2
# Define the limits for y based on x
def y_lower(x):
return x - 1
def y_upper(x):
return 0.5
# Perform the double integration
= dblquad(joint_pdf, 0, 0.5, y_lower, y_upper)
result, error print("Joint CDF F_{X,Y}(0.5, 0.5):", result)
Joint CDF F_{X,Y}(0.5, 0.5): 0.5625
Definition 9.12 (Marginal probability density function) Let \(X\) and \(Y\) be two continuous random variables with joint PDF \(f_{X,Y}(x,y)\). The marginal probability density function (PDF) of \(X\) is given by \[ \begin{align} f_X(x) = \int_{-\infty}^\infty f_{X,Y}(x,y)\text{d}y. \end{align} \] The marginal PDF of \(Y\) is given by \[ \begin{align} f_Y(y) = \int_{-\infty}^\infty f_{X,Y}(x,y)\text{d}x. \end{align} \]
9.5.2 Conditional distributions and conditional moments
From the joint and marginal distributions, we can derive the conditional distribution of one random variable given the other.
Definition 9.13 (Conditional distribution) Let \(X\) and \(Y\) be two discrete random variables. Then, the conditional distribution of \(Y\) given \(X=x\) is given by \[ \begin{align} P(Y=y |X=x) = \frac{P(Y=y,X=x)}{P(X=x)}. \end{align} \]
Definition 9.13 indicates that \(P(Y=y |X=x)\) is the likelihood of the event that \(Y\) takes the value \(y\) given that \(X=x\) has happened.
Example 9.24 Consider the joint probability distribution of \(X\) and \(Y\) given in Table 9.2. Then, the likelihood of a short commute given that it is raining outside is \[ \begin{align*} P(Y=1 |X=0) = \frac{P(Y=1,X=0)}{P(X=0)} = 0.15/0.30 = 0.50. \end{align*} \]
The likelihood of no rain outside given that it has been a short commute is \[ \begin{align*} P(X=1 |Y=1) = \frac{P(X=1,Y=1)}{P(Y=1)} = 0.63/0.78 = 0.81. \end{align*} \]
Definition 9.14 (Conditional probability density function) Let \(X\) and \(Y\) be two continuous random variables with joint PDF \(f_{X,Y}(x,y)\). The conditional probability density function (PDF) of \(Y\) given \(X=x\) is given by \[ \begin{align} f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}. \end{align} \]
Using the conditional distribution, we can define conditional moments. We use \(\E(Y|X)\) to denote the conditional mean of \(Y\) given \(X\). When the value \(X\) is not specified, \(\E(Y|X)\) becomes random. However, if the value of \(X\) is specified, say \(X=x\), then \(\E(Y|X=x)\) is a fixed quantity and is not random.
Definition 9.15 (Conditional expectation) Let \(X\) and \(Y\) be two discrete random variables, and suppose that \(Y\) takes on \(k\) values: \(y_1, y_2, \dots, y_k\). Then, the conditional expectation of \(Y\) given \(X = x\) is defined as \[ \begin{align} \E(Y|X=x) = \sum_{j=1}^k y_jP(Y=y_j|X=x). \end{align} \]
If \(X\) and \(Y\) are continuous random variables with conditional density function \(f_{Y|X}\), then the conditional expectation of \(Y\) given \(X = x\) is defined as \[ \begin{align} \E(Y|X=x) = \int_{-\infty}^\infty y f_{Y|X}(y|x)\text{d}y. \end{align} \]
The other conditional moments of \(Y\) given \(X\) can be defined in a similar way as in Definition 9.15. For example, the conditional variance of \(Y\) given \(X = x\) is defined as \[ \begin{align} \text{var}(Y|X=x) = \sum_{j=1}^k \big(y_j - \E(Y|X=x)\big)^2P(Y=y_j|X=x). \end{align} \]
Example 9.25 Consider the joint probability distribution of \(X\) and \(Y\) given in Table 9.2. Then, the expected value of commute given that it is not a rainy day is \[ \begin{align*} \E(Y|X=1)&=1\times P(Y=1|X=1)+0\times P(Y=0|X=1)=0.63/0.70=0.9. \end{align*} \]
The variance of commute given that it is not a rainy day is \[ \begin{align*} \text{var}(Y|X=1)&=(1-0.9)^2\times P(Y=1|X=1)+(0-0.9)^2\times P(Y=0|X=1)\\ &=(0.1)^2\times0.9+(-0.9)^2\times0.1=0.09. \end{align*} \]
9.5.3 Properties of conditional expectations
The conditional expectation plays a central role in econometrics, and therefore it is important to understand its properties.
Let \(X\) be a random variable and \(g\) a function defined on the support of \(X\). Then, the conditional expectation of \(g(X)\) given \(X\) is \[ \begin{align} \E(g(X)|X) = g(X) \end{align} \]
This means that the conditional expectation of a function of a random variable, given that random variable, is equal to the function itself. Intuitively, if we know the value of \(X\), then \(g(X)\) becomes a fixed quantity, and its conditional expectation given \(X\) is equal to itself.
Just like the expectation of a random variable, the conditional expectation is a linear operator. This means that if \(X\), \(Y\) and \(Z\) are random variables, then for any constants \(a\) and \(b\), we have \[ \begin{align} \E(aX + bY | Z) = a\E(X | Z) + b\E(Y | Z). \end{align} \]
Another important property of conditional expectations is the law of iterated expectations. This law states that the unconditional expectation of a random variable can be expressed as the expectation of its conditional expectation given another random variable.
Theorem 9.8 (Law of iterated expectations (LIE)) Let \(X\) and \(Y\) be two random variables. Then, \[ \begin{align} \E(Y) = \E\left(\E(Y|X)\right), \end{align} \] where the outer expectation is with respect to the marginal distribution of \(X\).
Let \(X\) and \(Y\) be two discrete random variables, taking values \(x_1,\dots,x_\ell\) and \(y_1,\dots,y_k\). The LIE suggests that the unconditional expectation of \(Y\), i.e., \(\E(Y)\), can be defined as the weighted average of the conditional expectations of \(Y\) given \(X\), where the weights come from the marginal distribution of \(X\): \[ \begin{align} \E(Y) =\E\left(\E(Y|X)\right)= \sum_{i=1}^\ell \E(Y|X=x_i) P(X=x_i). \end{align} \]
Proof (Proof of Theorem 9.8). Let \(X\) and \(Y\) be discrete random variables taking values \(x_1,\dots,x_\ell\) and \(y_1,\dots,y_k\). Then, \[ \begin{align} \E\left(\E(Y|X)\right)&=\sum_{i=1}^\ell\E(Y|X=x_i)P(X=x_i)\\ &=\sum_{i=1}^\ell\sum_{j=1}^ky_jP(Y=y_j|X=x_i)P(X=x_i)\\ &=\sum_{i=1}^\ell\sum_{j=1}^ky_j\frac{P(Y=y_j,X=x_i)}{P(X=x_i)}P(X=x_i)\\ &=\sum_{i=1}^\ell\sum_{j=1}^ky_jP(Y=y_j,X=x_i)\\ &=\sum_{j=1}^ky_j\sum_{i=1}^\ell P(Y=y_j,X=x_i)\\ &=\sum_{j=1}^ky_jP(Y=y_j)\\ &=\E(Y), \end{align} \] where the sixth equality follows from \(\sum_{i=1}^\ell P(Y=y_j,X=x_i)=P(Y=y_j)\). If \(X\) and \(Y\) are continuous random variables with densities \(f_X\) and \(f_Y\), then we have \[ \begin{align} \E\left(\E(Y|X)\right)&=\int\E(Y|X=x)f_X(x)dx\\ &=\int\int yf_{Y|X}(y|x)f_X(x)dydx\\ &=\int\int y\frac{f_{Y,X}(y,x)}{f_X(x)}f_X(x)dydx\\ &=\int\int yf_{Y,X}(y,x)dydx\\ &=\int yf_{Y}(y)dy\\ &=\E(Y). \end{align} \]
The next theorem gives a result related to LIE, which is useful when we have more than one conditioning variable.
Theorem 9.9 Let \(X\), \(Y\), and \(Z\) be random variables, and assume that their expectations exist. Then, we have \[ \begin{align*} \E(Y|X)=\E(\E(Y|X,Z)|X). \end{align*} \]
Proof (Proof of Theorem 9.9). To prove this result, we assume that these variables have joint density \(f_{X,Y,Z}(x,y,z)\). Then, we have \[ \begin{align*} \E(Y|X=x)&= \int yf_{Y|X}(y|x)dy=\int\int yf_{Y,Z|X}(y,z|x)dzdy. \end{align*} \] We have \(f_{Y|Z,X}(y|z,x)=\frac{f_{Y,Z|X}(y,z|x)}{f_{Z|X}(z|x)}\), suggesting that \(f_{Y,Z|X}(y,z|x)=f_{Y|Z,X}(y|z,x)\times f_{Z|X}(z|x)\). Thus, \[ \begin{align*} \E(Y|X=x)&=\int\int yf_{Y|Z,X}(y|z,x)f_{Z|X}(z|x)dzdy\\ &=\int\left(\int yf_{Y|Z,X}(y|z,x)dy\right)f_{Z|X}(z|x)dz\\ &=\int\E(Y|X=x,Z=z)f_{Z|X}(z|x)dz\\ &=\E(\E(Y|X,Z)|X=x). \end{align*} \]
9.5.4 Independence and measures of linear dependence
Definition 9.16 (Independence of random variables) The discrete random variables \(X\) and \(Y\) are independently distributed, or independent, if for all values of \(x\) and \(y\), we have \[ \begin{align*} P(Y=y,X=x) = P(Y=y)P(X=x). \end{align*} \] Similarly, the continuous random variables \(X\) and \(Y\) are independent if and only if \(f_{X,Y}(x,y)=f_X(x)\times f_Y(y)\).
Note that Definition 9.16 also implies that if \(X\) and \(Y\) are independent, then the conditional distribution of \(Y\) given \(X\) is equal to the marginal distribution of \(Y\). That is, in the discrete case, we have \(P(Y=y|X=x) = P(Y=y)\), and in the continuous case, \(f_{Y|X}(y|x) = f_Y(y)\).
If \(X\) and \(Y\) are independent random variables, then the conditional expectation of \(Y\) given \(X\) is equal to the unconditional expectation of \(Y\): \[ \begin{align} \E(Y|X) = \E(Y). \end{align} \]
This is because, when \(X\) and \(Y\) are independent, knowing the value of \(X\) provides no information about \(Y\). As a result, the conditional expectation of \(Y\) given \(𝑋\) is simply equal to its unconditional expectation. In the discrete case, this property can be shown as \[ \begin{align} \E(Y|X=x) = \sum_{j=1}^k y_j P(Y=y_j|X=x) = \sum_{j=1}^k y_j P(Y=y_j) = \E(Y), \end{align} \] where the second equality follows from \(P(Y=y_j|X=x) = P(Y=y_j)\) for all \(j\). In the continuous case, we have \[ \begin{align} \E(Y|X=x) = \int_{-\infty}^\infty y f_{Y|X}(y|x)\text{d}y = \int_{-\infty}^\infty y f_Y(y)\text{d}y = \E(Y), \end{align} \] where the second equality holds because \(f_{Y|X}(y|x) = f_Y(y)\) for all \(y\).
To measure the linear relationship between two random variables, we use the covariance and correlation measures. The covariance between \(X\) and \(Y\) is simply given by \[ \begin{align} \text{cov}(X,Y)=\sigma_{XY}&=\E\big((X-\mu_X)(Y-\mu_Y)\big)\\ &=\E(XY)-\mu_X\mu_Y. \end{align} \]
Suppose again both \(Y\) and \(X\) are discrete and \(Y\) takes on \(k\) values and \(X\) takes on \(\ell\) values. Then, \[ \begin{align*} \text{cov}(X,Y) &= \sum_{i=1}^k\sum_{j=1}^\ell (x_j-\mu_X)(y_i-\mu_Y)P(X=x_j,Y=y_i)\\ &=\sum_{i=1}^k\sum_{j=1}^\ell x_jy_iP(X=x_j,Y=y_i)-\mu_X\mu_Y. \end{align*} \]
If \(X\) and \(Y\) are continuous random variables with joint PDF \(f_{X,Y}(x,y)\), then the covariance is given by \[ \begin{align*} \text{cov}(X,Y) &= \int_{-\infty}^\infty\int_{-\infty}^\infty (x-\mu_X)(y-\mu_Y)f_{X,Y}(x,y)\text{d}y\text{d}x\\ &=\int_{-\infty}^\infty\int_{-\infty}^\infty xyf_{X,Y}(x,y)\text{d}y\text{d}x-\mu_X\mu_Y. \end{align*} \]
Some properties of covariance are listed below:
- \(\text{cov}(X,X)=\E(X^2)-\mu_X^2=\sigma^2_X\), which is the variance of \(X\).
- \(\text{cov}(X,Y)=\text{cov}(Y,X)\), i.e., covariance is symmetric.
- \(\text{cov}(a\,X+b\,Y,V)=a\,\text{cov}(X,V)+b\,\text{cov}(Y,V)\), where \(a\) and \(b\) are constants.
- If \(X\) and \(Y\) are independent, then \(\text{cov}(X,Y)=0\), because \(\text{cov}(X,Y)=\E(XY)-\mu_X\mu_Y=\E(X)\E(Y)-\mu_X\mu_Y=0\).
- If \(\text{cov}(X,Y)>0\), then \(X\) and \(Y\) move in the same direction. If \(\text{cov}(X,Y)<0\), then \(X\) and \(Y\) move in opposite directions.
Although covariance indicates the direction of the relationship between two variables, it does not convey the strength of the relationship. Moreover, covariance is not unit-free, meaning that it depends on the units of \(X\) and \(Y\). For example, if \(X\) is measured in meters and \(Y\) in seconds, then the covariance will be expressed in meter-seconds. To address these limitations, we introduce the correlation measure, which is a standardized version of covariance. The correlation between \(X\) and \(Y\) is defined as
\[ \begin{align}\label{eq13} \text{corr}(X,Y) = \frac{\text{cov}(X,Y)}{\sqrt{\text{var}(X)\text{var}(Y)}}=\frac{\sigma_{XY}}{\sigma_X\sigma_Y}. \end{align} \]
The correlation measure is unit free and satisfies \(-1\le \text{corr}(X,Y)\le 1\).
9.6 Commonly used distributions
In this section, we introduce some commonly used probability distributions for discrete and continuous random variables. In Appendix C, we provide some further details on the properties of some of these distributions.
9.6.1 Binomial distribution
To introduce the binomial distribution, we first define a binomial experiment. A binomial experiment has the following properties: (i) the experiment consists of \(n\) independent trials, (ii) each trial has only two possible outcomes, which we call “success” and “failure”, (iii) the probability of success in each trial is constant and denoted by \(p\).
Let \(X\) be the random variable representing the number of successes in \(n\) trials of a binomial experiment. Then, \(X\) is a discrete random variable that takes values \(0, 1, \dots, n\). The probability of having exactly \(x\) successes in \(n\) trials is given by the binomial probability mass function: \[ \begin{align*} P(X=x) = \binom{n}{x}p^x(1-p)^{n-x},\quad x=0,1,\dots,n, \end{align*} \] where \(\binom{n}{x}=\frac{n!}{x!(n-x)!}\) is the binomial coefficient.
If \(X\) follows a binomial distribution with parameters \(n\) and \(p\), we denote this as \(X \sim B(n, p)\). The expected value and variance of \(X\) are given by \[ \begin{align*} \E(X) = np,\quad \text{var}(X) = np(1-p). \end{align*} \]
The following code chunk generates the stem plots of the binomial distribution with parameters \(n=20\) and \(p\in\{0.1,0.5, 0.8\}\) for \(x=0,1,\dots,20\). The height of each stem corresponds to the probability of having exactly \(x\) successes in \(n\) trials. Figure 9.12 indicates that the binomial distribution is skewed to the right when \(p\) is small, symmetric when \(p=0.5\), and skewed to the left when \(p\) is large.
# Define parameters
= 20
n = [0.1, 0.5, 0.8]
p_values # Create a figure and axis object
# Create a 2x2 grid of subplots
= plt.subplots(2, 2, figsize=(8, 6))
fig, axs = axs.flatten()
axs
# Loop through each p value and plot in a separate subplot
for i, p in enumerate(p_values):
= np.arange(0, n + 1)
x = stats.binom.pmf(x, n, p)
pmf =" ")
axs[i].stem(x, pmf, basefmt'x')
axs[i].set_xlabel('Probability')
axs[i].set_ylabel(f'p={p}')
axs[i].set_title(0, n + 1, 2))
axs[i].set_xticks(np.arange(-1, n+1)
axs[i].set_xlim(
# Hide the unused subplot (bottom right)
3].axis('off')
axs[
plt.tight_layout() plt.show()

9.6.2 Uniform distribution
A continuous random variable \(Y\) is said to have a uniform distribution on the interval \([a, b]\) if its PDF is constant over this interval. The PDF of \(Y\) is given by \[ \begin{align*} \phi_Y(y) = \begin{cases} \frac{1}{b-a},\quad&a\leq y\leq b,\\ 0,\quad& \text{elsewhere}. \end{cases} \end{align*} \]
If \(Y\) has a uniform distribution on the interval \([a, b]\), we denote this as \(Y\sim U(a,b)\). The CDF of \(Y\) is given by \[ \begin{align*} \Phi_Y(y) = \begin{cases} 0,\quad&y<a,\\ \frac{y-a}{b-a},\quad&a\leq y\leq b,\\ 1,\quad&y>b. \end{cases} \end{align*} \]
The mean and variance of \(Y\) are given by \[ \begin{align*} \E(Y) = \frac{a+b}{2},\quad \text{var}(Y) = \frac{(b-a)^2}{12}. \end{align*} \]
The following code chunk generates the PDF and CDF of the uniform distribution on the interval \([0, 2\pi]\).
# The plots of PDF and CDF of uniform distribution
# PDF
= np.arange(0, 2 * np.pi + 0.01, 0.01)
y = np.ones_like(y) / (2 * np.pi)
pdf
= plt.subplots(figsize=(5, 3))
fig, ax ='steelblue', linewidth=2)
ax.plot(y, pdf, color2 * np.pi, linestyle='dashed', color='black', alpha=0.5)
ax.axvline('y')
ax.set_xlabel('PDF')
ax.set_ylabel(0, 7.5)
ax.set_xlim(0, 0.5)
ax.set_ylim(0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi])
ax.set_xticks(['0', r'$\pi/2$', r'$\pi$', r'$3\pi/2$', r'$2\pi$'])
ax.set_xticklabels([0, 1/(2*np.pi)])
ax.set_yticks(['0', r'$1/2\pi$'])
ax.set_yticklabels([
plt.show()
# CDF
= y / (2 * np.pi)
cdf = plt.subplots(figsize=(5, 3))
fig, ax ='steelblue', linewidth=2)
ax.plot(y, cdf, color'y')
ax.set_xlabel('CDF')
ax.set_ylabel(0, 7.5)
ax.set_xlim(0, 1.2)
ax.set_ylim(0, linestyle='dashed', color='black')
ax.axvline(# Draw horizontal line at y=1 from 2*pi to 7.5
1, 2*np.pi, 7.5, colors='steelblue', linewidth=2)
ax.hlines(0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi])
ax.set_xticks(['0', r'$\pi/2$', r'$\pi$', r'$3\pi/2$', r'$2\pi$'])
ax.set_xticklabels([ plt.show()


9.6.3 Normal distribution
If the continuous random variable \(Y\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we denote this as \(Y \sim N(\mu_Y, \sigma_Y^2)\). The PDF and CDF of \(Y\) are defined as \[ \begin{align*} &f_Y(y)=\frac{1}{\sqrt{2\pi}\sigma_Y}e^{-\frac{1}{2\sigma^2_Y}(y-\mu_Y)^2},\\ &F_Y(y)=P(Y\leq y)=\int_{-\infty}^y\frac{1}{\sqrt{2\pi}\sigma_Y}e^{-\frac{1}{2\sigma^2_Y}(y-\mu_Y)^2}\text{d}y. \end{align*} \]
If \(Y\sim N(0,1)\), then we say that \(Y\) has the standard normal distribution. The PDF and the CDF of \(Y\) are denoted by \(\phi_Y(y)\) and \(\Phi_Y(y)\), respectively. These functions are defined as \[ \begin{align*} &\phi_Y(y)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}y^2},\\ &\Phi_Y(y)=P(Y\leq y)=\int_{-\infty}^y\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}y^2}\text{d}y. \end{align*} \]
The CDF does not have a closed form, so numerical methods are used to compute \(\Phi_Z(z)=P(Z\leq z)\). The SciPy module provides statistical functions for probability computations as shown in Example 9.26.
Example 9.26 Let \(Y\sim N(0.35,2.25)\) and consider \(P(-0.5<Y\leq 0.75)\), which corresponds to the shaded area illustrated in Figure 9.14. Using the scipy.stats.norm.cdf
function, we can compute \(P(-0.5<Y\leq 0.75)\) in the following way:
=stats.norm.cdf(0.75,loc=0.35,scale=1.5)-stats.norm.cdf(-0.5,loc=0.35,scale=1.5)
pround(p,4) np.
0.3197
# The area under the PDF of N(0.35, 2.25) between -0.5 and 0.75
# Define y values and compute the PDF
= np.arange(-5, 5, 0.01)
y = stats.norm.pdf(y, loc=0.35,scale=1.5)
pdf # Create a figure and axis object
= plt.subplots(figsize=(5, 3.5))
fig, ax # Plot the PDF
='-', color='steelblue', linewidth=2, label='PDF')
ax.plot(y, pdf, linestyle'y')
ax.set_xlabel('PDF')
ax.set_ylabel(-4, 5)
ax.set_xlim(# Define x values for the shaded area
= np.arange(-0.5, 0.75, 0.001)
x = stats.norm.pdf(x,loc=0.35,scale=1.5)
y_fill # Fill the polygon for the shaded area
='lightblue', edgecolor='black', linewidth=1, linestyle='--')
ax.fill_between(x, y_fill, color# Add points at x = -0.5 and x = 0.75
-0.5, 0.75], [0, 0], color='black', s=20)
ax.scatter([# Show the plot
plt.show()

The normal distribution is symmetric around its mean \(\mu\) and has a bell-shaped curve. The mean, median, and mode of the normal distribution are all equal to \(\mu\). The variance \(\sigma^2\) determines the spread of the distribution, with larger values indicating a wider spread. Consider \(N(0,1)\), \(N(-2,1)\), and \(N(0,9)\). We use the following code chunk to plot the PDF and CDF of these normal distributions. Figure 9.15 shows the PDF and CDF of these distributions.
# The plots of PDF and CDF of normal distributions
# Define y values
= np.arange(-10, 10, 0.01)
y # Create a figure and axis object for PDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the PDFs
0, 1), linestyle='-', color='red', linewidth=1.5, label=r'$\mu=0, \sigma=1$')
ax.plot(y, stats.norm.pdf(y, -2, 1), linestyle='-', color='black', linewidth=1.5, label=r'$\mu=-2, \sigma=1$')
ax.plot(y, stats.norm.pdf(y, 0, 3), linestyle='-', color='steelblue', linewidth=1.5, label=r'$\mu=0, \sigma=3$')
ax.plot(y, stats.norm.pdf(y, # Customize the plot
'y')
ax.set_xlabel('PDF')
ax.set_ylabel(='upper left', frameon=False)
ax.legend(loc# Show the plot
plt.show()
# Create a figure and axis object for CDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the CDFs
0, 1), linestyle='-', color='red', linewidth=1.5, label=r'$\mu=0, \sigma=1$')
ax.plot(y, stats.norm.cdf(y, -2, 1), linestyle='-', color='black', linewidth=1.5, label=r'$\mu=-2, \sigma=1$')
ax.plot(y, stats.norm.cdf(y, 0, 3), linestyle='-', color='steelblue', linewidth=1.5, label=r'$\mu=0, \sigma=3$')
ax.plot(y, stats.norm.cdf(y, # Customize the plot
'y')
ax.set_xlabel('CDF')
ax.set_ylabel(='upper left', frameon=False)
ax.legend(loc# Show the plot
plt.show()


Below, we collect some important properties of random variables that have normal distributions.
- If \(X\sim N(\mu_X,\sigma^2_X)\), then \((X-\mu_X)/\sigma_X\sim N(0,1)\).
- If \(Y\sim N(\mu_Y,\sigma^2_Y)\), then \(a+bY\sim N(a+b\mu_Y,b^2\sigma^2_Y)\) for any constants \(a\) and \(b\).
- If \(X\) and \(Y\) are jointly normal, then \(X\) and \(Y\) are independent if and only if they are uncorrelated, i.e., \(\text{cov}(X,Y)=0\).
- If \(X_1,\dots,X_n\) are independent and identically distributed (i.i.d.) random variables with \(X_i\sim N(\mu_X,\sigma^2_X)\), then \(\sum_{i=1}^nX_i\sim N(n\mu_X,n\sigma^2_X)\).
9.6.4 Chi-squared distribution
The chi-squared distribution is used when testing certain types of hypotheses in econometrics. The chi-squared distribution that has \(m\) degrees of freedom, denoted by \(\chi^2_{m}\), is the distribution of the sum of \(m\) squared independent standard normal random variables. The mean and variance of the chi-squared distribution are given by \(m\) and \(2m\), respectively.
Example 9.27 Let \(Z_1\), \(Z_2\) and \(Z_3\) be independent standard normal random variables. If we define \(Y=Z^2_1+Z^2_2+Z^2_3\), then \(Y\sim \chi^2_3\). We can use the scipy.stats.chi2()
function to compute the PDF and CDF of \(Y\). For example, we can compute \(P(Y\leq 5)\) as follows:
= stats.chi2.cdf(5, df=3)
p round(p, 4) np.
0.8282
In Figure 9.16, we show the PDF and CDF of \(\chi^2_3\), \(\chi^2_4\), and \(\chi^2_{10}\). The figure indicates that the chi-squared distribution is right-skewed, and as \(m\) increases, the distribution becomes more symmetric.
# The plots of PDF and CDF of chi-squared distributions
# Define y values
= np.arange(0, 20, 0.01)
y # Create a figure and axis object for PDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the PDFs
=3), linestyle='-', color='red', linewidth=1.5, label=r'$m=3$')
ax.plot(y, stats.chi2.pdf(y, df=4), linestyle='-', color='black', linewidth=1.5, label=r'$m=4$')
ax.plot(y, stats.chi2.pdf(y, df=10), linestyle='-', color='steelblue', linewidth=1.5, label=r'$m=10$')
ax.plot(y, stats.chi2.pdf(y, df# Customize the plot
'y')
ax.set_xlabel('PDF')
ax.set_ylabel(='upper right', frameon=False)
ax.legend(loc# Show the plot
plt.show()
# Create a figure and axis object for CDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the CDFs
=3), linestyle='-', color='red', linewidth=1.5, label=r'$m=3$')
ax.plot(y, stats.chi2.cdf(y, df=4), linestyle='-', color='black', linewidth=1.5, label=r'$m=4$')
ax.plot(y, stats.chi2.cdf(y, df=10), linestyle='-', color='steelblue', linewidth=1.5, label=r'$m=10$')
ax.plot(y, stats.chi2.cdf(y, df# Customize the plot
'y')
ax.set_xlabel('CDF')
ax.set_ylabel(='lower right', frameon=False)
ax.legend(loc# Show the plot
plt.show()


9.6.5 Student t distribution
The Student t distribution with \(m\) degrees of freedom, denoted by \(t_m\), is defined as the distribution of the ratio of a standard normal random variable to the square root of an independently distributed chi-squared random variable with \(m\) degrees of freedom divided by \(m\). Specifically, let \(Z\sim N(0,1)\) and \(W\sim\chi^2_m\). Assuming that \(Z\) and \(W\) are independent, then we have \[ \frac{Z}{\sqrt{W/m}}\sim t_m. \]
In Figure 9.17, we compare the PDF and CDF of \(t_1\) and \(t_{10}\) with that of \(N(0,1)\). The \(t\) distribution has a bell shape similar to that of the normal distribution but exhibits more mass in the tails, resulting in a “fatter” bell shape compared to the normal distribution. In general, when \(m\) is \(30\) or more, the \(t\) distribution closely approximates the standard normal distribution.Formally, as \(𝑚\) approaches infinity, the \(t\) distribution converges to the standard normal distribution.
# The plots PDF and CDF of Student t distributions
# Define t values
= np.arange(-4, 4, 0.01)
t_values # Create a figure and axis object for PDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the PDFs
0, 1), linestyle='-', color='red', linewidth=1.5, label='N(0,1)')
ax.plot(t_values, stats.norm.pdf(t_values, =1), linestyle='-', color='steelblue', linewidth=1.5, label=r'$t_1$')
ax.plot(t_values, stats.t.pdf(t_values, df=10), linestyle='-', color='black', linewidth=1.5, label=r'$t_{10}$')
ax.plot(t_values, stats.t.pdf(t_values, df# Customize the plot
't')
ax.set_xlabel('PDF')
ax.set_ylabel(0, max(stats.norm.pdf(t_values, 0, 1))+0.01)
ax.set_ylim(='upper left', frameon=False)
ax.legend(loc# Show the plot
plt.show()
# Create a figure and axis object for CDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the CDFs
0, 1), linestyle='-', color='red', linewidth=1.5, label='N(0,1)')
ax.plot(t_values, stats.norm.cdf(t_values, =1), linestyle='-', color='steelblue', linewidth=1.5, label=r'$t_1$')
ax.plot(t_values, stats.t.cdf(t_values, df=10), linestyle='-', color='black', linewidth=1.5, label=r'$t_{10}$')
ax.plot(t_values, stats.t.cdf(t_values, df# Customize the plot
't')
ax.set_xlabel('CDF')
ax.set_ylabel(0, 1.05)
ax.set_ylim(='upper left', frameon=False)
ax.legend(loc# Show the plot
plt.show()


Example 9.28 Let \(Y\sim N(0,1)\) and \(X\sim t_{5}\). We want to compute \(P(Y\geq2)\) and \(P(X\geq2)\) using the scipy.stats
module. The following code chunk shows how to compute these probabilities:
= stats.norm.sf(2, loc=0, scale=1) # P(Y >= 2)
p_y = stats.t.sf(2, df=5) # P(X >= 2)
p_x round(p_y, 4), np.round(p_x, 4) np.
(0.0228, 0.051)
Note that stats.norm.sf()
and stats.t.sf()
compute the survival function, which is equal to \(1 - F_Y(y)\) and \(1 - F_X(x)\), respectively. We see that \(P(Y\geq2)\) is larger than \(P(X\geq2)\), which indicates that the \(t\) distribution has more mass in the tails compared to the normal distribution.
9.6.6 F distribution
The \(F\) distribution with \(m\) and \(n\) degrees of freedom is denoted by \(F_{m,n}\). Let \(W\sim\chi^2_m\) and \(V\sim\chi^2_n\). Assume that \(W\) and \(V\) are independent. Then, we have \[ \frac{W/m}{V/n}\sim F_{m,n}. \]
Let \(Y\sim F_{m,n}\). Then, the mean and variance of \(Y\) are given by \[ \begin{align*} &\E(Y) = \frac{n}{n-2},\\ &\text{var}(Y) = \frac{2n^2(m+n-2)}{m(n-2)^2(n-4)},\quad n>4. \end{align*} \]
The \(F\) distribution is right-skewed, and as \(m\) and \(n\) increase, the distribution becomes more symmetric. In Figure 9.18, we provide the plots of PDF and CDF of \(F_{1,10}\), \(F_{5,50}\), and \(F_{10,100}\).
# The plots PDF and CDF of F distributions
# Define u values
= np.arange(0, 8, 0.01)
u # Create a figure and axis object for PDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the PDFs
=1, dfd=10), linestyle='-', color='red', linewidth=1.5, label= 'm = 1, n = 10')
ax.plot(u, stats.f.pdf(u, dfn=5, dfd=50), linestyle='-', color='steelblue', linewidth=1.5, label= 'm = 5, n = 50')
ax.plot(u, stats.f.pdf(u, dfn=10, dfd=100), linestyle='-', color='black', linewidth=1.5, label='m = 10, n = 100')
ax.plot(u, stats.f.pdf(u, dfn# Customize the plot
'u')
ax.set_xlabel('PDF')
ax.set_ylabel(0, 1)
ax.set_ylim(='upper right', frameon=False)
ax.legend(loc# Show the plot
plt.show()
# Create a figure and axis object for CDF
= plt.subplots(figsize=(6,3))
fig, ax # Plot the CDFs
=1, dfd=10), linestyle='-', color='red', linewidth=2, label='m = 1, n = 10')
ax.plot(u, stats.f.cdf(u, dfn=5, dfd=50), linestyle='-', color='steelblue', linewidth=2, label='m = 5, n= 50')
ax.plot(u, stats.f.cdf(u, dfn=10, dfd=100), linestyle='-', color='black', linewidth=2, label='m = 10, n = 100')
ax.plot(u, stats.f.cdf(u, dfn# Customize the plot
'u')
ax.set_xlabel('CDF')
ax.set_ylabel(0, 1.02)
ax.set_ylim(='lower right', frameon=False)
ax.legend(loc# Show the plot
plt.show()


In econometrics, an important special case of the \(F\) distribution arises when the denominator degrees of freedom is large enough that the \(F_{m,n}\) can be approximated by \(F_{m,\infty}\). \(F_{m,\infty}\) distribution is the distribution of a chi-squared random variable with \(m\) degrees of freedom divided by \(m\), i.e., \(F_{m,\infty}=W/m\), where \(W\sim\chi^2_m\).
Example 9.29 Let \(Y\sim F_{2,10}\). We can compute the PDF and CDF of \(Y\) using the scipy.stats.f()
function. For example, we can compute \(P(Y\leq 3)\) as follows:
= stats.f.cdf(3, dfn=2, dfd=10)
p round(p, 4) np.
0.9046
This is the area under the PDF of \(F_{2,10}\) to the left of \(3\). In the following code chunk, we illustrate this area by shading the region under the PDF of \(F_{2,10}\) to the left of \(3\).
# The area under the PDF of F(2,10) to the left of 3
# Import necessary libraries
# Define u values and compute the PDF
= np.arange(0, 8, 0.01)
y = stats.f.pdf(y, dfn=2, dfd=10)
pdf # Create a figure and axis object
= plt.subplots(figsize=(5, 3.5))
fig, ax # Plot the PDF
='-', color='steelblue', linewidth=2)
ax.plot(y, pdf, linestyle'y')
ax.set_xlabel('Density')
ax.set_ylabel(0, 8)
ax.set_xlim(# Define x values for the shaded area
= np.arange(0, 3, 0.001)
x = stats.f.pdf(x, dfn=2, dfd=10)
y_fill # Fill the polygon for the shaded area
='lightblue')
ax.fill_between(x, y_fill, color3, linestyle='dashed', color='black', linewidth=1, alpha=0.5)
ax.axvline(0, color='black', linewidth=1, alpha=0.5)
ax.axhline(0, 1, 2, 3, 4, 5, 6, 7])
ax.set_xticks(['0', '1', '2', '3', '4', '5', '6', '7'])
ax.set_xticklabels([ plt.show()

9.7 Random sampling and the distribution of the sample average
If \(Y_1,\dots,Y_n\) are independent random variables from the same population distribution, then they are said to be independently and identically distributed (i.i.d.) random variables. We say that \(Y_1,\dots,Y_n\) constitute a random sample or an i.i.d. sample from the population distribution.
Assume that \(Y_1,Y_2,\dots,Y_n\) are i.i.d. random variables from a population distribution that has mean \(\mu_Y\) and varinace \(\sigma_Y^2\). Consider the sample mean \(\bar{Y}=\frac{1}{n}\sum_{i=1}^nY_i\). Then, \[ \begin{align*} &\E(\bar{Y}) = \E\left(\frac{1}{n}\sum_{i=1}^n Y_i\right) = \frac{1}{n}\sum_{i=1}^n E\left(Y_i\right) = \frac{1}{n}\sum_{i=1}^n \mu_Y = \mu_Y,\\ &\text{var}(\bar{Y}) = \text{var}\left(\frac{1}{n}\sum_{i=1}^n Y_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{var}\left(Y_i\right) = \frac{1}{n^2}\sum_{i=1}^n \sigma_Y^2 = \frac{\sigma_Y^2}{n}. \end{align*} \]
These results indicate that \(\bar{Y}\) is an unbiased estimator of \(\mu_Y\), e.i., \(\E(\bar{Y})=\mu_Y\), and \(\text{var}(\bar{Y})\) is inversely proportional to \(n\).
Definition 9.17 (Convergence in Probability) We say that the sample average \(\bar{Y}\) converges in probability to \(\mu_Y\) if the probability that \(\bar{Y}\) falls within the interval \((\mu_Y - c,\mu_Y + c)\) approaches \(1\) as \(n\) increases, for any constant \(c> 0\). The convergence in probability is denoted as \(\bar{Y}\xrightarrow{p}\mu_Y\). Formally, this can be expressed as \[ \lim_{n\to\infty}P(|\bar{Y}-\mu_Y|<c)=1. \]
If \(\bar{Y}\) converges in probability to \(\mu_Y\), then we say that \(\bar{Y}\) is a consistent estimator of \(\mu_Y\). In the next chapter, we will discuss the properties of estimators in more detail. The law of large numbers ensures that \(\bar{Y}\) is a consistent estimator of \(\mu_Y\).
Theorem 9.10 (The Law of Large Numbers) If \(Y_1,\dots, Y_n\) are independently and identically distributed with \(E(Y_i) = \mu_Y\) and \(\text{var}(Y_i)=\sigma^2_Y<\infty\), then \(\bar{Y}\xrightarrow{p}\mu_Y\).
To illustrate the law of large numbers, assume that \(Y_1, \dots, Y_n\) are i.i.d. random variables from a Bernoulli distribution with success probability \(p = 0.78\). We consider the sampling distribution of \(\bar{Y}\) for various sample sizes \(n\). According to the law of large numbers, \(\bar{Y}\) should converge in probability to \(0.78\) as \(n\) becomes larger. Figure 9.20 shows the sampling distribution of \(\bar{Y}\) for \(n \in \{2, 5, 25, 100\}\). We observe that the variance of the sampling distribution of \(\bar{Y}\) decreases as \(n\) increases, causing the distribution to become more tightly concentrated around its mean, \(\mu_Y = 0.78\).
# The sampling distribution of the sample average
# Set parameters
= 1000 # number of repetitions
reps = [2, 5, 25, 100]
sample_sizes = 0.78 # probability of success
p # Set seed for reproducibility
123)
np.random.seed(# Create a 2x2 subplot array
= plt.subplots(2, 2)
fig, axs # Flatten the array of axes for easy iteration
= axs.flatten()
axs # Outer loop (loop over the sample sizes)
for i, n in enumerate(sample_sizes):
= np.zeros(reps) # Initialize the vector of sample means
sample_means # Inner loop over repetitions
for j in range(reps):
= stats.bernoulli.rvs(p, size=n)
x = np.mean(x)
sample_means[j]
# Plot histogram
= axs[i]
ax =True, color='steelblue')
ax.hist(sample_means, density0, 1)
ax.set_xlim(f'n = {n}')
ax.set_xlabel(# Adjust layout
plt.tight_layout()# Show the plot
plt.show()

The distribution of \(\bar{Y}\) is called the sampling distribution of \(\bar{Y}\). If we assume that \(Y_i\sim N(\mu_Y,\sigma_Y^2)\) for \(i=1,\dots,n\), then the sampling distribution of \(\bar{Y}\) is \(N(\mu_Y, \sigma_Y^2/n)\). However, if we do not want to make the assumption that \(Y_i\sim N(\mu_Y,\sigma_Y^2)\) for \(i=1,\dots,n\), then the sampling distribution \(\bar{Y}\) can be approximated when the sample size is large. We call the approximated distribution the asymptotic distribution.
The central limit theorem (CLT) says that, under general conditions, the distribution of \(\bar{Y}\) is well approximated by a normal distribution when \(n\) is large.
Theorem 9.11 (The Central Limit Theorem) If \(Y_1,\dots, Y_n\) are independently and identically distributed with \(E(Y_i) = \mu_Y\) and \(\text{var}(Y_i)=\sigma^2_Y<\infty\), then \(\frac{\bar{Y}-\mu_Y}{\sqrt{\text{var}(\bar{Y})}}=\frac{\bar{Y}-\mu_Y}{\sigma_Y/\sqrt{n}}\) is well approximated by \(N(0,1)\).
The CLT implies that \(\bar{Y}\sim N(\mu_Y,\sigma^2_Y/n)\) when \(n\) is large. To illustrate the CLT, assume that \(Y_1, \dots, Y_n\) are i.i.d. random variables from a Bernoulli distribution with success probability \(p = 0.78\). We consider the sampling distribution of \(\frac{\bar{Y}-\mu_Y}{\sigma_Y/\sqrt{n}}\) for various sample sizes \(n\), where \(\sigma^2_Y=p(1-p)\). Figure 9.21 shows the histogram plot of \(\frac{\bar{Y}-\mu_Y}{\sigma_Y/\sqrt{n}}\) for \(n \in \{50, 100, 500, 1000\}\) along with the PDF plot of the standard normal distribution. The figure indicates that the distribution of \(\frac{\bar{Y}-\mu_Y}{\sigma_Y/\sqrt{n}}\) becomes more closely approximated by the standard normal distribution as \(n\) increases.
# The sampling distribution of the standardized sample average
# Set parameters
= 1000
reps = 0.78 # probability of success
p = [50, 100, 500, 1000]
sample_sizes # Set seed for reproducibility
123)
np.random.seed(# Create a 2x2 subplot array
= plt.subplots(2, 2)
fig, axs # Flatten the array of axes for easy iteration
= axs.flatten()
axs # Outer loop (loop over the sample sizes)
for i, n in enumerate(sample_sizes):
= np.zeros(reps) # Initialize the vector of standardized sample means
std_sample_means # Inner loop over repetitions
for j in range(reps):
= stats.bernoulli.rvs(p, size=n)
x = np.sqrt(n) * (np.mean(x) - p) / np.sqrt(p*(1-p))
std_sample_means[j]
# Plot histogram and overlay the N(0,1) density
= axs[i]
ax =40, density=True, color='steelblue')
ax.hist(std_sample_means,bins
# Overlay N(0,1) density
= -3, 3
xmin, xmax = np.linspace(xmin, xmax, 100)
x_vals 'k', linewidth=1.5)
ax.plot(x_vals, stats.norm.pdf(x_vals), -3, 3)
ax.set_xlim(0, 1)
ax.set_ylim(f'n = {n}')
ax.set_xlabel(# Adjust layout
plt.tight_layout()# Show the plot
plt.show()

In Chapter 25, we provide a more detailed discussion of asymptotic analysis, including probability inequalities, Slutsky’s theorem, and the continuous mapping theorem, which are useful results for applying the law of large numbers and the central limit theorem.
Formally, probability is a set function defined on the \(\sigma\)-algebra of the subsets of the sample space \(\Omega\). A \(\sigma\)-algebra is a collection of subsets of \(\Omega\) that includes the empty set and is closed under complementation and countable unions. See Probability Measure on Wikipedia for more details.↩︎