7  Introduction to Pandas

7.1 Introduction

In this chapter, we introduce the pandas module for reading data from various formats, exporting data, and performing data analysis tasks such as cleaning, filtering, transforming, handling missing data, merging, and reshaping. This module provides two data structures, Series and DataFrame, for storing and manipulating data.

import numpy as np
import pandas as pd
import os

7.2 Importing and Exporting Data

All data-reading methods in pandas load data into a DataFrame. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). We introduce the DataFrame in more detail below. The pandas module provides several functions for reading data from various formats, including delimited files, fixed-width formatted files, CSV files, Excel files, and Stata files:

  • read_table: Reads a general delimited file into a DataFrame.
  • read_fwf: Reads a fixed-width formatted file into a DataFrame.
  • read_csv: Reads a comma-separated values (CSV) file into a DataFrame.
  • read_excel: Reads an Excel file into a DataFrame (supports both .xls and .xlsx formats).
  • read_stata: Reads a Stata .dta file into a DataFrame.
  • read_sql: Reads data from a SQL database into a DataFrame.

Each function offers a range of arguments to customize data importing. For example, the notable arguments for the read_csv function are listed below:

  • delimiter: The delimiter used to separate values. The default is ,.
  • delim_whitespace: A boolean indicating whether the delimiter is whitespace (space or tab).
  • header: An integer indicating the row number to use for the column names (default is 0).
  • skiprows: An integer specifying the number of rows at the top of the file to skip before reading (default is 0).
  • skip_footer: An integer specifying the number of rows at the bottom of the file to skip when reading (default is 0).
  • index_col: An integer or column name indicating which column to use as the index. If not provided, a numeric index is generated.
  • names: A list of column names to use in place of any found in the file. This requires header=0.
  • nrows: An integer specifying the maximum number of rows to read. This is useful for reading a subset of the file.
  • usecols: A list of integers or column names indicating which columns to retain.

We consider a dataset on the California school districts, provided in the caschools.csv file. The dataset is collected in 1999 on n=420 school districts and includes 7560 observations on 18 district-level variables. This dataset can be imported in the following way:

# Importing the dataset
CAschool = pd.read_csv('data/caschool.csv')
type(CAschool)
pandas.core.frame.DataFrame

The type of the CAschool object is the DataFrame. We can see the first five observations in the dataset by calling the head method:

# Displaying the first five observations
CAschool.head() 
Observation Number dist_cod county district gr_span enrl_tot teachers calw_pct meal_pct computer testscr comp_stu expn_stu str avginc el_pct read_scr math_scr
0 1 75119 Alameda Sunol Glen Unified KK-08 195 10.900000 0.510200 2.040800 67 690.799988 0.343590 6384.911133 17.889910 22.690001 0.000000 691.599976 690.000000
1 2 61499 Butte Manzanita Elementary KK-08 240 11.150000 15.416700 47.916698 101 661.200012 0.420833 5099.380859 21.524664 9.824000 4.583333 660.500000 661.900024
2 3 61549 Butte Thermalito Union Elementary KK-08 1550 82.900002 55.032299 76.322601 169 643.599976 0.109032 5501.954590 18.697226 8.978000 30.000002 636.299988 650.900024
3 4 61457 Butte Golden Feather Union Elementary KK-08 243 14.000000 36.475399 77.049202 85 647.700012 0.349794 7101.831055 17.357143 8.978000 0.000000 651.900024 643.500000
4 5 61523 Butte Palermo Union Elementary KK-08 1335 71.500000 33.108601 78.427002 171 640.849976 0.128090 5235.987793 18.671329 9.080333 13.857677 641.799988 639.900024

There are several methods corresponding to a variety of formats for exporting/saving a DataFrame. The to_excel method exports the DataFrame to an excel file, the to_csv method to a CSV file, the to_stata method to the Stata file, the to_sql method writes the DataFrame to a SQL database table. Each method allows customization through several arguments. For example, notable arguments for the to_excel method include the following:

  • sheet_name: The name of the sheet which will contain DataFrame.
  • na_rep: A string representation for missing data (default is an empty string).
  • float_format: A string format for floating-point numbers (default is None).
  • columns: A list of column names to write to the file. If not provided, all columns are written.
  • header: Boolean indicating whether to write out the column names (default is True).
  • index: Boolean indicating whether to write out row names (default is True).
  • startrow: Pythonic row index for the starting position to write data (default is 0).
  • startcol: Pythonic column index for the starting position to write data (default is 0).

In the following example, we save the CAschool DataFrame to the California_school_data.xlsx file:

# Exporting the DataFrame to an Excel file
CAschool.to_excel('data/California_school_data.xlsx', header = True, index = False)

7.3 Series

A Series is equivalent to a one-dimensional array and can contain various data types, including floats, strings, and integers. In many ways, a Series behaves similarly to a NumPy array and can be considered a column in a DataFrame. We can initialize a Series object using array, list, tuple, or dictionary objects. In the following example, we create a Series from a NumPy array:

# From a NumPy array
a = np.array([0.1, 1.2, 2.3, 3.4, 4.5])
s1 = pd.Series(a)
type(s1)
s1
pandas.core.series.Series
0    0.1
1    1.2
2    2.3
3    3.4
4    4.5
dtype: float64

As shown in the above example, the type of s1 is Series. The index of the Series is automatically generated as integers starting from 0.

In the following examples, we create Series from a tuple and a dictionary, respectively. Notice that the index is automatically generated as integers starting from 0 in the first example, while in the second example, the keys of the dictionary are used as indices.

# Creating a series from a tuple
a = (1, 2, 3, 4, 'abs', np.nan)
s2 = pd.Series(a)
s2
0      1
1      2
2      3
3      4
4    abs
5    NaN
dtype: object
# Creating a series from a dictionary
s3 = pd.Series({'a': 0.1, 'b': 1.2, 'c': 2.3})
s3
a    0.1
b    1.2
c    2.3
dtype: float64

Finally, we can create a Series from a list. The index is automatically generated as integers starting from 0, similar to the first example with a NumPy array:

# Creating a series from a list
c = [0.1, 1.2, 2.3, 3.4, 4.5]
s4 = pd.Series(c)
s4
0    0.1
1    1.2
2    2.3
3    3.4
4    4.5
dtype: float64

The elements of a Series can be selected using the [] operator by specifying indices. We can use the index either explicitly or implicitly to slice a Series. Consider the following examples:

s = pd.Series([0.1, 1.2, 2.3, 3.4, 4.5], index=['a', 'b', 'c', 'd', 'e'])
s
a    0.1
b    1.2
c    2.3
d    3.4
e    4.5
dtype: float64
# Selecting elements by index
s['a']
0.1
# Selecting elements by multiple indices
s[['a', 'c']]  
a    0.1
c    2.3
dtype: float64
# Slicing elements by explicit index
s['a':'c']  
a    0.1
b    1.2
c    2.3
dtype: float64
# Slicing elements by implicit index
s[:2] 
a    0.1
b    1.2
dtype: float64

Logical conditions can also be used to select elements of a Series. For example, we can select elements based on their values or indices. Consider the following examples:

# Selecting elements based on value
s[s > 2]  # Elements that are greater than 2
c    2.3
d    3.4
e    4.5
dtype: float64
# Selecting elements based on value range
s[(s >= 1) & (s <= 4)]  # Elements between 1 and 4 (inclusive)
b    1.2
c    2.3
d    3.4
dtype: float64

Mathematical operations on Series objects are performed element-wise. When the indices do not match, the result is assigned the value NaN. Consider the following examples:

# Element-wise operations on Series
s1 = pd.Series({'a': 1, 'b': 2, 'c': 3})
s2 = s1**3
s2
a     1
b     8
c    27
dtype: int64
s1 = pd.Series({'a': 0.1, 'b': 1.2, 'c': 2.3})
s2 = pd.Series({'c': 10, 'd': 1.2, 'e': 2.3})
# Element-wise multiplication
r1 = s1 * s2 
r1
a     NaN
b     NaN
c    23.0
d     NaN
e     NaN
dtype: float64
# Element-wise addition
r2 = s1 + s2
r2
a     NaN
b     NaN
c    12.3
d     NaN
e     NaN
dtype: float64

In Table 7.1, we list some useful important methods for the Series objects.

Table 7.1: Some methods for the Series objects
Method Description
values/index Returns series as an array/returns index
head()/tail() Displays the first/last 5 rows of a series
isnull()/notnull() Returns a boolean same-sized object indicating if the values are NA/not NA
loc[ ]/iloc[ ] Allows access using position
describe() Returns a summary of basic statistics
unique() Returns the unique values of the Series object
nunique() Returns the number of unique elements in the Series object
drop Returns series with specified index labels removed
dropna Return a new series with missing values removed
fillna() Fills all null values in a series with a specified value
append() Appends one series to another
replace(list,values) Replaces a list of values in a series with new values
update(series) Replaces values in a series with those in another series, matching on the index

We illustrate some these methods in the following examples:

# Setting the index of a Series
s1 = pd.Series([1, 3, 5, 6, np.nan, 'cat', 'abc', 10, 12, 5])
s1.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'] 
# Displaying the first five observations
s1.head()
a      1
b      3
c      5
d      6
e    NaN
dtype: object
# Display not-null values
s1.notnull()  
a     True
b     True
c     True
d     True
e    False
f     True
g     True
h     True
i     True
k     True
dtype: bool
# Display null values
s1.isnull() 
a    False
b    False
c    False
d    False
e     True
f    False
g    False
h    False
i    False
k    False
dtype: bool
# Accessing elements using loc
s1.loc['e']  # Element with index 'e'
nan
# Accessing elements using iloc
s1.iloc[4]  # Element with implicit index 4
nan
# Remove missing values
s1.dropna() # Remove nan values
a      1
b      3
c      5
d      6
f    cat
g    abc
h     10
i     12
k      5
dtype: object
# Fill missing values with a specified value
s1 = pd.Series([1, 3, 5, 6, np.nan, 'cat', 'abc', 10, 12, 5])
s1.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'] # Set index
s1.fillna(-99.0)
a       1
b       3
c       5
d       6
e   -99.0
f     cat
g     abc
h      10
i      12
k       5
dtype: object
# Summary statistics
s1 = pd.Series(np.arange(10.0, 20.0))
s1.describe() 
count    10.00000
mean     14.50000
std       3.02765
min      10.00000
25%      12.25000
50%      14.50000
75%      16.75000
max      19.00000
dtype: float64

7.4 DataFrames

A DataFrame is a two-dimensional tabular data structure composed of multiple Series. It is organized into rows and columns, similar to a spreadsheet. Each column in a DataFrame is a Series, and each row can be accessed by its index. A DataFrame can be created from various data structures, including lists, dictionaries, NumPy arrays, and Series.

In the following example, we create a DataFrame from a dictionary of Series objects. The keys of the dictionary become the column names, and the values are the Series that form the columns of the DataFrame:

# Creating DataFrame from Series
s1 = pd.Series(np.arange(0,5.0))
s2 = pd.Series(np.arange(1.0,6.0))
s3 = pd.Series(np.arange(0,3.0))
df = pd.DataFrame({'one': s1, 'two': s2, 'three': s3})
df
one two three
0 0.0 1.0 0.0
1 1.0 2.0 1.0
2 2.0 3.0 2.0
3 3.0 4.0 NaN
4 4.0 5.0 NaN

In the following example, we create a DataFrame from an array. The array is converted to a DataFrame, and we can set the column names using the columns attribute.

# Creating DataFrame from an array
df = pd.DataFrame(np.array([[1,2],[3,4]]))
df.columns = ['col1','col2'] # Set column names
df
col1 col2
0 1 2
1 3 4
# Return column names of CAschool DataFrame
CAschool.columns 
Index(['Observation Number', 'dist_cod', 'county', 'district', 'gr_span',
       'enrl_tot', 'teachers', 'calw_pct', 'meal_pct', 'computer', 'testscr',
       'comp_stu', 'expn_stu', 'str', 'avginc', 'el_pct', 'read_scr',
       'math_scr'],
      dtype='object')

The student-teacher ratio is denoted by str in CAschool. This column can be selected via CAschool.str or CAschool['str']. To select multiple columns, such as str, read_scr, and math_scr, then we can use CAschool[['str', 'read_scr', 'math_scr']]. Consider the following examples:

# Selecting multiple columns
CAschool[['str', 'read_scr', 'math_scr']].head(n=5)
str read_scr math_scr
0 17.889910 691.599976 690.000000
1 21.524664 660.500000 661.900024
2 18.697226 636.299988 650.900024
3 17.357143 651.900024 643.500000
4 18.671329 641.799988 639.900024
# Selecting multiple columns and displaying summary statistics
CAschool[['str', 'read_scr', 'math_scr']].describe().T # Summary statistics
count mean std min 25% 50% 75% max
str 420.0 19.640425 1.891812 14.000000 18.582360 19.723208 20.871815 25.799999
read_scr 420.0 654.970477 20.107980 604.500000 640.400024 655.750000 668.725006 704.000000
math_scr 420.0 653.342619 18.754202 605.400024 639.375015 652.449982 665.849991 709.500000

We can slice rows using standard numerical slices, as in CAschool[0:5], which returns the first five rows for all columns. However, it is not possible to use standard slicing to select both rows and columns. Instead, we can use loc[rowselector, coloumnselector], as shown in the following example:

# Selecting specific rows and columns using loc
CAschool.loc[0:5, ['str', 'read_scr', 'math_scr']]
str read_scr math_scr
0 17.889910 691.599976 690.000000
1 21.524664 660.500000 661.900024
2 18.697226 636.299988 650.900024
3 17.357143 651.900024 643.500000
4 18.671329 641.799988 639.900024
5 21.406250 605.700012 605.400024

We can also select rows and columns using the iloc method, which allows for integer-location-based indexing. Consider the following example:

# Selecting specific rows and columns using iloc
CAschool.iloc[0:2,0:5] # Returns the first two rows and the first five columns
Observation Number dist_cod county district gr_span
0 1 75119 Alameda Sunol Glen Unified KK-08
1 2 61499 Butte Manzanita Elementary KK-08

Rows can also be selected using logical selection with a Boolean array. In the following example, we select districts with student-teacher ratios less than 15:

# Selecting rows based on a condition
subset = CAschool['str'] < 15
CAschool.loc[subset, ['county', 'district', 'str', 'read_scr', 'math_scr']]
county district str read_scr math_scr
79 Siskiyou McCloud Union Elementary 14.000000 634.599976 636.599976
228 Nevada Chicago Park Elementary 14.201763 657.799988 655.200012
315 Siskiyou Fort Jones Union Elementary 14.705882 666.900024 666.799988
408 San Mateo Woodside Elementary 14.542136 701.299988 689.299988

We can add columns to a DataFrame using two methods: (i) by assigning a new column name, and (ii) using the insert method. Consider the following examples:

# Adding a new column by assignment
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
df['logstr'] = np.log(df['str']) # Adding the logstr column
df.head()
str read_scr math_scr logstr
0 17.889910 691.599976 690.000000 2.884237
1 21.524664 660.500000 661.900024 3.069199
2 18.697226 636.299988 650.900024 2.928375
3 17.357143 651.900024 643.500000 2.854004
4 18.671329 641.799988 639.900024 2.926989
# Adding a new column using the insert method
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
df.insert(1, 'logstr', np.log(df['str'])) # Adding the logstr column
df.head()
str logstr read_scr math_scr
0 17.889910 2.884237 691.599976 690.000000
1 21.524664 3.069199 660.500000 661.900024
2 18.697226 2.928375 636.299988 650.900024
3 17.357143 2.854004 651.900024 643.500000
4 18.671329 2.926989 641.799988 639.900024

In the case of the insert method, the syntax is insert(location, column_name, series), where location specifies the column position, column_name is the name of the column to be added, and series is the data to be inserted.

There are three ways to delete columns from a DataFrame: (i) using the drop(list of columns/rows,axis=1/0) method, (ii) the pop(column) method, and (iii) the del keyword. Consider the following examples:

# Using the drop method to remove a column
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
# Drop  the str column
df.drop('str', axis=1).head()
read_scr math_scr
0 691.599976 690.000000
1 660.500000 661.900024
2 636.299988 650.900024
3 651.900024 643.500000
4 641.799988 639.900024
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
# Drop  the str column
del df['str']
df.head()
read_scr math_scr
0 691.599976 690.000000
1 660.500000 661.900024
2 636.299988 650.900024
3 651.900024 643.500000
4 641.799988 639.900024

Note that the drop method creates a new DataFrame unless we specify inplace=True, while del modifies the DataFrame in place.

Finally, in Table 7.2, we list some useful methods for DataFrame objects.

Table 7.2: Some useful methods for DataFrame objects
Method Description
drop() Drops specified labels from rows or columns
dropna() Remove missing values (NaN values)
drop_duplicates() Removes rows which are duplicates or other rows
values Retrieves a NumPy array
index Returns the index of the DataFrame.
fillna Fills NA/NaN or other null values with other values
T/transpose Both swap rows and columns of a DataFrame
sort_values() Sorts by the values along either axis
sort_index() Sorts a DataFrame by the values in the index
count() Counts non-NA cells for each column or row
describe() Generates descriptive statistics
value_counts() Returns a series containing counts of unique values

7.5 Reshape and Merge DataFrames

Data are often stored in a wide form, i.e., in a wide-form tabulation. For example, consider the following state per capita income dataset that is provided by Bureau of Economic Analysis.

# Importing the state per capita income dataset
income_wide = pd.read_csv('data/state_income.csv', header = 0)
income_wide.head()
state pinc.1993 pinc.1994 pinc.1995 pinc.1996 pinc.1997 pinc.1998 pinc.1999 pinc.2000 pinc.2001 ... pinc.2006 pinc.2007 pinc.2008 pinc.2009 pinc.2010 pinc.2011 pinc.2012 pinc.2013 pinc.2014 pinc.2015
0 Alabama 18129 18977 19892 20630 21516 22692 23352 24338 25104 ... 31362 32598 33353 32608 33752 34997 35884 36107 37266 38644
1 Alaska 25036 25713 26399 26953 28185 29220 29892 31974 33517 ... 41058 43861 47749 47069 49438 52390 53627 52723 55833 57583
2 Arizona 18950 19869 20753 21737 22804 24176 24813 26235 26803 ... 34686 35789 35563 33418 33635 34968 36123 36602 38226 39676
3 Arkansas 16956 17758 18602 19458 20187 21222 21861 22762 23840 ... 29438 31070 31940 31153 31927 33863 36240 36235 38260 39343
4 California 23013 23581 24629 25912 27171 29100 30640 33364 34043 ... 42088 43629 43890 42044 43634 46170 48798 49277 52324 55758

5 rows × 24 columns

However, the wide form is not compatible with modules used in econometric analysis. Therefore, we need to change it to the long form. To this end, we can use the wide_to_long method.

# Reshaping the DataFrame from wide to long format
income_long = pd.wide_to_long(income_wide, stubnames='pinc', i='state', j='year', sep='.')
income_long.head()
pinc
state year
Alabama 1993 18129
Alaska 1993 25036
Arizona 1993 18950
Arkansas 1993 16956
California 1993 23013

Notice that DataFrame has one column, named pinc. State and year variables are used to generate a Multi.Index for the DataFrame. We can use the reset_index method turn them back to data columns.

# Resetting the index to convert Multi.Index to columns
income_long = income_long.reset_index()
income_long.head()
state year pinc
0 Alabama 1993 18129
1 Alaska 1993 25036
2 Arizona 1993 18950
3 Arkansas 1993 16956
4 California 1993 23013

The DataFrame is currently sorted by year and state (year is the slower index). Let us sort it by state and year (with state as the slower index). We can use the sort_values method to sort the DataFrame.

# Sorting the DataFrame by state and year
income_long = income_long.sort_values(by = ["state", "year"])
income_long.head()
state year pinc
0 Alabama 1993 18129
50 Alabama 1994 18977
100 Alabama 1995 19892
150 Alabama 1996 20630
200 Alabama 1997 21516

Notice that the index order has changed. We can reset the index again, which will add the current index as a data column. However, we can drop it from the DataFrame if needed.

# Resetting the index and dropping the old index column
income_long = income_long.reset_index().drop("index", axis = 1)
income_long.head()
state year pinc
0 Alabama 1993 18129
1 Alabama 1994 18977
2 Alabama 1995 19892
3 Alabama 1996 20630
4 Alabama 1997 21516

Finally, we will check the data type of each column.

# Checking the data types of the columns
income_long.dtypes
state    object
year      int64
pinc      int64
dtype: object

Next, we will consider the state expenditure data provided by the Bureau of Economic Analysis. The data cover the period from 1993 to 2015, with one CSV file for each year. We will first generate a list of filenames and then sort the list.

# Generating a list of filenames for the expenditure data
file_list = os.listdir('data/expenditure/')
file_list.sort()
file_list
['expnd_1993.csv',
 'expnd_1994.csv',
 'expnd_1995.csv',
 'expnd_1996.csv',
 'expnd_1997.csv',
 'expnd_1998.csv',
 'expnd_1999.csv',
 'expnd_2000.csv',
 'expnd_2001.csv',
 'expnd_2002.csv',
 'expnd_2003.csv',
 'expnd_2004.csv',
 'expnd_2005.csv',
 'expnd_2006.csv',
 'expnd_2007.csv',
 'expnd_2008.csv',
 'expnd_2009.csv',
 'expnd_2010.csv',
 'expnd_2011.csv',
 'expnd_2012.csv',
 'expnd_2013.csv',
 'expnd_2014.csv',
 'expnd_2015.csv']

We can now stack these datasets in a for loop, however, we must be aware of non-uniformity of column names across years. Indeed, although the order of columns are uniform across years, the names are not. Hence, we will rename the columns after importing them.

# Importing the expenditure data from multiple CSV files
tmpnames = ['state','year','tot_revenue','taxes','tot_expnd','education','public_welfare','hospital','health']
expenditure = pd.read_csv('data/expenditure/' + file_list[0], header=0) # import the initial year
expenditure.columns = tmpnames

for i in range(1, len(file_list)):
    tmpframe = pd.read_csv('data/expenditure/' + file_list[i], header=0)
    tmpframe.columns = tmpnames
    expenditure = pd.concat([expenditure, tmpframe], axis=0)

expenditure.head()
state year tot_revenue taxes tot_expnd education public_welfare hospital health
0 Alabama 1993 11,389,335 4,639,784 10,242,374 1,920,765 2,006,829 887,835 395,901
1 Alaska 1993 7,358,035 2,227,283 5,423,008 604,677 454,430 35,105 103,541
2 Arizona 1993 10,843,407 5,281,816 9,782,725 1,279,055 1,900,329 71,652 357,130
3 Arkansas 1993 6,446,052 2,942,581 6,014,728 1,077,951 1,310,779 304,906 136,990
4 California 1993 108,597,722 49,418,058 104,567,486 9,825,944 10,909,104 2,818,232 1,744,769

Expenditure dataset is sorted by year and state (with year as the slower index). Let us sort it by state and year (with state as the slower index).

# Sorting the expenditure DataFrame by state and year
expenditure = expenditure.sort_values(by = ["state", "year"])
expenditure = expenditure.reset_index().drop("index", axis = 1)
expenditure.head()
state year tot_revenue taxes tot_expnd education public_welfare hospital health
0 Alabama 1993 11,389,335 4,639,784 10,242,374 1,920,765 2,006,829 887,835 395,901
1 Alabama 1994 11,599,362 4,767,108 10,815,221 2,101,540 2,157,831 823,194 470,960
2 Alabama 1995 12448670 5077827 11634629 2260473 2282961 811433 474424
3 Alabama 1996 12741148 5257771 12126587 2240613 2325418 815698 500625
4 Alabama 1997 14,007,883 5,484,161 12,944,867 5,175,279 2,537,627 882,613 566,651

We must check the data type for all columns to make sure that expenditure variables are numeric.

# Checking the data types of the columns in the expenditure DataFrame
expenditure.dtypes
state             object
year               int64
tot_revenue       object
taxes             object
tot_expnd         object
education         object
public_welfare    object
hospital          object
health            object
dtype: object

Unfortunately, dtype is object for all expenditure variables. This is due to , in those columns. We need to convert these columns to numeric type by dropping , and then resetting dtype to numeric.

# Converting expenditure columns to numeric by removing commas
columns = ['tot_revenue','taxes','tot_expnd','education','public_welfare','hospital','health']
expenditure[columns] = expenditure[columns].replace(",", "", regex = True).astype('int')
expenditure.dtypes
state             object
year               int64
tot_revenue        int32
taxes              int32
tot_expnd          int32
education          int32
public_welfare     int32
hospital           int32
health             int32
dtype: object

We can now merge income and expenditure datasets on state and year. Pandas merge() and join() methods use a set of rules known as relational algebra to combine data. If the key columns are named the same in the left and right DataFrames, a value for on argument can be provided. If the key columns are named differently, we can specify the key columns for merge using left_on and right_on arguments corresponding to left and right DataFrames. Similarly, if left and right DataFrames have indices set, you can use left_index and right_index arguments.

# Merging the income and expenditure DataFrames on state and year
final_df = pd.merge(income_long, expenditure, on = ["state", "year"])
final_df.head()
state year pinc tot_revenue taxes tot_expnd education public_welfare hospital health
0 Alabama 1993 18129 11389335 4639784 10242374 1920765 2006829 887835 395901
1 Alabama 1994 18977 11599362 4767108 10815221 2101540 2157831 823194 470960
2 Alabama 1995 19892 12448670 5077827 11634629 2260473 2282961 811433 474424
3 Alabama 1996 20630 12741148 5257771 12126587 2240613 2325418 815698 500625
4 Alabama 1997 21516 14007883 5484161 12944867 5175279 2537627 882613 566651

7.6 Further Readings

In this section, we introduce commonly used data structures and manipulation tools from the pandas module that are useful for econometric analysis. For further reading on pandas data structures and manipulation tools, we refer the reader to Sheppard (2021), McKinney (2022), and VanderPlas (2023).