7 Introduction to Pandas

7.1 Introduction

In this chapter, we introduce the pandas module for reading data from various formats, exporting data, and performing data analysis tasks such as cleaning, filtering, transforming, handling missing data, merging, and reshaping. This module provides two data structures, Series and DataFrame, for storing and manipulating data.

import numpy as np
import pandas as pd
import os

7.2 Importing and Exporting Data

All data-reading methods in pandas load data into a DataFrame. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). We introduce the DataFrame in more detail below. The pandas module provides several functions for reading data from various formats, including delimited files, fixed-width formatted files, CSV files, Excel files, and Stata files:

read_table: Reads a general delimited file into a DataFrame.
read_fwf: Reads a fixed-width formatted file into a DataFrame.
read_csv: Reads a comma-separated values (CSV) file into a DataFrame.
read_excel: Reads an Excel file into a DataFrame (supports both .xls and .xlsx formats).
read_stata: Reads a Stata .dta file into a DataFrame.
read_sql: Reads data from a SQL database into a DataFrame.

Each function offers a range of arguments to customize data importing. For example, the notable arguments for the read_csv function are listed below:

delimiter: The delimiter used to separate values. The default is ,.
delim_whitespace: A boolean indicating whether the delimiter is whitespace (space or tab).
header: An integer indicating the row number to use for the column names (default is 0).
skiprows: An integer specifying the number of rows at the top of the file to skip before reading (default is 0).
skip_footer: An integer specifying the number of rows at the bottom of the file to skip when reading (default is 0).
index_col: An integer or column name indicating which column to use as the index. If not provided, a numeric index is generated.
names: A list of column names to use in place of any found in the file. This requires header=0.
nrows: An integer specifying the maximum number of rows to read. This is useful for reading a subset of the file.
usecols: A list of integers or column names indicating which columns to retain.

We consider a dataset on the California school districts, provided in the caschools.csv file. The dataset is collected in 1999 on n=420 school districts and includes 7560 observations on 18 district-level variables. This dataset can be imported in the following way:

# Importing the dataset
CAschool = pd.read_csv('data/caschool.csv')
type(CAschool)

pandas.core.frame.DataFrame

The type of the CAschool object is the DataFrame. We can see the first five observations in the dataset by calling the head method:

# Displaying the first five observations
CAschool.head()

	Observation Number	dist_cod	county	district	gr_span	enrl_tot	teachers	calw_pct	meal_pct	computer	testscr	comp_stu	expn_stu	str	avginc	el_pct	read_scr	math_scr
0	1	75119	Alameda	Sunol Glen Unified	KK-08	195	10.900000	0.510200	2.040800	67	690.799988	0.343590	6384.911133	17.889910	22.690001	0.000000	691.599976	690.000000
1	2	61499	Butte	Manzanita Elementary	KK-08	240	11.150000	15.416700	47.916698	101	661.200012	0.420833	5099.380859	21.524664	9.824000	4.583333	660.500000	661.900024
2	3	61549	Butte	Thermalito Union Elementary	KK-08	1550	82.900002	55.032299	76.322601	169	643.599976	0.109032	5501.954590	18.697226	8.978000	30.000002	636.299988	650.900024
3	4	61457	Butte	Golden Feather Union Elementary	KK-08	243	14.000000	36.475399	77.049202	85	647.700012	0.349794	7101.831055	17.357143	8.978000	0.000000	651.900024	643.500000
4	5	61523	Butte	Palermo Union Elementary	KK-08	1335	71.500000	33.108601	78.427002	171	640.849976	0.128090	5235.987793	18.671329	9.080333	13.857677	641.799988	639.900024

There are several methods corresponding to a variety of formats for exporting/saving a DataFrame. The to_excel method exports the DataFrame to an excel file, the to_csv method to a CSV file, the to_stata method to the Stata file, the to_sql method writes the DataFrame to a SQL database table. Each method allows customization through several arguments. For example, notable arguments for the to_excel method include the following:

sheet_name: The name of the sheet which will contain DataFrame.
na_rep: A string representation for missing data (default is an empty string).
float_format: A string format for floating-point numbers (default is None).
columns: A list of column names to write to the file. If not provided, all columns are written.
header: Boolean indicating whether to write out the column names (default is True).
index: Boolean indicating whether to write out row names (default is True).
startrow: Pythonic row index for the starting position to write data (default is 0).
startcol: Pythonic column index for the starting position to write data (default is 0).

In the following example, we save the CAschool DataFrame to the California_school_data.xlsx file:

# Exporting the DataFrame to an Excel file
CAschool.to_excel('data/California_school_data.xlsx', header = True, index = False)

7.3 Series

A Series is equivalent to a one-dimensional array and can contain various data types, including floats, strings, and integers. In many ways, a Series behaves similarly to a NumPy array and can be considered a column in a DataFrame. We can initialize a Series object using array, list, tuple, or dictionary objects. In the following example, we create a Series from a NumPy array:

# From a NumPy array
a = np.array([0.1, 1.2, 2.3, 3.4, 4.5])
s1 = pd.Series(a)
type(s1)
s1

pandas.core.series.Series

0    0.1
1    1.2
2    2.3
3    3.4
4    4.5
dtype: float64

As shown in the above example, the type of s1 is Series. The index of the Series is automatically generated as integers starting from 0.

In the following examples, we create Series from a tuple and a dictionary, respectively. Notice that the index is automatically generated as integers starting from 0 in the first example, while in the second example, the keys of the dictionary are used as indices.

# Creating a series from a tuple
a = (1, 2, 3, 4, 'abs', np.nan)
s2 = pd.Series(a)
s2

0      1
1      2
2      3
3      4
4    abs
5    NaN
dtype: object

# Creating a series from a dictionary
s3 = pd.Series({'a': 0.1, 'b': 1.2, 'c': 2.3})
s3

a    0.1
b    1.2
c    2.3
dtype: float64

Finally, we can create a Series from a list. The index is automatically generated as integers starting from 0, similar to the first example with a NumPy array:

# Creating a series from a list
c = [0.1, 1.2, 2.3, 3.4, 4.5]
s4 = pd.Series(c)
s4

0    0.1
1    1.2
2    2.3
3    3.4
4    4.5
dtype: float64

The elements of a Series can be selected using the [] operator by specifying indices. We can use the index either explicitly or implicitly to slice a Series. Consider the following examples:

s = pd.Series([0.1, 1.2, 2.3, 3.4, 4.5], index=['a', 'b', 'c', 'd', 'e'])
s

a    0.1
b    1.2
c    2.3
d    3.4
e    4.5
dtype: float64

# Selecting elements by index
s['a']

0.1

# Selecting elements by multiple indices
s[['a', 'c']]

a    0.1
c    2.3
dtype: float64

# Slicing elements by explicit index
s['a':'c']

a    0.1
b    1.2
c    2.3
dtype: float64

# Slicing elements by implicit index
s[:2]

a    0.1
b    1.2
dtype: float64

Logical conditions can also be used to select elements of a Series. For example, we can select elements based on their values or indices. Consider the following examples:

# Selecting elements based on value
s[s > 2]  # Elements that are greater than 2

c    2.3
d    3.4
e    4.5
dtype: float64

# Selecting elements based on value range
s[(s >= 1) & (s <= 4)]  # Elements between 1 and 4 (inclusive)

b    1.2
c    2.3
d    3.4
dtype: float64

Mathematical operations on Series objects are performed element-wise. When the indices do not match, the result is assigned the value NaN. Consider the following examples:

# Element-wise operations on Series
s1 = pd.Series({'a': 1, 'b': 2, 'c': 3})
s2 = s1**3
s2

a     1
b     8
c    27
dtype: int64

s1 = pd.Series({'a': 0.1, 'b': 1.2, 'c': 2.3})
s2 = pd.Series({'c': 10, 'd': 1.2, 'e': 2.3})

# Element-wise multiplication
r1 = s1 * s2 
r1

a     NaN
b     NaN
c    23.0
d     NaN
e     NaN
dtype: float64

# Element-wise addition
r2 = s1 + s2
r2

a     NaN
b     NaN
c    12.3
d     NaN
e     NaN
dtype: float64

In Table 7.1, we list some useful important methods for the Series objects.

Table 7.1: Some methods for the Series objects

Method	Description
`values/index`	Returns series as an array/returns index
`head()/tail()`	Displays the first/last 5 rows of a series
`isnull()/notnull()`	Returns a boolean same-sized object indicating if the values are NA/not NA
`loc[ ]/iloc[ ]`	Allows access using position
`describe()`	Returns a summary of basic statistics
`unique()`	Returns the unique values of the `Series` object
`nunique()`	Returns the number of unique elements in the `Series` object
`drop`	Returns series with specified index labels removed
`dropna`	Return a new series with missing values removed
`fillna()`	Fills all null values in a series with a specified value
`append()`	Appends one series to another
`replace(list,values)`	Replaces a list of values in a series with new values
`update(series)`	Replaces values in a series with those in another series, matching on the index

We illustrate some these methods in the following examples:

# Setting the index of a Series
s1 = pd.Series([1, 3, 5, 6, np.nan, 'cat', 'abc', 10, 12, 5])
s1.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k']

# Displaying the first five observations
s1.head()

a      1
b      3
c      5
d      6
e    NaN
dtype: object

# Display not-null values
s1.notnull()

a     True
b     True
c     True
d     True
e    False
f     True
g     True
h     True
i     True
k     True
dtype: bool

# Display null values
s1.isnull()

a    False
b    False
c    False
d    False
e     True
f    False
g    False
h    False
i    False
k    False
dtype: bool

# Accessing elements using loc
s1.loc['e']  # Element with index 'e'

nan

# Accessing elements using iloc
s1.iloc[4]  # Element with implicit index 4

nan

# Remove missing values
s1.dropna() # Remove nan values

a      1
b      3
c      5
d      6
f    cat
g    abc
h     10
i     12
k      5
dtype: object

# Fill missing values with a specified value
s1 = pd.Series([1, 3, 5, 6, np.nan, 'cat', 'abc', 10, 12, 5])
s1.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'] # Set index
s1.fillna(-99.0)

a       1
b       3
c       5
d       6
e   -99.0
f     cat
g     abc
h      10
i      12
k       5
dtype: object

# Summary statistics
s1 = pd.Series(np.arange(10.0, 20.0))
s1.describe()

count    10.00000
mean     14.50000
std       3.02765
min      10.00000
25%      12.25000
50%      14.50000
75%      16.75000
max      19.00000
dtype: float64

7.4 DataFrames

A DataFrame is a two-dimensional tabular data structure composed of multiple Series. It is organized into rows and columns, similar to a spreadsheet. Each column in a DataFrame is a Series, and each row can be accessed by its index. A DataFrame can be created from various data structures, including lists, dictionaries, NumPy arrays, and Series.

In the following example, we create a DataFrame from a dictionary of Series objects. The keys of the dictionary become the column names, and the values are the Series that form the columns of the DataFrame:

# Creating DataFrame from Series
s1 = pd.Series(np.arange(0,5.0))
s2 = pd.Series(np.arange(1.0,6.0))
s3 = pd.Series(np.arange(0,3.0))
df = pd.DataFrame({'one': s1, 'two': s2, 'three': s3})
df

	one	two	three
0	0.0	1.0	0.0
1	1.0	2.0	1.0
2	2.0	3.0	2.0
3	3.0	4.0	NaN
4	4.0	5.0	NaN

In the following example, we create a DataFrame from an array. The array is converted to a DataFrame, and we can set the column names using the columns attribute.

# Creating DataFrame from an array
df = pd.DataFrame(np.array([[1,2],[3,4]]))
df.columns = ['col1','col2'] # Set column names
df

	col1	col2
0	1	2
1	3	4

# Return column names of CAschool DataFrame
CAschool.columns

Index(['Observation Number', 'dist_cod', 'county', 'district', 'gr_span',
       'enrl_tot', 'teachers', 'calw_pct', 'meal_pct', 'computer', 'testscr',
       'comp_stu', 'expn_stu', 'str', 'avginc', 'el_pct', 'read_scr',
       'math_scr'],
      dtype='object')

The student-teacher ratio is denoted by str in CAschool. This column can be selected via CAschool.str or CAschool['str']. To select multiple columns, such as str, read_scr, and math_scr, then we can use CAschool[['str', 'read_scr', 'math_scr']]. Consider the following examples:

# Selecting multiple columns
CAschool[['str', 'read_scr', 'math_scr']].head(n=5)

	str	read_scr	math_scr
0	17.889910	691.599976	690.000000
1	21.524664	660.500000	661.900024
2	18.697226	636.299988	650.900024
3	17.357143	651.900024	643.500000
4	18.671329	641.799988	639.900024

# Selecting multiple columns and displaying summary statistics
CAschool[['str', 'read_scr', 'math_scr']].describe().T # Summary statistics

	count	mean	std	min	25%	50%	75%	max
str	420.0	19.640425	1.891812	14.000000	18.582360	19.723208	20.871815	25.799999
read_scr	420.0	654.970477	20.107980	604.500000	640.400024	655.750000	668.725006	704.000000
math_scr	420.0	653.342619	18.754202	605.400024	639.375015	652.449982	665.849991	709.500000

We can slice rows using standard numerical slices, as in CAschool[0:5], which returns the first five rows for all columns. However, it is not possible to use standard slicing to select both rows and columns. Instead, we can use loc[rowselector, coloumnselector], as shown in the following example:

# Selecting specific rows and columns using loc
CAschool.loc[0:5, ['str', 'read_scr', 'math_scr']]

	str	read_scr	math_scr
0	17.889910	691.599976	690.000000
1	21.524664	660.500000	661.900024
2	18.697226	636.299988	650.900024
3	17.357143	651.900024	643.500000
4	18.671329	641.799988	639.900024
5	21.406250	605.700012	605.400024

We can also select rows and columns using the iloc method, which allows for integer-location-based indexing. Consider the following example:

# Selecting specific rows and columns using iloc
CAschool.iloc[0:2,0:5] # Returns the first two rows and the first five columns

	Observation Number	dist_cod	county	district	gr_span
0	1	75119	Alameda	Sunol Glen Unified	KK-08
1	2	61499	Butte	Manzanita Elementary	KK-08

Rows can also be selected using logical selection with a Boolean array. In the following example, we select districts with student-teacher ratios less than 15:

# Selecting rows based on a condition
subset = CAschool['str'] < 15
CAschool.loc[subset, ['county', 'district', 'str', 'read_scr', 'math_scr']]

	county	district	str	read_scr	math_scr
79	Siskiyou	McCloud Union Elementary	14.000000	634.599976	636.599976
228	Nevada	Chicago Park Elementary	14.201763	657.799988	655.200012
315	Siskiyou	Fort Jones Union Elementary	14.705882	666.900024	666.799988
408	San Mateo	Woodside Elementary	14.542136	701.299988	689.299988

We can add columns to a DataFrame using two methods: (i) by assigning a new column name, and (ii) using the insert method. Consider the following examples:

# Adding a new column by assignment
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
df['logstr'] = np.log(df['str']) # Adding the logstr column
df.head()

	str	read_scr	math_scr	logstr
0	17.889910	691.599976	690.000000	2.884237
1	21.524664	660.500000	661.900024	3.069199
2	18.697226	636.299988	650.900024	2.928375
3	17.357143	651.900024	643.500000	2.854004
4	18.671329	641.799988	639.900024	2.926989

# Adding a new column using the insert method
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
df.insert(1, 'logstr', np.log(df['str'])) # Adding the logstr column
df.head()

	str	logstr	read_scr	math_scr
0	17.889910	2.884237	691.599976	690.000000
1	21.524664	3.069199	660.500000	661.900024
2	18.697226	2.928375	636.299988	650.900024
3	17.357143	2.854004	651.900024	643.500000
4	18.671329	2.926989	641.799988	639.900024

In the case of the insert method, the syntax is insert(location, column_name, series), where location specifies the column position, column_name is the name of the column to be added, and series is the data to be inserted.

There are three ways to delete columns from a DataFrame: (i) using the drop(list of columns/rows,axis=1/0) method, (ii) the pop(column) method, and (iii) the del keyword. Consider the following examples:

# Using the drop method to remove a column
df = CAschool[['str', 'read_scr', 'math_scr']].copy()
# Drop  the str column
df.drop('str', axis=1).head()

	read_scr	math_scr
0	691.599976	690.000000
1	660.500000	661.900024
2	636.299988	650.900024
3	651.900024	643.500000
4	641.799988	639.900024

df = CAschool[['str', 'read_scr', 'math_scr']].copy()
# Drop  the str column
del df['str']
df.head()

	read_scr	math_scr
0	691.599976	690.000000
1	660.500000	661.900024
2	636.299988	650.900024
3	651.900024	643.500000
4	641.799988	639.900024

Note that the drop method creates a new DataFrame unless we specify inplace=True, while del modifies the DataFrame in place.

Finally, in Table 7.2, we list some useful methods for DataFrame objects.

Table 7.2: Some useful methods for DataFrame objects

Method	Description
`drop()`	Drops specified labels from rows or columns
`dropna()`	Remove missing values (NaN values)
`drop_duplicates()`	Removes rows which are duplicates or other rows
`values`	Retrieves a `NumPy` array
`index`	Returns the index of the DataFrame.
`fillna`	Fills `NA/NaN` or other null values with other values
`T/transpose`	Both swap rows and columns of a `DataFrame`
`sort_values()`	Sorts by the values along either axis
`sort_index()`	Sorts a `DataFrame` by the values in the index
`count()`	Counts non-NA cells for each column or row
`describe()`	Generates descriptive statistics
`value_counts()`	Returns a series containing counts of unique values

7.5 Reshape and Merge DataFrames

Data are often stored in a wide form, i.e., in a wide-form tabulation. For example, consider the following state per capita income dataset that is provided by Bureau of Economic Analysis.

# Importing the state per capita income dataset
income_wide = pd.read_csv('data/state_income.csv', header = 0)
income_wide.head()

	state	pinc.1993	pinc.1994	pinc.1995	pinc.1996	pinc.1997	pinc.1998	pinc.1999	pinc.2000	pinc.2001	...	pinc.2006	pinc.2007	pinc.2008	pinc.2009	pinc.2010	pinc.2011	pinc.2012	pinc.2013	pinc.2014	pinc.2015
0	Alabama	18129	18977	19892	20630	21516	22692	23352	24338	25104	...	31362	32598	33353	32608	33752	34997	35884	36107	37266	38644
1	Alaska	25036	25713	26399	26953	28185	29220	29892	31974	33517	...	41058	43861	47749	47069	49438	52390	53627	52723	55833	57583
2	Arizona	18950	19869	20753	21737	22804	24176	24813	26235	26803	...	34686	35789	35563	33418	33635	34968	36123	36602	38226	39676
3	Arkansas	16956	17758	18602	19458	20187	21222	21861	22762	23840	...	29438	31070	31940	31153	31927	33863	36240	36235	38260	39343
4	California	23013	23581	24629	25912	27171	29100	30640	33364	34043	...	42088	43629	43890	42044	43634	46170	48798	49277	52324	55758

5 rows × 24 columns

However, the wide form is not compatible with modules used in econometric analysis. Therefore, we need to change it to the long form. To this end, we can use the wide_to_long method.

# Reshaping the DataFrame from wide to long format
income_long = pd.wide_to_long(income_wide, stubnames='pinc', i='state', j='year', sep='.')
income_long.head()

		pinc
state	year
Alabama	1993	18129
Alaska	1993	25036
Arizona	1993	18950
Arkansas	1993	16956
California	1993	23013

Notice that DataFrame has one column, named pinc. State and year variables are used to generate a Multi.Index for the DataFrame. We can use the reset_index method turn them back to data columns.

# Resetting the index to convert Multi.Index to columns
income_long = income_long.reset_index()
income_long.head()

	state	year	pinc
0	Alabama	1993	18129
1	Alaska	1993	25036
2	Arizona	1993	18950
3	Arkansas	1993	16956
4	California	1993	23013

The DataFrame is currently sorted by year and state (year is the slower index). Let us sort it by state and year (with state as the slower index). We can use the sort_values method to sort the DataFrame.

# Sorting the DataFrame by state and year
income_long = income_long.sort_values(by = ["state", "year"])
income_long.head()

	state	year	pinc
0	Alabama	1993	18129
50	Alabama	1994	18977
100	Alabama	1995	19892
150	Alabama	1996	20630
200	Alabama	1997	21516

Notice that the index order has changed. We can reset the index again, which will add the current index as a data column. However, we can drop it from the DataFrame if needed.

# Resetting the index and dropping the old index column
income_long = income_long.reset_index().drop("index", axis = 1)
income_long.head()

	state	year	pinc
0	Alabama	1993	18129
1	Alabama	1994	18977
2	Alabama	1995	19892
3	Alabama	1996	20630
4	Alabama	1997	21516

Finally, we will check the data type of each column.

# Checking the data types of the columns
income_long.dtypes

state    object
year      int64
pinc      int64
dtype: object

Next, we will consider the state expenditure data provided by the Bureau of Economic Analysis. The data cover the period from 1993 to 2015, with one CSV file for each year. We will first generate a list of filenames and then sort the list.

# Generating a list of filenames for the expenditure data
file_list = os.listdir('data/expenditure/')
file_list.sort()
file_list

['expnd_1993.csv',
 'expnd_1994.csv',
 'expnd_1995.csv',
 'expnd_1996.csv',
 'expnd_1997.csv',
 'expnd_1998.csv',
 'expnd_1999.csv',
 'expnd_2000.csv',
 'expnd_2001.csv',
 'expnd_2002.csv',
 'expnd_2003.csv',
 'expnd_2004.csv',
 'expnd_2005.csv',
 'expnd_2006.csv',
 'expnd_2007.csv',
 'expnd_2008.csv',
 'expnd_2009.csv',
 'expnd_2010.csv',
 'expnd_2011.csv',
 'expnd_2012.csv',
 'expnd_2013.csv',
 'expnd_2014.csv',
 'expnd_2015.csv']

We can now stack these datasets in a for loop, however, we must be aware of non-uniformity of column names across years. Indeed, although the order of columns are uniform across years, the names are not. Hence, we will rename the columns after importing them.

# Importing the expenditure data from multiple CSV files
tmpnames = ['state','year','tot_revenue','taxes','tot_expnd','education','public_welfare','hospital','health']
expenditure = pd.read_csv('data/expenditure/' + file_list[0], header=0) # import the initial year
expenditure.columns = tmpnames

for i in range(1, len(file_list)):
    tmpframe = pd.read_csv('data/expenditure/' + file_list[i], header=0)
    tmpframe.columns = tmpnames
    expenditure = pd.concat([expenditure, tmpframe], axis=0)

expenditure.head()

	state	year	tot_revenue	taxes	tot_expnd	education	public_welfare	hospital	health
0	Alabama	1993	11,389,335	4,639,784	10,242,374	1,920,765	2,006,829	887,835	395,901
1	Alaska	1993	7,358,035	2,227,283	5,423,008	604,677	454,430	35,105	103,541
2	Arizona	1993	10,843,407	5,281,816	9,782,725	1,279,055	1,900,329	71,652	357,130
3	Arkansas	1993	6,446,052	2,942,581	6,014,728	1,077,951	1,310,779	304,906	136,990
4	California	1993	108,597,722	49,418,058	104,567,486	9,825,944	10,909,104	2,818,232	1,744,769

Expenditure dataset is sorted by year and state (with year as the slower index). Let us sort it by state and year (with state as the slower index).

# Sorting the expenditure DataFrame by state and year
expenditure = expenditure.sort_values(by = ["state", "year"])
expenditure = expenditure.reset_index().drop("index", axis = 1)
expenditure.head()

	state	year	tot_revenue	taxes	tot_expnd	education	public_welfare	hospital	health
0	Alabama	1993	11,389,335	4,639,784	10,242,374	1,920,765	2,006,829	887,835	395,901
1	Alabama	1994	11,599,362	4,767,108	10,815,221	2,101,540	2,157,831	823,194	470,960
2	Alabama	1995	12448670	5077827	11634629	2260473	2282961	811433	474424
3	Alabama	1996	12741148	5257771	12126587	2240613	2325418	815698	500625
4	Alabama	1997	14,007,883	5,484,161	12,944,867	5,175,279	2,537,627	882,613	566,651

We must check the data type for all columns to make sure that expenditure variables are numeric.

# Checking the data types of the columns in the expenditure DataFrame
expenditure.dtypes

state             object
year               int64
tot_revenue       object
taxes             object
tot_expnd         object
education         object
public_welfare    object
hospital          object
health            object
dtype: object

Unfortunately, dtype is object for all expenditure variables. This is due to , in those columns. We need to convert these columns to numeric type by dropping , and then resetting dtype to numeric.

# Converting expenditure columns to numeric by removing commas
columns = ['tot_revenue','taxes','tot_expnd','education','public_welfare','hospital','health']
expenditure[columns] = expenditure[columns].replace(",", "", regex = True).astype('int')
expenditure.dtypes

state             object
year               int64
tot_revenue        int32
taxes              int32
tot_expnd          int32
education          int32
public_welfare     int32
hospital           int32
health             int32
dtype: object

We can now merge income and expenditure datasets on state and year. Pandas merge() and join() methods use a set of rules known as relational algebra to combine data. If the key columns are named the same in the left and right DataFrames, a value for on argument can be provided. If the key columns are named differently, we can specify the key columns for merge using left_on and right_on arguments corresponding to left and right DataFrames. Similarly, if left and right DataFrames have indices set, you can use left_index and right_index arguments.

# Merging the income and expenditure DataFrames on state and year
final_df = pd.merge(income_long, expenditure, on = ["state", "year"])
final_df.head()

	state	year	pinc	tot_revenue	taxes	tot_expnd	education	public_welfare	hospital	health
0	Alabama	1993	18129	11389335	4639784	10242374	1920765	2006829	887835	395901
1	Alabama	1994	18977	11599362	4767108	10815221	2101540	2157831	823194	470960
2	Alabama	1995	19892	12448670	5077827	11634629	2260473	2282961	811433	474424
3	Alabama	1996	20630	12741148	5257771	12126587	2240613	2325418	815698	500625
4	Alabama	1997	21516	14007883	5484161	12944867	5175279	2537627	882613	566651

7.6 Further Readings

In this section, we introduce commonly used data structures and manipulation tools from the pandas module that are useful for econometric analysis. For further reading on pandas data structures and manipulation tools, we refer the reader to Sheppard (2021), McKinney (2022), and VanderPlas (2023).