import numpy as np
import pandas as pd
import os
7 Introduction to Pandas
7.1 Introduction
In this chapter, we introduce the pandas module for reading data from various formats, exporting data, and performing data analysis tasks such as cleaning, filtering, transforming, handling missing data, merging, and reshaping. This module provides two data structures, Series
and DataFrame
, for storing and manipulating data.
7.2 Importing and Exporting Data
All data-reading methods in pandas
load data into a DataFrame
. A DataFrame
is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). We introduce the DataFrame
in more detail below. The pandas
module provides several functions for reading data from various formats, including delimited files, fixed-width formatted files, CSV files, Excel files, and Stata files:
read_table
: Reads a general delimited file into a DataFrame.read_fwf
: Reads a fixed-width formatted file into a DataFrame.read_csv
: Reads a comma-separated values (CSV) file into a DataFrame.read_excel
: Reads an Excel file into a DataFrame (supports both .xls and .xlsx formats).read_stata
: Reads a Stata .dta file into a DataFrame.read_sql
: Reads data from a SQL database into a DataFrame.
Each function offers a range of arguments to customize data importing. For example, the notable arguments for the read_csv
function are listed below:
delimiter
: The delimiter used to separate values. The default is,
.delim_whitespace
: A boolean indicating whether the delimiter is whitespace (space or tab).header
: An integer indicating the row number to use for the column names (default is 0).skiprows
: An integer specifying the number of rows at the top of the file to skip before reading (default is 0).skip_footer
: An integer specifying the number of rows at the bottom of the file to skip when reading (default is 0).index_col
: An integer or column name indicating which column to use as the index. If not provided, a numeric index is generated.names
: A list of column names to use in place of any found in the file. This requires header=0.nrows
: An integer specifying the maximum number of rows to read. This is useful for reading a subset of the file.usecols
: A list of integers or column names indicating which columns to retain.
We consider a dataset on the California school districts, provided in the caschools.csv
file. The dataset is collected in 1999 on n=420 school districts and includes 7560 observations on 18 district-level variables. This dataset can be imported in the following way:
# Importing the dataset
= pd.read_csv('data/caschool.csv')
CAschool type(CAschool)
pandas.core.frame.DataFrame
The type of the CAschool
object is the DataFrame
. We can see the first five observations in the dataset by calling the head
method:
# Displaying the first five observations
CAschool.head()
Observation Number | dist_cod | county | district | gr_span | enrl_tot | teachers | calw_pct | meal_pct | computer | testscr | comp_stu | expn_stu | str | avginc | el_pct | read_scr | math_scr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 75119 | Alameda | Sunol Glen Unified | KK-08 | 195 | 10.900000 | 0.510200 | 2.040800 | 67 | 690.799988 | 0.343590 | 6384.911133 | 17.889910 | 22.690001 | 0.000000 | 691.599976 | 690.000000 |
1 | 2 | 61499 | Butte | Manzanita Elementary | KK-08 | 240 | 11.150000 | 15.416700 | 47.916698 | 101 | 661.200012 | 0.420833 | 5099.380859 | 21.524664 | 9.824000 | 4.583333 | 660.500000 | 661.900024 |
2 | 3 | 61549 | Butte | Thermalito Union Elementary | KK-08 | 1550 | 82.900002 | 55.032299 | 76.322601 | 169 | 643.599976 | 0.109032 | 5501.954590 | 18.697226 | 8.978000 | 30.000002 | 636.299988 | 650.900024 |
3 | 4 | 61457 | Butte | Golden Feather Union Elementary | KK-08 | 243 | 14.000000 | 36.475399 | 77.049202 | 85 | 647.700012 | 0.349794 | 7101.831055 | 17.357143 | 8.978000 | 0.000000 | 651.900024 | 643.500000 |
4 | 5 | 61523 | Butte | Palermo Union Elementary | KK-08 | 1335 | 71.500000 | 33.108601 | 78.427002 | 171 | 640.849976 | 0.128090 | 5235.987793 | 18.671329 | 9.080333 | 13.857677 | 641.799988 | 639.900024 |
There are several methods corresponding to a variety of formats for exporting/saving a DataFrame. The to_excel
method exports the DataFrame to an excel file, the to_csv
method to a CSV file, the to_stata
method to the Stata file, the to_sql
method writes the DataFrame to a SQL database table. Each method allows customization through several arguments. For example, notable arguments for the to_excel
method include the following:
sheet_name
: The name of the sheet which will contain DataFrame.na_rep
: A string representation for missing data (default is an empty string).float_format
: A string format for floating-point numbers (default is None).columns
: A list of column names to write to the file. If not provided, all columns are written.header
: Boolean indicating whether to write out the column names (default isTrue
).index
: Boolean indicating whether to write out row names (default isTrue
).startrow
: Pythonic row index for the starting position to write data (default is 0).startcol
: Pythonic column index for the starting position to write data (default is 0).
In the following example, we save the CAschool
DataFrame to the California_school_data.xlsx
file:
# Exporting the DataFrame to an Excel file
'data/California_school_data.xlsx', header = True, index = False) CAschool.to_excel(
7.3 Series
A Series is equivalent to a one-dimensional array and can contain various data types, including floats, strings, and integers. In many ways, a Series behaves similarly to a NumPy array and can be considered a column in a DataFrame. We can initialize a Series object using array, list, tuple, or dictionary objects. In the following example, we create a Series from a NumPy array:
# From a NumPy array
= np.array([0.1, 1.2, 2.3, 3.4, 4.5])
a = pd.Series(a)
s1 type(s1)
s1
pandas.core.series.Series
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype: float64
As shown in the above example, the type of s1
is Series
. The index of the Series is automatically generated as integers starting from 0.
In the following examples, we create Series from a tuple and a dictionary, respectively. Notice that the index is automatically generated as integers starting from 0 in the first example, while in the second example, the keys of the dictionary are used as indices.
# Creating a series from a tuple
= (1, 2, 3, 4, 'abs', np.nan)
a = pd.Series(a)
s2 s2
0 1
1 2
2 3
3 4
4 abs
5 NaN
dtype: object
# Creating a series from a dictionary
= pd.Series({'a': 0.1, 'b': 1.2, 'c': 2.3})
s3 s3
a 0.1
b 1.2
c 2.3
dtype: float64
Finally, we can create a Series from a list. The index is automatically generated as integers starting from 0, similar to the first example with a NumPy array:
# Creating a series from a list
= [0.1, 1.2, 2.3, 3.4, 4.5]
c = pd.Series(c)
s4 s4
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype: float64
The elements of a Series can be selected using the []
operator by specifying indices. We can use the index either explicitly or implicitly to slice a Series. Consider the following examples:
= pd.Series([0.1, 1.2, 2.3, 3.4, 4.5], index=['a', 'b', 'c', 'd', 'e'])
s s
a 0.1
b 1.2
c 2.3
d 3.4
e 4.5
dtype: float64
# Selecting elements by index
'a'] s[
0.1
# Selecting elements by multiple indices
'a', 'c']] s[[
a 0.1
c 2.3
dtype: float64
# Slicing elements by explicit index
'a':'c'] s[
a 0.1
b 1.2
c 2.3
dtype: float64
# Slicing elements by implicit index
2] s[:
a 0.1
b 1.2
dtype: float64
Logical conditions can also be used to select elements of a Series. For example, we can select elements based on their values or indices. Consider the following examples:
# Selecting elements based on value
> 2] # Elements that are greater than 2 s[s
c 2.3
d 3.4
e 4.5
dtype: float64
# Selecting elements based on value range
>= 1) & (s <= 4)] # Elements between 1 and 4 (inclusive) s[(s
b 1.2
c 2.3
d 3.4
dtype: float64
Mathematical operations on Series objects are performed element-wise. When the indices do not match, the result is assigned the value NaN
. Consider the following examples:
# Element-wise operations on Series
= pd.Series({'a': 1, 'b': 2, 'c': 3})
s1 = s1**3
s2 s2
a 1
b 8
c 27
dtype: int64
= pd.Series({'a': 0.1, 'b': 1.2, 'c': 2.3})
s1 = pd.Series({'c': 10, 'd': 1.2, 'e': 2.3}) s2
# Element-wise multiplication
= s1 * s2
r1 r1
a NaN
b NaN
c 23.0
d NaN
e NaN
dtype: float64
# Element-wise addition
= s1 + s2
r2 r2
a NaN
b NaN
c 12.3
d NaN
e NaN
dtype: float64
In Table 7.1, we list some useful important methods for the Series objects.
Method | Description |
---|---|
values/index |
Returns series as an array/returns index |
head()/tail() |
Displays the first/last 5 rows of a series |
isnull()/notnull() |
Returns a boolean same-sized object indicating if the values are NA/not NA |
loc[ ]/iloc[ ] |
Allows access using position |
describe() |
Returns a summary of basic statistics |
unique() |
Returns the unique values of the Series object |
nunique() |
Returns the number of unique elements in the Series object |
drop |
Returns series with specified index labels removed |
dropna |
Return a new series with missing values removed |
fillna() |
Fills all null values in a series with a specified value |
append() |
Appends one series to another |
replace(list,values) |
Replaces a list of values in a series with new values |
update(series) |
Replaces values in a series with those in another series, matching on the index |
We illustrate some these methods in the following examples:
# Setting the index of a Series
= pd.Series([1, 3, 5, 6, np.nan, 'cat', 'abc', 10, 12, 5])
s1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'] s1.index
# Displaying the first five observations
s1.head()
a 1
b 3
c 5
d 6
e NaN
dtype: object
# Display not-null values
s1.notnull()
a True
b True
c True
d True
e False
f True
g True
h True
i True
k True
dtype: bool
# Display null values
s1.isnull()
a False
b False
c False
d False
e True
f False
g False
h False
i False
k False
dtype: bool
# Accessing elements using loc
'e'] # Element with index 'e' s1.loc[
nan
# Accessing elements using iloc
4] # Element with implicit index 4 s1.iloc[
nan
# Remove missing values
# Remove nan values s1.dropna()
a 1
b 3
c 5
d 6
f cat
g abc
h 10
i 12
k 5
dtype: object
# Fill missing values with a specified value
= pd.Series([1, 3, 5, 6, np.nan, 'cat', 'abc', 10, 12, 5])
s1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'] # Set index
s1.index -99.0) s1.fillna(
a 1
b 3
c 5
d 6
e -99.0
f cat
g abc
h 10
i 12
k 5
dtype: object
# Summary statistics
= pd.Series(np.arange(10.0, 20.0))
s1 s1.describe()
count 10.00000
mean 14.50000
std 3.02765
min 10.00000
25% 12.25000
50% 14.50000
75% 16.75000
max 19.00000
dtype: float64
7.4 DataFrames
A DataFrame is a two-dimensional tabular data structure composed of multiple Series. It is organized into rows and columns, similar to a spreadsheet. Each column in a DataFrame is a Series, and each row can be accessed by its index. A DataFrame can be created from various data structures, including lists, dictionaries, NumPy arrays, and Series.
In the following example, we create a DataFrame from a dictionary of Series objects. The keys of the dictionary become the column names, and the values are the Series that form the columns of the DataFrame:
# Creating DataFrame from Series
= pd.Series(np.arange(0,5.0))
s1 = pd.Series(np.arange(1.0,6.0))
s2 = pd.Series(np.arange(0,3.0))
s3 = pd.DataFrame({'one': s1, 'two': s2, 'three': s3})
df df
one | two | three | |
---|---|---|---|
0 | 0.0 | 1.0 | 0.0 |
1 | 1.0 | 2.0 | 1.0 |
2 | 2.0 | 3.0 | 2.0 |
3 | 3.0 | 4.0 | NaN |
4 | 4.0 | 5.0 | NaN |
In the following example, we create a DataFrame from an array. The array is converted to a DataFrame, and we can set the column names using the columns
attribute.
# Creating DataFrame from an array
= pd.DataFrame(np.array([[1,2],[3,4]]))
df = ['col1','col2'] # Set column names
df.columns df
col1 | col2 | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
# Return column names of CAschool DataFrame
CAschool.columns
Index(['Observation Number', 'dist_cod', 'county', 'district', 'gr_span',
'enrl_tot', 'teachers', 'calw_pct', 'meal_pct', 'computer', 'testscr',
'comp_stu', 'expn_stu', 'str', 'avginc', 'el_pct', 'read_scr',
'math_scr'],
dtype='object')
The student-teacher ratio is denoted by str
in CAschool
. This column can be selected via CAschool.str
or CAschool['str']
. To select multiple columns, such as str
, read_scr
, and math_scr
, then we can use CAschool[['str', 'read_scr', 'math_scr']]
. Consider the following examples:
# Selecting multiple columns
'str', 'read_scr', 'math_scr']].head(n=5) CAschool[[
str | read_scr | math_scr | |
---|---|---|---|
0 | 17.889910 | 691.599976 | 690.000000 |
1 | 21.524664 | 660.500000 | 661.900024 |
2 | 18.697226 | 636.299988 | 650.900024 |
3 | 17.357143 | 651.900024 | 643.500000 |
4 | 18.671329 | 641.799988 | 639.900024 |
# Selecting multiple columns and displaying summary statistics
'str', 'read_scr', 'math_scr']].describe().T # Summary statistics CAschool[[
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
str | 420.0 | 19.640425 | 1.891812 | 14.000000 | 18.582360 | 19.723208 | 20.871815 | 25.799999 |
read_scr | 420.0 | 654.970477 | 20.107980 | 604.500000 | 640.400024 | 655.750000 | 668.725006 | 704.000000 |
math_scr | 420.0 | 653.342619 | 18.754202 | 605.400024 | 639.375015 | 652.449982 | 665.849991 | 709.500000 |
We can slice rows using standard numerical slices, as in CAschool[0:5]
, which returns the first five rows for all columns. However, it is not possible to use standard slicing to select both rows and columns. Instead, we can use loc[rowselector, coloumnselector]
, as shown in the following example:
# Selecting specific rows and columns using loc
0:5, ['str', 'read_scr', 'math_scr']] CAschool.loc[
str | read_scr | math_scr | |
---|---|---|---|
0 | 17.889910 | 691.599976 | 690.000000 |
1 | 21.524664 | 660.500000 | 661.900024 |
2 | 18.697226 | 636.299988 | 650.900024 |
3 | 17.357143 | 651.900024 | 643.500000 |
4 | 18.671329 | 641.799988 | 639.900024 |
5 | 21.406250 | 605.700012 | 605.400024 |
We can also select rows and columns using the iloc
method, which allows for integer-location-based indexing. Consider the following example:
# Selecting specific rows and columns using iloc
0:2,0:5] # Returns the first two rows and the first five columns CAschool.iloc[
Observation Number | dist_cod | county | district | gr_span | |
---|---|---|---|---|---|
0 | 1 | 75119 | Alameda | Sunol Glen Unified | KK-08 |
1 | 2 | 61499 | Butte | Manzanita Elementary | KK-08 |
Rows can also be selected using logical selection with a Boolean array. In the following example, we select districts with student-teacher ratios less than 15:
# Selecting rows based on a condition
= CAschool['str'] < 15
subset 'county', 'district', 'str', 'read_scr', 'math_scr']] CAschool.loc[subset, [
county | district | str | read_scr | math_scr | |
---|---|---|---|---|---|
79 | Siskiyou | McCloud Union Elementary | 14.000000 | 634.599976 | 636.599976 |
228 | Nevada | Chicago Park Elementary | 14.201763 | 657.799988 | 655.200012 |
315 | Siskiyou | Fort Jones Union Elementary | 14.705882 | 666.900024 | 666.799988 |
408 | San Mateo | Woodside Elementary | 14.542136 | 701.299988 | 689.299988 |
We can add columns to a DataFrame using two methods: (i) by assigning a new column name, and (ii) using the insert
method. Consider the following examples:
# Adding a new column by assignment
= CAschool[['str', 'read_scr', 'math_scr']].copy()
df 'logstr'] = np.log(df['str']) # Adding the logstr column
df[ df.head()
str | read_scr | math_scr | logstr | |
---|---|---|---|---|
0 | 17.889910 | 691.599976 | 690.000000 | 2.884237 |
1 | 21.524664 | 660.500000 | 661.900024 | 3.069199 |
2 | 18.697226 | 636.299988 | 650.900024 | 2.928375 |
3 | 17.357143 | 651.900024 | 643.500000 | 2.854004 |
4 | 18.671329 | 641.799988 | 639.900024 | 2.926989 |
# Adding a new column using the insert method
= CAschool[['str', 'read_scr', 'math_scr']].copy()
df 1, 'logstr', np.log(df['str'])) # Adding the logstr column
df.insert( df.head()
str | logstr | read_scr | math_scr | |
---|---|---|---|---|
0 | 17.889910 | 2.884237 | 691.599976 | 690.000000 |
1 | 21.524664 | 3.069199 | 660.500000 | 661.900024 |
2 | 18.697226 | 2.928375 | 636.299988 | 650.900024 |
3 | 17.357143 | 2.854004 | 651.900024 | 643.500000 |
4 | 18.671329 | 2.926989 | 641.799988 | 639.900024 |
In the case of the insert method, the syntax is insert(location, column_name, series)
, where location
specifies the column position, column_name
is the name of the column to be added, and series
is the data to be inserted.
There are three ways to delete columns from a DataFrame: (i) using the drop(list of columns/rows,axis=1/0)
method, (ii) the pop(column)
method, and (iii) the del
keyword. Consider the following examples:
# Using the drop method to remove a column
= CAschool[['str', 'read_scr', 'math_scr']].copy()
df # Drop the str column
'str', axis=1).head() df.drop(
read_scr | math_scr | |
---|---|---|
0 | 691.599976 | 690.000000 |
1 | 660.500000 | 661.900024 |
2 | 636.299988 | 650.900024 |
3 | 651.900024 | 643.500000 |
4 | 641.799988 | 639.900024 |
= CAschool[['str', 'read_scr', 'math_scr']].copy()
df # Drop the str column
del df['str']
df.head()
read_scr | math_scr | |
---|---|---|
0 | 691.599976 | 690.000000 |
1 | 660.500000 | 661.900024 |
2 | 636.299988 | 650.900024 |
3 | 651.900024 | 643.500000 |
4 | 641.799988 | 639.900024 |
Note that the drop
method creates a new DataFrame unless we specify inplace=True
, while del
modifies the DataFrame in place.
Finally, in Table 7.2, we list some useful methods for DataFrame objects.
Method | Description |
---|---|
drop() |
Drops specified labels from rows or columns |
dropna() |
Remove missing values (NaN values) |
drop_duplicates() |
Removes rows which are duplicates or other rows |
values |
Retrieves a NumPy array |
index |
Returns the index of the DataFrame. |
fillna |
Fills NA/NaN or other null values with other values |
T/transpose |
Both swap rows and columns of a DataFrame |
sort_values() |
Sorts by the values along either axis |
sort_index() |
Sorts a DataFrame by the values in the index |
count() |
Counts non-NA cells for each column or row |
describe() |
Generates descriptive statistics |
value_counts() |
Returns a series containing counts of unique values |
7.5 Reshape and Merge DataFrames
Data are often stored in a wide form, i.e., in a wide-form tabulation. For example, consider the following state per capita income dataset that is provided by Bureau of Economic Analysis.
# Importing the state per capita income dataset
= pd.read_csv('data/state_income.csv', header = 0)
income_wide income_wide.head()
state | pinc.1993 | pinc.1994 | pinc.1995 | pinc.1996 | pinc.1997 | pinc.1998 | pinc.1999 | pinc.2000 | pinc.2001 | ... | pinc.2006 | pinc.2007 | pinc.2008 | pinc.2009 | pinc.2010 | pinc.2011 | pinc.2012 | pinc.2013 | pinc.2014 | pinc.2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 18129 | 18977 | 19892 | 20630 | 21516 | 22692 | 23352 | 24338 | 25104 | ... | 31362 | 32598 | 33353 | 32608 | 33752 | 34997 | 35884 | 36107 | 37266 | 38644 |
1 | Alaska | 25036 | 25713 | 26399 | 26953 | 28185 | 29220 | 29892 | 31974 | 33517 | ... | 41058 | 43861 | 47749 | 47069 | 49438 | 52390 | 53627 | 52723 | 55833 | 57583 |
2 | Arizona | 18950 | 19869 | 20753 | 21737 | 22804 | 24176 | 24813 | 26235 | 26803 | ... | 34686 | 35789 | 35563 | 33418 | 33635 | 34968 | 36123 | 36602 | 38226 | 39676 |
3 | Arkansas | 16956 | 17758 | 18602 | 19458 | 20187 | 21222 | 21861 | 22762 | 23840 | ... | 29438 | 31070 | 31940 | 31153 | 31927 | 33863 | 36240 | 36235 | 38260 | 39343 |
4 | California | 23013 | 23581 | 24629 | 25912 | 27171 | 29100 | 30640 | 33364 | 34043 | ... | 42088 | 43629 | 43890 | 42044 | 43634 | 46170 | 48798 | 49277 | 52324 | 55758 |
5 rows × 24 columns
However, the wide form is not compatible with modules used in econometric analysis. Therefore, we need to change it to the long form. To this end, we can use the wide_to_long
method.
# Reshaping the DataFrame from wide to long format
= pd.wide_to_long(income_wide, stubnames='pinc', i='state', j='year', sep='.')
income_long income_long.head()
pinc | ||
---|---|---|
state | year | |
Alabama | 1993 | 18129 |
Alaska | 1993 | 25036 |
Arizona | 1993 | 18950 |
Arkansas | 1993 | 16956 |
California | 1993 | 23013 |
Notice that DataFrame has one column, named pinc. State and year variables are used to generate a Multi.Index
for the DataFrame. We can use the reset_index
method turn them back to data columns.
# Resetting the index to convert Multi.Index to columns
= income_long.reset_index()
income_long income_long.head()
state | year | pinc | |
---|---|---|---|
0 | Alabama | 1993 | 18129 |
1 | Alaska | 1993 | 25036 |
2 | Arizona | 1993 | 18950 |
3 | Arkansas | 1993 | 16956 |
4 | California | 1993 | 23013 |
The DataFrame is currently sorted by year and state (year is the slower index). Let us sort it by state and year (with state as the slower index). We can use the sort_values
method to sort the DataFrame.
# Sorting the DataFrame by state and year
= income_long.sort_values(by = ["state", "year"])
income_long income_long.head()
state | year | pinc | |
---|---|---|---|
0 | Alabama | 1993 | 18129 |
50 | Alabama | 1994 | 18977 |
100 | Alabama | 1995 | 19892 |
150 | Alabama | 1996 | 20630 |
200 | Alabama | 1997 | 21516 |
Notice that the index order has changed. We can reset the index again, which will add the current index as a data column. However, we can drop it from the DataFrame if needed.
# Resetting the index and dropping the old index column
= income_long.reset_index().drop("index", axis = 1)
income_long income_long.head()
state | year | pinc | |
---|---|---|---|
0 | Alabama | 1993 | 18129 |
1 | Alabama | 1994 | 18977 |
2 | Alabama | 1995 | 19892 |
3 | Alabama | 1996 | 20630 |
4 | Alabama | 1997 | 21516 |
Finally, we will check the data type of each column.
# Checking the data types of the columns
income_long.dtypes
state object
year int64
pinc int64
dtype: object
Next, we will consider the state expenditure data provided by the Bureau of Economic Analysis. The data cover the period from 1993 to 2015, with one CSV file for each year. We will first generate a list of filenames and then sort the list.
# Generating a list of filenames for the expenditure data
= os.listdir('data/expenditure/')
file_list
file_list.sort() file_list
['expnd_1993.csv',
'expnd_1994.csv',
'expnd_1995.csv',
'expnd_1996.csv',
'expnd_1997.csv',
'expnd_1998.csv',
'expnd_1999.csv',
'expnd_2000.csv',
'expnd_2001.csv',
'expnd_2002.csv',
'expnd_2003.csv',
'expnd_2004.csv',
'expnd_2005.csv',
'expnd_2006.csv',
'expnd_2007.csv',
'expnd_2008.csv',
'expnd_2009.csv',
'expnd_2010.csv',
'expnd_2011.csv',
'expnd_2012.csv',
'expnd_2013.csv',
'expnd_2014.csv',
'expnd_2015.csv']
We can now stack these datasets in a for
loop, however, we must be aware of non-uniformity of column names across years. Indeed, although the order of columns are uniform across years, the names are not. Hence, we will rename the columns after importing them.
# Importing the expenditure data from multiple CSV files
= ['state','year','tot_revenue','taxes','tot_expnd','education','public_welfare','hospital','health']
tmpnames = pd.read_csv('data/expenditure/' + file_list[0], header=0) # import the initial year
expenditure = tmpnames
expenditure.columns
for i in range(1, len(file_list)):
= pd.read_csv('data/expenditure/' + file_list[i], header=0)
tmpframe = tmpnames
tmpframe.columns = pd.concat([expenditure, tmpframe], axis=0)
expenditure
expenditure.head()
state | year | tot_revenue | taxes | tot_expnd | education | public_welfare | hospital | health | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1993 | 11,389,335 | 4,639,784 | 10,242,374 | 1,920,765 | 2,006,829 | 887,835 | 395,901 |
1 | Alaska | 1993 | 7,358,035 | 2,227,283 | 5,423,008 | 604,677 | 454,430 | 35,105 | 103,541 |
2 | Arizona | 1993 | 10,843,407 | 5,281,816 | 9,782,725 | 1,279,055 | 1,900,329 | 71,652 | 357,130 |
3 | Arkansas | 1993 | 6,446,052 | 2,942,581 | 6,014,728 | 1,077,951 | 1,310,779 | 304,906 | 136,990 |
4 | California | 1993 | 108,597,722 | 49,418,058 | 104,567,486 | 9,825,944 | 10,909,104 | 2,818,232 | 1,744,769 |
Expenditure dataset is sorted by year and state (with year as the slower index). Let us sort it by state and year (with state as the slower index).
# Sorting the expenditure DataFrame by state and year
= expenditure.sort_values(by = ["state", "year"])
expenditure = expenditure.reset_index().drop("index", axis = 1)
expenditure expenditure.head()
state | year | tot_revenue | taxes | tot_expnd | education | public_welfare | hospital | health | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1993 | 11,389,335 | 4,639,784 | 10,242,374 | 1,920,765 | 2,006,829 | 887,835 | 395,901 |
1 | Alabama | 1994 | 11,599,362 | 4,767,108 | 10,815,221 | 2,101,540 | 2,157,831 | 823,194 | 470,960 |
2 | Alabama | 1995 | 12448670 | 5077827 | 11634629 | 2260473 | 2282961 | 811433 | 474424 |
3 | Alabama | 1996 | 12741148 | 5257771 | 12126587 | 2240613 | 2325418 | 815698 | 500625 |
4 | Alabama | 1997 | 14,007,883 | 5,484,161 | 12,944,867 | 5,175,279 | 2,537,627 | 882,613 | 566,651 |
We must check the data type for all columns to make sure that expenditure variables are numeric.
# Checking the data types of the columns in the expenditure DataFrame
expenditure.dtypes
state object
year int64
tot_revenue object
taxes object
tot_expnd object
education object
public_welfare object
hospital object
health object
dtype: object
Unfortunately, dtype
is object for all expenditure variables. This is due to ,
in those columns. We need to convert these columns to numeric type by dropping ,
and then resetting dtype
to numeric.
# Converting expenditure columns to numeric by removing commas
= ['tot_revenue','taxes','tot_expnd','education','public_welfare','hospital','health']
columns = expenditure[columns].replace(",", "", regex = True).astype('int')
expenditure[columns] expenditure.dtypes
state object
year int64
tot_revenue int32
taxes int32
tot_expnd int32
education int32
public_welfare int32
hospital int32
health int32
dtype: object
We can now merge income and expenditure datasets on state and year. Pandas merge()
and join()
methods use a set of rules known as relational algebra to combine data. If the key columns are named the same in the left and right DataFrames, a value for on
argument can be provided. If the key columns are named differently, we can specify the key columns for merge using left_on
and right_on
arguments corresponding to left and right DataFrames. Similarly, if left and right DataFrames have indices set, you can use left_index
and right_index
arguments.
# Merging the income and expenditure DataFrames on state and year
= pd.merge(income_long, expenditure, on = ["state", "year"])
final_df final_df.head()
state | year | pinc | tot_revenue | taxes | tot_expnd | education | public_welfare | hospital | health | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1993 | 18129 | 11389335 | 4639784 | 10242374 | 1920765 | 2006829 | 887835 | 395901 |
1 | Alabama | 1994 | 18977 | 11599362 | 4767108 | 10815221 | 2101540 | 2157831 | 823194 | 470960 |
2 | Alabama | 1995 | 19892 | 12448670 | 5077827 | 11634629 | 2260473 | 2282961 | 811433 | 474424 |
3 | Alabama | 1996 | 20630 | 12741148 | 5257771 | 12126587 | 2240613 | 2325418 | 815698 | 500625 |
4 | Alabama | 1997 | 21516 | 14007883 | 5484161 | 12944867 | 5175279 | 2537627 | 882613 | 566651 |
7.6 Further Readings
In this section, we introduce commonly used data structures and manipulation tools from the pandas
module that are useful for econometric analysis. For further reading on pandas data structures and manipulation tools, we refer the reader to Sheppard (2021), McKinney (2022), and VanderPlas (2023).