In the previous session, we introduced Pandas’ basic data structures and indexing methods. Building on the understanding of how Pandas organizes data and how we select the data we need from it, this session will cover using Pandas for data calculation and statistical analysis.
Pandas Calculations
Since Pandas is built on NumPy, its computational core aligns with NumPy. By indexing data blocks, it inherently supports vectorization for fast calculations.
Simultaneously, Pandas also supports the vast majority of NumPy’s built-in functions.
Below, we demonstrate Pandas’ computational capabilities by constructing a pseudo-meteorological element dataset.
python
import pandas as pd
import numpy as np
# Create a pseudo-meteorological element dataset
df = pd.DataFrame({'temperature': [22, 23, 24, 25],
'humidity': [60, 65, 70, 75],
'pressure': [1013, 1015, 1017, 1019],
'wind': [5, 10, 15, 20],
'precipitation': [0.5, 0.7, 0.9, 1.1],},
index=['Moscow', 'Boston', 'Rome', 'Tokyo'])
print(df)
# Output:
# temperature humidity pressure wind precipitation
# Moscow 22 60 1013 5 0.5
# Boston 23 65 1015 10 0.7
# Rome 24 70 1017 15 0.9
# Tokyo 25 75 1019 20 1.1
python
# Simple arithmetic operations via indexing, used for unit conversions, etc. print(df['temperature'] + 273.15, '\n') # Celsius to Kelvin print(df['temperature'] * 9/5 + 32, '\n') # Celsius to Fahrenheit print(df['precipitation'] / 86400, '\n') # mm/day to mm/hour print(df['wind'] * 1.60934, '\n') # m/s to mph print(df['pressure'] * 0.750062) # hPa to inHg # Output: # Moscow 295.15 # Boston 296.15 # Rome 297.15 # Tokyo 298.15 # Name: temperature, dtype: float64 # # Moscow 71.6 # Boston 73.4 # Rome 75.2 # Tokyo 77.0 # Name: temperature, dtype: float64 # # Moscow 0.000006 # Boston 0.000008 # Rome 0.000010 # Tokyo 0.000013 # Name: precipitation, dtype: float64 # # Moscow 8.0467 # Boston 16.0934 # Rome 24.1401 # Tokyo 32.1868 # Name: wind, dtype: float64 # # Moscow 759.812806 # Boston 761.312930 # Rome 762.813054 # Tokyo 764.313178 # Name: pressure, dtype: float64
python
# Besides column calculations, row operations are also supported. However, operating on multiple rows here lacks practical meaning. We first transpose the DataFrame, then perform row operations. df0 = df.T print(df0, '\n') print(df0.loc['temperature'] + 273.15, '\n') print(df0.loc['precipitation', 'Tokyo'] * 30, '\n') # Calculate specific rows/columns print(df0['Moscow'] - df0['Tokyo'], '\n') # Calculations between different rows are also possible # Therefore, we can also perform calculations on arbitrary slices, though it's not practically meaningful here. print(df0.loc['humidity':'wind', ['Boston', 'Tokyo']] + 10000) # Output: # Moscow Boston Rome Tokyo # temperature 22.0 23.0 24.0 25.0 # humidity 60.0 65.0 70.0 75.0 # pressure 1013.0 1015.0 1017.0 1019.0 # wind 5.0 10.0 15.0 20.0 # precipitation 0.5 0.7 0.9 1.1 # # Moscow 295.15 # Boston 296.15 # Rome 297.15 # Tokyo 298.15 # Name: temperature, dtype: float64 # # 33.0 # # temperature -3.0 # humidity -15.0 # pressure -6.0 # wind -15.0 # precipitation -0.6 # dtype: float64 # # Boston Tokyo # humidity 10065.0 10075.0 # pressure 11015.0 11019.0 # wind 10010.0 10020.0
python
# Similarly, commonly used functions from NumPy are also supported by Pandas.
df = pd.DataFrame({'lon': [-105, -90, 0, 10, 120], 'lat': [30, 40, 50, 60, 70]})
print(df, '\n')
# Convert latitude/longitude to radians
df['lon_rad'] = np.deg2rad(df['lon'])
df['lat_rad'] = np.deg2rad(df['lat'])
print(df, '\n')
# Calculate latitude circle length
df['lat_len'] = 2 * np.pi * 6371 * np.cos(df['lat_rad'])
print(df)
# Output:
# lon lat
# 0 -105 30
# 1 -90 40
# 2 0 50
# 3 10 60
# 4 120 70
#
# lon lat lon_rad lat_rad
# 0 -105 30 -1.832596 0.523599
# 1 -90 40 -1.570796 0.698132
# 2 0 50 0.000000 0.872665
# 3 10 60 0.174533 1.047198
# 4 120 70 2.094395 1.221730
#
# lon lat lon_rad lat_rad lat_len
# 0 -105 30 -1.832596 0.523599 34667.147249
# 1 -90 40 -1.570796 0.698132 30664.892037
# 2 0 50 0.000000 0.872665 25730.899599
# 3 10 60 0.174533 1.047198 20015.086796
# 4 120 70 2.094395 1.221730 13691.125709
By combining Pandas with various operators and functions, we can conveniently implement many operations similar to those in Excel.
One point to note: Since Pandas has the concept of row and column names, when two different Pandas data objects are operated on, row and column names are automatically matched. Non-existent row/column names will be filled with null values (NaN).
python
df0 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]}, index=['a', 'b', 'd'])
print(df0, '\n', df1, '\n', df0 + df1)
# Output:
# A B
# a 1 4
# b 2 5
# c 3 6
# A C
# a 10 40
# b 20 50
# d 30 60
# A B C
# a 11.0 NaN NaN
# b 22.0 NaN NaN
# c NaN NaN NaN
# d NaN NaN NaN
python
# However, when directly extracting data from another DataFrame and assigning it to a DataFrame, only rows with indices existing in the target DataFrame are retained.
df0['C'] = df1['C']
print(df0, '\n') # Row with index 'd' is not present in the result
# Similar for row assignment; columns not existing in the target array are not created.
df0 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]}, index=['a', 'b', 'd'])
df0.loc['d', :] = df1.loc['d', :]
print(df0) # Column 'C' is not present in the result
# Output:
# A B C
# a 1 4 40.0
# b 2 5 50.0
# c 3 6 NaN
#
# A B
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 6.0
# d 30.0 NaN
Furthermore, we can use the axis parameter in Pandas’ built-in mathematical operation functions to specify rows or columns, enabling different numerical calculations for rows or columns.
python
df = pd.DataFrame({'temperature': [22, 23, 24, 25],
'humidity': [60, 65, 70, 75],
'pressure': [1013, 1015, 1017, 1019],
'wind': [5, 10, 15, 20],
'precipitation': [0.5, 0.7, 0.9, 1.1],},
index=['Moscow', 'Boston', 'Rome', 'Tokyo'])
print(df, '\n')
print(df.add([1000, 2000, 3000, 4000, 5000], axis=1), '\n') # Add different values to each column (axis=1: columns)
print(df.sub([1000, 2000, 3000, 4000, 5000]), '\n') # Subtract (default axis=0: rows). Here axis=1 can be omitted because default is 0.
print(df.mul([10, 20, 30, 40], axis=0), '\n') # Multiply each row by different values (axis=0: rows)
print(df.div([1000, 2000, 3000, 4000], axis=0), '\n') # Divide each row by different values
print(df.pow([.2, .3, .4, .5], axis=0)) # Raise each row to a power
# Output:
# temperature humidity pressure wind precipitation
# Moscow 22 60 1013 5 0.5
# Boston 23 65 1015 10 0.7
# Rome 24 70 1017 15 0.9
# Tokyo 25 75 1019 20 1.1
#
# temperature humidity pressure wind precipitation
# Moscow 1022 2060 4013 4005 5000.5
# Boston 1023 2065 4015 4010 5000.7
# Rome 1024 2070 4017 4015 5000.9
# Tokyo 1025 2075 4019 4020 5001.1
#
# temperature humidity pressure wind precipitation
# Moscow -978 -1940 -1987 -3995 -4999.5
# Boston -977 -1935 -1985 -3990 -4999.3
# Rome -976 -1930 -1983 -3985 -4999.1
# Tokyo -975 -1925 -1981 -3980 -4998.9
#
# temperature humidity pressure wind precipitation
# Moscow 220 600 10130 50 5.0
# Boston 460 1300 20300 200 14.0
# Rome 720 2100 30510 450 27.0
# Tokyo 1000 3000 40760 800 44.0
#
# temperature humidity pressure wind precipitation
# Moscow 0.02200 0.060000 1.01300 0.005 0.000500
# Boston 0.01150 0.032500 0.50750 0.005 0.000350
# Rome 0.00800 0.023333 0.33900 0.005 0.000300
# Tokyo 0.00625 0.018750 0.25475 0.005 0.000275
#
# temperature humidity pressure wind precipitation
# Moscow 1.855601 2.267933 3.991369 1.379730 0.870551
# Boston 2.561642 3.498437 7.978841 1.995262 0.898523
# Rome 3.565205 5.470654 15.956160 2.954177 0.958732
# Tokyo 5.000000 8.660254 31.921779 4.472136 1.048809
Pandas Statistics
Statistics is one of the reasons tables exist. Through statistical analysis of vast amounts of data, obtaining metrics like mean, variance, median, etc., helps us gain deeper insights into the data and extract patterns from it.
Pandas provides rich statistical functions to help us quickly calculate statistical metrics for a dataset.
We’ll start again with a set of pseudo-meteorological data:
python
df = pd.DataFrame({'temperature': [-12, 23, 34, 30],
'humidity': [90, 85, 60, 75],
'pressure': [998, 1005, 1017, 1019],
'wind': [10, 8, 2, 0.5],
'precipitation': [2, 1.7, 0, 0.1],},
index=['Stockholm', 'Vienna', 'Barcelona', 'San Francisco'])
# We can get a quick overview of the dataset's basic information through a single function.
print(df.describe(), '\n')
# Alternatively, we can specify metrics of interest.
print(df['temperature'].mean()) # Mean
print(df['humidity'].median()) # Median
print(df['pressure'].max()) # Maximum
print(df['wind'].min()) # Minimum
print(df['precipitation'].std()) # Standard deviation
print(df['temperature'].quantile([0.9])) # 90th percentile
print(df['humidity'].sum()) # Sum (though not practically meaningful here)
print(df.corr()) # Correlation coefficient r
# Output:
# temperature humidity pressure wind precipitation
# count 4.000000 4.000000 4.000000 4.00000 4.000000
# mean 18.750000 77.500000 1009.750000 5.12500 0.950000
# std 20.998016 13.228757 9.979145 4.58939 1.047219
# min -12.000000 60.000000 998.000000 0.50000 0.000000
# 25% 14.250000 71.250000 1003.250000 1.62500 0.075000
# 50% 26.500000 80.000000 1011.000000 5.00000 0.900000
# 75% 31.000000 86.250000 1017.500000 8.50000 1.775000
# max 34.000000 90.000000 1019.000000 10.00000 2.000000
#
# 18.75
# 80.0
# 1019
# 0.5
# 1.0472185381603338
# 0.9 32.8
# Name: temperature, dtype: float64
# 310
# temperature humidity pressure wind precipitation
# temperature 1.000000 -0.777000 0.884070 -0.821071 -0.805687
# humidity -0.777000 1.000000 -0.839572 0.816698 0.902306
# pressure 0.884070 -0.839572 1.000000 -0.992579 -0.977639
# wind -0.821071 0.816698 -0.992579 1.000000 0.983127
# precipitation -0.805687 0.902306 -0.977639 0.983127 1.000000
When we have a large number of data entries, to aggregate data of different types, we can use the groupby() function to group and cluster the data.
python
# Generate a set of random population and GDP data for countries
country=['United States', 'United Kingdom', 'Switzerland', 'Finland', 'Russia']
df = pd.DataFrame({'Country': [country[x] for x in np.random.randint(0,len(country),2000)],
'Population':np.random.randint(1, 100, 2000),
'GDP':np.random.randint(1000, 10000, 2000),
'Season': np.random.choice(['Spring', 'Summer', 'Fall', 'Winter'], 2000),
})
print(df.head()) # Show first few rows
# Output (example):
# Country Population GDP Season
# 0 Switzerland 3 4883 Summer
# 1 United Kingdom 59 3654 Winter
# 2 Russia 14 3451 Summer
# 3 Finland 55 5472 Summer
# 4 Finland 85 8517 Summer
# ... ... ... ... ...
python
print(df.groupby('Country').mean()) # Calculate mean population and GDP for each country across all seasons
print(df.groupby('Country').median()) # Median
print(df.groupby('Country').min()) # Minimum
print(df.groupby('Country').max()) # Maximum
print(df.groupby('Season').count()) # Count per season
print(df.groupby(['Season', 'Country']).count()) # Supports clustering by multiple labels simultaneously
# Output (example):
# Population GDP
# Country
# Finland 50.799007 5500.885856
# Russia 50.808824 5407.105392
# Switzerland 49.598958 5706.822917
# United Kingdom 49.260759 5486.015190
# United States 51.509756 5458.958537
# ... etc.
Postscript
The above covers some basic operations for calculation and statistics using Pandas. Clearly, many of these functionalities are essential in our data processing. By operating on DataFrame rows and columns, calculating metrics like Root Mean Square Error (RMSE), Mean Absolute Error (MAE), etc., also becomes straightforward.
Moreover, mastering code liberates us from the tediousness of spreadsheet dragging in Excel. If it were a one-time calculation, that’s manageable, but errors during busy work or re-running data necessitate repetitive labor.