Pandas 02 | How to Use Pandas to Calculate and Statistically Analyze Geoscience Data?

In the previous session, we introduced Pandas’ basic data structures and indexing methods. Building on the understanding of how Pandas organizes data and how we select the data we need from it, this session will cover using Pandas for data calculation and statistical analysis.

Pandas Calculations

Since Pandas is built on NumPy, its computational core aligns with NumPy. By indexing data blocks, it inherently supports vectorization for fast calculations.

Simultaneously, Pandas also supports the vast majority of NumPy’s built-in functions.

Below, we demonstrate Pandas’ computational capabilities by constructing a pseudo-meteorological element dataset.

python

import pandas as pd
import numpy as np

# Create a pseudo-meteorological element dataset
df = pd.DataFrame({'temperature': [22, 23, 24, 25],
                   'humidity': [60, 65, 70, 75],
                   'pressure': [1013, 1015, 1017, 1019],
                   'wind': [5, 10, 15, 20],
                   'precipitation': [0.5, 0.7, 0.9, 1.1],},
                   index=['Moscow', 'Boston', 'Rome', 'Tokyo'])

print(df)

# Output:
#         temperature  humidity  pressure  wind  precipitation
# Moscow           22        60      1013     5            0.5
# Boston           23        65      1015    10            0.7
# Rome             24        70      1017    15            0.9
# Tokyo            25        75      1019    20            1.1

python

# Simple arithmetic operations via indexing, used for unit conversions, etc.
print(df['temperature'] + 273.15, '\n')       # Celsius to Kelvin
print(df['temperature'] * 9/5 + 32, '\n')     # Celsius to Fahrenheit
print(df['precipitation'] / 86400, '\n')      # mm/day to mm/hour
print(df['wind'] * 1.60934, '\n')             # m/s to mph
print(df['pressure'] * 0.750062)              # hPa to inHg

# Output:
# Moscow    295.15
# Boston    296.15
# Rome      297.15
# Tokyo     298.15
# Name: temperature, dtype: float64
#
# Moscow    71.6
# Boston    73.4
# Rome      75.2
# Tokyo     77.0
# Name: temperature, dtype: float64
#
# Moscow    0.000006
# Boston    0.000008
# Rome      0.000010
# Tokyo     0.000013
# Name: precipitation, dtype: float64
#
# Moscow     8.0467
# Boston    16.0934
# Rome      24.1401
# Tokyo     32.1868
# Name: wind, dtype: float64
#
# Moscow    759.812806
# Boston    761.312930
# Rome      762.813054
# Tokyo     764.313178
# Name: pressure, dtype: float64

python

# Besides column calculations, row operations are also supported. However, operating on multiple rows here lacks practical meaning. We first transpose the DataFrame, then perform row operations.
df0 = df.T
print(df0, '\n')

print(df0.loc['temperature'] + 273.15, '\n')
print(df0.loc['precipitation', 'Tokyo'] * 30, '\n')               # Calculate specific rows/columns

print(df0['Moscow'] - df0['Tokyo'], '\n')                         # Calculations between different rows are also possible

# Therefore, we can also perform calculations on arbitrary slices, though it's not practically meaningful here.
print(df0.loc['humidity':'wind', ['Boston', 'Tokyo']] + 10000)

# Output:
#                  Moscow  Boston    Rome   Tokyo
# temperature       22.0    23.0    24.0    25.0
# humidity          60.0    65.0    70.0    75.0
# pressure        1013.0  1015.0  1017.0  1019.0
# wind               5.0    10.0    15.0    20.0
# precipitation      0.5     0.7     0.9     1.1
#
# Moscow    295.15
# Boston    296.15
# Rome      297.15
# Tokyo     298.15
# Name: temperature, dtype: float64
#
# 33.0
#
# temperature      -3.0
# humidity        -15.0
# pressure         -6.0
# wind            -15.0
# precipitation    -0.6
# dtype: float64
#
#            Boston    Tokyo
# humidity  10065.0  10075.0
# pressure  11015.0  11019.0
# wind      10010.0  10020.0

python

# Similarly, commonly used functions from NumPy are also supported by Pandas.
df = pd.DataFrame({'lon': [-105, -90, 0, 10, 120], 'lat': [30, 40, 50, 60, 70]})
print(df, '\n')

# Convert latitude/longitude to radians
df['lon_rad'] = np.deg2rad(df['lon'])
df['lat_rad'] = np.deg2rad(df['lat'])
print(df, '\n')

# Calculate latitude circle length
df['lat_len'] = 2 * np.pi * 6371 * np.cos(df['lat_rad'])
print(df)

# Output:
#     lon  lat
# 0 -105   30
# 1  -90   40
# 2    0   50
# 3   10   60
# 4  120   70
#
#     lon  lat   lon_rad   lat_rad
# 0 -105   30 -1.832596  0.523599
# 1  -90   40 -1.570796  0.698132
# 2    0   50  0.000000  0.872665
# 3   10   60  0.174533  1.047198
# 4  120   70  2.094395  1.221730
#
#     lon  lat   lon_rad   lat_rad       lat_len
# 0 -105   30 -1.832596  0.523599  34667.147249
# 1  -90   40 -1.570796  0.698132  30664.892037
# 2    0   50  0.000000  0.872665  25730.899599
# 3   10   60  0.174533  1.047198  20015.086796
# 4  120   70  2.094395  1.221730  13691.125709

By combining Pandas with various operators and functions, we can conveniently implement many operations similar to those in Excel.

One point to note: Since Pandas has the concept of row and column names, when two different Pandas data objects are operated on, row and column names are automatically matched. Non-existent row/column names will be filled with null values (NaN).

python

df0 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]}, index=['a', 'b', 'd'])

print(df0, '\n', df1, '\n', df0 + df1)

# Output:
#    A  B
# a  1  4
# b  2  5
# c  3  6
#      A   C
# a  10  40
# b  20  50
# d  30  60
#        A   B   C
# a  11.0 NaN NaN
# b  22.0 NaN NaN
# c   NaN NaN NaN
# d   NaN NaN NaN

python

# However, when directly extracting data from another DataFrame and assigning it to a DataFrame, only rows with indices existing in the target DataFrame are retained.
df0['C'] = df1['C']
print(df0, '\n')            # Row with index 'd' is not present in the result

# Similar for row assignment; columns not existing in the target array are not created.
df0 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]}, index=['a', 'b', 'd'])

df0.loc['d', :] = df1.loc['d', :]
print(df0)            # Column 'C' is not present in the result

# Output:
#    A  B     C
# a  1  4  40.0
# b  2  5  50.0
# c  3  6   NaN
#
#       A    B
# a   1.0  4.0
# b   2.0  5.0
# c   3.0  6.0
# d  30.0  NaN

Furthermore, we can use the axis parameter in Pandas’ built-in mathematical operation functions to specify rows or columns, enabling different numerical calculations for rows or columns.

python

df = pd.DataFrame({'temperature': [22, 23, 24, 25],
                   'humidity': [60, 65, 70, 75],
                   'pressure': [1013, 1015, 1017, 1019],
                   'wind': [5, 10, 15, 20],
                   'precipitation': [0.5, 0.7, 0.9, 1.1],},
                   index=['Moscow', 'Boston', 'Rome', 'Tokyo'])

print(df, '\n')

print(df.add([1000, 2000, 3000, 4000, 5000], axis=1), '\n')             # Add different values to each column (axis=1: columns)
print(df.sub([1000, 2000, 3000, 4000, 5000]), '\n')                     # Subtract (default axis=0: rows). Here axis=1 can be omitted because default is 0.
print(df.mul([10, 20, 30, 40], axis=0), '\n')                           # Multiply each row by different values (axis=0: rows)
print(df.div([1000, 2000, 3000, 4000], axis=0), '\n')                   # Divide each row by different values
print(df.pow([.2, .3, .4, .5], axis=0))                                 # Raise each row to a power

# Output:
#         temperature  humidity  pressure  wind  precipitation
# Moscow           22        60      1013     5            0.5
# Boston           23        65      1015    10            0.7
# Rome             24        70      1017    15            0.9
# Tokyo            25        75      1019    20            1.1
#
#         temperature  humidity  pressure  wind  precipitation
# Moscow         1022      2060      4013  4005         5000.5
# Boston         1023      2065      4015  4010         5000.7
# Rome           1024      2070      4017  4015         5000.9
# Tokyo          1025      2075      4019  4020         5001.1
#
#         temperature  humidity  pressure  wind  precipitation
# Moscow          -978     -1940     -1987 -3995        -4999.5
# Boston          -977     -1935     -1985 -3990        -4999.3
# Rome            -976     -1930     -1983 -3985        -4999.1
# Tokyo           -975     -1925     -1981 -3980        -4998.9
#
#         temperature  humidity  pressure  wind  precipitation
# Moscow          220       600     10130    50            5.0
# Boston          460      1300     20300   200           14.0
# Rome            720      2100     30510   450           27.0
# Tokyo          1000      3000     40760   800           44.0
#
#         temperature  humidity  pressure   wind  precipitation
# Moscow      0.02200  0.060000   1.01300  0.005       0.000500
# Boston      0.01150  0.032500   0.50750  0.005       0.000350
# Rome        0.00800  0.023333   0.33900  0.005       0.000300
# Tokyo       0.00625  0.018750   0.25475  0.005       0.000275
#
#         temperature  humidity   pressure      wind  precipitation
# Moscow     1.855601  2.267933   3.991369  1.379730       0.870551
# Boston     2.561642  3.498437   7.978841  1.995262       0.898523
# Rome       3.565205  5.470654  15.956160  2.954177       0.958732
# Tokyo      5.000000  8.660254  31.921779  4.472136       1.048809

Pandas Statistics

Statistics is one of the reasons tables exist. Through statistical analysis of vast amounts of data, obtaining metrics like mean, variance, median, etc., helps us gain deeper insights into the data and extract patterns from it.

Pandas provides rich statistical functions to help us quickly calculate statistical metrics for a dataset.

We’ll start again with a set of pseudo-meteorological data:

python

df = pd.DataFrame({'temperature': [-12, 23, 34, 30],
                   'humidity': [90, 85, 60, 75],
                   'pressure': [998, 1005, 1017, 1019],
                   'wind': [10, 8, 2, 0.5],
                   'precipitation': [2, 1.7, 0, 0.1],},
                   index=['Stockholm', 'Vienna', 'Barcelona', 'San Francisco'])

# We can get a quick overview of the dataset's basic information through a single function.
print(df.describe(), '\n')

# Alternatively, we can specify metrics of interest.
print(df['temperature'].mean())                         # Mean
print(df['humidity'].median())                          # Median
print(df['pressure'].max())                             # Maximum
print(df['wind'].min())                                 # Minimum
print(df['precipitation'].std())                        # Standard deviation
print(df['temperature'].quantile([0.9]))                # 90th percentile
print(df['humidity'].sum())                             # Sum (though not practically meaningful here)

print(df.corr())                                        # Correlation coefficient r

# Output:
#        temperature   humidity     pressure      wind  precipitation
# count      4.000000   4.000000     4.000000   4.00000       4.000000
# mean      18.750000  77.500000  1009.750000   5.12500       0.950000
# std       20.998016  13.228757     9.979145   4.58939       1.047219
# min      -12.000000  60.000000   998.000000   0.50000       0.000000
# 25%       14.250000  71.250000  1003.250000   1.62500       0.075000
# 50%       26.500000  80.000000  1011.000000   5.00000       0.900000
# 75%       31.000000  86.250000  1017.500000   8.50000       1.775000
# max       34.000000  90.000000  1019.000000  10.00000       2.000000
#
# 18.75
# 80.0
# 1019
# 0.5
# 1.0472185381603338
# 0.9    32.8
# Name: temperature, dtype: float64
# 310
#                temperature  humidity  pressure      wind  precipitation
# temperature       1.000000 -0.777000  0.884070 -0.821071      -0.805687
# humidity         -0.777000  1.000000 -0.839572  0.816698       0.902306
# pressure          0.884070 -0.839572  1.000000 -0.992579      -0.977639
# wind             -0.821071  0.816698 -0.992579  1.000000       0.983127
# precipitation    -0.805687  0.902306 -0.977639  0.983127       1.000000

When we have a large number of data entries, to aggregate data of different types, we can use the groupby() function to group and cluster the data.

python

# Generate a set of random population and GDP data for countries
country=['United States', 'United Kingdom', 'Switzerland', 'Finland', 'Russia']

df = pd.DataFrame({'Country': [country[x] for x in np.random.randint(0,len(country),2000)],
                   'Population':np.random.randint(1, 100, 2000),
                   'GDP':np.random.randint(1000, 10000, 2000),
                   'Season': np.random.choice(['Spring', 'Summer', 'Fall', 'Winter'], 2000),
                   })

print(df.head()) # Show first few rows

# Output (example):
#             Country  Population   GDP  Season
# 0        Switzerland           3  4883  Summer
# 1     United Kingdom          59  3654  Winter
# 2             Russia          14  3451  Summer
# 3            Finland          55  5472  Summer
# 4            Finland          85  8517  Summer
# ...              ...         ...   ...     ...

python

print(df.groupby('Country').mean())                 # Calculate mean population and GDP for each country across all seasons
print(df.groupby('Country').median())               # Median
print(df.groupby('Country').min())                  # Minimum
print(df.groupby('Country').max())                  # Maximum
print(df.groupby('Season').count())                 # Count per season
print(df.groupby(['Season', 'Country']).count())    # Supports clustering by multiple labels simultaneously

# Output (example):
#                 Population          GDP
# Country
# Finland          50.799007  5500.885856
# Russia           50.808824  5407.105392
# Switzerland      49.598958  5706.822917
# United Kingdom   49.260759  5486.015190
# United States    51.509756  5458.958537
# ... etc.

Postscript

The above covers some basic operations for calculation and statistics using Pandas. Clearly, many of these functionalities are essential in our data processing. By operating on DataFrame rows and columns, calculating metrics like Root Mean Square Error (RMSE), Mean Absolute Error (MAE), etc., also becomes straightforward.

Moreover, mastering code liberates us from the tediousness of spreadsheet dragging in Excel. If it were a one-time calculation, that’s manageable, but errors during busy work or re-running data necessitate repetitive labor.

Easy Python

Pandas 02 | How to Use Pandas to Calculate and Statistically Analyze Geoscience Data?

New Article

Related articles