NumPy 02 | What Are We Actually Calculating in Earth Science Computations?

david 02/12/2025

Introduction

All our earth science data processing is, in essence, a game of numbers. Therefore, the specific usage methods of NumPy are inherently identical to the geophysical methods we employ.

Alright, enough with the cryptic talk. Perhaps without concrete examples, it’s hard to grasp where this identity lies. So, this week we will continue using the “application example” approach from the last section of the previous article to understand the significance of NumPy for earth science applications.

This article’s core will revolve around array creation, operations, and indexing – the most frequent and foundational elements in our earth science computations. Whether it’s simply calculating a single geoscience metric or building large-scale numerical simulation frameworks, implementing them in Python will inevitably involve these concepts.

Enough of these abstract statements. Without further ado, let’s begin. Selling snacks up front: melon seeds, peanuts, cola. Let’s eat and chat.

Creating Arrays

In the previous article, we used arrays to introduce their basic characteristics and the differences and connections between them and Python Lists. However, we haven’t formally discussed how to create them. So, let’s start there.

Multidimensional Arrays

As mentioned before, the formal term for arrays in NumPy is “multidimensional arrays,” with the type name ndarray. This means they can be of any dimension from 0 to N (the universe itself could be considered an array if your memory could hold it).

Let’s give a practical example:

When receiving a package, we need a series of address information to determine its destination. The recipient’s name can be considered a 0-dimensional array.

Under the constraints of a specific address, the resident at that location can only be you; there isn’t another user with the same address and name (if such a person exists, that would be terrifying).

Now let’s increase the dimensionality: Suppose your residence is on the third floor of an apartment building, with a door number in the form (3, N), where N is your room number. When all rooms on one floor form a whole, a 1D array is created.

You need N to identify your room among the identically designed rooms on that floor, and the resident name corresponding to (3, N) is you. This is the meaning of indexing we will discuss later (and mentioned earlier).

Let’s generalize further: A building has many floors. Now the 3 in (3, N) becomes variable. We need to determine which floor’s room N you are in to avoid delivering to your upstairs or downstairs neighbor. So the indexing form becomes (F, N), where F is the floor number, and the array becomes 2D.

By analogy, when we need to pinpoint the unique you among 8 billion people in over 200 countries worldwide, an array with many dimensions is created.

Following our common shipping address format, we can roughly write its dimensions: (Province, City, District/County, Township/Town, Street, Community, Building, Unit, Floor, Room Number).

Thus, under China’s current postal system, locating a recipient with a relatively complete and clear address requires a 10-dimensional array.

Of course, this is just an ideal concept. Not all houses and streets look exactly the same. The key is understanding the multidimensional concept without being confined to spatial coordinate systems.

Understanding the basic meaning of multidimensional arrays, let’s see how NumPy represents arrays:

python

import numpy as np

# Create an array directly from a custom list
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]])
print(arr)

# Output:
# [[ 1  2  3  4  5]
#  [ 6  7  8  9 10]
#  [11 12 13 14 15]]

# Create arrays of specified shapes using NumPy built-in functions
arr_rand = np.random.rand(5, 5)         # Create a 5x5 array with random numbers between 0-1
arr_ones = np.ones((5, 5))              # Create a 5x5 array filled with 1s
arr_zeros = np.zeros((5, 5))            # Create a 5x5 array filled with 0s
arr_full = np.full((5, 5), 10)          # Create a 5x5 array with all elements set to 10
arr_eye = np.eye(5)                     # Create a 5x5 identity array (1s on diagonal, 0s elsewhere)

print(arr_rand)
print(arr_ones)
print(arr_zeros)
print(arr_full)
print(arr_eye)

# Example outputs (abbreviated):
# [[0.35943306 0.52195247 0.81331466 0.83091224 0.97672133] ...]
# [[1. 1. 1. 1. 1.] ...]
# [[0. 0. 0. 0. 0.] ...]
# [[10 10 10 10 10] ...]
# [[1. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0.]
#  [0. 0. 1. 0. 0.]
#  [0. 0. 0. 1. 0.]
#  [0. 0. 0. 0. 1.]]

arr_ones_like = np.ones_like(arr_rand)              # Create an array of ones with the same shape as arr_rand
arr_zeros_like = np.zeros_like(arr_rand)            # Create an array of zeros with the same shape as arr_rand
arr_full_like = np.full_like(arr_rand, 10)          # Create an array with all elements set to 10, same shape as arr_rand

print(arr_ones_like)
print(arr_zeros_like)
print(arr_full_like)

# Outputs (abbreviated):
# [[1. 1. 1. 1. 1.] ...]
# [[0. 0. 0. 0. 0.] ...]
# [[10. 10. 10. 10. 10.] ...]

1D Sequences

In terms of inclusion, this section should be a subset of multidimensional arrays. The 1D sequences created by NumPy are simply 1D arrays, which we call vectors. However, their widespread use in practical scenarios necessitates giving them their own section.

Let’s see how they work and some possible application examples:

python

arr0 = np.arange(0, 100, 0.5)           # Create an array from 0 to 100 (exclusive) with step 0.5
arr1 = np.linspace(0, 10, 20)           # Create an array of 20 evenly spaced numbers between 0 and 10 (inclusive)

arr_demo0 = np.arange(4000, 8850, 50)       # Generate elevation bands from 4000m to 8800m with 50m intervals
arr_demo1 = np.linspace(200, 1200, 6)       # Assume regional precipitation ranges from 200-1200 mm, divided into 6 equal precipitation levels

print(arr0)
print(arr1)
print(arr_demo0)
print(arr_demo1)

# Outputs (abbreviated):
# [ 0.   0.5  1.   1.5  2.   2.5 ... 99.5]
# [ 0.          0.52631579  1.05263158 ... 10.        ]
# [4000 4050 4100 4150 ... 8800]
# [ 200.  400.  600.  800. 1000. 1200.]

Note: In the above examples, arange is left-closed, right-open, so we set the maximum altitude to the max value plus the step size. linspace, however, includes both endpoints, so the original range doesn’t need adjustment.

When learning related functions, pay attention to differences in default parameters and return values to avoid confusion. No teaching material can cover all details, so reading the official documentation is essential.

Array Computation

The term “array” literally means a combination of numbers. As numbers, they are naturally used to quantify and compute our objective world. So, let’s introduce array computation in NumPy.

Data Types

To achieve higher computational precision, NumPy defines different data types. Broadly including:

  • Boolean (bool, actually built into Python): Represents logical values, True or False.
  • Integer (int): Represents integers like -1, 0, 3, etc. NumPy supports various integer types with different bit sizes, e.g., int8int16int32int64.
  • Unsigned Integer (uint): Represents non-negative integers. NumPy also supports various unsigned integer types, e.g., uint8uint16uint32uint64.
  • Floating-point (float): Represents floating-point numbers like 1.2, 3.14. NumPy supports various floating-point types with different precisions, e.g., float32float64.
  • Complex (complex): Represents complex numbers with real and imaginary parts.

NumPy doesn’t define data type names with stark English differences like C (e.g., long, short). Instead, different integer data are distinguished as intN (where N is the bit size). Floating-point types are similar. Python doesn’t have a double-precision floating-point type called “Double”; it uses floatN of different lengths.

Simply put, the difference in data type lengths is the number of decimal places they retain, leading to different levels of precision loss.

If overall precision isn’t affected but memory usage explodes due to huge data volume, consider compressing data type precision.

You can observe memory usage for different data types after creating a very large array.

python

arr0 = np.random.rand(100, 7, 1000, 1000)     # Create a very large array
print(arr0.nbytes)                            # Check memory usage
arr0 = arr0.astype(np.float16)                # Compress data type precision
print(arr0.nbytes)                            # Check memory usage after compression

Since our main use of NumPy is for computation, we won’t discuss stuffing arrays with other strange data types.

Mathematical Operations and Functions

These operations, performed on entire arrays simultaneously, are typically vectorized, saving significant time (demonstrated in the previous article).

Here are some commonly used computations. As they are broadly applicable, specific earth science scenario examples aren’t enumerated here.

python

arr = np.random.rand(5, 5) * 100        # Create a 5x5 random array and multiply by 100

print(arr)
print(arr + 1)                    # Addition
print(arr - 1)                    # Subtraction
print(arr * 2)                    # Multiplication
print(arr / 2)                    # Division
print(arr ** 2)                   # Power
print(arr ** 0.5)                 # Square root
print(arr // 2)                   # Integer division (floor division)
print(arr % 2)                    # Modulo (remainder)

# Example outputs (abbreviated):
# [[85.32100518 67.67212275 86.05631003 ...] ...]
# [[86.32100518 68.67212275 87.05631003 ...] ...]
# ... (other operations)

arr = np.random.rand(5, 5)        # Create a random array

print(arr)
print(np.ceil(arr))               # Ceiling (round up)
print(np.floor(arr))              # Floor (round down)
print(np.round(arr))              # Round

# Example outputs (abbreviated):
# [[0.67410798 0.28093307 ...] ...]
# [[1. 1. 1. ...] ...]
# [[0. 0. 0. ...] ...]
# [[1. 0. 0. ...] ...]

Statistics

Statistical metrics are indispensable for describing results and characterizing data magnitude. NumPy provides rich statistical functions. Here are some common ones:

python

arr = np.random.rand(5, 5)        # Create a random array

print(arr)
print(np.mean(arr))               # Mean
print(np.median(arr))             # Median
print(np.std(arr))                # Standard deviation
print(np.var(arr))                # Variance
print(np.min(arr))                # Minimum
print(np.max(arr))                # Maximum
print(np.percentile(arr, 25))     # Percentile (25th)
print(np.sum(arr))                # Sum
print(np.cumsum(arr))             # Cumulative sum
print(np.diff(arr))               # Difference (discrete difference)
print(np.sort(arr))               # Sort
print(np.unique(arr))             # Unique values

# Example outputs (abbreviated):
# [[0.76109786 0.25870338 ...] ...]
# 0.500909350459388
# 0.5001225770420152
# 0.2964726511532032
# 0.08789603288180892
# 0.02445998151118267
# 0.9947154489951692
# 0.2587033828003008
# 12.5227337614847
# [ 0.76109786  1.01980124  1.67051796 ... 12.52273376] (cumulative sum)
# [[-0.50239448  0.39201333 ...] ...] (differences)
# [[0.03378749 0.20757916 ...] ...] (sorted)
# [0.02445998 0.03378749 0.06872452 ... 0.99471545] (unique)

The statistical methods exemplified here apply to the entire array. To perform statistics along a specific dimension, the axis parameter must be specified.

We listed some related earth science use cases at the end of the previous article, such as latitude/longitude circle averages for temperature. Feel free to revisit that section; we won’t repeat them here.

Constants

There are a few persistent and seemingly unreasonable numbers in mathematics that always surround us. In recognition of their special status, NumPy gives them dedicated symbols: the natural constant e (np.e), pi π (np.pi), infinity  (np.inf), and Not a Number (np.nan).

About NaN

nan needs special explanation. This value is not equal to 0, nor to infinity. It’s not equal to any number; it exists outside the three realms.

Is it powerful then?

Not necessarily. It has no magnitude relationship with any number. Its meaning simply indicates a missing value in an array. Calculating an array with missing values will only give you a headache.

An earth science example: missing data at a certain station for a specific time. Both time and station exist, but the data is empty.

We can use functions to handle missing values to avoid anomalies, such as nan_to_numisnannanmean.

python

print(np.e)                         # Natural constant e
print(np.pi)                        # Pi π
print(np.inf)                       # Infinity ∞
print(np.nan)                       # Not a Number nan

# Output:
# 2.718281828459045
# 3.141592653589793
# inf
# nan

arr0 = np.array([1, 2, np.nan, 4, 5])    # Create an array containing NaN

print(arr0)
print(arr0.mean())                      # Mean (affected by NaN)
print(arr0.max())                       # Max (affected by NaN)
print(arr0.min())                       # Min (affected by NaN)

arr1 = np.nan_to_num(arr0, nan=-9999)   # Fill NaN with -9999
print(arr1)

arr2 = np.isnan(arr0)                   # Check if values are NaN
print(arr2)

arr3 = np.nanmean(arr0)                 # Calculate mean ignoring NaN
print(arr3)

# Output:
# [ 1.  2. nan  4.  5.]
# nan
# nan
# nan
# [ 1.000e+00  2.000e+00 -9.999e+03  4.000e+00  5.000e+00]
# [False False  True False False]
# 3.0