Introduction
All our earth science data processing is, in essence, a game of numbers. Therefore, the specific usage methods of NumPy are inherently identical to the geophysical methods we employ.
Alright, enough with the cryptic talk. Perhaps without concrete examples, it’s hard to grasp where this identity lies. So, this week we will continue using the “application example” approach from the last section of the previous article to understand the significance of NumPy for earth science applications.
This article’s core will revolve around array creation, operations, and indexing – the most frequent and foundational elements in our earth science computations. Whether it’s simply calculating a single geoscience metric or building large-scale numerical simulation frameworks, implementing them in Python will inevitably involve these concepts.
Enough of these abstract statements. Without further ado, let’s begin. Selling snacks up front: melon seeds, peanuts, cola. Let’s eat and chat.
Creating Arrays
In the previous article, we used arrays to introduce their basic characteristics and the differences and connections between them and Python Lists. However, we haven’t formally discussed how to create them. So, let’s start there.
Multidimensional Arrays
As mentioned before, the formal term for arrays in NumPy is “multidimensional arrays,” with the type name ndarray. This means they can be of any dimension from 0 to N (the universe itself could be considered an array if your memory could hold it).
Let’s give a practical example:
When receiving a package, we need a series of address information to determine its destination. The recipient’s name can be considered a 0-dimensional array.
Under the constraints of a specific address, the resident at that location can only be you; there isn’t another user with the same address and name (if such a person exists, that would be terrifying).
Now let’s increase the dimensionality: Suppose your residence is on the third floor of an apartment building, with a door number in the form
(3, N), where N is your room number. When all rooms on one floor form a whole, a 1D array is created.You need N to identify your room among the identically designed rooms on that floor, and the resident name corresponding to
(3, N)is you. This is the meaning of indexing we will discuss later (and mentioned earlier).Let’s generalize further: A building has many floors. Now the
3in(3, N)becomes variable. We need to determine which floor’s room N you are in to avoid delivering to your upstairs or downstairs neighbor. So the indexing form becomes(F, N), where F is the floor number, and the array becomes 2D.By analogy, when we need to pinpoint the unique you among 8 billion people in over 200 countries worldwide, an array with many dimensions is created.
Following our common shipping address format, we can roughly write its dimensions: (Province, City, District/County, Township/Town, Street, Community, Building, Unit, Floor, Room Number).
Thus, under China’s current postal system, locating a recipient with a relatively complete and clear address requires a 10-dimensional array.
Of course, this is just an ideal concept. Not all houses and streets look exactly the same. The key is understanding the multidimensional concept without being confined to spatial coordinate systems.
Understanding the basic meaning of multidimensional arrays, let’s see how NumPy represents arrays:
python
import numpy as np # Create an array directly from a custom list arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]) print(arr) # Output: # [[ 1 2 3 4 5] # [ 6 7 8 9 10] # [11 12 13 14 15]] # Create arrays of specified shapes using NumPy built-in functions arr_rand = np.random.rand(5, 5) # Create a 5x5 array with random numbers between 0-1 arr_ones = np.ones((5, 5)) # Create a 5x5 array filled with 1s arr_zeros = np.zeros((5, 5)) # Create a 5x5 array filled with 0s arr_full = np.full((5, 5), 10) # Create a 5x5 array with all elements set to 10 arr_eye = np.eye(5) # Create a 5x5 identity array (1s on diagonal, 0s elsewhere) print(arr_rand) print(arr_ones) print(arr_zeros) print(arr_full) print(arr_eye) # Example outputs (abbreviated): # [[0.35943306 0.52195247 0.81331466 0.83091224 0.97672133] ...] # [[1. 1. 1. 1. 1.] ...] # [[0. 0. 0. 0. 0.] ...] # [[10 10 10 10 10] ...] # [[1. 0. 0. 0. 0.] # [0. 1. 0. 0. 0.] # [0. 0. 1. 0. 0.] # [0. 0. 0. 1. 0.] # [0. 0. 0. 0. 1.]] arr_ones_like = np.ones_like(arr_rand) # Create an array of ones with the same shape as arr_rand arr_zeros_like = np.zeros_like(arr_rand) # Create an array of zeros with the same shape as arr_rand arr_full_like = np.full_like(arr_rand, 10) # Create an array with all elements set to 10, same shape as arr_rand print(arr_ones_like) print(arr_zeros_like) print(arr_full_like) # Outputs (abbreviated): # [[1. 1. 1. 1. 1.] ...] # [[0. 0. 0. 0. 0.] ...] # [[10. 10. 10. 10. 10.] ...]
1D Sequences
In terms of inclusion, this section should be a subset of multidimensional arrays. The 1D sequences created by NumPy are simply 1D arrays, which we call vectors. However, their widespread use in practical scenarios necessitates giving them their own section.
Let’s see how they work and some possible application examples:
python
arr0 = np.arange(0, 100, 0.5) # Create an array from 0 to 100 (exclusive) with step 0.5 arr1 = np.linspace(0, 10, 20) # Create an array of 20 evenly spaced numbers between 0 and 10 (inclusive) arr_demo0 = np.arange(4000, 8850, 50) # Generate elevation bands from 4000m to 8800m with 50m intervals arr_demo1 = np.linspace(200, 1200, 6) # Assume regional precipitation ranges from 200-1200 mm, divided into 6 equal precipitation levels print(arr0) print(arr1) print(arr_demo0) print(arr_demo1) # Outputs (abbreviated): # [ 0. 0.5 1. 1.5 2. 2.5 ... 99.5] # [ 0. 0.52631579 1.05263158 ... 10. ] # [4000 4050 4100 4150 ... 8800] # [ 200. 400. 600. 800. 1000. 1200.]
Note: In the above examples, arange is left-closed, right-open, so we set the maximum altitude to the max value plus the step size. linspace, however, includes both endpoints, so the original range doesn’t need adjustment.
When learning related functions, pay attention to differences in default parameters and return values to avoid confusion. No teaching material can cover all details, so reading the official documentation is essential.
Array Computation
The term “array” literally means a combination of numbers. As numbers, they are naturally used to quantify and compute our objective world. So, let’s introduce array computation in NumPy.
Data Types
To achieve higher computational precision, NumPy defines different data types. Broadly including:
- Boolean (bool, actually built into Python): Represents logical values,
TrueorFalse. - Integer (int): Represents integers like -1, 0, 3, etc. NumPy supports various integer types with different bit sizes, e.g.,
int8,int16,int32,int64. - Unsigned Integer (uint): Represents non-negative integers. NumPy also supports various unsigned integer types, e.g.,
uint8,uint16,uint32,uint64. - Floating-point (float): Represents floating-point numbers like 1.2, 3.14. NumPy supports various floating-point types with different precisions, e.g.,
float32,float64. - Complex (complex): Represents complex numbers with real and imaginary parts.
NumPy doesn’t define data type names with stark English differences like C (e.g., long, short). Instead, different integer data are distinguished as
intN(where N is the bit size). Floating-point types are similar. Python doesn’t have a double-precision floating-point type called “Double”; it usesfloatNof different lengths.Simply put, the difference in data type lengths is the number of decimal places they retain, leading to different levels of precision loss.
If overall precision isn’t affected but memory usage explodes due to huge data volume, consider compressing data type precision.
You can observe memory usage for different data types after creating a very large array.
python
arr0 = np.random.rand(100, 7, 1000, 1000) # Create a very large array print(arr0.nbytes) # Check memory usage arr0 = arr0.astype(np.float16) # Compress data type precision print(arr0.nbytes) # Check memory usage after compression
Since our main use of NumPy is for computation, we won’t discuss stuffing arrays with other strange data types.
Mathematical Operations and Functions
These operations, performed on entire arrays simultaneously, are typically vectorized, saving significant time (demonstrated in the previous article).
Here are some commonly used computations. As they are broadly applicable, specific earth science scenario examples aren’t enumerated here.
python
arr = np.random.rand(5, 5) * 100 # Create a 5x5 random array and multiply by 100 print(arr) print(arr + 1) # Addition print(arr - 1) # Subtraction print(arr * 2) # Multiplication print(arr / 2) # Division print(arr ** 2) # Power print(arr ** 0.5) # Square root print(arr // 2) # Integer division (floor division) print(arr % 2) # Modulo (remainder) # Example outputs (abbreviated): # [[85.32100518 67.67212275 86.05631003 ...] ...] # [[86.32100518 68.67212275 87.05631003 ...] ...] # ... (other operations) arr = np.random.rand(5, 5) # Create a random array print(arr) print(np.ceil(arr)) # Ceiling (round up) print(np.floor(arr)) # Floor (round down) print(np.round(arr)) # Round # Example outputs (abbreviated): # [[0.67410798 0.28093307 ...] ...] # [[1. 1. 1. ...] ...] # [[0. 0. 0. ...] ...] # [[1. 0. 0. ...] ...]
Statistics
Statistical metrics are indispensable for describing results and characterizing data magnitude. NumPy provides rich statistical functions. Here are some common ones:
python
arr = np.random.rand(5, 5) # Create a random array print(arr) print(np.mean(arr)) # Mean print(np.median(arr)) # Median print(np.std(arr)) # Standard deviation print(np.var(arr)) # Variance print(np.min(arr)) # Minimum print(np.max(arr)) # Maximum print(np.percentile(arr, 25)) # Percentile (25th) print(np.sum(arr)) # Sum print(np.cumsum(arr)) # Cumulative sum print(np.diff(arr)) # Difference (discrete difference) print(np.sort(arr)) # Sort print(np.unique(arr)) # Unique values # Example outputs (abbreviated): # [[0.76109786 0.25870338 ...] ...] # 0.500909350459388 # 0.5001225770420152 # 0.2964726511532032 # 0.08789603288180892 # 0.02445998151118267 # 0.9947154489951692 # 0.2587033828003008 # 12.5227337614847 # [ 0.76109786 1.01980124 1.67051796 ... 12.52273376] (cumulative sum) # [[-0.50239448 0.39201333 ...] ...] (differences) # [[0.03378749 0.20757916 ...] ...] (sorted) # [0.02445998 0.03378749 0.06872452 ... 0.99471545] (unique)
The statistical methods exemplified here apply to the entire array. To perform statistics along a specific dimension, the axis parameter must be specified.
We listed some related earth science use cases at the end of the previous article, such as latitude/longitude circle averages for temperature. Feel free to revisit that section; we won’t repeat them here.
Constants
There are a few persistent and seemingly unreasonable numbers in mathematics that always surround us. In recognition of their special status, NumPy gives them dedicated symbols: the natural constant e (np.e), pi π (np.pi), infinity ∞ (np.inf), and Not a Number (np.nan).
About NaN
nanneeds special explanation. This value is not equal to 0, nor to infinity. It’s not equal to any number; it exists outside the three realms.Is it powerful then?
Not necessarily. It has no magnitude relationship with any number. Its meaning simply indicates a missing value in an array. Calculating an array with missing values will only give you a headache.
An earth science example: missing data at a certain station for a specific time. Both time and station exist, but the data is empty.
We can use functions to handle missing values to avoid anomalies, such as
nan_to_num,isnan,nanmean.
python
print(np.e) # Natural constant e print(np.pi) # Pi π print(np.inf) # Infinity ∞ print(np.nan) # Not a Number nan # Output: # 2.718281828459045 # 3.141592653589793 # inf # nan arr0 = np.array([1, 2, np.nan, 4, 5]) # Create an array containing NaN print(arr0) print(arr0.mean()) # Mean (affected by NaN) print(arr0.max()) # Max (affected by NaN) print(arr0.min()) # Min (affected by NaN) arr1 = np.nan_to_num(arr0, nan=-9999) # Fill NaN with -9999 print(arr1) arr2 = np.isnan(arr0) # Check if values are NaN print(arr2) arr3 = np.nanmean(arr0) # Calculate mean ignoring NaN print(arr3) # Output: # [ 1. 2. nan 4. 5.] # nan # nan # nan # [ 1.000e+00 2.000e+00 -9.999e+03 4.000e+00 5.000e+00] # [False False True False False] # 3.0