NumPy 03 | The One Thousand and One Ways to Reconstruct Earth Science Data (½)

david 03/12/2025

Introduction

When we need to compute values at specific positions or regions within an array, how should we proceed? Translated to earth science: when we have global gridded data, we generally need to extract the portion corresponding to our study area.

Another issue: the arrays we created previously had their dimensions and sizes specified at creation. However, in reality, many arrays might be stored separately and need to be aggregated to achieve the functionality we desire. How should this be handled?

An example:

Our monthly temperature data are stored in separate arrays. To calculate the annual average temperature, we need to aggregate them into one array and compute the mean.

You might say, “It’s only 12 months, I can read them separately, add from item 1 to 12, then divide by 12.”

That’s possible. But what if there are 1,200 months? How would you handle that?

Then you say, “Then I’d use a loop.”

Sorry, we haven’t covered loops yet; I’ll assume we don’t know how.

Besides, even if a loop can simply compute the mean, we often have more complex parameters to calculate.

For example, finding the time trend for each grid cell, which is the slope k in a linear function. For a bunch of scattered arrays, I don’t think using a loop can easily solve this problem.

Therefore, array restructuring is particularly important. Of course, this isn’t the only simple example; they have broader applications, which can be explored gradually later.

Personally, I feel that extracting parts of arrays containing earth science data is also a change in data structure, so “indexing” doesn’t get a place in the title here.

And the title looks odd, ending with ½. That’s because while writing, I found I had written a messy bunch, and the first section’s length had already seriously exceeded expectations (underestimating my own ability to ramble, I guess).

So, let’s split it into two parts!

Indexing and Slicing

Indexing

We previously mentioned the concept of indexing: indexing refers to accessing an element at a specified position; list indexing is similar to array indexing.

Moreover, since Python follows C language’s syntax conventions, indexing starts from 0. This differs from more mathematics-oriented languages like MATLAB, R, and Julia, and requires attention. When an array has multiple dimensions, multiple subscripts are needed to access a specific element.

It’s worth mentioning that array indexing includes forward and reverse order. When indexing from start to finish, counting from 0 determines the element’s subscript. If it’s more convenient to determine an element’s position from the end, we can count from the last element backwards as -1, -2, -3, and so on.

Let’s demonstrate with examples:

python

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr[0])   # First element
print(arr[4])   # Fifth element
print(arr[-1])  # Last element
print(arr[-3])  # Third element from the end

# Output:
# 1
# 5
# 5
# 3

When an array has multiple dimensions, we need multiple subscripts to access a specific element.

For example, for a 2D array, we can index using arr[i][j], where i is the row index and j is the column index.

python

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(arr[0])        # First row

# The following two notations are ultimately equivalent
print(arr[0][0])     # Element at row 0, column 0
print(arr[0, 0])     # Same as above
print(arr[-1, -3])   # Last row, third element from the end of that row

# Output:
# [1 2 3]
# 1
# 1
# 7

Slicing

Clearly, indexing has a limitation: we can only get one value per position. Is there a more advanced way to cut out a large chunk of area to form a new array at once? Of course, that’s the higher-level indexing method – slicing.

Regular Slicing

Slicing modifies the index position to use a colon-specified range of elements.

The syntax is array[start:stop:step], meaning indexing from position start to stop (exclusive) with step size stepstep defaults to 1 and can be omitted.

Note: Slicing uses a left-closed, right-open interval; the element at the last index (stop) is not included.

python

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(arr[1:5])          # Forward slice: elements at indices 1-4
print(arr[-5:-1])        # Reverse slice: elements at indices -5 to -2
print(arr[0:10:2])       # Forward slice with step 2

# Output:
# [2 3 4 5]
# [6 7 8 9]
# [1 3 5 7 9]

arr = np.random.rand(180, 360)      # Define 1°x1° gridded data
print(arr[30:61, 60:101])           # Extract grid data for lat range 30°N-60°N, lon range 60°E-100°E (since stop=101 is exclusive, 100°E is included)

# Output (example):
# [[0.62427043 0.82133876 0.50815278 ... 0.43005637 0.84763936 0.99592113]
#  [0.01961363 0.7792136  0.83044477 ... 0.08328295 0.24768328 0.02536089]
#  ...
#  [0.66423806 0.94671443 0.24224496 ... 0.46171495 0.13174959 0.21355304]]

Now a question arises: if we want to include the last element when slicing, filling stop with -1 is obviously wrong. 0? Even worse. Would we have to use the array’s size attribute and add 1 every time to ensure we get the last element? Of course not. If we leave start and stop empty when slicing, it defaults to including elements from both ends.

python

print(arr[::2])         # Forward slice with step 2
print(arr[::-1])        # Reverse the array
print(arr[1:8:3])       # Forward slice indices 1-7 with step 3

# Output:
# [1 3 5 7 9]
# [10  9  8  7  6  5  4  3  2  1]
# [2 5 8]

Of course, if you have an odd number of elements and slice with array[::2], the last element won’t appear. But it’s not that it wasn’t selected; the step size of 2 skipped over it.

python

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print(arr[::2])

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
print(arr[::2])

# Output:
# [1 3 5 7]
# [1 3 5 7 9]

Boolean Slicing

You might say that needing to know fixed positions for slicing is too rigid. Is there a more advanced way, like filtering precipitation for grid cells where temperature is greater than 30°C?

Thus, Boolean indexing based on logical operations appears. When we evaluate an array against a condition, the array becomes a mask array with values True and False.

We can then use this mask for more complex extractions.

python

temperature = np.random.rand(180, 360) * 100             # Define 1°x1° temperature grid data
another_index = np.random.rand(180, 360)                # Define 1°x1° index grid data

temperature_mask = temperature > 30                     # Check if temperature is above 30°C
print(temperature_mask)                                 # The mask array

print(another_index[temperature_mask])                  # Index another array of the same shape using the mask array

# Output (mask example, index values):
# [[False  True  True ...  True False  True]
#  [ True  True  True ... False  True False]
#  ...
#  [ True False  True ...  True  True  True]]
# [0.23491348 0.56643524 0.1230861  ... 0.97229519 0.05542499 0.35975104]

Special Slicing

Finally, let’s mention a special type of indexing for specifying irregular positions, to handle some possible special scenarios.

python

arr = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
                [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]])

print(arr)
print(arr[[0, 1, 1], [0, 2, 8]])   # Fancy indexing: elements at (0,0), (1,2), (1,8)

# Output:
# [[ 1  2  3  4  5  6  7  8  9 10]
#  [11 12 13 14 15 16 17 18 19 20]
#  [21 22 23 24 25 26 27 28 29 30]]
# [ 1 13 19]   # Corresponding to arr[0,0], arr[1,2], arr[1,8]

Array Assignment

We’ve talked a lot about using indexing and slicing to get values at specified positions. In fact, they also provide endless possibilities for modifying array element values. Let’s give some practical examples:

Without indexing/slicing: We can only operate on the entire array, like converting Celsius to Kelvin.

python

temperature_celsius = np.random.rand(180, 360) * 100
temperature_kelvin = temperature_celsius + 273.15

print(temperature_celsius)
print(temperature_kelvin)

# Outputs (example values)

Using indexing: Knowing a specific point’s temperature is anomalously high, we can index it and subtract the anomaly.

python

anomaly = 1                    # Define anomaly value
print(temperature_celsius[12, 12])

temperature_celsius[12, 12] = temperature_celsius[12, 12] - anomaly
print(temperature_celsius[12, 12])

# Output (example):
# 25.76296894744049
# 24.76296894744049

Using slicing: Knowing that reanalysis data for a certain region is generally high, we can slice it out and uniformly subtract the anomaly.

python

anomaly = 10
print(temperature_celsius[20:30, 50:60])

temperature_celsius[20:30, 50:60] = temperature_celsius[20:30, 50:60] - anomaly
print(temperature_celsius[20:30, 50:60])

# Output (example values before and after subtraction)

More complex applications: Suppose an index exceeds a critical value when temperature is >50°C. We mask them out and apply boundary conditions.

python

another_index = np.random.rand(180, 360) * 10                           # Define a 1°x1° index grid data
print(another_index[temperature_celsius >= 50])

boundary_condition = 0.5
another_index[temperature_celsius >= 50] = boundary_condition           # Set values where temp >= 50°C to the boundary condition
print(another_index[temperature_celsius >= 50])

# Output (example):
# [3.31433926 7.22395752 6.10150125 ... 0.74224767 0.4194778  8.52261236]
# [0.5 0.5 0.5 ... 0.5 0.5 0.5]

Thus, the nan_to_num function for handling missing values mentioned last issue becomes somewhat unnecessary; we can implement it ourselves:

python

arr = np.array([1, 2, np.nan, 4, 5])

arr[np.isnan(arr)] = -1000
print(arr)

# Output:
# [    1.     2. -1000.     4.     5.]

Supplementary

In the array computation section last time, I think there were a few more practical functions not mentioned. Let’s supplement them here.

Matrix multiplication .dot:

python

arr0 = np.array([[1, 2], [3, 4]])
arr1 = np.array([[5, 6], [7, 8]])

print(arr0.dot(arr1))

# Output:
# [[19 22]
#  [43 50]]

In MATLAB, arrays default to matrix operations; element-wise operations require a dot to differentiate (e.g., .*./). In NumPy, the more commonly used element-wise operations are the default, while matrix operations are encapsulated in functions.

Degree-radian conversion deg2rad and rad2deg:

python

arr = np.array([0, 30, 45, 60, 90])         # Define angle array in degrees

print(np.sin(arr * np.pi / 180))            # Convert degrees to radians manually, then compute sine
print(np.deg2rad(arr))                      # Built-in function conversion
print(np.sin(np.deg2rad(arr)))
print(np.sin(np.deg2rad(np.rad2deg(np.deg2rad(arr)))))          # Russian nesting doll

# Output:
# [0.         0.5        0.70710678 0.8660254  1.        ]
# [0.         0.52359878 0.78539816 1.04719755 1.57079633]
# [0.         0.5        0.70710678 0.8660254  1.        ]
# [0.         0.5        0.70710678 0.8660254  1.        ]

Here, 2 is a homophone for “to” in English, consistent with the naming style of data type conversion functions in MATLAB. Those who know should understand.

Of course, we could calculate degree-radian conversion based on their formula definitions. But to avoid those moments when you just can’t remember what the formula looks like, let’s use the “syntactic sugar” NumPy has packaged for us.