Introduction
Last time we discussed two forms of reconstructing earth science data: one is indexing (including advanced indexing – slicing), which extracts the parts we need from a complete sequence; the other is recombination – how to integrate multiple discrete data sources into a unified whole when we cannot read or produce a complete sequence at once.
Last week we covered the indexing part. This time, let’s explore how to use NumPy to recombine earth science data.
Broadcasting
A common issue when processing meteorological data is: how to handle varying month lengths? For example, monthly ERA5 precipitation data we download typically has units of mm/day (average daily precipitation). How do we convert it to total precipitation for the corresponding month, considering the number of days?
Experiments Begin
Experiment.1: The simplest method: I’ll just pretend every month has 30 days. Multiply everything by 30. Done!
Comment.1: Feasible, but precision seems lacking.
Experiment.2: Then I’ll use a loop. Loop through the 12 months one by one, multiply by the corresponding number of days.
Comment.2: Feasible, but unnecessary. Also, non-vectorized computation along the time axis during looping will slow things down. (And we haven’t covered loops yet.)
Experiment.3: Want vectorization? I’ll use
np.full_like()to create an array with exactly the same shape as the precipitation data. Then I’ll assign each month its corresponding number of days, and multiply the two arrays.python
import numpy as np pr_daily_arr = np.random.rand(12, 180, 360) # Suppose this is 12 months of monthly average daily precipitation data day_arr = np.full_like(pr_daily_arr, np.nan) # Suppose this is the number of days for the corresponding months days_ls = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31] for i in range(day_arr.shape[0]): day_arr[i] = days_ls[i] # Assign corresponding days to each month pr_mon_arr = pr_daily_arr * day_arr # Monthly total precipitation print(pr_mon_arr)Comment.3: Hmm, good grasp of previous concepts. But this uses a loop, right? Also, looping for assignment and then multiplication seems a bit redundant?
After all these proposals, is there a good solution?
Yes, absolutely: broadcasting.
Simply put, broadcasting is a mechanism. What kind of mechanism? If two arrays are to be multiplied element-wise, they have compatible dimensions but not identical shapes (suppose both have three axes, axis=(0, 1, 2), but only differ in the length of axis=2, where one has length 5 and the other 1). If one of the dimensions where shapes differ has length 1, then the array with the non-1 dimension will compute all its values based on the value from the dimension of length 1.
What on earth is that? Is that “simply put”? Is this something humans can understand? (flips table)
Don’t panic. Text descriptions really test human imagination. Let’s look at a case to understand.
python
import numpy as np pr_daily_arr = np.random.rand(12, 180, 360) # Suppose this is 12 months of monthly average daily precipitation data day_arr = np.array([31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]) # Number of days for each month pr_mon_arr = pr_daily_arr * day_arr[:, np.newaxis, np.newaxis] # Monthly total precipitation print(pr_daily_arr[0], '\n') # Print first month daily precip, '\n' adds a newline print(pr_mon_arr[0], '\n') # Print first month total precip # Output (example values): # [[0.7457109 0.1999839 0.86517314 ... 0.9433673 0.44404254 0.83987611] ... ] # [[23.11703799 6.19950096 26.82036733 ... 29.24438641 13.76531887 26.03615934] ... ]
You can see, it’s not just one dimension being different that triggers broadcasting. As long as one of the dimensions where shapes differ has length 1, broadcasting can be triggered on that axis.
Here we used the newaxis attribute built into NumPy. It adds a new axis where the axis we’re indexing doesn’t exist.
Why are today’s concepts so hard to describe?
Let’s look at another case:
python
pr_mon_arr1 = pr_daily_arr * day_arr # Trying without newaxis # Output: # ValueError: operands could not be broadcast together with shapes (12,180,360) (12,) print(pr_mon_arr.shape, '\n') print(day_arr.shape, '\n') print(day_arr[:, np.newaxis, np.newaxis].shape, '\n') # Output: # (12, 180, 360) # Actually, shape of pr_daily_arr # (12,) # day_arr shape # (12, 1, 1) # day_arr after adding two new axes
Obviously, broadcasting fails without newaxis. Because, as mentioned in the initial concept, their dimensions need to be compatible, just that one array can have length 1 on some dimension. day_arr has only one dimension, while day_arr[:, np.newaxis, np.newaxis] has three dimensions, matching pr_daily_arr. It’s just that the two newly added dimensions have length 1.
Of course, broadcasting can be taken a step further. But this way can easily become confusing and tangled. It’s still recommended to use the newaxis method, which clearly shows which axis is being broadcast.
python
pr_daily_arr = np.random.rand(180, 360, 12) # Different axis order: (lat, lon, time) day_arr = np.array([31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]) pr_mon_arr = pr_daily_arr * day_arr[np.newaxis, np.newaxis, :] # Add axes for lat and lon print(pr_daily_arr[0], '\n') print(pr_mon_arr[0], '\n') print(day_arr.shape, '\n') print(pr_daily_arr.shape, '\n') # Outputs (example values and shapes)
You can see, even without newaxis, if the corresponding axis lengths of two arrays align from inner to outer (i.e., dimensions from right to left), the outer axes default to broadcasting. Here, day_arr has dimension 12, and pr_daily_arr‘s dimensions from inner to outer are 12 → 360 → 180, so broadcasting can be triggered. (Review axis concepts in the first installment.)
Stacking, Concatenation
Now back to the initial topic: how to integrate multiple discrete data sources into a whole? This primarily uses array stacking (or concatenation). It’s defined as the operation of joining multiple arrays along a specified axis, typically used to combine multiple arrays with the same shape or dimensions into a larger array.
Core functions include: vstack, hstack, dstack, concatenate, stack.
vstack,hstack,dstackcan only be used for concatenation along fixed array axes: vertical direction (axis=0, what we often call the y-axis, concatenating row-wise, v for vertical), horizontal direction (axis=1, the x-axis, column-wise, h for horizontal), and depth direction (axis=2, the z-axis, stacking page-wise, d for deep).
Here are examples:
python
arr0 = np.array([[1, 2, 3], [4, 5, 6]]) arr1 = np.array([[7, 8, 9], [10, 11, 12]]) arr2 = np.vstack((arr0, arr1)) # Vertical stack arr3 = np.hstack((arr0, arr1)) # Horizontal stack arr4 = np.dstack((arr0, arr1)) # Depth stack print(arr0, '\n') print(arr1, '\n') print(arr2, '\n') print(arr3, '\n') print(arr4, '\n') # Outputs: # arr2: [[ 1 2 3] [ 4 5 6] [ 7 8 9] [10 11 12]] # arr3: [[ 1 2 3 7 8 9] [ 4 5 6 10 11 12]] # arr4: [[[ 1 7] [ 2 8] [ 3 9]] [[ 4 10] [ 5 11] [ 6 12]]]
Since the above three functions are so rigid, they definitely won’t be our favorites. A common scenario: how do I stack along the 4th axis? That’s where the next two functions come in.
I purposely placed concatenate before stack. Logically, the other four functions should be more similar. A simple example will show why I wrote them in this order:
python
# Create two multi-dimensional arrays arr0 = np.array([[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]], [[[13, 14, 15], [16, 17, 18]], [[19, 20, 21], [22, 23, 24]]]]) arr1 = np.array([[[[25, 26, 27], [28, 29, 30]], [[31, 32, 33], [34, 35, 36]]], [[[37, 38, 39], [40, 41, 42]], [[43, 44, 45], [46, 47, 48]]]]) print(arr0.shape, '\n') # (2, 2, 2, 3) print(arr1.shape, '\n') # (2, 2, 2, 3) arr2 = np.concatenate((arr0, arr1), axis=3) # Concatenate along axis 3 (innermost) arr3 = np.stack((arr0, arr1), axis=3) # Stack along a new axis 3 print(arr2.shape, '\n') # (2, 2, 2, 6) - axis 3 length doubled print(arr3.shape, '\n') # (2, 2, 2, 2, 3) - a new axis added, making 5D # Outputs show shape differences
Clearly, concatenate joins along an existing axis, acting like a more flexible vstack, hstack, dstack. But stack adds a new axis at our specified position, moving previously existing axes inward.
Of course, in a sense, using dstack on two 2D arrays achieves a similar effect, but only in special cases.
python
arr0 = np.array([[1, 2, 3], [4, 5, 6]]) arr1 = np.array([[7, 8, 9], [10, 11, 12]]) arr2 = np.dstack((arr0, arr1)) # Depth stack print(arr2, '\n') print(arr2.shape, '\n') # (2, 3, 2) - note shape change
Shape Transformation
We’ve discussed many ways to recombine arrays, but they all involve multiple arrays. However, transforming the shape of a single array itself is also extremely important.
For example, in deep learning we often need dimensionality reduction, flattening 2D spatial data into 1D vectors. Or, two meteorological datasets have the same dimensions (latitude, longitude, time, pressure level), but their axis order differs. How to swap axes so they can be computed together?
Generally, functions used here involve more advanced operations and data processing. It might be hard to give very universal examples; you’ll need to explore based on your own use cases. But these are relatively common functions in underlying modeling.
python
arr0 = np.array([[1, 2], [4, 5]]) arr1 = arr0.T # Transpose arr2 = arr0.flatten() # Flatten to 1D arr3 = np.repeat(arr0, 2, axis=1) # Repeat: repeat N times along specified axis (here N=2). Note: repetition is contiguous for array elements. arr4 = np.tile(arr0, (2, 2)) # Tile: repeat (M, N) times along axes (here M=N=2). Note: repetition copies the entire array block. print(arr0, '\n') print(arr1, '\n') print(arr2, '\n') print(arr3, '\n') print(arr4, '\n') # Outputs: # arr1: [[1 4] [2 5]] # arr2: [1 2 4 5] # arr3: [[1 1 2 2] [4 4 5 5]] # arr4: [[1 2 1 2] [4 5 4 5] [1 2 1 2] [4 5 4 5]]
The above is just a 2D example for tile. To repeat along more axes, you can continue adding axis length parameters. Here’s a possible application scenario for tile:
python
lon_arr = np.arange(-30, 60) # Generate latitude range lon_grid = np.tile(lon_arr, (90, 1)) # Latitude repeated for 90 longitudes print(lon_grid, '\n') # Shape (90, 90) # Output: A 90x90 grid where each row is the same latitude sequence.
But NumPy provides a simpler grid generation method: meshgrid.
python
lon = np.arange(10, 120) # Longitude values lat = np.arange(-30, 60) # Latitude values lon_grid, lat_grid = np.meshgrid(lon, lat) # Create coordinate grids print(lon_grid, '\n') print(lat_grid, '\n') # Outputs: lon_grid has shape (90, 110) with longitude varying across columns, latitude constant per row. # lat_grid has shape (90, 110) with latitude varying across rows, longitude constant per column.
When we need more complex array shape reshaping, we use the reshape function. It redefines the array’s shape by specifying the length of each axis.
python
# Use reshape to change array shape arr0 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) arr1 = arr0.reshape(2, 5) print(arr0, '\n') print(arr1, '\n') # If you don't want to calculate one dimension's length, use -1 to denote an unknown dimension length. arr2 = arr0.reshape(2, -1) # -1 will be computed as 5 print(arr2, '\n') # Outputs: # arr1 and arr2: [[ 1 2 3 4 5] [ 6 7 8 9 10]]
We mentioned earlier that when two datasets have different axis orders, we need to swap axes to perform operations. This requires the transpose() function, which swaps array axes by specifying the new axis order.
python
# Swap axes arr0 = np.random.rand(3, 4, 5) # Shape (3, 4, 5) arr1 = np.random.rand(3, 5, 4) # Shape (3, 5, 4) - axes 1 and 2 swapped relative to arr0 arr0 * arr1 # Shapes incompatible, cannot compute # Output: # ValueError: operands could not be broadcast together with shapes (3,4,5) (3,5,4) print(arr0 * arr1.transpose(0, 2, 1)) # Swap arr1's axis=1 and axis=2 # Now arr1 becomes shape (3, 4, 5), matching arr0. Multiplication works.