NumPy 01 | Introduction, Concepts

Introduction

NumPy is the foundation and most important component of the Python scientific computing ecosystem (SC, meaning Scientific Calculation, which is also the acronym in this series title). It is also considered one of the “Three Musketeers” in the field of Python data analysis (NumPy serves as the core library for numerical computation, Pandas is a powerful tool for data analysis and manipulation, and Matplotlib provides visualization capabilities).

In practical use, SciPy also has widespread applicability in data analysis. Therefore, I personally prefer to include SciPy as one of the “Data Analysis Musketeers.” As we all know, the “Big Three” often has four members, just like the “Four Heavenly Kings” often has five. So, having four “Data Analysis Musketeers” is not strange at all. Q.E.D.

NumPy’s core lies in its support for large-scale N-dimensional arrays (ndarray in NumPy type, or simply referred to as array) and matrix operations, along with a vast array of APIs for array manipulation.

From the simplest arithmetic operations to building complex numerical models with Python, NumPy serves as the cornerstone of the castle. Therefore, our story begins with NumPy.

List vs. Array

List

Before discussing arrays, it’s necessary to first mention the List, the most common native Python container.

Simply put, a list is a container that holds a series of elements in order. These elements can be anything: a number, a string of words, the contents of an Excel spreadsheet, or even another list (from an organizational standpoint, I believe there’s no fundamental difference between a nested list and an array).

Lists do not require specifying their length at creation and can be easily modified (add/delete/search/update) using built-in functions like append.

In Python, lists are typically created using square brackets [], with elements separated by commas: [1, 'Los Angeles', True, [2, 'Sheffield', 'How are u?']].

Array

An array is a collection of elements of the same data type, stored in contiguous memory locations. Each array element is identified by one or a set of indices, allowing for efficient storage and manipulation of large-scale data.

For arrays, their dimensions must be specified upon creation, giving them a fixed length (we can use ndim, shape, size, and dtype to get the array’s dimensions, shape, size, and data type, respectively). Adding, deleting, or modifying array elements requires assigning a new array-type variable. Additionally, elements within the same array should be of the same type. If an array is already specified as numeric, assigning a character-type element to it will not work.

python

import numpy as np

# Create a 2000×3000 2D array using random numbers (unmentioned functions will be briefly explained, details covered later)
arr = np.random.rand(2000, 3000)

print(arr.ndim)     # Dimensions
print(arr.shape)    # Shape
print(arr.size)     # Size (number of elements)
print(arr.dtype)    # Data type

arr[100, 200] = 'Python'    # Assignment with mismatched type will fail

# Output:
# 2
# (2000, 3000)
# 6000000
# float64
# ValueError: could not convert string to float: 'Python'

ls = [1, 'Los Angeles', True, [2, 'Sheffield', 'How are u?']]
ls[0] = 'Zürich'    # Lists accept assignments of different types

print(ls)
# Output: ['Zürich', 'Los Angeles', True, [2, 'Sheffield', 'How are u?']]

So, what are the advantages of arrays over lists?

Arrays support broadcasting operations and vectorized computations, greatly optimizing their execution speed and reducing code volume. Simultaneously, NumPy provides a vast number of functions for performing advanced mathematical operations, enabling convenient tasks like matrix multiplication, Fourier transforms, and calculus.

We can run a simple test to compare the performance of NumPy arrays versus lists:

python

import time

t1 = time.time()    # Get current timestamp
x = np.arange(1000000) + 1    # Add 1 to all natural numbers within 1000000
t2 = time.time()

print('Runtime: %.2f ms' % ((t2 - t1) * 1000))  # Calculate runtime

# Output: Runtime: 2.00 ms

t1 = time.time()

ls = []
for i in range(1000000):
    ls.append(i + 1)

t2 = time.time()
print('Runtime: %.2f ms' % ((t2 - t1) * 1000))

# Output: Runtime: 96.00 ms

Clearly, arrays are significantly faster than lists.

Array Axes

Earlier we mentioned, “From an organizational standpoint, there’s no fundamental difference between a nested list and an array.” This involves the concept of dimensions (dim).

In NumPy, each dimension corresponds to an axis (axis) (where the first dimension is axis=0, the second is axis=1, and so on). This enables richer operations, such as calculating the mean/max/min separately for each row/column.

python

arr = np.random.rand(2000, 3000)

print(arr.mean())    # Calculate mean of all values
print(arr.sum(axis=0))      # Sum along rows (axis 0)
print(arr.mean(axis=1))     # Mean along columns (axis 1)

# Output (example values):
# 0.4999523616278683
# [999.26900252  994.98832343  994.0841875  ... 1025.45738118  994.97250643 992.98966004]
# [0.49987789 0.50101981 0.49383497 ... 0.50501385 0.49313256 0.50345248]

It’s important to note that the order of array axes is not fixed, which might seem counterintuitive but has its own consistent logic.

In NumPy, the most basic array unit is a row vector (a 1D array). When we create a natural number vector from 0-9, arr = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), since all elements are in row 0, the concept of rows doesn’t exist yet. There’s only one axis, axis=0 (what we commonly call the x-axis), with index values corresponding to the column position of the element (e.g., arr[2] has the value 2).

By stacking multiple row vectors in the row direction to form a 2D array, the concept of rows emerges, and the axes generalize to 2 (axis=0 and axis=1, or y and x axes). However, since the row index is conceptually one level higher than the column index (because rows are created by stacking row vectors, and NumPy’s indexing is designed from the outside in), the axis corresponding to y takes axis=0, and the x-axis shifts accordingly.

The generation of multi-dimensional arrays follows this pattern.

python

arr1D = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr2D = np.array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
                  [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
                  [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
                  [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

print(arr1D[2])    # Index 2 along axis 0
print(arr2D[0, 2]) # Row 0, Column 2

# Output:
# 2
# 2

In a spatial coordinate system, we typically use (x, y, z) to represent an element’s position. In NumPy, when we know that in a 2D array, rows and columns correspond to the y and x axes (i.e., their axes are 0 and 1, respectively), and we form a 3D array by stacking multiple 2D arrays, the indexing for elements in the 3D array is typically array[z, y, x] (where the z-axis is axis=0, and rows and columns become axes 1 and 2).

This is the 3D generalization of the earlier 1D to 2D case. When stacking multiple 2D arrays to form a 3D array, indexing an element first requires determining which 2D array it’s in (the z-level), then indexing its row and column position within that array.

Indexing refers to accessing an element at a specified position. For example, the element at row 2, column 3 in a 2D array A can be indexed via A[2, 3] or A[2][3]. List indexing is similar to array indexing. This is not the core content of this section and will be explained further later.

Note: Indexing in Python, like in C and other traditional programming languages, starts from 0. The “2nd row, 3rd column” above refers to the value at the mathematical 3rd row, 4th column position.

Actually, there is a slight difference between the indexing methods A[2, 3] and A[2][3]. Think about what the difference might be after reading this article.

python

arr0 = np.random.rand(1000, 800)
arr1 = np.random.rand(1000, 800)
arr3d = np.stack((arr0, arr1))     # Stack along a new axis, default is axis=0

print(arr3d.shape)
# Output: (2, 1000, 800)

Returning to the statement at the beginning of this section: suppose we combine multiple ordinary, non-nested element lists into a new list. If this new list List2D doesn’t contain nested containers like lists or dictionaries that allow nested indexing, its data structure should be similar to a 2D array.

If we further perform second-order nesting of multiple nested lists, a list similar to a 3D array, List3D, is formed. In this case, to index the element at (x0, y0, z0), we can achieve it via a method similar to array indexing: List3D[z0][y0][x0].

python

list1d = [1, 2, 3, 4]
list2d = [list1d, list1d, list1d, list1d]
list3d = [list2d, list2d]

print(list3d)
print(list2d[1][2])    # Element at index 1, then index 2 within that inner list
print(list3d[0][1][2]) # Element at index 0, then index 1, then index 2

# Output:
# [[[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]], [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]]
# 3
# 3

Application Example

If the above content seems abstract and obscure from a purely mathematical perspective, let’s provide an example closer to Earth science.

Suppose we have a large temperature dataset with dimensions 100×8×180×360 (using NumPy generated random numbers here as a substitute), corresponding to time/pressure level/latitude/longitude (time/level/lat/lon).

python

temp_pres_level = np.random.rand(100, 8, 180, 360) * 10
print(temp_pres_level.shape)
# Output: (100, 8, 180, 360)

The most fundamental dimensions are longitude (lon, length 360) and latitude (lat, length 180). Any spatial slice of this array represents the temperature field state at a specific time and a specific pressure level height.

python

# Spatial distribution of temperature at time index 10, pressure level index 3 (indices start from 0)
print(temp_pres_level[10, 3, :, :])
print(temp_pres_level[10, 3, :, :].shape)

# Output (example values, shape):
# [[5.15967091 8.32741165 8.38750895 ... 5.13925573 6.31967397 3.6473191 ]
#  [2.43346555 1.47431224 3.64297239 ... 1.16419275 2.16354937 0.56699932]
#  [8.49879345 9.36592812 4.35082365 ... 7.25290327 7.51761726 1.49693121]
#  ...
#  [0.06623796 4.7796034  9.29730585 ... 8.81317218 3.70472341 0.65723977]
#  [8.36701033 5.70102812 8.31637686 ... 0.65968372 1.86492099 2.29193739]
#  [3.33517635 6.73889334 4.89346057 ... 2.65087397 7.88711282 0.84554989]]
# (180, 360)

At any time slice, there exists a 3D spatial distribution of temperature formed by stacking the temperature fields from the 8 pressure level heights for that time period.

python

# 3D temperature field at time index 10
print(temp_pres_level[10, :, :, :])
print(temp_pres_level[10, :, :, :].shape)

# Output (example values, shape):
# (8, 180, 360)

By specifying indices for each dimension, we can obtain the temperature at a specific time/pressure level/longitude/latitude.

python

# Temperature at time index 10, pressure level 0, latitude index 90, longitude index 180
print(temp_pres_level[10, 0, 90, 180])
# Output (example): 4.262264606486996

Thus, we can easily obtain distributions for various climate variables.

python

print(temp_pres_level.mean(axis=0))   # Time-averaged temperature (output omitted due to length)
print(temp_pres_level.mean(axis=1))   # Pressure-level averaged temperature
print(temp_pres_level.min(axis=3))    # Minimum temperature along latitude circles
print(temp_pres_level.max(axis=2))    # Maximum temperature along longitude circles
print(temp_pres_level.mean(axis=(2, 3)))  # Spatial (lat-lon) average temperature for each time-pressure level
print(temp_pres_level[10].mean(axis=1))   # Pressure level-longitude average temperature distribution for time index 10
print(temp_pres_level[10].mean(axis=2))   # Pressure level-latitude average temperature distribution for time index 10

Easy Python

NumPy 01 | Introduction, Concepts

Introduction

List vs. Array

List

Array

Array Axes

Application Example

New Article

Introduction

List vs. Array

List

Array

Array Axes

Application Example

Related articles