NumPy 05 | How to Enable Interaction Between Local Earth Science Datasets and NumPy?

Introduction

Previously, we covered the basic concepts, creation, computation, and statistics of NumPy arrays, as well as more advanced applications like indexing, slicing, and concatenation. We also suggested some potential application scenarios in earth science accordingly.

However, a question arises: all the example data we used were randomly generated by NumPy or custom-defined special arrays. But our data is stored locally. How to bridge this gap between local storage and runtime memory is crucial. (In plain English: how do we read local files into NumPy, or save from NumPy to local storage?)

Therefore, this article will explore several file Input/Output (IO) methods built into NumPy.

Note: In broader applications, our data often exists in vector or raster formats, not the few data types mentioned here (only .txt files might be relatively common storage types, often seen in meteorological station data storage).

More specialized libraries are needed for their IO. Related content will be continuously supplemented and expanded later. However, these types are still decent as temporary data storage methods.

Ultimately, the organizational form of the array data after reading and the processing methods are mostly based on NumPy and Pandas (which we will start soon). In future Pandas content, we will detail file processing for common formats like .csv and Excel spreadsheets.

Thus, understanding NumPy array principles and mechanisms is vital for what follows. In the foreseeable future, all our earth science data processing will revolve around NumPy.

Binary File IO

NumPy can use save and load functions to save arrays as .npy format binary files.

python

import numpy as np

arr = np.array([[1, 2], [3, 4]])

np.save('my_array.npy', arr)
loaded_arr = np.load('my_array.npy')

print(loaded_arr)

# Output:
# [[1 2]
#  [3 4]]

Text File IO

Although binary files have faster read speeds, sometimes we want to inspect specific output values. .npy format files are not natively supported by Windows, so here comes our most frequently encountered format – text files (.txt).

They can also be simply read/written using NumPy:

python

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

np.savetxt('my_file.txt', arr)
loaded_arr = np.loadtxt('my_file.txt')

print(loaded_arr)

# Output:
# [[1. 2. 3.]
#  [4. 5. 6.]
#  [7. 8. 9.]]

When saving files, besides specifying the filename, there are more customizable parameters. Here we mainly introduce the most commonly used ones: the delimiter (common delimiters include space, tab, comma, etc.) and data type.

python

arr = np.random.rand(3, 4)
print(arr, '\n')

np.savetxt('random_array.txt', arr, fmt='%.2f', delimiter='\t')         # Use tab as delimiter
loaded_arr = np.loadtxt('random_array.txt', delimiter='\t')

print(loaded_arr)

# Output (example):
# [[0.94374423 0.0944434  0.85795381 0.75437499] ... ]
# [[0.94 0.09 0.86 0.75] ... ]

You can see that when saving, because we specified the fmt parameter (format), the original float64 array was saved retaining two decimal places (here we used %.2f to format the data, meaning a floating-point type with two decimal places. The involved module content will be explained systematically later).

This is extremely effective when storage space is limited and compressing decimal places doesn’t significantly affect computational precision, greatly saving storage space. Specifying delimiters is more for readability or for use in other applications with specified data organization forms.

If we try to open a file with a delimiter different from the one specified, data reading will fail:

python

arr = np.random.rand(3, 4)

np.savetxt('random_array.txt', arr, fmt='%.2f', delimiter=',')         # Use comma as delimiter
loaded_arr = np.loadtxt('random_array.txt', delimiter='\t')             # Try to read with tab delimiter -> Error

# Output:
# ValueError: could not convert string '0.25,0.85,0.28,0.77' to float64 at row 0, column 1.

Therefore, when using NumPy to read third-party data, you first need to clarify the storage format of that data source.

We mentioned that when saving, we can save as a specific data type to save storage space. Similarly, when facing an existing large dataset that already has a specific data type, and we are constrained by our hardware, we might need to compress the reading. We can also directly specify the data type when reading:

python

arr = np.random.rand(3, 4) * 100
print(arr, '\n')

np.savetxt('random_array.txt', arr, delimiter='\t')

loaded_arr = np.loadtxt('random_array.txt', delimiter='\t')
print(loaded_arr, '\n')

loaded_arr = np.loadtxt('random_array.txt', delimiter='\t', dtype=np.float16)
print(loaded_arr)

# Output (example):
# Original arr: [[72.52279016 27.60526614 73.44863686 21.84277912] ... ]
# Loaded as default (float64): same as above
# Loaded as float16: [[72.5   27.61  73.44  21.84 ] ... ] # Lower precision

You can see we saved an array unchanged. Specifying the data type dtype determines the final read data precision.

Another more common operation is when the first few rows are invalid, or we only need data from specific columns. In such cases, we can further use parameters to limit positions requiring special handling:

python

arr = np.random.rand(5, 5) * 100
print(arr, '\n')

np.savetxt('random_array.txt', arr, delimiter='\t')         # Use tab as delimiter

loaded_arr = np.loadtxt('random_array.txt', delimiter='\t', skiprows=2, usecols=[0, 2, 4])
print(loaded_arr, '\n')

# Output (example):
# Original 5x5 array...
# Loaded array (skipped first 2 rows, kept columns 0, 2, 4):
# [[70.04846718 44.15204615 89.78729344]
#  [17.31092689 25.05784428 31.42033019]
#  [ 2.0699918  60.35116244 70.98181825]]

Here we used the skiprows parameter to skip the first two rows, and the usecols parameter to specify only reading columns [0, 2, 4].

You should notice here: column numbers also start from 0, even though it’s not yet an array; while skiprows uses normal natural counting. Therefore, we need to pay attention to many settings in Python, as minor oversights can lead to serious errors.

More Flexible Array IO – NPZ

Whether it’s the invisible binary files or the visible text files above, we can ultimately only store one array per file.

This means, suppose we have long-term temperature data for a region, we can convert it into a 3D array of shape (t, x, y) for storage. But if we have multiple regions, and perhaps their spatial extents differ, leading to different array dimension lengths, squeezing them into a single array for storage becomes troublesome.

Of course, they can be stored as multiple different files. But for data of the same type, managing a single file is often more convenient than managing a bunch of files in many cases.

So, is there a method where, regardless of array shapes, we just need to remember their names, pack them all together, and then extract data with specific names when reading?

That brings us to the .npz format built into NumPy. It’s somewhat similar to MATLAB’s .mat and R’s .rds, etc. But since NumPy focuses only on arrays themselves, the content and form aren’t as rich (perhaps consider integrating Pandas structures).

python

arr0 = np.array([1, 2, 3, 4, 5])
arr1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2 = np.random.rand(3, 3, 3)

np.savez('my_array.npz', a=arr0, b=arr1, c=arr2)

data = np.load('my_array.npz')
data0 = data['a']
data1 = data['b']

data2 = np.load('my_array.npz')['c']                # Or do it all in one step

print(data0, '\n')
print(data1, '\n')
print(data2)

# Outputs: the three arrays with their respective names.

That’s the simple usage of .npz. We stored three different types of arrays into the same .npz file, assigning each an alias. When reading, we just need to extract the needed content based on the corresponding name.

Supplementary

From previous content, I felt there was a relatively key concept not mentioned. Let’s supplement it here. Other possible omissions will also be added in future updates in this form.

The main concept here is shallow copy vs. deep copy.

Shallow Copy vs. Deep Copy

Shallow Copy:
Shallow copy means when copying an object, only the object’s reference is copied, not the object itself. In other words, the copied object and the original object share the same memory space. Modifying one object affects the other.

Deep Copy:
Deep copy creates a new array and copies all data from the original array into the new array. This means the original array and the copied array have different memory spaces. Modifying one array does not affect the other.

So, if we need to perform two different calculations on a piece of data, and our two arrays are connected via a shallow copy, the calculation results for both will be identical, potentially leading to huge errors. Special attention is needed.

Let’s first give an example of a shallow copy:

python

a = np.array([1, 2, 3])
b = a  # Shallow copy - assignment

print(a)
print(b)

a[0] = 10

print(a)
print(b)

# Output:
# [1 2 3]
# [1 2 3]
# [10  2  3]
# [10  2  3]

You can see that using the assignment operator = directly establishes a connection between arrays a and b. But this is a shallow connection sharing the memory address, not duplicating memory usage. It means if one value changes, the other also changes.

python

a = np.array([1, 2, 3])
b = a.copy()  # Deep copy

print(a)
print(b)

a[0] = 10
b[1] = 20

print(a)
print(b)

# Output:
# [1 2 3]
# [1 2 3]
# [10  2  3]
# [ 1 20  3]

If we use copy to completely duplicate the array into another memory block, their storage becomes independent, and subsequent operations proceed separately without affecting each other.

Easy Python

NumPy 05 | How to Enable Interaction Between Local Earth Science Datasets and NumPy?

Introduction

Binary File IO

Text File IO

More Flexible Array IO – NPZ

Supplementary

New Article

Introduction

Binary File IO

Text File IO

More Flexible Array IO – NPZ

Supplementary

Related articles