Pandas 04 | How to Use Pandas for File I/O?

In the previous session, we discussed how Pandas can conveniently handle time series data (seems like a distant memory now). Without further ado, let’s continue by looking at reading and writing files with Pandas to enable interaction between local data and Python.

Similar to NumPy, Pandas also provides a series of I/O functions for reading and writing various file formats. Due to Pandas’ tabular nature, the content it reads is much more user-friendly. After all, an incomprehensible array, with data and row/column names all squeezed together, can sometimes be truly hard to bear.

Reading csv/txt Files

First are the most familiar txt files (csv files). Pandas can read and write them using read_csv and to_csv.

csv vs. txt
csv stands for Comma-Separated Values. Its storage method separates different fields with commas. A txt file is a plain text file with no specific format requirements; its content can be text in any format. Often, we encounter txt files where fields are separated by spaces, tabs, or commas. By specifying the sep parameter, we can read txt files with different separators.

Here are some examples:

python

import numpy as np
import pandas as pd

# Create a random dataset
data = pd.DataFrame(np.random.rand(10, 3) * 100, index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], columns=['A', 'B', 'C'])
print(data)

# Export it in different formats
data.to_csv('data.csv')
data.to_csv('data.txt', sep=';')            # When not specifying sep, comma is the default.

# Output (example):
#               A          B          C
# a  22.874398  43.825654  13.928709
# b  32.045215  29.105177  68.247118
# c  27.921033   3.979053   4.287473
# d  17.862866  98.013949  88.642451
# e  91.404210  33.967303  17.501599
# f  38.233862  78.224419  54.417324
# g  78.381364  61.520635  59.949401
# h   4.025050  14.812711  72.599593
# i  96.191718  35.276491  98.322036
# j  45.067653   8.914086  12.149836

We can then use the read_csv function to read the above data. By specifying parameters, we can customize the reading. Here are some commonly used parameters:

sep: Specifies the data separator; default is comma.
header: Specifies the row to use as column names. Default is 0, meaning the first row.
index_col: Specifies the column to use as row index (row names). Default is None, meaning no index column is used.
skiprows: Specifies the number of rows to skip from the beginning. Default is 0.
nrows: Specifies the number of rows to read. Default is None, meaning read all rows.

python

df0 = pd.read_csv('data.csv', index_col=0, header=None, skiprows=1, nrows=5)
print(df0)

# Output (example):
#               1          2          3
# 0
# a  22.874398  43.825654  13.928709
# b  32.045215  29.105177  68.247118
# c  27.921033   3.979053   4.287473
# d  17.862866  98.013949  88.642451
# e  91.404210  33.967303  17.501599

When reading a txt file without specifying a separator, Pandas defaults to a comma, which may cause issues. Therefore, extra attention is needed for txt file reading.

python

df1 = pd.read_csv('data.txt') # May read incorrectly.
print(df1.head())

df1 = pd.read_csv('data.txt', sep=';', index_col=0) # Correct reading.
print(df1)

# Output (example - first read likely incorrect):
#                                                 ;A;B;C
# 0  a;22.87439762439186;43.82565359446042;13.92870...
# 1  b;32.04521536203912;29.105177171104458;68.2471...
# ...

# Output (correct read):
#              A          B          C
# a  22.874398  43.825654  13.928709
# b  32.045215  29.105177  68.247118
# c  27.921033   3.979053   4.287473
# d  17.862866  98.013949  88.642451
# e  91.404210  33.967303  17.501599
# f  38.233862  78.224419  54.417324
# g  78.381364  61.520635  59.949401
# h   4.025050  14.812711  72.599593
# i  96.191718  35.276491  98.322036
# j  45.067653   8.914086  12.149836

Similarly, when exporting data with Pandas, we can specify parameters to control whether to output row/column names and other information. Here are some common parameters:

index: Whether to output row names. Default is True.
header: Whether to output column names. Default is True.
sep: Specifies the separator. Default is comma.
encoding: Specifies the encoding format.
mode: Specifies the output mode, e.g., 'w' for write (overwrite), 'a' for append. The method is consistent with the basic Python syntax for txt file I/O explained earlier.

python

df0.to_csv('data0.csv', index=False, header=True, mode='w', encoding='utf-8')

Reading Excel Spreadsheets

Speaking of tables, how can we bypass Excel? Besides being able to display fancy formats with software, the most significant feature of Excel spreadsheets compared to csv and txt is that they can contain multiple sheets.

Therefore, other points are almost the same. When there are multiple sheets, the only additional task for reading an Excel file is to specify the sheet name. First, let’s create an Excel file containing multiple sheets using the data above.

python

# When exporting to an Excel file, it overwrites the existing file. When we need to write multiple sheets, we must write all content before closing the file.
# Otherwise, only the last table's content will be saved.
with pd.ExcelWriter('data.xlsx') as writer:
    df0.to_excel(writer, index=False, sheet_name='df0')
    df1.to_excel(writer, index=False, sheet_name='df1')

df0 = pd.read_excel('data.xlsx', sheet_name='df0')
print(df0)

df1 = pd.read_excel('data.xlsx', sheet_name='df1')
print(df1)

# Output (example):
# (df0)
#               1          2          3
# 0  22.874398  43.825654  13.928709
# 1  32.045215  29.105177  68.247118
# 2  27.921033   3.979053   4.287473
# 3  17.862866  98.013949  88.642451
# 4  91.404210  33.967303  17.501599

# (df1)
#               A          B          C
# 0  22.874398  43.825654  13.928709
# 1  32.045215  29.105177  68.247118
# 2  27.921033   3.979053   4.287473
# 3  17.862866  98.013949  88.642451
# 4  91.404210  33.967303  17.501599
# 5  38.233862  78.224419  54.417324
# 6  78.381364  61.520635  59.949401
# 7   4.025050  14.812711  72.599593
# 8  96.191718  35.276491  98.322036
# 9  45.067653   8.914086  12.149836

Additionally, Pandas provides functions for reading hdf, html, json, and other formats. As they are less used in the geoscience domain, they won’t be elaborated here; we’ll explain them when encountered later.

Among these, the hdf format is relatively common for storing geospatial data, but we often use corresponding third-party libraries. The json format often appears in scenarios like coordinates, polygon vertices, or storing attribute data.

Postscript

The above covers the basic introduction to data I/O using Pandas, which can be considered a short piece compared to previous lengths. However, during past writing and proofreading, it was noted that excessively long articles might be a burden for both the author and the readers, requiring a concentrated, not-so-short period of time.

Easy Python

Pandas 04 | How to Use Pandas for File I/O?

New Article

Related articles