In the previous session, we discussed how Pandas can conveniently handle time series data (seems like a distant memory now). Without further ado, let’s continue by looking at reading and writing files with Pandas to enable interaction between local data and Python.
Similar to NumPy, Pandas also provides a series of I/O functions for reading and writing various file formats. Due to Pandas’ tabular nature, the content it reads is much more user-friendly. After all, an incomprehensible array, with data and row/column names all squeezed together, can sometimes be truly hard to bear.
Reading csv/txt Files
First are the most familiar txt files (csv files). Pandas can read and write them using read_csv and to_csv.
csv vs. txt
csv stands for Comma-Separated Values. Its storage method separates different fields with commas. A txt file is a plain text file with no specific format requirements; its content can be text in any format. Often, we encounter txt files where fields are separated by spaces, tabs, or commas. By specifying thesepparameter, we can read txt files with different separators.
Here are some examples:
python
import numpy as np
import pandas as pd
# Create a random dataset
data = pd.DataFrame(np.random.rand(10, 3) * 100, index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], columns=['A', 'B', 'C'])
print(data)
# Export it in different formats
data.to_csv('data.csv')
data.to_csv('data.txt', sep=';') # When not specifying sep, comma is the default.
# Output (example):
# A B C
# a 22.874398 43.825654 13.928709
# b 32.045215 29.105177 68.247118
# c 27.921033 3.979053 4.287473
# d 17.862866 98.013949 88.642451
# e 91.404210 33.967303 17.501599
# f 38.233862 78.224419 54.417324
# g 78.381364 61.520635 59.949401
# h 4.025050 14.812711 72.599593
# i 96.191718 35.276491 98.322036
# j 45.067653 8.914086 12.149836
We can then use the read_csv function to read the above data. By specifying parameters, we can customize the reading. Here are some commonly used parameters:
sep: Specifies the data separator; default is comma.header: Specifies the row to use as column names. Default is0, meaning the first row.index_col: Specifies the column to use as row index (row names). Default isNone, meaning no index column is used.skiprows: Specifies the number of rows to skip from the beginning. Default is0.nrows: Specifies the number of rows to read. Default isNone, meaning read all rows.
python
df0 = pd.read_csv('data.csv', index_col=0, header=None, skiprows=1, nrows=5)
print(df0)
# Output (example):
# 1 2 3
# 0
# a 22.874398 43.825654 13.928709
# b 32.045215 29.105177 68.247118
# c 27.921033 3.979053 4.287473
# d 17.862866 98.013949 88.642451
# e 91.404210 33.967303 17.501599
When reading a txt file without specifying a separator, Pandas defaults to a comma, which may cause issues. Therefore, extra attention is needed for txt file reading.
python
df1 = pd.read_csv('data.txt') # May read incorrectly.
print(df1.head())
df1 = pd.read_csv('data.txt', sep=';', index_col=0) # Correct reading.
print(df1)
# Output (example - first read likely incorrect):
# ;A;B;C
# 0 a;22.87439762439186;43.82565359446042;13.92870...
# 1 b;32.04521536203912;29.105177171104458;68.2471...
# ...
# Output (correct read):
# A B C
# a 22.874398 43.825654 13.928709
# b 32.045215 29.105177 68.247118
# c 27.921033 3.979053 4.287473
# d 17.862866 98.013949 88.642451
# e 91.404210 33.967303 17.501599
# f 38.233862 78.224419 54.417324
# g 78.381364 61.520635 59.949401
# h 4.025050 14.812711 72.599593
# i 96.191718 35.276491 98.322036
# j 45.067653 8.914086 12.149836
Similarly, when exporting data with Pandas, we can specify parameters to control whether to output row/column names and other information. Here are some common parameters:
index: Whether to output row names. Default isTrue.header: Whether to output column names. Default isTrue.sep: Specifies the separator. Default is comma.encoding: Specifies the encoding format.mode: Specifies the output mode, e.g.,'w'for write (overwrite),'a'for append. The method is consistent with the basic Python syntax for txt file I/O explained earlier.
python
df0.to_csv('data0.csv', index=False, header=True, mode='w', encoding='utf-8')
Reading Excel Spreadsheets
Speaking of tables, how can we bypass Excel? Besides being able to display fancy formats with software, the most significant feature of Excel spreadsheets compared to csv and txt is that they can contain multiple sheets.
Therefore, other points are almost the same. When there are multiple sheets, the only additional task for reading an Excel file is to specify the sheet name. First, let’s create an Excel file containing multiple sheets using the data above.
python
# When exporting to an Excel file, it overwrites the existing file. When we need to write multiple sheets, we must write all content before closing the file.
# Otherwise, only the last table's content will be saved.
with pd.ExcelWriter('data.xlsx') as writer:
df0.to_excel(writer, index=False, sheet_name='df0')
df1.to_excel(writer, index=False, sheet_name='df1')
df0 = pd.read_excel('data.xlsx', sheet_name='df0')
print(df0)
df1 = pd.read_excel('data.xlsx', sheet_name='df1')
print(df1)
# Output (example):
# (df0)
# 1 2 3
# 0 22.874398 43.825654 13.928709
# 1 32.045215 29.105177 68.247118
# 2 27.921033 3.979053 4.287473
# 3 17.862866 98.013949 88.642451
# 4 91.404210 33.967303 17.501599
# (df1)
# A B C
# 0 22.874398 43.825654 13.928709
# 1 32.045215 29.105177 68.247118
# 2 27.921033 3.979053 4.287473
# 3 17.862866 98.013949 88.642451
# 4 91.404210 33.967303 17.501599
# 5 38.233862 78.224419 54.417324
# 6 78.381364 61.520635 59.949401
# 7 4.025050 14.812711 72.599593
# 8 96.191718 35.276491 98.322036
# 9 45.067653 8.914086 12.149836
Additionally, Pandas provides functions for reading hdf, html, json, and other formats. As they are less used in the geoscience domain, they won’t be elaborated here; we’ll explain them when encountered later.
Among these, the hdf format is relatively common for storing geospatial data, but we often use corresponding third-party libraries. The json format often appears in scenarios like coordinates, polygon vertices, or storing attribute data.
Postscript
The above covers the basic introduction to data I/O using Pandas, which can be considered a short piece compared to previous lengths. However, during past writing and proofreading, it was noted that excessively long articles might be a burden for both the author and the readers, requiring a concentrated, not-so-short period of time.