A 1000x Faster Method for Reading Excel Files in Python

As a Python user, I utilize Excel files to load/store data because business personnel prefer sharing data in Excel or CSV format. Unfortunately, Python is particularly slow when handling Excel files.

In this article, I will share with you 5 methods for loading data in Python.

In the end, we will achieve a speedup of 3 orders of magnitude. It will be lightning-fast.

Experimental Setup

Let’s assume we want to load 10 Excel files with 20,000 rows and 25 columns each (totaling about 70MB). This is a typical case where you wish to load transactional data from an ERP (like SAP) into Python for analysis.

Let’s populate this dummy data and import the required libraries (we will discuss pickle and joblib later in the article).

python

import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time

for file_number in range(10):
    values = np.random.uniform(size=(20000,25))
    pd.DataFrame(values).to_csv(f"Dummy {file_number}.csv")
    pd.DataFrame(values).to_excel(f"Dummy {file_number}.xlsx")
    pd.DataFrame(values).to_pickle(f"Dummy {file_number}.pickle")

5 Methods for Loading Data in Python

1. Loading Excel Files in Python

Let’s start with the simple method of loading these files. We’ll create the first Pandas DataFrame, then append each Excel file to it.

python

start = time.time()
df = pd.read_excel("Dummy 0.xlsx")
for file_number in range(1,10):
    df.append(pd.read_excel(f"Dummy {file_number}.xlsx"))
end = time.time()
print("Excel:", end - start)

Excel: 53.4

This takes about 50 seconds to run, which is slow.

2. Loading CSVs in Python

Now, let’s assume we save these files from the ERP/system/SAP as .csv (instead of .xlsx).

python

start = time.time()
df = pd.read_csv("Dummy 0.csv")
for file_number in range(1,10):
    df.append(pd.read_csv(f"Dummy {file_number}.csv"))
end = time.time()
print("CSV:", end - start)

CSV: 0.632

We can now load these files in 0.63 seconds, which is almost 100 times faster! Python loads CSV files about 100 times faster than Excel files.

Drawback: CSV files are almost always larger than .xlsx files. In this example, the .csv files are 9.5MB, while the .xlsx files are 6.4MB.

3. Creating Smarter Pandas DataFrames

We can speed up our process by changing how we create Pandas DataFrames.

Instead of appending each file to an existing DataFrame, we load each DataFrame independently into a list. Then we concatenate the entire list into one DataFrame.

python

start = time.time()
df = []
for file_number in range(10):
    temp = pd.read_csv(f"Dummy {file_number}.csv")
    df.append(temp)
df = pd.concat(df, ignore_index=True)
end = time.time()
print("CSV2:", end - start)

CSV2: 0.619

We reduced the time by a few percent. In my experience, this trick becomes useful when you are dealing with larger DataFrames (df >> 100MB).

4. Parallelizing CSV Import Using Joblib

We want to load 10 files in Python. Instead of loading each file one by one, why not load several in parallel at once? We can easily do this using joblib.

python

start = time.time()
def loop(file_number):
    return pd.read_csv(f"Dummy {file_number}.csv")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("CSV//:", end - start)

CSV//: 0.386

This is almost twice as fast as the single-core version. However, as a general rule, don’t expect to speed up your process by a factor of 8 by using 8 cores (here, I got a 2x speedup by using 8 cores on a Mac Air with the new M1 chip).

Joblib

If you are not familiar with joblib, it’s a simple Python library that allows you to run a function in parallel. In practice, joblib is used like a list comprehension, except each iteration is executed by a different thread. Here is an example:

python

def loop(file_number):
    return pd.read_csv(f"Dummy {file_number}.csv")

df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
# Equivalent to:
# df = [loop(file_number) for file_number in range(10)]

5. Storing Pickle Files for Speed

You can load data (much faster) by storing it in pickle files (a specific format used by Python) instead of .csv files.

Drawback: You won’t be able to manually open a pickle file and see what’s inside.

python

start = time.time()
def loop(file_number):
    return pd.read_pickle(f"Dummy {file_number}.pickle")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("Pickle//:", end - start)

Pickle//: 0.072

We just reduced the runtime by 80%!

Typically, using pickle files is much faster than using CSV files. However, on the other hand, pickle files usually take up more space on your drive (not in this specific example). Also, in practice, you won’t be able to extract data from a system directly into a pickle file.

I suggest using Pickles in the following two cases:

You want to save data from a Python process (and you don’t plan to open it in Excel) to use it later/in another process. Save your DataFrame as a pickle instead of .csv.
You need to reload the same files multiple times. The first time you open a file, save it as a pickle so you can load the pickle version directly next time.

Example: Suppose you work with monthly transactional data (loading a new month’s data each month). You could save all historical data as .pickle, and each time you receive a new file, you could load it as .csv once, then save it as .pickle for next time.

Bonus: Loading Excel Files in Parallel

Suppose you receive Excel files and have no choice but to load them as-is. You can also use joblib to parallelize this. Compared to the pickle code above, we only need to update the loop function.

python

start = time.time()
def loop(file_number):
    return pd.read_excel(f"Dummy {file_number}.xlsx")
df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))
df = pd.concat(df, ignore_index=True)
end = time.time()
print("Excel//:", end - start)

Excel//: 13.45

We can reduce the loading time by 70% (from 50 seconds to 13 seconds).

You can also use this loop to dynamically create pickle files. Therefore, the next time you load these files, you’ll be able to achieve lightning-fast loading speeds.

python

def loop(file_number):
    temp = pd.read_excel(f"Dummy {file_number}.xlsx")
    temp.to_pickle(f"Dummy {file_number}.pickle")
    return temp

Python

Easy Python

A 1000x Faster Method for Reading Excel Files in Python

Related articles

New Article