
As a Python user, I use Excel files to load/store data because business people like to share data in Excel or CSV formats. Unfortunately, Python is particularly slow when processing Excel files.
In this article, I will share with you 5 methods to load data in Python.
Finally, we will achieve a 3-order-of-magnitude speedup. It will be lightning-fast.
Experimental setup
Suppose we need to load 10 Excel files with 20,000 rows and 25 columns each (totaling approximately 70MB). This is a typical scenario where you want to load transaction data from an ERP (SAP) into Python for some analysis.
Let’s populate this dummy data and import the required libraries (we will discuss pickle and joblib later in this article).
import pandas as pdimport numpy as npfrom joblib import Parallel, delayedimport timefor file_number in range(10):values = np.random.uniform(size=(20000,25))pd.DataFrame(values).to_csv(f"Dummy {file_number}.csv")pd.DataFrame(values).to_excel(f"Dummy {file_number}.xlsx")pd.DataFrame(values).to_pickle(f"Dummy {file_number}.pickle")
import pandas as pdimport numpy as npfrom joblib import Parallel, delayedimport timefor file_number in range(10):values = np.random.uniform(size=(20000,25))pd.DataFrame(values).to_csv(f”Dummy {file_number}.csv”)pd.DataFrame(values).to_excel(f”Dummy {file_number}.xlsx”)pd.DataFrame(values).to_pickle(f”Dummy {file_number}.pickle”)
5 Methods to Load Data in Python
1. Loading Excel files in Python
Let’s start with a simple way to load these files. We’ll create the first Pandas DataFrame and then append each Excel file to it.
start = time.time()df = pd.read_excel(“Dummy 0.xlsx”)for file_number in range(1,10):df.append(pd.read_excel(f”Dummy {file_number}.xlsx”))end = time.time()print(“Excel:”, end — start)>> Excel:53.4
start = time.time()df = pd.read_excel("Dummy 0.xlsx")for file_number inrange(1,10):df.append(pd.read_excel(f"Dummy {file_number}.xlsx"))end = time.time()print("Excel:", end - start)
It takes about 50 seconds to run, which is very slow.
2. Loading CSV in Python
Now, let’s assume we save these files from ERP/System/SAP as .csv (instead of .xlsx).
start = time.time()df = pd.read_csv(“Dummy 0.csv”)for file_number in range(1,10):df.append(pd.read_csv(f”Dummy {file_number}.csv”))end = time.time()print(“CSV:”, end — start)>> CSV:0.632
start = time.time()df = pd.read_csv("Dummy 0.csv")for file_number inrange(1,10):df.append(pd.read_csv(f"Dummy {file_number}.csv"))end = time.time()print("CSV:", end - start)
We can now load these files in 0.63 seconds, which is almost 10 times faster!
Python loads CSV files 100 times faster than Excel files, so use CSV.
Disadvantages: CSV files are almost always larger than .xlsx files. In this example, the .csv file is 9.5MB, while the .xlsx file is 6.4MB.
3、CreateSmarter Pandas DataFrames
We can speed up our process by changing the way we create pandas DataFrames.
Instead of appending each file to an existing DataFrame,
- We load each DataFrame independently in the list.
- Then concatenate the entire list into a DataFrame
start = time.time()df = []for file_number in range(10):temp = pd.read_csv(f”Dummy {file_number}.csv”)df.append(temp)df = pd.concat(df, ignore_index =True)end = time.time()print(“CSV2:”, end — start)>> CSV2:0.619
start = time.time()df = []for file_number in range(10):temp = pd.read_csv(f"Dummy {file_number}.csv")df.append(temp)df = pd.concat(df, ignore_index=True)end = time.time()print("CSV2:", end - start)
We reduced the time by a few percentage points. From my experience, this trick becomes useful when you are dealing with larger Dataframes (df >> 100MB).
4. Parallelize CSV import using Joblib
We want to load 10 files in Python. Instead of loading each file one by one, why not load them all in parallel at once?
We can easily do this using joblib.
start = time.time()defloop(file_number):return pd.read_csv(f”Dummy {file_number}.csv”)df = Parallel(n_jobs=-1,verbose=10)(delayed(loop)(file_number) for file_number inrange(10))df = pd.concat(df, ignore_index=True)end = time.time()print(“CSV//:”, end — start)>> CSV//: 0.386
start = time.time()defloop(file_number):return pd.read_csv(f"Dummy {file_number}.csv")df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number inrange(10))df = pd.concat(df, ignore_index=True)end = time.time()print("CSV//:", end - start)
This is almost twice as fast as the single-core version. However, as a general rule, don’t expect your process to speed up 8 times by using 8 cores (here, I achieved a 2x speedup by using 8 cores on a new M1 chip Mac Air).
Joblib
If you are not familiar with joblib, it is a simple Python library that allows you to run a function in parallel. In practice, joblib is used like list comprehension, except that each iteration is executed by a different thread. Here is an example.
defloop(file_number):return pd.read_csv(f”Dummy {file_number}.csv”)df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))#equivalent todf = [loop(file_number) for file_number in range(10)]
ef loop(file_number):return pd.read_csv(f"Dummy {file_number}.csv")df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number in range(10))#equivalent todf = [loop(file_number) for file_number in range(10)]
5. Faster storage of Pickle files
You can do this (faster) by storing data in a pickle file (a specific format used by Python) instead of a .csv file.
Disadvantage: You will not be able to manually open the pickle file and view its contents.
start = time.time()defloop(file_number):return pd.read_pickle(f”Dummy {file_number}.pickle”)df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number inrange(10))df = pd.concat(df, ignore_index=True)end = time.time()print(“Pickle//:”, end — start)>> Pickle//: 0.072
start = time.time()defloop(file_number):return pd.read_pickle(f"Dummy {file_number}.pickle")df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number inrange(10))df = pd.concat(df, ignore_index=True)end = time.time()print("Pickle//:", end - start)
We just reduced the running time by 80%!
In general, using pickle files is much faster than using csv files. However, on the other hand, pickle files usually take up more space on your drive (not in this specific example).
In fact, you will not be able to directly extract data from the system in the pickle file.
I recommend using Pickles in the following two cases:
You want to save data from a Python process (and you don’t plan to open it in Excel) for use later or in another process. Save your dataframe as a pickle instead of a .csv
You need to reload the same file multiple times. When you open the file for the first time, save it as a pickle so that you can directly load the pickle version next time.
Example: Suppose you use monthly transaction data (loading data for a new month each month). You can save all historical data as .pickle, and each time you receive a new file, you can load it once as .csv and then save it as .pickle for next use.
Reward extra: Parallel loading of Excel files
Suppose you receive Excel files and have no choice but to load them as they are. You can also use joblib to parallelize this process. Compared with the pickle code above, we only need to update the loop function.
start = time.time()defloop(file_number):return pd.read_excel(f"Dummy {file_number}.xlsx")df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number inrange(10))df = pd.concat(df, ignore_index=True)end = time.time()print("Excel//:", end - start)>> 13.45
start = time.time()defloop(file_number):return pd.read_excel(f"Dummy {file_number}.xlsx")df = Parallel(n_jobs=-1, verbose=10)(delayed(loop)(file_number) for file_number inrange(10))df = pd.concat(df, ignore_index=True)end = time.time()print("Excel//:", end - start)
We can reduce the loading time by 70% (from 50 seconds to 13 seconds).
You can also use this loop to dynamically create Pickles files. Therefore, the next time you load these files, you will be able to achieve lightning-fast loading speeds.
defloop(file_number):temp = pd.read_excel(f"Dummy {file_number}.xlsx")temp.to_pickle(f"Dummy {file_number}.pickle")return temp



