We have two Excel files: one is the main data sheet, and the other is a complete information sheet for students/employees/customers.
The goal is to match records by name and supplement the complete information into the main sheet.
It sounds simple, but several common pitfalls can occur in practice, such as:
- The main sheet is in XLS format, while the information sheet is in XLSX format.
- The headers in either sheet are not in the first row.
- The first column of the information sheet is an index/sequence number, with the real data starting from the second column.
- After merging, new fields need to be inserted at a specific position, not simply appended to the end.
This article shares a solution from a real project:
Using Python + Pandas + xlrd + openpyxl to implement:
- Reading multiple Excel formats
- Flexible header row handling
- Matching and merging by name
- Outputting results with a custom column order.
1. Requirement Breakdown
The final objectives are:
- Read both XLS and XLSX files.
- Support skipping header rows.
- Support skipping the first column (commonly used for sequence numbers).
- Perform matching based on the “Name” field.
- Insert the matched results after a specified column (e.g., Column I) in the main sheet.
- Output a new XLSX file.
To achieve this, we need to write a “Universal Excel Reading Function”.
2. Universal Excel Reading Function (Supports XLS / XLSX)
Core challenges:
- Pandas cannot directly read old .xls format files with merged cells or special formats.
- xlrd can only read .xls files (newer versions do not support .xlsx).
Therefore, the strategy is:
- For .xls files → use xlrd.
- For .xlsx files → use Pandas (with openpyxl engine).
Manually implement features like skipping rows and skipping the first column.
3. Overall Process for Cross-Sheet Matching
The overall data processing logic is as follows:
- Read the main sheet (supports both XLS/XLSX).
- Read the information sheet (e.g., where the real header starts from the second row).
- Set the “Name” column as the index for the information sheet.
- Iterate through each row in the main sheet, retrieving corresponding information based on the name.
- Combine the matched results into a new DataFrame.
- Insert this new DataFrame after the specified column in the main sheet.
- Output the final merged sheet.
This process is robust and adaptable to various Excel formats.
4. Complete Example Code (Desensitized)
Here is the universal example code, which you can reuse directly:
python
import pandas as pd
import xlrd
import openpyxl
def read_excel_any(path, sheet_name=None, skip_header_rows=0, skip_first_column=False):
"""
Universal function to read both XLS and XLSX files.
- skip_header_rows: Skip the first N rows.
- skip_first_column: Whether to skip the first column (e.g., index/sequence number).
"""
if path.lower().endswith(".xls"):
book = xlrd.open_workbook(path)
sheet = book.sheet_by_name(sheet_name) if sheet_name else book.sheet_by_index(0)
data = []
for r in range(sheet.nrows):
if r < skip_header_rows:
continue
row = sheet.row_values(r)
if skip_first_column:
row = row[1:]
data.append(row)
df = pd.DataFrame(data[1:], columns=data[0])
return df
else: # XLSX
df = pd.read_excel(
path,
sheet_name=sheet_name,
engine="openpyxl",
skiprows=skip_header_rows
)
if skip_first_column:
df = df.iloc[:, 1:]
return df
# ===== Configuration Area (Example) ===== #
main_file = "main_sheet.xls"
main_sheet = "DataSheet"
info_file = "info_sheet.xlsx"
info_sheet = "CompleteInfo"
info_skip_header = 1 # Skip 1 header row in info sheet
info_skip_first_col = True # Skip the first column in info sheet
# ======================================== #
# Read the main sheet
df_main = read_excel_any(main_file, sheet_name=main_sheet)
# Read the information sheet
df_info = read_excel_any(
info_file,
sheet_name=info_sheet,
skip_header_rows=info_skip_header,
skip_first_column=info_skip_first_col
)
# Ensure both sheets have a "Name" column
if "Name" not in df_main.columns or "Name" not in df_info.columns:
raise ValueError('Both sheets must contain a "Name" column!')
# Set the index of the information sheet to "Name"
df_info_index = df_info.set_index("Name")
# Perform matching by name
matched_data = []
for name in df_main["Name"]:
if name in df_info_index.index:
matched_data.append(df_info_index.loc[name].to_dict())
else:
# If the name doesn't exist, fill with None/blank
matched_data.append({col: None for col in df_info.columns if col != "Name"})
df_match = pd.DataFrame(matched_data)
# Insert matched content at the specified position (Example: after column 8, i.e., index 8)
insert_pos = 8
cols_main = df_main.columns.tolist()
new_cols = cols_main[:insert_pos + 1] + df_match.columns.tolist() + cols_main[insert_pos + 1:]
df_out = pd.concat([df_main, df_match], axis=1)[new_cols]
df_out.to_excel("output_result.xlsx", index=False)
5. Explanation of Key Technical Points
① Significance of Encapsulating the Universal Reading Function
In many projects, the main sheet and information sheet have different formats, structures, and column layouts.
After encapsulating the universal function:
- The script becomes highly reusable.
- Switching Excel files only requires modifying the filenames.
- The structure is clearer and easier to maintain.
② Using set_index for Matchingdf_info.set_index("Name") creates an index on the “Name” column.
This makes lookups extremely efficient, similar to dictionary queries.
③ Maintaining the Original Field Order
Many business sheets are “fixed templates” where column order cannot be arbitrarily changed.
By customizing new_cols, we ensure the final column order is fully controllable.
6. Example of the Final Output
The final output Excel file will:
- Retain the original column order of the main sheet.
- Insert the fields from the information sheet after the specified column (e.g., Column I).
- Perform row-by-row matching based on the “Name” field.
- Fill with blanks if a name is not found in the information sheet.
The entire process is automated, eliminating the need for manual filtering, copying, and pasting.