Python Office Automation with Word — How to use python-docx
1. Introduction
Regarding reading and writing Word documents, the previous two articles provided comprehensive summaries. This article serves as a supplement to the Word automation series, covering several practical office scenarios including:
- Header and footer processing
- Merging multiple documents
- Adding page numbers
- Batch conversion of .doc to .docx
- Comparing document differences
- Special content annotation
- Replacing text content
2. Headers and Footers
Each page section contains headers and footers that can be set individually for each page or uniformly across all pages. This functionality is controlled by the different_first_page_header_footer property in the section object.
- When set to
True: Headers and footers differ from the first page; each page section can have unique headers/footers - When set to
False: All pages share the same headers/footers
# 1. Get sections for header/footer processing header = self.doc.sections[0].header footer = self.doc.sections[0].footer # True if this section displays a distinct first-page header and footer # True: Headers/footers differ from first page, each page section has individual settings # False: All pages have same headers/footers self.doc.sections[0].different_first_page_header_footer = True
There are two types of header/footer additions: normal headers/footers and custom-styled headers/footers.
2.1 Normal Headers/Footers
def add_norm_header_and_footer(header, footer, header_content, footer_content):
"""
Add normal header and footer, center-aligned
:param header_content:
:param footer_content:
:return:
"""
# Add/modify header and footer
# Note: Typically headers/footers contain only one paragraph
header.paragraphs[0].text = header_content
footer.paragraphs[0].text = footer_content
# Center alignment
header.paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
footer.paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
# 2. Add header
# 2.1 Normal headers/footers
add_norm_header_and_footer(header, footer, "I am a header", "I am a footer")
2.2 Custom-styled Headers/Footers
def add_custom_style_header_and_footer(header, footer, header_content, footer_content, style):
"""
Add custom styled headers/footers
:param header:
:param footer:
:param header_content:
:param footer_content:
:param style:
:return:
"""
# Note: style_type=2, otherwise error occurs
header.paragraphs[0].add_run(header_content, style)
footer.paragraphs[0].add_run(footer_content, style)
# 2.2 Headers/footers with custom styles
# Create a style
style_paragraph = create_style(document=self.doc, style_name="style5", style_type=2, font_size=30,
font_color=[0xff, 0x00, 0x00], align=WD_PARAGRAPH_ALIGNMENT.CENTER)
add_custom_style_header_and_footer(header, footer, "I am header 2", "I am footer 2", style_paragraph)
To remove all headers and footers from a document, follow these 2 steps:
- Iterate through all page sections and set
different_first_page_header_footertoFalse - Set the section header/footer’s
is_linked_to_previousproperty toTrue
def remove_all_header_and_footer(doc):
"""
Remove all headers and footers from document
:param doc:
:return:
"""
for section in doc.sections:
section.different_first_page_header_footer = False
# When is_linked_to_previous is set to True, headers/footers are deleted
section.header.is_linked_to_previous = True
section.footer.is_linked_to_previous = True
3. Merging Multiple Documents
In daily work, we often need to merge multiple Word documents into one file. We can use another Python dependency library: docxcompose
# Dependency library for merging files
# pip3 install docxcompose
from docxcompose.composer import Composer
def compose_files(self, files, output_file_path):
"""
Merge multiple Word files into one
:param files: List of files to merge
:param output_file_path: New file path
:return:
"""
composer = Composer(Document())
for file in files:
composer.append(Document(file))
# Save to new file
composer.save(output_file_path)
4. Adding Page Numbers
We often need to add page numbers in document footers, but python-docx doesn’t provide a built-in method. However, we found an implementation on StackOverflow:
from docx.oxml.xmlchemy import BaseOxmlElement, ZeroOrOne, ZeroOrMore, OxmlElement
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.oxml import ns
def create_element(self, name):
return OxmlElement(name)
def create_attribute(self, element, name, value):
element.set(ns.qn(name), value)
def add_page_number(self, run):
"""
Add page number
:param run:
:return:
"""
fldChar1 = self.create_element('w:fldChar')
self.create_attribute(fldChar1, 'w:fldCharType', 'begin')
instrText = self.create_element('w:instrText')
self.create_attribute(instrText, 'xml:space', 'preserve')
instrText.text = "PAGE"
fldChar2 = self.create_element('w:fldChar')
self.create_attribute(fldChar2, 'w:fldCharType', 'end')
# run._r: class 'docx.oxml.text.run.CT_R'>
run._r.append(fldChar1)
run._r.append(instrText)
run._r.append(fldChar2)
By default, page numbers appear in the bottom-left corner of the footer, which isn’t aesthetically pleasing. We can use methods from the first article to create a “text run style” and add it as a Run to the footer’s first paragraph.
# Note: To set header/footer alignment, must set it on the paragraph (cannot set alignment on text runs)
doc.sections[0].footer.paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
# Create a text run style with specified font name, size, color
style = create_style(document=doc, style_name="style", style_type=2, font_size=10,
font_color=[0x00, 0x00, 0x00], font_name="Heiti")
self.add_page_number(doc.sections[0].footer.paragraphs[0].add_run("", style))
doc.save("./output.docx")
print('Page numbers added successfully!')
Note: If you need to set the alignment of page numbers, you must set it on the footer paragraph by modifying its alignment property.
5. .doc to .docx Conversion
python-docx isn’t very friendly with .doc format documents. To handle such documents, we need to convert them to .docx format first.
For Windows systems, we can use the win32com module to call the Word application, open the source file, and save it as a .docx file.
from win32com import client
def doc_to_docx_in_win(path_raw, path_output):
"""
Convert .doc to .docx (Windows)
:param path_original:
:param path_final:
:return:
"""
# Get file extension
file_suffix = os.path.splitext(path_raw)[1]
if file_suffix == ".doc":
word = client.Dispatch('Word.Application')
# Source file
doc = word.Documents.Open(path_raw)
# New generated file
doc.SaveAs(path_output, 16)
doc.Close()
word.Quit()
elif file_suffix == ".docx":
shutil.copy(path_raw, path_output)
For Mac/Linux, LibreOffice is recommended for document format conversion:
bash
# Format conversion ./soffice --headless --convert-to docx source_file.doc --outdir /output/path/
LibreOffice is a free, open-source office suite created by the community, cross-platform, with built-in soffice command for file conversion.
Taking Mac OS as an example, follow these steps:
- Download and install LibreOffice from the official website
- Find the LibreOffice installation directory and add the soffice command directory to PATH
- Restart PyCharm
- Use the
walk()function from the os module to traverse all source files and form soffice conversion commands - Execute conversion commands
import os
source = "./doc/"
dest = "./docx/"
g = os.walk(source)
# Traverse directory
for root, dirs, files in g:
for file in files:
# Full path of source file
file_path_raw = os.path.join(root, file)
print(file_path_raw)
os.system("soffice --headless --convert-to docx {} --outdir {}".format(file_path_raw, dest))
6. Comparing Document Differences
Comparing two Word documents is another common work requirement.
First, iterate through all paragraphs in the documents, filter out empty lines, and get all text content:
# Get paragraph content separately
content1 = ''
content2 = ''
for paragraph in file1.paragraphs:
if "" == paragraph.text.strip():
continue
content1 += paragraph.text + '\n'
for paragraph in file2.paragraphs:
if "" == paragraph.text.strip():
continue
content2 += paragraph.text + '\n'
# If parameter keepends is False, line breaks are not included; if True, line breaks are preserved.
print("Second document data:\n", content1.splitlines(keepends=False))
print("First document data:\n", content1.splitlines(keepends=False))
Then, use Python’s standard library difflib to compare text differences and generate an HTML difference report:
import codecs
from difflib import HtmlDiff
# Difference content
diff_html = HtmlDiff(wrapcolumn=100).make_file(content1.split("\n"), content2.split("\n"))
# Write to file
with codecs.open('./diff_result.html', 'w', encoding='utf-8') as f:
f.write(diff_html)
7. Special Content Annotation
We often need to specially annotate important content in documents. For example, we might want to highlight text runs or cells containing “WeChat” in red and bold.
7.1 Paragraph Content
Simply iterate through all text runs in paragraphs and directly modify the Run’s Font properties:
doc = Document(file)
# Highlight keyword text runs or cells in red and bold
# 1. Modify styles of text runs containing keywords in paragraphs
for paragraph in doc.paragraphs:
for run in paragraph.runs:
if keyword in run.text:
# Change color to red and display in bold
run.font.bold = True
run.font.color.rgb = RGBColor(255, 0, 0)
7.2 Table Content
Setting styles for cells that meet conditions is somewhat special and requires 4 steps:
- Get cell object, get cell text content, and temporarily save it
- Clear cell data
- Append a paragraph and a text run to the cell object, returning a text run object
- Set the text run object style to red and bold
tables = [table for table in doc.tables]
for table in tables:
for row in table.rows:
for cell in row.cells:
if keyword in cell.text:
# Original content
content_raw = cell.text
# Clear cell data
cell.text = ""
# Append data and set style
run = cell.paragraphs[0].add_run(content_raw)
run.font.color.rgb = RGBColor(255, 0, 0)
run.font.bold = True
8. Replacing Text Content
Sometimes we need to replace all occurrences of a keyword in a document with new content. We can iterate through all paragraphs and tables, using the replace() function to replace paragraph text and cell content.
def replace_content(self, old_content, new_content):
"""
Replace all content in document
:param old_content: Old content
:param new_content: New content
:return:
"""
# Replace paragraphs
for paragraph in self.doc.paragraphs:
if old_content in paragraph.text:
# Replace content and reset
paragraph.text = paragraph.text.replace(old_content, new_content)
# Replace tables
# document.tables[table_index].rows[row_index].cells[cell_column_index].text = "new data"
tables = [table for table in self.doc.tables]
for table in tables:
for row in table.rows:
for cell in row.cells:
if old_content in cell.text:
# Reset cell content
cell.text = cell.text.replace(old_content, new_content)
# Save to new file
self.doc.save('./new.docx')
9. Conclusion
This concludes the Python Word Automation series! If you encounter other business scenarios in actual work that aren’t covered in these articles, please leave comments below. Corresponding solutions may be provided in future Office Automation Practice articles!
Related articles