Python Office Automation with Word — How to use python-docx

1. Introduction
Regarding reading and writing Word documents, the previous two articles provided comprehensive summaries. This article serves as a supplement to the Word automation series, covering several practical office scenarios including:

Header and footer processing
Merging multiple documents
Adding page numbers
Batch conversion of .doc to .docx
Comparing document differences
Special content annotation
Replacing text content

2. Headers and Footers
Each page section contains headers and footers that can be set individually for each page or uniformly across all pages. This functionality is controlled by the different_first_page_header_footer property in the section object.

When set to True: Headers and footers differ from the first page; each page section can have unique headers/footers
When set to False: All pages share the same headers/footers

# 1. Get sections for header/footer processing
header = self.doc.sections[0].header
footer = self.doc.sections[0].footer

# True if this section displays a distinct first-page header and footer
# True: Headers/footers differ from first page, each page section has individual settings
# False: All pages have same headers/footers
self.doc.sections[0].different_first_page_header_footer = True

There are two types of header/footer additions: normal headers/footers and custom-styled headers/footers.

2.1 Normal Headers/Footers

def add_norm_header_and_footer(header, footer, header_content, footer_content):
    """
    Add normal header and footer, center-aligned
    :param header_content:
    :param footer_content:
    :return:
    """
    # Add/modify header and footer
    # Note: Typically headers/footers contain only one paragraph
    header.paragraphs[0].text = header_content
    footer.paragraphs[0].text = footer_content

    # Center alignment
    header.paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
    footer.paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER

# 2. Add header
# 2.1 Normal headers/footers
add_norm_header_and_footer(header, footer, "I am a header", "I am a footer")

2.2 Custom-styled Headers/Footers

def add_custom_style_header_and_footer(header, footer, header_content, footer_content, style):
    """
    Add custom styled headers/footers
    :param header:
    :param footer:
    :param header_content:
    :param footer_content:
    :param style:
    :return:
    """
    # Note: style_type=2, otherwise error occurs
    header.paragraphs[0].add_run(header_content, style)
    footer.paragraphs[0].add_run(footer_content, style)

# 2.2 Headers/footers with custom styles
# Create a style
style_paragraph = create_style(document=self.doc, style_name="style5", style_type=2, font_size=30,
                               font_color=[0xff, 0x00, 0x00], align=WD_PARAGRAPH_ALIGNMENT.CENTER)
add_custom_style_header_and_footer(header, footer, "I am header 2", "I am footer 2", style_paragraph)

To remove all headers and footers from a document, follow these 2 steps:

Iterate through all page sections and set different_first_page_header_footer to False
Set the section header/footer’s is_linked_to_previous property to True

def remove_all_header_and_footer(doc):
    """
    Remove all headers and footers from document
    :param doc:
    :return:
    """
    for section in doc.sections:
        section.different_first_page_header_footer = False
        # When is_linked_to_previous is set to True, headers/footers are deleted
        section.header.is_linked_to_previous = True
        section.footer.is_linked_to_previous = True

3. Merging Multiple Documents
In daily work, we often need to merge multiple Word documents into one file. We can use another Python dependency library: docxcompose

# Dependency library for merging files
# pip3 install docxcompose

from docxcompose.composer import Composer

def compose_files(self, files, output_file_path):
    """
    Merge multiple Word files into one
    :param files: List of files to merge
    :param output_file_path: New file path
    :return:
    """
    composer = Composer(Document())
    for file in files:
        composer.append(Document(file))

    # Save to new file
    composer.save(output_file_path)

4. Adding Page Numbers
We often need to add page numbers in document footers, but python-docx doesn’t provide a built-in method. However, we found an implementation on StackOverflow:

from docx.oxml.xmlchemy import BaseOxmlElement, ZeroOrOne, ZeroOrMore, OxmlElement
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.oxml import ns

def create_element(self, name):
    return OxmlElement(name)

def create_attribute(self, element, name, value):
    element.set(ns.qn(name), value)

def add_page_number(self, run):
    """
    Add page number
    :param run:
    :return:
    """
    fldChar1 = self.create_element('w:fldChar')
    self.create_attribute(fldChar1, 'w:fldCharType', 'begin')

    instrText = self.create_element('w:instrText')
    self.create_attribute(instrText, 'xml:space', 'preserve')
    instrText.text = "PAGE"

    fldChar2 = self.create_element('w:fldChar')
    self.create_attribute(fldChar2, 'w:fldCharType', 'end')

    # run._r: class 'docx.oxml.text.run.CT_R'>
    run._r.append(fldChar1)
    run._r.append(instrText)
    run._r.append(fldChar2)

By default, page numbers appear in the bottom-left corner of the footer, which isn’t aesthetically pleasing. We can use methods from the first article to create a “text run style” and add it as a Run to the footer’s first paragraph.

# Note: To set header/footer alignment, must set it on the paragraph (cannot set alignment on text runs)
doc.sections[0].footer.paragraphs[0].alignment = WD_PARAGRAPH_ALIGNMENT.CENTER

# Create a text run style with specified font name, size, color
style = create_style(document=doc, style_name="style", style_type=2, font_size=10,
                     font_color=[0x00, 0x00, 0x00], font_name="Heiti")
self.add_page_number(doc.sections[0].footer.paragraphs[0].add_run("", style))
doc.save("./output.docx")
print('Page numbers added successfully!')

Note: If you need to set the alignment of page numbers, you must set it on the footer paragraph by modifying its alignment property.

5. .doc to .docx Conversion
python-docx isn’t very friendly with .doc format documents. To handle such documents, we need to convert them to .docx format first.

For Windows systems, we can use the win32com module to call the Word application, open the source file, and save it as a .docx file.

from win32com import client

def doc_to_docx_in_win(path_raw, path_output):
    """
    Convert .doc to .docx (Windows)
    :param path_original:
    :param path_final:
    :return:
    """
    # Get file extension
    file_suffix = os.path.splitext(path_raw)[1]
    if file_suffix == ".doc":
        word = client.Dispatch('Word.Application')
        # Source file
        doc = word.Documents.Open(path_raw)
        # New generated file
        doc.SaveAs(path_output, 16)
        doc.Close()
        word.Quit()
    elif file_suffix == ".docx":
        shutil.copy(path_raw, path_output)

For Mac/Linux, LibreOffice is recommended for document format conversion:

bash

# Format conversion
./soffice --headless --convert-to docx source_file.doc --outdir /output/path/

LibreOffice is a free, open-source office suite created by the community, cross-platform, with built-in soffice command for file conversion.

Taking Mac OS as an example, follow these steps:

Download and install LibreOffice from the official website
Find the LibreOffice installation directory and add the soffice command directory to PATH
Restart PyCharm
Use the walk() function from the os module to traverse all source files and form soffice conversion commands
Execute conversion commands

import os

source = "./doc/"
dest = "./docx/"
g = os.walk(source)

# Traverse directory
for root, dirs, files in g:
    for file in files:
        # Full path of source file
        file_path_raw = os.path.join(root, file)
        print(file_path_raw)

        os.system("soffice --headless --convert-to docx {} --outdir {}".format(file_path_raw, dest))

6. Comparing Document Differences
Comparing two Word documents is another common work requirement.

First, iterate through all paragraphs in the documents, filter out empty lines, and get all text content:

# Get paragraph content separately
content1 = ''
content2 = ''
for paragraph in file1.paragraphs:
    if "" == paragraph.text.strip():
        continue
    content1 += paragraph.text + '\n'

for paragraph in file2.paragraphs:
    if "" == paragraph.text.strip():
        continue
    content2 += paragraph.text + '\n'

# If parameter keepends is False, line breaks are not included; if True, line breaks are preserved.
print("Second document data:\n", content1.splitlines(keepends=False))
print("First document data:\n", content1.splitlines(keepends=False))

Then, use Python’s standard library difflib to compare text differences and generate an HTML difference report:

import codecs
from difflib import HtmlDiff

# Difference content
diff_html = HtmlDiff(wrapcolumn=100).make_file(content1.split("\n"), content2.split("\n"))

# Write to file
with codecs.open('./diff_result.html', 'w', encoding='utf-8') as f:
     f.write(diff_html)

7. Special Content Annotation
We often need to specially annotate important content in documents. For example, we might want to highlight text runs or cells containing “WeChat” in red and bold.

7.1 Paragraph Content
Simply iterate through all text runs in paragraphs and directly modify the Run’s Font properties:

doc = Document(file)

# Highlight keyword text runs or cells in red and bold
# 1. Modify styles of text runs containing keywords in paragraphs
for paragraph in doc.paragraphs:
    for run in paragraph.runs:
        if keyword in run.text:
            # Change color to red and display in bold
            run.font.bold = True
            run.font.color.rgb = RGBColor(255, 0, 0)

7.2 Table Content
Setting styles for cells that meet conditions is somewhat special and requires 4 steps:

Get cell object, get cell text content, and temporarily save it
Clear cell data
Append a paragraph and a text run to the cell object, returning a text run object
Set the text run object style to red and bold

tables = [table for table in doc.tables]
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            if keyword in cell.text:
                # Original content
                content_raw = cell.text
                # Clear cell data
                cell.text = ""
                # Append data and set style
                run = cell.paragraphs[0].add_run(content_raw)
                run.font.color.rgb = RGBColor(255, 0, 0)
                run.font.bold = True

8. Replacing Text Content
Sometimes we need to replace all occurrences of a keyword in a document with new content. We can iterate through all paragraphs and tables, using the replace() function to replace paragraph text and cell content.

def replace_content(self, old_content, new_content):
    """
    Replace all content in document
    :param old_content: Old content
    :param new_content: New content
    :return:
    """
    # Replace paragraphs
    for paragraph in self.doc.paragraphs:
        if old_content in paragraph.text:
            # Replace content and reset
            paragraph.text = paragraph.text.replace(old_content, new_content)

    # Replace tables
    # document.tables[table_index].rows[row_index].cells[cell_column_index].text = "new data"
    tables = [table for table in self.doc.tables]
    for table in tables:
        for row in table.rows:
            for cell in row.cells:
                if old_content in cell.text:
                    # Reset cell content
                    cell.text = cell.text.replace(old_content, new_content)

    # Save to new file
    self.doc.save('./new.docx')

9. Conclusion
This concludes the Python Word Automation series! If you encounter other business scenarios in actual work that aren’t covered in these articles, please leave comments below. Corresponding solutions may be provided in future Office Automation Practice articles!

Easy Python

Python Office Automation with Word — How to use python-docx

New Article

Related articles