Python operation on Word — How to use python-docx

1. Introduction
The previous article summarized common operations for writing data to Word documents: “How to read and write Word using Python- How to use python-docx“.
Compared to writing data, reading data is equally practical! This article will discuss how to comprehensively read data from a Word document and highlight important points to note.

2. Basic Information
We’ll continue using the python-docx library to read Word documents. First, let’s read the document’s basic information including: sections, page margins, header/footer margins, page width/height, page orientation, etc.

First, create a document object using the document path:

python

from docx import Document

# Source file directory
self.word_path = './output.docx'

# Open document and create a document object
self.doc = Document(self.word_path)

2.1 Sections

python

# 1. Get section information
# Note: Sections can set page size, headers, footers
msg_sections = self.doc.sections
print("Section list:", msg_sections)
# Number of sections
print('Number of sections:', len(msg_sections))

2.2 Page Margins
Use section object properties left_margintop_marginright_marginbottom_margin to get left, top, right, and bottom margins:

python

def get_page_margin(section):
    """
    Get page margins for a section (in EMU)
    :param section:
    :return:
    """
    # Left, top, right, bottom margins
    left, top, right, bottom = section.left_margin, section.top_margin, section.right_margin, section.bottom_margin
    return left, top, right, bottom

# 2. Page margin information
first_section = msg_sections[0]
left, top, right, bottom = get_page_margin(first_section)
print('Left margin:', left, ", Top margin:", top, ", Right margin:", right, ", Bottom margin:", bottom)

The return value is in EMU units. Conversion relationships with centimeters and inches are as follows.

2.3 Header/Footer Margins
Header margin: header_distance
Footer margin: footer_distance

python

def get_header_footer_distance(section):
    """
    Get header and footer margins
    :param section:
    :return:
    """
    # Header margin, footer margin
    header_distance, footer_distance = section.header_distance, section.footer_distance
    return header_distance, footer_distance

# 3. Header/footer margins
header_distance, footer_distance = get_header_footer_distance(first_section)
print('Header margin:', header_distance, ", Footer margin:", footer_distance)

2.4 Page Width and Height
Page width: page_width
Page height: page_height

python

def get_page_size(section):
    """
    Get page width and height
    :param section:
    :return:
    """
    # Page width, height
    page_width, page_height = section.page_width, section.page_height
    return page_width, page_height

# 4. Page width and height
page_width, page_height = get_page_size(first_section)
print('Page width:', page_width, ", Page height:", page_height)

2.5 Page Orientation
Page orientation includes: portrait and landscape. Use the section object’s orientation property:

python

def get_page_orientation(section):
    """
    Get page orientation
    :param section:
    :return:
    """
    return section.orientation

# 5. Page orientation
# Type: class 'docx.enum.base.EnumValue
# Includes: PORTRAIT (0), LANDSCAPE (1)
page_orientation = get_page_orientation(first_section)
print("Page orientation:", page_orientation)

You can also use this property to set section orientation:

python

from docx.enum.section import WD_ORIENT

# Set page orientation (landscape, portrait)
# Set to landscape
first_section.orientation = WD_ORIENT.LANDSCAPE
# Set to portrait
# first_section.orientation = WD_ORIENT.PORTRAIT
self.doc.save(self.word_path)

3. Paragraphs
Use the document object’s paragraphs property to get all paragraphs in the document.
Note: This does not include paragraphs in headers, footers, or tables.

python

# Get all paragraphs in document object, excludes: headers, footers, table paragraphs
paragraphs = self.doc.paragraphs

# 1. Number of paragraphs
paragraphs_length = len(paragraphs)
print('Document contains: {} paragraphs'.format(paragraphs_length))

3.1 Paragraph Content
We can iterate through all paragraphs and use the paragraph object’s text property:

python

# 0. Read all paragraph data
contents = [paragraph.text for paragraph in self.doc.paragraphs]
print(contents)

3.2 Paragraph Formatting
Use the paragraph_format property to get basic formatting information including: alignment, left/right indentation, line spacing, spacing before/after paragraph, etc.

python

# 2. Get formatting information for a specific paragraph
paragraph_someone = paragraphs[0]

# 2.1 Paragraph content
content = paragraph_someone.text
print('Paragraph content:', content)

# 2.2 Paragraph formatting
paragraph_format = paragraph_someone.paragraph_format

# 2.2.1 Alignment
# <class 'docx.enum.base.EnumValue'>
alignment = paragraph_format.alignment
print('Paragraph alignment:', alignment)

# 2.2.2 Left/right indentation
left_indent, right_indent = paragraph_format.left_indent, paragraph_format.right_indent
print('Paragraph left indent:', left_indent, ", right indent:", right_indent)

# 2.2.3 First line indent
first_line_indent = paragraph_format.first_line_indent
print('Paragraph first line indent:', first_line_indent)

# 2.2.4 Line spacing
line_spacing = paragraph_format.line_spacing
print('Paragraph line spacing:', line_spacing)

# 2.2.5 Spacing before/after paragraph
space_before, space_after = paragraph_format.space_before, paragraph_format.space_after
print('Spacing before/after paragraph:', space_before, ',', space_after)

4. Text Runs
Text Runs belong to paragraphs, so to get Run information, you must first obtain a paragraph instance object.

4.1 Basic Run Information
Use the paragraph object’s runs property to get all Run objects within a paragraph:

python

def get_runs(paragraph):
    """
    Get all Run information in paragraph, including: count, content list
    :param paragraph:
    :return:
    """
    # Run objects contained in paragraph
    runs = paragraph.runs

    # Count
    runs_length = len(runs)

    # Run content
    runs_contents = [run.text for run in runs]

    return runs, runs_length, runs_contents

4.2 Run Formatting Information
Runs are the smallest text units in a document. Use the Run object’s font property to get font attributes including: font name, size, color, bold, italic, etc.

python

# 2. Run formatting information
# Includes: font name, size, color, bold, etc.
# Font attributes of a specific Run
run_someone_font = runs[0].font

# Font name
font_name = run_someone_font.name
print('Font name:', font_name)

# Font color (RGB)
# <class 'docx.shared.RGBColor'>
font_color = run_someone_font.color.rgb
print('Font color:', font_color)
print(type(font_color))

# Font size
font_size = run_someone_font.size
print('Font size:', font_size)

# Bold
# True: bold; None/False: not bold
font_bold = run_someone_font.bold
print('Bold:', font_bold)

# Italic
# True: italic; None/False: not italic
font_italic = run_someone_font.italic
print('Italic:', font_italic)

# Underline
# True: underlined; None/False: not underlined
font_underline = run_someone_font.underline
print('Underlined:', font_underline)

# Strikethrough/Double strikethrough
# True: strikethrough; None/False: no strikethrough
font_strike = run_someone_font.strike
font_double_strike = run_someone_font.double_strike
print('Strikethrough:', font_strike, "\nDouble strikethrough:", font_double_strike)

5. Tables
The document object’s tables property gets all table objects in the current document:

python

# All table objects in document
tables = self.doc.tables

# 1. Number of tables
table_num = len(tables)
print('Number of tables in document:', table_num)

5.1 Getting All Table Data
There are two methods to get all table data:

Method 1: Iterate through all tables, then through rows and cells, using cell’s text property:

python

# 2. Read all table data
# All table objects
# tables = [table for table in self.doc.tables]
print('Contents are:')
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end='  ')
        print()
    print('\n')

Method 2: Use table object’s _cells property to get all cells, then iterate:

python

def get_table_cell_content(table):
    """
    Read content of all cells in table
    :param table:
    :return:
    """
    # All cells
    cells = table._cells
    cell_size = len(cells)

    # Content of all cells
    content = [cell.text for cell in cells]
    return content

5.2 Table Styles

python

# 3. Table style name
# Table Grid
table_someone = tables[0]
style = table_someone.style.name
print("Table style:", style)

5.3 Number of Rows and Columns
table.rows: row data iterator
table.columns: column data iterator

python

def get_table_size(table):
    """
    Get number of rows and columns in table
    :param table:
    :return:
    """
    # Number of rows, columns
    row_length, column_length = len(table.rows), len(table.columns)
    return row_length, column_length

5.4 Row Data and Column Data
Sometimes we need to get all data by row or column:

python

def get_table_row_datas(table):
    """
    Get row data from table
    :param table:
    :return:
    """
    rows = table.rows
    datas = []

    # Get cell data for each row as list, add to result list
    for row in rows:
        datas.append([cell.text for cell in row.cells])
    return datas

def get_table_column_datas(table):
    """
    Get column data from table
    :param table:
    :return:
    """
    columns = table.columns
    datas = []

    # Get cell data for each column as list, add to result list
    for column in columns:
        datas.append([cell.text for cell in column.cells])
    return datas

6. Images
Sometimes we need to download images from Word documents to local storage.
A Word document is essentially a compressed file. Using extraction tools reveals that document images are stored in the /word/media/ directory.

There are two methods to extract document images:

  1. Extract the document file and copy images from the corresponding directory
  2. Use python-docx built-in methods to extract images (recommended)

python

def get_word_pics(doc, word_path, output_path):
    """
    Extract images from Word document
    :param word_path: Source file name
    :param output_path: Output directory
    :return:
    """
    dict_rel = doc.part._rels
    for rel in dict_rel:
        rel = dict_rel[rel]
        if "image" in rel.target_ref:
            # Image save directory
            if not os.path.exists(output_path):
                os.makedirs(output_path)
            img_name = re.findall("/(.*)", rel.target_ref)[0]
            word_name = os.path.splitext(word_path)[0]

            # New name
            newname = word_name.split('\\')[-1] if os.sep in word_name else word_name.split('/')[-1]
            img_name = f'{newname}_{img_name}'

            # Write to file
            with open(f'{output_path}/{img_name}', "wb") as f:
                f.write(rel.target_part.blob)

7. Headers and Footers
Headers and footers are section-based. Let’s use a section object as an example:

python

# Get a section
first_section = self.doc.sections[0]

Use the section object’s header and footer properties to get header and footer objects. Since headers/footers may contain multiple paragraphs, we can use the header/footer object’s paragraphs property to get all paragraphs, then iterate and concatenate their values to get the complete header/footer content.

python

# Note: Headers and footers may contain multiple paragraphs
# All paragraphs in header
header_content = " ".join([paragraph.text for paragraph in first_section.header.paragraphs])
print("Header content:", header_content)

# Footer
footer_content = " ".join([paragraph.text for paragraph in first_section.footer.paragraphs])
print("Footer content:", footer_content)