Python operation on Word — How to use python-docx
1. Introduction
The previous article summarized common operations for writing data to Word documents: “How to read and write Word using Python- How to use python-docx“.
Compared to writing data, reading data is equally practical! This article will discuss how to comprehensively read data from a Word document and highlight important points to note.
2. Basic Information
We’ll continue using the python-docx library to read Word documents. First, let’s read the document’s basic information including: sections, page margins, header/footer margins, page width/height, page orientation, etc.
First, create a document object using the document path:
python
from docx import Document # Source file directory self.word_path = './output.docx' # Open document and create a document object self.doc = Document(self.word_path)
2.1 Sections
python
# 1. Get section information
# Note: Sections can set page size, headers, footers
msg_sections = self.doc.sections
print("Section list:", msg_sections)
# Number of sections
print('Number of sections:', len(msg_sections))
2.2 Page Margins
Use section object properties left_margin, top_margin, right_margin, bottom_margin to get left, top, right, and bottom margins:
python
def get_page_margin(section):
"""
Get page margins for a section (in EMU)
:param section:
:return:
"""
# Left, top, right, bottom margins
left, top, right, bottom = section.left_margin, section.top_margin, section.right_margin, section.bottom_margin
return left, top, right, bottom
# 2. Page margin information
first_section = msg_sections[0]
left, top, right, bottom = get_page_margin(first_section)
print('Left margin:', left, ", Top margin:", top, ", Right margin:", right, ", Bottom margin:", bottom)
The return value is in EMU units. Conversion relationships with centimeters and inches are as follows.
2.3 Header/Footer Margins
Header margin: header_distance
Footer margin: footer_distance
python
def get_header_footer_distance(section):
"""
Get header and footer margins
:param section:
:return:
"""
# Header margin, footer margin
header_distance, footer_distance = section.header_distance, section.footer_distance
return header_distance, footer_distance
# 3. Header/footer margins
header_distance, footer_distance = get_header_footer_distance(first_section)
print('Header margin:', header_distance, ", Footer margin:", footer_distance)
2.4 Page Width and Height
Page width: page_width
Page height: page_height
python
def get_page_size(section):
"""
Get page width and height
:param section:
:return:
"""
# Page width, height
page_width, page_height = section.page_width, section.page_height
return page_width, page_height
# 4. Page width and height
page_width, page_height = get_page_size(first_section)
print('Page width:', page_width, ", Page height:", page_height)
2.5 Page Orientation
Page orientation includes: portrait and landscape. Use the section object’s orientation property:
python
def get_page_orientation(section):
"""
Get page orientation
:param section:
:return:
"""
return section.orientation
# 5. Page orientation
# Type: class 'docx.enum.base.EnumValue
# Includes: PORTRAIT (0), LANDSCAPE (1)
page_orientation = get_page_orientation(first_section)
print("Page orientation:", page_orientation)
You can also use this property to set section orientation:
python
from docx.enum.section import WD_ORIENT # Set page orientation (landscape, portrait) # Set to landscape first_section.orientation = WD_ORIENT.LANDSCAPE # Set to portrait # first_section.orientation = WD_ORIENT.PORTRAIT self.doc.save(self.word_path)
3. Paragraphs
Use the document object’s paragraphs property to get all paragraphs in the document.
Note: This does not include paragraphs in headers, footers, or tables.
python
# Get all paragraphs in document object, excludes: headers, footers, table paragraphs
paragraphs = self.doc.paragraphs
# 1. Number of paragraphs
paragraphs_length = len(paragraphs)
print('Document contains: {} paragraphs'.format(paragraphs_length))
3.1 Paragraph Content
We can iterate through all paragraphs and use the paragraph object’s text property:
python
# 0. Read all paragraph data contents = [paragraph.text for paragraph in self.doc.paragraphs] print(contents)
3.2 Paragraph Formatting
Use the paragraph_format property to get basic formatting information including: alignment, left/right indentation, line spacing, spacing before/after paragraph, etc.
python
# 2. Get formatting information for a specific paragraph
paragraph_someone = paragraphs[0]
# 2.1 Paragraph content
content = paragraph_someone.text
print('Paragraph content:', content)
# 2.2 Paragraph formatting
paragraph_format = paragraph_someone.paragraph_format
# 2.2.1 Alignment
# <class 'docx.enum.base.EnumValue'>
alignment = paragraph_format.alignment
print('Paragraph alignment:', alignment)
# 2.2.2 Left/right indentation
left_indent, right_indent = paragraph_format.left_indent, paragraph_format.right_indent
print('Paragraph left indent:', left_indent, ", right indent:", right_indent)
# 2.2.3 First line indent
first_line_indent = paragraph_format.first_line_indent
print('Paragraph first line indent:', first_line_indent)
# 2.2.4 Line spacing
line_spacing = paragraph_format.line_spacing
print('Paragraph line spacing:', line_spacing)
# 2.2.5 Spacing before/after paragraph
space_before, space_after = paragraph_format.space_before, paragraph_format.space_after
print('Spacing before/after paragraph:', space_before, ',', space_after)
4. Text Runs
Text Runs belong to paragraphs, so to get Run information, you must first obtain a paragraph instance object.
4.1 Basic Run Information
Use the paragraph object’s runs property to get all Run objects within a paragraph:
python
def get_runs(paragraph):
"""
Get all Run information in paragraph, including: count, content list
:param paragraph:
:return:
"""
# Run objects contained in paragraph
runs = paragraph.runs
# Count
runs_length = len(runs)
# Run content
runs_contents = [run.text for run in runs]
return runs, runs_length, runs_contents
4.2 Run Formatting Information
Runs are the smallest text units in a document. Use the Run object’s font property to get font attributes including: font name, size, color, bold, italic, etc.
python
# 2. Run formatting information
# Includes: font name, size, color, bold, etc.
# Font attributes of a specific Run
run_someone_font = runs[0].font
# Font name
font_name = run_someone_font.name
print('Font name:', font_name)
# Font color (RGB)
# <class 'docx.shared.RGBColor'>
font_color = run_someone_font.color.rgb
print('Font color:', font_color)
print(type(font_color))
# Font size
font_size = run_someone_font.size
print('Font size:', font_size)
# Bold
# True: bold; None/False: not bold
font_bold = run_someone_font.bold
print('Bold:', font_bold)
# Italic
# True: italic; None/False: not italic
font_italic = run_someone_font.italic
print('Italic:', font_italic)
# Underline
# True: underlined; None/False: not underlined
font_underline = run_someone_font.underline
print('Underlined:', font_underline)
# Strikethrough/Double strikethrough
# True: strikethrough; None/False: no strikethrough
font_strike = run_someone_font.strike
font_double_strike = run_someone_font.double_strike
print('Strikethrough:', font_strike, "\nDouble strikethrough:", font_double_strike)
5. Tables
The document object’s tables property gets all table objects in the current document:
python
# All table objects in document
tables = self.doc.tables
# 1. Number of tables
table_num = len(tables)
print('Number of tables in document:', table_num)
5.1 Getting All Table Data
There are two methods to get all table data:
Method 1: Iterate through all tables, then through rows and cells, using cell’s text property:
python
# 2. Read all table data
# All table objects
# tables = [table for table in self.doc.tables]
print('Contents are:')
for table in tables:
for row in table.rows:
for cell in row.cells:
print(cell.text, end=' ')
print()
print('\n')
Method 2: Use table object’s _cells property to get all cells, then iterate:
python
def get_table_cell_content(table):
"""
Read content of all cells in table
:param table:
:return:
"""
# All cells
cells = table._cells
cell_size = len(cells)
# Content of all cells
content = [cell.text for cell in cells]
return content
5.2 Table Styles
python
# 3. Table style name
# Table Grid
table_someone = tables[0]
style = table_someone.style.name
print("Table style:", style)
5.3 Number of Rows and Columnstable.rows: row data iteratortable.columns: column data iterator
python
def get_table_size(table):
"""
Get number of rows and columns in table
:param table:
:return:
"""
# Number of rows, columns
row_length, column_length = len(table.rows), len(table.columns)
return row_length, column_length
5.4 Row Data and Column Data
Sometimes we need to get all data by row or column:
python
def get_table_row_datas(table):
"""
Get row data from table
:param table:
:return:
"""
rows = table.rows
datas = []
# Get cell data for each row as list, add to result list
for row in rows:
datas.append([cell.text for cell in row.cells])
return datas
def get_table_column_datas(table):
"""
Get column data from table
:param table:
:return:
"""
columns = table.columns
datas = []
# Get cell data for each column as list, add to result list
for column in columns:
datas.append([cell.text for cell in column.cells])
return datas
6. Images
Sometimes we need to download images from Word documents to local storage.
A Word document is essentially a compressed file. Using extraction tools reveals that document images are stored in the /word/media/ directory.
There are two methods to extract document images:
- Extract the document file and copy images from the corresponding directory
- Use python-docx built-in methods to extract images (recommended)
python
def get_word_pics(doc, word_path, output_path):
"""
Extract images from Word document
:param word_path: Source file name
:param output_path: Output directory
:return:
"""
dict_rel = doc.part._rels
for rel in dict_rel:
rel = dict_rel[rel]
if "image" in rel.target_ref:
# Image save directory
if not os.path.exists(output_path):
os.makedirs(output_path)
img_name = re.findall("/(.*)", rel.target_ref)[0]
word_name = os.path.splitext(word_path)[0]
# New name
newname = word_name.split('\\')[-1] if os.sep in word_name else word_name.split('/')[-1]
img_name = f'{newname}_{img_name}'
# Write to file
with open(f'{output_path}/{img_name}', "wb") as f:
f.write(rel.target_part.blob)
7. Headers and Footers
Headers and footers are section-based. Let’s use a section object as an example:
python
# Get a section first_section = self.doc.sections[0]
Use the section object’s header and footer properties to get header and footer objects. Since headers/footers may contain multiple paragraphs, we can use the header/footer object’s paragraphs property to get all paragraphs, then iterate and concatenate their values to get the complete header/footer content.
python
# Note: Headers and footers may contain multiple paragraphs
# All paragraphs in header
header_content = " ".join([paragraph.text for paragraph in first_section.header.paragraphs])
print("Header content:", header_content)
# Footer
footer_content = " ".join([paragraph.text for paragraph in first_section.footer.paragraphs])
print("Footer content:", footer_content)
Related articles