Python operation on Word — How to use python-docx
1. Introduction
The previous article summarized common operations for writing data to Word documents: “How to read and write Word using Python- How to use python-docx“.
Compared to writing data, reading data is equally practical! This article will discuss how to comprehensively read data from a Word document and highlight important points to note.
2. Basic Information
We’ll continue using the python-docx library to read Word documents. First, let’s read the document’s basic information including: sections, page margins, header/footer margins, page width/height, page orientation, etc.
First, create a document object using the document path:
python
from docx import Document # Source file directory self.word_path = './output.docx' # Open document and create a document object self.doc = Document(self.word_path)
2.1 Sections
python
# 1. Get section information # Note: Sections can set page size, headers, footers msg_sections = self.doc.sections print("Section list:", msg_sections) # Number of sections print('Number of sections:', len(msg_sections))
2.2 Page Margins
Use section object properties left_margin
, top_margin
, right_margin
, bottom_margin
to get left, top, right, and bottom margins:
python
def get_page_margin(section): """ Get page margins for a section (in EMU) :param section: :return: """ # Left, top, right, bottom margins left, top, right, bottom = section.left_margin, section.top_margin, section.right_margin, section.bottom_margin return left, top, right, bottom # 2. Page margin information first_section = msg_sections[0] left, top, right, bottom = get_page_margin(first_section) print('Left margin:', left, ", Top margin:", top, ", Right margin:", right, ", Bottom margin:", bottom)
The return value is in EMU units. Conversion relationships with centimeters and inches are as follows.
2.3 Header/Footer Margins
Header margin: header_distance
Footer margin: footer_distance
python
def get_header_footer_distance(section): """ Get header and footer margins :param section: :return: """ # Header margin, footer margin header_distance, footer_distance = section.header_distance, section.footer_distance return header_distance, footer_distance # 3. Header/footer margins header_distance, footer_distance = get_header_footer_distance(first_section) print('Header margin:', header_distance, ", Footer margin:", footer_distance)
2.4 Page Width and Height
Page width: page_width
Page height: page_height
python
def get_page_size(section): """ Get page width and height :param section: :return: """ # Page width, height page_width, page_height = section.page_width, section.page_height return page_width, page_height # 4. Page width and height page_width, page_height = get_page_size(first_section) print('Page width:', page_width, ", Page height:", page_height)
2.5 Page Orientation
Page orientation includes: portrait and landscape. Use the section object’s orientation
property:
python
def get_page_orientation(section): """ Get page orientation :param section: :return: """ return section.orientation # 5. Page orientation # Type: class 'docx.enum.base.EnumValue # Includes: PORTRAIT (0), LANDSCAPE (1) page_orientation = get_page_orientation(first_section) print("Page orientation:", page_orientation)
You can also use this property to set section orientation:
python
from docx.enum.section import WD_ORIENT # Set page orientation (landscape, portrait) # Set to landscape first_section.orientation = WD_ORIENT.LANDSCAPE # Set to portrait # first_section.orientation = WD_ORIENT.PORTRAIT self.doc.save(self.word_path)
3. Paragraphs
Use the document object’s paragraphs
property to get all paragraphs in the document.
Note: This does not include paragraphs in headers, footers, or tables.
python
# Get all paragraphs in document object, excludes: headers, footers, table paragraphs paragraphs = self.doc.paragraphs # 1. Number of paragraphs paragraphs_length = len(paragraphs) print('Document contains: {} paragraphs'.format(paragraphs_length))
3.1 Paragraph Content
We can iterate through all paragraphs and use the paragraph object’s text
property:
python
# 0. Read all paragraph data contents = [paragraph.text for paragraph in self.doc.paragraphs] print(contents)
3.2 Paragraph Formatting
Use the paragraph_format
property to get basic formatting information including: alignment, left/right indentation, line spacing, spacing before/after paragraph, etc.
python
# 2. Get formatting information for a specific paragraph paragraph_someone = paragraphs[0] # 2.1 Paragraph content content = paragraph_someone.text print('Paragraph content:', content) # 2.2 Paragraph formatting paragraph_format = paragraph_someone.paragraph_format # 2.2.1 Alignment # <class 'docx.enum.base.EnumValue'> alignment = paragraph_format.alignment print('Paragraph alignment:', alignment) # 2.2.2 Left/right indentation left_indent, right_indent = paragraph_format.left_indent, paragraph_format.right_indent print('Paragraph left indent:', left_indent, ", right indent:", right_indent) # 2.2.3 First line indent first_line_indent = paragraph_format.first_line_indent print('Paragraph first line indent:', first_line_indent) # 2.2.4 Line spacing line_spacing = paragraph_format.line_spacing print('Paragraph line spacing:', line_spacing) # 2.2.5 Spacing before/after paragraph space_before, space_after = paragraph_format.space_before, paragraph_format.space_after print('Spacing before/after paragraph:', space_before, ',', space_after)
4. Text Runs
Text Runs belong to paragraphs, so to get Run information, you must first obtain a paragraph instance object.
4.1 Basic Run Information
Use the paragraph object’s runs
property to get all Run objects within a paragraph:
python
def get_runs(paragraph): """ Get all Run information in paragraph, including: count, content list :param paragraph: :return: """ # Run objects contained in paragraph runs = paragraph.runs # Count runs_length = len(runs) # Run content runs_contents = [run.text for run in runs] return runs, runs_length, runs_contents
4.2 Run Formatting Information
Runs are the smallest text units in a document. Use the Run object’s font
property to get font attributes including: font name, size, color, bold, italic, etc.
python
# 2. Run formatting information # Includes: font name, size, color, bold, etc. # Font attributes of a specific Run run_someone_font = runs[0].font # Font name font_name = run_someone_font.name print('Font name:', font_name) # Font color (RGB) # <class 'docx.shared.RGBColor'> font_color = run_someone_font.color.rgb print('Font color:', font_color) print(type(font_color)) # Font size font_size = run_someone_font.size print('Font size:', font_size) # Bold # True: bold; None/False: not bold font_bold = run_someone_font.bold print('Bold:', font_bold) # Italic # True: italic; None/False: not italic font_italic = run_someone_font.italic print('Italic:', font_italic) # Underline # True: underlined; None/False: not underlined font_underline = run_someone_font.underline print('Underlined:', font_underline) # Strikethrough/Double strikethrough # True: strikethrough; None/False: no strikethrough font_strike = run_someone_font.strike font_double_strike = run_someone_font.double_strike print('Strikethrough:', font_strike, "\nDouble strikethrough:", font_double_strike)
5. Tables
The document object’s tables
property gets all table objects in the current document:
python
# All table objects in document tables = self.doc.tables # 1. Number of tables table_num = len(tables) print('Number of tables in document:', table_num)
5.1 Getting All Table Data
There are two methods to get all table data:
Method 1: Iterate through all tables, then through rows and cells, using cell’s text
property:
python
# 2. Read all table data # All table objects # tables = [table for table in self.doc.tables] print('Contents are:') for table in tables: for row in table.rows: for cell in row.cells: print(cell.text, end=' ') print() print('\n')
Method 2: Use table object’s _cells
property to get all cells, then iterate:
python
def get_table_cell_content(table): """ Read content of all cells in table :param table: :return: """ # All cells cells = table._cells cell_size = len(cells) # Content of all cells content = [cell.text for cell in cells] return content
5.2 Table Styles
python
# 3. Table style name # Table Grid table_someone = tables[0] style = table_someone.style.name print("Table style:", style)
5.3 Number of Rows and Columnstable.rows
: row data iteratortable.columns
: column data iterator
python
def get_table_size(table): """ Get number of rows and columns in table :param table: :return: """ # Number of rows, columns row_length, column_length = len(table.rows), len(table.columns) return row_length, column_length
5.4 Row Data and Column Data
Sometimes we need to get all data by row or column:
python
def get_table_row_datas(table): """ Get row data from table :param table: :return: """ rows = table.rows datas = [] # Get cell data for each row as list, add to result list for row in rows: datas.append([cell.text for cell in row.cells]) return datas def get_table_column_datas(table): """ Get column data from table :param table: :return: """ columns = table.columns datas = [] # Get cell data for each column as list, add to result list for column in columns: datas.append([cell.text for cell in column.cells]) return datas
6. Images
Sometimes we need to download images from Word documents to local storage.
A Word document is essentially a compressed file. Using extraction tools reveals that document images are stored in the /word/media/
directory.
There are two methods to extract document images:
- Extract the document file and copy images from the corresponding directory
- Use python-docx built-in methods to extract images (recommended)
python
def get_word_pics(doc, word_path, output_path): """ Extract images from Word document :param word_path: Source file name :param output_path: Output directory :return: """ dict_rel = doc.part._rels for rel in dict_rel: rel = dict_rel[rel] if "image" in rel.target_ref: # Image save directory if not os.path.exists(output_path): os.makedirs(output_path) img_name = re.findall("/(.*)", rel.target_ref)[0] word_name = os.path.splitext(word_path)[0] # New name newname = word_name.split('\\')[-1] if os.sep in word_name else word_name.split('/')[-1] img_name = f'{newname}_{img_name}' # Write to file with open(f'{output_path}/{img_name}', "wb") as f: f.write(rel.target_part.blob)
7. Headers and Footers
Headers and footers are section-based. Let’s use a section object as an example:
python
# Get a section first_section = self.doc.sections[0]
Use the section object’s header
and footer
properties to get header and footer objects. Since headers/footers may contain multiple paragraphs, we can use the header/footer object’s paragraphs
property to get all paragraphs, then iterate and concatenate their values to get the complete header/footer content.
python
# Note: Headers and footers may contain multiple paragraphs # All paragraphs in header header_content = " ".join([paragraph.text for paragraph in first_section.header.paragraphs]) print("Header content:", header_content) # Footer footer_content = " ".join([paragraph.text for paragraph in first_section.footer.paragraphs]) print("Footer content:", footer_content)
Related articles