Markdown Document Format Analysis and Structured Parsing of .md Files Using Python

I. Markdown Documents

Essentially, a Markdown document is: a tree structure (Block level) + inline structure (Inline level).

Block-level elements (structure):

heading_open → inline → heading_close
paragraph_open → inline → paragraph_close
list_open → list_item_open → inline → list_item_close
blockquote_open → …
fence (code block)

Inline-level elements (appear within a line):

text
image
link_open → inline → link_close
strong_open / strong_close
em_open / em_close

Patterns in .md file structure:

A block always appears in pairs (e.g., heading_open / heading_close).
content is only used for “inline” text, not for structural tokens.

MarkdownToken#

Structure	Token Flow
Title	`heading_open` → `inline` → `heading_close`
Paragraph	`paragraph_open` → `inline` → `paragraph_close`
List Item	`list_item_open` → `inline` → `list_item_close`

Token content usage:

`type`	`content`
`heading_open`	`""` (empty)
`inline`	The entire line’s text (including Markdown syntax)
`text`	Plain text content
`image`	alt text (i.e., `![alt]`)

II. Python Package: markdown-it

1. How markdown-it Works

markdown-it parses a Markdown document into a flat list of Tokens. Each Token has the following attributes:

type – “Syntactic element type” (Key). Determines which Markdown structure the Token represents. Common types include:Syntaxtype Examples# Headingheading_open, heading_closeParagraphparagraph_open, paragraph_closeInline contentinlineImage ![]()imageList - itembullet_list_open, list_item_open, etc.
tag – Corresponding HTML tag name (e.g., h1, p, img).typetagheading_open (###)h3paragraph_openpimageimg
content – Text content (only present for inline or text child Tokens).
- For inline → content is the raw text of the entire line (e.g., "docker images").
- For text → content is plain text (the actual text node).
- For image → content is the alt text (e.g., "image-2025...").
attrs – HTML attributes (e.g., an image’s src, alt, title are all here).

2. Markdown → Token Mapping

Assume original Markdown:

markdown

### Syntax

docker images

![image-2025xxxx](docker-learning-use-images/image-2025xxxx.png)

Use the following code to parse the document:

python

from pathlib import Path
from markdown_it import MarkdownIt

md = MarkdownIt()

md_path = Path(r"./docker-learning.md")
md_text = md_path.read_text(encoding="utf-8")

tokens = md.parse(md_text)

for t in tokens:
    print(f"type: {t.type}, tag: {t.tag}, content: {t.content}, attrs: {t.attrs}")

The resulting semantic tree (conceptual):

text

heading_open  (tag h3)
   inline -> text("Syntax")
heading_close (tag h3)

paragraph_open
   inline -> text("docker images")
paragraph_close

paragraph_open
   inline -> image (alt="image-2025...", src="docker-learning-use-images/...")
paragraph_close

This structure is well-suited for programmatic document analysis.

3. Common Usage of markdown-it

Basic Examples:

python

from markdown_it import MarkdownIt
from pathlib import Path

# Install & Create parser
md = MarkdownIt()

# Render text to HTML format string
text = """
### Title

This is a paragraph containing **bold** and *italic* text.

![Image1](images/img1.png "Image1 Title")
"""

html = md.render(text)
print(html)

# Parse text into a token list (requires first reading the .md file into a variable)
md_path = Path(r"./docker-learning.md")
md_text = md_path.read_text(encoding="utf-8")

tokens = md.parse(md_text)

for t in tokens:
    print(f"type: {t.type}, tag: {t.tag}, content: {t.content}, attrs: {t.attrs}")

Easy Python

Markdown Document Format Analysis and Structured Parsing of .md Files Using Python

I. Markdown Documents

II. Python Package: markdown-it

New Article

I. Markdown Documents

II. Python Package: markdown-it

Related articles