Markdown Document Format Analysis and Structured Parsing of .md Files Using Python

david 17/12/2025

I. Markdown Documents

Essentially, a Markdown document is: a tree structure (Block level) + inline structure (Inline level).

Block-level elements (structure):

  • heading_open → inline → heading_close
  • paragraph_open → inline → paragraph_close
  • list_open → list_item_open → inline → list_item_close
  • blockquote_open → …
  • fence (code block)

Inline-level elements (appear within a line):

  • text
  • image
  • link_open → inline → link_close
  • strong_open / strong_close
  • em_open / em_close

Patterns in .md file structure:

  • A block always appears in pairs (e.g., heading_open / heading_close).
  • content is only used for “inline” text, not for structural tokens.

MarkdownToken#

StructureToken Flow
Titleheading_open → inline → heading_close
Paragraphparagraph_open → inline → paragraph_close
List Itemlist_item_open → inline → list_item_close

Token content usage:

typecontent
heading_open"" (empty)
inlineThe entire line’s text (including Markdown syntax)
textPlain text content
imagealt text (i.e., ![alt])

II. Python Package: markdown-it

1. How markdown-it Works

markdown-it parses a Markdown document into a flat list of Tokens. Each Token has the following attributes:

  • type – “Syntactic element type” (Key). Determines which Markdown structure the Token represents. Common types include:Syntaxtype Examples# Headingheading_openheading_closeParagraphparagraph_openparagraph_closeInline contentinlineImage ![]()imageList - itembullet_list_openlist_item_open, etc.
  • tag – Corresponding HTML tag name (e.g., h1pimg).typetagheading_open (###)h3paragraph_openpimageimg
  • content – Text content (only present for inline or text child Tokens).
    • For inline → content is the raw text of the entire line (e.g., "docker images").
    • For text → content is plain text (the actual text node).
    • For image → content is the alt text (e.g., "image-2025...").
  • attrs – HTML attributes (e.g., an image’s srcalttitle are all here).

2. Markdown → Token Mapping

Assume original Markdown:

markdown

### Syntax

docker images

![image-2025xxxx](docker-learning-use-images/image-2025xxxx.png)

Use the following code to parse the document:

python

from pathlib import Path
from markdown_it import MarkdownIt

md = MarkdownIt()

md_path = Path(r"./docker-learning.md")
md_text = md_path.read_text(encoding="utf-8")

tokens = md.parse(md_text)

for t in tokens:
    print(f"type: {t.type}, tag: {t.tag}, content: {t.content}, attrs: {t.attrs}")

The resulting semantic tree (conceptual):

text

heading_open  (tag h3)
   inline -> text("Syntax")
heading_close (tag h3)

paragraph_open
   inline -> text("docker images")
paragraph_close

paragraph_open
   inline -> image (alt="image-2025...", src="docker-learning-use-images/...")
paragraph_close

This structure is well-suited for programmatic document analysis.

3. Common Usage of markdown-it

Basic Examples:

python

from markdown_it import MarkdownIt
from pathlib import Path

# Install & Create parser
md = MarkdownIt()

# Render text to HTML format string
text = """
### Title

This is a paragraph containing **bold** and *italic* text.

![Image1](images/img1.png "Image1 Title")
"""

html = md.render(text)
print(html)

# Parse text into a token list (requires first reading the .md file into a variable)
md_path = Path(r"./docker-learning.md")
md_text = md_path.read_text(encoding="utf-8")

tokens = md.parse(md_text)

for t in tokens:
    print(f"type: {t.type}, tag: {t.tag}, content: {t.content}, attrs: {t.attrs}")