MinerU: Turns Any PDF Into LLM-ready markdown Or JSON, Completely Free

If you’ve ever tried to scrape data from a scientific paper or a complex PDF, you know the pain. You copy text, and suddenly the page numbers are in the middle of sentences, the math equations look like gibberish, and the multi-column layout is completely scrambled.

What is MinerU?

This isn’t just another generic PDF converter; it’s an open-source tool born out of necessity. It was originally developed during the pre-training of InternLM (a massive large language model) specifically to solve the headache of extracting clean training data from messy scientific literature.

What Does MinerU Do?

It Reads Like a Human: MinerU understands layout. It knows how to navigate multi-column papers and complex formatting, ensuring the output text follows the actual reading order rather than just scraping left-to-right.
It Cleans the Mess: It automatically strips away the “noise”—headers, footers, page numbers, and footnotes—so you get just the core content.
Math & Tables Are Safe: This is huge for researchers. It converts formulas directly into LaTeX and tables into HTML, preserving the semantic meaning that usually gets lost.
It Handles Scans: Got a scanned document or garbled text? MinerU kicks into OCR mode (supporting 109 languages!) to turn images back into text.
Run It Anywhere: Whether you are on a beastly GPU rig or a standard laptop (CPU only), MinerU works on Windows, Linux, and Mac.

Features

Remove headers, footers, footnotes, page numbers and other elements to ensure semantic coherence
Output text in human reading order, suitable for single-column, multi-column and complex layouts
Retain the original document structure, including titles, paragraphs, lists, etc.
Extract images, image descriptions, tables, table titles and footnotes
Automatically identify and convert formulas in documents to LaTeX format
Automatically identify and convert tables in documents to HTML format
Automatically detect scanned PDFs and garbled PDFs, and enable OCR functionality
OCR supports detection and recognition of 109 languages
Support multiple output formats, such as multimodal and NLP Markdown, reading-order-sorted JSON, and information-rich intermediate formats
Support multiple visualization results, including layout visualization, span visualization, etc., for efficient confirmation of output effects and quality inspection
Support pure CPU environment operation, and support GPU(CUDA)/NPU(CANN)/MPS acceleration
Compatible with Windows, Linux and Mac platforms

License

AGPL-3.0 License.

Easy Python

MinerU: Turns Any PDF Into LLM-ready markdown Or JSON, Completely Free

What is MinerU?

What Does MinerU Do?

Features

License

Resources & Downloads

New Article

What is MinerU?

What Does MinerU Do?

Features

License

Resources & Downloads

Related articles