If you’ve ever tried to scrape data from a scientific paper or a complex PDF, you know the pain. You copy text, and suddenly the page numbers are in the middle of sentences, the math equations look like gibberish, and the multi-column layout is completely scrambled.
What is MinerU?
This isn’t just another generic PDF converter; it’s an open-source tool born out of necessity. It was originally developed during the pre-training of InternLM (a massive large language model) specifically to solve the headache of extracting clean training data from messy scientific literature.
What Does MinerU Do?
- It Reads Like a Human: MinerU understands layout. It knows how to navigate multi-column papers and complex formatting, ensuring the output text follows the actual reading order rather than just scraping left-to-right.
- It Cleans the Mess: It automatically strips away the “noise”—headers, footers, page numbers, and footnotes—so you get just the core content.
- Math & Tables Are Safe: This is huge for researchers. It converts formulas directly into LaTeX and tables into HTML, preserving the semantic meaning that usually gets lost.
- It Handles Scans: Got a scanned document or garbled text? MinerU kicks into OCR mode (supporting 109 languages!) to turn images back into text.
- Run It Anywhere: Whether you are on a beastly GPU rig or a standard laptop (CPU only), MinerU works on Windows, Linux, and Mac.
Features
- Remove headers, footers, footnotes, page numbers and other elements to ensure semantic coherence
- Output text in human reading order, suitable for single-column, multi-column and complex layouts
- Retain the original document structure, including titles, paragraphs, lists, etc.
- Extract images, image descriptions, tables, table titles and footnotes
- Automatically identify and convert formulas in documents to LaTeX format
- Automatically identify and convert tables in documents to HTML format
- Automatically detect scanned PDFs and garbled PDFs, and enable OCR functionality
- OCR supports detection and recognition of 109 languages
- Support multiple output formats, such as multimodal and NLP Markdown, reading-order-sorted JSON, and information-rich intermediate formats
- Support multiple visualization results, including layout visualization, span visualization, etc., for efficient confirmation of output effects and quality inspection
- Support pure CPU environment operation, and support GPU(CUDA)/NPU(CANN)/MPS acceleration
- Compatible with Windows, Linux and Mac platforms
License
AGPL-3.0 License.




