MinerU: Turns Any PDF Into LLM-ready markdown Or JSON, Completely Free

amy 14/01/2026

If you’ve ever tried to scrape data from a scientific paper or a complex PDF, you know the pain. You copy text, and suddenly the page numbers are in the middle of sentences, the math equations look like gibberish, and the multi-column layout is completely scrambled.

What is MinerU?

This isn’t just another generic PDF converter; it’s an open-source tool born out of necessity. It was originally developed during the pre-training of InternLM (a massive large language model) specifically to solve the headache of extracting clean training data from messy scientific literature.

What Does MinerU Do?

  • It Reads Like a Human: MinerU understands layout. It knows how to navigate multi-column papers and complex formatting, ensuring the output text follows the actual reading order rather than just scraping left-to-right.
  • It Cleans the Mess: It automatically strips away the “noise”—headers, footers, page numbers, and footnotes—so you get just the core content.
  • Math & Tables Are Safe: This is huge for researchers. It converts formulas directly into LaTeX and tables into HTML, preserving the semantic meaning that usually gets lost.
  • It Handles Scans: Got a scanned document or garbled text? MinerU kicks into OCR mode (supporting 109 languages!) to turn images back into text.
  • Run It Anywhere: Whether you are on a beastly GPU rig or a standard laptop (CPU only), MinerU works on Windows, Linux, and Mac.

Features

  • Remove headers, footers, footnotes, page numbers and other elements to ensure semantic coherence
  • Output text in human reading order, suitable for single-column, multi-column and complex layouts
  • Retain the original document structure, including titles, paragraphs, lists, etc.
  • Extract images, image descriptions, tables, table titles and footnotes
  • Automatically identify and convert formulas in documents to LaTeX format
  • Automatically identify and convert tables in documents to HTML format
  • Automatically detect scanned PDFs and garbled PDFs, and enable OCR functionality
  • OCR supports detection and recognition of 109 languages
  • Support multiple output formats, such as multimodal and NLP Markdown, reading-order-sorted JSON, and information-rich intermediate formats
  • Support multiple visualization results, including layout visualization, span visualization, etc., for efficient confirmation of output effects and quality inspection
  • Support pure CPU environment operation, and support GPU(CUDA)/NPU(CANN)/MPS acceleration
  • Compatible with Windows, Linux and Mac platforms

License

AGPL-3.0 License.

Resources & Downloads