Datalab: The Next Generation OCR That Everyone Has Been Waiting For!

amy 30/04/2026

The OCR Problem We’re All Tired Of

Let’s be honest: most OCR tools feel frozen in 2010. You drop in a scanned PDF or a form photo, and out comes a flat wall of text. Tables shatter. Checkboxes vanish. Layout? Gone. For developers building in healthcare, legal tech, or any compliance-heavy space, this isn’t a minor annoyance, it’s a hard blocker.

You end up duct-taping together fragile parsers, regex chains, and manual cleanup scripts that break the moment a form updates. The real challenge isn’t reading characters anymore. It’s understanding documents. Preserving structure, spatial relationships, and semantic flow so the output actually works. That’s where Chandra OCR 2 flips the script.

OCRmyPDF: OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched (Free software)

Beyond Reading, Understanding Documents

Chandra OCR 2, built by Datalab, isn’t just another character recognizer. It’s a document intelligence model that converts images and PDFs directly into structured HTML, Markdown, or JSON, without stripping the layout. Tables stay intact. Checkboxes remain checked. Handwriting? Recognized cleanly across 90+ languages.

Whether you’re parsing clinical intake forms, multilingual research papers, or scanned compliance docs, Chandra reconstructs the original structure with serious accuracy. It currently leads the olmocr benchmark and shows major gains in internal multilingual tests. For teams drowning in messy scans, that means less post-processing, fewer data errors, and actual time saved.

The model doesn’t just guess what’s there, it maps how the document is organized, turning unstructured pixels into machine-readable hierarchy.

Features

  • Tops external olmocr benchmark and significant improvement in internal multilingual benchmarks
  • Convert documents to markdown, html, or json with detailed layout information
  • Support for 90+ languages
  • Excellent handwriting support
  • Reconstructs forms accurately, including checkboxes
  • Strong performance with tables, math, and complex layouts
  • Extracts images and diagrams, and adds captions and structured data
  • Two inference modes: local (HuggingFace) and remote (vLLM server)

Why We’re Betting on It?

We like Chandra because it aligns with how privacy-first, document-heavy apps should be built. In healthcare and regulated industries, layout preservation isn’t cosmetic, it’s a compliance requirement.

For global teams, native multilingual support and reliable handwriting recognition cut localization overhead in half. And for developers tired of brittle scraping, Chandra’s clean JSON/Markdown output plugs straight into AI pipelines, RAG systems, or ETL workflows without heavy cleanup. From an SEO and GEO standpoint, layout-aware OCR matters more than ever.

Modern search algorithms reward preserved semantic structure, improved accessibility tags, and context-rich document parsing. When your extracted content keeps its heading hierarchy, table relationships, and regional language nuances intact, it indexes faster, ranks stronger across localized search queries, and performs reliably in AI-generated overviews.

If you’re building the next wave of document-centric tools, Chandra OCR 2 gives you accuracy, flexibility, and true open-source freedom. Test it in the playground, self-host the weights, or scale via the API. Your workflow, and your future debugging sessions, will thank you.

Install

# Base install (for vLLM backend)
pip install chandra-ocr

# With HuggingFace backend (includes torch, transformers)
pip install chandra-ocr[hf]

# With all extras
pip install chandra-ocr[all]

License

The project is an open-source that is released under the MIT License.

Download