MinerU Cheat Sheet

Overview

MinerU is an open-source tool developed by OpenDataLab for converting PDFs, web pages, and e-books into high-quality machine-readable formats (Markdown, JSON). It uses deep learning models for layout detection, text recognition (OCR), mathematical formula recognition (LaTeX), and table structure extraction. MinerU handles complex document layouts including multi-column text, mixed text-and-image regions, and scientific notation.

The tool is part of the OpenDataLab ecosystem and supports both CPU and GPU processing. It produces clean Markdown output with properly formatted tables, equations, and image references, making it ideal for building training datasets for LLMs and populating RAG knowledge bases with high-fidelity document content.

Installation

# Basic install
pip install magic-pdf[full]

# CPU only
pip install magic-pdf[lite]

# Download required models
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('opendatalab/PDF-Extract-Kit', local_dir='./models')
"

# Set model directory
export MINERU_MODEL_DIR=./models

Configuration File

// ~/.magic-pdf.json
{
  "models-dir": "/path/to/models",
  "device-mode": "cuda",
  "layout-config": {
    "model": "doclayout_yolo"
  },
  "formula-config": {
    "enable": true,
    "model": "unimernet"
  },
  "table-config": {
    "enable": true,
    "model": "tablemaster"
  }
}

Core Usage

Command Line

# Convert single PDF to Markdown
magic-pdf -p paper.pdf -o output/ -m auto

# Specify processing method
magic-pdf -p paper.pdf -o output/ -m ocr    # Force OCR
magic-pdf -p paper.pdf -o output/ -m txt    # Text extraction only
magic-pdf -p paper.pdf -o output/ -m auto   # Auto-detect

# Process specific pages
magic-pdf -p paper.pdf -o output/ -m auto --start-page 0 --end-page 10

# Process directory of PDFs
magic-pdf -p /path/to/pdfs/ -o output/ -m auto

# Output as JSON instead of Markdown
magic-pdf -p paper.pdf -o output/ -m auto --output-format json

Python API

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe
import json

# Read PDF
pdf_path = "paper.pdf"
with open(pdf_path, "rb") as f:
    pdf_bytes = f.read()

# Auto pipeline (recommended)
output_dir = "./output"
image_writer = FileBasedDataWriter(f"{output_dir}/images")

pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()

# Get Markdown output
markdown = pipe.pipe_mk_markdown(f"{output_dir}/images")
with open(f"{output_dir}/output.md", "w") as f:
    f.write(markdown)

# Get structured JSON
content_list = pipe.pipe_mk_uni_format(f"{output_dir}/images")
with open(f"{output_dir}/output.json", "w") as f:
    json.dump(content_list, f, ensure_ascii=False, indent=2)

Processing Methods

Method	Description	Best For
`auto`	Auto-detect text vs scanned PDF	General use
`txt`	Extract embedded text layer	Born-digital PDFs
`ocr`	Full OCR processing	Scanned documents, images

Output Structure

Markdown Output

# Paper Title

## Abstract

This paper presents a novel approach to document understanding...

## 1 Introduction

The equation $E = mc^2$ is fundamental. The full derivation:

$$\int_{0}^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}$$

| Method | Precision | Recall | F1 |
|--------|-----------|--------|-----|
| Baseline | 0.82 | 0.79 | 0.80 |
| Ours | 0.93 | 0.91 | 0.92 |

Figure 1: Architecture overview of the proposed system.

JSON Output Structure

[
  {
    "type": "title",
    "text": "Paper Title",
    "page_no": 0,
    "bbox": [72, 50, 540, 80]
  },
  {
    "type": "text",
    "text": "This paper presents...",
    "page_no": 0,
    "bbox": [72, 100, 540, 200]
  },
  {
    "type": "table",
    "text": "| Method | Precision |...",
    "html": "<table>...</table>",
    "page_no": 1,
    "bbox": [72, 300, 540, 450]
  },
  {
    "type": "equation",
    "text": "$$E = mc^2$$",
    "latex": "E = mc^2",
    "page_no": 1,
    "bbox": [200, 500, 400, 530]
  },
  {
    "type": "image",
    "path": "images/figure_1.png",
    "caption": "Architecture overview",
    "page_no": 2,
    "bbox": [72, 100, 540, 400]
  }
]

Configuration

Model Options

Component	Models	Description
Layout	`doclayout_yolo`, `layoutlmv3`	Page layout detection
OCR	`paddleocr`, `rapidocr`	Text recognition
Formula	`unimernet`, `pix2tex`	Math formula to LaTeX
Table	`tablemaster`, `structeqtable`	Table structure recognition

Performance Tuning

// ~/.magic-pdf.json
{
  "models-dir": "/path/to/models",
  "device-mode": "cuda",
  "layout-config": {
    "model": "doclayout_yolo",
    "batch_size": 8
  },
  "ocr-config": {
    "lang": "en",
    "use_gpu": true
  },
  "formula-config": {
    "enable": true,
    "max_batch_size": 32
  },
  "table-config": {
    "enable": true,
    "max_time": 60
  },
  "debug-mode": false
}

Advanced Usage

Batch Processing

import os
import glob
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.pipe.UNIPipe import UNIPipe

pdf_dir = "./papers/"
output_base = "./extracted/"

for pdf_path in glob.glob(f"{pdf_dir}/*.pdf"):
    name = os.path.splitext(os.path.basename(pdf_path))[0]
    output_dir = f"{output_base}/{name}"
    os.makedirs(f"{output_dir}/images", exist_ok=True)

    with open(pdf_path, "rb") as f:
        pdf_bytes = f.read()

    image_writer = FileBasedDataWriter(f"{output_dir}/images")
    pipe = UNIPipe(pdf_bytes, [], image_writer)
    pipe.pipe_classify()
    pipe.pipe_analyze()
    pipe.pipe_parse()

    markdown = pipe.pipe_mk_markdown(f"{output_dir}/images")
    with open(f"{output_dir}/output.md", "w") as f:
        f.write(markdown)

    print(f"Processed: {name}")

Integration with Vector Database

from magic_pdf.pipe.UNIPipe import UNIPipe
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Extract markdown
pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
markdown = pipe.pipe_mk_markdown("./images")

# Split by headers for RAG
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown)

# Index chunks
for chunk in chunks:
    embedding = embed_model.encode(chunk.page_content)
    vector_db.upsert(
        id=f"{pdf_name}_{chunk.metadata}",
        values=embedding,
        metadata={"text": chunk.page_content, "section": str(chunk.metadata)}
    )

Web Page Extraction

# Extract from URL
magic-pdf -p "https://example.com/article" -o output/ -m auto --input-type html

Troubleshooting

Issue	Solution
Model download fails	Download from HuggingFace manually, set `MINERU_MODEL_DIR`
CUDA out of memory	Use `device-mode: cpu` or reduce batch size
Poor OCR quality	Use `ocr` mode explicitly, check language setting
Tables not detected	Enable table config, increase `max_time`
Formulas garbled	Enable formula recognition in config
Slow processing	Use GPU, reduce pages with `--start-page`/`--end-page`
Missing images in output	Check image output directory permissions
JSON encoding errors	Use `ensure_ascii=False` when writing JSON

# Verify GPU support
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Check models are downloaded
ls -la ${MINERU_MODEL_DIR}/

# Test with sample PDF
magic-pdf -p sample.pdf -o test_output/ -m auto