Skip to content

MinerU Cheat Sheet

Overview

MinerU is an open-source tool developed by OpenDataLab for converting PDFs, web pages, and e-books into high-quality machine-readable formats (Markdown, JSON). It uses deep learning models for layout detection, text recognition (OCR), mathematical formula recognition (LaTeX), and table structure extraction. MinerU handles complex document layouts including multi-column text, mixed text-and-image regions, and scientific notation.

The tool is part of the OpenDataLab ecosystem and supports both CPU and GPU processing. It produces clean Markdown output with properly formatted tables, equations, and image references, making it ideal for building training datasets for LLMs and populating RAG knowledge bases with high-fidelity document content.

Installation

# Basic install
pip install magic-pdf[full]

# CPU only
pip install magic-pdf[lite]

# Download required models
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('opendatalab/PDF-Extract-Kit', local_dir='./models')
"

# Set model directory
export MINERU_MODEL_DIR=./models

Configuration File

// ~/.magic-pdf.json
{
  "models-dir": "/path/to/models",
  "device-mode": "cuda",
  "layout-config": {
    "model": "doclayout_yolo"
  },
  "formula-config": {
    "enable": true,
    "model": "unimernet"
  },
  "table-config": {
    "enable": true,
    "model": "tablemaster"
  }
}

Core Usage

Command Line

# Convert single PDF to Markdown
magic-pdf -p paper.pdf -o output/ -m auto

# Specify processing method
magic-pdf -p paper.pdf -o output/ -m ocr    # Force OCR
magic-pdf -p paper.pdf -o output/ -m txt    # Text extraction only
magic-pdf -p paper.pdf -o output/ -m auto   # Auto-detect

# Process specific pages
magic-pdf -p paper.pdf -o output/ -m auto --start-page 0 --end-page 10

# Process directory of PDFs
magic-pdf -p /path/to/pdfs/ -o output/ -m auto

# Output as JSON instead of Markdown
magic-pdf -p paper.pdf -o output/ -m auto --output-format json

Python API

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe
import json

# Read PDF
pdf_path = "paper.pdf"
with open(pdf_path, "rb") as f:
    pdf_bytes = f.read()

# Auto pipeline (recommended)
output_dir = "./output"
image_writer = FileBasedDataWriter(f"{output_dir}/images")

pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()

# Get Markdown output
markdown = pipe.pipe_mk_markdown(f"{output_dir}/images")
with open(f"{output_dir}/output.md", "w") as f:
    f.write(markdown)

# Get structured JSON
content_list = pipe.pipe_mk_uni_format(f"{output_dir}/images")
with open(f"{output_dir}/output.json", "w") as f:
    json.dump(content_list, f, ensure_ascii=False, indent=2)

Processing Methods

MethodDescriptionBest For
autoAuto-detect text vs scanned PDFGeneral use
txtExtract embedded text layerBorn-digital PDFs
ocrFull OCR processingScanned documents, images

Output Structure

Markdown Output

# Paper Title

## Abstract

This paper presents a novel approach to document understanding...

## 1 Introduction

The equation $E = mc^2$ is fundamental. The full derivation:

$$\int_{0}^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}$$

| Method | Precision | Recall | F1 |
|--------|-----------|--------|-----|
| Baseline | 0.82 | 0.79 | 0.80 |
| Ours | 0.93 | 0.91 | 0.92 |

Figure 1: Architecture overview of the proposed system.

JSON Output Structure

[
  {
    "type": "title",
    "text": "Paper Title",
    "page_no": 0,
    "bbox": [72, 50, 540, 80]
  },
  {
    "type": "text",
    "text": "This paper presents...",
    "page_no": 0,
    "bbox": [72, 100, 540, 200]
  },
  {
    "type": "table",
    "text": "| Method | Precision |...",
    "html": "<table>...</table>",
    "page_no": 1,
    "bbox": [72, 300, 540, 450]
  },
  {
    "type": "equation",
    "text": "$$E = mc^2$$",
    "latex": "E = mc^2",
    "page_no": 1,
    "bbox": [200, 500, 400, 530]
  },
  {
    "type": "image",
    "path": "images/figure_1.png",
    "caption": "Architecture overview",
    "page_no": 2,
    "bbox": [72, 100, 540, 400]
  }
]

Configuration

Model Options

ComponentModelsDescription
Layoutdoclayout_yolo, layoutlmv3Page layout detection
OCRpaddleocr, rapidocrText recognition
Formulaunimernet, pix2texMath formula to LaTeX
Tabletablemaster, structeqtableTable structure recognition

Performance Tuning

// ~/.magic-pdf.json
{
  "models-dir": "/path/to/models",
  "device-mode": "cuda",
  "layout-config": {
    "model": "doclayout_yolo",
    "batch_size": 8
  },
  "ocr-config": {
    "lang": "en",
    "use_gpu": true
  },
  "formula-config": {
    "enable": true,
    "max_batch_size": 32
  },
  "table-config": {
    "enable": true,
    "max_time": 60
  },
  "debug-mode": false
}

Advanced Usage

Batch Processing

import os
import glob
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.pipe.UNIPipe import UNIPipe

pdf_dir = "./papers/"
output_base = "./extracted/"

for pdf_path in glob.glob(f"{pdf_dir}/*.pdf"):
    name = os.path.splitext(os.path.basename(pdf_path))[0]
    output_dir = f"{output_base}/{name}"
    os.makedirs(f"{output_dir}/images", exist_ok=True)

    with open(pdf_path, "rb") as f:
        pdf_bytes = f.read()

    image_writer = FileBasedDataWriter(f"{output_dir}/images")
    pipe = UNIPipe(pdf_bytes, [], image_writer)
    pipe.pipe_classify()
    pipe.pipe_analyze()
    pipe.pipe_parse()

    markdown = pipe.pipe_mk_markdown(f"{output_dir}/images")
    with open(f"{output_dir}/output.md", "w") as f:
        f.write(markdown)

    print(f"Processed: {name}")

Integration with Vector Database

from magic_pdf.pipe.UNIPipe import UNIPipe
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Extract markdown
pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
markdown = pipe.pipe_mk_markdown("./images")

# Split by headers for RAG
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown)

# Index chunks
for chunk in chunks:
    embedding = embed_model.encode(chunk.page_content)
    vector_db.upsert(
        id=f"{pdf_name}_{chunk.metadata}",
        values=embedding,
        metadata={"text": chunk.page_content, "section": str(chunk.metadata)}
    )

Web Page Extraction

# Extract from URL
magic-pdf -p "https://example.com/article" -o output/ -m auto --input-type html

Troubleshooting

IssueSolution
Model download failsDownload from HuggingFace manually, set MINERU_MODEL_DIR
CUDA out of memoryUse device-mode: cpu or reduce batch size
Poor OCR qualityUse ocr mode explicitly, check language setting
Tables not detectedEnable table config, increase max_time
Formulas garbledEnable formula recognition in config
Slow processingUse GPU, reduce pages with --start-page/--end-page
Missing images in outputCheck image output directory permissions
JSON encoding errorsUse ensure_ascii=False when writing JSON
# Verify GPU support
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Check models are downloaded
ls -la ${MINERU_MODEL_DIR}/

# Test with sample PDF
magic-pdf -p sample.pdf -o test_output/ -m auto