Overview
MinerU is an open-source tool developed by OpenDataLab for converting PDFs, web pages, and e-books into high-quality machine-readable formats (Markdown, JSON). It uses deep learning models for layout detection, text recognition (OCR), mathematical formula recognition (LaTeX), and table structure extraction. MinerU handles complex document layouts including multi-column text, mixed text-and-image regions, and scientific notation.
The tool is part of the OpenDataLab ecosystem and supports both CPU and GPU processing. It produces clean Markdown output with properly formatted tables, equations, and image references, making it ideal for building training datasets for LLMs and populating RAG knowledge bases with high-fidelity document content.
Installation
# Basic install
pip install magic-pdf[full]
# CPU only
pip install magic-pdf[lite]
# Download required models
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('opendatalab/PDF-Extract-Kit', local_dir='./models')
"
# Set model directory
export MINERU_MODEL_DIR=./models
Configuration File
// ~/.magic-pdf.json
{
"models-dir": "/path/to/models",
"device-mode": "cuda",
"layout-config": {
"model": "doclayout_yolo"
},
"formula-config": {
"enable": true,
"model": "unimernet"
},
"table-config": {
"enable": true,
"model": "tablemaster"
}
}
Core Usage
Command Line
# Convert single PDF to Markdown
magic-pdf -p paper.pdf -o output/ -m auto
# Specify processing method
magic-pdf -p paper.pdf -o output/ -m ocr # Force OCR
magic-pdf -p paper.pdf -o output/ -m txt # Text extraction only
magic-pdf -p paper.pdf -o output/ -m auto # Auto-detect
# Process specific pages
magic-pdf -p paper.pdf -o output/ -m auto --start-page 0 --end-page 10
# Process directory of PDFs
magic-pdf -p /path/to/pdfs/ -o output/ -m auto
# Output as JSON instead of Markdown
magic-pdf -p paper.pdf -o output/ -m auto --output-format json
Python API
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe
import json
# Read PDF
pdf_path = "paper.pdf"
with open(pdf_path, "rb") as f:
pdf_bytes = f.read()
# Auto pipeline (recommended)
output_dir = "./output"
image_writer = FileBasedDataWriter(f"{output_dir}/images")
pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
# Get Markdown output
markdown = pipe.pipe_mk_markdown(f"{output_dir}/images")
with open(f"{output_dir}/output.md", "w") as f:
f.write(markdown)
# Get structured JSON
content_list = pipe.pipe_mk_uni_format(f"{output_dir}/images")
with open(f"{output_dir}/output.json", "w") as f:
json.dump(content_list, f, ensure_ascii=False, indent=2)
Processing Methods
| Method | Description | Best For |
|---|
auto | Auto-detect text vs scanned PDF | General use |
txt | Extract embedded text layer | Born-digital PDFs |
ocr | Full OCR processing | Scanned documents, images |
Output Structure
Markdown Output
# Paper Title
## Abstract
This paper presents a novel approach to document understanding...
## 1 Introduction
The equation $E = mc^2$ is fundamental. The full derivation:
$$\int_{0}^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}$$
| Method | Precision | Recall | F1 |
|--------|-----------|--------|-----|
| Baseline | 0.82 | 0.79 | 0.80 |
| Ours | 0.93 | 0.91 | 0.92 |
Figure 1: Architecture overview of the proposed system.
JSON Output Structure
[
{
"type": "title",
"text": "Paper Title",
"page_no": 0,
"bbox": [72, 50, 540, 80]
},
{
"type": "text",
"text": "This paper presents...",
"page_no": 0,
"bbox": [72, 100, 540, 200]
},
{
"type": "table",
"text": "| Method | Precision |...",
"html": "<table>...</table>",
"page_no": 1,
"bbox": [72, 300, 540, 450]
},
{
"type": "equation",
"text": "$$E = mc^2$$",
"latex": "E = mc^2",
"page_no": 1,
"bbox": [200, 500, 400, 530]
},
{
"type": "image",
"path": "images/figure_1.png",
"caption": "Architecture overview",
"page_no": 2,
"bbox": [72, 100, 540, 400]
}
]
Configuration
Model Options
| Component | Models | Description |
|---|
| Layout | doclayout_yolo, layoutlmv3 | Page layout detection |
| OCR | paddleocr, rapidocr | Text recognition |
| Formula | unimernet, pix2tex | Math formula to LaTeX |
| Table | tablemaster, structeqtable | Table structure recognition |
// ~/.magic-pdf.json
{
"models-dir": "/path/to/models",
"device-mode": "cuda",
"layout-config": {
"model": "doclayout_yolo",
"batch_size": 8
},
"ocr-config": {
"lang": "en",
"use_gpu": true
},
"formula-config": {
"enable": true,
"max_batch_size": 32
},
"table-config": {
"enable": true,
"max_time": 60
},
"debug-mode": false
}
Advanced Usage
Batch Processing
import os
import glob
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.pipe.UNIPipe import UNIPipe
pdf_dir = "./papers/"
output_base = "./extracted/"
for pdf_path in glob.glob(f"{pdf_dir}/*.pdf"):
name = os.path.splitext(os.path.basename(pdf_path))[0]
output_dir = f"{output_base}/{name}"
os.makedirs(f"{output_dir}/images", exist_ok=True)
with open(pdf_path, "rb") as f:
pdf_bytes = f.read()
image_writer = FileBasedDataWriter(f"{output_dir}/images")
pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
markdown = pipe.pipe_mk_markdown(f"{output_dir}/images")
with open(f"{output_dir}/output.md", "w") as f:
f.write(markdown)
print(f"Processed: {name}")
Integration with Vector Database
from magic_pdf.pipe.UNIPipe import UNIPipe
from langchain.text_splitter import MarkdownHeaderTextSplitter
# Extract markdown
pipe = UNIPipe(pdf_bytes, [], image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
markdown = pipe.pipe_mk_markdown("./images")
# Split by headers for RAG
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown)
# Index chunks
for chunk in chunks:
embedding = embed_model.encode(chunk.page_content)
vector_db.upsert(
id=f"{pdf_name}_{chunk.metadata}",
values=embedding,
metadata={"text": chunk.page_content, "section": str(chunk.metadata)}
)
# Extract from URL
magic-pdf -p "https://example.com/article" -o output/ -m auto --input-type html
Troubleshooting
| Issue | Solution |
|---|
| Model download fails | Download from HuggingFace manually, set MINERU_MODEL_DIR |
| CUDA out of memory | Use device-mode: cpu or reduce batch size |
| Poor OCR quality | Use ocr mode explicitly, check language setting |
| Tables not detected | Enable table config, increase max_time |
| Formulas garbled | Enable formula recognition in config |
| Slow processing | Use GPU, reduce pages with --start-page/--end-page |
| Missing images in output | Check image output directory permissions |
| JSON encoding errors | Use ensure_ascii=False when writing JSON |
# Verify GPU support
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Check models are downloaded
ls -la ${MINERU_MODEL_DIR}/
# Test with sample PDF
magic-pdf -p sample.pdf -o test_output/ -m auto