Marker Cheat Sheet
Overview
Marker is an open-source tool that converts PDF documents to clean Markdown with high accuracy. It uses a pipeline of deep learning models for page segmentation, layout analysis, OCR, table recognition, and LaTeX equation extraction. Marker is significantly faster than alternatives like Nougat while maintaining comparable quality, processing most documents 10x faster with lower GPU memory requirements.
The tool handles complex document layouts including multi-column text, embedded tables, mathematical formulas, code blocks, headers/footers, and figure captions. It outputs well-structured Markdown suitable for LLM training data, RAG knowledge bases, and content migration workflows. Marker supports batch processing and can be configured to optimize for speed or quality.
Installation
pip install marker-pdf
# With GPU support (recommended)
pip install marker-pdf[gpu]
# From source
git clone https://github.com/VikParuchuri/marker.git
cd marker
pip install -e ".[gpu]"
# Models are downloaded automatically on first run
System Dependencies
# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr ghostscript
# macOS
brew install tesseract ghostscript
Core Usage
Command Line
# Convert single PDF
marker_single input.pdf output_dir/
# Convert single PDF with specific output format
marker_single input.pdf output_dir/ --output_format markdown
# Batch convert directory of PDFs
marker output_dir/ input_dir/ --workers 4
# Convert with maximum quality
marker_single input.pdf output_dir/ --max_pages 100
# Convert specific pages
marker_single input.pdf output_dir/ --start_page 5 --max_pages 10
# Force OCR on all pages
marker_single input.pdf output_dir/ --force_ocr
# Set language for OCR
marker_single input.pdf output_dir/ --langs "English,French"
Python API
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
# Load models
model_dict = create_model_dict()
# Create converter
converter = PdfConverter(artifact_dict=model_dict)
# Convert PDF
rendered = converter("paper.pdf")
markdown_text, metadata, images = text_from_rendered(rendered)
print(markdown_text)
print(f"Pages: {metadata['pages']}")
print(f"Images: {len(images)}")
# Save images
for name, img in images.items():
img.save(f"output/{name}")
Batch Processing
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import os, glob
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)
input_dir = "./pdfs/"
output_dir = "./markdown/"
os.makedirs(output_dir, exist_ok=True)
for pdf_path in glob.glob(f"{input_dir}/*.pdf"):
name = os.path.splitext(os.path.basename(pdf_path))[0]
rendered = converter(pdf_path)
markdown, metadata, images = text_from_rendered(rendered)
# Save markdown
with open(f"{output_dir}/{name}.md", "w") as f:
f.write(markdown)
# Save images
img_dir = f"{output_dir}/{name}_images"
os.makedirs(img_dir, exist_ok=True)
for img_name, img in images.items():
img.save(f"{img_dir}/{img_name}")
print(f"Converted: {name} ({metadata['pages']} pages)")
Output Format
Markdown Structure
# Document Title
**Authors:** John Smith, Jane Doe
## Abstract
This paper presents a novel approach to...
## 1 Introduction
The relationship is defined by $E = mc^2$ and the integral:
$$\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$$
## 2 Methods
### 2.1 Data Collection
| Dataset | Samples | Classes | Split |
|---------|---------|---------|-------|
| MNIST | 70,000 | 10 | 60k/10k |
| CIFAR-10 | 60,000 | 10 | 50k/10k |
### 2.2 Model Architecture
```python
class Transformer(nn.Module):
def __init__(self, d_model=512):
super().__init__()
Figure 1: Architecture diagram of the proposed system.
References
- Vaswani, A. et al. “Attention Is All You Need.” NeurIPS 2017.
## Configuration
### Processing Options
| Option | Description | Default |
|--------|-------------|---------|
| `--workers` | Number of parallel workers | 1 |
| `--max_pages` | Maximum pages to process | None (all) |
| `--start_page` | Starting page number | 0 |
| `--langs` | OCR languages (comma-separated) | English |
| `--force_ocr` | Force OCR on all pages | False |
| `--output_format` | Output format (markdown, json) | markdown |
| `--paginate_output` | Add page breaks between pages | False |
| `--disable_image_extraction` | Skip image extraction | False |
### Environment Variables
```bash
# Model cache directory
export MARKER_MODEL_DIR=~/.cache/marker
# Use CPU only
export TORCH_DEVICE=cpu
# Limit GPU memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Resource Requirements
| Mode | GPU VRAM | RAM | Speed |
|---|---|---|---|
| GPU (full) | ~4 GB | 8 GB | ~5 pages/sec |
| GPU (light) | ~2 GB | 4 GB | ~3 pages/sec |
| CPU | 0 | 8 GB | ~0.5 pages/sec |
Advanced Usage
Custom Pipeline Configuration
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
model_dict = create_model_dict()
converter = PdfConverter(
artifact_dict=model_dict,
config={
"force_ocr": False,
"paginate_output": True,
"disable_image_extraction": False,
"extract_images": True,
}
)
rendered = converter("paper.pdf")
Integration with LangChain
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Convert PDF to Markdown
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)
rendered = converter("paper.pdf")
markdown, _, _ = text_from_rendered(rendered)
# Split by headers
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown)
# Index in vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# Query
results = vectorstore.similarity_search("main findings", k=5)
API Server
from fastapi import FastAPI, UploadFile
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import tempfile
app = FastAPI()
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)
@app.post("/convert")
async def convert_pdf(file: UploadFile):
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
tmp.write(await file.read())
tmp_path = tmp.name
rendered = converter(tmp_path)
markdown, metadata, images = text_from_rendered(rendered)
return {
"markdown": markdown,
"pages": metadata["pages"],
"images": list(images.keys())
}
Comparing with Nougat
| Feature | Marker | Nougat |
|---|---|---|
| Speed | ~10x faster | Slower |
| GPU Memory | ~4 GB | ~6 GB |
| Non-academic docs | Good | Poor |
| Math extraction | Good | Excellent |
| Table extraction | Good | Moderate |
| Scanned PDFs | Good (OCR) | Limited |
| Multi-language | Yes | Limited |
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Use TORCH_DEVICE=cpu or reduce batch size |
| Tesseract not found | Install: apt install tesseract-ocr |
| Ghostscript errors | Install: apt install ghostscript |
| Slow processing | Use GPU, increase --workers for batch |
| Poor table extraction | Tables with complex merges may need post-processing |
| Missing LaTeX formulas | Update models: re-download from HuggingFace |
| Image extraction fails | Check disk space, use --disable_image_extraction |
| Garbled text | Force OCR: --force_ocr, set correct --langs |
# Verify installation
python -c "from marker.converters.pdf import PdfConverter; print('Marker installed')"
# Check GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
# Test conversion
marker_single test.pdf ./test_output/