Marker Cheat Sheet

Overview

Marker is an open-source tool that converts PDF documents to clean Markdown with high accuracy. It uses a pipeline of deep learning models for page segmentation, layout analysis, OCR, table recognition, and LaTeX equation extraction. Marker is significantly faster than alternatives like Nougat while maintaining comparable quality, processing most documents 10x faster with lower GPU memory requirements.

The tool handles complex document layouts including multi-column text, embedded tables, mathematical formulas, code blocks, headers/footers, and figure captions. It outputs well-structured Markdown suitable for LLM training data, RAG knowledge bases, and content migration workflows. Marker supports batch processing and can be configured to optimize for speed or quality.

Installation

pip install marker-pdf

# With GPU support (recommended)
pip install marker-pdf[gpu]

# From source
git clone https://github.com/VikParuchuri/marker.git
cd marker
pip install -e ".[gpu]"

# Models are downloaded automatically on first run

System Dependencies

# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr ghostscript

# macOS
brew install tesseract ghostscript

Core Usage

Command Line

# Convert single PDF
marker_single input.pdf output_dir/

# Convert single PDF with specific output format
marker_single input.pdf output_dir/ --output_format markdown

# Batch convert directory of PDFs
marker output_dir/ input_dir/ --workers 4

# Convert with maximum quality
marker_single input.pdf output_dir/ --max_pages 100

# Convert specific pages
marker_single input.pdf output_dir/ --start_page 5 --max_pages 10

# Force OCR on all pages
marker_single input.pdf output_dir/ --force_ocr

# Set language for OCR
marker_single input.pdf output_dir/ --langs "English,French"

Python API

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

# Load models
model_dict = create_model_dict()

# Create converter
converter = PdfConverter(artifact_dict=model_dict)

# Convert PDF
rendered = converter("paper.pdf")
markdown_text, metadata, images = text_from_rendered(rendered)

print(markdown_text)
print(f"Pages: {metadata['pages']}")
print(f"Images: {len(images)}")

# Save images
for name, img in images.items():
    img.save(f"output/{name}")

Batch Processing

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import os, glob

model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)

input_dir = "./pdfs/"
output_dir = "./markdown/"
os.makedirs(output_dir, exist_ok=True)

for pdf_path in glob.glob(f"{input_dir}/*.pdf"):
    name = os.path.splitext(os.path.basename(pdf_path))[0]

    rendered = converter(pdf_path)
    markdown, metadata, images = text_from_rendered(rendered)

    # Save markdown
    with open(f"{output_dir}/{name}.md", "w") as f:
        f.write(markdown)

    # Save images
    img_dir = f"{output_dir}/{name}_images"
    os.makedirs(img_dir, exist_ok=True)
    for img_name, img in images.items():
        img.save(f"{img_dir}/{img_name}")

    print(f"Converted: {name} ({metadata['pages']} pages)")

Output Format

Markdown Structure

# Document Title

**Authors:** John Smith, Jane Doe

## Abstract

This paper presents a novel approach to...

## 1 Introduction

The relationship is defined by $E = mc^2$ and the integral:

$$\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$$

## 2 Methods

### 2.1 Data Collection

| Dataset | Samples | Classes | Split |
|---------|---------|---------|-------|
| MNIST | 70,000 | 10 | 60k/10k |
| CIFAR-10 | 60,000 | 10 | 50k/10k |

### 2.2 Model Architecture

```python
class Transformer(nn.Module):
    def __init__(self, d_model=512):
        super().__init__()

Figure 1: Architecture diagram of the proposed system.

References

Vaswani, A. et al. “Attention Is All You Need.” NeurIPS 2017.


## Configuration

### Processing Options

| Option | Description | Default |
|--------|-------------|---------|
| `--workers` | Number of parallel workers | 1 |
| `--max_pages` | Maximum pages to process | None (all) |
| `--start_page` | Starting page number | 0 |
| `--langs` | OCR languages (comma-separated) | English |
| `--force_ocr` | Force OCR on all pages | False |
| `--output_format` | Output format (markdown, json) | markdown |
| `--paginate_output` | Add page breaks between pages | False |
| `--disable_image_extraction` | Skip image extraction | False |

### Environment Variables

```bash
# Model cache directory
export MARKER_MODEL_DIR=~/.cache/marker

# Use CPU only
export TORCH_DEVICE=cpu

# Limit GPU memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Resource Requirements

Mode	GPU VRAM	RAM	Speed
GPU (full)	~4 GB	8 GB	~5 pages/sec
GPU (light)	~2 GB	4 GB	~3 pages/sec
CPU	0	8 GB	~0.5 pages/sec

Advanced Usage

Custom Pipeline Configuration

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

model_dict = create_model_dict()
converter = PdfConverter(
    artifact_dict=model_dict,
    config={
        "force_ocr": False,
        "paginate_output": True,
        "disable_image_extraction": False,
        "extract_images": True,
    }
)

rendered = converter("paper.pdf")

Integration with LangChain

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Convert PDF to Markdown
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)
rendered = converter("paper.pdf")
markdown, _, _ = text_from_rendered(rendered)

# Split by headers
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown)

# Index in vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("main findings", k=5)

API Server

from fastapi import FastAPI, UploadFile
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import tempfile

app = FastAPI()
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)

@app.post("/convert")
async def convert_pdf(file: UploadFile):
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    rendered = converter(tmp_path)
    markdown, metadata, images = text_from_rendered(rendered)

    return {
        "markdown": markdown,
        "pages": metadata["pages"],
        "images": list(images.keys())
    }

Comparing with Nougat

Feature	Marker	Nougat
Speed	~10x faster	Slower
GPU Memory	~4 GB	~6 GB
Non-academic docs	Good	Poor
Math extraction	Good	Excellent
Table extraction	Good	Moderate
Scanned PDFs	Good (OCR)	Limited
Multi-language	Yes	Limited

Troubleshooting

Issue	Solution
CUDA out of memory	Use `TORCH_DEVICE=cpu` or reduce batch size
Tesseract not found	Install: `apt install tesseract-ocr`
Ghostscript errors	Install: `apt install ghostscript`
Slow processing	Use GPU, increase `--workers` for batch
Poor table extraction	Tables with complex merges may need post-processing
Missing LaTeX formulas	Update models: re-download from HuggingFace
Image extraction fails	Check disk space, use `--disable_image_extraction`
Garbled text	Force OCR: `--force_ocr`, set correct `--langs`

# Verify installation
python -c "from marker.converters.pdf import PdfConverter; print('Marker installed')"

# Check GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

# Test conversion
marker_single test.pdf ./test_output/