Aller au contenu

Marker Cheat Sheet

Overview

Marker is an open-source tool that converts PDF documents to clean Markdown with high accuracy. It uses a pipeline of deep learning models for page segmentation, layout analysis, OCR, table recognition, and LaTeX equation extraction. Marker is significantly faster than alternatives like Nougat while maintaining comparable quality, processing most documents 10x faster with lower GPU memory requirements.

The tool handles complex document layouts including multi-column text, embedded tables, mathematical formulas, code blocks, headers/footers, and figure captions. It outputs well-structured Markdown suitable for LLM training data, RAG knowledge bases, and content migration workflows. Marker supports batch processing and can be configured to optimize for speed or quality.

Installation

pip install marker-pdf

# With GPU support (recommended)
pip install marker-pdf[gpu]

# From source
git clone https://github.com/VikParuchuri/marker.git
cd marker
pip install -e ".[gpu]"

# Models are downloaded automatically on first run

System Dependencies

# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr ghostscript

# macOS
brew install tesseract ghostscript

Core Usage

Command Line

# Convert single PDF
marker_single input.pdf output_dir/

# Convert single PDF with specific output format
marker_single input.pdf output_dir/ --output_format markdown

# Batch convert directory of PDFs
marker output_dir/ input_dir/ --workers 4

# Convert with maximum quality
marker_single input.pdf output_dir/ --max_pages 100

# Convert specific pages
marker_single input.pdf output_dir/ --start_page 5 --max_pages 10

# Force OCR on all pages
marker_single input.pdf output_dir/ --force_ocr

# Set language for OCR
marker_single input.pdf output_dir/ --langs "English,French"

Python API

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

# Load models
model_dict = create_model_dict()

# Create converter
converter = PdfConverter(artifact_dict=model_dict)

# Convert PDF
rendered = converter("paper.pdf")
markdown_text, metadata, images = text_from_rendered(rendered)

print(markdown_text)
print(f"Pages: {metadata['pages']}")
print(f"Images: {len(images)}")

# Save images
for name, img in images.items():
    img.save(f"output/{name}")

Batch Processing

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import os, glob

model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)

input_dir = "./pdfs/"
output_dir = "./markdown/"
os.makedirs(output_dir, exist_ok=True)

for pdf_path in glob.glob(f"{input_dir}/*.pdf"):
    name = os.path.splitext(os.path.basename(pdf_path))[0]

    rendered = converter(pdf_path)
    markdown, metadata, images = text_from_rendered(rendered)

    # Save markdown
    with open(f"{output_dir}/{name}.md", "w") as f:
        f.write(markdown)

    # Save images
    img_dir = f"{output_dir}/{name}_images"
    os.makedirs(img_dir, exist_ok=True)
    for img_name, img in images.items():
        img.save(f"{img_dir}/{img_name}")

    print(f"Converted: {name} ({metadata['pages']} pages)")

Output Format

Markdown Structure

# Document Title

**Authors:** John Smith, Jane Doe

## Abstract

This paper presents a novel approach to...

## 1 Introduction

The relationship is defined by $E = mc^2$ and the integral:

$$\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$$

## 2 Methods

### 2.1 Data Collection

| Dataset | Samples | Classes | Split |
|---------|---------|---------|-------|
| MNIST | 70,000 | 10 | 60k/10k |
| CIFAR-10 | 60,000 | 10 | 50k/10k |

### 2.2 Model Architecture

```python
class Transformer(nn.Module):
    def __init__(self, d_model=512):
        super().__init__()

Figure 1: Architecture diagram of the proposed system.

References

  1. Vaswani, A. et al. “Attention Is All You Need.” NeurIPS 2017.

## Configuration

### Processing Options

| Option | Description | Default |
|--------|-------------|---------|
| `--workers` | Number of parallel workers | 1 |
| `--max_pages` | Maximum pages to process | None (all) |
| `--start_page` | Starting page number | 0 |
| `--langs` | OCR languages (comma-separated) | English |
| `--force_ocr` | Force OCR on all pages | False |
| `--output_format` | Output format (markdown, json) | markdown |
| `--paginate_output` | Add page breaks between pages | False |
| `--disable_image_extraction` | Skip image extraction | False |

### Environment Variables

```bash
# Model cache directory
export MARKER_MODEL_DIR=~/.cache/marker

# Use CPU only
export TORCH_DEVICE=cpu

# Limit GPU memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Resource Requirements

ModeGPU VRAMRAMSpeed
GPU (full)~4 GB8 GB~5 pages/sec
GPU (light)~2 GB4 GB~3 pages/sec
CPU08 GB~0.5 pages/sec

Advanced Usage

Custom Pipeline Configuration

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

model_dict = create_model_dict()
converter = PdfConverter(
    artifact_dict=model_dict,
    config={
        "force_ocr": False,
        "paginate_output": True,
        "disable_image_extraction": False,
        "extract_images": True,
    }
)

rendered = converter("paper.pdf")

Integration with LangChain

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Convert PDF to Markdown
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)
rendered = converter("paper.pdf")
markdown, _, _ = text_from_rendered(rendered)

# Split by headers
headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown)

# Index in vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("main findings", k=5)

API Server

from fastapi import FastAPI, UploadFile
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import tempfile

app = FastAPI()
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)

@app.post("/convert")
async def convert_pdf(file: UploadFile):
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    rendered = converter(tmp_path)
    markdown, metadata, images = text_from_rendered(rendered)

    return {
        "markdown": markdown,
        "pages": metadata["pages"],
        "images": list(images.keys())
    }

Comparing with Nougat

FeatureMarkerNougat
Speed~10x fasterSlower
GPU Memory~4 GB~6 GB
Non-academic docsGoodPoor
Math extractionGoodExcellent
Table extractionGoodModerate
Scanned PDFsGood (OCR)Limited
Multi-languageYesLimited

Troubleshooting

IssueSolution
CUDA out of memoryUse TORCH_DEVICE=cpu or reduce batch size
Tesseract not foundInstall: apt install tesseract-ocr
Ghostscript errorsInstall: apt install ghostscript
Slow processingUse GPU, increase --workers for batch
Poor table extractionTables with complex merges may need post-processing
Missing LaTeX formulasUpdate models: re-download from HuggingFace
Image extraction failsCheck disk space, use --disable_image_extraction
Garbled textForce OCR: --force_ocr, set correct --langs
# Verify installation
python -c "from marker.converters.pdf import PdfConverter; print('Marker installed')"

# Check GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

# Test conversion
marker_single test.pdf ./test_output/