Overview
Nougat (Neural Optical Understanding for Academic Documents) is a transformer-based model developed by Meta AI that converts academic PDF documents into structured Markdown text. Unlike traditional OCR, Nougat uses a visual transformer encoder-decoder architecture trained on scientific papers to accurately extract mathematical equations (as LaTeX), tables, section structures, and references from PDF documents without requiring an underlying text layer.
The model excels at handling the complex layouts found in academic papers: multi-column formats, inline and display math, chemical formulas, figures with captions, and bibliographic references. Nougat processes PDF pages as images and outputs clean Markdown with embedded LaTeX, making it ideal for building searchable academic knowledge bases and RAG systems over scientific literature.
Installation
pip install nougat-ocr
# With GPU support
pip install nougat-ocr[gpu]
# From source
git clone https://github.com/facebookresearch/nougat.git
cd nougat
pip install -e ".[gpu]"
# Download model weights (auto-downloaded on first use)
nougat --help
Core Usage
Command Line
# Convert single PDF
nougat path/to/paper.pdf -o output_dir/
# Convert with specific model
nougat paper.pdf -o output/ -m 0.1.0-small
# Convert multiple PDFs
nougat paper1.pdf paper2.pdf paper3.pdf -o output/
# Process entire directory
nougat /path/to/pdfs/ -o output/ --recompute
# Specify pages
nougat paper.pdf -o output/ --pages 0-5
# Skip existing outputs
nougat paper.pdf -o output/ --no-skipping
# Use CPU (slower)
nougat paper.pdf -o output/ --no-cuda
# Batch processing with specific batch size
nougat paper.pdf -o output/ --batchsize 4
Python API
from nougat import NougatModel
from nougat.utils.device import move_to_device
from nougat.postprocessing import markdown_compatible
from PIL import Image
import torch
# Load model
model = NougatModel.from_pretrained("facebook/nougat-base")
model = move_to_device(model)
model.eval()
# Process a single page image
from nougat.utils.dataset import LazyDataset
from torch.utils.data import DataLoader
# Convert PDF to images
from nougat.dataset.rasterize import rasterize_paper
images = rasterize_paper(pdf_path="paper.pdf", return_pil=True)
# Process each page
for i, image in enumerate(images):
# Prepare input
sample = model.encoder.prepare_input(image)
sample = sample.unsqueeze(0).to(model.device)
# Generate markdown
output = model.inference(image_tensors=sample)
generated = output["predictions"][0]
# Post-process
markdown = markdown_compatible(generated)
print(f"--- Page {i+1} ---")
print(markdown)
# Paper Title
## Abstract
This paper presents a novel approach to...
## 1 Introduction
We introduce a method that $\alpha + \beta = \gamma$ demonstrates...
### 1.1 Background
The equation governing the process is:
$$\mathcal{L} = \sum_{i=1}^{N} \log p(x_i | \theta)$$
## 2 Methodology
| Method | Accuracy | F1 Score |
|--------|----------|----------|
| Baseline | 0.85 | 0.83 |
| Ours | **0.92** | **0.91** |
## References
* [1] Author et al. "Title of Paper." Conference 2024.
Models
| Model | Size | Performance | Speed |
|---|
nougat-base | 350M params | Best quality | Slower |
nougat-small | 250M params | Good quality | Faster |
# Download specific model
python -c "from nougat import NougatModel; NougatModel.from_pretrained('facebook/nougat-base')"
python -c "from nougat import NougatModel; NougatModel.from_pretrained('facebook/nougat-small')"
Configuration
Processing Options
| Parameter | Description | Default |
|---|
--model / -m | Model tag (0.1.0-base, 0.1.0-small) | 0.1.0-base |
--batchsize / -b | Batch size for processing | 1 |
--pages | Page range to process (e.g., 0-5) | All |
--out / -o | Output directory | Current dir |
--recompute | Reprocess existing outputs | False |
--no-cuda | Force CPU processing | False |
--no-skipping | Don’t skip pages with errors | False |
--markdown | Post-process to clean Markdown | True |
GPU Memory Requirements
| Model | Batch Size 1 | Batch Size 4 |
|---|
| nougat-base | ~6 GB VRAM | ~16 GB VRAM |
| nougat-small | ~4 GB VRAM | ~12 GB VRAM |
Advanced Usage
Batch Processing Pipeline
import os
import glob
from pathlib import Path
from nougat import NougatModel
from nougat.utils.device import move_to_device
from nougat.dataset.rasterize import rasterize_paper
from nougat.postprocessing import markdown_compatible
model = NougatModel.from_pretrained("facebook/nougat-base")
model = move_to_device(model)
model.eval()
pdf_dir = "./papers/"
output_dir = "./markdown/"
os.makedirs(output_dir, exist_ok=True)
for pdf_path in glob.glob(f"{pdf_dir}/*.pdf"):
name = Path(pdf_path).stem
output_path = f"{output_dir}/{name}.md"
if os.path.exists(output_path):
continue
print(f"Processing: {pdf_path}")
images = rasterize_paper(pdf_path, return_pil=True)
pages = []
for image in images:
sample = model.encoder.prepare_input(image).unsqueeze(0).to(model.device)
output = model.inference(image_tensors=sample)
page_md = markdown_compatible(output["predictions"][0])
pages.append(page_md)
full_md = "\n\n".join(pages)
with open(output_path, "w") as f:
f.write(full_md)
print(f" -> {output_path}")
API Server
from fastapi import FastAPI, UploadFile
from nougat import NougatModel
from nougat.utils.device import move_to_device
from nougat.dataset.rasterize import rasterize_paper
from nougat.postprocessing import markdown_compatible
import tempfile
app = FastAPI()
model = NougatModel.from_pretrained("facebook/nougat-base")
model = move_to_device(model)
model.eval()
@app.post("/convert")
async def convert_pdf(file: UploadFile):
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
tmp.write(await file.read())
tmp_path = tmp.name
images = rasterize_paper(tmp_path, return_pil=True)
pages = []
for image in images:
sample = model.encoder.prepare_input(image).unsqueeze(0).to(model.device)
output = model.inference(image_tensors=sample)
pages.append(markdown_compatible(output["predictions"][0]))
return {"markdown": "\n\n".join(pages), "pages": len(pages)}
Integration with RAG
from nougat import NougatModel
from langchain.text_splitter import MarkdownHeaderTextSplitter
# Convert PDF to Markdown
markdown_text = convert_pdf_to_markdown("paper.pdf")
# Split by headers for RAG chunking
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)
# Each chunk has metadata with section headers
for chunk in chunks:
print(f"Section: {chunk.metadata}")
print(f"Content: {chunk.page_content[:100]}...")
Troubleshooting
| Issue | Solution |
|---|
| CUDA out of memory | Reduce batch size to 1, use nougat-small model |
| Poor LaTeX output | Use nougat-base model, check PDF is not a scan |
| Garbled text on scanned PDFs | Nougat works best on born-digital PDFs |
| Slow processing | Use GPU, increase batch size if VRAM allows |
| Missing pages in output | Check --pages range, use --no-skipping |
| Model download fails | Download manually from HuggingFace hub |
| Repetitive output | Known issue with some pages; post-process to detect loops |
| Tables misaligned | Use nougat-base for better table extraction |
# Test installation
python -c "import nougat; print('Nougat installed')"
# Check CUDA availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
# Verify model download
python -c "from nougat import NougatModel; m = NougatModel.from_pretrained('facebook/nougat-base'); print('Model loaded')"