Docling - Document Parsing for RAG Cheatsheet
Docling is an open-source document-parsing toolkit (an LF AI & Data project) that converts PDF, DOCX, PPTX, XLSX, HTML, images, and more into a structured representation — clean Markdown or JSON that preserves layout, tables, headings, and reading order. It includes hierarchy-aware chunking that enriches chunks with structural metadata, and it integrates directly with LangChain and LlamaIndex, which makes it one of the strongest open-source choices for the ingestion stage of a RAG pipeline.
Installation
| Method | Command |
|---|
| pip | pip install docling |
| uv | uv add docling |
| With OCR extras | pip install "docling[ocr]" |
| Verify | docling --version |
CLI Usage
| Command | Description |
|---|
docling document.pdf | Convert a file to Markdown (default) |
docling --to json document.pdf | Output structured JSON |
docling --to md --output out/ report.docx | Convert to a directory |
docling https://example.com/page.html | Convert from a URL |
docling --ocr scanned.pdf | Force OCR for scanned documents |
docling --help | Full option list |
Python: Basic Conversion
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("report.pdf")
doc = result.document
print(doc.export_to_markdown()) # clean Markdown
print(doc.export_to_dict()) # structured JSON-able dict
| Method | Returns |
|---|
export_to_markdown() | Markdown with headings/tables preserved |
export_to_dict() | Structured document model (JSON-able) |
export_to_doctags() | DocTags representation |
result.document | The parsed DoclingDocument |
| Format | Notes |
|---|
| PDF | Layout, tables, reading order; OCR for scans |
| DOCX / PPTX / XLSX | Office formats |
| HTML | Web pages and exports |
| Images | PNG/JPG/TIFF via OCR |
| Markdown / AsciiDoc | Re-structured into the document model |
Hierarchy-Aware Chunking
Docling’s chunkers split a parsed document for embedding while keeping structural context (section headings, table boundaries) in each chunk’s metadata.
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
doc = DocumentConverter().convert("report.pdf").document
chunker = HybridChunker(tokenizer="sentence-transformers/all-MiniLM-L6-v2")
for chunk in chunker.chunk(doc):
print(chunk.text)
print(chunk.meta) # headings, page, provenance
| Chunker | Behavior |
|---|
HierarchicalChunker | Split on document structure (sections, items) |
HybridChunker | Structure-aware + tokenizer-aware sizing/merging |
chunk.meta | Carries headings/provenance for context expansion |
RAG Framework Integration
| Framework | How |
|---|
| LangChain | DoclingLoader returns documents/chunks |
| LlamaIndex | Docling reader/node parser |
| Custom | Use export_to_markdown() or chunker output directly |
# LangChain example
from langchain_docling import DoclingLoader
docs = DoclingLoader(file_path="report.pdf").load()
| Option | Purpose |
|---|
| OCR engine | Choose/disable OCR (EasyOCR, Tesseract, etc.) |
| Table mode | Accurate vs fast table structure recovery |
| Device | Run models on CPU or GPU |
| Page range | Limit conversion to specific pages |
| Pipeline options | Tune the conversion pipeline per format |
Common Workflows
# Convert a folder of PDFs to Markdown for ingestion
for f in docs/*.pdf; do docling --to md --output corpus/ "$f"; done
# Parse → chunk → embed, the RAG ingestion core
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
doc = DocumentConverter().convert("manual.pdf").document
chunks = list(HybridChunker().chunk(doc))
# embed each chunk.text with your model, store chunk.meta alongside
Docling vs Other Parsers
| Aspect | Docling | Unstructured | Marker |
|---|
| Output | Markdown + structured model | Typed elements | Markdown |
| Chunking | Built-in, hierarchy-aware | Element-based | External |
| Speed | Good (CPU) | Good | Fastest with GPU |
| Best for | Self-hosted RAG ingestion | Typed element pipelines | GPU bulk Markdown |
Resources