Docling - Document Parsing for RAG Cheatsheet

Docling is an open-source document-parsing toolkit (an LF AI & Data project) that converts PDF, DOCX, PPTX, XLSX, HTML, images, and more into a structured representation — clean Markdown or JSON that preserves layout, tables, headings, and reading order. It includes hierarchy-aware chunking that enriches chunks with structural metadata, and it integrates directly with LangChain and LlamaIndex, which makes it one of the strongest open-source choices for the ingestion stage of a RAG pipeline.

Installation

Method	Command
pip	`pip install docling`
uv	`uv add docling`
With OCR extras	`pip install "docling[ocr]"`
Verify	`docling --version`

CLI Usage

Command	Description
`docling document.pdf`	Convert a file to Markdown (default)
`docling --to json document.pdf`	Output structured JSON
`docling --to md --output out/ report.docx`	Convert to a directory
`docling https://example.com/page.html`	Convert from a URL
`docling --ocr scanned.pdf`	Force OCR for scanned documents
`docling --help`	Full option list

Python: Basic Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")

doc = result.document
print(doc.export_to_markdown())   # clean Markdown
print(doc.export_to_dict())       # structured JSON-able dict

Method	Returns
`export_to_markdown()`	Markdown with headings/tables preserved
`export_to_dict()`	Structured document model (JSON-able)
`export_to_doctags()`	DocTags representation
`result.document`	The parsed `DoclingDocument`

Supported Inputs

Format	Notes
PDF	Layout, tables, reading order; OCR for scans
DOCX / PPTX / XLSX	Office formats
HTML	Web pages and exports
Images	PNG/JPG/TIFF via OCR
Markdown / AsciiDoc	Re-structured into the document model

Hierarchy-Aware Chunking

Docling’s chunkers split a parsed document for embedding while keeping structural context (section headings, table boundaries) in each chunk’s metadata.

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

doc = DocumentConverter().convert("report.pdf").document
chunker = HybridChunker(tokenizer="sentence-transformers/all-MiniLM-L6-v2")

for chunk in chunker.chunk(doc):
    print(chunk.text)
    print(chunk.meta)   # headings, page, provenance

Chunker	Behavior
`HierarchicalChunker`	Split on document structure (sections, items)
`HybridChunker`	Structure-aware + tokenizer-aware sizing/merging
`chunk.meta`	Carries headings/provenance for context expansion

RAG Framework Integration

Framework	How
LangChain	`DoclingLoader` returns documents/chunks
LlamaIndex	Docling reader/node parser
Custom	Use `export_to_markdown()` or chunker output directly

# LangChain example
from langchain_docling import DoclingLoader
docs = DoclingLoader(file_path="report.pdf").load()

Performance & Options

Option	Purpose
OCR engine	Choose/disable OCR (EasyOCR, Tesseract, etc.)
Table mode	Accurate vs fast table structure recovery
Device	Run models on CPU or GPU
Page range	Limit conversion to specific pages
Pipeline options	Tune the conversion pipeline per format

Common Workflows

# Convert a folder of PDFs to Markdown for ingestion
for f in docs/*.pdf; do docling --to md --output corpus/ "$f"; done

# Parse → chunk → embed, the RAG ingestion core
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

doc = DocumentConverter().convert("manual.pdf").document
chunks = list(HybridChunker().chunk(doc))
# embed each chunk.text with your model, store chunk.meta alongside

Docling vs Other Parsers

Aspect	Docling	Unstructured	Marker
Output	Markdown + structured model	Typed elements	Markdown
Chunking	Built-in, hierarchy-aware	Element-based	External
Speed	Good (CPU)	Good	Fastest with GPU
Best for	Self-hosted RAG ingestion	Typed element pipelines	GPU bulk Markdown

Docling - Document Parsing for RAG Cheatsheet

Docling - Document Parsing for RAG Cheatsheet

Installation

CLI Usage

Python: Basic Conversion

Supported Inputs

Hierarchy-Aware Chunking

RAG Framework Integration

Performance & Options

Common Workflows

Docling vs Other Parsers

Resources