Skip to content

Docling - Document Parsing for RAG Cheatsheet

Docling - Document Parsing for RAG Cheatsheet

Docling is an open-source document-parsing toolkit (an LF AI & Data project) that converts PDF, DOCX, PPTX, XLSX, HTML, images, and more into a structured representation — clean Markdown or JSON that preserves layout, tables, headings, and reading order. It includes hierarchy-aware chunking that enriches chunks with structural metadata, and it integrates directly with LangChain and LlamaIndex, which makes it one of the strongest open-source choices for the ingestion stage of a RAG pipeline.

Installation

MethodCommand
pippip install docling
uvuv add docling
With OCR extraspip install "docling[ocr]"
Verifydocling --version

CLI Usage

CommandDescription
docling document.pdfConvert a file to Markdown (default)
docling --to json document.pdfOutput structured JSON
docling --to md --output out/ report.docxConvert to a directory
docling https://example.com/page.htmlConvert from a URL
docling --ocr scanned.pdfForce OCR for scanned documents
docling --helpFull option list

Python: Basic Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")

doc = result.document
print(doc.export_to_markdown())   # clean Markdown
print(doc.export_to_dict())       # structured JSON-able dict
MethodReturns
export_to_markdown()Markdown with headings/tables preserved
export_to_dict()Structured document model (JSON-able)
export_to_doctags()DocTags representation
result.documentThe parsed DoclingDocument

Supported Inputs

FormatNotes
PDFLayout, tables, reading order; OCR for scans
DOCX / PPTX / XLSXOffice formats
HTMLWeb pages and exports
ImagesPNG/JPG/TIFF via OCR
Markdown / AsciiDocRe-structured into the document model

Hierarchy-Aware Chunking

Docling’s chunkers split a parsed document for embedding while keeping structural context (section headings, table boundaries) in each chunk’s metadata.

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

doc = DocumentConverter().convert("report.pdf").document
chunker = HybridChunker(tokenizer="sentence-transformers/all-MiniLM-L6-v2")

for chunk in chunker.chunk(doc):
    print(chunk.text)
    print(chunk.meta)   # headings, page, provenance
ChunkerBehavior
HierarchicalChunkerSplit on document structure (sections, items)
HybridChunkerStructure-aware + tokenizer-aware sizing/merging
chunk.metaCarries headings/provenance for context expansion

RAG Framework Integration

FrameworkHow
LangChainDoclingLoader returns documents/chunks
LlamaIndexDocling reader/node parser
CustomUse export_to_markdown() or chunker output directly
# LangChain example
from langchain_docling import DoclingLoader
docs = DoclingLoader(file_path="report.pdf").load()

Performance & Options

OptionPurpose
OCR engineChoose/disable OCR (EasyOCR, Tesseract, etc.)
Table modeAccurate vs fast table structure recovery
DeviceRun models on CPU or GPU
Page rangeLimit conversion to specific pages
Pipeline optionsTune the conversion pipeline per format

Common Workflows

# Convert a folder of PDFs to Markdown for ingestion
for f in docs/*.pdf; do docling --to md --output corpus/ "$f"; done
# Parse → chunk → embed, the RAG ingestion core
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

doc = DocumentConverter().convert("manual.pdf").document
chunks = list(HybridChunker().chunk(doc))
# embed each chunk.text with your model, store chunk.meta alongside

Docling vs Other Parsers

AspectDoclingUnstructuredMarker
OutputMarkdown + structured modelTyped elementsMarkdown
ChunkingBuilt-in, hierarchy-awareElement-basedExternal
SpeedGood (CPU)GoodFastest with GPU
Best forSelf-hosted RAG ingestionTyped element pipelinesGPU bulk Markdown

Resources