Skip to content

Chunky - RAG Chunking Toolkit Cheatsheet

Chunky - RAG Chunking Toolkit Cheatsheet

Chunky is an open-source toolkit for building reliable RAG ingestion pipelines, focused on the often-neglected chunking stage. It converts PDFs to Markdown, cleans documents, lets you inspect chunks and compare chunking strategies side by side, and enriches chunk metadata for LLM applications. Because retrieval quality is capped by how documents are split, Chunky’s value is making the chunking step visible and tunable instead of a blind default.

Installation

MethodCommand
pippip install chunky
uvuv add chunky
From sourcegit clone https://github.com/GiovanniPasq/chunky && cd chunky && pip install -e .
Verifypython -c "import chunky; print('ok')"

Pipeline Stages

StagePurpose
ConvertTurn PDFs/docs into clean Markdown
CleanRemove boilerplate, fix artifacts
ChunkSplit text using a chosen strategy
InspectVisualize the resulting chunks
CompareRun multiple strategies and compare
EnrichAttach metadata (headings, source, position)

Convert & Clean

import chunky

# Convert a PDF to clean Markdown
md = chunky.to_markdown("report.pdf")

# Clean common artifacts (headers/footers, hyphenation, noise)
md = chunky.clean(md)
FunctionDescription
to_markdown(path)Convert a document to Markdown
clean(text)Strip boilerplate and normalize

Chunking Strategies

StrategySplits on
Fixed-sizeN tokens/characters with overlap
RecursiveParagraphs → sentences → words as needed
Markdown / header-awareDocument structure (#, ##, sections)
SemanticEmbedding-similarity boundaries
Token-aware refinementMerge undersized, split oversized chunks
chunks = chunky.chunk(
    md,
    strategy="header_aware",
    max_tokens=512,
    overlap=64,
    repeat_headers=True,   # carry section headers across table splits
)
for c in chunks:
    print(len(c.tokens), c.metadata["heading"])

Inspect & Compare

The differentiator: see and compare what each strategy produces before committing.

# Inspect chunk boundaries and sizes
chunky.inspect(chunks)            # sizes, overlaps, boundaries

# Compare strategies on the same document
report = chunky.compare(
    md,
    strategies=["fixed", "recursive", "header_aware", "semantic"],
    max_tokens=512,
)
print(report)   # per-strategy stats: count, size distribution, fragmentation
FunctionShows
inspect(chunks)Size distribution, overlap, boundaries
compare(text, strategies=[...])Side-by-side strategy metrics

Metadata Enrichment

MetadataUse in retrieval
Heading pathContext expansion / filtering
Source + pageCitations
Position/indexOrdering and neighbor lookup
Token countBudget management at prompt time
enriched = chunky.enrich(chunks, source="report.pdf")
# each chunk.metadata now carries heading path, page, source, index

Common Workflows

# End-to-end: PDF → clean Markdown → header-aware chunks → enriched
import chunky
md = chunky.clean(chunky.to_markdown("manual.pdf"))
chunks = chunky.chunk(md, strategy="header_aware", max_tokens=512, overlap=64)
chunks = chunky.enrich(chunks, source="manual.pdf")
# embed chunk.text, store chunk.metadata alongside in your vector DB
# Choose a strategy with evidence, not guesswork
print(chunky.compare(md, strategies=["recursive", "header_aware", "semantic"]))

Chunky vs Other Approaches

AspectChunkyFramework default splittersDocling
Strategy comparisonFirst-classManualLimited
Conversion + cleanBuilt-inSeparateBuilt-in
Best forTuning the chunking stageQuick startFull parse + chunk

Pairs well with Docling for parsing and your vector DB for storage — Chunky’s job is getting the split right.

Resources