Chunky is an open-source toolkit for building reliable RAG ingestion pipelines, focused on the often-neglected chunking stage. It converts PDFs to Markdown, cleans documents, lets you inspect chunks and compare chunking strategies side by side, and enriches chunk metadata for LLM applications. Because retrieval quality is capped by how documents are split, Chunky’s value is making the chunking step visible and tunable instead of a blind default.
Installation
| Method | Command |
|---|
| pip | pip install chunky |
| uv | uv add chunky |
| From source | git clone https://github.com/GiovanniPasq/chunky && cd chunky && pip install -e . |
| Verify | python -c "import chunky; print('ok')" |
Pipeline Stages
| Stage | Purpose |
|---|
| Convert | Turn PDFs/docs into clean Markdown |
| Clean | Remove boilerplate, fix artifacts |
| Chunk | Split text using a chosen strategy |
| Inspect | Visualize the resulting chunks |
| Compare | Run multiple strategies and compare |
| Enrich | Attach metadata (headings, source, position) |
Convert & Clean
import chunky
# Convert a PDF to clean Markdown
md = chunky.to_markdown("report.pdf")
# Clean common artifacts (headers/footers, hyphenation, noise)
md = chunky.clean(md)
| Function | Description |
|---|
to_markdown(path) | Convert a document to Markdown |
clean(text) | Strip boilerplate and normalize |
Chunking Strategies
| Strategy | Splits on |
|---|
| Fixed-size | N tokens/characters with overlap |
| Recursive | Paragraphs → sentences → words as needed |
| Markdown / header-aware | Document structure (#, ##, sections) |
| Semantic | Embedding-similarity boundaries |
| Token-aware refinement | Merge undersized, split oversized chunks |
chunks = chunky.chunk(
md,
strategy="header_aware",
max_tokens=512,
overlap=64,
repeat_headers=True, # carry section headers across table splits
)
for c in chunks:
print(len(c.tokens), c.metadata["heading"])
Inspect & Compare
The differentiator: see and compare what each strategy produces before committing.
# Inspect chunk boundaries and sizes
chunky.inspect(chunks) # sizes, overlaps, boundaries
# Compare strategies on the same document
report = chunky.compare(
md,
strategies=["fixed", "recursive", "header_aware", "semantic"],
max_tokens=512,
)
print(report) # per-strategy stats: count, size distribution, fragmentation
| Function | Shows |
|---|
inspect(chunks) | Size distribution, overlap, boundaries |
compare(text, strategies=[...]) | Side-by-side strategy metrics |
| Metadata | Use in retrieval |
|---|
| Heading path | Context expansion / filtering |
| Source + page | Citations |
| Position/index | Ordering and neighbor lookup |
| Token count | Budget management at prompt time |
enriched = chunky.enrich(chunks, source="report.pdf")
# each chunk.metadata now carries heading path, page, source, index
Common Workflows
# End-to-end: PDF → clean Markdown → header-aware chunks → enriched
import chunky
md = chunky.clean(chunky.to_markdown("manual.pdf"))
chunks = chunky.chunk(md, strategy="header_aware", max_tokens=512, overlap=64)
chunks = chunky.enrich(chunks, source="manual.pdf")
# embed chunk.text, store chunk.metadata alongside in your vector DB
# Choose a strategy with evidence, not guesswork
print(chunky.compare(md, strategies=["recursive", "header_aware", "semantic"]))
Chunky vs Other Approaches
| Aspect | Chunky | Framework default splitters | Docling |
|---|
| Strategy comparison | First-class | Manual | Limited |
| Conversion + clean | Built-in | Separate | Built-in |
| Best for | Tuning the chunking stage | Quick start | Full parse + chunk |
Pairs well with Docling for parsing and your vector DB for storage — Chunky’s job is getting the split right.
Resources