DataTrove - LLM Data Processing Pipelines Cheatsheet
DataTrove is an open-source library from Hugging Face for processing, filtering, and deduplicating large text datasets for LLM training. It provides a set of composable pipeline blocks — readers, filters, dedup stages, writers — that run unchanged across execution backends: locally, on a Slurm cluster, or on Ray. It is the reference implementation behind the FineWeb and FineWeb-Edu datasets, which makes it the go-to tool when you want to reproduce or adapt a published, CPU-first curation pipeline.
Installation
| Method | Command |
|---|
| pip | pip install datatrove |
| With all extras | pip install "datatrove[all]" |
| Processing extras | pip install "datatrove[processing]" |
| From source | git clone https://github.com/huggingface/datatrove && pip install -e . |
Core Concepts
| Concept | Meaning |
|---|
| Pipeline | An ordered list of blocks data flows through |
| Block | One step: reader, filter, dedup, writer, etc. |
| Document | The unit (text, id, metadata) |
| Executor | Where the pipeline runs (Local / Slurm / Ray) |
| Tasks | Parallel shards of the workload |
Pipeline Blocks
| Category | Examples |
|---|
| Readers | WarcReader, JsonlReader, ParquetReader, HuggingFaceDatasetReader |
| Extractors | Trafilatura (HTML → text) |
| Filters | LanguageFilter, GopherQualityFilter, GopherRepetitionFilter, C4QualityFilter, FineWebQualityFilter, URLFilter |
| Dedup | MinhashDedup, SentenceDedup, ExactSubstrDedup |
| Writers | JsonlWriter, ParquetWriter |
| Tokenization | DocumentTokenizer |
A Basic Pipeline
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.filters import LanguageFilter, GopherQualityFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline = [
JsonlReader("data/input/"),
LanguageFilter(languages=["en"]),
GopherQualityFilter(),
JsonlWriter("data/output/"),
]
executor = LocalPipelineExecutor(pipeline=pipeline, tasks=8)
executor.run()
Executors (Same Pipeline, Different Backends)
| Executor | Use |
|---|
LocalPipelineExecutor | One machine, multiprocessing |
SlurmPipelineExecutor | HPC clusters via Slurm jobs |
RayPipelineExecutor | Ray clusters |
from datatrove.executor import SlurmPipelineExecutor
SlurmPipelineExecutor(
pipeline=pipeline, tasks=1000, time="20:00:00",
partition="cpu", workers=200,
).run()
Deduplication
DataTrove’s dedup stages typically run as multi-step pipelines (signatures → buckets → clusters → filter).
| Method | Block |
|---|
| MinHash (fuzzy) | MinhashDedup (multi-stage) |
| Exact substring | ExactSubstrDedup |
| Sentence-level | SentenceDedup |
Quality Filters
| Filter | Heuristic |
|---|
GopherQualityFilter | Length, symbol ratios, bullet/ellipsis limits |
GopherRepetitionFilter | Excessive repetition |
C4QualityFilter | C4-style rules (terminal punctuation, etc.) |
FineWebQualityFilter | FineWeb recipe heuristics |
LanguageFilter | fastText language ID threshold |
URLFilter | Block/allow by URL/domain |
Common Workflows
# Reproduce a FineWeb-style web pipeline (conceptual order)
# WarcReader → URLFilter → Trafilatura → LanguageFilter →
# GopherQuality/Repetition → C4/FineWeb filters → MinhashDedup → JsonlWriter
# Scale the identical pipeline from laptop to cluster by swapping the executor:
# LocalPipelineExecutor → SlurmPipelineExecutor → RayPipelineExecutor
DataTrove vs NeMo Curator
| Aspect | DataTrove | NeMo Curator |
|---|
| Compute | CPU-first | GPU-native (RAPIDS) |
| Best for | Reproducing FineWeb-style datasets | Dedup-heavy 10T+ token runs |
| Backends | Local / Slurm / Ray | Ray |
| Origin | Hugging Face | NVIDIA |
Resources