Skip to content

DataTrove - LLM Data Processing Pipelines Cheatsheet

DataTrove - LLM Data Processing Pipelines Cheatsheet

DataTrove is an open-source library from Hugging Face for processing, filtering, and deduplicating large text datasets for LLM training. It provides a set of composable pipeline blocks — readers, filters, dedup stages, writers — that run unchanged across execution backends: locally, on a Slurm cluster, or on Ray. It is the reference implementation behind the FineWeb and FineWeb-Edu datasets, which makes it the go-to tool when you want to reproduce or adapt a published, CPU-first curation pipeline.

Installation

MethodCommand
pippip install datatrove
With all extraspip install "datatrove[all]"
Processing extraspip install "datatrove[processing]"
From sourcegit clone https://github.com/huggingface/datatrove && pip install -e .

Core Concepts

ConceptMeaning
PipelineAn ordered list of blocks data flows through
BlockOne step: reader, filter, dedup, writer, etc.
DocumentThe unit (text, id, metadata)
ExecutorWhere the pipeline runs (Local / Slurm / Ray)
TasksParallel shards of the workload

Pipeline Blocks

CategoryExamples
ReadersWarcReader, JsonlReader, ParquetReader, HuggingFaceDatasetReader
ExtractorsTrafilatura (HTML → text)
FiltersLanguageFilter, GopherQualityFilter, GopherRepetitionFilter, C4QualityFilter, FineWebQualityFilter, URLFilter
DedupMinhashDedup, SentenceDedup, ExactSubstrDedup
WritersJsonlWriter, ParquetWriter
TokenizationDocumentTokenizer

A Basic Pipeline

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.filters import LanguageFilter, GopherQualityFilter
from datatrove.pipeline.writers import JsonlWriter

pipeline = [
    JsonlReader("data/input/"),
    LanguageFilter(languages=["en"]),
    GopherQualityFilter(),
    JsonlWriter("data/output/"),
]

executor = LocalPipelineExecutor(pipeline=pipeline, tasks=8)
executor.run()

Executors (Same Pipeline, Different Backends)

ExecutorUse
LocalPipelineExecutorOne machine, multiprocessing
SlurmPipelineExecutorHPC clusters via Slurm jobs
RayPipelineExecutorRay clusters
from datatrove.executor import SlurmPipelineExecutor
SlurmPipelineExecutor(
    pipeline=pipeline, tasks=1000, time="20:00:00",
    partition="cpu", workers=200,
).run()

Deduplication

DataTrove’s dedup stages typically run as multi-step pipelines (signatures → buckets → clusters → filter).

MethodBlock
MinHash (fuzzy)MinhashDedup (multi-stage)
Exact substringExactSubstrDedup
Sentence-levelSentenceDedup

Quality Filters

FilterHeuristic
GopherQualityFilterLength, symbol ratios, bullet/ellipsis limits
GopherRepetitionFilterExcessive repetition
C4QualityFilterC4-style rules (terminal punctuation, etc.)
FineWebQualityFilterFineWeb recipe heuristics
LanguageFilterfastText language ID threshold
URLFilterBlock/allow by URL/domain

Common Workflows

# Reproduce a FineWeb-style web pipeline (conceptual order)
# WarcReader → URLFilter → Trafilatura → LanguageFilter →
# GopherQuality/Repetition → C4/FineWeb filters → MinhashDedup → JsonlWriter
# Scale the identical pipeline from laptop to cluster by swapping the executor:
#   LocalPipelineExecutor  →  SlurmPipelineExecutor  →  RayPipelineExecutor

DataTrove vs NeMo Curator

AspectDataTroveNeMo Curator
ComputeCPU-firstGPU-native (RAPIDS)
Best forReproducing FineWeb-style datasetsDedup-heavy 10T+ token runs
BackendsLocal / Slurm / RayRay
OriginHugging FaceNVIDIA

Resources