NeMo Curator - GPU Data Curation for LLMs Cheatsheet
NVIDIA NeMo Curator is an open-source, GPU-accelerated data curation toolkit for preparing large-scale, high-quality datasets to pretrain or fine-tune LLMs. It builds repeatable pipelines that download and extract data, clean and normalize text, identify language, quality-filter, classify domain/toxicity, apply privacy filters, and deduplicate (exact and fuzzy) — scaling from a laptop to thousands of GPUs via RAPIDS/cuDF and Ray. As of the 26.x releases it uses a Ray-based pipeline architecture across text, image, video, and audio.
Curation quality drives model quality. The expensive, high-impact step is usually deduplication; NeMo Curator’s GPU dedup is its headline advantage.
Installation
| Method | Command |
|---|
| pip (CPU modules) | pip install nemo-curator |
| pip with CUDA extras | pip install "nemo-curator[cuda12x]" |
| Container | use NVIDIA’s NeMo/Curator container image |
| Requirements | NVIDIA GPU(s) + CUDA for accelerated modules; Ray for distribution |
Pipeline Concepts
| Concept | Meaning |
|---|
DocumentDataset | The dataset abstraction (backed by Dask/cuDF) |
| Module | A curation step (filter, dedup, classifier, …) |
| Pipeline | Ordered modules forming a reproducible flow |
| Backend | CPU (pandas) or GPU (cuDF/RAPIDS) execution |
| Ray runtime | Distributes work across cores/GPUs/nodes |
Core Modules
| Stage | Module(s) |
|---|
| Download/extract | Common Crawl, arXiv, Wikipedia downloaders; text extraction |
| Language ID | fastText-based language identification |
| Cleaning | Unicode fixing, boilerplate/URL removal, reformatting |
| Quality filtering | Heuristic filters + classifier-based quality scoring |
| Classification | Domain and toxicity classifiers |
| PII / privacy | Detect and redact personal data |
| Deduplication | Exact (hash) and fuzzy (MinHash/LSH) dedup on GPU |
Quality Filtering (sketch)
from nemo_curator import ScoreFilter
from nemo_curator.filters import WordCountFilter
from nemo_curator.datasets import DocumentDataset
dataset = DocumentDataset.read_json("input/*.jsonl", backend="cudf")
filter_step = ScoreFilter(
WordCountFilter(min_words=50, max_words=100000),
text_field="text",
)
clean = filter_step(dataset)
clean.to_json("filtered/", write_to_filename=True)
Deduplication (sketch)
from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig
config = FuzzyDuplicatesConfig(
cache_dir="./cache",
num_buckets=20,
hashes_per_bucket=13, # MinHash/LSH parameters
)
fuzzy = FuzzyDuplicates(config=config)
duplicates = fuzzy(dataset) # GPU-accelerated near-dup detection
deduped = dataset.df[~dataset.df["id"].isin(duplicates.df["id"])]
| Dedup type | Method |
|---|
| Exact | Document hashing |
| Fuzzy | MinHash + LSH (GPU via RAPIDS) |
| Semantic | Embedding-based near-duplicate removal |
Classifiers & Filtering
| Classifier | Flags |
|---|
| Domain | Topic/domain labels for mixing control |
| Quality | High/low quality scoring |
| Toxicity | Unsafe content for removal |
| Language | Keep/drop by language |
Scaling
| Mechanism | Note |
|---|
| GPU (cuDF/RAPIDS) | Accelerates filtering and dedup |
| Ray runtime | Distributes across GPUs and nodes |
| Dask | Out-of-core processing for huge corpora |
| Checkpointing | Resume long curation runs |
Common Workflows
# Reproducible pretraining curation: clean → quality filter → dedup
from nemo_curator.datasets import DocumentDataset
ds = DocumentDataset.read_json("raw/*.jsonl", backend="cudf")
# 1) language ID + cleaning 2) quality ScoreFilter 3) FuzzyDuplicates
# write the curated, deduplicated corpus for training
NeMo Curator vs DataTrove
| Aspect | NeMo Curator | DataTrove |
|---|
| Acceleration | GPU-native (RAPIDS) | CPU-first |
| Best for | Dedup-heavy, novel large runs (10T+ tokens) | Reproducing FineWeb-style pipelines |
| Distribution | Ray | Local/Slurm/Ray executors |
| Modalities | Text, image, video, audio | Text-focused |
Resources