Skip to content

NeMo Curator - GPU Data Curation for LLMs Cheatsheet

NeMo Curator - GPU Data Curation for LLMs Cheatsheet

NVIDIA NeMo Curator is an open-source, GPU-accelerated data curation toolkit for preparing large-scale, high-quality datasets to pretrain or fine-tune LLMs. It builds repeatable pipelines that download and extract data, clean and normalize text, identify language, quality-filter, classify domain/toxicity, apply privacy filters, and deduplicate (exact and fuzzy) — scaling from a laptop to thousands of GPUs via RAPIDS/cuDF and Ray. As of the 26.x releases it uses a Ray-based pipeline architecture across text, image, video, and audio.

Curation quality drives model quality. The expensive, high-impact step is usually deduplication; NeMo Curator’s GPU dedup is its headline advantage.

Installation

MethodCommand
pip (CPU modules)pip install nemo-curator
pip with CUDA extraspip install "nemo-curator[cuda12x]"
Containeruse NVIDIA’s NeMo/Curator container image
RequirementsNVIDIA GPU(s) + CUDA for accelerated modules; Ray for distribution

Pipeline Concepts

ConceptMeaning
DocumentDatasetThe dataset abstraction (backed by Dask/cuDF)
ModuleA curation step (filter, dedup, classifier, …)
PipelineOrdered modules forming a reproducible flow
BackendCPU (pandas) or GPU (cuDF/RAPIDS) execution
Ray runtimeDistributes work across cores/GPUs/nodes

Core Modules

StageModule(s)
Download/extractCommon Crawl, arXiv, Wikipedia downloaders; text extraction
Language IDfastText-based language identification
CleaningUnicode fixing, boilerplate/URL removal, reformatting
Quality filteringHeuristic filters + classifier-based quality scoring
ClassificationDomain and toxicity classifiers
PII / privacyDetect and redact personal data
DeduplicationExact (hash) and fuzzy (MinHash/LSH) dedup on GPU

Quality Filtering (sketch)

from nemo_curator import ScoreFilter
from nemo_curator.filters import WordCountFilter
from nemo_curator.datasets import DocumentDataset

dataset = DocumentDataset.read_json("input/*.jsonl", backend="cudf")

filter_step = ScoreFilter(
    WordCountFilter(min_words=50, max_words=100000),
    text_field="text",
)
clean = filter_step(dataset)
clean.to_json("filtered/", write_to_filename=True)

Deduplication (sketch)

from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig

config = FuzzyDuplicatesConfig(
    cache_dir="./cache",
    num_buckets=20,
    hashes_per_bucket=13,   # MinHash/LSH parameters
)
fuzzy = FuzzyDuplicates(config=config)
duplicates = fuzzy(dataset)        # GPU-accelerated near-dup detection
deduped = dataset.df[~dataset.df["id"].isin(duplicates.df["id"])]
Dedup typeMethod
ExactDocument hashing
FuzzyMinHash + LSH (GPU via RAPIDS)
SemanticEmbedding-based near-duplicate removal

Classifiers & Filtering

ClassifierFlags
DomainTopic/domain labels for mixing control
QualityHigh/low quality scoring
ToxicityUnsafe content for removal
LanguageKeep/drop by language

Scaling

MechanismNote
GPU (cuDF/RAPIDS)Accelerates filtering and dedup
Ray runtimeDistributes across GPUs and nodes
DaskOut-of-core processing for huge corpora
CheckpointingResume long curation runs

Common Workflows

# Reproducible pretraining curation: clean → quality filter → dedup
from nemo_curator.datasets import DocumentDataset
ds = DocumentDataset.read_json("raw/*.jsonl", backend="cudf")
# 1) language ID + cleaning  2) quality ScoreFilter  3) FuzzyDuplicates
# write the curated, deduplicated corpus for training

NeMo Curator vs DataTrove

AspectNeMo CuratorDataTrove
AccelerationGPU-native (RAPIDS)CPU-first
Best forDedup-heavy, novel large runs (10T+ tokens)Reproducing FineWeb-style pipelines
DistributionRayLocal/Slurm/Ray executors
ModalitiesText, image, video, audioText-focused

Resources