Salta ai contenuti

verl - Volcano Engine RL per LLM Cheatsheet

verl - Volcano Engine RL per LLM Cheatsheet

verl (Volcano Engine Reinforcement Learning) è un framework open-source ad alte prestazioni per il post-training con apprendimento per rinforzo di grandi modelli di linguaggio. È l”implementazione pubblica dell”articolo HybridFlow, costruito attorno a un modello di programmazione a controller ibrido che consente di esprimere flussi di dati RL complessi — PPO, GRPO e altri — in poche righe disaccoppiando il calcolo dalle dipendenze di dati. Si integra con FSDP, Megatron-LM, vLLM e SGLang, ed è uno dei framework RL più utilizzati per gli LLM.

Il post-training RL è computazionalmente intensivo e sensibile alla configurazione. Convalida una reward e una piccola esecuzione prima di scalare a più nodi.

Installation

MethodCommand
pippip install verl
From sourcegit clone https://github.com/volcengine/verl && cd verl && pip install -e .
Dockeruse the official images referenced in the docs (CUDA + vLLM preinstalled)
RequirementsNVIDIA GPU(s), CUDA, PyTorch; vLLM or SGLang for rollout

Core Concepts

TermMeaning
HybridFlowThe controller model: a single controller orchestrates distributed workers
RolloutGeneration phase that produces samples (served by vLLM/SGLang)
Actor / CriticThe policy being trained and (for PPO) the value model
RewardScalar signal; from a reward model or a custom function
PlacementHow models are mapped onto GPU sets (colocated or split)
RecipeA ready-made training configuration for an algorithm

Algorithms Supported

AlgorithmNotes
PPOClassic actor-critic RLHF
GRPOGroup-relative, no critic model required
ReMax / RLOOLightweight baselines
DAPO / Dr.GRPOGRPO variants
DPO-stylePreference optimization recipes

Running a Training Job

verl è guidato da override di configurazione in stile Hydra sulla riga di comando.

# GRPO example (conceptual): point at data, model, and rollout backend
python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  data.train_files=$DATA/train.parquet \
  data.val_files=$DATA/val.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  trainer.n_gpus_per_node=8 \
  trainer.nnodes=1
Config groupControls
algorithm.*Algorithm and advantage estimator (e.g. adv_estimator=grpo)
data.*Train/val files, batch sizes, max lengths
actor_rollout_ref.model.*Base model path and dtype
actor_rollout_ref.rollout.*Rollout backend (vllm / sglang) and sampling
critic.*Critic model settings (PPO)
trainer.*GPUs per node, node count, logging, checkpoints

Backends & Scaling

ComponentOptions
Training engineFSDP, FSDP2, Megatron-LM
Rollout enginevLLM, SGLang
DistributionRay-based placement across GPUs/nodes
Device mappingColocate actor+rollout, or split across GPU sets

Reward & Data

TaskHow
Custom rewardProvide a reward function module via config
Reward modelPoint reward_model.* at a scoring model
Dataset formatParquet with prompt (and answer for verifiable rewards)
Verifiable rewardsBuilt-in support for math/code-style exact-match rewards

Monitoring

ToolIntegration
Weights & Biasestrainer.logger=['console','wandb']
TensorBoardSupported logger backend
Checkpointstrainer.save_freq, resume from trainer.resume_mode

verl vs Other RL Frameworks

AspectverlOpenRLHFART
Core modelHybridFlow controllerRay actor splitClient/server
Training enginesFSDP + MegatronDeepSpeed/FSDPUnsloth
RolloutvLLM / SGLangvLLMvLLM
Best forThroughput + flexibilityProduction RLHF at scaleAgents in your own code

Resources