verl - Volcano Engine RL per LLM Cheatsheet

verl (Volcano Engine Reinforcement Learning) è un framework open-source ad alte prestazioni per il post-training con apprendimento per rinforzo di grandi modelli di linguaggio. È l”implementazione pubblica dell”articolo HybridFlow, costruito attorno a un modello di programmazione a controller ibrido che consente di esprimere flussi di dati RL complessi — PPO, GRPO e altri — in poche righe disaccoppiando il calcolo dalle dipendenze di dati. Si integra con FSDP, Megatron-LM, vLLM e SGLang, ed è uno dei framework RL più utilizzati per gli LLM.

Il post-training RL è computazionalmente intensivo e sensibile alla configurazione. Convalida una reward e una piccola esecuzione prima di scalare a più nodi.

Installation

Method	Command
pip	`pip install verl`
From source	`git clone https://github.com/volcengine/verl && cd verl && pip install -e .`
Docker	use the official images referenced in the docs (CUDA + vLLM preinstalled)
Requirements	NVIDIA GPU(s), CUDA, PyTorch; vLLM or SGLang for rollout

Core Concepts

Term	Meaning
HybridFlow	The controller model: a single controller orchestrates distributed workers
Rollout	Generation phase that produces samples (served by vLLM/SGLang)
Actor / Critic	The policy being trained and (for PPO) the value model
Reward	Scalar signal; from a reward model or a custom function
Placement	How models are mapped onto GPU sets (colocated or split)
Recipe	A ready-made training configuration for an algorithm

Algorithms Supported

Algorithm	Notes
PPO	Classic actor-critic RLHF
GRPO	Group-relative, no critic model required
ReMax / RLOO	Lightweight baselines
DAPO / Dr.GRPO	GRPO variants
DPO-style	Preference optimization recipes

Running a Training Job

verl è guidato da override di configurazione in stile Hydra sulla riga di comando.

# GRPO example (conceptual): point at data, model, and rollout backend
python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  data.train_files=$DATA/train.parquet \
  data.val_files=$DATA/val.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  trainer.n_gpus_per_node=8 \
  trainer.nnodes=1

Config group	Controls
`algorithm.*`	Algorithm and advantage estimator (e.g. `adv_estimator=grpo`)
`data.*`	Train/val files, batch sizes, max lengths
`actor_rollout_ref.model.*`	Base model path and dtype
`actor_rollout_ref.rollout.*`	Rollout backend (`vllm` / `sglang`) and sampling
`critic.*`	Critic model settings (PPO)
`trainer.*`	GPUs per node, node count, logging, checkpoints

Backends & Scaling

Component	Options
Training engine	FSDP, FSDP2, Megatron-LM
Rollout engine	vLLM, SGLang
Distribution	Ray-based placement across GPUs/nodes
Device mapping	Colocate actor+rollout, or split across GPU sets

Reward & Data

Task	How
Custom reward	Provide a reward function module via config
Reward model	Point `reward_model.*` at a scoring model
Dataset format	Parquet with prompt (and answer for verifiable rewards)
Verifiable rewards	Built-in support for math/code-style exact-match rewards

Monitoring

Tool	Integration
Weights & Biases	`trainer.logger=['console','wandb']`
TensorBoard	Supported logger backend
Checkpoints	`trainer.save_freq`, resume from `trainer.resume_mode`

verl vs Other RL Frameworks

Aspect	verl	OpenRLHF	ART
Core model	HybridFlow controller	Ray actor split	Client/server
Training engines	FSDP + Megatron	DeepSpeed/FSDP	Unsloth
Rollout	vLLM / SGLang	vLLM	vLLM
Best for	Throughput + flexibility	Production RLHF at scale	Agents in your own code

verl - Volcano Engine RL per LLM Cheatsheet

verl - Volcano Engine RL per LLM Cheatsheet

Installation

Core Concepts

Algorithms Supported

Running a Training Job

Backends & Scaling

Reward & Data

Monitoring

verl vs Other RL Frameworks

Resources