verl - Volcano Engine RL for LLMs Cheatsheet

verl (Volcano Engine Reinforcement Learning) is an open-source, high-performance framework for reinforcement-learning post-training of large language models. It is the public implementation of the HybridFlow paper, built around a hybrid-controller programming model that lets you express complex RL dataflows — PPO, GRPO, and others — in a few lines while decoupling computation from data dependencies. It integrates with FSDP, Megatron-LM, vLLM, and SGLang, and is one of the most widely used RL frameworks for LLMs.

RL post-training is compute-heavy and sensitive to configuration. Validate a reward and a small run before scaling to multi-node.

Installation

Method	Command
pip	`pip install verl`
From source	`git clone https://github.com/volcengine/verl && cd verl && pip install -e .`
Docker	use the official images referenced in the docs (CUDA + vLLM preinstalled)
Requirements	NVIDIA GPU(s), CUDA, PyTorch; vLLM or SGLang for rollout

Core Concepts

Term	Meaning
HybridFlow	The controller model: a single controller orchestrates distributed workers
Rollout	Generation phase that produces samples (served by vLLM/SGLang)
Actor / Critic	The policy being trained and (for PPO) the value model
Reward	Scalar signal; from a reward model or a custom function
Placement	How models are mapped onto GPU sets (colocated or split)
Recipe	A ready-made training configuration for an algorithm

Algorithms Supported

Algorithm	Notes
PPO	Classic actor-critic RLHF
GRPO	Group-relative, no critic model required
ReMax / RLOO	Lightweight baselines
DAPO / Dr.GRPO	GRPO variants
DPO-style	Preference optimization recipes

Running a Training Job

verl is driven by Hydra-style config overrides on the command line.

# GRPO example (conceptual): point at data, model, and rollout backend
python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  data.train_files=$DATA/train.parquet \
  data.val_files=$DATA/val.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  trainer.n_gpus_per_node=8 \
  trainer.nnodes=1

Config group	Controls
`algorithm.*`	Algorithm and advantage estimator (e.g. `adv_estimator=grpo`)
`data.*`	Train/val files, batch sizes, max lengths
`actor_rollout_ref.model.*`	Base model path and dtype
`actor_rollout_ref.rollout.*`	Rollout backend (`vllm` / `sglang`) and sampling
`critic.*`	Critic model settings (PPO)
`trainer.*`	GPUs per node, node count, logging, checkpoints

Backends & Scaling

Component	Options
Training engine	FSDP, FSDP2, Megatron-LM
Rollout engine	vLLM, SGLang
Distribution	Ray-based placement across GPUs/nodes
Device mapping	Colocate actor+rollout, or split across GPU sets

Reward & Data

Task	How
Custom reward	Provide a reward function module via config
Reward model	Point `reward_model.*` at a scoring model
Dataset format	Parquet with prompt (and answer for verifiable rewards)
Verifiable rewards	Built-in support for math/code-style exact-match rewards

Monitoring

Tool	Integration
Weights & Biases	`trainer.logger=['console','wandb']`
TensorBoard	Supported logger backend
Checkpoints	`trainer.save_freq`, resume from `trainer.resume_mode`

verl vs Other RL Frameworks

Aspect	verl	OpenRLHF	ART
Core model	HybridFlow controller	Ray actor split	Client/server
Training engines	FSDP + Megatron	DeepSpeed/FSDP	Unsloth
Rollout	vLLM / SGLang	vLLM	vLLM
Best for	Throughput + flexibility	Production RLHF at scale	Agents in your own code

verl - Volcano Engine RL for LLMs Cheatsheet

verl - Volcano Engine RL for LLMs Cheatsheet

Installation

Core Concepts

Algorithms Supported

Running a Training Job

Backends & Scaling

Reward & Data

Monitoring

verl vs Other RL Frameworks

Resources