verl - Volcano Engine RL for LLMs Cheatsheet
verl (Volcano Engine Reinforcement Learning) is an open-source, high-performance framework for reinforcement-learning post-training of large language models. It is the public implementation of the HybridFlow paper, built around a hybrid-controller programming model that lets you express complex RL dataflows — PPO, GRPO, and others — in a few lines while decoupling computation from data dependencies. It integrates with FSDP, Megatron-LM, vLLM, and SGLang, and is one of the most widely used RL frameworks for LLMs.
RL post-training is compute-heavy and sensitive to configuration. Validate a reward and a small run before scaling to multi-node.
Installation
| Method | Command |
|---|
| pip | pip install verl |
| From source | git clone https://github.com/volcengine/verl && cd verl && pip install -e . |
| Docker | use the official images referenced in the docs (CUDA + vLLM preinstalled) |
| Requirements | NVIDIA GPU(s), CUDA, PyTorch; vLLM or SGLang for rollout |
Core Concepts
| Term | Meaning |
|---|
| HybridFlow | The controller model: a single controller orchestrates distributed workers |
| Rollout | Generation phase that produces samples (served by vLLM/SGLang) |
| Actor / Critic | The policy being trained and (for PPO) the value model |
| Reward | Scalar signal; from a reward model or a custom function |
| Placement | How models are mapped onto GPU sets (colocated or split) |
| Recipe | A ready-made training configuration for an algorithm |
Algorithms Supported
| Algorithm | Notes |
|---|
| PPO | Classic actor-critic RLHF |
| GRPO | Group-relative, no critic model required |
| ReMax / RLOO | Lightweight baselines |
| DAPO / Dr.GRPO | GRPO variants |
| DPO-style | Preference optimization recipes |
Running a Training Job
verl is driven by Hydra-style config overrides on the command line.
# GRPO example (conceptual): point at data, model, and rollout backend
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA/train.parquet \
data.val_files=$DATA/val.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.rollout.name=vllm \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1
| Config group | Controls |
|---|
algorithm.* | Algorithm and advantage estimator (e.g. adv_estimator=grpo) |
data.* | Train/val files, batch sizes, max lengths |
actor_rollout_ref.model.* | Base model path and dtype |
actor_rollout_ref.rollout.* | Rollout backend (vllm / sglang) and sampling |
critic.* | Critic model settings (PPO) |
trainer.* | GPUs per node, node count, logging, checkpoints |
Backends & Scaling
| Component | Options |
|---|
| Training engine | FSDP, FSDP2, Megatron-LM |
| Rollout engine | vLLM, SGLang |
| Distribution | Ray-based placement across GPUs/nodes |
| Device mapping | Colocate actor+rollout, or split across GPU sets |
Reward & Data
| Task | How |
|---|
| Custom reward | Provide a reward function module via config |
| Reward model | Point reward_model.* at a scoring model |
| Dataset format | Parquet with prompt (and answer for verifiable rewards) |
| Verifiable rewards | Built-in support for math/code-style exact-match rewards |
Monitoring
| Tool | Integration |
|---|
| Weights & Biases | trainer.logger=['console','wandb'] |
| TensorBoard | Supported logger backend |
| Checkpoints | trainer.save_freq, resume from trainer.resume_mode |
verl vs Other RL Frameworks
| Aspect | verl | OpenRLHF | ART |
|---|
| Core model | HybridFlow controller | Ray actor split | Client/server |
| Training engines | FSDP + Megatron | DeepSpeed/FSDP | Unsloth |
| Rollout | vLLM / SGLang | vLLM | vLLM |
| Best for | Throughput + flexibility | Production RLHF at scale | Agents in your own code |
Resources