Skip to content

verl - Volcano Engine RL for LLMs Cheatsheet

verl - Volcano Engine RL for LLMs Cheatsheet

verl (Volcano Engine Reinforcement Learning) is an open-source, high-performance framework for reinforcement-learning post-training of large language models. It is the public implementation of the HybridFlow paper, built around a hybrid-controller programming model that lets you express complex RL dataflows — PPO, GRPO, and others — in a few lines while decoupling computation from data dependencies. It integrates with FSDP, Megatron-LM, vLLM, and SGLang, and is one of the most widely used RL frameworks for LLMs.

RL post-training is compute-heavy and sensitive to configuration. Validate a reward and a small run before scaling to multi-node.

Installation

MethodCommand
pippip install verl
From sourcegit clone https://github.com/volcengine/verl && cd verl && pip install -e .
Dockeruse the official images referenced in the docs (CUDA + vLLM preinstalled)
RequirementsNVIDIA GPU(s), CUDA, PyTorch; vLLM or SGLang for rollout

Core Concepts

TermMeaning
HybridFlowThe controller model: a single controller orchestrates distributed workers
RolloutGeneration phase that produces samples (served by vLLM/SGLang)
Actor / CriticThe policy being trained and (for PPO) the value model
RewardScalar signal; from a reward model or a custom function
PlacementHow models are mapped onto GPU sets (colocated or split)
RecipeA ready-made training configuration for an algorithm

Algorithms Supported

AlgorithmNotes
PPOClassic actor-critic RLHF
GRPOGroup-relative, no critic model required
ReMax / RLOOLightweight baselines
DAPO / Dr.GRPOGRPO variants
DPO-stylePreference optimization recipes

Running a Training Job

verl is driven by Hydra-style config overrides on the command line.

# GRPO example (conceptual): point at data, model, and rollout backend
python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  data.train_files=$DATA/train.parquet \
  data.val_files=$DATA/val.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  trainer.n_gpus_per_node=8 \
  trainer.nnodes=1
Config groupControls
algorithm.*Algorithm and advantage estimator (e.g. adv_estimator=grpo)
data.*Train/val files, batch sizes, max lengths
actor_rollout_ref.model.*Base model path and dtype
actor_rollout_ref.rollout.*Rollout backend (vllm / sglang) and sampling
critic.*Critic model settings (PPO)
trainer.*GPUs per node, node count, logging, checkpoints

Backends & Scaling

ComponentOptions
Training engineFSDP, FSDP2, Megatron-LM
Rollout enginevLLM, SGLang
DistributionRay-based placement across GPUs/nodes
Device mappingColocate actor+rollout, or split across GPU sets

Reward & Data

TaskHow
Custom rewardProvide a reward function module via config
Reward modelPoint reward_model.* at a scoring model
Dataset formatParquet with prompt (and answer for verifiable rewards)
Verifiable rewardsBuilt-in support for math/code-style exact-match rewards

Monitoring

ToolIntegration
Weights & Biasestrainer.logger=['console','wandb']
TensorBoardSupported logger backend
Checkpointstrainer.save_freq, resume from trainer.resume_mode

verl vs Other RL Frameworks

AspectverlOpenRLHFART
Core modelHybridFlow controllerRay actor splitClient/server
Training enginesFSDP + MegatronDeepSpeed/FSDPUnsloth
RolloutvLLM / SGLangvLLMvLLM
Best forThroughput + flexibilityProduction RLHF at scaleAgents in your own code

Resources