OpenRLHF Cheatsheet
OpenRLHF is a high-performance, open-source RLHF framework designed for scale. Built on Ray for distributed scheduling, vLLM for fast generation, and DeepSpeed/ZeRO for training, it separates the Actor, Critic, Reward, and Reference models across GPUs so you can train large models (70B+) efficiently. It implements a broad menu of algorithms — PPO, GRPO, REINFORCE++, RLOO — plus dynamic sampling and async agentic RL.
Reinforcement learning at scale is operationally demanding. Start on a single node with a small model, confirm the reward trends correctly, then scale out with Ray.
Installation
| Method | Command |
|---|
| pip | pip install openrlhf |
| With vLLM extras | pip install openrlhf[vllm] |
| From source | git clone https://github.com/OpenRLHF/OpenRLHF && cd OpenRLHF && pip install -e . |
| Docker | use the project’s reference image (CUDA + deps preinstalled) |
| Requirements | NVIDIA GPU(s), CUDA, Ray, DeepSpeed, vLLM |
Architecture
| Component | Role |
|---|
| Actor | The policy model being trained |
| Critic | Value model (PPO) |
| Reward model | Scores generated responses |
| Reference model | Frozen baseline for the KL penalty |
| Ray | Schedules and places these models across GPUs |
| vLLM engine | Accelerates the rollout/generation phase |
Common Training Entry Points
OpenRLHF ships CLI modules per algorithm; launch with deepspeed or ray.
| Command | Purpose |
|---|
openrlhf.cli.train_sft | Supervised fine-tuning |
openrlhf.cli.train_rm | Reward model training |
openrlhf.cli.train_ppo | PPO (single-controller) |
openrlhf.cli.train_ppo_ray | PPO/GRPO distributed with Ray + vLLM |
openrlhf.cli.train_dpo | Direct Preference Optimization |
PPO/GRPO with Ray (sketch)
ray start --head --node-ip-address 0.0.0.0
python3 -m openrlhf.cli.train_ppo_ray \
--pretrain Qwen/Qwen2.5-7B-Instruct \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
--advantage_estimator group_norm \
--vllm_num_engines 2 --vllm_tensor_parallel_size 1 \
--actor_num_gpus_per_node 4 \
--colocate_actor_ref \
--prompt_data your/prompts --input_key prompt \
--save_path ./ckpt --use_wandb $WANDB_API_KEY
| Flag | Controls |
|---|
--pretrain | Base/actor model |
--reward_pretrain | Reward model path |
--advantage_estimator group_norm | Selects GRPO-style group normalization |
--vllm_num_engines | Number of vLLM rollout engines |
--actor_num_gpus_per_node | GPU allocation for the actor |
--colocate_actor_ref | Place actor + reference together to save GPUs |
Algorithms
| Algorithm | Flag/Module |
|---|
| PPO | default in train_ppo* |
| GRPO | --advantage_estimator group_norm |
| REINFORCE++ | --advantage_estimator reinforce family |
| RLOO | --advantage_estimator rloo |
| DPO / KTO | dedicated train_dpo / train_kto modules |
| Technique | Flag/Note |
|---|
| ZeRO stage | --zero_stage 3 for large models |
| Offload | --adam_offload to spill optimizer state to CPU |
| Flash attention | --flash_attn |
| Colocation | --colocate_actor_ref, --colocate_critic_reward |
| Dynamic sampling | Built-in to improve sample efficiency |
OpenRLHF vs verl vs ART
| Aspect | OpenRLHF | verl | ART |
|---|
| Foundation | Ray + DeepSpeed + vLLM | HybridFlow + FSDP/Megatron | Client/server + Unsloth |
| Strength | Production RLHF, broad algos | Throughput + flexibility | Agents in your own code |
| Multimodal | VLM RLHF (OpenRLHF-M) | Supported | Text-focused |
| Best for | Scalable RLHF pipelines | Research throughput | Single-agent on-the-job RL |
Resources