Skip to content

OpenRLHF Cheatsheet

OpenRLHF Cheatsheet

OpenRLHF is a high-performance, open-source RLHF framework designed for scale. Built on Ray for distributed scheduling, vLLM for fast generation, and DeepSpeed/ZeRO for training, it separates the Actor, Critic, Reward, and Reference models across GPUs so you can train large models (70B+) efficiently. It implements a broad menu of algorithms — PPO, GRPO, REINFORCE++, RLOO — plus dynamic sampling and async agentic RL.

Reinforcement learning at scale is operationally demanding. Start on a single node with a small model, confirm the reward trends correctly, then scale out with Ray.

Installation

MethodCommand
pippip install openrlhf
With vLLM extraspip install openrlhf[vllm]
From sourcegit clone https://github.com/OpenRLHF/OpenRLHF && cd OpenRLHF && pip install -e .
Dockeruse the project’s reference image (CUDA + deps preinstalled)
RequirementsNVIDIA GPU(s), CUDA, Ray, DeepSpeed, vLLM

Architecture

ComponentRole
ActorThe policy model being trained
CriticValue model (PPO)
Reward modelScores generated responses
Reference modelFrozen baseline for the KL penalty
RaySchedules and places these models across GPUs
vLLM engineAccelerates the rollout/generation phase

Common Training Entry Points

OpenRLHF ships CLI modules per algorithm; launch with deepspeed or ray.

CommandPurpose
openrlhf.cli.train_sftSupervised fine-tuning
openrlhf.cli.train_rmReward model training
openrlhf.cli.train_ppoPPO (single-controller)
openrlhf.cli.train_ppo_rayPPO/GRPO distributed with Ray + vLLM
openrlhf.cli.train_dpoDirect Preference Optimization

PPO/GRPO with Ray (sketch)

ray start --head --node-ip-address 0.0.0.0

python3 -m openrlhf.cli.train_ppo_ray \
  --pretrain Qwen/Qwen2.5-7B-Instruct \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
  --advantage_estimator group_norm \
  --vllm_num_engines 2 --vllm_tensor_parallel_size 1 \
  --actor_num_gpus_per_node 4 \
  --colocate_actor_ref \
  --prompt_data your/prompts --input_key prompt \
  --save_path ./ckpt --use_wandb $WANDB_API_KEY
FlagControls
--pretrainBase/actor model
--reward_pretrainReward model path
--advantage_estimator group_normSelects GRPO-style group normalization
--vllm_num_enginesNumber of vLLM rollout engines
--actor_num_gpus_per_nodeGPU allocation for the actor
--colocate_actor_refPlace actor + reference together to save GPUs

Algorithms

AlgorithmFlag/Module
PPOdefault in train_ppo*
GRPO--advantage_estimator group_norm
REINFORCE++--advantage_estimator reinforce family
RLOO--advantage_estimator rloo
DPO / KTOdedicated train_dpo / train_kto modules

Scaling & Performance

TechniqueFlag/Note
ZeRO stage--zero_stage 3 for large models
Offload--adam_offload to spill optimizer state to CPU
Flash attention--flash_attn
Colocation--colocate_actor_ref, --colocate_critic_reward
Dynamic samplingBuilt-in to improve sample efficiency

OpenRLHF vs verl vs ART

AspectOpenRLHFverlART
FoundationRay + DeepSpeed + vLLMHybridFlow + FSDP/MegatronClient/server + Unsloth
StrengthProduction RLHF, broad algosThroughput + flexibilityAgents in your own code
MultimodalVLM RLHF (OpenRLHF-M)SupportedText-focused
Best forScalable RLHF pipelinesResearch throughputSingle-agent on-the-job RL

Resources