OpenRLHF Cheatsheet

OpenRLHF is a high-performance, open-source RLHF framework designed for scale. Built on Ray for distributed scheduling, vLLM for fast generation, and DeepSpeed/ZeRO for training, it separates the Actor, Critic, Reward, and Reference models across GPUs so you can train large models (70B+) efficiently. It implements a broad menu of algorithms — PPO, GRPO, REINFORCE++, RLOO — plus dynamic sampling and async agentic RL.

Reinforcement learning at scale is operationally demanding. Start on a single node with a small model, confirm the reward trends correctly, then scale out with Ray.

Installation

Method	Command
pip	`pip install openrlhf`
With vLLM extras	`pip install openrlhf[vllm]`
From source	`git clone https://github.com/OpenRLHF/OpenRLHF && cd OpenRLHF && pip install -e .`
Docker	use the project’s reference image (CUDA + deps preinstalled)
Requirements	NVIDIA GPU(s), CUDA, Ray, DeepSpeed, vLLM

Architecture

Component	Role
Actor	The policy model being trained
Critic	Value model (PPO)
Reward model	Scores generated responses
Reference model	Frozen baseline for the KL penalty
Ray	Schedules and places these models across GPUs
vLLM engine	Accelerates the rollout/generation phase

Common Training Entry Points

OpenRLHF ships CLI modules per algorithm; launch with deepspeed or ray.

Command	Purpose
`openrlhf.cli.train_sft`	Supervised fine-tuning
`openrlhf.cli.train_rm`	Reward model training
`openrlhf.cli.train_ppo`	PPO (single-controller)
`openrlhf.cli.train_ppo_ray`	PPO/GRPO distributed with Ray + vLLM
`openrlhf.cli.train_dpo`	Direct Preference Optimization

PPO/GRPO with Ray (sketch)

ray start --head --node-ip-address 0.0.0.0

python3 -m openrlhf.cli.train_ppo_ray \
  --pretrain Qwen/Qwen2.5-7B-Instruct \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
  --advantage_estimator group_norm \
  --vllm_num_engines 2 --vllm_tensor_parallel_size 1 \
  --actor_num_gpus_per_node 4 \
  --colocate_actor_ref \
  --prompt_data your/prompts --input_key prompt \
  --save_path ./ckpt --use_wandb $WANDB_API_KEY

Flag	Controls
`--pretrain`	Base/actor model
`--reward_pretrain`	Reward model path
`--advantage_estimator group_norm`	Selects GRPO-style group normalization
`--vllm_num_engines`	Number of vLLM rollout engines
`--actor_num_gpus_per_node`	GPU allocation for the actor
`--colocate_actor_ref`	Place actor + reference together to save GPUs

Algorithms

Algorithm	Flag/Module
PPO	default in `train_ppo*`
GRPO	`--advantage_estimator group_norm`
REINFORCE++	`--advantage_estimator reinforce` family
RLOO	`--advantage_estimator rloo`
DPO / KTO	dedicated `train_dpo` / `train_kto` modules

Scaling & Performance

Technique	Flag/Note
ZeRO stage	`--zero_stage 3` for large models
Offload	`--adam_offload` to spill optimizer state to CPU
Flash attention	`--flash_attn`
Colocation	`--colocate_actor_ref`, `--colocate_critic_reward`
Dynamic sampling	Built-in to improve sample efficiency

OpenRLHF vs verl vs ART

Aspect	OpenRLHF	verl	ART
Foundation	Ray + DeepSpeed + vLLM	HybridFlow + FSDP/Megatron	Client/server + Unsloth
Strength	Production RLHF, broad algos	Throughput + flexibility	Agents in your own code
Multimodal	VLM RLHF (OpenRLHF-M)	Supported	Text-focused
Best for	Scalable RLHF pipelines	Research throughput	Single-agent on-the-job RL

OpenRLHF Cheatsheet

OpenRLHF Cheatsheet

Installation

Architecture

Common Training Entry Points

PPO/GRPO with Ray (sketch)

Algorithms

Scaling & Performance

OpenRLHF vs verl vs ART

Resources