OpenRLHF Cheatsheet

OpenRLHF è un framework RLHF ad alte prestazioni e open-source progettato per la scalabilità. Costruito su Ray per la pianificazione distribuita, vLLM per la generazione veloce e DeepSpeed/ZeRO per il training, separa i modelli Actor, Critic, Reward e Reference tra GPU così puoi effettuare il training di modelli grandi (70B+) in modo efficiente. Implementa un ampio menu di algoritmi — PPO, GRPO, REINFORCE++, RLOO — più il campionamento dinamico e l”RL agentico asincrono.

L”apprendimento per rinforzo su larga scala è operativamente impegnativo. Inizia su un singolo nodo con un modello piccolo, conferma che le reward tendono correttamente, quindi scala con Ray.

Installation

Method	Command
pip	`pip install openrlhf`
With vLLM extras	`pip install openrlhf[vllm]`
From source	`git clone https://github.com/OpenRLHF/OpenRLHF && cd OpenRLHF && pip install -e .`
Docker	use the project’s reference image (CUDA + deps preinstalled)
Requirements	NVIDIA GPU(s), CUDA, Ray, DeepSpeed, vLLM

Architecture

Component	Role
Actor	The policy model being trained
Critic	Value model (PPO)
Reward model	Scores generated responses
Reference model	Frozen baseline for the KL penalty
Ray	Schedules and places these models across GPUs
vLLM engine	Accelerates the rollout/generation phase

Common Training Entry Points

OpenRLHF ships CLI modules per algorithm; launch with deepspeed or ray.

Command	Purpose
`openrlhf.cli.train_sft`	Supervised fine-tuning
`openrlhf.cli.train_rm`	Reward model training
`openrlhf.cli.train_ppo`	PPO (single-controller)
`openrlhf.cli.train_ppo_ray`	PPO/GRPO distributed with Ray + vLLM
`openrlhf.cli.train_dpo`	Direct Preference Optimization

PPO/GRPO with Ray (sketch)

ray start --head --node-ip-address 0.0.0.0

python3 -m openrlhf.cli.train_ppo_ray \
  --pretrain Qwen/Qwen2.5-7B-Instruct \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
  --advantage_estimator group_norm \
  --vllm_num_engines 2 --vllm_tensor_parallel_size 1 \
  --actor_num_gpus_per_node 4 \
  --colocate_actor_ref \
  --prompt_data your/prompts --input_key prompt \
  --save_path ./ckpt --use_wandb $WANDB_API_KEY

Flag	Controls
`--pretrain`	Base/actor model
`--reward_pretrain`	Reward model path
`--advantage_estimator group_norm`	Selects GRPO-style group normalization
`--vllm_num_engines`	Number of vLLM rollout engines
`--actor_num_gpus_per_node`	GPU allocation for the actor
`--colocate_actor_ref`	Place actor + reference together to save GPUs

Algorithms

Algorithm	Flag/Module
PPO	default in `train_ppo*`
GRPO	`--advantage_estimator group_norm`
REINFORCE++	`--advantage_estimator reinforce` family
RLOO	`--advantage_estimator rloo`
DPO / KTO	dedicated `train_dpo` / `train_kto` modules

Scaling & Performance

Technique	Flag/Note
ZeRO stage	`--zero_stage 3` for large models
Offload	`--adam_offload` to spill optimizer state to CPU
Flash attention	`--flash_attn`
Colocation	`--colocate_actor_ref`, `--colocate_critic_reward`
Dynamic sampling	Built-in to improve sample efficiency

OpenRLHF vs verl vs ART

Aspect	OpenRLHF	verl	ART
Foundation	Ray + DeepSpeed + vLLM	HybridFlow + FSDP/Megatron	Client/server + Unsloth
Strength	Production RLHF, broad algos	Throughput + flexibility	Agents in your own code
Multimodal	VLM RLHF (OpenRLHF-M)	Supported	Text-focused
Best for	Scalable RLHF pipelines	Research throughput	Single-agent on-the-job RL

OpenRLHF Cheatsheet

OpenRLHF Cheatsheet

Installation

Architecture

Common Training Entry Points

PPO/GRPO with Ray (sketch)

Algorithms

Scaling & Performance

OpenRLHF vs verl vs ART

Resources