Salta ai contenuti

OpenRLHF Cheatsheet

OpenRLHF Cheatsheet

OpenRLHF è un framework RLHF ad alte prestazioni e open-source progettato per la scalabilità. Costruito su Ray per la pianificazione distribuita, vLLM per la generazione veloce e DeepSpeed/ZeRO per il training, separa i modelli Actor, Critic, Reward e Reference tra GPU così puoi effettuare il training di modelli grandi (70B+) in modo efficiente. Implementa un ampio menu di algoritmi — PPO, GRPO, REINFORCE++, RLOO — più il campionamento dinamico e l”RL agentico asincrono.

L”apprendimento per rinforzo su larga scala è operativamente impegnativo. Inizia su un singolo nodo con un modello piccolo, conferma che le reward tendono correttamente, quindi scala con Ray.

Installation

MethodCommand
pippip install openrlhf
With vLLM extraspip install openrlhf[vllm]
From sourcegit clone https://github.com/OpenRLHF/OpenRLHF && cd OpenRLHF && pip install -e .
Dockeruse the project’s reference image (CUDA + deps preinstalled)
RequirementsNVIDIA GPU(s), CUDA, Ray, DeepSpeed, vLLM

Architecture

ComponentRole
ActorThe policy model being trained
CriticValue model (PPO)
Reward modelScores generated responses
Reference modelFrozen baseline for the KL penalty
RaySchedules and places these models across GPUs
vLLM engineAccelerates the rollout/generation phase

Common Training Entry Points

OpenRLHF ships CLI modules per algorithm; launch with deepspeed or ray.

CommandPurpose
openrlhf.cli.train_sftSupervised fine-tuning
openrlhf.cli.train_rmReward model training
openrlhf.cli.train_ppoPPO (single-controller)
openrlhf.cli.train_ppo_rayPPO/GRPO distributed with Ray + vLLM
openrlhf.cli.train_dpoDirect Preference Optimization

PPO/GRPO with Ray (sketch)

ray start --head --node-ip-address 0.0.0.0

python3 -m openrlhf.cli.train_ppo_ray \
  --pretrain Qwen/Qwen2.5-7B-Instruct \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
  --advantage_estimator group_norm \
  --vllm_num_engines 2 --vllm_tensor_parallel_size 1 \
  --actor_num_gpus_per_node 4 \
  --colocate_actor_ref \
  --prompt_data your/prompts --input_key prompt \
  --save_path ./ckpt --use_wandb $WANDB_API_KEY
FlagControls
--pretrainBase/actor model
--reward_pretrainReward model path
--advantage_estimator group_normSelects GRPO-style group normalization
--vllm_num_enginesNumber of vLLM rollout engines
--actor_num_gpus_per_nodeGPU allocation for the actor
--colocate_actor_refPlace actor + reference together to save GPUs

Algorithms

AlgorithmFlag/Module
PPOdefault in train_ppo*
GRPO--advantage_estimator group_norm
REINFORCE++--advantage_estimator reinforce family
RLOO--advantage_estimator rloo
DPO / KTOdedicated train_dpo / train_kto modules

Scaling & Performance

TechniqueFlag/Note
ZeRO stage--zero_stage 3 for large models
Offload--adam_offload to spill optimizer state to CPU
Flash attention--flash_attn
Colocation--colocate_actor_ref, --colocate_critic_reward
Dynamic samplingBuilt-in to improve sample efficiency

OpenRLHF vs verl vs ART

AspectOpenRLHFverlART
FoundationRay + DeepSpeed + vLLMHybridFlow + FSDP/MegatronClient/server + Unsloth
StrengthProduction RLHF, broad algosThroughput + flexibilityAgents in your own code
MultimodalVLM RLHF (OpenRLHF-M)SupportedText-focused
Best forScalable RLHF pipelinesResearch throughputSingle-agent on-the-job RL

Resources