ART (Agent Reinforcement Trainer) Cheatsheet
ART (Agent Reinforcement Trainer) by OpenPipe is an open-source framework for training LLM-based agents with reinforcement learning, primarily GRPO. Its defining idea is a split architecture: a lightweight client runs your agent’s rollouts in your own code through an OpenAI-compatible endpoint, while a server/backend handles inference (vLLM) and training (Unsloth-powered GRPO), optionally on a separate GPU machine. This makes it well suited to multi-turn, tool-using agents that need “on-the-job” training against a reward.
Reinforcement learning can be unstable and compute-hungry. Start small, log everything, and validate a reward function on a handful of rollouts before scaling.
Installation
| Step | Command |
|---|
| Install | pip install openpipe-art |
| With extras (backend) | pip install "openpipe-art[backend]" |
| uv | uv add openpipe-art |
| Requirements | A CUDA GPU for the backend (training/inference) |
Core Concepts
| Term | Meaning |
|---|
| Model | A trainable model handle (art.TrainableModel) registered with a backend |
| Backend | Where inference + training run (local GPU or remote) |
| Rollout | One agent episode that produces a trajectory and a reward |
| Trajectory | The messages/tool-calls/choices ART scores and learns from |
| GRPO | Group Relative Policy Optimization — the default RL algorithm |
| Reward | A scalar your code assigns to a trajectory (higher = better) |
Defining a Model
import art
model = art.TrainableModel(
name="agent-001",
project="my-agent",
base_model="Qwen/Qwen2.5-7B-Instruct",
)
backend = art.LocalBackend() # or a remote backend
await model.register(backend)
| Object | Purpose |
|---|
art.TrainableModel(...) | The policy you are training |
art.LocalBackend() | Run inference + training on the local GPU |
model.register(backend) | Bind a model to a backend |
model.openai_client() | OpenAI-compatible client for rollouts |
Writing a Rollout
import art, weave
@weave.op
async def rollout(model: art.Model, scenario) -> art.Trajectory:
client = model.openai_client()
traj = art.Trajectory(messages_and_choices=[], reward=0.0)
messages = [{"role": "user", "content": scenario.prompt}]
completion = await client.chat.completions.create(
model=model.name, messages=messages,
)
choice = completion.choices[0]
traj.messages_and_choices.append(choice)
traj.reward = score(choice.message.content, scenario) # your reward fn
return traj
Training Loop
for step in range(NUM_STEPS):
groups = await art.gather_trajectory_groups(
(art.TrajectoryGroup(rollout(model, s) for _ in range(GROUP_SIZE))
for s in scenarios)
)
await model.train(groups, config=art.TrainConfig(learning_rate=1e-5))
| Call | Description |
|---|
art.TrajectoryGroup(...) | A group of rollouts compared against each other (GRPO) |
art.gather_trajectory_groups(...) | Run rollouts concurrently and collect groups |
model.train(groups, config=...) | One GRPO update from the gathered groups |
art.TrainConfig(...) | Learning rate and training hyperparameters |
model.get_step() | Current training step (for checkpointing/resume) |
Reward Design Tips
| Guideline | Why |
|---|
| Keep rewards bounded | Stabilizes GRPO advantage estimates |
| Reward the outcome, not the wording | Avoids reward hacking on phrasing |
| Add small shaping for tool success | Helps multi-step credit assignment |
| Use RULER for relative scoring | ART’s helper to rank trajectories in a group when no clean metric exists |
Observability
| Tool | Integration |
|---|
| Weights & Biases | Native logging of reward/loss curves |
| Weave | Decorate rollouts with @weave.op for trace capture |
| LangfuseTracing | Supported for trajectory inspection |
ART vs Other RL Trainers
| Aspect | ART | OpenRLHF | verl |
|---|
| Focus | Multi-step agents | Scalable RLHF | High-throughput RL |
| Architecture | Split client/server | Ray + vLLM | Ray + vLLM |
| Backend | vLLM + Unsloth | vLLM | vLLM |
| Best for | Agents trained in your own code | Large-scale RLHF pipelines | Research throughput |
Resources