Skip to content

ART (Agent Reinforcement Trainer) Cheatsheet

ART (Agent Reinforcement Trainer) Cheatsheet

ART (Agent Reinforcement Trainer) by OpenPipe is an open-source framework for training LLM-based agents with reinforcement learning, primarily GRPO. Its defining idea is a split architecture: a lightweight client runs your agent’s rollouts in your own code through an OpenAI-compatible endpoint, while a server/backend handles inference (vLLM) and training (Unsloth-powered GRPO), optionally on a separate GPU machine. This makes it well suited to multi-turn, tool-using agents that need “on-the-job” training against a reward.

Reinforcement learning can be unstable and compute-hungry. Start small, log everything, and validate a reward function on a handful of rollouts before scaling.

Installation

StepCommand
Installpip install openpipe-art
With extras (backend)pip install "openpipe-art[backend]"
uvuv add openpipe-art
RequirementsA CUDA GPU for the backend (training/inference)

Core Concepts

TermMeaning
ModelA trainable model handle (art.TrainableModel) registered with a backend
BackendWhere inference + training run (local GPU or remote)
RolloutOne agent episode that produces a trajectory and a reward
TrajectoryThe messages/tool-calls/choices ART scores and learns from
GRPOGroup Relative Policy Optimization — the default RL algorithm
RewardA scalar your code assigns to a trajectory (higher = better)

Defining a Model

import art

model = art.TrainableModel(
    name="agent-001",
    project="my-agent",
    base_model="Qwen/Qwen2.5-7B-Instruct",
)

backend = art.LocalBackend()          # or a remote backend
await model.register(backend)
ObjectPurpose
art.TrainableModel(...)The policy you are training
art.LocalBackend()Run inference + training on the local GPU
model.register(backend)Bind a model to a backend
model.openai_client()OpenAI-compatible client for rollouts

Writing a Rollout

import art, weave

@weave.op
async def rollout(model: art.Model, scenario) -> art.Trajectory:
    client = model.openai_client()
    traj = art.Trajectory(messages_and_choices=[], reward=0.0)

    messages = [{"role": "user", "content": scenario.prompt}]
    completion = await client.chat.completions.create(
        model=model.name, messages=messages,
    )
    choice = completion.choices[0]
    traj.messages_and_choices.append(choice)

    traj.reward = score(choice.message.content, scenario)  # your reward fn
    return traj

Training Loop

for step in range(NUM_STEPS):
    groups = await art.gather_trajectory_groups(
        (art.TrajectoryGroup(rollout(model, s) for _ in range(GROUP_SIZE))
         for s in scenarios)
    )
    await model.train(groups, config=art.TrainConfig(learning_rate=1e-5))
CallDescription
art.TrajectoryGroup(...)A group of rollouts compared against each other (GRPO)
art.gather_trajectory_groups(...)Run rollouts concurrently and collect groups
model.train(groups, config=...)One GRPO update from the gathered groups
art.TrainConfig(...)Learning rate and training hyperparameters
model.get_step()Current training step (for checkpointing/resume)

Reward Design Tips

GuidelineWhy
Keep rewards boundedStabilizes GRPO advantage estimates
Reward the outcome, not the wordingAvoids reward hacking on phrasing
Add small shaping for tool successHelps multi-step credit assignment
Use RULER for relative scoringART’s helper to rank trajectories in a group when no clean metric exists

Observability

ToolIntegration
Weights & BiasesNative logging of reward/loss curves
WeaveDecorate rollouts with @weave.op for trace capture
LangfuseTracingSupported for trajectory inspection

ART vs Other RL Trainers

AspectARTOpenRLHFverl
FocusMulti-step agentsScalable RLHFHigh-throughput RL
ArchitectureSplit client/serverRay + vLLMRay + vLLM
BackendvLLM + UnslothvLLMvLLM
Best forAgents trained in your own codeLarge-scale RLHF pipelinesResearch throughput

Resources