For two years the story of customizing large language models was a story about supervised fine-tuning. You collected examples of good behavior, you ran LoRA or a full fine-tune, and the model learned to imitate them. That approach is mature, cheap, and well understood — and for a growing class of problems, it is not enough. When the thing you care about is an outcome rather than a style — did the agent solve the ticket, did the multi-step tool sequence actually retrieve the right answer, did the negotiation reach a deal — imitation hits a ceiling. You cannot collect supervised examples of the optimal action at every step of a long, branching interaction, because you do not know what the optimal action was. What you can do is let the agent act, score the result, and push it toward whatever produced the higher score. That is reinforcement learning, and in 2026 it has become a practical, accessible technique for training agents rather than an exotic research pursuit.
The shift has been driven largely by one algorithm and a wave of tooling around it. GRPO (Group Relative Policy Optimization) stripped away much of the machinery that made classic RLHF painful, and a set of open-source frameworks — ART, verl, and OpenRLHF — made it runnable without a research lab's infrastructure. This guide explains how reinforcement fine-tuning for agents actually works in 2026, compares the three frameworks most teams reach for, and offers concrete guidance on reward design and when RL is worth the trouble.
Why supervised fine-tuning runs out of road
Supervised fine-tuning (SFT) is, at its heart, next-token imitation. You show the model input-output pairs and it learns the conditional distribution of the outputs. For tasks where good behavior is well captured by examples — matching a tone, following a format, answering domain questions — this works beautifully and should remain your first move. It is cheaper, more stable, and easier to debug than anything involving RL.
The ceiling appears when good behavior is defined by a result that unfolds over many steps. Consider an agent that answers questions by searching internal documents: it issues a query, reads results, decides whether to search again, and finally composes an answer. The quality signal you actually have is whether the final answer was correct. You do not have a labeled "correct query to issue at step one given this partial context," because the right query depends on what comes back, which depends on the document store, which changes. SFT can teach the agent to imitate a few traces you happened to record, but it cannot teach it to optimize the end-to-end outcome across the enormous space of possible interactions. The agent overfits to the surface form of your examples instead of learning the underlying objective.
Reinforcement learning inverts the setup. Instead of demonstrating the right action, you let the agent take its own actions, observe the outcome, assign a reward, and adjust the policy to make high-reward behavior more likely. The agent explores, and the reward — not a fixed transcript — defines success. This is exactly the regime where multi-step, tool-using agents live, which is why RL has become the technique of choice for pushing agents past what SFT alone can reach.
GRPO: the algorithm that made this practical
The reason RL for LLMs felt out of reach for so long was PPO, the workhorse algorithm behind the original RLHF. PPO is powerful but operationally heavy: it requires training and serving a separate value (critic) model alongside the policy, roughly doubling memory and adding a second model to tune and keep stable. For most teams that overhead was prohibitive.
GRPO's key insight is that you can estimate how good an action was without a learned value function, by comparing several sampled responses to the same prompt against each other. You generate a group of completions, score them all, and use the group's average score as the baseline. A completion that beats the group average gets a positive advantage; one that falls below gets a negative one. The relative ranking within the group replaces the absolute value estimate that PPO's critic provided. No critic model, far less memory, and a training loop that is dramatically simpler to reason about.
This is why nearly every agent-RL framework in 2026 centers on GRPO. It made the difference between "you need a dedicated ML team and a cluster" and "you can run this on a single capable GPU with a sane amount of code." The frameworks below are, in large part, different opinions about how to wrap GRPO in usable infrastructure.
ART: reinforcement learning that lives in your code
ART (Agent Reinforcement Trainer) from OpenPipe takes the most agent-native stance of the three. Its defining design choice is a split between a client and a backend. The client runs your agent's rollouts — the actual episodes where the agent acts — inside your own application code, talking to the model through a standard OpenAI-compatible chat completions endpoint. The backend handles the heavy machinery: serving the model for inference with vLLM and running GRPO training with Unsloth-optimized kernels. The two halves can run on different machines, so your agent logic can stay on your laptop while training happens on a cloud GPU.
This architecture matters because it means you write rollouts the same way you already write agents. You call the model, let it use tools, capture the trajectory, and assign a reward with ordinary Python. ART then takes groups of those trajectories and performs GRPO updates. There is no need to reframe your agent as a special RL environment; the RL is wrapped around the code you would have written anyway. ART also ships a helper called RULER for relative scoring, which uses a model to rank trajectories within a group when you do not have a clean numeric metric — useful for the many real tasks where "better" is judgeable but not directly measurable.
ART is the right starting point when your goal is to improve a specific agent you have already built, especially a multi-turn, tool-using one, and you want to keep the rollout logic in your own environment. It targets best-in-class training efficiency for that single-agent, on-the-job-training use case rather than sprawling distributed pipelines.
verl: throughput and research flexibility
verl (Volcano Engine Reinforcement Learning) comes from a different direction: high-performance, large-scale RL for LLMs. Built around Ray for distribution and vLLM for fast generation, verl is engineered for throughput and for the flexibility researchers need to experiment with algorithms and reward schemes. It supports PPO, GRPO, and a growing family of variants, and it is designed to scale across many GPUs efficiently.
The trade-off is that verl exposes more of the RL machinery. You gain control over the training topology, the algorithm details, and the performance knobs, but you also take on more of the conceptual load. verl shines for teams doing serious, compute-intensive RL — training larger models, running many experiments, or pushing on algorithmic frontiers — where raw throughput and configurability justify the steeper setup. It is less of a "wrap my existing agent" tool and more of a research-and-scale platform.
OpenRLHF: production RLHF at scale
OpenRLHF bills itself as a high-performance, production-ready RLHF framework, also built on Ray and vLLM, with a unified agent-based design. It implements a broad menu of algorithms — PPO, GRPO, REINFORCE++, RLOO and more — with the optimization tricks that practical RLHF needs to stay stable at scale. Its lineage is the full RLHF pipeline: reward modeling, preference optimization, and policy training across distributed hardware.
OpenRLHF has kept pace with where the field is going. Its 2026 releases added multi-turn vision-language RL, letting teams train VLMs that reason over images across multiple steps end to end — a signal that agent RL is expanding beyond text into multimodal tool use. OpenRLHF is the natural choice when you need a mature, scalable RLHF stack with a wide algorithm selection and are comfortable operating a distributed system to get it.
Choosing among the three
The decision tracks the shape of your problem and your appetite for infrastructure. Reach for ART when you want to improve a specific agent you have already written, value keeping rollout logic in your own code, and prefer a split architecture that runs comfortably on modest hardware. Reach for verl when throughput and algorithmic flexibility dominate — large models, many experiments, a research bent — and you can absorb a more hands-on setup. Reach for OpenRLHF when you need a production-grade, broadly capable RLHF platform at scale, including multimodal RL, and you have the operational capacity to run a Ray-based distributed system.
All three converge on the same engine room — GRPO for the algorithm, vLLM for fast generation — so the choice is less about raw capability and more about the abstraction level you want to work at. A useful mental model: ART wraps RL around your agent, while verl and OpenRLHF ask you to bring your agent into their RL platform.
A concrete picture of the training loop
It helps to make the abstraction tangible. Imagine training a document-research agent — the kind that answers a question by searching an internal knowledge base, reading results, and composing an answer. Under GRPO the loop looks like this. For each training question you sample a group of complete agent episodes, say eight of them. Each episode is a full rollout: the agent issues searches, reads results, decides whether to keep searching, and produces a final answer. Because sampling is stochastic, the eight episodes differ — some find the right document quickly, some wander, some answer confidently but wrongly.
You then score each episode with your reward function, producing eight numbers. GRPO computes the group's average and assigns each episode an advantage equal to how far above or below average it landed. The two episodes that nailed the answer get positive advantages; the three that hallucinated get negative ones. The policy update nudges the model to make the high-advantage behavior more likely and the low-advantage behavior less likely — across every token of every episode in the group. Repeat over many questions and many steps, and the agent gradually shifts its whole strategy toward whatever earns reward: better queries, knowing when to stop searching, grounding answers in retrieved text.
What makes this powerful for agents specifically is that the reward only has to judge the final outcome. You never had to label the correct query at step one. The agent discovered, through comparison and reinforcement, that certain query patterns lead to higher-reward endings. That is the thing SFT cannot do, expressed as a loop you can actually run. ART structures this as trajectory groups gathered concurrently; verl and OpenRLHF express the same idea through their Ray-based rollout workers. The vocabulary differs, but the group-relative comparison at the heart of GRPO is identical across all three.
Hardware and cost expectations
Reinforcement fine-tuning is heavier than SFT, and it is worth setting expectations before you start. The dominant cost is generation: every training step requires sampling whole groups of multi-step rollouts, and for a tool-using agent each rollout may involve several model calls plus the latency of the tools themselves. This is why every serious framework leans on vLLM — fast batched inference is not a nicety here, it is the difference between a training run that finishes overnight and one that does not finish at all.
For a small model in the 3–8B range with LoRA-style adapters, a single modern data-center GPU is often enough to see real signal, especially with ART's Unsloth-optimized backend, which is tuned for exactly this single-GPU efficiency. Scaling to larger models or larger group sizes pushes you toward the multi-GPU, Ray-based topologies that verl and OpenRLHF are built for. A practical sequence is to prototype the reward and the rollout on the smallest viable model locally, confirm the reward curve trends upward on a tiny dataset, and only then commit cloud GPUs to a larger run. The split client/server design ART promotes is convenient precisely because it lets the prototype rollout code stay unchanged when you move the backend to bigger hardware.
Reward design is the real work
Whichever framework you choose, the framework is not where your project will succeed or fail. The reward function is. Reinforcement learning optimizes exactly what you reward, which means a sloppy reward gets you an agent that is excellent at the wrong thing — the phenomenon known as reward hacking. A few principles consistently help.
Keep rewards bounded and well scaled. GRPO works from relative advantages within a group, and wildly varying or unbounded rewards make those advantage estimates noisy and training unstable. Reward the outcome rather than the wording: if you score how an answer is phrased, the agent will learn to phrase rather than to solve. Where multi-step credit assignment is hard, small shaping rewards for intermediate successes — a tool call that returned useful data, a retrieval that hit the right document — can help the agent discover good strategies without dictating them. And validate your reward on a handful of hand-inspected rollouts before scaling: read what the agent actually did to earn a high score, and confirm it matches your intent. Almost every RL failure traces back to a reward that measured something subtly different from what the team meant.
Finally, respect the cost and instability that come with RL. It is more compute-hungry and more finicky than SFT. Start with the smallest model and dataset that can show signal, log reward and loss curves obsessively (all three frameworks integrate with Weights & Biases), and only scale once you trust the reward and the trend. RL is a powerful tool for the specific job of optimizing outcomes — and a frustrating one if reached for before SFT has been exhausted.
The bottom line
Reinforcement fine-tuning crossed into the mainstream in 2026 because GRPO removed the critic-model overhead that made RLHF impractical, and because ART, verl, and OpenRLHF turned the algorithm into usable infrastructure. Use SFT first; it remains the cheaper, more stable default. Turn to RL when success is an outcome that unfolds over many steps and cannot be captured by imitation. Pick ART to wrap RL around an agent you already have, verl for throughput and research flexibility, and OpenRLHF for scalable, multi-capable production RLHF. Then spend the bulk of your effort not on the framework but on the reward function — because in reinforcement learning, you get exactly what you ask for.
References and Resources
Frameworks
- ART (Agent Reinforcement Trainer) — GitHub and the launch post
- verl — GitHub
- OpenRLHF — GitHub
- vLLM and Unsloth (inference + training backends)
Algorithms and background
Related 1337skills cheatsheets
Further reading