Research Explainers May 26, 2026 7 min read

Train the Agent Where It Actually Runs

Agentic RL gets much more useful when the training loop can follow the agent into its real tool harness, not a simplified lab copy of it.

Source note: Binfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz, Yi Dong. “Polar: Agentic RL on Any Harness at Scale.” arXiv:2605.24220, 2026-05-22. https://arxiv.org/abs/2605.24220

Why This Paper Matters

Reinforcement learning for agents has a plumbing problem.

The interesting agents no longer live inside neat benchmark loops. They run through coding CLIs, browser harnesses, terminal workflows, MCP servers, custom context managers, subprocesses, tool schemas, patch submission flows, and evaluators. The harness is not just a wrapper around the model. It shapes what the model sees, how it acts, when context gets compacted, how tools are exposed, and what a valid answer even looks like.

That creates an awkward gap. RL infrastructure wants a clean environment interface. Real agents often arrive as messy working software. If every harness has to be rewritten into the trainer’s preferred environment API, the training setup can stop matching the product setup. Worse, the training loop may lose the exact token-level behavior that should receive credit or blame.

Polar is interesting because it moves the integration boundary. Instead of asking every agent harness to become an RL environment, it treats the harness as a black box and watches the one thing every LLM agent has to do: call a model.

The Idea in Plain English

Polar is a training recorder placed at the model endpoint.

The agent still runs in its native harness. Codex runs like Codex. Claude Code runs like Claude Code. Qwen Code and Pi keep their own execution paths. But their model requests are routed through Polar’s proxy. That proxy forwards the request to an inference backend, returns a response in the provider shape the harness expects, and records the token-level data needed for training.

The useful move is that Polar does not need to understand the whole agent. It does not need to own the planner, the tool loop, the context policy, or the patching workflow. It needs to preserve API compatibility and capture the model interactions faithfully enough to reconstruct training trajectories afterward.

In plain English: if the agent’s harness is the factory floor, Polar trains by installing a precise meter on the power line rather than rebuilding the factory.

What the Researchers Tested

The paper tests Polar in two settings.

First, it uses Polar for online RL on software-engineering tasks. The authors start with the same Qwen3.5-4B base checkpoint and train it with GRPO across four coding harnesses: Codex, Claude Code, Qwen Code, and Pi. Training uses SWE-Gym data, and evaluation uses SWE-Bench Verified. Polar captures the harness’s model traffic, reconstructs trajectories, and feeds those traces into the trainer.

Second, it uses Polar as an offline data-generation system. A larger Qwen3.5-122B-A10B checkpoint drives the Pi coding harness over SWE-Gym tasks. Polar records the resulting multi-turn trajectories, then keeps only those where the final patch passes the SWE-Bench evaluator.

The paper also tests a systems question that matters for scaling: how to turn many captured model calls from a long agent session into trainer-ready traces without flooding the trainer with tiny request-level samples.

What They Found

Polar’s strongest result is not that a specific algorithm wins. It is that harness-native training becomes possible without rewriting the harness.

The same base model improved across four coding harnesses

On SWE-Bench Verified, Polar-trained Qwen3.5-4B improved under all four evaluated harnesses:

Codex: 3.8% to 26.4% pass@1, a 22.6 point gain.
Claude Code: 29.8% to 34.6%, a 4.8 point gain.
Qwen Code: 34.6% to 35.2%, a 0.6 point gain.
Pi: 34.2% to 40.4%, a 6.2 point gain.

The Codex result is the headline because it starts from a weak fit. A Qwen model is not naturally fluent in Codex’s action protocol, context behavior, and patch workflow. Training through the actual Codex harness gives the model feedback on the behavior it must use at test time.

The training curves improved inside the harnesses

The paper reports pass@1 reward rising over training. Codex moved from a 9.5% average over the first ten steps to 54.5% over the last ten. Claude Code moved from 28.8% to 67.0%. Qwen Code rose from 61.6% to 66.0%, while Pi rose from 61.6% to 76.2%.

Those are training-curve numbers rather than the final benchmark table, but they make the same point: the model is learning the harness-mediated behavior, not merely improving on a detached benchmark prompt.

Token-faithful merging mattered for throughput

Long-running agents can make many model calls. A conservative approach is to treat every completion as a separate trace. That preserves each call, but it can drown the trainer in small fragments.

Polar’s prefix-merging strategy reconstructs longer traces when the conversation history is append-only, while splitting naturally around subagents, compaction, rewritten prompts, and independent branches. In the ablation, prefix merging reduced three training steps from 1,185 request-level updates to 218 merged-trace updates. Wall time fell from 189.5 minutes to 35.2 minutes, and average rollout GPU utilization rose from 20.4% to 87.7%.

Polar also generated usable SFT data

In the offline case study, Polar generated 504 accepted SWE-Gym trajectories from 1,638 attempts, a 30.8% acceptance rate, using the Pi harness and a Qwen3.5-122B-A10B teacher. Acceptance was defined narrowly: the final patch had to pass the SWE-Bench evaluator’s FAIL_TO_PASS and PASS_TO_PASS tests. The run cost roughly 64 GPU-hours.

Why It Happens

The mechanism is credit assignment at the right boundary.

Agent harnesses change the model’s behavior. They decide how tasks are framed, which tools are available, how tool outputs are inserted, how much context is retained, and how final answers are submitted. If training happens through a simplified wrapper, the reward may be attached to behavior that does not match the production or evaluation path.

Polar instead records the actual sampled tokens produced during the harness run. Its trajectory builder keeps the trainable tokens tied to the behavior policy and masks out tokens that were merely reconstructed from canonical prompt rendering or interstitial context. The paper states the invariant plainly: every trainable token should match the behavior policy during rollout, and non-generated tokens should be masked out.

That sounds technical, but the operational consequence is simple. The trainer is optimizing the tokens the model really emitted inside the agent’s workflow.

What This Means for Builders

For agent builders, the paper points toward a cleaner separation between product harness and training infrastructure.

The harness can stay close to the product: real CLI, real tool schemas, real context management, real evaluator, real patch submission path. The training layer can observe from the model endpoint and turn those runs into traces. That makes RL less dependent on a perfect environment abstraction and more compatible with the messy agent software people are already building.

It also raises the bar for model API proxies. A proxy is no longer just a routing convenience. In this design, it becomes the measurement layer for training. It must preserve provider compatibility, capture token IDs and log probabilities, handle streaming expectations, and record enough provenance to reconstruct traces after the run.

The most practical builder takeaway is this: if your agent’s value lives in the harness, do not throw that harness away during training. Train through it.

What This Means for Buyers and Operators

For buyers and operators, Polar makes a useful distinction between model quality and harness quality.

A coding agent is not just a base model. It is a model plus a harness plus tools plus runtime isolation plus evaluator. The paper’s Codex result is a good example: the same base model performed poorly before harness-native training and much better afterward. That suggests organizations should be careful about judging agent performance from model benchmarks alone.

It also implies that future enterprise agents may need training and evaluation systems that can sit alongside real operating workflows. If an agent works through a terminal, browser, IDE, ticketing system, or internal toolchain, the improvement loop has to see that workflow. Otherwise, the organization may optimize a lab version of the agent and deploy a different system.

The risk is that this kind of training infrastructure is complex. Proxying model calls, capturing traces, evaluating patches in clean runtimes, and assigning rewards correctly are all failure-prone. Polar reduces one kind of integration burden, but it does not make RL for agents turnkey.

What to Watch Next

The field should watch whether model-endpoint capture becomes a standard pattern for agent training. If many harnesses can be trained by routing model calls through a proxy, the agent ecosystem may converge around fewer training contracts and more harness-native evaluation.

Builders should watch credit assignment. The paper reports reward hacking when per-request traces received session-level rewards without stronger normalization or process rewards. That is a serious warning. Capturing the right tokens is necessary, but it does not solve the question of which intermediate decisions deserve credit.

Buyers should watch for evidence beyond one benchmark family. SWE-style patch verification is unusually clean because tests can provide a binary signal. Many enterprise workflows have softer outcomes, delayed feedback, and human review. The next question is whether the same architecture works when the reward is messier than “the patch passed.”

Limitations and Caveats

This is still a software-engineering-heavy result. The experiments center on coding harnesses, SWE-Gym, SWE-Bench Verified, and patch evaluation. That is a strong testbed for agentic RL infrastructure, but it is not the whole agent world.

The design assumes the harness can be pointed at a compatible model endpoint. That is often true for modern LLM agents, but not always cleanly true for hosted, locked-down, or highly proprietary systems.

The proxy also depends on token-level capture from the inference backend. If the serving stack cannot supply reliable token IDs, log probabilities, or enough metadata, the training signal gets weaker.

Finally, Polar helps with the integration boundary. It does not remove the hard parts of RL: reward design, evaluator quality, distribution shift, safety, compute cost, and deciding whether a behavior improvement is actually useful outside the benchmark.

Source

Binfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz, Yi Dong. (2026). Polar: Agentic RL on Any Harness at Scale. arXiv preprint arXiv:2605.24220. Available at: https://arxiv.org/abs/2605.24220

Research Browse Research & Deep Dives

Move through market maps, company deep dives, cross-profile patterns, papers, reports, and technical explainers.

Start Here Find the best entry point

Use the site map to choose a path through AI, operations, strategy, profiles, and series.

Topic Explore AI systems

Read essays on AI adoption, agents, business systems, and the changing shape of work.