A new paper separates the model that improves an agent’s harness from the model that has to use it. The result is a useful warning for agent builders: cheap models may write decent harness updates, but the task-solving agent still needs enough capability to load and follow them.

Source note: Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu. “Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents.” arXiv:2605.30621, 2026-05-28. https://arxiv.org/abs/2605.30621

Why This Paper Matters

Most agent systems are not just a model plus a prompt. They accumulate a harness around the model: instructions, skills, memory, tool wrappers, checklists, retrieval rules, and task-specific procedures. That harness is where a lot of operational learning lives.

The obvious next step is self-evolution. Let the agent look at what happened, update its own harness, and become better next time.

This paper asks a sharper question: when that works, which model capability actually mattered?

There are two jobs that often get collapsed into one. One model reads execution evidence and writes a better harness artifact. Another model later solves tasks while using those artifacts. The paper calls these two capabilities harness-updating and harness-benefit. That distinction matters because the expensive model may not need to be in the place builders assume.

The Idea in Plain English

Think of an agent harness as the operating manual around the model. It can include skills, prompts, memory, or tools that tell the agent how to approach recurring work.

Harness-updating is the ability to improve the manual after watching previous attempts.

Harness-benefit is the ability to actually use the improved manual during the next attempt.

Those are not the same skill. A junior analyst might write a surprisingly useful checklist after observing a process. A senior operator might still be needed to use the checklist under pressure, notice when it applies, and follow it without drifting. In agent terms, writing the harness update may be easier than making the task-solving agent reliably act on it.

What the Researchers Tested

The researchers studied self-evolving LLM agents across three benchmarks:

  • SWE-bench Verified, for software engineering tasks.
  • MCP-Atlas, for tool-rich agent tasks.
  • SkillsBench, for tasks where reusable skills matter.

They varied two roles separately.

The first role was the evolver: the model that reads execution evidence and produces harness updates. The paper tested frontier, strong, mid-tier, and smaller models in that role, including Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Qwen3-235B-A22B, Qwen3-32B, GPT-OSS-120B, and Qwen3.5-9B.

The second role was the task-solving agent: the model that later receives the updated harness and tries to solve the actual benchmark tasks.

That split lets the paper ask two separate questions. Do stronger models write much better harness updates? And do stronger models benefit more from receiving those updates?

What They Found

The headline result is almost anti-intuitive: harness-updating was fairly flat across model capability, while harness-benefit depended heavily on the task-solving agent.

1. Smaller Evolvers Often Wrote Useful Updates

Different evolver models produced surprisingly similar gains. The reported best-worst gap across evolvers was at most 3.1 percentage points on any benchmark.

That does not mean all evolvers were identical. It means the expected ordering did not cleanly hold. For example, Qwen3-235B led on one benchmark but ranked last on another, while the smaller Qwen3.5-9B produced the highest SkillsBench gain in the reported evolver-side comparison.

The practical read is that harness-updating may be less bottlenecked by raw model capability than many teams would expect.

2. The Task-Solving Agent Still Dominated Final Performance

When the task-solving agent changed, final performance moved a lot. When only the evolver changed, performance moved much less.

The paper reports that, for a fixed agent, the spread across seven evolvers was small compared with the base-capability gap between agents. In one comparison, the within-agent spread was at most 5.1 percentage points, while the base gap between strong and weaker task-solving agents was far larger.

That points to a simple budgeting rule: if you have to spend expensive inference somewhere, spend it first on the agent doing the work.

3. Benefit From Harnesses Was Non-Monotonic

The strongest agents did not always gain the most from better harnesses. Mid-tier agents often gained more.

On SWE-bench, the paper reports a 19.3 percentage point maximum gain for Qwen3-235B, compared with 4.4 points for Qwen3-32B and 2.6 points for Opus 4.6. On MCP-Atlas, GPT-OSS-120B had the largest reported gain at 7.0 points. On SkillsBench, several weaker or mid-tier models gained more than the already strong ones.

The authors’ interpretation is reasonable. Very weak agents often cannot use the harness well. Very strong agents may already know enough, so the harness has less room to help. The middle is where the harness can add the most.

4. Weak Agents Failed at Loading and Following Skills

The most useful diagnostic in the paper is not just that weaker agents gained less. It is why.

On SkillsBench, stronger models had near-ceiling skill-load rates, around 0.957 to 0.961. GPT-OSS-120B loaded the relevant skill much less often, at 0.446. Qwen3-32B was lower still, at 0.251.

Following the loaded skill was also hard. The paper reports a harness-following rate of 0.757 for Opus 4.6, but only 0.142 for Qwen3-32B. Qwen3-235B loaded skills at a high rate, yet its following rate was much lower than Opus.

So the problem was not only memory retrieval or skill discovery. Some agents saw the relevant harness artifact and still failed to execute it faithfully.

Why It Happens

The mechanism is familiar to anyone who has built agent workflows.

Writing a persistent instruction from a trace is often a summarization and abstraction problem. The evolver can inspect a failure, notice the missing step, and write a reusable procedure.

Using that procedure inside a live task is harder. The task-solving agent has to recognize when the artifact applies, load it in the correct format, keep it active across a long horizon, reconcile it with the current environment, and resist drifting back to its own default plan.

That second part exercises tool use, retrieval, instruction hierarchy, working memory, and self-monitoring. A harness update can be correct and still fail operationally if the agent treats it as optional advice instead of a binding procedure.

What This Means for Builders

The immediate implication is that self-evolving agent systems should not treat the evolver as the only intelligence center.

For many systems, a cheaper model may be good enough to propose harness updates, especially if those updates are validated before being admitted into the live harness. The harder product work is making sure the task-solving agent can use the resulting artifacts.

That means builders should invest in:

  • explicit harness invocation protocols
  • skill routers that make loading failures visible
  • tests that check whether a loaded skill was followed
  • smaller, more executable harness artifacts
  • validation before harness updates persist
  • traces that distinguish “did not retrieve” from “retrieved but ignored”

The paper also argues against a naive “frontier model writes the harness, cheaper model executes” pattern. The cheaper executor may be exactly where the system loses the benefit.

What This Means for Buyers and Operators

For buyers, the paper is a reminder to ask where learning actually lands.

An agent vendor may claim that the system improves from experience because it updates prompts, memories, or skills. That is not enough. The useful question is whether the live agent reliably invokes and follows those updates in future work.

For operators, the right evaluation is not just before-and-after benchmark performance. It is also process evidence:

  • Did the agent load the relevant harness artifact?
  • Did it follow the artifact step by step?
  • Did following it improve the outcome?
  • Did new harness updates create conflicts with older ones?
  • Can the system roll back a bad harness change?

In production, the danger is false learning. The harness gets bigger, the traces look more sophisticated, but the agent’s behavior does not improve because the executor cannot use the added structure.

What to Watch Next

The next useful research thread is harness usability, not just harness generation.

The field should watch for benchmarks that test whether agents retrieve, invoke, and follow external skills under realistic task pressure. Builders should also watch for training methods that improve long-horizon instruction following after retrieval, because that is where the harness becomes operational rather than decorative.

It would also be useful to see more work on harness compilers: systems that take messy learned procedures and convert them into tighter, testable, tool-callable routines. If the executor struggles with long free-text skills, the right answer may be a more structured harness, not a larger evolver.

Limitations and Caveats

The paper studies a specific harness evolution setup across three benchmarks. The result should not be read as proof that small evolvers are always enough.

The harness artifacts also differ across settings. SWE-bench and SkillsBench emphasize skills, while MCP-Atlas uses a broader mix of prompts, skills, memory, and tools. A different harness format could change the balance between updating and benefit.

Some of the diagnostic measurements, including whether agents followed a harness artifact, depend on judge-based analysis. That is useful, but it should be treated as an operational signal rather than ground truth.

The model list is also time-bound. New models, better tool-use training, or stronger native skill invocation could shift the curve. The durable lesson is the decomposition: separate the model that writes the update from the model that has to use it.

Source

Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu. (2026). Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents. arXiv preprint arXiv:2605.30621. Available at: https://arxiv.org/abs/2605.30621