Agent Harnesses Can Learn From Their Own Failures

The paper’s practical point: agent performance is not just about the base model. It is also about the harness around the model, and that harness may be something the agent can improve itself.

Source note: Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu. “Self-Harness: Harnesses That Improve Themselves.” arXiv:2606.09498, June 8, 2026. https://arxiv.org/abs/2606.09498

Why This Paper Matters

Most debates about AI agents still start with the model.

Which model is smarter? Which one reasons better? Which one writes better code? Which one can use tools without getting lost?

Those questions matter, but they are incomplete. An agent is not a naked model dropped into the world. It is a model inside a harness: the prompts, tools, runtime policies, memory, verification rules, parsers, permissions, and recovery procedures that let the model interact with an environment.

That surrounding layer can make the same base model look much better or much worse. A model may fail not because it lacks intelligence, but because it keeps retrying a broken command, forgets to create the required output file, loses an environment variable across shell calls, or does not know when exploration should turn into implementation.

Today, humans usually fix those problems. Engineers watch the traces, notice the pattern, rewrite the system prompt, add a rule, patch the tool wrapper, or change the runtime. That works, but it does not scale neatly. Every new model has different habits. Every agent environment creates different failure modes.

This paper asks a sharp question: can the agent use its own failures to improve the harness it runs inside?

The Idea in Plain English

The easiest way to understand the paper is to separate the model from the operating system around the model.

The model is the reasoning engine. The harness is the operating setup that tells the model what tools exist, how to call them, how to interpret results, when to stop, how to recover from failure, and how to prove that the task is done.

Earlier harness improvement usually fits one of two patterns.

The first is human harness engineering. A person studies agent failures and edits the harness by hand.

The second is external optimization. A stronger or separate agent improves the harness for a weaker target agent. The paper discusses this as a Meta-Harness style paradigm: one system supervises and improves another.

Self-Harness is different. The target agent improves its own harness. It does not use a stronger external model as the designer. It does not ask a human to write the patches. The same fixed model, operating under its current harness, studies its own execution traces and proposes small edits to the harness that governs its future behavior.

The important word is small. This is not a model freely rewriting itself. It is a bounded loop for turning observed failures into targeted harness changes, then accepting only the changes that survive regression tests.

What the Researchers Tested

The authors tested Self-Harness on Terminal-Bench-2.0, a benchmark for command-line and containerized software tasks. That choice matters because the tasks have verifier-grounded outcomes. The agent either produces the required artifact, passes the tests, or fails.

They used a minimal initial harness with basic file and shell capabilities. Then they tested three base models from different families:

MiniMax M2.5.

Qwen3.5-35B-A3B.

GLM-5.

The Self-Harness loop has three stages.

First is Weakness Mining. The agent runs on held-in tasks and produces execution traces. The system studies failed traces and clusters recurring failure patterns. The goal is not merely to say “the task failed.” It is to identify the behavioral mechanism: repeated failed commands, missing output files, poor dependency checks, broken environment handling, endless exploration, or tool-call loops.

Second is Harness Proposal. The agent proposes candidate edits to its own harness. These edits are supposed to be minimal, diverse, and tied to specific mined weaknesses. A good proposal is not “be better.” It is a concrete change to prompts, tool handling, middleware, decomposition, or runtime behavior.

Third is Proposal Validation. Candidate edits are tested. The system accepts a change only if it improves performance without causing measurable regressions. Accepted changes are merged into the next harness version, and the loop repeats.

The authors evaluate both held-in and held-out tasks. Held-in tasks provide traces for mining and proposal. Held-out tasks are not used as inputs to the improvement loop, so they test whether the harness changes generalize beyond the failures that inspired them.

What They Found

The headline result is simple: Self-Harness improved all three models.

On held-in tasks, MiniMax M2.5 improved from 43.0% to 50.0%. Qwen3.5-35B-A3B improved from 15.1% to 36.0%. GLM-5 improved from 47.7% to 57.0%.

The held-out results are more important. MiniMax M2.5 improved from 40.5% to 61.9%. Qwen3.5-35B-A3B improved from 23.8% to 38.1%. GLM-5 improved from 42.9% to 57.1%.

Those held-out gains suggest the loop was not merely memorizing a benchmark split. The harness edits captured failure patterns that also mattered on unseen tasks.

The fixes were model-specific

The paper’s qualitative examples are the most useful part.

MiniMax M2.5 benefited from rules that pushed it to create required output files earlier, handle structured tool outputs more carefully, and stop unproductive tool loops before they burned the whole budget.

Qwen3.5-35B-A3B needed a different set of fixes. Its harness changes focused on checking dependencies before running commands, avoiding repeated failed edits, breaking exploration cycles, and making sure required artifacts still existed after tool errors.

GLM-5 had its own failure pattern. It needed better handling of environment settings across shell commands and stronger pressure to move from exploration into implementation and testing.

That is the point of the paper. The best harness is not generic. It depends on the model’s actual behavior under real task pressure.

The harness changed more than the prompt

Self-Harness did not only add longer instructions.

The accepted changes included structural mechanisms such as subagent-style decomposition and middleware for handling errors. That matters because many agent failures live below the level of prose instructions. The issue may be how tool outputs are formatted, how errors are surfaced, how tasks are decomposed, or how the runtime tells the model that a loop has gone stale.

This is where the paper is strongest. It treats harness design as software, not prompt decoration.

Regression testing was the control system

The paper is careful about validation. A self-improving loop without tests is just a way to automate drift.

Self-Harness only promotes edits after regression testing. That constraint is doing a lot of work. It means the agent can propose changes, but it does not get to declare victory. The verifier decides.

That is why Terminal-Bench-2.0 is a useful setting. The system can run the modified agent and compare outcomes against previous behavior. In fuzzier domains, this becomes much harder.

Why It Happens

Self-Harness works because it turns vague agent failure into a concrete engineering loop.

The agent does not simply read a failed final score. It gets execution traces. Those traces show how the failure happened: what command was repeated, what file was missing, where the agent lost state, or why it kept exploring instead of acting.

That evidence changes the nature of the problem. The agent is no longer guessing at better instructions. It is proposing a targeted patch against a recurring failure mechanism.

The proposal constraints also matter. The paper asks for diverse but minimal modifications. That reduces the chance that the system makes a sweeping change that helps one task and damages many others.

Finally, validation closes the loop. Weakness Mining supplies evidence. Harness Proposal supplies candidate edits. Proposal Validation decides whether the edit survives contact with tasks. The result is not open-ended self-transformation. It is test-driven harness repair.

What This Means for Builders

For builders, the lesson is that agent improvement may move from prompt writing to harness infrastructure.

If an agent can help improve its own operating layer, then the human job shifts. The valuable work becomes building the evaluation environment: clean task definitions, reliable verifiers, representative traces, regression suites, and strict promotion rules.

That is a very different operating model from “keep tweaking the prompt until it feels better.” It is closer to software engineering. The harness is a versioned artifact. Failures create evidence. Candidate edits are proposed. Tests decide whether they ship.

It also suggests that model-specific harnesses will become normal. A company may not want one generic agent wrapper for every model. It may want a loop that adapts the wrapper to each model’s actual failure profile.

The practical question becomes: where do you have enough objective feedback to let the system safely improve itself?

Coding tasks, terminal tasks, data transformations, workflow automation, document extraction, and operational checks are good candidates. They produce artifacts that can be tested. Strategy, management advice, creative judgment, and customer communication are harder because the pass/fail signal is softer.

What This Means for Buyers and Operators

For operators, the paper points to a less obvious source of vendor lock-in.

Teams often think they are locked into a model because the model is uniquely capable. Sometimes that is true. But sometimes the lock-in is really harness lock-in. The workflow, prompts, tool adapters, and recovery logic have been tuned around one model’s habits.

If harnesses can adapt to a new model through verifier-grounded testing, switching models becomes more plausible. A cheaper or faster model might be viable if the harness can learn where that model needs more structure.

That does not mean buyers should expect automatic portability. The evaluation layer has to be strong. If the tests are weak, the self-improvement loop can optimize the wrong thing. If the domain has hidden quality requirements, the agent may improve measured scores while making the workflow worse in ways the verifier does not catch.

The procurement lesson is simple: ask vendors how their agents improve after failure. If the answer is “we manually tune prompts,” that may be fine today, but it is labor-intensive. If the answer is “the system mines traces, proposes bounded harness edits, and regression-tests them,” that is a more scalable operating model.

What to Watch Next

The next thing to watch is whether this works outside command-line software tasks.

Terminal-Bench-2.0 gives the loop clean signals. Many business workflows do not. A sales research agent, support agent, procurement agent, or analyst agent may produce work that is partly subjective. The system can still collect traces, but validation becomes more difficult.

The field should also watch whether self-harness loops become part of agent platforms. A mature platform may not only run tasks. It may maintain a harness changelog, mine recurring failures, propose patches, run regression tests, and roll back bad edits.

The other thing to watch is how much of the improvement comes from simple prompt changes versus deeper runtime changes. The more the gains come from middleware, decomposition, tool handling, and verification policy, the more agent engineering starts to look like a real systems discipline.

Limitations and Caveats

This is a controlled result, not proof of general recursive self-improvement.

The agent is not changing its neural weights. It is changing the non-parametric harness around the model. That is powerful, but bounded.

The benchmark is also narrow. Terminal-Bench-2.0 is useful because it is objective, but it is still a software and terminal-task environment. Results may not transfer cleanly to domains with ambiguous success criteria.

The study uses three base models. That is enough to show model-specific adaptation, but not enough to prove the method works across the full range of frontier, open, small, and specialized models.

There is also an overfitting risk. Held-out gains are a good sign, but any self-improvement loop can learn the shape of its evaluator. If the verifier misses important quality dimensions, the harness may improve the score while creating new risks.

The paper’s strongest claim is therefore practical rather than metaphysical: agents can use their own verified failures to improve the harness around them. That is not artificial general intelligence improving itself without bounds. It is still a big deal.

Source

Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu. (2026). Self-Harness: Harnesses That Improve Themselves. arXiv preprint arXiv:2606.09498. Available at: https://arxiv.org/abs/2606.09498