The paper’s practical point is that many agent failures do not require a new model. In deterministic workflows, the better move may be to fix the runtime harness: the layer that tells the model what it sees, which tools exist, how actions are executed, and how mistakes are recovered.

Source note: Tianshi Xu, Huifeng Wen, and Meng Li. “Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents.” arXiv:2605.22166, submitted May 21, 2026, revised May 27, 2026. https://arxiv.org/abs/2605.22166

Why This Paper Matters

An agent is not just a model with tools attached. It is a model inside an operating loop.

The environment sends observations. The runtime describes tools. The model proposes actions. An executor applies those actions. The environment returns feedback. The agent keeps going until it succeeds, fails, loops, or runs out of budget.

Most improvement work still focuses on the model: scale it, fine-tune it, distill it, reinforce it, or teach it better tool use. That makes sense, but it misses a lot of practical failure. In many rule-governed domains, the model already has enough latent ability. The problem is that the interface around it is confusing, underspecified, brittle, or bad at helping the model recover.

This paper proposes Life-Harness, a runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. It learns from training trajectories, identifies recurring interface failures, converts them into reusable harness interventions, and then freezes those interventions for evaluation on unseen tasks.

The result is a useful shift in emphasis: do not only ask whether the model is smart enough. Ask whether the runtime is giving the model a usable operating surface.

The Idea in Plain English

Imagine a human employee who keeps making mistakes in a rigid internal system. They click the wrong button because the UI label is misleading. They retry the same failed path because the error message is vague. They miss a policy rule because it is buried in a manual. They type an action in a way the system rejects.

You could send that employee to training. Or you could fix the interface: clarify the labels, add a checklist, validate inputs before submission, and interrupt loops when the same mistake repeats.

Life-Harness is the second approach for LLM agents.

It does not retrain the model. It modifies the runtime interface around the model in four places. The Environment Contract layer clarifies tool descriptions and constraints before interaction. The Procedural Skill layer retrieves task-relevant procedures from prior trajectories. The Action Realization layer validates and canonicalizes model actions before execution. The Trajectory Regulation layer watches for repetition, stagnation, invalid retries, and budget exhaustion, then triggers recovery.

The key is that the harness is evolved from failures. Training trajectories reveal where the model and environment misunderstand each other. Life-Harness turns those repeated failures into fixed, auditable runtime interventions.

What the Researchers Tested

The authors evaluate Life-Harness on deterministic agent environments where rules, tools, and evaluation criteria are stable.

They use seven task scenarios across three benchmark suites: tau-bench, tau2-bench, and AgentBench. The tasks include Airline, Retail, Telecom, ALFWorld, WebShop, OS interaction, and DBBench.

They evolve harnesses using only Qwen3-4B-Instruct trajectories. Then they freeze the resulting harnesses and evaluate them across 18 model backbones, including Qwen-family models, Llama-family models, and xLAM tool-use-trained models.

This setup matters because it tests whether the harness captures reusable environment-side structure. If the harness only worked for the source model, the result would look more like model-specific prompt tuning. If it transfers to 17 other models, it suggests the harness is fixing the interface.

What They Found

Runtime harnessing improved most model-environment settings

The headline result is broad. Life-Harness improves 116 out of 126 model-environment settings across 18 backbones, with an average relative improvement of 88.5%.

The aggregate benchmark numbers are strong. On ALFWorld, average Pass@1 rises from 41.1% to 75.7%, an 84% relative gain. WebShop moves from 31.4% to 44.0%. OS interaction moves from 34.7% to 41.2%. DBBench moves from 48.4% to 64.6%.

On tau-bench Airline, Pass@1 rises from 49.7% to 62.6%, and the stricter Pass^3 metric rises from 34.7% to 52.2%. On Retail, Pass@1 rises from 56.2% to 61.8%. On tau2-bench Telecom, Pass@1 rises from 55.3% to 69.0%, while Pass^3 rises from 41.5% to 52.6%.

The paper reports gains in 92% of all model-benchmark settings, even though the harnesses were evolved from only one source model.

Prompt evolution is not enough

The authors compare Life-Harness with prompt-only evolving, where the system iteratively optimizes the input prompt.

Prompt evolution helps, but Life-Harness does much better. The paper reports that Life-Harness adds an average relative improvement of 120% over prompt-only evolving.

That gap is the paper’s core argument in miniature. Agent performance depends on the full interaction loop, not just the first instruction. A better prompt can tell the model what to do. A better harness can change how the model sees tools, how invalid actions are repaired, how procedural knowledge is retrieved, and how failing trajectories are redirected.

For multi-step agents, the runtime is part of the policy.

All four harness layers matter

The ablation study removes each layer of Life-Harness and measures the damage.

The results vary by task, which is exactly the point. Some environments fail mainly because action realization is brittle. Others fail because trajectory loops are not interrupted. Others need clearer contracts or procedural skills.

Removing Action Realization causes large drops on Airline and OS: 61.7% and 59.6%, respectively, compared with the full harness. Removing Trajectory Regulation causes an 86.5% drop on ALFWorld and a 36.2% drop on Telecom. Removing Contract or Skill layers also hurts across multiple benchmarks.

The lesson is that agent failures are not one kind of problem. They occur at different lifecycle stages, so the runtime needs multiple intervention points.

Harnessing complements model training

The paper also asks whether specialized tool-use training makes harnessing unnecessary. Its answer is no.

The authors compare Qwen2.5-32B-Instruct with Life-Harness against xLAM-2-32B, a tool-use-trained derivative. They report that Qwen2.5-32B with Life-Harness beats xLAM-2-32B by 7.5 percentage points on the in-domain tau-bench setting.

They also apply Life-Harness to xLAM itself and see improvements across evaluated benchmark groups, ranging from 6.8 to 28.9 percentage points.

This is a clean conceptual split. Training changes model parameters. Harnessing changes the runtime interface. In deterministic domains, both can help, and they are not substitutes.

Why It Happens

Life-Harness works because deterministic agent tasks contain stable structure outside the model.

An airline workflow has policies. A telecom support task has tool constraints. A database task has executable actions and invalid actions. A web shopping task has repeated procedural patterns. An OS task has command formats. These structures do not need to live entirely inside model weights.

If the runtime can expose them clearly, validate them before execution, and recover when the trajectory goes bad, the same model can behave much better.

The model still matters. But the harness reduces wasted capability. It prevents predictable interface errors from consuming the agent’s budget. It turns recurring failures into reusable runtime knowledge.

That is why transfer matters. If a harness evolved from Qwen3-4B-Instruct helps many other models, it is probably not just teaching one model a trick. It is encoding something about the environment.

What This Means for Builders

Builders should stop treating the agent harness as plumbing.

The harness is where the model meets reality. It contains tool descriptions, action schemas, feedback handling, retry policies, memory, skill retrieval, stopping rules, and recovery behavior. If that layer is sloppy, stronger models will still fail in boring ways.

For deterministic workflows, builders should collect failure trajectories and classify them by lifecycle stage. Was the tool contract unclear? Was the needed procedure missing? Was the action malformed? Did the agent loop after feedback? Each class points to a different intervention.

This also argues for harness versioning. A good harness is not a pile of ad hoc patches. It is a reusable interface artifact that should be tested, audited, and frozen for evaluation.

What This Means for Buyers and Operators

For buyers, the paper gives a better way to evaluate agent vendors.

Do not only ask what model they use. Ask how the runtime handles tool contracts, invalid actions, retries, feedback, and loops. Ask whether the system learns from failed trajectories. Ask whether improvements transfer across models or require retraining every time. Ask whether harness changes are testable and auditable.

Operators should also notice the domain boundary. Life-Harness is most relevant where the workflow is deterministic and rule-governed: support operations, internal tools, database tasks, shopping-like flows, OS actions, and policy-heavy business processes.

Open-ended work is different. If every task brings new tools, goals, resources, and success criteria, a fixed harness becomes harder to evolve and trust.

What to Watch Next

The field should watch whether harness adaptation becomes a standard complement to model training for agents.

Researchers should watch transfer. If environment-side harnesses can reliably transfer across models, organizations may be able to invest in workflow-specific runtime assets without locking themselves to one model provider.

Builders should watch observability. Harness evolution depends on good trajectories. If the system cannot show where failures happen, it cannot improve the right layer.

Buyers should watch for benchmark realism. The most useful evaluations will separate model intelligence from runtime mediation, especially in workflows where the rules are stable but the interaction loop is unforgiving.

Limitations and Caveats

The paper is intentionally scoped to deterministic, rule-governed environments. That is a strength for the experiments, but it limits the claim.

Open-ended agent tasks are harder. Goals vary. Tools change. External resources appear and disappear. Success criteria may be subjective. In that setting, it is less obvious how to define a stable runtime interface or evolve a harness that generalizes.

The approach also depends on the quality of failure diagnosis. If training trajectories are too narrow, the harness may overfit to visible failure modes. If interventions trigger too aggressively, they can harm otherwise correct behavior.

Finally, harnessing does not remove the need for good models. It changes the improvement target. For many practical agents, better weights and better interfaces will be complementary.

Source

Xu, Tianshi; Wen, Huifeng; Li, Meng. (2026). Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents. arXiv preprint arXiv:2605.22166. Available at: https://arxiv.org/abs/2605.22166