The most useful way to evaluate an AI agent is not model versus model. It is model plus harness versus model plus harness.
Source note: Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, Chandan K. Reddy. “Agent Harness Engineering: A Survey.” Under review as a TMLR submission, 2026. https://picrew.github.io/LLM-Harness/main.pdf
Why This Paper Matters
AI-agent discussion still leans too heavily on model capability. When a model improves at coding, planning, or retrieval, the industry treats that as the primary reason for better performance.
This framing misses much of the practical reliability.
An agent does not act directly in the world. It runs inside a harness: the execution layer that determines which environment the agent can touch, which tools it can call, how context is assembled, how state survives, how actions are logged, how outputs are verified, and which policies constrain the run.
This survey matters because it gives that layer a name and a structure. The authors argue that the harness is more than plumbing beneath the “real” agent. It is an independent system layer, and for long-running tasks, it often becomes the binding constraint on reliability.
If an agent fails, the cause may not be a “dumb” model. It may be a leaky sandbox, a tool interface written for humans, a context window that lost a key constraint, a lifecycle loop with no recovery path, or a governance layer that allowed an unauthorized action.
The Idea in Plain English
The paper’s claim is simple: agent quality is partly an infrastructure problem.
Early LLM systems were essentially model plus prompt. Then came context engineering: retrieval, memory, and prompt assembly. The authors argue that production agents have entered a third phase: harness engineering.
Harness engineering is the discipline of designing the environment around the model. It includes the code runner, browser, filesystem, tool protocols, state machine, observers, evaluators, and human handoff points.
Real agents are more than chat windows with tools; they function as small operating systems built around language models.
The paper proposes a seven-layer taxonomy called ETCLOVG:
- Execution environment
- Tool interface
- Context management
- Lifecycle and orchestration
- Observability
- Verification
- Governance
The acronym is clunky, but the split is useful. It forces builders to ask whether an agent has every reliability layer it needs, rather than treating everything outside the prompt as a generic engineering detail.
What the Researchers Tested
This is a survey paper that organizes the agent-harness stack. The authors do three things.
First, they argue for the “binding-constraint thesis”: for complex tasks evaluated across comparable models, much of the variance comes from the execution harness rather than the model alone.
Second, they use the ETCLOVG taxonomy to map the stack. Execution covers sandboxes and isolation. Tooling covers protocols and integration. Context covers memory and persistence. Lifecycle covers state and task loops. Observability covers traces and cost tracking. Verification covers evaluation and regression loops. Governance covers permissions and audit infrastructure.
Third, they map over 140 open-source projects onto this taxonomy to show where the ecosystem is mature and where it remains thin.
What They Found
The agent ecosystem is uneven.
Execution environments, tool interfaces, and lifecycle systems are relatively dense. Developers feel the pain of sandboxes and task loops early because they block basic demos.
Observability and governance are thinner. These layers are more likely to appear in commercial platforms than in open-source research artifacts. This is a warning: the layers that make agents operationally trustworthy are the ones least likely to be captured by toy benchmarks.
The paper argues that observability and governance deserve to be first-class architectural layers. Observability is more than logging LLM calls; it means reconstructing trajectories, connecting reasoning to actions, and diagnosing specific failure points in a run. Governance is more than a policy paragraph; it involves identity, delegation, and permission boundaries.
The survey identifies three recurring tensions:
- The cost-quality-speed trilemma: Richer observability and safer sandboxes improve quality but add cost and latency.
- The capability-control tradeoff: More powerful agents need broader tool access and autonomy, which creates more ways to drift or escalate privileges.
- The harness coupling problem: If an agent is optimized for one harness, its behavior may not transfer cleanly to another. The harness shapes the task.
Why It Happens
Harnesses define the agent’s actual action space.
A model can only choose actions the harness exposes. It only sees the context the harness constructs and remembers what the harness persists. It only recovers if the lifecycle system provides a path.
Agent capability is a joint property. A strong model inside a weak harness will fail due to wrong tool schemas, stale state, or missing retries. A weaker model inside a disciplined harness can often be more reliable because the surrounding system constrains and recovers it.
This is why single-number benchmarks can mislead. A score may reflect model reasoning, but it also reflects context injection, verifier design, or retry policy. Treating a benchmark as a pure model measurement hides the infrastructure.
The model is not the only thing that matters; in long-running work, the harness can dominate the outcome.
What This Means for Builders
For builders, this paper is a checklist.
If an agent is intended for real work, model reasoning is only one requirement. The question is whether the harness has been engineered as a product surface.
The execution layer should make dangerous actions hard and recoverable. The tool layer should be designed for agents, not copied from human APIs. The context layer should track task state rather than just stuffing tokens into the window. The observability layer should produce traces that explain failures, rather than basic API-call dashboards. The governance layer should track who authorized which action and under what permissions.
Do not hide the harness in miscellaneous infrastructure. Make it an explicit architecture.
When comparing agents, report the harness. Which tools were available? What was the context policy? How were errors retried? Without this information, model comparisons are under-specified.
What This Means for Buyers and Operators
For buyers, the survey is a defense against “model-name theater.”
A vendor may claim better performance because they use a stronger model, but that is an incomplete story. The operational question is: what harness surrounds that model?
A serious agent product should explain its execution environment, tool permissions, memory policy, and governance model. If the answer is just “we use a frontier model,” it is not enough.
This matters for work that is long-running or high-consequence. A coding agent or an ops agent needs more than good completions. They need scoped authority, state continuity, and review gates.
Evaluate the full loop. Test the agent in the actual operating environment. Look at traces, failure modes, and escalation behavior, not just final answers.
What to Watch Next
Watch whether harness reporting becomes standard in benchmarks. Model cards are insufficient for agents; we need harness cards covering runtime, tool contracts, and governance.
Watch trace-native evaluation. Final-score evaluation is too thin for harness engineering. A useful next step is turning traces into the primary object for scoring and failure attribution.
Watch handoff standards. Agents need richer cross-layer contracts that transfer intent, constraints, budget state, and risk level during handoffs.
Finally, watch for harness simplification. As models improve, some scaffolding will become counterproductive. The best harness will be the one that keeps capability and reliability in balance.
Limitations and Caveats
The paper is a synthesis rather than a controlled causal test. While the binding-constraint thesis is persuasive, the paper does not isolate every harness variable through a benchmark suite.
The open-source mapping has visibility bias. Observability and governance may look thinner because many of those systems live inside proprietary company stacks.
The argument is strongest for long-running work. For short, single-turn tasks, model capability still explains most of the outcome. Harness engineering becomes decisive as tasks grow longer and actions become harder to reverse.
Source
Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, Chandan K. Reddy. (2026). Agent Harness Engineering: A Survey. Under review as a TMLR submission. Available at: https://picrew.github.io/LLM-Harness/main.pdf