The paper’s practical point is that an AI agent is a model running inside a runtime. That runtime increasingly determines whether the agent is useful, controlled, observable, and recoverable.

Source note: Qianyu Meng, Yanan Wang, Liyi Chen, Qimeng Wang, Chengqiang Lu, Wei Wu, Yan Gao, Yi Wu, and Yao Hu. “Agent Harness for Large Language Model Agents: A Survey.” Preprints.org, posted April 7, 2026. https://www.preprints.org/frontend/manuscript/c2effdadd32617ea6f65180463ac5f37/download_pub

Why This Paper Matters

Most agent discussion still starts with the model. Can it reason? Can it code? Can it plan? Can it use tools? Can it stay coherent across a long task?

Those questions matter, but they are incomplete. A capable model can still fail as a deployed agent if the surrounding runtime is weak. It may call the wrong tool, lose the task state, loop forever, forget a constraint, run in an unsafe environment, or produce behavior that cannot be reconstructed afterward.

This survey tries to make that surrounding layer the object of study. The authors call it the agent harness: the software system that governs execution, tool access, context, memory, policy hooks, and evaluation.

That framing is useful because agent reliability is becoming less like prompt craft and more like operating-system design. A production agent needs a controlled environment, typed interfaces, state persistence, recovery paths, logs, evaluation hooks, permission boundaries, and human handoff rules. Those are not decorative features around the model. They are the difference between a demo and an operating system for work.

The paper is also useful because it says the quiet part directly: benchmark scores and vendor claims are under-specified when they name the model but not the harness. A model running in one execution environment can behave very differently from the same model running in another.

The Idea in Plain English

The paper’s central claim is that model capability does not produce reliability by itself.

A model has latent capability. The harness decides how that capability becomes action.

The paper defines a full agent harness as six runtime governance functions:

  1. Execution loop
  2. Tool registry
  3. Context manager
  4. State store
  5. Lifecycle hooks
  6. Evaluation interface

In the paper’s notation, the harness is H = (E, T, C, S, L, V).

The execution loop manages the observe, think, act cycle. It decides how turns proceed, when a run stops, and how the system recovers after an error.

The tool registry governs what the agent can do. It maintains typed, validated tool interfaces and monitors invocations.

The context manager governs what the model sees. It decides what enters the context window, what gets compacted, what gets retrieved, and what gets excluded.

The state store lets the agent persist task-relevant state across turns or sessions. Without it, long work becomes fragile.

Lifecycle hooks intercept the run for authentication, logging, policy enforcement, and instrumentation.

The evaluation interface captures trajectories, intermediate states, tool outcomes, and success signals in a way external evaluators can consume.

The useful distinction is governance. A harness is more than a framework that helps developers build agent logic. It is the runtime layer that controls how capability is exercised.

What the Researchers Tested

This is a survey paper, not a new benchmark paper. The authors do five things.

First, they propose a formal definition of an agent harness. Their minimum threshold is E plus T: a system needs a multi-step execution loop and tool access before it qualifies as a harness. A full-stack harness implements all six components.

Second, they trace the history of harnesses through three lineages: software test harnesses, reinforcement learning environments, and early LLM agent frameworks.

Third, they map 22 representative systems against the six-component matrix. The systems include full-stack harnesses, specialized harnesses, frameworks, capability modules, and evaluation infrastructure.

Fourth, they analyze technical challenges across the stack: sandboxing, evaluation, protocol standardization, context management, tool governance, memory, planning, multi-agent coordination, cost, and long-running deployment.

Fifth, they propose research directions that are only visible when the harness itself is treated as the research object.

The evidence base is mixed. The paper cites peer-reviewed or formal benchmark work such as SWE-bench, HAL, and AgencyBench, but it also leans on preprints and practitioner reports. The authors flag many of those sources as not peer reviewed. That caveat matters, especially because some of the strongest quantitative claims are still early.

What They Found

A harness is a governance layer, not a prompt wrapper

The paper’s best contribution is the six-part vocabulary.

The six components map cleanly to common agent failure modes. Execution loops handle runaway or stuck work. Tool registries handle tool misuse. Context managers handle context blowout. State stores handle lost progress. Lifecycle hooks handle unmonitored side effects. Evaluation interfaces handle behavior that cannot be measured or compared.

That mapping is practical. It lets a builder look at an agent system and ask where reliability actually lives. If a system has good tool calling but no state store, it may work for short jobs but fail on long jobs. If it has logs but no structured evaluation interface, it may be inspectable by humans but hard to compare across runs. If it has a broad tool registry but weak lifecycle hooks, it may be powerful and unsafe at the same time.

The paper also separates frameworks from harnesses. LangGraph, AutoGen, LlamaIndex, and similar systems can provide useful construction primitives, but they do not automatically supply a governed runtime. The deployer still has to decide state, security, evaluation, permissions, and recovery.

That distinction is more than taxonomy. It prevents a common category error: mistaking the ability to assemble an agent for the ability to run one reliably.

The 22-system matrix shows an uneven ecosystem

The paper maps 22 systems across the six components.

The full-stack group includes systems such as DeepAgents, Claude Code, OpenClaw, DeerFlow, OpenHands, and AIOS. Specialized harnesses include SWE-agent, Browser-Use, RAI, PortiaAI, and TrustAgent. Frameworks and modules include LangGraph, LlamaIndex, AutoGen, Google ADK, CrewAI, MemGPT, MCP servers, and Voyager. Evaluation infrastructure includes HAL, AgencyBench, SkillsBench, and Harbor.

The pattern is familiar: some components have reusable infrastructure, while others are repeatedly rebuilt inside each serious harness.

Tool access has MCP. Context has strong module ecosystems, including retrieval and memory systems. Evaluation infrastructure is becoming more sophisticated because benchmarks force reproducibility.

But the paper argues that state stores, lifecycle hooks, and evaluation interfaces are still under-standardized. Full-stack harnesses often build these internally. That makes sense in the short term, but it creates portability problems. A tool, skill, memory strategy, or evaluation trace that works in one harness may not move cleanly to another.

This is the modularity gap. Agent infrastructure has modules, but the most important governance boundaries are not yet standardized.

MCP and A2A solve different boundaries

One useful part of the paper is its treatment of agent protocols.

The paper does not frame MCP and A2A as a simple winner-take-all fight. It treats them as operating at different boundaries.

MCP is mostly a harness-to-tool protocol. A harness discovers tools, validates schemas, and dispatches calls to MCP servers. It is strongest inside a harness, where the core problem is reliable tool access.

A2A is mostly a harness-to-harness or agent-to-agent protocol. It is about delegation, agent identity, progress streaming, and task boundaries between autonomous systems.

That split is important. A serious agent stack may need both. MCP can govern “call this tool with these arguments.” A2A can govern “delegate this task to another agent with this authority and this progress protocol.”

The unresolved issue is the translation boundary. If one harness receives a delegated task and then calls local MCP tools to execute it, the system must translate task authority into tool permissions. The paper argues that deployers currently handle that boundary ad hoc.

That is where a lot of future reliability and security work will live.

Long context does not remove context engineering

The paper is skeptical of the idea that huge context windows eliminate context management.

Ultra-long-context models change the C-component, but they do not remove it. A harness still has to decide ordering, salience, retrieval, compression, privacy, cost, and attack-surface management.

In short-context systems, the context manager is a gatekeeper because the window is scarce. In long-context systems, the context manager becomes an architect. It decides what structure the model sees, what gets emphasized, what should remain hidden, and how much cost the task is allowed to burn.

That is the right way to think about long context. More room does not remove the need for taste, policy, and memory design. It only changes the failure modes.

Evaluation is the hardest part because it touches everything

The paper argues that evaluation is the densest cross-component challenge.

To evaluate an agent properly, the harness needs environment isolation, tool traces, context records, state snapshots, lifecycle policy logs, and standardized success signals. If any part is missing, it becomes hard to know whether a result reflects the model, the task, the tool environment, or the harness implementation.

This is why agent benchmarks are fragile. A benchmark is a runtime environment, not a dataset alone. If the environment drifts, if tools differ, if retry policies differ, if state is not captured, or if the execution loop is underspecified, then the benchmark score is partly a harness score.

The practical implication is clear: agent papers and vendor reports should disclose the harness along with the model and benchmark.

Why It Happens

The mechanism is mediation.

The model never acts on the world directly. The harness mediates every meaningful boundary between the model and reality.

It mediates time through the execution loop. It mediates action through the tool registry. It mediates attention through context management. It mediates continuity through state. It mediates risk through lifecycle hooks. It mediates scientific claims through evaluation.

That is why harness changes can look like capability changes. A better tool schema can make a model appear smarter. A better sandbox can make the same model safer. A better state store can make the same model more persistent. A better evaluation interface can reveal failures that were previously invisible.

The opposite is also true. A strong model inside a weak harness can look worse than it is. It may fail because tools are badly described, because context is stale, because error recovery is missing, or because the run terminates at the wrong time.

In deployed agents, intelligence is therefore a joint property of model and runtime.

What This Means for Builders

Builders should treat the harness as a product surface.

That means drawing the six-component map before adding more autonomy. What is the execution loop? What tools are exposed? What context policy is used? What state survives? Where are lifecycle hooks enforced? What evaluation trace is captured?

It also means designing for failure. Long-running agents will hit partial failures, ambiguous tool responses, stale context, bad state, and policy boundaries. If those cases are left to the model’s judgment alone, the harness is not doing its job.

The strongest practical move is to make the runtime explicit. Define termination conditions. Keep tool schemas narrow. Use scoped permissions. Preserve task state. Capture typed traces. Build review gates around risky actions. Make recovery a first-class path, not an afterthought.

For teams building agent platforms, the paper also suggests a standards agenda. Tool protocols are not enough. The field needs portable state interfaces, lifecycle hooks, trace schemas, and harness disclosure formats.

Without those, every serious agent platform becomes its own island.

What This Means for Buyers and Operators

For buyers, the paper gives a useful diligence question: what harness surrounds the model?

A vendor saying “we use a frontier model” is not enough. Ask what the runtime actually controls.

What tools can the agent call? How are those tools permissioned? What happens when the agent fails halfway through a task? Can the operator reconstruct the run? Are intermediate decisions logged? Can the system prove which context was visible at each step? Where does human approval happen? What is the evaluation interface?

Those questions matter most for high-consequence work: coding, infrastructure, finance, support automation, data operations, legal workflows, and any workflow where the agent can create side effects.

The paper also warns against comparing agents by headline scores alone. If the harness differs, the comparison is not purely model versus model. It is model-plus-runtime versus model-plus-runtime.

That is not a problem if disclosed. It is a problem when hidden.

What to Watch Next

Watch for harness cards or equivalent disclosure artifacts. Agent evaluations need a standard way to report runtime configuration.

Watch MCP and A2A integration. The important work is the permission translation layer between task delegation and tool execution.

Watch evaluation infrastructure. The more agents act in real environments, the more benchmarks will need to behave like controlled runtimes rather than static test sets.

Watch context management after long-context models. The winning systems will not simply stuff more information into the window. They will structure, prioritize, hide, compress, and audit context.

Watch runtime security. Standard containers are useful, but the paper’s larger point is that security belongs across lifecycle hooks, tool registries, memory, context, and state. The sandbox boundary is only one piece.

Limitations and Caveats

This is a preprint and explicitly not peer reviewed.

The paper is strongest as a framework and map of the field. Its six-component definition is useful, and its evaluation-disclosure argument is hard to disagree with.

The paper is weaker where it leans on fast-moving practitioner reports and not-yet-peer-reviewed benchmark claims. Some numbers may change as the cited work matures. The authors acknowledge this, but readers should keep the evidence hierarchy in mind.

The 22-system taxonomy is also limited by public visibility. Closed-source systems may have stronger internal implementations than their public documentation reveals, and enterprise-internal harnesses are mostly invisible.

There is also a category-risk issue. Once “harness” becomes a popular label, vendors will use it loosely. The paper’s six-component definition helps, but the field will need stronger empirical tests to separate real runtime governance from marketing vocabulary.

The best reading is practical rather than doctrinal: the paper does not prove that the harness is always more important than the model. It shows why agent reliability cannot be understood without the runtime.

Source

Meng, Qianyu, Wang, Yanan, Chen, Liyi, Wang, Qimeng, Lu, Chengqiang, Wu, Wei, Gao, Yan, Wu, Yi, and Hu, Yao. (2026). Agent Harness for Large Language Model Agents: A Survey. Preprints.org. DOI: 10.20944/preprints202604.0428.v1. Available at: https://www.preprints.org/frontend/manuscript/c2effdadd32617ea6f65180463ac5f37/download_pub