Research Explainers May 29, 2026 7 min read

Agents Should Compile Workflows, Not Replay Them

The paper’s practical point is that many agent workflows should stop being live agent sessions. Let the model design the workflow once, save it as a reusable blueprint, and run the repeatable execution path deterministically.

Source note: Abhinav Singh Parmar. “Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol.” arXiv:2605.00827, submitted March 13, 2026. https://arxiv.org/abs/2605.00827

Why This Paper Matters

Tool-using agents are powerful, but they waste effort when they solve the same operational task again and again.

A typical agent loop does not just call tools. It reads the task, inspects available tools, decides which call should come next, emits a structured invocation, reads the result, updates its plan, and repeats. That is useful when the path is genuinely uncertain. It is expensive when the job is a known workflow.

This paper focuses on that repeated-work problem in the context of the Model Context Protocol. MCP gives agents a standard way to discover and call external tools. The paper asks what should happen after an agent has already figured out a multi-step MCP task once.

The proposed answer is simple: do not keep paying for the agent to rediscover the same plan. Have the agent produce a workflow blueprint, then let an MCP-native engine execute that blueprint later.

That framing matters because a lot of agent work in companies is not open-ended reasoning. It is recurring operational choreography: sync this system to that system, inspect these resources, collect these logs, update this record, route this ticket, generate this report. The first run may need intelligence. The hundredth run usually needs reliability.

The Idea in Plain English

The paper separates two jobs that are often mixed together.

The first job is intelligence: understanding the goal, exploring tool schemas, deciding the sequence of steps, and turning a vague task into a concrete procedure.

The second job is execution: making the tool calls in the right order, passing data between steps, retrying transient failures, and recording the result.

Most agent systems keep the model in both jobs. The paper argues that this is the wrong default for repeated workflows. Once the agent has designed a good procedure, the procedure should become an artifact.

The artifact in this paper is a declarative JSON workflow blueprint. It defines inputs, step order, templates, loops, parallel branches, data piping, and error handling. Later, instead of asking the model to reason through the whole process again, the agent makes one call to run_workflow.

The best analogy is compilation. The model acts like a compiler for an operational task. The workflow engine is the runtime.

What the Researchers Tested

The paper introduces the MCP Workflow Engine, a TypeScript implementation built against the MCP SDK.

Architecturally, it uses what the author calls the MCP Mediator pattern. The workflow engine is itself an MCP server exposed to the agent. At the same time, it acts as an MCP client to downstream servers such as Kubernetes and a graph database. The agent sees high-level tools like create_workflow, run_workflow, list_workflows, and validate_workflow; the engine handles the lower-level tool calls behind the scenes.

The workflow blueprint uses five step types.

call invokes a downstream MCP tool.

loop iterates over a collection, such as all namespaces in a Kubernetes cluster.

parallel runs independent branches concurrently.

pipe passes results through a sequence.

collect batches mechanical results so an agent can review one aggregated output instead of reasoning over every item separately.

The main evaluation is a Kubernetes CMDB synchronization task. The workflow syncs a production-scale cluster into a graph database. The reported setup includes 38 namespaces, 13 worker nodes, 22 resource types, 2 downstream MCP servers, and 67 top-level workflow steps.

What They Found

The engine turned a long agent loop into one trigger

The baseline task would require the agent to orchestrate many tool calls: enumerate resources, loop across namespaces, create graph nodes, create relationships, and handle errors. The paper estimates about 2,041 agent-level steps or tool-call cycles for the full sync.

With the workflow engine, the agent pays the design cost once. After that, execution is triggered by one workflow call. The internal workflow still performs many MCP invocations, but those invocations no longer require the model to reason at each step.

That is the central claim. The engine does not make the task disappear. It moves the repeated work out of the model loop.

The reported token savings are large

The paper estimates the baseline agent execution at about 1.25 million tokens per full run. The one-time workflow design cost is estimated at about 54,000 tokens, while each later execution costs around 150 tokens from the agent’s perspective.

On one execution, the paper reports a 95.7% token reduction after including the design cost. By five executions, the reported savings exceed 99%. At daily execution for a year, the paper estimates roughly 455 million baseline agent tokens versus about 109,000 engine tokens.

The specific numbers should be treated carefully because they are modeled, not tokenizer-measured. But the direction is hard to ignore. If a task is repeated and deterministic, keeping the LLM in every internal step is usually wasteful.

Execution became faster and more predictable

The paper reports that the full workflow ran in about 42 seconds, completing all 67 top-level steps and expanding into roughly 2,000 MCP tool invocations through loops.

The resulting graph had 1,200+ nodes and 2,800+ relationships across 20 relationship types. The run completed with zero errors and zero agent tokens consumed during execution.

The latency argument follows from removing inference from the inner loop. If an agent takes even a second or two to reason between tool calls, a long workflow becomes slow. The workflow engine is bounded mostly by downstream API calls and parallelism, not repeated model turns.

Simplicity was part of the design

The paper deliberately avoids turning the workflow language into a full programming language. It excludes conditionals, variables, and string manipulation.

That is a useful design choice. The point is not to replace every kind of automation with a JSON DSL. The point is to capture the repeatable 80% where the path is known, the data movement is structured, and runtime branching is limited.

When a workflow needs judgment, the engine can collect results and hand a batch back to the agent. That hybrid pattern is more realistic than pretending every agent task should become a fully deterministic pipeline.

Why It Happens

The paper’s deeper argument is about where intelligence belongs in an operational system.

In a live agent loop, every tool call keeps the model in the control path. That means every run inherits the model’s strengths and weaknesses: flexible reasoning, but also latency, cost, stochastic behavior, context growth, malformed calls, and occasional drift.

For a new task, those tradeoffs can be acceptable. The model is exploring. It needs freedom.

For a repeated task, those tradeoffs become less attractive. Once the right procedure is known, the system benefits from making the procedure inspectable, versionable, and deterministic.

The MCP Mediator pattern is interesting because it keeps the agent interface intact. From the agent’s perspective, run_workflow is just another tool. But under that one tool call sits a whole workflow runtime that can route calls to multiple downstream MCP servers.

That is the shift: the agent stops being the runtime and becomes the designer of runtime artifacts.

What This Means for Builders

Builders should look for repeatable agent loops that can be compiled into workflows.

The obvious candidates are high-volume operational tasks: infrastructure syncs, security scans, recurring compliance checks, log collection, ticket enrichment, customer-support routing, data-lineage refreshes, and internal system reconciliation.

The practical test is straightforward. If the agent is making the same sequence of tool calls with different parameters, it probably should not be reasoning through the sequence every time.

This does not mean every agent product needs a custom workflow engine. But it does suggest a useful architecture:

Use the agent to discover tools and design the process.
Validate the generated workflow against live tool schemas.
Save the workflow as a versioned artifact.
Execute repeat runs through a deterministic runtime.
Bring the agent back only for exceptions, ambiguous results, or redesign.

That pattern also improves observability. A JSON workflow can be reviewed, diffed, tested, scheduled, and rolled back. A free-form agent conversation is harder to govern.

The engineering challenge is to keep the DSL small. If the workflow language becomes a second programming language, the system inherits the complexity it was meant to avoid.

What This Means for Buyers and Operators

For buyers, the paper gives a useful way to pressure-test agent automation claims.

Ask whether the vendor’s agent repeats reasoning on every run or turns successful workflows into reusable artifacts. Ask whether those artifacts can be inspected, versioned, tested, and executed without the model in the inner loop.

The difference matters for cost, latency, and reliability. A vendor demo can look fine when the agent runs one workflow once. The real question is what happens when the same workflow runs every hour across hundreds of accounts, clusters, tickets, repositories, or customers.

Operators should also care about where the model remains in control. For deterministic execution, the workflow runtime should be responsible for retries, parameter resolution, tool routing, and logs. The model should return when the system hits a branch that actually requires judgment.

This is also a governance point. If an agent generates a workflow, that workflow should not silently become production automation. It should pass validation, review, and permission checks. The compiled artifact is powerful because it can run repeatedly. That makes approval and audit more important, not less.

What to Watch Next

The first thing to watch is whether MCP workflow engines become a common layer in agent stacks.

MCP standardizes tool access. A mediator-style workflow runtime standardizes repeated orchestration across tools. Those two pieces fit naturally together.

The second thing to watch is workflow validation. The paper includes structural validation and live-tool warnings, but production systems will need stronger controls: permissions, sandboxing, schema drift checks, simulation, dry runs, approval gates, and rollback.

The third thing to watch is hybrid execution. The most useful systems may not be pure agent or pure workflow. They may run deterministic sub-workflows, collect structured results, and then ask an agent to reason over the batch.

Finally, watch the boundary around conditionals. The paper intentionally leaves out runtime branching. That keeps the engine understandable, but many real workflows need bounded branching. The hard problem is adding enough control flow without recreating an ungoverned programming language.

Limitations and Caveats

The evaluation is narrow. It centers on one production-style Kubernetes CMDB synchronization task. That is a strong fit for the architecture because the workflow is repeated, structured, and largely deterministic.

The token savings are estimates based on per-step assumptions rather than exact measurements from a specific model, tokenizer, and agent framework. The estimates may be directionally right while still being imprecise.

The current implementation is single-process. Distributed execution, stronger scheduling semantics, and multi-engine coordination are listed as future work.

The DSL also does not support conditionals. That is a deliberate constraint, but it means tasks with substantial runtime branching still need an agent or another orchestration layer.

The broader lesson is not that agents should disappear from operations. It is that agents should not stay in places where the work has already become a known procedure.

Source

Parmar, Abhinav Singh. (2026). Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol. arXiv preprint arXiv:2605.00827. Available at: https://arxiv.org/abs/2605.00827

Research Browse Research & Deep Dives

Move through market maps, company deep dives, cross-profile patterns, papers, reports, and technical explainers.

Start Here Find the best entry point

Use the site map to choose a path through AI, operations, strategy, profiles, and series.

Topic Explore AI systems

Read essays on AI adoption, agents, business systems, and the changing shape of work.