Agent Coding Costs Hide in Review, Not Generation

The expensive part of agentic software engineering may not be initial code generation. It may be the repeated review, refinement, and verification loops around the code.

Source note: Mohamad Salim, Jasmine Latendresse, SayedHassan Khatoonabadi, Emad Shihab. “Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering.” arXiv:2601.14470, January 20, 2026. https://arxiv.org/abs/2601.14470

Why This Paper Matters

The software industry is actively exploring large language model based multi-agent systems to automate complex tasks across the entire software development life cycle. The theoretical appeal is clear. Instead of relying on a single model prompt to write an entire application, organizations can deploy specialized autonomous agents that simulate a complete engineering team. You might have one agent acting as the product manager outlining requirements, another acting as the software architect designing the system, a third acting as the programmer writing the source code, and a fourth acting as the quality assurance tester writing unit tests. These agents collaborate, debate, and divide the work, promising more autonomy, more division of labor, and better scaling to problems beyond what a single prompt can usually handle.

However, the practical adoption of these multi-agent systems is currently bottlenecked by a fundamental lack of operational visibility. Specifically, their resource consumption remains poorly understood. When you deploy a team of autonomous agents to build a software feature, you are initiating a cascade of automated interactions with a large language model API. Every interaction consumes tokens. Tokens dictate the financial cost of the system, they dictate the latency of the workflow, and they drive the underlying environmental impact through compute cycles. If operators cannot accurately model or predict how many tokens an agentic workflow will consume, deploying these systems in a production environment becomes an unpredictable financial risk.

This research paper introduces the concept of “tokenomics” within the specific domain of agentic software engineering. It represents the first empirical attempt to trace exactly where tokens are consumed across distinct software engineering activities. By analyzing the execution traces of an agentic system, the researchers have created a preliminary cost map of the automated software development life cycle. This matters because it shifts the conversation from the theoretical capabilities of autonomous agents to the practical realities of their operational efficiency. If teams do not know where the tokens go, they cannot budget for agentic systems or optimize them intelligently.

The Idea in Plain English

Imagine you hire a team of human software engineers to build a new application. They spend a few hours planning the architecture, a few days writing the initial code, and then potentially weeks reviewing the code, finding bugs, passing the code back and forth, and verifying that the final product works exactly as intended. The communication and verification phases often take significantly more time and energy than the initial drafting phase.

The researchers wanted to know if autonomous artificial intelligence agents behave in the same way. When you tell a multi-agent framework to build a piece of software, it spins up virtual roles that talk to each other. The product manager agent sends a specification to the coder agent. The coder agent writes a script and sends it to the reviewer agent. The reviewer agent finds a flaw and sends it back to the coder agent for revisions.

Because large language models are stateless, every time these agents talk to each other, they cannot simply reference a shared memory. They must pass the entire context of the conversation and the entire current state of the codebase back and forth as input tokens. The researchers wanted to measure this exact mechanism. In this virtual assembly line, which specific station consumes the most resources? Does the system spend its tokens planning the application, writing the initial code, or arguing over the code review? By tracking the raw token consumption across every phase of development, the researchers aimed to quantify the exact cost of agentic collaboration.

What the Researchers Tested

To understand token consumption patterns, the research team designed an empirical study using an open-source large language model multi-agent framework called ChatDev. ChatDev is widely cited because it operates on a “chat chain” architecture that simulates a virtual software company. The agents follow a clear, sequential waterfall model, which makes the distinct phases of their work relatively easy to isolate and analyze.

The researchers executed ChatDev on 30 distinct software development tasks. The prompts for these tasks were sourced from the ProgramDev Dataset, which ensures a diversity of complexity. The tasks ranged from basic algorithmic problems, such as generating the Fibonacci sequence, to more complex applications, such as coding a functional chess game. This diversity in task complexity was reflected in the reasoning token consumption, which varied widely from 17,280 tokens to 40,000 tokens across the different runs.

For the underlying intelligence, every agent in the simulation was powered by the GPT-5 reasoning model, listed in the paper as gpt-5-2025-08-07. This model features a large context window of 400,000 tokens and a maximum output of 128,000 tokens. The temperature parameter, which controls the randomness of the output, is immutable on this model and remained at its default value of 1.0.

The core methodological innovation of the study was mapping ChatDev’s internal, framework-specific operations to universally understood software development life cycle stages. The researchers instrumented the framework to capture every single API call, logging the prompt, the response, and the exact count of input, output, and reasoning tokens. They then mapped these logs into six standardized development stages: 1. Design: Understanding requirements and making high-level technical decisions. 2. Coding: Writing the initial source code. 3. Code Completion: Finishing placeholder or incomplete files. 4. Code Review: The iterative dialogue between programmer and reviewer agents to review and modify or refine the code. 5. Testing: Dynamic system testing to locate and fix executability bugs. 6. Documentation: Generating user manuals and environment dependencies.

By aggregating the token counts across these six stages for all 30 tasks, the researchers built a quantitative model of agentic resource consumption.

What They Found

The researchers discovered an uneven distribution of token usage across the automated development process, revealing that different engineering activities possess distinct resource profiles.

The Code Review Stage Dominates Token Consumption

The most significant finding is that the largest share of tokens are not spent creating software, but reviewing it. The Code Review stage was the largest consumer of resources, accounting for an average of 59.4% of all tokens consumed across the 30 tasks. The Code Completion phase, which was triggered in 6 of the 30 tasks, was the second most expensive, averaging 26.8% of tokens in those specific runs. Documentation averaged 20.1% of tokens, and Testing averaged 10.3% across the 12 tasks where it occurred.

In stark contrast, the initial generative phases were comparatively inexpensive. The Coding stage consumed an average of only 8.6% of the total tokens, and the Design stage consumed a mere 2.4%. The data suggests that the primary cost of agentic software engineering is concentrated in the iterative, conversational process of automated refinement and verification.

Token Consumption is Dominated by Input Tokens

When breaking down the types of tokens consumed, the researchers found a consistent pattern across almost all phases: input tokens far exceed both output and reasoning tokens. On average, the overall token usage for a single task consisted of 53.9% input tokens, 24.4% output tokens, and 21.6% reasoning tokens.

This approximate ratio of two input tokens for every one output token provides strong empirical evidence for what the researchers term a “communication tax.” Because the agents are collaborating through dialogue, they must repeatedly pass large contexts between one another to maintain shared understanding. The majority of the computational resources are spent communicating existing context, rather than generating novel output.

Software Development Stages Exhibit Distinct Tokenomic Profiles

The analysis revealed that different stages of the software development life cycle exhibit unique tokenomic fingerprints. The Coding phase is a notable outlier; it is output-heavy, consisting of 58.0% output tokens versus only 6.9% input tokens. This aligns with intuition, as the coding agent takes a concise design specification and expands it into verbose source code.

Conversely, the verification and documentation phases are extremely input-heavy. The Code Review stage consists of 51.4% input tokens, and the Documentation stage consists of 80.2% input tokens. These phases require the agents to ingest and process large amounts of existing code as context merely to produce relatively small, analytical outputs, such as a localized bug fix or a concise user manual.

Why It Happens

The root cause of this high token consumption during the verification stages lies in the inherent conversational architecture of current large language model multi-agent systems. When a programmer agent and a code reviewer agent collaborate to fix a bug, they do not simply exchange a few lines of relevant text. Because the underlying large language models are stateless, the agents must iteratively pass the full code context back and forth during every turn of their dialogue.

If the reviewer agent identifies a flaw, it must pass the entire codebase along with its critique to the programmer agent. The programmer agent must then ingest that large input, reason through the problem, generate the corrected code, and pass the newly modified codebase back to the reviewer agent to verify the fix. If the system is struggling to resolve a complex bug or is trapped in a loop of step repetition, this large payload of input tokens is passed back and forth repeatedly.

The high token usage is essentially a symptom of the multi-agent system attempting to overcome coordination challenges and verification failures through brute-force dialogue. This is the “Cost of Conversation.” The agents are taxing the system not because the reasoning is particularly deep, but because the collaboration protocol relies on naive, full-context passing. They are reading the entire book every time they want to discuss a single sentence.

What This Means for Builders

For engineers and researchers building agentic frameworks, this paper serves as a clear signal to optimize collaboration protocols. The current reliance on conversational architectures that repeatedly pass full code contexts is expensive and inefficient.

Builders must focus on developing more token-efficient methods for agent coordination, particularly during the verification and refinement stages. This might involve moving away from chat-based interfaces toward more structured, differential protocols. For example, instead of passing the entire file, agents could be engineered to exchange localized patch files, syntax trees, or specific contextual differentials.

Furthermore, builders should explore architectures that do not rely strictly on hierarchical, conversational workflows. Evaluating whether standard operating procedure based assembly lines are more token-efficient than open-ended agent debates will be critical. The mapping methodology introduced in this paper provides a standardized evaluation framework that builders can use to benchmark the efficiency of new architectures against existing baselines. The goal must be to reduce the communication tax without degrading the quality of the final software output.

What This Means for Buyers and Operators

For technology leaders, engineering managers, and operators looking to deploy agentic software engineering systems, this research changes how you must approach cost prediction and workflow design. You can no longer estimate the cost of an agentic project based solely on the size of the final codebase.

Because the tokenomic profiles vary sharply by activity, the cost of deployment will depend entirely on the type of engineering work being performed. Greenfield projects, which are weighted toward the inexpensive Coding stage, will have a different and likely more predictable cost structure than legacy modernization or refactoring projects. Projects that require modifying existing codebases will trigger the expensive, input-heavy Code Review cycle, where the system will consume large amounts of tokens simply reading the existing architecture.

Operators should leverage these distinct tokenomic profiles to create accurate cost maps before deployment. More importantly, operators should use this data to intervene in the workflow to maximize economic efficiency. For instance, the researchers suggest introducing a “human-in-the-loop” checkpoint immediately before the automated Code Review phase. By having a human engineer validate the initial output before allowing the agents to engage in their expensive, automated verification loops, organizations can prevent runaway token consumption and ensure that the costly communication tax is only paid when strictly necessary.

What to Watch Next

This paper is a preliminary study that establishes a useful baseline for the emerging field of tokenomics in software engineering. The most immediate next step is for researchers to expand the dataset significantly beyond 30 tasks to ensure that these consumption patterns hold true across a much wider variety of software development scenarios and complexities.

We should also expect to see this analysis extended across different foundation models. While the GPT-5 reasoning model exhibits this specific tokenomic profile, it is important to understand if models designed with different attention mechanisms or those optimized specifically for code generation process the communication tax differently.

Furthermore, the research community must apply this standardized mapping framework to other popular multi-agent architectures to conduct comparative efficiency studies. Comparing the tokenomics of a conversational system like ChatDev against structured, pipeline-driven agent frameworks will reveal how architectural design choices directly impact operational costs. Finally, future research must investigate the direct correlation between these high token consumption patterns and actual system failure modes. Understanding whether a large spike in input tokens during the Code Review phase is a reliable indicator that the agents are stuck in an unproductive loop could lead to automated circuit breakers that halt inefficient execution before costs spiral out of control.

Limitations and Caveats

While the findings present a compelling initial look at agentic resource consumption, the researchers are precise about the scoped and preliminary nature of the study. Several important limitations must be considered.

First, the empirical analysis is based entirely on a single multi-agent system architecture (ChatDev) and a single underlying large language model (the GPT-5 reasoning model). The token consumption patterns observed here are closely tied to the specific “chat chain” mechanics of ChatDev and the specific token efficiency of the GPT-5 model. Different multi-agent architectures or different foundation models may yield entirely different tokenomic profiles.

Second, the dataset of 30 software development tasks, while diverse in complexity, represents a relatively small sample size that does not capture the full spectrum of enterprise software development scenarios. The researchers note that this limitation is a direct result of the current lack of public, large-scale benchmarks for software engineering specific agent execution traces.

Third, the sample sizes for certain specific development stages are limited. Because the autonomous agents dynamically decide which phases are necessary for a given task, the Code Completion stage only occurred in 6 of the 30 tasks, and the Testing stage only occurred in 12 of the 30 tasks. The tokenomic profiles drawn for these specific activities are therefore based on a much smaller subset of data, which may limit their broader generalizability.

Finally, the core methodological step of mapping ChatDev’s internal phases to standard software development life cycle stages is an abstraction. While it provides a useful standardized framework for analysis, it remains one of several possible interpretations of how autonomous agents divide their labor.

Source

Mohamad Salim, Jasmine Latendresse, SayedHassan Khatoonabadi, Emad Shihab. “Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering.” arXiv:2601.14470, January 20, 2026. https://arxiv.org/abs/2601.14470