The paper’s practical point: an AI agent is more than a model with a longer prompt. It is a reasoning loop that plans, acts, checks feedback, updates state, and sometimes coordinates with other agents.
Source note: Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He. “Agentic Reasoning for Large Language Models.” arXiv:2601.12538, 2026-01-18. https://arxiv.org/abs/2601.12538
Why This Paper Matters
“Agent” is one of those words that became useful and sloppy at the same time.
Sometimes it means a chatbot that can call a tool. Sometimes it means a browser automation script. Sometimes it means a coding system that plans, edits, tests, and retries. Sometimes it means a team of specialized model instances passing messages to each other.
This survey tries to put a more precise shape around the category.
Its argument is that the important shift is not simply a rename from “LLM” to “agent.” The shift is away from static reasoning and toward agentic reasoning. Static reasoning asks a model to produce an answer from a fixed prompt. Agentic reasoning asks a system to keep reasoning while it interacts with an environment.
That distinction matters for builders because it changes the unit of design. The unit is not the prompt. The unit is the loop.
The loop observes something, forms an intermediate plan, chooses an action, calls a tool or searches an environment, reads the result, updates state, and decides what to do next. If the task is long enough, the loop also needs memory. If the environment changes, it needs feedback and adaptation. If the work is too broad for one agent, it needs coordination across roles.
That is a different engineering problem from writing a good instruction.
It is closer to building a small operating system around model calls: state, tools, permissions, validators, retry logic, memory, traces, and evaluation.
The Idea in Plain English
The simplest version of the paper is this: reasoning becomes agentic when it leaves the single model call and starts living across actions.
A normal LLM reasoning task looks like this:
The user asks a question. The model thinks inside the context window. The model produces an answer.
An agentic reasoning task looks more like this:
The user gives a goal. The system breaks it into steps. It decides what it needs to know. It searches. It calls tools. It writes or reads memory. It checks whether the output worked. It revises the plan. It may ask another agent to critique, execute, or verify part of the work. The final answer is just the last visible part of a longer interaction.
The paper’s taxonomy has three layers.
The first layer is foundational agentic reasoning. This is the base layer: planning, tool use, and search. An agent needs to decompose goals, pick actions, use external functions, retrieve information, and verify results.
The second layer is self-evolving agentic reasoning. This is where the system improves across attempts. It can learn from feedback, maintain memory, revise procedures, reuse skills, or adapt its behavior after errors.
The third layer is collective multi-agent reasoning. This is where reasoning is distributed across multiple agents. One agent may plan, another may execute, another may critique, another may retrieve evidence, and another may arbitrate disagreements.
Across those layers, the paper draws another useful distinction.
In-context reasoning improves behavior at inference time. The model weights stay fixed, but the system uses orchestration, planning, search, examples, memory, and tool feedback to get better results during the run.
Post-training reasoning improves behavior by changing the model or policy. This includes supervised fine-tuning, reinforcement learning, and training methods that internalize better planning, tool use, search, or coordination.
That split is useful because it separates two questions teams often blur together: “Can we get this agent to behave better through better runtime design?” and “Do we need to train the underlying model or policy?”
What the Researchers Tested
This is a survey paper, not a new benchmark or model release.
The authors review a large body of work on LLM agents, reasoning, tool use, planning, search, memory, feedback, multi-agent coordination, applications, and benchmarks. Their main contribution is a map of the field.
The paper starts by contrasting traditional LLM reasoning with agentic reasoning. Traditional LLM reasoning is described as passive, static, single-pass, context-window-bound, offline-trained, and prompt-reactive. Agentic reasoning is interactive, dynamic, multi-step, stateful, capable of improvement, and goal-oriented.
The survey then organizes the field around mechanisms.
For foundational agents, it reviews planning methods, tool-use optimization, and agentic search. This covers systems that decompose tasks, invoke APIs, browse, retrieve, run code, and use external results as part of the reasoning process.
For self-evolving agents, it reviews feedback mechanisms, memory systems, and agents that improve their planning, tool use, or search over time. The important point is that an agent can evolve through more than model weights. It can also evolve through written memories, learned procedures, generated tools, workflow changes, or updated state.
For multi-agent systems, it reviews role taxonomies, division of labor, collaboration patterns, multi-agent memory, and attempts to train collaboration itself rather than hand-designing every role and communication pattern.
The paper also surveys applications and benchmarks across math, coding, scientific discovery, robotics, healthcare, autonomous web research, tool-use tasks, memory tasks, planning tasks, and multi-agent settings.
Its value is breadth and organization. It does not prove that one agent architecture is best. It explains why many seemingly separate techniques belong to the same systems problem.
What They Found
The paper’s most useful finding is not a number. It is a reframing.
Agentic reasoning is the connective tissue between planning, tools, search, memory, feedback, and collaboration.
Agentic Reasoning Is Interaction, Not Deeper Thought
The survey argues that reasoning is no longer only about producing better intermediate text before an answer.
Chain-of-thought, decomposition, and verifier-style methods helped models reason better inside a prompt. But agents push the reasoning process outside the prompt. The model’s reasoning can now affect the world through actions, and the world’s response can shape the next reasoning step.
That is a major shift.
It means the system can recover information it did not originally have. It can test assumptions. It can write and run code. It can inspect files. It can ask another component for review. It can use feedback from failed actions.
It also means failures are harder to reason about. A bad final answer may come from bad planning, wrong retrieval, stale memory, tool misuse, weak verification, bad role design, or the model itself.
Foundational Agents Need Three Base Mechanics
The paper treats planning, tool use, and search as the base mechanics of agentic reasoning.
Planning turns a goal into steps. Tool use turns reasoning into external action. Search lets the system explore information or solution spaces instead of relying only on what is already in the prompt.
These are not independent features. In real agents, they reinforce each other.
A coding agent plans a change, searches the codebase, edits files, runs tests, reads failures, and revises the plan. A research agent decomposes a question, searches sources, extracts evidence, checks contradictions, and synthesizes an answer. A workflow agent selects tools, executes actions, validates outputs, and decides whether to continue.
Calling a tool once is not the interesting part. The interesting part is the control loop around the tool.
Self-Evolving Agents Make Memory and Feedback First-Class
The survey’s self-evolving layer is important because it separates short-term task execution from longer-term improvement.
An agent that only reacts during one run may look capable, but it forgets the lesson as soon as the context disappears. A self-evolving agent tries to preserve useful experience.
That preservation can be simple: a textual reflection, a rule, a saved memory, or a cached solution. It can be more structured: a graph, a skill library, a tool inventory, a state store, or a learned retrieval policy. In more ambitious systems, the agent can modify procedures or train policies from feedback.
The paper’s practical message is that memory is part of the reasoning system, not a convenience feature.
Bad memory can make an agent worse. It can retrieve irrelevant lessons, preserve false beliefs, leak private context, or overfit to old user behavior. Useful memory needs selection, retrieval, updating, deletion, and evaluation.
Multi-Agent Systems Shift the Problem to Coordination
The paper treats multi-agent reasoning as more than “run several agents and combine the answers.”
Once multiple agents are involved, the design problem changes. The system needs roles, communication patterns, shared state, conflict resolution, and some way to decide when collaboration is worth the overhead.
Static role assignment can help. A planner, executor, critic, and verifier can produce better work than a single model call on complex tasks. But the paper also points toward a harder frontier: training or adapting the collaboration pattern itself.
That is where multi-agent systems become less like prompt templates and more like organizations.
Who owns the plan? Who can act? Who checks the work? What gets remembered? What happens when two agents disagree? How much communication is useful before it becomes waste?
Those are not abstract questions. They become product and infrastructure questions as soon as agents touch real workflows.
Benchmarks Are Catching Up, But Still Lag Real Deployment
The survey reviews a large set of benchmarks: tool use, search, planning, memory, multi-agent systems, embodied agents, scientific discovery, web agents, medical agents, autonomous research agents, and general API use.
The direction is clear. Evaluation is moving away from static question answering and toward interactive tasks that require perception, planning, action, feedback, and persistence.
That is the right direction, but it also exposes the gap.
Real deployments are messy in ways benchmarks struggle to capture. Agents operate under permissions, privacy constraints, tool failures, stale data, ambiguous goals, changing user preferences, cost budgets, latency limits, and long-running state.
Benchmarks can isolate capabilities. Production systems still need operational evaluation.
Why It Happens
Agentic reasoning exists because model intelligence and workflow reliability are not the same thing.
A stronger model can improve many steps, but it does not remove the need for structure. In fact, stronger models often make the structure more important because they are trusted with broader tasks.
Once a model can write code, browse, call APIs, update records, coordinate with other agents, or remember user preferences, the system needs explicit boundaries. It needs to know what the agent can see, what it can do, how it decides, how it recovers, and how humans inspect the result.
The paper’s control-loop framing helps here.
An agent is a policy operating over observations, internal reasoning, actions, memory, and rewards or feedback. Some of that policy lives in the model. Some lives in orchestration code. Some lives in prompts. Some lives in tool schemas. Some lives in validators and permissions. Some lives in memory and state.
That is why agent quality is a systems property.
When agents fail, the failure often belongs to the loop as much as the model.
What This Means for Builders
The first builder implication is that “agentic” should be treated as an architecture claim, not a marketing adjective.
If a system is agentic, it should be possible to describe its loop. What does it observe? What state does it keep? What actions can it take? What tools can it call? What checks happen before and after actions? What memory can it write? What permissions constrain it? What does it do when an action fails?
The second implication is that runtime design and training design are different levers.
Many teams should start with in-context agentic reasoning: better orchestration, clearer state, better tool schemas, narrower permissions, retrieval quality, validation, and traceability. That is often faster and easier to audit than training a model.
Post-training makes sense when the desired behavior needs to become more reliable, faster, cheaper, or more natural than runtime scaffolding can provide. But it should not be used to hide unclear workflow design.
The third implication is that memory needs product management.
Memory is not a drawer where every interaction goes. It is an evolving part of the reasoning loop. Builders need policies for what gets stored, how it is retrieved, how it is corrected, when it expires, and how users inspect or override it.
The fourth implication is that multi-agent systems need an explicit coordination budget.
More agents can add diversity and specialization. They can also add latency, cost, duplicated work, unclear ownership, and new failure modes. Multi-agent design should earn its keep by improving quality, coverage, robustness, or speed on tasks where division of labor actually matters.
What This Means for Buyers and Operators
For buyers, the paper gives a better way to interrogate agent products.
Do not stop at “Which model do you use?” Ask how the reasoning loop works.
What tools can the agent call? What permissions does it have? Can it act without approval? Does it keep memory across sessions? Can that memory be inspected and deleted? How does it recover from tool errors? How are long-running tasks traced? How are multi-agent handoffs audited? What benchmarks resemble the actual work?
For operators, the main lesson is that agent deployment is operational change, not a narrow software purchase.
An agent that plans and acts inside a workflow becomes part of the workflow’s control system. It may decide what gets escalated, what gets ignored, what evidence gets surfaced, what action gets taken, and what state gets written back.
That means the deployment questions are governance questions too.
Who is accountable for the agent’s action? Which actions need human review? What logs are retained? What happens when memory conflicts with current policy? How are unsafe plans detected before tool calls? How are regressions caught when a model, tool, prompt, or memory policy changes?
The paper does not answer all of those questions. It makes clear why they belong at the center of agent design.
What to Watch Next
The field should watch whether agent evaluation catches up with real operating conditions.
Short tasks are useful, but the important frontier is long-horizon work where agents need to preserve state, recover from errors, update memory, coordinate with people or other agents, and operate under changing constraints.
Builders should also watch world models. If agents can simulate possible outcomes before acting, they may become more reliable planners. But world models create their own evaluation problem: a confidently wrong internal simulation can be more dangerous than no simulation at all.
Multi-agent training is another frontier. Today, many multi-agent systems are hand-designed. Over time, more systems will try to learn role assignment, communication topology, and collaboration policy. That could improve performance, but it will also make behavior harder to explain unless interpretability and auditability are designed in.
The final area to watch is governance.
As agents move beyond answering questions and start executing workflows, safety can no longer mean only filtering bad outputs. It has to cover planning-time failures, tool-use risk, memory risk, inter-agent coordination, permissions, rollback, and accountability.
Limitations and Caveats
This paper is a map, not a recipe.
It is useful because it organizes a fragmented field, but it does not tell a team which architecture will work for a specific product. A customer-support agent, a coding agent, a robotics agent, and a scientific-discovery agent will need very different loops.
The survey is also broad enough that the term “agentic reasoning” can become too elastic. If every system with a tool call is called agentic reasoning, the category loses bite. The practical test should be whether reasoning actually persists through planning, action, feedback, state, and adaptation.
Another caveat is that many of the most important deployment properties remain hard to benchmark. Reliability, controllability, auditability, memory quality, privacy, governance, and long-term user trust are not captured by a single success rate.
Finally, the paper summarizes progress up to 2025. That is useful, but this field is moving quickly. The taxonomy will likely remain helpful, while specific systems, benchmarks, and training methods will keep changing.
Source
Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He. (2026). Agentic Reasoning for Large Language Models. arXiv preprint arXiv:2601.12538. Available at: https://arxiv.org/abs/2601.12538