The paper’s practical point is that long-context systems cannot keep buying bigger windows forever. For dense work, the prompt needs to become something the model can inspect, slice, loop over, and recursively delegate.
Source note: Alex L. Zhang, Tim Kraska, and Omar Khattab. “Recursive Language Models.” arXiv:2512.24601, submitted December 31, 2025, revised May 11, 2026. https://arxiv.org/abs/2512.24601
Why This Paper Matters
The industry answer to long context has mostly been more context. Push the window from thousands of tokens to hundreds of thousands, then to millions. Add summarization when the window fills. Add retrieval when the corpus gets too large. Add agent scaffolding when the model needs tools.
That helps, but it does not solve the deeper problem. Some tasks only need to find one fact in a large pile. Others require the model to read, classify, compare, aggregate, or reason across nearly everything in the prompt. The second class breaks much sooner. A million-token window can still behave badly if the work requires dense access to the whole input.
This paper proposes Recursive Language Models, or RLMs, as a different interface. Instead of feeding the full prompt directly into the model, an RLM puts the prompt inside an external environment. The model receives a symbolic handle to it, writes code to inspect it, decomposes the work into smaller pieces, and can recursively call language models over those pieces.
The important shift is simple: the prompt stops being a huge block of text inside the model’s attention window. It becomes an object the model can operate on.
The Idea in Plain English
A normal long-context model is like a person asked to hold an entire library in working memory and answer from it directly. Retrieval is like giving that person a search box. Compaction is like asking them to keep rewriting their notes as the library gets too large.
An RLM is closer to giving the person a programming environment with the library loaded as a variable. They can check its length, inspect sections, write loops, split it into pieces, call helpers on each piece, store intermediate results, and combine those results into a final answer.
In the paper’s implementation, the model gets a Python REPL. The user’s prompt is stored as a variable. The model can write code against that variable. It can call a sub-language-model or sub-RLM function on programmatically constructed snippets. It keeps working until it sets a Final variable.
That sounds like an agent scaffold, but the distinction matters. Many coding agents can execute code and call tools, but they still load the original prompt into the model’s context or rely on retrieval and summarization. An RLM makes the prompt itself part of the external environment from the start.
The model uses code and recursion to decide how much of the prompt to touch, in what order, and with which sub-calls.
What the Researchers Tested
The authors test RLMs across several long-context and reasoning tasks with different kinds of scaling pressure.
They include simple needle-in-a-haystack style retrieval, where the amount of relevant information stays roughly constant as the input grows. They include BrowseComp-Plus with 1,000 documents, where the answer requires piecing together evidence from multiple documents. They include OOLONG, where the model must semantically label and aggregate many entries, making the work scale roughly linearly with the input. They introduce OOLONG-Pairs, where the answer depends on pairs of entries, making the work closer to quadratic.
They also test LongBench-v2 CodeQA for repository understanding and LongCoT-mini for difficult long-horizon reasoning problems.
The main comparisons are against direct model calls, compaction agents, CodeAct-style agents with retrieval or sub-calls, OpenCode, Claude Code, and RLM variants with different recursion depths. The paper evaluates GPT-5, Qwen3-Coder-480B-A35B, Claude Opus 4.1 through Claude Code, and a small model fine-tuned to act as an RLM.
What They Found
Bigger windows still degrade on dense work
One of the paper’s best contributions is separating input length from task complexity.
Needle-in-a-haystack tasks can make long context look solved because the model only needs to find a small amount of relevant information. Dense aggregation tasks are different. If the answer depends on nearly every line, or on many pairwise comparisons, the effective context window shrinks dramatically.
In the paper’s scaling experiment, GPT-5 degrades as inputs get longer and as the task becomes more complex. The RLM version degrades much more slowly. For contexts beyond the paper’s GPT-5 window region, the base model cannot run directly, while the RLM can continue by operating over the prompt externally.
That is the central lesson: context length is not a single number. It depends on the work the model must do over the context.
Recursion beats compaction when details matter
The paper’s headline numbers are strong. In the abstract, the authors report that on GPT-5, RLMs outperform compaction by a median of 26%, CodeAct with sub-calls by 130%, and Claude Code by 13% across the evaluated benchmarks, while keeping cost comparable.
The task-level results show why. On BrowseComp-Plus with 1,000 documents, RLM(GPT-5, depth=1) scores 91.3, compared with 70.5 for the compaction agent and 51.0 for CodeAct with BM25. On OOLONG-Pairs, base GPT-5 is effectively stuck at 0.1 F1. RLM(GPT-5, depth=1) reaches 58.0, and deeper recursion variants go higher.
Those gains are not magic. They come from preserving access to detail instead of compressing the input into lossy summaries. When the task requires dense evidence, compaction throws away exactly the material the answer may need.
The REPL is not a cosmetic detail
The paper argues that three design choices make RLMs different from ordinary scaffolds.
First, the model needs a symbolic handle to the prompt. It should be able to say, in effect, “inspect this range,” “split this corpus,” or “loop over these records” without copying the whole prompt into its own context.
Second, intermediate and final outputs should be able to live outside the model context. If the model must directly emit everything through a normal completion channel, output length becomes another context-window bottleneck.
Third, recursion has to be programmatic. The model should be able to write a loop that launches many sub-calls over constructed slices, rather than verbalizing a few hand-written subtasks.
This is why the REPL matters. It gives the model a place to store state, manipulate strings, launch calls, and combine results without turning the whole operation into one long chat transcript.
Training models for this interface helps quickly
The authors also fine-tune a small model, RLM-Qwen3-8B, around the RLM interface. They use 1,000 filtered trajectories from a stronger RLM setup on unrelated tasks.
That small training pass improves the base Qwen3-8B model by a median of about 28% across the four main evaluation tasks. The paper also reports lower inference cost and more than 3x faster performance in the appendix, attributed to better decisions and fewer mistakes inside the RLM scaffold.
The implication is important. RLM behavior can be trained. Models can learn to be better roots for recursive environments: how to probe the input, when to decompose, how to call helpers, and how to avoid wasting calls.
Why It Happens
RLMs work because they move long-context processing out of pure token space and into a computational environment.
Attention windows are expensive and fragile places to store everything. They are especially weak when the task requires repeated passes, structured aggregation, or comparisons across many parts of an input. A REPL gives the model a memory outside the context window. Recursion gives it a way to spend model calls only where needed. Code gives it loops, indexing, string operations, and structured intermediate state.
That combination lets the model trade one enormous brittle read for many smaller intentional reads.
It also changes the cost profile. A bad RLM trajectory can waste money by launching too many calls or choosing the wrong decomposition. A good trajectory can be cheaper than forcing a frontier model to ingest millions of tokens. The paper finds the median RLM run can be cheaper than the median base-model run, while averages are pulled up by outlier trajectories where the RLM struggles.
So the architecture creates leverage, but it also creates a new skill problem. The root model has to choose useful decompositions.
What This Means for Builders
The builder takeaway is modest: serious long-context systems need to distinguish between storage, access, and reasoning.
A huge window gives storage. Retrieval gives selective access. Compaction gives lossy compression. RLMs add a different capability: programmatic decomposition over the full prompt, with recursive model calls over chosen slices.
That is valuable for tasks where the model has to process many parts of the input, rather than find one fact. Codebase understanding, document review, diligence, scientific analysis, compliance checks, legal matter review, customer research synthesis, and large agent trajectories all have this shape.
Builders should also notice the interface lesson. If the model is expected to operate over large data, it should receive handles, tools, state, and a safe execution environment, not only a message. The prompt can become an object with methods, rather than a string.
The hard engineering work is around control. Recursive calls need budgets. REPLs need sandboxing. Tool access needs permissions. Intermediate state needs observability. Failure modes need to be visible enough that a human can understand whether the system decomposed the problem well or wandered into an expensive loop.
What This Means for Buyers and Operators
For buyers, the paper gives a sharper way to evaluate long-context claims.
Ask what kind of work the system can do across the tokens it ingests. Can it aggregate across all records? Can it compare pairs or clusters? Can it process a large corpus without summarizing away key details? Can it show the decomposition it used? Can it cap recursive calls and explain cost?
The distinction matters because many enterprise tasks are not needle-in-haystack tasks. They are dense judgment tasks. A contract review may require checking clauses across dozens of files. A support analysis may require patterning thousands of tickets. A codebase question may require understanding interactions among distant files. A diligence memo may require reconciling evidence from many sources.
If a vendor’s answer is only “we have a big context window,” that is not enough. The stronger question is whether the system has an execution strategy for the work the context implies.
Operators should also be cautious. RLMs make the model more capable, but they also make it more active. A model that can write code, recurse, inspect data, and launch many calls needs governance. Cost ceilings, sandboxing, logging, and review are not optional details.
What to Watch Next
The field should watch whether recursive scaffolds become a standard interface for long-context work, the way reasoning models became a standard interface for harder problem solving.
Researchers should watch training. The small RLM-Qwen3-8B result suggests that models can learn the habits needed for recursive environments. Better root models may make decomposition more reliable and reduce wasteful trajectories.
Builders should watch sandbox design. A Python REPL is powerful, but production systems will need safer constrained environments, better debugging tools, and clearer policies for what recursive calls can do.
Buyers should watch benchmarks that scale task complexity, not only prompt length. The most useful evaluations will separate retrieval-like tasks from tasks that require linear or quadratic work over the input.
Limitations and Caveats
The paper is ambitious, but it does not remove the operational burden of long-context systems. It moves that burden into decomposition, recursion, execution safety, and cost control.
Recursive systems can fail expensively. If the root model chooses the wrong plan, it may launch many bad sub-calls. The authors’ trajectory analysis shows that first decomposition choices matter, and that weaker coding models can make syntax errors that propagate through deeper recursion.
The results are also scaffold-dependent. Prompt examples affect decomposition behavior. Runtime numbers depend on implementation details, API latency, and whether calls run sequentially or asynchronously. The authors are explicit that guardrails for RLMs remain underexplored.
Finally, this is not a replacement for better model context. Bigger windows still help. The stronger claim is that windows alone are not the right abstraction for dense, long-horizon work. When the prompt is too large and the task is too structured, the model needs an environment.
Source
Zhang, Alex L.; Kraska, Tim; Khattab, Omar. (2026). Recursive Language Models. arXiv preprint arXiv:2512.24601. Available at: https://arxiv.org/abs/2512.24601