Long-context AI systems may need a separate phase for turning recent context into usable memory, not just a larger place to store old tokens.

Source note: Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti. “Language Models Need Sleep.” arXiv:2605.26099, 2026-05-25. https://arxiv.org/abs/2605.26099

Why This Paper Matters

Long-context language models are usually discussed as a storage problem. The model needs to remember more tokens, so builders extend the context window, compress the context, add state-space layers, use sliding-window attention, or keep a bigger cache.

This paper argues that storage is only half the problem.

If the model sees an earlier chunk of text, loses direct attention to it, and later has to reason from it, the useful question is not just “was the information preserved?” It is “was the information transformed into a state that supports the later computation?”

That distinction matters for agents, research tools, coding systems, legal review, enterprise search, and any workflow where a model reads a long artifact and later has to answer a hard question about what it saw. A long memory that stores facts poorly organized for future reasoning is not much better than a short memory.

The title is playful, but the idea is practical. The authors propose giving the model something like sleep: an offline consolidation phase where it performs extra recurrent computation over recent context before clearing the attention cache.

The Idea in Plain English

Think of the model as having two kinds of memory.

The first is attention memory, the familiar key-value cache. It is precise and useful, but it grows with the length of the context. Keeping everything in attention is expensive.

The second is fast-weight memory inside state-space model blocks. This memory has a fixed size. It is cheaper to carry forward, but it is lossy. Old context has to be folded into a compact internal state.

The paper’s central claim is that folding context into that state is itself a reasoning problem. A model may need several passes to convert “I saw these tokens” into “I now have the right internal structure to answer a later question.”

Sleep is the name for those extra passes. Before the model evicts a chunk from attention, it loops over the current context several times and updates the fast weights in its SSM blocks. Then it clears the attention cache and continues. At wake time, prediction is still a single forward pass. The extra work happened earlier, during consolidation.

This is different from asking the model to think longer at answer time. It is closer to asking the model to organize its notes before putting the notes away.

What the Researchers Tested

The paper tests whether extra sleep-time computation helps models answer questions about context that has already been evicted from attention.

The first task uses Rule 110, a one-dimensional cellular automaton. The model sees binary strings, the attention cache is forcibly cleared, and later the model must predict what happens after several rollout steps. The key control is that the amount of information stays fixed while the required reasoning depth increases.

The second task is Depo, a multi-hop graph retrieval benchmark. The model sees a shuffled directed cycle split across cache windows. Later it must answer k-hop questions about the graph. Larger k means deeper traversal.

The third task is GSM-Infinite, a synthetic math-reasoning benchmark based on GSM8K-style word problems. The problems are long, include distractors, and require different numbers of arithmetic operations. The authors test pretrained Jet-Nemotron 2B and Ouro 1.4B variants, with context eviction forcing the model to rely on consolidated memory for much of the problem.

Across these settings, the comparison is not simply “more compute at answer time.” The paper keeps prediction-phase latency constrained: the model has to answer with a single wake-time forward pass.

What They Found

The results support a narrow but important point: when old context has to support later reasoning, more consolidation passes can help.

Fast Weights Stored Context, but Did Not Reason Deeply Enough

In the Rule 110 setup, a standard transformer cannot use evicted context because its attention cache is gone. A vanilla SSM-attention hybrid can store some information in fast weights, but its performance drops as the number of automaton rollout steps increases.

That matters because sequence length is held fixed. The failure is not simply that the model has too much to remember. The difficulty comes from needing deeper sequential computation over what was remembered.

Sleep Helped on the Hard Cellular Automaton Case

On the harder Rule 110 setting with 32 rollout steps, the no-loop baseline stays close to random guessing, reaching only about 10% exact accuracy after nearly 5 billion training tokens.

Adding sleep loops improves performance under the same token budget. Two loops reach roughly 20% accuracy, while three and four loops rise above 30%. The context length, eviction rule, and wake-time prediction path stay fixed, so the improvement comes from extra consolidation-time computation.

Multi-Hop Retrieval Needed More Consolidation

Depo makes the problem harder because the graph is fragmented across several cache windows and the query comes later. The model cannot just remember one directly relevant thing. It has to organize a graph into a query-agnostic state that supports different hop counts.

The pattern is clean: extra offline loops help most as k increases. The one-loop model makes little progress on 4-hop and harder queries. The two-loop model still stalls on 8-hop and harder cases. Within the paper’s training budget, only the four-loop model starts improving on 16-hop retrieval.

The Pattern Carried Into Math Reasoning

On GSM-Infinite, the gains are clearest on harder problems with more arithmetic operations.

For Jet-Nemotron 2B, six loops improve final accuracy from 0.742 to 0.812 on six-operation problems and from 0.351 to 0.388 on eight-operation problems.

For Ouro 1.4B, four loops improve final accuracy from 0.419 to 0.615 on six-operation problems and from 0.210 to 0.272 on eight-operation problems.

The easy two- and four-operation problems often saturate, so extra sleep matters less there. The benefit appears when the model has to combine long-context processing with deeper computation.

Sliding-Window Eviction Made the Retrieval Point Sharper

The paper also tests a sliding-window setup, where the active attention window remains capped but the newest tokens are retained while older ones are evicted.

With Ouro 1.4B and a 512-token window, four loops improve two-operation GSM-Infinite accuracy from 0.596 to 0.905. That is not mainly a hard-math result, since two-operation problems are relatively easy. It suggests that sleep also helps compress and retrieve relevant context when the active window is much smaller than the full sequence.

Why It Happens

The mechanism is extra computation at the memory boundary.

Attention lets the model look back at earlier tokens directly. Fast weights do not. Once context leaves attention, the model has only whatever state was written into the fixed-size memory. If that state was formed in one shallow pass, it may contain fragments of the past without the structure needed for later reasoning.

Sleep gives the model more chances to refine that state. During consolidation, recurrent passes update the SSM fast weights before the current chunk is evicted. The model is trained end to end so those updates become useful for the later answer.

The paper’s important constraint is that this does not add loops during answer generation. It moves computation from prediction time to consolidation time. That is useful for systems where users care about response latency but can tolerate background processing while reading, indexing, ingesting, or moving between context windows.

What This Means for Builders

For model builders, the paper is a warning against treating long context as a pure capacity benchmark.

A system can have enough memory to store old context and still lack enough computation to make that context useful. This is especially relevant for hybrid architectures that mix attention with SSM-style layers. The fixed-size memory may be efficient, but the write process into that memory needs to be powerful enough.

For agent builders, the practical pattern is familiar. Agents already have natural pauses: after reading a file, after finishing a browser pass, after processing a transcript, after closing a tool call, after ingesting a chunk of logs. Those moments could become consolidation boundaries. The agent may not need to think longer only when asked a question. It may need to organize memory while the workflow is unfolding.

For infrastructure teams, the design also suggests a new knob: spend more compute at memory write time to preserve single-pass answer latency later. That tradeoff is different from simply increasing context length or adding chain-of-thought tokens.

What This Means for Buyers and Operators

For buyers, this paper is a useful lens on long-context claims.

The question is not only how many tokens the model accepts. It is what the system can do after earlier context leaves the active attention window. Can it reason over old material? Can it combine scattered facts? Does accuracy fall as the number of reasoning steps increases? Does the vendor report results by operation count, hop count, or other difficulty measures rather than average score alone?

For operators, the strongest implication is workflow design. If a model is expected to read a long document, repository, deal room, medical record, or customer history, the system may need explicit consolidation phases. Those phases could be triggered after ingestion, between sections, or during idle time.

This does not mean every product needs biological metaphors. It means memory management should be an operating concern, not just a model-card claim.

What to Watch Next

The field should watch whether sleep-like consolidation becomes a standard long-context training pattern for hybrid models.

Builders should watch the boundary between context compression, fast-weight memory, and test-time training. These techniques are converging around the same problem: turn a large recent context into a compact state that can be reused later.

Buyers should watch for evidence on messier tasks. GSM-Infinite is more realistic than the synthetic automaton and graph tasks, but it is still procedurally generated. The harder question is whether consolidation helps with open-ended enterprise workflows where the right answer depends on judgment, retrieval, tool use, and delayed feedback.

Researchers should also watch cost. The paper reports that recurrent-depth cost grows roughly linearly with the number of sleep passes. If sleep improves quality but multiplies training expense, the next question is where the extra passes are worth it.

Limitations and Caveats

The evidence is strongest for controlled settings. Rule 110 and Depo are useful because they isolate reasoning depth, but they are not normal user tasks. GSM-Infinite is closer to language-model use, but it is still synthetic.

The models are modest by frontier standards. The paper tests small custom hybrids, Jet-Nemotron 2B, and Ouro 1.4B variants. It does not show the effect on the largest commercial models.

The method also makes training harder. Backpropagating through repeated consolidation passes adds cost and can create stability issues. The paper notes that training cost scales roughly with the number of recurrent steps.

Finally, sleep helps with one specific failure mode: organizing evicted context into fast weights. It does not solve all long-context problems. Retrieval quality, distractor resistance, factual grounding, tool reliability, and evaluation design still matter.

Source

Lee, S., McLeish, S., Goldstein, T., & Fanti, G. (2026). Language Models Need Sleep. arXiv preprint arXiv:2605.26099. Available at: https://arxiv.org/abs/2605.26099