Large models may be better not just because they are more expressive, but because they can preserve weak signals from rare tasks while common tasks keep updating the network.

Source note: Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, and Ekdeep Singh Lubana. “Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention.” arXiv:2605.29548, submitted May 28, 2026. https://arxiv.org/abs/2605.29548

Why This Paper Matters

The usual story about scaling is simple: larger models have more parameters, see more data, and get better. That story is useful, but it hides an important question. What exactly does the extra size buy?

This paper gives a sharper answer. It argues that larger models can learn parts of the training distribution that smaller models fail to learn, even when those smaller models are not obviously too small to represent the task. The bottleneck is not just expressivity. It is the interaction between task frequency, task complexity, and interference during training.

That matters for anyone building or buying AI systems because many valuable capabilities are rare in the training stream. Edge-case reasoning, unusual workflows, small-domain procedures, compliance details, strange data formats, and long-tail user intents are not the common examples. A model can be competent on the center of the distribution while still failing to retain the signals needed for these rarer tasks.

The paper’s practical claim is uncomfortable but useful: some capabilities may require either more model capacity, better data mixture design, or training schemes that protect rare-task signal from being washed away.

The Idea in Plain English

Imagine a company where the loudest customers shape every product meeting. The team keeps hearing the same requests, so it allocates roadmap space to those requests first. A smaller team may technically be capable of serving niche customers too, but the common requests consume all of its attention. The rare requests appear, briefly matter, then disappear before the team can build durable memory around them.

The paper says something similar can happen inside models.

Training data is a mixture of tasks. Some tasks are common. Some are rare. Some are simple. Some require many different features to solve. Smaller models tend to allocate their limited representational resources to high-frequency or lower-complexity tasks first. Rare and complex tasks get weak gradients and long gaps between observations. By the time the model sees the rare task again, frequent-task updates may have overwritten the fragile signal from the previous rare-task example.

Larger models have more room. Once they have allocated enough resources to common tasks, updates from those tasks become weaker and less destructive. Rare-task features can survive between observations. Over time, those fragments accumulate into something the model can generalize.

The key phrase is rare-task retention. Larger models remember enough of the rare task for long enough to learn it.

What the Researchers Tested

The paper tests the claim in two stages.

First, the authors build a synthetic multi-task regression setup. Each example comes from one task in a mixture. Tasks differ by frequency and complexity. The model has a shared encoder with limited width, so tasks compete for representational space. This lets the authors control exactly which task is rare, which task is complex, and how much capacity the model has.

Second, they move to a more realistic language-model pretraining setting. They train OLMo models on Dolma v1.7 with injected tasks at controlled frequencies. The model sizes range from roughly 4M to 4B parameters. The injected tasks are deliberately unusual, so the authors can measure whether a model learned the task distribution rather than relying on preexisting patterns in the corpus.

They look at three kinds of evidence:

  • Behavioral evidence: does the model actually solve the rare task?
  • Representational evidence: does the model encode task-relevant features?
  • Gradient evidence: do frequent-task updates interfere with rare-task features?

That structure is the paper’s strength. It is not only a benchmark result. It tries to connect behavior to a mechanism.

What They Found

The first finding is that larger models learn lower-frequency tasks that smaller models miss. In the synthetic setup, increasing width lets the model retain lower-utility features. “Utility” here depends on both how often a task appears and how much signal each feature contributes. A rare task can lose the competition because its useful features show up too infrequently.

The second finding is that frequency alone is not the whole story. Complexity matters too. If a task needs many feature directions, it can be harder to learn even if it is not the rarest task. The appendix experiments show that learning order depends on both prior frequency and feature complexity, not just “common before rare.”

The third finding is the retention mechanism. In matched-frequency injection experiments, the rare task is withheld for a number of steps and then reintroduced in batches while keeping the overall frequency constant. Smaller models briefly pick up the rare-task signal after injection, then lose it as frequent-task updates resume. Larger models retain more of that signal between injections, so each new rare-task observation builds on the previous one.

The fourth finding is that the same pattern appears in OLMo pretraining. The authors train OLMo models from 4M to 4B parameters and inject tasks at controlled frequencies. Larger models learn the low-frequency tasks better. They also represent more task-relevant features and show less gradient interference.

The fifth finding is that memorization can support abstraction. The paper’s story is not “memorization bad, generalization good.” For rare tasks, the model may need to retain partial memories of sparse observations before those memories consolidate into a generalizable structure.

Why It Happens

The mechanism is resource competition.

In the toy setup, tasks compete for the model’s shared encoder. The model learns features in order of utility. Common and simple tasks have high-utility features, so they get represented first. Rare and complex tasks have weaker signals, so they need spare capacity and enough retention across time.

Once a model has enough capacity to explain the common tasks, the residual error on those tasks gets smaller. Their gradients become less disruptive. That creates a quieter training environment for rare-task features. A rare-task update can leave a trace, and that trace is less likely to be overwritten before the next rare-task example arrives.

This is why the paper is not just saying “more parameters can fit more functions.” It is saying that larger models change the learning dynamics. They reduce interference between frequent and rare tasks.

For builders, that is the interesting part. The result points to several possible levers: model size, task frequency, task ordering, replay, mixture weighting, curriculum design, and architectures or optimizers that reduce destructive interference.

What This Means for Builders

The immediate implication is that small-model evaluation needs to be careful around rare capabilities. A smaller model may look good on common workflows and still fail on the weird cases that matter in production. The failure may not be fixed by simply showing it more of the same broad data if the rare-task signal keeps getting overwritten.

The second implication is that data mixture design matters more than aggregate token count. If a desired behavior is rare, increasing its frequency may be a cheaper intervention than scaling the model. The authors say this directly in the discussion: scaling up the frequency of a target task might be more efficient than scaling up model size.

The third implication is that post-training and distillation need to respect retention. If a frontier model has learned rare capabilities through scale, a smaller distilled model may not automatically inherit the same long-tail competence. It might mimic common outputs while losing the mechanisms that make rare behaviors stable.

The fourth implication is that agent and enterprise workflows should separate common automation from rare judgment. A smaller model may handle routine cases well, while unusual cases need routing to a larger model, retrieval support, explicit examples, or a harness that keeps rare-task context alive.

The fifth implication is that evals should include gaps between rare examples. Testing a model immediately after exposure to a rare task is easier than testing whether it retains and reuses that signal later. Retention over time is closer to the production problem.

What This Means for Buyers and Operators

For buyers, the paper is a warning against judging models only by average-case performance. The valuable question is not “does the cheaper model work on normal prompts?” It is “which rare but important cases does the cheaper model forget, flatten, or mishandle?”

That matters in domains with long tails: support, legal review, finance operations, healthcare operations, industrial workflows, internal tools, and enterprise search. These systems often look fine until an uncommon policy, edge-case account, legacy data pattern, or unusual exception appears.

Operators should ask vendors how they test rare-task retention. Do they evaluate long-tail cases separately? Do they know which capabilities are frequency-sensitive? Do they route rare cases to larger models? Do they use retrieval, examples, replay, or targeted fine-tuning to protect the behaviors they care about?

The paper also gives a better way to think about model cost. The cheapest model is not always the one with the lowest per-token price. It is the model and workflow combination that can handle the distribution of work without turning every rare case into manual cleanup.

What to Watch Next

The field should watch whether this rare-task retention story holds for naturally occurring capabilities in frontier-scale models, not only injected tasks in controlled OLMo runs.

Builders should watch for training methods that reduce interference without requiring brute-force scale. Mixture weighting, replay, curriculum design, targeted synthetic data, modular architectures, and routing systems all become more interesting if the bottleneck is retention under interference.

Buyers should watch for evals that report long-tail capability separately from headline benchmark performance. A model card that says “strong average performance” is less useful than one that says which rare tasks degrade, under what frequencies, and after what delays.

Researchers should also watch the relationship between memorization and generalization. This paper makes the case that some memorization is not a defect. It can be a bridge toward abstraction when the signal is sparse.

Limitations and Caveats

This is not a complete theory of scaling. The authors are explicit about that. Expressivity, sample efficiency, optimization, architecture, and data quality still matter. The paper adds a data-centric mechanism, not a replacement for every other explanation.

The OLMo validation uses injected tasks. That is a strength for control, but it is not the same as proving the mechanism for every natural capability in frontier models. Real tasks are messier, more entangled, and harder to define.

The tested language models range up to 4B parameters. That is large enough to be meaningful, but it is not the same as directly validating behavior in the largest production frontier systems.

The paper focuses on frequency and, in the synthetic setup, complexity. It does not fully resolve how task semantic structure, data quality, curriculum, retrieval, post-training, or tool use change the picture.

The practical conclusion should therefore be modest: scale can help models retain and learn rare tasks under interference, but scale is not the only lever. Sometimes the better move is to change the data mixture, task schedule, architecture, or workflow around the model.

Source

Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, and Ekdeep Singh Lubana. (2026). Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention. arXiv preprint arXiv:2605.29548. Available at: https://arxiv.org/abs/2605.29548