Research Explainers May 27, 2026 12 min read

Synthetic Data Needs Recipes, Not Bigger Generators

Model building is bottlenecked by token manufacturing efficiency more than raw quantity. Success depends on knowing how to produce better tokens without wasting compute.

Source note: Joel Niklaus, Atsuki Yamaguchi, Michal Stefanik, Guilherme Penedo, Hynek Kydlicek, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, and Thomas Wolf. “How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data.” arXiv:2604.13977, 2026-04-15. https://arxiv.org/abs/2604.13977

Companion source: Joel Niklaus, “The Synthetic Data Playbook: Generating Trillions of the Finest Tokens,” University of Zurich guest lecture deck, 2026. https://docs.google.com/presentation/d/1lY40grdoBdD3VsuaxmwVAx3x-X8hxZkU8ZXsvLVAmyw/mobilepresent?slide=id.g3e2ed693d5a_2_109

Why This Paper Matters

Synthetic data is no longer a side trick. It is a core part of how language models are built.

Frontier labs and open model teams use synthetic tokens for pretraining, mid-training, and post-training. But the recipes are usually hidden. A model release might mention synthetic data without answering the operational questions: which source data, which prompts, which generator, how much raw web data was mixed back in, how the pipeline avoided template collapse, and how the team verified that the data was actually better.

This paper and the companion deck turn alchemy into engineering. The authors didn’t simply release a dataset; they ran a systematic study across prompt design, generator models, source data, and mixtures to build FinePhrase, an open 486-billion-token synthetic pretraining dataset.

The punchline: bigger generator models aren’t the answer. The strongest gains come from the shape of the transformation, the quality of the mixed-in raw data, the diversity of outputs, and the infrastructure to run enough experiments to find a real recipe.

The Idea in Plain English

The core idea is rephrasing the web.

Take a web document and run it through a language model. Tell the model to preserve the information while changing the format. The design space is massive: the same document can become a table, a tutorial, a math word problem, an FAQ, a summary, a dialogue, or a list of facts.

Those formats are not cosmetic. They change what the next model learns from the text.

A flat web page may contain useful information, but the signal is often buried in prose, navigation junk, and repetition. A table makes relationships explicit. An FAQ surfaces implicit questions. A math prompt turns quantitative facts into worked reasoning. A tutorial exposes procedural steps.

The paper’s claim: synthetic pretraining data works best when it restructures information into forms that teach. Simple paraphrase is weak. Transformation into pedagogical structures is the real primitive.

What the Researchers Tested

The authors test three axes.

First, prompt design. They compare existing synthetic-data prompts with four structured formats: math, FAQ, table, and tutorial.

Second, generator choice. They test models from 135 million to 27 billion parameters, including families like Gemma, Llama, Qwen, Granite, Falcon, and SmolLM2.

Third, data composition. They test which source corpora to rephrase and which raw web corpus to mix back into training. This is important because the system trains on a mixture of synthetic and original web tokens.

The experiments were conducted at significant scale. Each configuration rephrases roughly 10 billion tokens, then trains a small language model on 21 billion tokens. Evaluation covers 12 benchmarks across reasoning, general knowledge, math, and table understanding.

The broader study generated over 1 trillion tokens for ablations, requiring more than 100,000 H100 hours. The final FinePhrase dataset contains 486 billion completion tokens across 1.35 billion samples.

What They Found

Prompt Design Was the Biggest Lever

Structured formats beat the raw DCLM baseline. The key isn’t that “synthetic data” wins as a broad category, but which synthetic data wins. The strongest formats restructure information into teachable forms: tables, math, FAQs, and tutorials.

For builders, this means that if a prompt merely asks for a cleaner version of a document, the pipeline wastes compute on nicer prose without improving the training signal. The better recipe forces the generator to expose latent structure.

Small Generator Models Were Often Enough

Generator scale saturates early. For structured prompts, 1B-class models work well. 4B models help with complexity. Jumping to 12B or 27B is usually a bad trade when considering cost.

Synthetic data pipelines become uneconomical if every token requires a large model. FinePhrase suggests using a small model that is good at the specific transformation, then spending the saved compute on better prompts, more experiments, and higher throughput.

SmolLM2 Was Better Than Cleaner-Looking Alternatives

SmolLM2 1.7B performed strongly, likely because its instruction tuning includes rewrite-heavy data.

The reason is surprising: in math rephrasing, Qwen3 produced polished outputs, but many started with the same template. SmolLM2 outputs looked messier and were sometimes short, but they avoided this template collapse.

For pretraining, polish is a trap. Human taste rewards consistency and formatting, but models need linguistic variety and diversity to learn effectively.

Synthetic Data Needed Raw Web Data Beside It

Synthetic-only training underperforms. Generated data should be mixed with strong non-synthetic data.

This is a signal-composition issue, not just a safety hedge. Structured synthetic data improves factual recall and reading comprehension. Raw web data performs better on commonsense signals like HellaSwag. If a model trains only on structured data, it loses the messy distribution that makes ordinary language useful.

Quality Classifiers Were Not Enough

The paper reports that cheap quality scores are unreliable predictors of performance. DCLM-score is moderately predictive, but Edu-score can be misleading. Some of the formats that train best receive lower quality scores after rephrasing.

Data pipelines often want a fast proxy classifier. But proxy quality is not the same as training value. A classifier may penalize a table or math transformation because it no longer looks like the educational prose the classifier was trained to reward. The lesson: there is no universal shortcut. Teams need downstream ablations.

Why It Happens

A synthetic-data pipeline does two jobs: it preserves information from the source and chooses a new representation. The representation changes the learning surface.

Tables compress relationships. FAQs turn facts into retrieval questions. Tutorials impose sequence. These formats make specific capabilities easier to learn, which is why benchmark gains are uneven across categories.

This explains the tradeoff. Raw web text contains ordinary context and human patterns. Structured synthetic data distills facts but removes some of the everyday context. It’s like nutrition: different data formats provide different nutrients. A useful pipeline mixes them deliberately.

What This Means for Builders

Treat synthetic data as a data-engineering problem, not a prompting stunt.

A real recipe includes source selection, transformation prompts, generator choice, sampling settings, diversity checks, mix-in strategy, and downstream ablations. Infrastructure matters: the authors reached 33.1 million tokens per GPU-hour using DataTrove and speculative decoding.

Without throughput, teams cannot afford enough experiments to find good recipes. The takeaway: spend less time looking for a magic prompt and more time building the loop that compares prompts, models, and mixtures at scale.

What This Means for Buyers and Operators

For buyers, ignore vague synthetic-data claims. Ask what role the synthetic data played: was it pretraining, post-training, reasoning, or tool-use data? Was it mixed with raw data? How did they prevent repetitive templates? Did they measure downstream gains or only classifier scores?

The commonsense-versus-knowledge tradeoff is essential. A synthetic-heavy recipe might improve factual recall but weaken ordinary-world behavior if not mixed carefully.

For operators building internal systems, the same lesson applies. If you generate training examples or knowledge-base rewrites, don’t optimize for clean-looking prose alone. Track diversity, coverage, and downstream task performance.

What to Watch Next

Watch whether synthetic pretraining becomes more like compiler optimization than content generation. The goal is choosing transformations that expose structure and managing the cost-to-performance ratio. The best teams will have build systems with regression tests and mixture experiments, not one-off notebooks.

Researchers should watch scale transfer. While ablations on small models are useful, it remains important to verify which findings hold at 100B-token and frontier-scale training.

Builders should watch prompt optimization and best-of-N filtering. If automatic prompt searches can find stronger transformations, the economics of synthetic data will shift again.

Limitations and Caveats

This is a preprint under review, so results should be treated as evidence rather than law.

The experiments focus on rephrased web text for pretraining. They do not answer every question about synthetic instruction data, code, or multimodal data.

The benchmark setup is constrained. Most core ablations train small models for 21B tokens. This is useful for comparison, but it is not the same as full-scale pretraining.

Finally, macro scores hide category-level differences. The right recipe depends on what capability the model needs. Synthetic data quality is a systems problem, not a model problem.

Source

Niklaus, J., Yamaguchi, A., Stefanik, M., Penedo, G., Kydlicek, H., Bakouch, E., Tunstall, L., Beeching, E. E., Frere, T., Raffel, C., von Werra, L., & Wolf, T. (2026). How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data. arXiv preprint arXiv:2604.13977. Available at: https://arxiv.org/abs/2604.13977

Niklaus, J. (2026). The Synthetic Data Playbook: Generating Trillions of the Finest Tokens. Companion lecture deck. Available at: https://docs.google.com/presentation/d/1lY40grdoBdD3VsuaxmwVAx3x-X8hxZkU8ZXsvLVAmyw/mobilepresent?slide=id.g3e2ed693d5a_2_109# Synthetic Data Needs Recipes, Not Bigger Generators

Model building is bottlenecked by token manufacturing efficiency more than raw quantity. Success depends on knowing how to produce better tokens without wasting compute.

Why This Paper Matters

Synthetic data is no longer a side trick. It is a core part of how language models are built.

The Idea in Plain English

The core idea is rephrasing the web.

Those formats are not cosmetic. They change what the next model learns from the text.

What the Researchers Tested

The authors test three axes.

First, prompt design. They compare existing synthetic-data prompts with four structured formats: math, FAQ, table, and tutorial.

Second, generator choice. They test models from 135 million to 27 billion parameters, including families like Gemma, Llama, Qwen, Granite, Falcon, and SmolLM2.

What They Found

Prompt Design Was the Biggest Lever

Small Generator Models Were Often Enough

Generator scale saturates early. For structured prompts, 1B-class models work well. 4B models help with complexity. Jumping to 12B or 27B is usually a bad trade when considering cost.

SmolLM2 Was Better Than Cleaner-Looking Alternatives

SmolLM2 1.7B performed strongly, likely because its instruction tuning includes rewrite-heavy data.

For pretraining, polish is a trap. Human taste rewards consistency and formatting, but models need linguistic variety and diversity to learn effectively.

Synthetic Data Needed Raw Web Data Beside It

Synthetic-only training underperforms. Generated data should be mixed with strong non-synthetic data.

Quality Classifiers Were Not Enough

Why It Happens

A synthetic-data pipeline does two jobs: it preserves information from the source and chooses a new representation. The representation changes the learning surface.

What This Means for Builders

Treat synthetic data as a data-engineering problem, not a prompting stunt.

Without throughput, teams cannot afford enough experiments to find good recipes. The takeaway: spend less time looking for a magic prompt and more time building the loop that can compare prompts, models, and mixtures at scale.

What This Means for Buyers and Operators

The commonsense-versus-knowledge tradeoff is essential. A synthetic-heavy recipe might improve factual recall but weaken ordinary-world behavior if not mixed carefully.

What to Watch Next

Researchers should watch scale transfer. While ablations on small models are useful, it remains important to verify which findings hold at 100B-token and frontier-scale training.

Builders should watch prompt optimization and best-of-N filtering. If automatic prompt searches can find stronger transformations, the economics of synthetic data will shift again.

Limitations and Caveats

This is a preprint under review, so results should be treated as evidence rather than law.

The experiments focus on rephrased web text for pretraining. They do not answer every question about synthetic instruction data, code, or multimodal data.

The benchmark setup is constrained. Most core ablations train small models for 21B tokens. This is useful for comparison, but it is not the same as full-scale pretraining.

Finally, macro scores hide category-level differences. The right recipe depends on what capability the model needs. Synthetic data quality is a systems problem, not a model problem.

Source

Research Browse Research & Deep Dives

Move through market maps, company deep dives, cross-profile patterns, papers, reports, and technical explainers.

Start Here Find the best entry point

Use the site map to choose a path through AI, operations, strategy, profiles, and series.

Topic Explore AI systems

Read essays on AI adoption, agents, business systems, and the changing shape of work.

Synthetic Data Needs Recipes, Not Bigger Generators

Why This Paper Matters

The Idea in Plain English

What the Researchers Tested

What They Found

Prompt Design Was the Biggest Lever

Small Generator Models Were Often Enough

SmolLM2 Was Better Than Cleaner-Looking Alternatives

Synthetic Data Needed Raw Web Data Beside It

Quality Classifiers Were Not Enough

Why It Happens

What This Means for Builders

What This Means for Buyers and Operators

What to Watch Next

Limitations and Caveats

Source

Why This Paper Matters

The Idea in Plain English

What the Researchers Tested

What They Found

Prompt Design Was the Biggest Lever

Small Generator Models Were Often Enough

SmolLM2 Was Better Than Cleaner-Looking Alternatives

Synthetic Data Needed Raw Web Data Beside It

Quality Classifiers Were Not Enough

Why It Happens

What This Means for Builders

What This Means for Buyers and Operators

What to Watch Next

Limitations and Caveats

Source

Explore the surrounding system

Get the weekly briefing.

More in Research Explainers

When the Odds Cannot Be Calculated

When Do AI Agents Actually Need Blockchains?

AI Traffic Is Becoming Workflow Traffic