Model building is bottlenecked by token manufacturing efficiency more than raw quantity. Success depends on knowing how to produce better tokens without wasting compute.
Source note: Joel Niklaus, Atsuki Yamaguchi, Michal Stefanik, Guilherme Penedo, Hynek Kydlicek, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, and Thomas Wolf. “How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data.” arXiv:2604.13977, 2026-04-15. https://arxiv.org/abs/2604.13977
Companion source: Joel Niklaus, “The Synthetic Data Playbook: Generating Trillions of the Finest Tokens,” University of Zurich guest lecture deck, 2026. https://docs.google.com/presentation/d/1lY40grdoBdD3VsuaxmwVAx3x-X8hxZkU8ZXsvLVAmyw/mobilepresent?slide=id.g3e2ed693d5a_2_109
Why This Paper Matters
Synthetic data is no longer a side trick. It is a core part of how language models are built.
Frontier labs and open model teams use synthetic tokens for pretraining, mid-training, and post-training. But the recipes are usually hidden. A model release might mention synthetic data without answering the operational questions: which source data, which prompts, which generator, how much raw web data was mixed back in, how the pipeline avoided template collapse, and how the team verified that the data was actually better.
This paper and the companion deck turn alchemy into engineering. The authors didn’t simply release a dataset; they ran a systematic study across prompt design, generator models, source data, and mixtures to build FinePhrase, an open 486-billion-token synthetic pretraining dataset.
The punchline: bigger generator models aren’t the answer. The strongest gains come from the shape of the transformation, the quality of the mixed-in raw data, the diversity of outputs, and the infrastructure to run enough experiments to find a real recipe.
The Idea in Plain English
The core idea is rephrasing the web.
Take a web document and run it through a language model. Tell the model to preserve the information while changing the format. The design space is massive: the same document can become a table, a tutorial, a math word problem, an FAQ, a summary, a dialogue, or a list of facts.
Those formats are not cosmetic. They change what the next model learns from the text.
A flat web page may contain useful information, but the signal is often buried in prose, navigation junk, and repetition. A table makes relationships explicit. An FAQ surfaces implicit questions. A math prompt turns quantitative facts into worked reasoning. A tutorial exposes procedural steps.
The paper’s claim: synthetic pretraining data works best when it restructures information into forms that teach. Simple paraphrase is weak. Transformation into pedagogical structures is the real primitive.
What the Researchers Tested
The authors test three axes.
First, prompt design. They compare existing synthetic-data prompts with four structured formats: math, FAQ, table, and tutorial.
Second, generator choice. They test models from 135 million to 27 billion parameters, including families like Gemma, Llama, Qwen, Granite, Falcon, and SmolLM2.
Third, data composition. They test which source corpora to rephrase and which raw web corpus to mix back into training. This is important because the system trains on a mixture of synthetic and original web tokens.
The experiments were conducted at significant scale. Each configuration rephrases roughly 10 billion tokens, then trains a small language model on 21 billion tokens. Evaluation covers 12 benchmarks across reasoning, general knowledge, math, and table understanding.
The broader study generated over 1 trillion tokens for ablations, requiring more than 100,000 H100 hours. The final FinePhrase dataset contains 486 billion completion tokens across 1.35 billion samples.
What They Found
Prompt Design Was the Biggest Lever
Structured formats beat the raw DCLM baseline. The key isn’t that “synthetic data” wins as a broad category, but which synthetic data wins. The strongest formats restructure information into teachable forms: tables, math, FAQs, and tutorials.
For builders, this means that if a prompt merely asks for a cleaner version of a document, the pipeline wastes compute on nicer prose without improving the training signal. The better recipe forces the generator to expose latent structure.
Small Generator Models Were Often Enough
Generator scale saturates early. For structured prompts, 1B-class models work well. 4B models help with complexity. Jumping to 12B or 27B is usually a bad trade when considering cost.
Synthetic data pipelines become uneconomical if every token requires a large model. FinePhrase suggests using a small model that is good at the specific transformation, then spending the saved compute on better prompts, more experiments, and higher throughput.
SmolLM2 Was Better Than Cleaner-Looking Alternatives
SmolLM2 1.7B performed strongly, likely because its instruction tuning includes rewrite-heavy data.
The reason is surprising: in math rephrasing, Qwen3 produced polished outputs, but many started with the same template. SmolLM2 outputs looked messier and were sometimes short, but they avoided this template collapse.
For pretraining, polish is a trap. Human taste rewards consistency and formatting, but models need linguistic variety and diversity to learn effectively.
Synthetic Data Needed Raw Web Data Beside It
Synthetic-only training underperforms. Generated data should be mixed with strong non-synthetic data.
This is a signal-composition issue, not just a safety hedge. Structured synthetic data improves factual recall and reading comprehension. Raw web data performs better on commonsense signals like HellaSwag. If a model trains only on structured data, it loses the messy distribution that makes ordinary language useful.
Quality Classifiers Were Not Enough
The paper reports that cheap quality scores are unreliable predictors of performance. DCLM-score is moderately predictive, but Edu-score can be misleading. Some of the formats that train best receive lower quality scores after rephrasing.
Data pipelines often want a fast proxy classifier. But proxy quality is not the same as training value. A classifier may penalize a table or math transformation because it no longer looks like the educational prose the classifier was trained to reward. The lesson: there is no universal shortcut. Teams need downstream ablations.
Why It Happens
A synthetic-data pipeline does two jobs: it preserves information from the source and chooses a new representation. The representation changes the learning surface.
Tables compress relationships. FAQs turn facts into retrieval questions. Tutorials impose sequence. These formats make specific capabilities easier to learn, which is why benchmark gains are uneven across categories.
This explains the tradeoff. Raw web text contains ordinary context and human patterns. Structured synthetic data distills facts but removes some of the everyday context. It’s like nutrition: different data formats provide different nutrients. A useful pipeline mixes them deliberately.
What This Means for Builders
Treat synthetic data as a data-engineering problem, not a prompting stunt.
A real recipe includes source selection, transformation prompts, generator choice, sampling settings, diversity checks, mix-in strategy, and downstream ablations. Infrastructure matters: the authors reached 33.1 million tokens per GPU-hour using DataTrove and speculative decoding.
Without throughput, teams cannot afford enough experiments to find good recipes. The takeaway: spend less time looking for a magic prompt and more time building the loop that compares prompts, models, and mixtures at scale.
What This Means for Buyers and Operators
For buyers, ignore vague synthetic-data claims. Ask what role the synthetic data played: was it pretraining, post-training, reasoning, or tool-use data? Was it mixed with raw data? How did they prevent repetitive templates? Did they measure downstream gains or only classifier scores?
The commonsense-versus-knowledge tradeoff is essential. A synthetic-heavy recipe might improve factual recall but weaken ordinary-world behavior if not mixed carefully.
For operators building internal systems, the same lesson applies. If you generate training examples or knowledge-base rewrites, don’t optimize for clean-looking prose alone. Track diversity, coverage, and downstream task performance.
What to Watch Next
Watch whether synthetic pretraining becomes more like compiler optimization than content generation. The goal is choosing transformations that expose structure and managing the cost-to-performance ratio. The best teams will have build systems with regression tests and mixture experiments, not one-off notebooks.
Researchers should watch scale transfer. While ablations on small models are useful, it remains important to verify which findings hold at 100B-token and frontier-scale training.
Builders should watch prompt optimization and best-of-N filtering. If automatic prompt searches can find stronger transformations, the economics of synthetic data will shift again.
Limitations and Caveats
This is a preprint under review, so results should be treated as evidence rather than law.
The experiments focus on rephrased web text for pretraining. They do not answer every question about synthetic instruction data, code, or multimodal data.
The benchmark setup is constrained. Most core ablations train small models for 21B tokens. This is useful for comparison, but it is not the same as full-scale pretraining.
Finally, macro scores hide category-level differences. The right recipe depends on what capability the model needs. Synthetic data quality is a systems problem, not a model problem.
Source
Niklaus, J., Yamaguchi, A., Stefanik, M., Penedo, G., Kydlicek, H., Bakouch, E., Tunstall, L., Beeching, E. E., Frere, T., Raffel, C., von Werra, L., & Wolf, T. (2026). How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data. arXiv preprint arXiv:2604.13977. Available at: https://arxiv.org/abs/2604.13977
Niklaus, J. (2026). The Synthetic Data Playbook: Generating Trillions of the Finest Tokens. Companion lecture deck. Available at: https://docs.google.com/presentation/d/1lY40grdoBdD3VsuaxmwVAx3x-X8hxZkU8ZXsvLVAmyw/mobilepresent?slide=id.g3e2ed693d5a_2_109# Synthetic Data Needs Recipes, Not Bigger Generators
Model building is bottlenecked by token manufacturing efficiency more than raw quantity. Success depends on knowing how to produce better tokens without wasting compute.
Source note: Joel Niklaus, Atsuki Yamaguchi, Michal Stefanik, Guilherme Penedo, Hynek Kydlicek, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, and Thomas Wolf. “How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data.” arXiv:2604.13977, 2026-04-15. https://arxiv.org/abs/2604.13977
Companion source: Joel Niklaus, “The Synthetic Data Playbook: Generating Trillions of the Finest Tokens,” University of Zurich guest lecture deck, 2026. https://docs.google.com/presentation/d/1lY40grdoBdD3VsuaxmwVAx3x-X8hxZkU8ZXsvLVAmyw/mobilepresent?slide=id.g3e2ed693d5a_2_109
Why This Paper Matters
Synthetic data is no longer a side trick. It is a core part of how language models are built.
Frontier labs and open model teams use synthetic tokens for pretraining, mid-training, and post-training. But the recipes are usually hidden. A model release might mention synthetic data without answering the operational questions: which source data, which prompts, which generator, how much raw web data was mixed back in, how the pipeline avoided template collapse, and how the team verified that the data was actually better.
This paper and the companion deck turn alchemy into engineering. The authors didn’t simply release a dataset; they ran a systematic study across prompt design, generator models, source data, and mixtures to build FinePhrase, an open 486-billion-token synthetic pretraining dataset.
The punchline: bigger generator models aren’t the answer. The strongest gains come from the shape of the transformation, the quality of the mixed-in raw data, the diversity of outputs, and the infrastructure to run enough experiments to find a real recipe.
The Idea in Plain English
The core idea is rephrasing the web.
Take a web document and run it through a language model. Tell the model to preserve the information while changing the format. The design space is massive: the same document can become a table, a tutorial, a math word problem, an FAQ, a summary, a dialogue, or a list of facts.
Those formats are not cosmetic. They change what the next model learns from the text.
A flat web page may contain useful information, but the signal is often buried in prose, navigation junk, and repetition. A table makes relationships explicit. An FAQ surfaces implicit questions. A math prompt turns quantitative facts into worked reasoning. A tutorial exposes procedural steps.
The paper’s claim: synthetic pretraining data works best when it restructures information into forms that teach. Simple paraphrase is weak. Transformation into pedagogical structures is the real primitive.
What the Researchers Tested
The authors test three axes.
First, prompt design. They compare existing synthetic-data prompts with four structured formats: math, FAQ, table, and tutorial.
Second, generator choice. They test models from 135 million to 27 billion parameters, including families like Gemma, Llama, Qwen, Granite, Falcon, and SmolLM2.
Third, data composition. They test which source corpora to rephrase and which raw web corpus to mix back into training. This is important because the system trains on a mixture of synthetic and original web tokens.
The experiments were conducted at significant scale. Each configuration rephrases roughly 10 billion tokens, then trains a small language model on 21 billion tokens. Evaluation covers 12 benchmarks across reasoning, general knowledge, math, and table understanding.
The broader study generated over 1 trillion tokens for ablations, requiring more than 100,000 H100 hours. The final FinePhrase dataset contains 486 billion completion tokens across 1.35 billion samples.
What They Found
Prompt Design Was the Biggest Lever
Structured formats beat the raw DCLM baseline. The key isn’t that “synthetic data” wins as a broad category, but which synthetic data wins. The strongest formats restructure information into teachable forms: tables, math, FAQs, and tutorials.
For builders, this means that if a prompt merely asks for a cleaner version of a document, the pipeline wastes compute on nicer prose without improving the training signal. The better recipe forces the generator to expose latent structure.
Small Generator Models Were Often Enough
Generator scale saturates early. For structured prompts, 1B-class models work well. 4B models help with complexity. Jumping to 12B or 27B is usually a bad trade when considering cost.
Synthetic data pipelines become uneconomical if every token requires a large model. FinePhrase suggests using a small model that is good at the specific transformation, then spending the saved compute on better prompts, more experiments, and higher throughput.
SmolLM2 Was Better Than Cleaner-Looking Alternatives
SmolLM2 1.7B performed strongly, likely because its instruction tuning includes rewrite-heavy data.
The reason is surprising: in math rephrasing, Qwen3 produced polished outputs, but many started with the same template. SmolLM2 outputs looked messier and were sometimes short, but they avoided this template collapse.
For pretraining, polish is a trap. Human taste rewards consistency and formatting, but models need linguistic variety and diversity to learn effectively.
Synthetic Data Needed Raw Web Data Beside It
Synthetic-only training underperforms. Generated data should be mixed with strong non-synthetic data.
This is a signal-composition issue, not just a safety hedge. Structured synthetic data improves factual recall and reading comprehension. Raw web data performs better on commonsense signals like HellaSwag. If a model trains only on structured data, it loses the messy distribution that makes ordinary language useful.
Quality Classifiers Were Not Enough
The paper reports that cheap quality scores are unreliable predictors of performance. DCLM-score is moderately predictive, but Edu-score can be misleading. Some of the formats that train best receive lower quality scores after rephrasing.
Data pipelines often want a fast proxy classifier. But proxy quality is not the same as training value. A classifier may penalize a table or math transformation because it no longer looks like the educational prose the classifier was trained to reward. The lesson: there is no universal shortcut. Teams need downstream ablations.
Why It Happens
A synthetic-data pipeline does two jobs: it preserves information from the source and chooses a new representation. The representation changes the learning surface.
Tables compress relationships. FAQs turn facts into retrieval questions. Tutorials impose sequence. These formats make specific capabilities easier to learn, which is why benchmark gains are uneven across categories.
This explains the tradeoff. Raw web text contains ordinary context and human patterns. Structured synthetic data distills facts but removes some of the everyday context. It’s like nutrition: different data formats provide different nutrients. A useful pipeline mixes them deliberately.
What This Means for Builders
Treat synthetic data as a data-engineering problem, not a prompting stunt.
A real recipe includes source selection, transformation prompts, generator choice, sampling settings, diversity checks, mix-in strategy, and downstream ablations. Infrastructure matters: the authors reached 33.1 million tokens per GPU-hour using DataTrove and speculative decoding.
Without throughput, teams cannot afford enough experiments to find good recipes. The takeaway: spend less time looking for a magic prompt and more time building the loop that can compare prompts, models, and mixtures at scale.
What This Means for Buyers and Operators
For buyers, ignore vague synthetic-data claims. Ask what role the synthetic data played: was it pretraining, post-training, reasoning, or tool-use data? Was it mixed with raw data? How did they prevent repetitive templates? Did they measure downstream gains or only classifier scores?
The commonsense-versus-knowledge tradeoff is essential. A synthetic-heavy recipe might improve factual recall but weaken ordinary-world behavior if not mixed carefully.
For operators building internal systems, the same lesson applies. If you generate training examples or knowledge-base rewrites, don’t optimize for clean-looking prose alone. Track diversity, coverage, and downstream task performance.
What to Watch Next
Watch whether synthetic pretraining becomes more like compiler optimization than content generation. The goal is choosing transformations that expose structure and managing the cost-to-performance ratio. The best teams will have build systems with regression tests and mixture experiments, not one-off notebooks.
Researchers should watch scale transfer. While ablations on small models are useful, it remains important to verify which findings hold at 100B-token and frontier-scale training.
Builders should watch prompt optimization and best-of-N filtering. If automatic prompt searches can find stronger transformations, the economics of synthetic data will shift again.
Limitations and Caveats
This is a preprint under review, so results should be treated as evidence rather than law.
The experiments focus on rephrased web text for pretraining. They do not answer every question about synthetic instruction data, code, or multimodal data.
The benchmark setup is constrained. Most core ablations train small models for 21B tokens. This is useful for comparison, but it is not the same as full-scale pretraining.
Finally, macro scores hide category-level differences. The right recipe depends on what capability the model needs. Synthetic data quality is a systems problem, not a model problem.
Source
Niklaus, J., Yamaguchi, A., Stefanik, M., Penedo, G., Kydlicek, H., Bakouch, E., Tunstall, L., Beeching, E. E., Frere, T., Raffel, C., von Werra, L., & Wolf, T. (2026). How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data. arXiv preprint arXiv:2604.13977. Available at: https://arxiv.org/abs/2604.13977
Niklaus, J. (2026). The Synthetic Data Playbook: Generating Trillions of the Finest Tokens. Companion lecture deck. Available at: https://docs.google.com/presentation/d/1lY40grdoBdD3VsuaxmwVAx3x-X8hxZkU8ZXsvLVAmyw/mobilepresent?slide=id.g3e2ed693d5a_2_109