Lessons from Lukasz Kaiser

Lukasz Kaiser co-authored the 2017 "Attention Is All You Need" paper that introduced the Transformer architecture. He later moved to OpenAI to focus on process-based supervision, training models to reason step-by-step through math and logic problems. This profile gathers his views on deep learning from his academic papers, software frameworks, and public interviews.

Part 1: The Transformer Architecture

On Initial Expectations: "When we wrote the paper, we viewed attention primarily as an efficiency improvement for machine translation, not the foundation for a global generative AI revolution." — Source: [NVIDIA GTC Panel 2024]
On Sequence Context: "The core strength of the architecture is its ability to look at the entire past context simultaneously to generate the next word." — Source: [Maker Faire Rome 2020]
On Moving Beyond Convolutions: "Dispensing with recurrence and convolutions allowed us to rely entirely on attention mechanisms, which vastly improved parallelization." — Source: [Attention Is All You Need (2017)]
On Legacy: "While Transformers are the foundation of current AI, the field must continue searching for something better to reach new levels of performance." — Source: [NVIDIA GTC Panel 2024]
On Architectural Constraints: "The quadratic complexity of self-attention was a known limitation from day one, which is why making them more efficient has been an ongoing research goal." — Source: [Pi School Fireside Chat 2021]
On Discovering Generalization: "What surprised the team most over time was how well the architecture generalized from natural language to computer vision and audio." — Source: [ML in PL Conference 2021]
On Translation as the Catalyst: "Sequence-to-sequence learning in translation proved to be the perfect testing ground for deep learning to quietly revolutionize natural language processing." — Source: [AI Frontiers 2017]
On Simplification: "Our goal was to drop complex architectural baggage; we found that a simpler, purely attention-based mechanism performed significantly better." — Source: [Attention Is All You Need (2017)]
On Attention Weights: "Visualizing attention weights gave us early clues that the model was organically learning linguistic structure without explicit supervision." — Source: [Attention Is All You Need (2017)]
On Enduring Dominance: "Defending the Transformer against alternative designs today requires acknowledging its unmatched scaling properties, even as we look for post-Transformer architectures." — Source: [Transformer vs. Post-Transformer Debate]

Part 2: The Role of Language in Intelligence

On the Key to Capability: "Language is the key to endowing intelligence with a special power, serving as the bridge between simple pattern matching and deeper cognition." — Source: [36Kr Interview 2025]
On Text as the Universal Interface: "Almost any task can be framed as language translation or text generation, which is why early NLP research had such broad downstream impact." — Source: [DeepLearning.AI NLP Specialization]
On Semantic Compression: "When a model learns language, it is fundamentally learning the structure of human thought and compressing reality into vector spaces." — Source: [Unsupervised Learning with Jacob Effron]
On Beyond Next-Token Prediction: "Predicting the next token in language was just the entry point; the real goal is getting the model to understand the underlying logic of what it's saying." — Source: [The MAD Podcast with Matt Turck]
On First-Principles Thinking: "Applying first-principles thinking to large models reveals that language itself forces a structured representation of abstract concepts." — Source: [36Kr Interview 2025]
On Human vs. Machine Generalization: "We must remain humble; while language models are powerful, it remains an open question if they can generalize in the physical world the way humans do." — Source: [AI Podcast with Lukasz Kaiser]
On the Limits of Vocabulary: "Intelligence is not strictly bound by vocabulary, but language provides the easiest dataset we have for training reasoning engines." — Source: [Unsupervised Learning with Jacob Effron]
On Emergent Capabilities: "The shift from specific problem-solving models to general capabilities was heavily driven by the density of information contained in human text." — Source: [This Is World interview with Lukasz Kaiser]
On Multimodality: "Language provides the scaffolding, but true understanding requires integrating vision, logic, and other modalities into that linguistic framework." — Source: [The MAD Podcast with Matt Turck]
On Translation Parallels: "Everything from writing code to solving math problems can be seen conceptually as translating human intent into formal language." — Source: [AI Frontiers 2017]

Part 3: Transition to System 2 Reasoning

On the Shift in Paradigm: "We are moving away from models that rely entirely on System 1 token prediction toward System 2 thinking that can pause, plan, and verify." — Source: [OpenAI Research Blog]
On Deliberate Computation: "Giving a model more time to think at test time unlocks capabilities that simply scaling training compute cannot achieve alone." — Source: [Training Verifiers (2021)]
On Human Parallels: "Just as humans don't solve complex math equations purely by reflex, we shouldn't expect AI models to generate correct complex answers in a single forward pass." — Source: [AI Podcast with Lukasz Kaiser]
On the Limitations of LLMs: "Traditional large language models are limited by their inability to self-correct during generation; System 2 architectures are designed to fix this." — Source: [The MAD Podcast with Matt Turck]
On Active Verification: "By training models to verify their own step-by-step reasoning, we bridge the gap between fast generation and reliable logic." — Source: [Training Verifiers (2021)]
On Long Context as Memory: "System 2 reasoning requires a functional working memory; the ability to handle long contexts allows models to maintain coherent reasoning over extended logical chains." — Source: [OpenAI Forum Talk 2026]
On Tool Usage: "A reasoning model should be able to recognize its own limitations and use external tools, like writing a Python script to execute a calculation, before returning an answer." — Source: [Unsupervised Learning with Jacob Effron]
On Disentangling Processes: "Future architectures may fully disentangle the fast, intuitive text generation from the slow, deliberative reasoning processes." — Source: [Transformer vs. Post-Transformer Debate]
On the Evolution of Prompts: "The need for complex prompt engineering is a symptom of System 1 limitations; as System 2 capabilities improve, models will do that structuring internally." — Source: [AI Podcast with Lukasz Kaiser]
On Safety and Learnability: "Models that can reason about their own actions are inherently more interpretable and potentially safer than black-box pattern matchers." — Source: [OpenAI Forum Talk 2026]

Part 4: Verifiers and Process-Based Supervision

On Process vs. Outcome: "Rewarding a model for the correct final answer is often insufficient; we must use process-based supervision to reward correct logical steps." — Source: [Training Verifiers (2021)]
On the Role of Verifiers: "A verifier acts as a critic, evaluating the generator's proposed solutions and rejecting those that contain logical flaws." — Source: [Training Verifiers (2021)]
On Reinforcement Learning: "Reinforcement learning is crucial for training the thinking process, particularly in domains where correctness can be objectively verified." — Source: [Unsupervised Learning with Jacob Effron]
On Data Efficiency: "Using verifiers and process-based supervision allows models to learn more efficiently from less data by extracting more signal from each training example." — Source: [OpenAI Research Blog]
On Test-Time Scaling: "The most exciting frontier is test-time compute: scaling the inference-time resources to allow a verifier to search through multiple reasoning paths." — Source: [The MAD Podcast with Matt Turck]
On Solving Math Word Problems: "Math word problems require precise multi-step logic, making them the perfect benchmark for proving the efficacy of verifier networks." — Source: [Training Verifiers (2021)]
On Self-Correction: "A robust AI must be able to realize when it has made a mistake halfway through a problem and backtrack, which requires strong internal supervision." — Source: [AI Podcast with Lukasz Kaiser]
On Hallucination Reduction: "Verifiers inherently reduce hallucinations by forcing the model to justify its claims logically before presenting them to the user." — Source: [This Is World interview with Lukasz Kaiser]
On Aligning AI Behavior: "Process-based supervision is an alignment technique that ensures the model's reasoning structurally aligns with human logic." — Source: [OpenAI Forum Talk 2026]

Part 5: Logic, Mathematics, and Complexity

On Algorithmic Foundations: "Deep learning is fundamentally an extension of algorithmic model theory, requiring a rigorous mathematical approach to computational complexity." — Source: [VvL Logic at Large Lecture 2023]
On Automata Theory Parallels: "The way recurrent networks and transformers process sequences has deep mathematical parallels to formal automata theory." — Source: [RWTH Aachen Academic Publications]
On Bridging Logic and Neural Networks: "Early work in logic and combinatorial games provided the theoretical foundation for understanding how neural networks can be forced to adhere to strict rules." — Source: [VvL Logic at Large Lecture 2023]
On Objective Truth in AI: "Mathematics is the ideal testing ground for intelligence because it provides an objective standard of truth that is independent of human cultural bias." — Source: [Unsupervised Learning with Jacob Effron]
On the Limitations of Statistics: "Purely statistical pattern matching breaks down when faced with complex logic puzzles that require strict adherence to explicit rules." — Source: [36Kr Interview 2025]
On the Search for Intelligent Programs: "The ultimate goal of machine learning is to build systems capable of discovering algorithms and programs that exhibit intelligent behavior." — Source: [ML in PL Conference 2021]
On Mathematical Intuition: "We are trying to teach models the mathematical intuition required to know which rule to apply when navigating an abstract search space." — Source: [The MAD Podcast with Matt Turck]
On Formal Verification: "The future of reliable AI may require integrating deep learning with formal logic verification systems to guarantee strict safety properties." — Source: [VvL Logic at Large Lecture 2023]
On Computational Complexity: "The challenge of designing new architectures is often a battle against computational complexity; we need models that are expressive yet tractable." — Source: [Pi School Fireside Chat 2021]

Part 6: Tooling and Deep Learning Ecosystems

On the Philosophy Behind Tensor2Tensor: "Tensor2Tensor was created to make deep learning research more accessible by providing a library of standardized models and datasets." — Source: [Tensor2Tensor GitHub]
On Lowering Barriers to Entry: "The goal of good tooling is to abstract away the boilerplate so researchers can focus entirely on architectural innovation." — Source: [DeepLearning.AI NLP Specialization]
On the Evolution to Trax: "We built Trax as the successor to Tensor2Tensor, utilizing JAX to provide a faster, more efficient environment free from legacy constraints." — Source: [Trax GitHub]
On the Importance of Open Source: "Open-source frameworks are the engine of AI progress; they allow the global research community to iterate on ideas rapidly." — Source: [AI Frontiers 2017]
On Reproducibility: "A major motivation behind building standardized tooling was ensuring that state-of-the-art results could be easily reproduced by anyone with sufficient compute." — Source: [Tensor2Tensor GitHub]
On Hardware Abstraction: "Effective deep learning libraries must abstract the hardware layer so that code runs seamlessly across CPUs, GPUs, and TPUs without modification." — Source: [Trax GitHub]
On Iteration Speed: "Research productivity is directly tied to compilation and execution speed; tools that reduce iteration time naturally lead to faster breakthroughs." — Source: [Unsupervised Learning with Jacob Effron]
On Educational Value: "Frameworks should be designed with clear, readable code to serve as educational tools for students, not just as production engines." — Source: [DeepLearning.AI NLP Specialization]
On the Shift in Development Tools: "The tools we use shape the models we build; moving to JAX allowed us to explore mathematical transformations that were previously cumbersome to implement." — Source: [Trax GitHub]

Part 7: Infrastructure, Scaling, and Bottlenecks

On the Compute Bottleneck: "We are constantly constrained by hardware; the theoretical limits of our architectures often far exceed our physical ability to train them." — Source: [AI Podcast with Lukasz Kaiser]
On Scaling Laws: "Scaling laws have proven remarkably robust, but scaling alone becomes inefficient without parallel innovations in architecture and data quality." — Source: [Transformer vs. Post-Transformer Debate]
On Memory Constraints: "The memory requirements for processing long contexts are a massive bottleneck, necessitating constant innovation in efficient attention mechanisms." — Source: [Pi School Fireside Chat 2021]
On Energy Efficiency: "As models grow, the energy required to train and run them becomes a significant concern; we must prioritize computational efficiency alongside raw performance." — Source: [This Is World interview with Lukasz Kaiser]
On Inference Costs: "The shift toward System 2 reasoning means we are trading training compute for inference compute, completely changing the economics of deployment." — Source: [The MAD Podcast with Matt Turck]
On Hardware-Software Co-design: "The best performance gains occur when the model architecture is co-designed specifically around the strengths and limitations of the underlying silicon." — Source: [AI Frontiers 2017]
On Data Exhaustion: "We are rapidly approaching the limit of high-quality human text data, making synthetic data generation and reinforcement learning critical for future scaling." — Source: [OpenAI Forum Talk 2026]
On the Limits of Quadratic Attention: "The standard complexity of self-attention forces us to use approximations and sliding windows for truly massive document processing." — Source: [ML in PL Conference 2021]
On Distributed Training: "Training frontier models requires orchestrating tens of thousands of GPUs perfectly in sync, which is as much a networking challenge as an algorithmic one." — Source: [AI Podcast with Lukasz Kaiser]

Part 8: The Future of AI Progress

On the Improbability of an AI Winter: "I don't think there is any AI winter coming; if anything, we may see a very sharp improvement in the next year or two that is almost scary." — Source: [Reddit Discussions 2025]
On Scientific Discovery: "The most profound impact of advanced reasoning models will not be writing emails, but accelerating scientific discovery in fields like biology and materials science." — Source: [This Is World interview with Lukasz Kaiser]
On Artificial Agency: "The next frontier is AI agency: models that can autonomously plan, navigate environments, and execute complex workflows over days or weeks." — Source: [AI Podcast with Lukasz Kaiser]
On Understanding vs. Scaling: "While scaling works predictably, understanding the fundamental mechanics of why AI works may matter more for achieving true artificial general intelligence." — Source: [36Kr Interview 2025]
On the Evolution of Coding: "Reasoning models have fundamentally altered research productivity; programming is becoming more about guiding AI intent than writing syntax." — Source: [Unsupervised Learning with Jacob Effron]
On Human-AI Collaboration: "The future is collaborative; human experts will act as conductors, steering teams of specialized reasoning models to solve complex problems." — Source: [The MAD Podcast with Matt Turck]
On Embodied AI: "For AI to truly understand the world, it may eventually need embodiment to interact with physics and physical constraints directly." — Source: [AI Podcast with Lukasz Kaiser]
On Continuous Learning: "A major unsolved challenge is creating models that can learn continuously from user interaction without needing a massive, expensive retraining run." — Source: [OpenAI Forum Talk 2026]
On the Definition of Intelligence: "As we build systems capable of deeper reasoning, we are forced to continually update our own definition of what constitutes genuine intelligence." — Source: [Transformer vs. Post-Transformer Debate]

Lessons from Lukasz Kaiser

Lessons from Lukasz Kaiser

Part 1: The Transformer Architecture

Part 2: The Role of Language in Intelligence

Part 3: Transition to System 2 Reasoning

Part 4: Verifiers and Process-Based Supervision

Part 5: Logic, Mathematics, and Complexity

Part 6: Tooling and Deep Learning Ecosystems

Part 7: Infrastructure, Scaling, and Bottlenecks

Part 8: The Future of AI Progress

Explore the surrounding system

Get the next notes and essays.

More profiles

Lessons from Darren Farber

Lessons from Vlad Barbalat

Lessons from Kareem Amin