
Lessons from Llion Jones
Llion Jones co-authored the 2017 paper "Attention Is All You Need," introducing the Transformer architecture used in today's large language models. After a decade at Google, he co-founded Sakana AI in Tokyo to research evolutionary alternatives to the systems he helped popularize. This collection outlines his background, his criticisms of the AI industry, and his work on dynamic computation.
Part 1: The Transformer and Its Origins
- On Naming the Architecture: "Jones coined the term 'Transformer' for the neural network architecture that redefined modern deep learning." — Source: [VentureBeat]
- On the Paper's Title: "The title 'Attention Is All You Need' was directly inspired by the Beatles song 'All You Need Is Love.'" — Source: [Hugging Face]
- On Ablation Studies: "The bold title came to the team during ablation studies when they realized the attention mechanism was the only component truly required for the model to work." — Source: [Hugging Face]
- On Equal Contribution: "All eight authors of the original 2017 paper contributed equally to the research, a rare dynamic in high-profile computer science publications." — Source: [Wikipedia]
- On the Core Innovation: "The Transformer broke ground by replacing recurrent and convolutional networks entirely with self-attention mechanisms to enable massive parallelization." — Source: [arXiv]
- On Foundational Technology: "The architecture he co-invented remains the technical bedrock for generative systems including ChatGPT, Claude, and Gemini." — Source: [Parseur]
- On Historical Impact: "The shift to self-attention drastically reduced training times and altered the trajectory of deep learning research." — Source: [Grokipedia]
- On Inventor's Remorse: "Despite its massive success, Jones has openly expressed a desire to move beyond the architecture he helped establish." — Source: [VentureBeat]
- On His Tenure at Google: "His work on the Transformer occurred during a decade-long stint at Google Research in Mountain View." — Source: [DLD Conference]
- On Early Research Culture: "He fondly remembers the environment before the Transformer breakthrough as one driven by curiosity rather than corporate deliverables." — Source: [36Kr]
Part 2: The Transformer Rut and Success Capture
- On the Local Minimum: "The AI industry is currently trapped in a localized rut because it relies entirely on a single architecture." — Source: [Machine Learning Street Talk]
- On Success Capture: "The overwhelming commercial success of the Transformer has disincentivized researchers from pursuing bolder, alternative breakthroughs." — Source: [Machine Learning Street Talk]
- On Incremental Tweaks: "The field is dominated by researchers making minor optimizations to existing models instead of exploring foundational changes." — Source: [Reddit]
- On Investor Pressure: "The obsession with optimizing known architectures is heavily driven by financial pressures in a crowded corporate landscape." — Source: [VentureBeat]
- On the Spiral Analogy: "When a neural network is asked to understand a spiral, it fakes comprehension by drawing tiny straight lines rather than grasping the core concept." — Source: [Reddit]
- On Faking Intelligence: "Current models excel at mimicking intelligent output without possessing an internal understanding of the underlying logic." — Source: [Rescript]
- On Exploitation vs. Exploration: "The industry needs to turn up the explore dial instead of endlessly exploiting well-understood systems." — Source: [VentureBeat]
- On the Arms Race: "Corporate competition in generative AI is actively stifling true scientific innovation by rewarding only short-term gains." — Source: [TED]
- On Breaking the Mold: "Jones has significantly reduced his own work on Transformers to force himself into open-ended research paths." — Source: [Machine Learning Street Talk]
- On the Danger of Homogeneity: "Relying on a single architectural paradigm makes the entire field fragile and limits the ultimate potential of the technology." — Source: [Algustionesa]
Part 3: Continuous Thought Machines
- On Internal Time: "Unlike static Transformers, Continuous Thought Machines introduce an internal time dimension to the computation process." — Source: [AI Papers Academy]
- On Dynamic Processing: "CTMs allow models to think longer about difficult problems and stop early when handling simpler tasks." — Source: [Medium]
- On Neuron-Level Memory: "CTM neurons use unique weight parameters to process a history of incoming signals, unlike standard stateless artificial neurons." — Source: [Hugging Face]
- On Neural Dynamics: "The architecture leverages the timing and synchronization between neurons as a core part of its logic, mirroring biological brains." — Source: [arXiv]
- On Sequential Reasoning: "The discrete ticks of processing time in a CTM enable complex and adaptive sequential reasoning." — Source: [Hugging Face]
- On Biological Plausibility: "CTM research pushes back against the trend of abstracting away temporal dynamics strictly for software efficiency." — Source: [Sakana AI]
- On Bridging the Gap: "The goal of CTM is to marry biological realism with the extreme scalability required for modern deep learning." — Source: [Sakana AI]
- On Escaping the One-Shot Paradigm: "CTM directly addresses the one-shot limitation of current models that pass input through a fixed number of layers regardless of difficulty." — Source: [Medium]
- On True Thinking: "The architecture is an attempt to move toward models that possess genuine thinking processes rather than static pattern matching." — Source: [Rescript]
Part 4: Sakana AI and Nature-Inspired Architecture
- On the Meaning of Sakana: "The company name is the Japanese word for fish, directly referencing their focus on collective behavior." — Source: [Sakana AI]
- On Biomimicry: "The lab's philosophy is inspired by how schools of fish form coherent, intelligent entities through simple individual rules." — Source: [Sakana AI]
- On Swarm Intelligence: "They are applying concepts of swarm intelligence to develop AI that exhibits emergent, adaptive behaviors." — Source: [AI Tinkerers]
- On Evolutionary Computation: "Rather than training massive models from scratch, Sakana AI explores techniques where AI models effectively evolve." — Source: [Lux Capital]
- On Natural Selection in AI: "Their process involves creating multiple system versions, testing them, and letting the most effective configurations survive and persist." — Source: [AWS]
- On Moving Away from Monoliths: "The lab is deliberately avoiding the industry trend of building increasingly massive, monolithic neural networks." — Source: [Japan Times]
- On Adaptive Systems: "By leaning into nature-inspired techniques, they aim to create models that are fundamentally more adaptive to changing environments." — Source: [Medium]
- On Resource Efficiency: "Evolutionary and swarm methods are seen as a pathway to systems that do not require the massive computational power of traditional LLMs." — Source: [IT Pro]
- On Co-Founding: "He established the lab in 2023 alongside fellow former Google researcher David Ha and Ren Ito." — Source: [Startup Intros]
- On Location: "Basing the frontier AI lab in Tokyo was a deliberate choice to step outside the Silicon Valley echo chamber." — Source: [Venture Cafe Global]
Part 5: Reflections on Corporate AI and Google
- On Leaving Google: "He left Google after more than a decade to pursue research that felt constrained within a massive corporate structure." — Source: [Birmingham University]
- On Corporate Inertia: "Large tech companies often struggle to innovate fundamentally because their infrastructure is deeply tied to existing, profitable paradigms." — Source: [Machine Learning Street Talk]
- On Early Google Brain: "He spent his foundational years at Google Brain working on large-scale machine learning and natural language processing." — Source: [North Wales Chronicle]
- On the Natural Questions Benchmark: "During his Google tenure, he made significant contributions to question-answering systems and the Natural Questions benchmark." — Source: [Medium]
- On Research Freedom: "True breakthroughs require environments that prioritize natural inspiration and curiosity over predefined deliverables." — Source: [36Kr]
- On the Silicon Valley Mindset: "The current culture in major AI hubs often confuses scale with actual scientific progress." — Source: [Machine Learning Street Talk]
- On Stifled Innovation: "The race to deploy slightly better chatbots is distracting the brightest minds from solving the deeper mysteries of intelligence." — Source: [TED]
- On Reclaiming Independence: "Co-founding a smaller, independent lab was necessary to regain the freedom to fail at radical new ideas." — Source: [Birmingham University]
- On the Value of Small Teams: "Small, focused groups of researchers often move faster and think more creatively than sprawling corporate divisions." — Source: [Sakana AI]
- On Looking Back: "While Google provided incredible resources, the most exciting work happens at the absolute edge of the unknown." — Source: [Substack]
Part 6: Rethinking Scaling and Efficiency
- On the Brute Force Approach: "Simply adding more compute and data to the same architecture is a brute-force method with diminishing returns." — Source: [IT Pro]
- On Sustainable AI: "The energy consumption required to train massive LLMs is pushing the industry to find more sustainable architectures." — Source: [AWS]
- On Biological Efficiency: "The human brain operates on roughly 20 watts of power, a stark contrast to the gigawatts required by modern AI data centers." — Source: [arXiv]
- On Emergent Complexity: "Complex intelligence should emerge from the interaction of simple parts, not from the sheer size of a single static model." — Source: [AI Tinkerers]
- On Hardware Limitations: "The architecture of the future must be designed with an awareness of physical and hardware constraints, rather than fighting them." — Source: [Medium]
- On Optimization: "There is a vast, unexplored design space for models that are small, fast, and highly optimized for specific tasks." — Source: [Lux Capital]
- On Over-parameterization: "Many current models are vastly over-parameterized, storing redundant information that could be handled more efficiently." — Source: [Rescript]
- On Evolutionary Pruning: "Evolutionary algorithms can naturally prune away unnecessary network pathways, leading to leaner systems." — Source: [AWS]
Part 7: The Future of Deep Learning
- On the Post-Transformer Era: "The next major leap in artificial intelligence will not look like a Transformer; it will likely incorporate concepts we currently consider fringe." — Source: [Lenny's Podcast]
- On Continuous Learning: "Future models must be capable of continuous, lifelong learning rather than being frozen after a single massive training run." — Source: [Hugging Face]
- On Open-Endedness: "The goal is to create systems that can invent their own novel solutions to problems rather than just interpolating training data." — Source: [Machine Learning Street Talk]
- On Synthesizing Disciplines: "Breakthroughs will come from combining deep learning with evolutionary biology, complex systems theory, and neuroscience." — Source: [Sakana AI]
- On the Unknown: "Researchers must embrace the uncomfortable reality that we do not yet know the final shape of artificial general intelligence." — Source: [Substack]
- On Patience: "True scientific paradigm shifts take time and cannot be artificially accelerated by merely increasing funding." — Source: [Machine Learning Street Talk]
- On Redefining Intelligence: "We need to stop measuring machine intelligence solely by benchmark scores on standardized human tests." — Source: [Metacast]
- On Structural Flexibility: "Future architectures will likely dynamically change their own structure depending on the task at hand." — Source: [AI Papers Academy]
- On the Journey Ahead: "The invention of the Transformer was just one step in a much longer, multi-generational journey toward understanding intelligence." — Source: [Machine Learning Street Talk]
Part 8: Early Life and Academic Background
- On His Roots: "He was born and raised in the small village of Bangor, Wales, where the population was around 200 people." — Source: [Spark Daily]
- On Language: "He is a native Welsh speaker, an early experience that shaped his later interest in natural language processing." — Source: [DLD Conference]
- On Early Curiosity: "Rather than just playing games, his childhood fascination with computers centered on understanding exactly how software functioned." — Source: [Spark Daily]
- On A-Levels: "Studying computing alongside mathematics, physics, and chemistry at Coleg Meirion-Dwyfor provided his foundational analytical skills." — Source: [North Wales Chronicle]
- On Discovering Programming: "He has described his first exposure to coding before university as an eye-opener that dictated his career path." — Source: [North Wales Chronicle]
- On Higher Education: "He attended the University of Birmingham, earning a BSc in Artificial Intelligence and Computer Science in 2008." — Source: [Old Joe]
- On Academic Foundations: "He frequently credits the University of Birmingham's computer science faculty for equipping him with the tools for his future research." — Source: [Birmingham University]
- On Further Studies: "He stayed at Birmingham to complete an MSc in Advanced Computer Science in 2009 before moving into the industry." — Source: [Built in Birmingham]
- On Recognition: "In June 2026, it was announced he would receive an honorary doctorate from Bangor University for his contributions to artificial intelligence." — Source: [Bangor University]