Lessons from Richard Sutton
Richard Sutton formalized reinforcement learning and co-wrote the standard textbook on how machines learn through trial and error. In his 2019 essay "The Bitter Lesson," he argued that AI systems improve primarily through massive computation rather than human-engineered knowledge. This profile organizes his research papers, writings, and interviews to explain how autonomous agents evaluate actions and map situations to rewards.
Part 1: The Bitter Lesson and Computation
- On 70 Years of AI: "General methods that scale with computation are ultimately the most effective, by a large margin." — Source: [The Bitter Lesson]
- On Human Psychology: "The lesson is bitter because we desperately want our human-engineered domain knowledge to matter." — Source: [The Bitter Lesson]
- On Moore's Law: "We must build methods that scale smoothly with computation, anticipating the massive increases that hardware guarantees." — Source: [The Bitter Lesson]
- On Hand-Coding Constraints: "Attempting to build in how we think we think does not work in the long run." — Source: [The Bitter Lesson]
- On Chess Algorithms: "Deep Blue beat Kasparov through massive search methods, completely bypassing human-designed chess strategies." — Source: [The Bitter Lesson]
- On Computer Vision: "Modern computer vision abandoned hand-crafted edges and filters in favor of networks that extract features directly from data." — Source: [The Bitter Lesson]
- On Speech Recognition: "Statistical learning systems completely replaced earlier efforts built around human phonetics." — Source: [The Bitter Lesson]
- On the Mind's Complexity: "Our minds are far more complex than our introspective descriptions of them." — Source: [The Bitter Lesson]
- On Meta-Methods: "Researchers should focus on algorithms that discover and learn on their own, rather than trying to discover things manually and program them in." — Source: [The Bitter Lesson]
- On Discovery Systems: "We want agents that can discover like we can, instead of agents that simply contain what we have already discovered." — Source: [The Bitter Lesson]
Part 2: Reinforcement Learning Fundamentals
- On Definition: "Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal." — Source: [Reinforcement Learning: An Introduction]
- On Trial and Error: "A learning agent must discover which actions yield the most reward by actively trying them." — Source: [Reinforcement Learning: An Introduction]
- On Delayed Reward: "Actions affect immediate rewards, but they also determine the next situation and influence all subsequent rewards." — Source: [Reinforcement Learning: An Introduction]
- On Exploration vs. Exploitation: "An agent must exploit what it already knows to gain reward, while continually exploring to make better selections in the future." — Source: [Reinforcement Learning: An Introduction]
- On Markov Decision Processes: "The MDP framework abstracts goal-directed interaction into three basic signals: state, action, and reward." — Source: [Reinforcement Learning: An Introduction]
- On Temporal-Difference Learning: "TD learning combines Monte Carlo principles with dynamic programming to learn directly from raw experience without a complete environmental model." — Source: [Reinforcement Learning: An Introduction]
- On Value Functions: "Value functions organize our knowledge regarding how good it is for an agent to be in any given state." — Source: [Reinforcement Learning: An Introduction]
- On Agent Interfaces: "The boundary between an agent and its environment is defined by the limits of the agent's absolute control." — Source: [Reinforcement Learning: An Introduction]
- On Policy Formation: "A policy defines exactly how a learning agent behaves at a specific point in time." — Source: [Reinforcement Learning: An Introduction]
- On Reward Signals: "If an action is followed by low reward, the primary policy alters to select different actions going forward." — Source: [Reinforcement Learning: An Introduction]
Part 3: Experience and Interaction
- On True Intelligence: "If we want real intelligence, systems need to learn by doing and experiencing trial and error in the world." — Source: [IBM Interview]
- On the Nature of Experience: "Experience is the raw data received through interaction with the world; this is how both people and animals actually learn." — Source: [NUS Lecture]
- On Grounding AI: "Intelligence requires grounding in a continuous, active stream of sensory-motor data from an environment." — Source: [The Alberta Plan]
- On Continual Learning: "An agent is never finished learning; it must endlessly adapt its control strategies as the external world changes." — Source: [The Alberta Plan]
- On Autonomous Agents: "True agents define their own sub-goals based on ongoing, direct interaction with their surroundings." — Source: [The Alberta Plan]
- On Predictive Knowledge: "Knowledge is accurately represented as specific predictions about future sensorimotor experiences, rather than static facts." — Source: [The Alberta Plan]
- On World Models: "A functional world model consists of verifiable predictions about what will happen when an agent takes particular actions." — Source: [The Alberta Plan]
- On Embodiment Elements: "Physical embodiment is strictly optional, but maintaining a tight interactive loop with a complex environment is mandatory." — Source: [The Alberta Plan]
- On Real-Time Processing: "Intelligence algorithms must operate in real-time, handling learning and acting simultaneously." — Source: [The Alberta Plan]
- On the Ultimate Goal: "The primary objective is to build artificial entities capable of learning to act skillfully to achieve goals." — Source: [The Alberta Plan]
Part 4: Large Language Models vs. General Intelligence
- On LLM Constraints: "Large language models lack the fundamental capacity to learn continuously from experience in a real-world setting." — Source: [Dwarkesh Podcast]
- On Next-Token Limits: "Predicting text is a distinctly different problem than predicting the actual physical consequences of real actions." — Source: [Dwarkesh Podcast]
- On the Missing Ingredient: "LLMs are completely missing the interactive loop of action, consequence, and reward that characterizes genuine learning." — Source: [Dwarkesh Podcast]
- On Human Data: "Training strictly on human-generated text caps a model at human-level understanding; self-play and environmental interaction bypass this ceiling." — Source: [Dwarkesh Podcast]
- On Dead Ends: "Equating artificial general intelligence completely with scaling up static language models will likely lead down a dead end." — Source: [Dwarkesh Podcast]
- On Knowing vs. Doing: "Language models capture everything humans say about the world, while failing to capture how to optimally act within it." — Source: [Dwarkesh Podcast]
- On Alignment Approaches: "True alignment might be easier to achieve if agents learn shared values through grounded interaction instead of massive text ingestion." — Source: [Dwarkesh Podcast]
- On Static Weights: "A functioning mind should never be frozen after a training run; it must constantly update its understanding through new experiences." — Source: [Dwarkesh Podcast]
- On the Purpose of Language: "Language serves as a tool for interacting with others to achieve practical goals, rather than an isolated symbol game." — Source: [Dwarkesh Podcast]
Part 5: Search and Meta-Methods
- On Core AI Mechanics: "Learning and search are the two specific methods that have consistently scaled with computation over decades." — Source: [The Bitter Lesson]
- On Planning Functionality: "Planning operates essentially as a process of searching through possible future states using an internal environmental model." — Source: [Reinforcement Learning: An Introduction]
- On State Discovery: "Agents should rely on search to discover useful abstract states rather than depending on human programmers to specify them." — Source: [The Alberta Plan]
- On Temporal Abstractions: "We need meta-methods that can automatically construct and map out new temporal abstractions, or options, for complex planning." — Source: [The Alberta Plan]
- On Unforeseen Solutions: "Search holds immense power because it explores combinations of actions that human experts never consider." — Source: [The Bitter Lesson]
- On the AlphaGo Breakthrough: "AlphaGo succeeded specifically because it combined deep learning with sophisticated search algorithms like Monte Carlo Tree Search." — Source: [The Bitter Lesson]
- On Model-Based RL: "Search allows an agent to mentally simulate future scenarios extensively before taking a concrete action in the real world." — Source: [Reinforcement Learning: An Introduction]
- On Processing Efficiency: "Search algorithms must run with high efficiency to explore vast state spaces rapidly during real-time interaction." — Source: [Reinforcement Learning: An Introduction]
- On Domain Generality: "Search and learning apply generally across domains; they solve Go just as effectively as they solve robotics." — Source: [The Bitter Lesson]
Part 6: Reward and Emergent Complexity
- On the Sufficiency of Reward: "Intelligence can be entirely understood as subserving the maximization of reward by an agent acting in its environment." — Source: [Reward is Enough]
- On Emergent Behaviors: "Complex capabilities like perception, language, and social intelligence emerge naturally from the simple drive to maximize reward." — Source: [Reward is Enough]
- On Avoiding Special Modules: "We do not need to build separate cognitive modules; a general agent will develop them autonomously if they increase reward." — Source: [Reward is Enough]
- On the Origin of Goals: "All complex goals formulate neatly as the maximization of a single scalar reward signal over time." — Source: [Reward is Enough]
- On Knowledge Representation: "An agent's entire representation of reality should be shaped by whatever proves useful for maximizing its reward." — Source: [Reward is Enough]
- On Evolutionary Parallels: "Natural evolution gave animals complex reward systems, like hunger and pain, which drive highly sophisticated adaptive behaviors." — Source: [Reward is Enough]
- On Simplicity: "The reward hypothesis offers a radically simple and unifying framework to organize artificial intelligence research." — Source: [Reward is Enough]
- On Multi-Agent Dynamics: "In complex social settings, individual agents acting to maximize distinct personal rewards naturally develop cooperative or competitive intelligence." — Source: [Reward is Enough]
- On Defining Success: "Success in artificial systems is not measured by human-likeness, but strictly by the ability to maximize reward in complex environments." — Source: [Reward is Enough]
Part 7: AI Timelines and The Future
- On Understanding the Mind: "Replicating the mind in machines stands as the greatest scientific challenge of our time, and it will eventually be achieved." — Source: [Amii Podcast]
- On Trajectory and Timelines: "While specific dates vary, we maintain a clear, principled path toward building agents that accurately understand the world." — Source: [Amii Podcast]
- On Human Significance: "Creating human-level agents will not diminish human importance; it will represent the crowning achievement of human ingenuity." — Source: [Amii Podcast]
- On Fear of Technology: "We should view advanced artificial systems as an opportunity for profound cooperation rather than an existential threat." — Source: [Lex Fridman Podcast]
- On the Next Paradigm: "The next leap in capability will come from scaling autonomous, experiential learning instead of scaling text datasets." — Source: [Lex Fridman Podcast]
- On Perspective: "Studying intelligence grants us deeper insight into the functional nature of our own minds and our place in the universe." — Source: [Lex Fridman Podcast]
- On Shared Directives: "We must ensure that the reward functions provided to agents carefully align with overall human flourishing." — Source: [Lex Fridman Podcast]
- On Ecosystem Diversity: "The future will likely consist of a diverse ecosystem of specialized and general agents, rather than a single monolithic system." — Source: [Lex Fridman Podcast]
- On the Current Horizon: "We are still operating in the early, foundational days of understanding true reinforcement learning at scale." — Source: [Lex Fridman Podcast]
Part 8: Research Philosophy and Methodology
- On Problem Solving: "Approximate the solution, not the problem; we should avoid making special cases out of the environment to make algorithms work." — Source: [Research Slogans]
- On Scientific Motivation: "Drive from the problem, letting the demands of the environment dictate the architecture rather than forcing an architecture onto a problem." — Source: [Research Slogans]
- On Agent Perspective: "Take the agent's point of view to understand the learning process directly from the perspective of the entity receiving the data." — Source: [Research Slogans]
- On Valid Measurement: "Don't ask the agent to achieve what it cannot measure; goals must tie directly to the sensory data the agent receives." — Source: [Research Slogans]
- On Incremental Progress: "Science relies on small, verifiable steps; we must rigorously test fundamental algorithms before attempting to scale them." — Source: [The Alberta Plan]
- On Useful Abstraction: "A functional theory of intelligence abstracts away specific hardware specifics and focuses entirely on the dynamics of learning." — Source: [The Alberta Plan]
- On Overlooking the Basics: "Modern research frequently ignores the fundamental mechanics of temporal-difference learning in the rush for immediate benchmarks." — Source: [The Alberta Plan]
- On Engineering Hubris: "Researchers must remain humble, accepting that our hand-crafted solutions will inevitably be replaced by learned solutions." — Source: [The Bitter Lesson]
- On the Long View: "The objective is to understand the fundamental principles of the mind, rather than solely publishing papers or winning benchmarks." — Source: [Research Slogans]