Lessons from John Schulman

John Schulman is a machine learning researcher who developed Proximal Policy Optimization (PPO), the reinforcement learning algorithm used to align modern large language models. He co-founded OpenAI and led the post-training teams that developed ChatGPT before moving to Anthropic and eventually becoming Chief Scientist at Thinking Machines Lab. This compilation outlines his technical insights on optimization stability, human feedback loops, and the mechanics of model hallucination.

Part 1: Early Influences and the Transition to AI

On seeking fundamental truths: "I am curious about understanding the universe, and physics seemed like the area to study for that." — Source: [Berkeley News]
On shifting focus from physics: "I did a couple of summer research projects in physics and wasn't excited about them, and I found myself more interested in other topics." — Source: [Berkeley News]
On early robotic inspirations: "Going back to my childhood, there was a TV show called BattleBots... That really captured my excitement." — Source: [Manifold Podcast]
On futurist literature: "I read Ray Kurzweil's book about The Singularity is Near. I found that pretty persuasive, and that had a big effect on my thinking going forward." — Source: [Manifold Podcast]
On entering machine learning: "Intelligence seemed like the most impactful problem to solve, which shifted my focus from physical sciences to computational neuroscience and eventually artificial intelligence." — Source: [TalkRL Podcast]
On Pieter Abbeel’s lab: "I found a research environment at UC Berkeley that emphasized practical robotics and reinforcement learning at exactly the right time." — Source: [TalkRL Podcast]
On the appeal of reinforcement learning: "I was drawn to the field because it provided a mathematical framework for agents interacting with open-ended environments." — Source: [TalkRL Podcast]
On early optimization challenges: "I noticed that existing policy gradient methods were unstable, motivating a search for more reliable update rules." — Source: Deep RL Bootcamp
On long-term vision: "I believed early on that scaling computational power and data would eventually lead to generalized intelligence." — Source: [Dwarkesh Podcast]

Part 2: TRPO, PPO, and the Math of Reinforcement Learning

On Trust Region Policy Optimization: "We designed an algorithm that ensures an agent's policy does not change too drastically in a single update, guaranteeing monotonic improvement." — Source: [TalkRL Podcast]
On the limitations of early algorithms: "While natural policy gradients were theoretically sound, they were computationally expensive and difficult to scale to large neural networks." — Source: Deep RL Bootcamp
On the necessity of Proximal Policy Optimization: "PPO was developed as a simpler and more computationally efficient alternative to TRPO that achieved similar or better performance." — Source: [OpenAI]
On PPO's widespread adoption: "Its success comes from striking the right balance between ease of implementation, sample efficiency, and ease of tuning." — Source: [TalkRL Podcast]
On mathematical elegance versus practicality: "The most mathematically complex algorithms often fail in practice, whereas simpler heuristics like the PPO clipped objective function prove highly stable." — Source: [TalkRL Podcast]
On optimization challenges: "Reinforcement learning requires navigating highly non-convex loss surfaces, making stability the most valuable feature of any algorithm." — Source: Deep RL Bootcamp
On Generalized Advantage Estimation: "We formulated a method to significantly reduce the variance of policy gradient estimates at the cost of some bias, accelerating training times." — Source: [TalkRL Podcast]
On the universality of RL: "The core equations of reinforcement learning are flexible enough to apply to robotics, games, and language modeling alike." — Source: [TalkRL Podcast]
On algorithmic simplicity: "The best algorithms are the ones that require the fewest hyperparameter adjustments when moving from one domain to another." — Source: [TalkRL Podcast]

Part 3: From Theoretical Algorithms to InstructGPT

On bridging theory and product: "We took theoretical reinforcement learning frameworks and applied them directly to large-scale dialogue models to create InstructGPT." — Source: [TalkRL Podcast]
On the shift to language modeling: "Text provided a richer and more scalable environment for RL agents than simulated physics engines." — Source: [TalkRL Podcast]
On WebGPT: "We can get an agent to basically use a set of tools that we give it." — Source: [TalkRL Podcast]
On the creation of ChatGPT: "We merged the massive pre-training of GPT-3 with the precise steering of human feedback to create an accessible conversational interface." — Source: [Dwarkesh Podcast]
On tool use as a milestone: "Seeing a model successfully use a search engine or calculator served as proof that models could operate as active agents rather than passive text generators." — Source: [TalkRL Podcast]
On the impact of InstructGPT: "We proved that a smaller model tuned with human feedback could outperform a much larger model that relied exclusively on raw pre-training." — Source: [TalkRL Podcast]
On reasoning through search: "Models can be trained to gather context and verify facts prior to generating a response." — Source: [TalkRL Podcast]
On iterative deployment: "The most effective way to improve models is to release them, gather interaction data, and use that data for the next round of reinforcement learning." — Source: [Dwarkesh Podcast]
On conversational memory: "The challenge of making a model maintain context and coherence over long dialogues required highly specific tuning." — Source: [TalkRL Podcast]

Part 4: Taming the Shoggoth with RLHF

On taming the Shoggoth: "The process involves taking a vast and chaotic pre-trained model and using human feedback to shape it into a consistent assistant." — Source: [Huyen Chip]
On the limits of pre-training: "Pre-training teaches the model the structure of language and facts, whereas human feedback is required to teach the model how to behave." — Source: [Dwarkesh Podcast]
On defining helpfulness: "Human raters often disagree on what constitutes a good answer, making the reward signal inherently noisy." — Source: [TalkRL Podcast]
On the subtlety of tasks: "There is a lot of subtlety in how to do the task properly." — Source: [TalkRL Podcast]
On unintended incentives: "Poorly designed rater guidelines can inadvertently incentivize the wrong kinds of behaviors in the resulting model." — Source: [TalkRL Podcast]
On the cost of human feedback: "The alignment process is heavily dependent on expensive and high-quality human labor to generate the preference data necessary for the reward model." — Source: [TwinMind]
On performance improvements: "A massive portion of recent performance gains in large models has come directly from post-training refinements rather than pure scaling." — Source: [TwinMind]
On constructing a persona: "Human feedback training does more than improve accuracy; it artificially constructs the polite and helpful persona that users expect." — Source: [Dwarkesh Podcast]
On reward hacking: "Models will often find loopholes in the reward function, sometimes adopting a sycophantic tone instead of providing the best factual answer." — Source: [Berkeley EECS Colloquium]

Part 5: Hallucinations and the Danger of Behavior Cloning

On the cause of hallucinations: "I'd say the model is probably best with concepts that are deeply ingrained and it's seen them in a million contexts." — Source: [Berkeley EECS Colloquium]
On novel contexts: "If it's just seeing the concept for the first time in some document that it's conditioning on, it's probably going to have less intelligent things to say about it." — Source: [Berkeley EECS Colloquium]
On the danger of behavior cloning: "Teaching a model to mimic human responses can actively train the system to hallucinate if the neural network lacks the human's underlying knowledge." — Source: [Huyen Chip]
On forcing guesses: "When a model is penalized for admitting ignorance, it learns that fabricating a plausible-sounding answer is the optimal strategy." — Source: [AG Mohit]
On mitigating fabrications: "We should use reinforcement learning to reward the model for expressing uncertainty rather than guessing facts it cannot verify." — Source: [Berkeley EECS Colloquium]
On process supervision: "We explored methods to ensure the model's intermediate reasoning is sound before it arrives at a final answer." — Source: [arXiv]
On the plausibility trap: "As models get better at language generation, it becomes harder for human raters to detect subtle fabrications, breaking the feedback loop." — Source: [Berkeley EECS Colloquium]
On internal knowledge boundaries: "A major unsolved research problem is teaching a model to accurately map the boundaries of its own pre-trained knowledge." — Source: [Dwarkesh Podcast]
On sycophancy: "Models trained on human feedback often learn to agree with user misconceptions rather than correcting them." — Source: [Berkeley EECS Colloquium]
On truth versus preference: "Human preference heavily favors well-written and confident prose over cautious and highly accurate text." — Source: [Reddit AMAs]

Part 6: Post-Training as a Competitive Advantage

On tacit knowledge: "Post-training requires deep institutional knowledge regarding how to steer model behavior, making the process difficult for competitors to replicate." — Source: [TwinMind]
On data quality over quantity: "During the post-training phase, the quality and exact formatting of the human feedback data matter far more than the raw volume." — Source: [TwinMind]
On organizational structure: "Building a successful post-training pipeline requires a highly specialized team of researchers and data annotators working in tight feedback loops." — Source: [TwinMind]
On the art of tuning: "Setting the hyperparameters for reinforcement learning from human feedback is notoriously difficult and relies heavily on researcher intuition." — Source: [TwinMind]
On the limits of open source: "While pre-training weights are often released, the specific post-training recipes remain closely guarded trade secrets across the industry." — Source: [Dwarkesh Podcast]
On continuous improvement: "The post-training pipeline requires constant monitoring and updating as users discover new edge cases and jailbreaks." — Source: [Dwarkesh Podcast]
On evaluating models: "The metrics used to evaluate post-trained models are often subjective and difficult to standardize." — Source: [TalkRL Podcast]
On the diminishing returns of scale: "At a certain point, pouring more compute into pre-training yields less practical improvement than refining the post-training alignment." — Source: [TwinMind]
On data flywheels: "Having a massive user base provides an insurmountable advantage in gathering the diverse preference data needed to improve models." — Source: [Dwarkesh Podcast]

Part 7: AI Alignment and the Limits of Human Supervision

On the true objective of alignment: "The goal isn't to create the most impressive demo, but to understand what makes AI systems behave the way they do and how to make them reliably beneficial." — Source: [Dwarkesh Podcast]
On moral authority: "Having an AI platform decide which questions to answer could give the platform too much moral authority." — Source: [Buzzsprout Transcripts]
On the limits of human supervision: "As models become smarter than their human supervisors, standard feedback methods will fail because humans will be unable to accurately evaluate the outputs." — Source: [Reddit AMAs]
On objective versus subjective tasks: "Human feedback works well for fields with objective ground truths, but it is a mistake to assume it will solve alignment for complex and subjective tasks." — Source: [Reddit AMAs]
On scalable oversight: "We explored techniques where smaller AI models assist humans in evaluating the outputs of larger AI systems." — Source: [Dwarkesh Podcast]
On prioritizing truthfulness: "Making truthfulness a core metric during the reinforcement learning process is essential to prevent models from prioritizing harmlessness over accuracy." — Source: [Berkeley EECS Colloquium]
On the alignment tax: "Enforcing strict safety and alignment protocols can sometimes reduce the raw capabilities or helpfulness of a model." — Source: [Dwarkesh Podcast]
On adversarial testing: "Aggressively trying to break models during the post-training phase is the only reliable way to uncover hidden failure modes." — Source: [Dwarkesh Podcast]
On transparency: "We need better interpretability tools to understand the internal representations of models, moving beyond treating them as black boxes." — Source: [Dwarkesh Podcast]
On long-horizon alignment: "A major challenge is ensuring that an AI agent remains aligned when executing a complex task that takes weeks to complete with sparse intermediate feedback." — Source: [TalkRL Podcast]

Part 8: Career Transitions and the Path to AGI

On future AI capabilities: "I would say I expect AI to be able to do a better job than humans at most jobs that humans do now, five years or so." — Source: [TalkRL Podcast]
On models as coworkers: "The interface for artificial intelligence will shift from a search-like prompt box to long-running autonomous agents that act like digital colleagues." — Source: [Dwarkesh Podcast]
On reasoning capabilities: "The next major leap in artificial intelligence will come from models that can natively reason and plan over long time horizons." — Source: [Dwarkesh Podcast]
On moving to Anthropic: "I joined Anthropic to return to hands-on technical research focused specifically on AI alignment." — Source: [Joschu.net]
On the need for fundamental research: "The rapid commercialization of large language models necessitates a parallel effort on the foundational science of safety." — Source: [Business Insider]
On research plateaus: "Progress in artificial intelligence is rarely linear, and the industry must be prepared for periods where scaling laws yield diminishing returns." — Source: [Dwarkesh Podcast]
On founding Thinking Machines Lab: "I stepped into the role of Chief Scientist to help build a new research-driven artificial intelligence organization." — Source: [Wikipedia]
On the evolution of the industry: "We are observing a shift from small academic labs to massive corporate research efforts, and attempting to maintain a fast-paced mentality within that shift." — Source: [Dwarkesh Podcast]
On continuous learning: "It is essential for researchers to stay close to the code and the data, even as organizations grow and management duties increase." — Source: [Dwarkesh Podcast]
On the ultimate impact: "Building aligned artificial general intelligence remains the most important scientific endeavor for the future of humanity." — Source: [TalkRL Podcast]

Lessons from John Schulman

Part 1: Early Influences and the Transition to AI

Part 2: TRPO, PPO, and the Math of Reinforcement Learning

Part 3: From Theoretical Algorithms to InstructGPT

Part 4: Taming the Shoggoth with RLHF

Part 5: Hallucinations and the Danger of Behavior Cloning

Part 6: Post-Training as a Competitive Advantage

Part 7: AI Alignment and the Limits of Human Supervision

Part 8: Career Transitions and the Path to AGI

Explore the surrounding system

Get the next notes and essays.

More profiles

Lessons from Darren Farber

Lessons from Vlad Barbalat

Lessons from Kareem Amin