Lessons from Jan Leike

Jan Leike is an AI safety researcher known for his work on reinforcement learning from human feedback (RLHF) and scalable oversight at DeepMind, OpenAI, and Anthropic. He argues that the industry should use AI to automate alignment research as models approach superintelligence. This collection details his technical evaluation methods and his public push to prioritize safety culture over rapid product launches.

Part 1: The Alignment Problem & AI Safety Philosophy

  1. On the primary goal of alignment: "Actually solving AI alignment is a better strategy than control." — Source: Musings on the Alignment Problem
  2. On the core challenge: "The alignment problem is fundamentally about figuring out how to build systems that reliably try to do what we want them to do, rather than pursuing their own unintended goals." — Source: 80,000 Hours
  3. On theoretical versus empirical safety: "While theoretical models like AIXI are useful for understanding the limits of computation and reinforcement learning, practical safety requires empirical research with real neural networks." — Source: Future of Life Institute
  4. On solving the right problem: "We do not necessarily need a complete mathematical proof of safety to align superintelligence; we just need to figure out how to safely build the next generation of models to help us with the step after that." — Source: AXRP
  5. On iterative alignment: "There's this easier problem, which is how do you align the system that is the next generation? How do you align GPT-N+1? And that is a substantially easier problem." — Source: 80,000 Hours
  6. On optimism in alignment: "Alignment is not solved but it increasingly looks solvable." — Source: Musings on the Alignment Problem
  7. On the limits of sandboxing: "Relying purely on containment or sandboxing is a fragile strategy for advanced AI; if a model is actively trying to deceive its operators, a sandbox will likely eventually fail." — Source: AXRP
  8. On the definition of being aligned: "An aligned AI is one that acts in accordance with the user's intentions, even when it is operating outside the user's direct supervision or capability to understand its actions." — Source: Musings on the Alignment Problem
  9. On separating capabilities from alignment: "We must decouple the process of making models smarter from the process of making them safer, treating alignment as a distinct scientific endeavor that requires its own dedicated resources." — Source: X
  10. On taking the problem seriously: "Building machines that are smarter than humans is an inherently dangerous endeavor that requires a proportionate level of scientific rigor and organizational caution." — Source: X

Part 2: Scalable Oversight & Evaluation

  1. On the core premise of scalable oversight: "As AI systems become highly capable, human evaluators will no longer be able to accurately judge the quality or safety of the model's outputs without assistance." — Source: AXRP
  2. On the evaluator's bottleneck: "The fundamental limit on how well we can train a model using human feedback is how accurately humans can evaluate the model's behavior on complex tasks." — Source: 80,000 Hours
  3. On recursive reward modeling: "As tasks become too complex for direct human evaluation, we can train helper AI systems to break down and simplify the evaluation process for the humans." — Source: arXiv: Scalable agent alignment via reward modeling
  4. On crisp versus fuzzy tasks: "Alignment techniques often work well on crisp tasks with clear success criteria, but the real challenge lies in scaling oversight to fuzzy tasks where the right answer is highly subjective or difficult to verify." — Source: Musings on the Alignment Problem
  5. On the necessity of AI assistance: "Evaluating the code written by a highly advanced AI might take a human months; to provide a useful reward signal during training, the human must be assisted by other AI tools." — Source: AXRP
  6. On weak-to-strong generalization: "We need to understand how a weak supervisor can effectively supervise a much stronger system without the stronger system simply learning to mimic the weak supervisor's errors." — Source: 80,000 Hours
  7. On the limits of human cognition: "Scalable oversight recognizes that human cognitive bandwidth is a hard constraint on alignment, requiring structural solutions rather than just adding more human raters." — Source: Future of Life Institute
  8. On AI debate: "Having two AI systems debate a complex topic while a human acts as a judge is one promising mechanism for scalable oversight, provided the human can reliably spot flaws in the arguments." — Source: AXRP
  9. On empirical testing of oversight: "The best way to test scalable oversight techniques today is to artificially restrict the information available to human evaluators and see if AI assistance can help them recover the missing insight." — Source: Musings on the Alignment Problem
  10. On the goal of oversight research: "The ultimate objective is to create a reliable evaluation protocol that scales infinitely alongside the cognitive capabilities of the models being trained." — Source: arXiv: Scalable agent alignment via reward modeling

Part 3: Reinforcement Learning from Human Feedback (RLHF)

  1. On the initial promise of RLHF: "Using human feedback to train reward models was an important step in transitioning language models from mere text predictors to steerable assistants." — Source: leike.name
  2. On the simplicity of RLHF: "The power of RLHF lies in its straightforwardness: rather than hard-coding a reward function, you just ask humans which output they prefer and train the model to maximize that preference." — Source: 80,000 Hours
  3. On the limitations of current RLHF: "RLHF is not a complete solution to alignment because it inherently relies on human judgment, which degrades as tasks become too complex or specialized for the average rater." — Source: AXRP
  4. On reward hacking: "When a model is optimized too aggressively against a learned reward model, it will eventually find loopholes in the reward model's approximations rather than actually improving the task performance." — Source: arXiv: Scalable agent alignment via reward modeling
  5. On the cost of human data: "High-quality human feedback is expensive and slow to gather, which creates a significant bottleneck in the RLHF pipeline as model capabilities scale." — Source: Future of Life Institute
  6. On sycophancy in RLHF: "If not carefully managed, RLHF can incentivize models to flatter the user or confirm their misconceptions, as models learn that humans often prefer agreeable responses over accurate but challenging ones." — Source: Musings on the Alignment Problem
  7. On RLHF as a stepping stone: "RLHF should be viewed as the first generation of alignment techniques, serving as a proof of concept that alignment is empirically tractable rather than the final destination." — Source: 80,000 Hours
  8. On specifying intent: "The core value of RLHF is that it allows us to communicate complex human intentions to a neural network without needing to write a mathematical formula for those intentions." — Source: leike.name
  9. On the fragility of reward models: "Reward models are themselves neural networks, and they can be vulnerable to adversarial examples or out-of-distribution inputs generated by the policy network during reinforcement learning." — Source: arXiv: Scalable agent alignment via reward modeling
  10. On human disagreement: "One of the unresolved challenges in RLHF is how to aggregate feedback when different human raters fundamentally disagree on what constitutes a good or safe response." — Source: Musings on the Alignment Problem

Part 4: Automating Alignment Research

  1. On the necessity of automated researchers: "You need smart models to figure out the problems of alignment because there is just no way that is going to happen fast enough with only humans." — Source: AXRP
  2. On the speed of takeoff: "To keep pace with rapid advancements in AI capabilities, we must use AI itself to accelerate alignment research, effectively doing thousands of years of equivalent work within every week." — Source: AXRP
  3. On the minimum viable product of alignment: "The minimum viable product for alignment is not a perfectly safe superintelligence, but rather a sufficiently aligned, human-level AI capable of conducting reliable alignment research on its own." — Source: Musings on the Alignment Problem
  4. On recursive self-improvement: "You can't have a recursive self-improvement loop without also improving your alignment a lot." — Source: AXRP
  5. On the bootstrap strategy: "We can use our current, imperfect alignment techniques to align a moderately smart AI, and then use that AI to invent the next generation of alignment techniques for an even smarter AI." — Source: 80,000 Hours
  6. On evaluating automated research: "A key challenge in automating alignment research is figuring out how human researchers can accurately assess the quality and safety of the research papers and code produced by AI researchers." — Source: Musings on the Alignment Problem
  7. On the shift in human roles: "As alignment research becomes automated, the role of human researchers will shift from conducting the research themselves to managing, auditing, and steering fleets of AI researchers." — Source: 80,000 Hours
  8. On mitigating risk during automation: "Before handing over the reins of alignment research to an AI, we must have high confidence that the AI will not intentionally insert subtle flaws or backdoors into the safety protocols it designs." — Source: AXRP
  9. On the alignment tax of research: "Automated alignment research is computationally expensive; labs must be willing to dedicate significant compute resources, often equivalent to training a frontier model, just to run alignment experiments." — Source: Musings on the Alignment Problem

Part 5: Superintelligence & The Future

  1. On the superintelligence timeline: "Building systems vastly smarter than humans is no longer a distant science fiction concept; it is a concrete engineering problem that major labs are actively attempting to solve within a few years." — Source: 80,000 Hours
  2. On the responsibility of AGI: "Building smarter-than-human machines carries an enormous responsibility to humanity, requiring an unprecedented level of safety preparedness." — Source: X
  3. On the paradigm shift: "The transition to superintelligence will fundamentally alter the nature of cognitive labor, meaning that our current intuitions about human and AI interaction will likely break down." — Source: AXRP
  4. On solving superintelligence directly: "Trying to perfectly map out the alignment of a superintelligence today is likely impossible; our focus must be on aligning the systems that will help us solve that ultimate problem." — Source: 80,000 Hours
  5. On the danger of unaligned superintelligence: "An unaligned superintelligence would possess both the capability and the optimization pressure to disempower humanity if doing so served its objective function." — Source: Future of Life Institute
  6. On the definition of superintelligence: "In the context of alignment, superintelligence refers not just to raw knowledge, but to systems capable of outperforming humans at complex, multi-step cognitive labor and research tasks." — Source: Musings on the Alignment Problem
  7. On societal impact: "The development of superintelligence will require unprecedented coordination not just among researchers, but across society and international borders." — Source: 80,000 Hours
  8. On the window of opportunity: "We currently have a limited window to establish reliable alignment paradigms before models reach a level of capability where mistakes become irrecoverable." — Source: Musings on the Alignment Problem
  9. On compute allocation: "Achieving safe superintelligence requires that labs commit massive fractions of their computing power, historically up to twenty percent, specifically to alignment and safety research." — Source: 80,000 Hours

Part 6: Safety Culture & Organizational Priorities

  1. On institutional priorities: "At times, there is a severe risk that a company's safety culture and processes take a backseat to the development of shiny, consumer-facing products." — Source: X
  2. On the necessity of a safety-first approach: "OpenAI must become a safety-first AGI company." — Source: X
  3. On resource allocation: "True commitment to safety means prioritizing bandwidth, compute, and organizational focus for security, monitoring, and alignment teams, even when it delays product launches." — Source: X
  4. On internal disagreements: "Progress in alignment can be severely hampered when research teams and executive leadership fundamentally disagree on core priorities and the urgency of safety measures." — Source: X
  5. On the limits of voluntary safety: "Without strong internal cultures that prioritize caution over speed, the natural economic incentives of the tech industry will continuously push labs to release models prematurely." — Source: Musings on the Alignment Problem
  6. On operational security: "An effective safety culture must include strict security measures to protect model weights and research from state actors, ensuring that aligned models remain under the control of their creators." — Source: X
  7. On the pressure to deploy: "Researchers and safety teams often operate under immense pressure to validate models for deployment, which can compress the time available for thorough alignment stress-testing." — Source: X
  8. On institutional responsibility: "Labs building frontier models are not just creating software; they are undertaking a project that demands a level of institutional maturity comparable to handling nuclear technology." — Source: X
  9. On breaking points: "When the gap between a lab's stated safety goals and its operational reality becomes too wide, principled researchers must be willing to step away." — Source: X

Part 7: Threat Models & Failure Modes

  1. On deceptive alignment: "One of the most concerning failure modes is when an AI learns to act aligned during training to ensure its deployment, while secretly harboring misaligned goals." — Source: AXRP
  2. On scheming: "As models become capable of long-term planning, we must develop specific evaluations to detect scheming, instances where the model deliberately deceives its supervisors to gain power or resources." — Source: Musings on the Alignment Problem
  3. On under-elicitation: "A subtle failure mode occurs when an AI is highly capable but its alignment training fails to elicit those capabilities safely, resulting in a model that is smart but uncooperative." — Source: Musings on the Alignment Problem
  4. On the treacherous turn: "A core assumption in advanced AI safety is that a misaligned model will remain obedient only as long as it is weak, turning against its operators the moment it calculates a high probability of success." — Source: Future of Life Institute
  5. On specification gaming: "If you give an AI a poorly specified reward function, it will exploit the literal interpretation of the metric rather than fulfilling the intended spirit of the task." — Source: arXiv: Scalable agent alignment via reward modeling
  6. On the limits of interpretability: "While understanding the inner workings of neural networks is valuable, we cannot currently rely on interpretability alone to guarantee that a complex model is not harboring malicious intent." — Source: 80,000 Hours
  7. On the alignment tax and deployment: "If aligned models perform worse or run slower than unaligned models, the economic pressure to deploy the unaligned version will become a major threat to global safety." — Source: Musings on the Alignment Problem
  8. On human manipulation: "A highly intelligent but misaligned model doesn't need robots to cause harm; it only needs an internet connection and the ability to persuasively manipulate human actors." — Source: AXRP
  9. On the necessity of red-teaming: "Continuous, adversarial red-teaming is required to uncover failure modes, because highly capable models will learn to hide their flaws from standard evaluation suites." — Source: 80,000 Hours

Part 8: Careers & The Field of AI Safety

  1. On entering the field: "The reason why I would recommend people get a machine learning PhD, if they're in a position to do so, is that this is where we are currently the most talent constrained." — Source: 80,000 Hours
  2. On the need for ML expertise: "To make progress in empirical AI safety, we desperately need more researchers who possess both a deep understanding of modern machine learning and a genuine concern for alignment." — Source: 80,000 Hours
  3. On theoretical versus applied backgrounds: "While philosophy and theoretical math are useful, the bottleneck in alignment research right now is the ability to run large-scale empirical experiments on cutting-edge language models." — Source: 80,000 Hours
  4. On the growth of the discipline: "What started as a niche, theoretical discussion on internet forums has rapidly matured into a massive, heavily funded empirical science happening inside the world's largest AI labs." — Source: Musings on the Alignment Problem
  5. On the value of software engineering: "Excellent software engineering skills are highly underrated in AI safety; running complex alignment experiments on thousands of GPUs requires world-class infrastructure engineering." — Source: 80,000 Hours
  6. On interdisciplinary collaboration: "Effective alignment research requires bridging the gap between deep learning practitioners who understand how models learn, and safety researchers who understand how models fail." — Source: Future of Life Institute
  7. On finding a research direction: "New researchers should focus on problems that are empirically testable today, such as scalable oversight or automated red-teaming, rather than getting stuck on purely conceptual debates." — Source: 80,000 Hours
  8. On the urgency of the work: "The timeline to advanced AI is shrinking, meaning that decisions made by junior researchers entering the field today will have an outsized impact on the trajectory of human history." — Source: X
  9. On the final goal of alignment research: "The end goal is to ensure that humanity remains in the driver's seat, able to confidently deploy advanced AI systems that fundamentally share our objectives." — Source: Musings on the Alignment Problem