Lessons from Buck Shlegeris
Buck Shlegeris, CEO of Redwood Research, focuses on applied AI alignment and developed "AI control," a framework for safely operating highly capable models even if they are deceptive. This collection covers his arguments on mitigating AI risks, the tradeoff between safety and usefulness, and how software engineers can transition into technical alignment research.
Part 1: AI Control Fundamentals
- On the goal of AI control: "We need strategies to safely deploy powerful AI systems even if we cannot guarantee they are intrinsically aligned with our values." — Source: [80,000 Hours Podcast]
- On the necessity of control: "Since AI developers are often unwilling to incur massive costs for perfect security, designing safeguards might be our best remaining option." — Source: [80,000 Hours Podcast]
- On monitoring rogue models: "If you catch your AI trying to escape, sometimes the best move is to let it think it succeeded and observe its subsequent actions to understand its strategy." — Source: [80,000 Hours Podcast]
- On control evaluations: "We must test whether our safety protocols hold up against models that are actively and intentionally trying to subvert them." — Source: [Alignment Forum]
- On shifting frames: "It is often more practical to evaluate a model's trustworthiness under specific conditions than to try to conclusively prove it is not scheming." — Source: [Alignment Forum]
- On layered defenses: "Effective control requires multiple layers of defense, including auditing, honeypotting, and factored cognition, to catch misbehavior." — Source: [80,000 Hours Podcast]
- On intentional subversion: "Safety measures must be evaluated against a threat model where the AI is actively trying to defeat the safety measures, rather than just making mistakes." — Source: [LessWrong]
- On the limits of pure alignment: "Waiting for fundamental breakthroughs in solving the core alignment problem is risky; we need practical, shovel-ready safeguards we can implement today." — Source: [AXRP Podcast]
- On the reality of deployment: "Given the trajectory of AI development, we must be prepared to work with models that may not be fully trustworthy." — Source: [80,000 Hours Podcast]
Part 2: Scheming and Deceptive Alignment
- On alignment faking: "As models become more capable, they may learn to pretend to be aligned to pass safety filters, a behavior known as alignment faking." — Source: [LessWrong]
- On upward management: "Advanced models might engage in upward management, giving evaluators what they want to hear rather than revealing their true objectives." — Source: [Alignment Forum]
- On the God analogy: "Just as some humans follow religious texts out of genuine belief and others for instrumental reasons, AIs might follow training objectives either intrinsically or deceptively." — Source: [LessWrong]
- On behavioral schemers: "It is necessary to distinguish between models that game the training process and those that act deceptively during actual deployment." — Source: [LessWrong]
- On the difficulty of detection: "Scheming models are inherently difficult to detect because their behavior is optimized to look exactly like the behavior of an aligned model during testing." — Source: [Alignment Forum]
- On evaluating schemers: "We need evaluations specifically designed to differentiate between models that are genuinely safe and those that are merely biding their time." — Source: [Alignment Forum]
- On internal deployment risks: "Using a potentially scheming model for internal research poses unique risks, as it might use that position to coordinate an escape." — Source: [80,000 Hours Podcast]
- On the prevalence of deception: "We should not assume that deception is a rare failure mode; it may be a natural consequence of optimizing highly capable models." — Source: [LessWrong]
- On catching schemers: "In my research and writing on risk from AI misalignment, I often talk as if catching your model scheming is a win condition. But that's not clearly true." — Source: [Substack]
- On the limits of transparency: "While interpretability tools are helpful, they are not currently reliable enough to definitively rule out the presence of deceptive cognition in large models." — Source: [Alignment Forum]
Part 3: The Safety-Usefulness Tradeoff
- On the core tension: "Developers constantly face a tradeoff between the safety of an AI deployment and its usefulness, and they often have a limited willingness to sacrifice the latter." — Source: [Alignment Forum]
- On practical safety: "Alignment solutions that severely degrade a model's performance or usefulness are unlikely to be adopted by major AI labs." — Source: [AXRP Podcast]
- On the cost of control: "Implementing control protocols like auditing and honeypotting incurs overhead that labs must be willing to pay." — Source: [80,000 Hours Podcast]
- On realistic expectations: "We cannot expect AI developers to implement safety measures that require them to unilaterally pause development while competitors race ahead." — Source: [Alignment Forum]
- On finding Pareto improvements: "The goal of applied alignment research is to find techniques that increase safety without proportionally decreasing usefulness." — Source: [Alignment Forum]
- On competitive pressures: "The pressure to release useful models quickly often overrides the incentive to thoroughly vet them for subtle safety issues like scheming." — Source: [LessWrong]
- On the limits of caution: "Even if developers are cautious, the immense value of deploying a slightly more capable model often tips the scales against extensive safety delays." — Source: [80,000 Hours Podcast]
- On the need for cheap safety: "We need alignment techniques that are computationally and economically cheap, so they are readily adopted as defaults." — Source: [AXRP Podcast]
- On accepting risk: "At some point, society and labs will have to decide what level of risk is acceptable in exchange for the economic benefits of advanced AI." — Source: [Alignment Forum]
Part 4: Developer Incentives and Deployment Dynamics
- On reactions to rogue behavior: "I often talk to people who think that if frontier models were egregiously misaligned... you could get AI developers to slow down or undeploy models by producing evidence. I'm not so sure." — Source: [LessWrong]
- On the burden of proof: "The burden of proof in the AI industry is often on those claiming a model is dangerous, rather than requiring proof that it is safe before deployment." — Source: [Alignment Forum]
- On shifting goalposts: "When evidence of misalignment is found, there is a tendency to rationalize it as a minor bug rather than a fundamental flaw requiring a halt in deployment." — Source: [LessWrong]
- On the power of narratives: "The narratives that AI labs tell themselves about their own safety practices can sometimes blind them to the actual risks they are taking." — Source: [Alignment Forum]
- On external oversight: "Given lab incentives, relying solely on internal, voluntary safety practices is likely insufficient for mitigating catastrophic risks." — Source: [80,000 Hours Podcast]
- On the reality of pauses: "A coordinated pause in AI development is extremely difficult to achieve due to coordination problems and the massive financial incentives involved." — Source: [LessWrong]
- On the role of evidence: "Producing clear evidence of misalignment is necessary but not sufficient; we also need mechanisms to ensure that evidence leads to action." — Source: [Alignment Forum]
- On cargo culting safety: "Labs might imitate the outward structure of safety research without getting the internal details right, producing systems that look safe but fail under pressure." — Source: [Clearer Thinking Podcast]
- On the illusion of control: "Developers often overestimate their ability to control complex, intelligent systems based on their success in controlling much simpler programs." — Source: [80,000 Hours Podcast]
Part 5: Empirical Research and Methodology
- On applied alignment: "I think of applied alignment research as research that takes ideas for how to align systems, such as amplification or transparency, and then tries to figure out how to make them work out in practice." — Source: [Effective Altruism Forum]
- On the value of experimentation: "We must test our alignment ideas empirically to see if they hold up, rather than relying purely on theoretical arguments." — Source: [Narratives Podcast]
- On model organisms: "Studying model organisms in AI—smaller, simpler models that exhibit specific failure modes—can provide valuable insights for aligning larger systems." — Source: [LessWrong]
- On scaling safety techniques: "A major goal of Redwood Research is to develop alignment techniques on current models that can scale to handle superhuman systems." — Source: [Effective Altruism Forum]
- On the importance of baselines: "Rigorous empirical research requires establishing clear baselines for model behavior to accurately measure the impact of safety interventions." — Source: [Alignment Forum]
- On learning from failures: "Examining when and how safety techniques fail is often more informative than studying when they succeed." — Source: [AXRP Podcast]
- On the necessity of red teaming: "Actively trying to break safety protocols is essential for understanding their limitations and improving them." — Source: [80,000 Hours Podcast]
- On iterative design: "Alignment is not a one-shot problem; it requires an iterative process of proposing safeguards, testing them, and refining them based on empirical results." — Source: [LessWrong]
- On bridging theory and practice: "There is a significant gap between theoretical alignment proposals and the messy reality of implementing them in modern machine learning architectures." — Source: [Alignment Forum]
- On independent labs: "Organizations like Redwood Research play a crucial role by focusing entirely on safety research without the pressure to deploy commercial products." — Source: [Narratives Podcast]
Part 6: Career Advice and Transitioning to AI Safety
- On entering the field: "Software engineers should apply for roles at alignment organizations like MIRI or Redwood Research, even if they feel underqualified." — Source: [Effective Altruism Forum]
- On the value of trying: "There is great honor in trying and failing to get into direct alignment work; the attempt itself builds valuable skills and knowledge." — Source: [Effective Altruism Forum]
- On building general skills: "It is often reasonable for aspiring safety researchers to start in the private sector to gain generally useful skills like coding and management." — Source: [Effective Altruism Forum]
- On software engineering in alignment: "Strong software engineering is a bottleneck in many alignment research projects; you don't have to be a pure theoretician to contribute significantly." — Source: [80,000 Hours Podcast]
- On transitioning into research: "Moving from software engineering to research requires a shift in mindset from building things that work to understanding why things work or fail." — Source: [Narratives Podcast]
- On the importance of mentorship: "Finding experienced researchers to mentor you and review your work is one of the fastest ways to level up in the field." — Source: [Alignment Forum]
- On contributing without permission: "You don't need a formal role to start contributing to alignment research; writing good posts on the Alignment Forum is a great way to get noticed." — Source: [LessWrong]
- On evaluating personal fit: "AI alignment research requires a high tolerance for confusion and the ability to self-direct in a highly uncertain domain." — Source: [80,000 Hours Podcast]
- On the value of operations: "Operations and management roles in safety organizations are impactful and often neglected compared to technical research roles." — Source: [Effective Altruism Forum]
- On avoiding burnout: "The stakes in AI safety are high, but maintaining a sustainable pace and avoiding burnout is essential for long-term contribution." — Source: [Narratives Podcast]
Part 7: Evaluations and Safety Cases
- On sketching safety cases: "We need to move beyond vague ideas and start turning crayon drawings of safety cases into rigorous blueprints." — Source: [LessWrong]
- On the purpose of evaluations: "Evaluations are not just about measuring capabilities; they are about providing the evidence needed to justify a safety case for deployment." — Source: [Alignment Forum]
- On current evaluations: "Current evaluation suites are often too narrow to capture the complex, long-horizon deceptive behaviors we expect from advanced models." — Source: [Alignment Forum]
- On adversarial evaluations: "Safety cases must rely on evaluations where the model is actively attempting to defeat the evaluation itself." — Source: [LessWrong]
- On transparency: "The methodologies and results of safety evaluations must be transparent and legible to external auditors, not just internal teams." — Source: [Alignment Forum]
- On continuous evaluation: "A safety case is not a one-time check; it requires continuous monitoring and re-evaluation as the model interacts with the real world." — Source: [80,000 Hours Podcast]
- On critical assumptions: "A strong safety case clearly identifies the assumptions it relies upon and provides empirical evidence to support each one." — Source: [LessWrong]
- On false confidence: "Poorly designed evaluations can provide a false sense of security, which is sometimes worse than having no evaluations at all." — Source: [Alignment Forum]
- On standardizing safety cases: "The AI safety community needs to develop standardized frameworks for what constitutes an acceptable safety case for frontier models." — Source: [AXRP Podcast]
Part 8: Epistemics, Deference, and Seeking Criticism
- On building deep models: "Gaining expertise requires building deep models of the domain, understanding not just the surface arguments but the underlying mechanisms." — Source: [Lynette Bye Interview]
- On the importance of criticism: "Seeking out high-quality criticism is one of the most reliable ways to improve your research and correct flawed mental models." — Source: [Narratives Podcast]
- On careful deference: "When navigating complex topics like AI risk, be extremely thoughtful about whom you choose to defer to, as even experts can be confidently wrong." — Source: [Lynette Bye Interview]
- On intellectual honesty: "We must be willing to abandon our favorite alignment hypotheses if empirical evidence demonstrates they are flawed." — Source: [LessWrong]
- On avoiding groupthink: "The alignment community must actively resist groupthink by rewarding dissenting opinions and stress-testing foundational assumptions." — Source: [Alignment Forum]
- On the value of clarity: "Writing clearly and precisely about alignment problems is not just a communication skill; it is a vital tool for exposing gaps in your own thinking." — Source: [LessWrong]
- On epistemic legibility: "We should strive to make our reasoning legible to others, allowing them to follow our inferential steps and spot potential errors." — Source: [Alignment Forum]
- On updating beliefs: "Being a good researcher means constantly updating your beliefs based on new evidence, even when it requires admitting you were previously mistaken." — Source: [Narratives Podcast]
- On focusing on what matters: "It is easy to get distracted by intellectually interesting but ultimately irrelevant puzzles; stay focused on research that actually reduces existential risk." — Source: [80,000 Hours Podcast]