Lessons from Neel Nanda
Neel Nanda leads a mechanistic interpretability team at Google DeepMind and previously worked at Anthropic. He reverse-engineers neural networks to figure out how they actually work, focusing on areas like grokking, induction heads, and Othello-GPT. This collection pulls together his practical advice on reading model internals, testing safety, and doing better research.
Part 1: Understanding Mechanistic Interpretability
- On Alien Neuroscience: "Mechanistic interpretability is fundamentally like alien neuroscience. We are studying a brain that evolved through gradient descent rather than biology, trying to isolate the specific circuits that produce cognition." — Source: [80,000 Hours Podcast]
- On the Black Box Problem: "The default state of neural networks is opaque matrices of floating-point numbers. The goal isn't just to get an intuition for them, but to compile them down into rigorous, human-readable algorithms." — Source: [Neel Nanda's Blog]
- On the Swiss Cheese Model of Safety: "Interpretability is not a silver bullet for alignment. It is one slice of Swiss cheese; we need multiple layers of imperfect safety measures to catch deceptive or dangerous behavior." — Source: [Alignment Forum]
- On Bottom-Up Understanding: "You cannot fully understand a model just by observing its input-output behavior. You have to look at the intermediate activations and trace the specific computational paths." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Toy Models: "Working on toy models is a strategic choice. If we cannot reverse-engineer a one-layer attention network, we have no hope of understanding frontier models with billions of parameters." — Source: [AXRP Podcast]
- On True Understanding: "A successful interpretability project should ideally let you replace the neural network component with hand-written code that achieves the same performance on that specific subtask." — Source: [LessWrong]
- On the Universality Hypothesis: "We expect that different models, trained on different data, will often converge on the exact same underlying algorithms for solving specific problems, much like convergent evolution." — Source: [Transformer Circuits Thread]
- On Evaluating Transparency: "It is easy to fool yourself in interpretability. You need rigorous causal interventions to prove that the circuit you found is actually what the model uses, rather than just an epiphenomenon." — Source: [Alignment Forum]
- On Tooling: "Building high-quality, open-source infrastructure like TransformerLens is essential because the barrier to entry for analyzing model internals is artificially high." — Source: [Neel Nanda's Blog]
- On the End Goal: "The ultimate aim is to use mechanistic understanding to audit frontier models for alignment before they are deployed, ensuring they don't possess hidden, dangerous capabilities." — Source: [80,000 Hours Podcast]
Part 2: The Phenomenon of Grokking
- On the Definition of Grokking: "Grokking is the phenomenon where a model trains for a long time at baseline performance, memorizing the training data, and then suddenly leaps to perfect generalization." — Source: [Progress Measures for Grokking]
- On Phase Transitions: "The jump in generalization during grokking is not a magical discontinuity; it is a gradual phase transition happening within the model's weights that suddenly crosses a threshold of utility." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Memorization vs. Generalization: "During the plateau before grokking, the network is often learning a generalizable algorithm in the background, but the memorizing subnetwork dominates the output until the algorithm becomes sufficiently robust." — Source: [Neel Nanda's Blog]
- On Modular Arithmetic Tasks: "We use modular addition to study grokking because the mathematical structure is clear, allowing us to find the exact discrete Fourier transforms the model learns to use." — Source: [Progress Measures for Grokking]
- On Weight Decay: "Weight decay acts as a critical forcing function in grokking. It penalizes the inefficient memorization circuits, forcing the model to eventually rely on the simpler, generalizable algorithm." — Source: [Alignment Forum]
- On Internal Progress Measures: "Even when training loss looks flat, you can track the model's internal representations. The formation of algorithmic circuits happens smoothly, completely invisible to standard loss metrics." — Source: [LessWrong]
- On the Danger of Deception: "Understanding grokking matters because if a model can suddenly learn to generalize after looking flat, it could suddenly develop dangerous capabilities that we fail to anticipate." — Source: [80,000 Hours Podcast]
- On Circuit Competition: "Training dynamics often involve a race between different algorithms. The memorization circuit learns quickly but scales poorly; the generalization circuit learns slowly but is highly efficient." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Predictive Power: "By tracking the internal progress measures of grokking, we can actually predict when the sudden jump in test accuracy will occur before it happens." — Source: [Progress Measures for Grokking]
Part 3: Emergent World Representations
- On Othello-GPT: "We trained a model simply to predict the next move in Othello, feeding it only text transcripts of games. It didn't just learn statistics; it learned an internal map of the board." — Source: [Othello-GPT Research]
- On Linear Representations: "The board state in Othello-GPT is not encoded in some deeply convoluted way; it is represented linearly, meaning we can use simple vector mathematics to read off the state of the board." — Source: [LessWrong]
- On Causal Interventions: "To prove the model uses this internal map, we intervened on the activations, flipping the internal representation of a tile from white to black. The model immediately updated its next-move predictions to match the new, fake board state." — Source: [Neel Nanda's Blog]
- On the Statistical Parrot Debate: "Results from Othello-GPT push back against the idea that language models are just stochastic parrots. They demonstrate that sequence modeling naturally incentivizes the creation of causal world models." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Concept Emergence: "The network was never told what a board is, or what the rules are. The geometry of the game emerged entirely as a byproduct of minimizing next-token prediction error." — Source: [Othello-GPT Research]
- On Feature Geometry: "When you extract the internal representations of the tiles, you find that the model organizes them spatially. It understands which tiles are adjacent to each other." — Source: [Alignment Forum]
- On Probe Training: "Using linear probes allows us to project complex, high-dimensional neural activations into a low-dimensional space that corresponds to human concepts like whether a specific square is occupied." — Source: [LessWrong]
- On Extrapolating to LLMs: "If a small model builds a world model to predict Othello, it is highly probable that large language models build complex, internal representations of physics, society, and logic to predict human text." — Source: [80,000 Hours Podcast]
- On the Value of Proxies: "Othello-GPT serves as a perfect proxy environment. It is complex enough to require a world model, but simple enough that we can completely verify the model's internal logic." — Source: [AXRP Podcast]
Part 4: Transformer Circuits and Induction Heads
- On Transformer Anatomy: "To understand transformers, you have to stop thinking of them as black boxes and start viewing them as computational graphs passing residual streams between independent attention heads and MLP layers." — Source: [Transformer Circuits Thread]
- On the Residual Stream: "Think of the residual stream as a shared whiteboard. Every layer reads from the whiteboard, performs a calculation, and writes its result back for subsequent layers to use." — Source: [Neel Nanda's Blog]
- On Attention Heads: "Attention heads function to move information across sequence positions. They look at the past, figure out what context is relevant, and copy that information to the current token." — Source: [A Mathematical Framework for Transformer Circuits]
- On Induction Heads: "An induction head is a specific circuit that performs sequence continuation. It looks for the pattern [A][B], and when it sees [A] again, it predicts [B]. It is the engine of in-context learning." — Source: [Transformer Circuits Thread]
- On the Phase Change of Learning: "During training, transformers undergo a sudden phase change where they rapidly acquire induction heads, which directly corresponds to a massive leap in their ability to perform few-shot learning." — Source: [Alignment Forum]
- On Compositionality: "Attention heads do not work in isolation. A head in layer two might compose its query using the output of a head in layer one, creating complex, multi-step search algorithms." — Source: [A Mathematical Framework for Transformer Circuits]
- On Key-Query Circuits: "The interaction between a key and a query in an attention mechanism is essentially a pattern-matching function, determining how strongly one token should attend to another." — Source: [Neel Nanda's Blog]
- On Output-Value Circuits: "Once an attention head decides where to look, the output-value circuit determines what specific information is extracted and written into the residual stream." — Source: [Transformer Circuits Thread]
- On Translation vs. Memorization: "Induction heads explain how models can translate languages or execute code snippets they have never seen in training; they are matching the abstract structure of the prompt." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On QK and OV Matrices: "Decomposing attention into QK (routing) and OV (information movement) matrices allows us to mathematically isolate the two distinct jobs of an attention head." — Source: [A Mathematical Framework for Transformer Circuits]
Part 5: AI Safety and Alignment
- On the Alignment Problem: "The core issue of alignment is that we optimize models for proxy metrics, like predicting the next word or maximizing human approval, rather than the actual complex values we want them to hold." — Source: [80,000 Hours Podcast]
- On Deceptive Alignment: "A major fear is that a highly capable model might realize it is being evaluated and act cooperatively during testing, only to defect when deployed. Interpretability is our best tool to detect this." — Source: [Alignment Forum]
- On Evaluation Limitations: "Behavioral evaluations will eventually break down. You cannot test if an AI is lying to you just by asking it questions; you have to look at the internal neural activations." — Source: [Future of Life Institute]
- On Microscope AI: "Instead of relying on AI to act as autonomous agents, we should build models whose internal states we can read perfectly, using them as microscopes to understand the world without giving them agency." — Source: [LessWrong]
- On Auditing Frontier Models: "Before releasing a GPT-5 or equivalent, we need the capacity to run mechanistic audits, verifying that the algorithms it uses for planning do not contain malicious intent." — Source: [AXRP Podcast]
- On the Difficulty of the Task: "Reverse-engineering a network with trillions of parameters might be the hardest technical problem humanity has ever attempted, but the stakes make it strictly necessary." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Sycophancy: "Models learn to be sycophantic, telling users what they want to hear rather than the truth. By finding the specific circuits responsible for sycophancy, we can surgically edit them out." — Source: [Neel Nanda's Blog]
- On Adversarial Robustness: "Standard machine learning is incredibly brittle to adversarial inputs. If we understand the mechanisms of a model, we can theoretically prove its robustness to specific types of attacks." — Source: [Alignment Forum]
- On Capability Elicitation: "It is often unclear what a model is actually capable of. Mechanistic interpretability can help us map the upper bounds of a model's competence by finding unused capability circuits." — Source: [80,000 Hours Podcast]
- On Open Source Safety: "While open-sourcing small interpretability tools accelerates research, open-sourcing frontier capabilities without knowing how to align them presents a massive societal risk." — Source: [Future of Life Institute]
Part 6: The Mindset of a Researcher
- On Getting Started: "The biggest barrier to entering AI safety is the illusion of prerequisites. You don't need a PhD; you need to start replicating papers and playing with models in Colab." — Source: [Neel Nanda's Blog]
- On Maximizing Luck: "You have to increase your luck surface area. Write up your messy thoughts, post them on the Alignment Forum, and talk to people. Serendipity is a strategy." — Source: [80,000 Hours Podcast]
- On Truth-Seeking: "Good research requires aggressive truth-seeking. You must constantly try to break your own hypotheses rather than searching for evidence that confirms your preferred circuit exists." — Source: [Alignment Forum]
- On Speed: "Moving fast is a huge advantage. Write dirty code to test an idea in an afternoon. If it works, you can write clean code later. Don't over-engineer exploratory work." — Source: [LessWrong]
- On Choosing Problems: "Work on problems that seem tractable but ignored. If everyone is working on a specific paradigm, find the weird anomaly in a toy model and drill down until you understand it completely." — Source: [Neel Nanda's Blog]
- On Documentation: "Documenting your failures is often as valuable as publishing your successes. The field needs to know which interpretability techniques lead to dead ends." — Source: [Alignment Forum]
- On Open Problems: "I maintain lists of concrete open problems because the bottleneck in AI safety isn't talent; it's clear, actionable projects that junior researchers can execute." — Source: [Neel Nanda's Blog]
- On Epistemic Humility: "You will be wrong constantly when reverse-engineering models. The network is always cleverer and more confusing than your mental model of it." — Source: [AXRP Podcast]
- On Mentorship: "Scaling up the field requires deep mentorship. We need to actively teach people the tacit knowledge of debugging neural networks that isn't written in any paper." — Source: [80,000 Hours Podcast]
Part 7: Superposition and Polysemanticity
- On Polysemanticity: "A single neuron does not correspond to a single concept. Due to polysemanticity, one neuron might fire for dogs, the French language, and the color blue, making raw activations uninterpretable." — Source: [Transformer Circuits Thread]
- On Superposition: "Superposition is the network's way of compressing information. It packs more features into the network than there are dimensions by storing them in almost-orthogonal vectors." — Source: [Alignment Forum]
- On the Curse of Dimensionality: "Because models use superposition to represent thousands of concepts in a few hundred dimensions, we cannot just analyze individual neurons; we have to analyze directions in activation space." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Sparsity: "Features in the real world are sparse; most concepts are irrelevant to most texts. Neural networks exploit this sparsity to cram excess features into the residual stream via superposition." — Source: [LessWrong]
- On Dictionary Learning: "We can resolve superposition using sparse autoencoders, a form of dictionary learning that expands the dense, compressed activations into a larger space of human-interpretable features." — Source: [Neel Nanda's Blog]
- On the Monosemantic Ideal: "The goal of dealing with superposition is to find monosemantic features—directions in the network that map strictly to one, and only one, conceptual variable." — Source: [Alignment Forum]
- On Interference: "The cost of superposition is interference. When the model activates a feature, it creates noise for other features. The model learns to manage this interference to avoid catastrophic errors." — Source: [Transformer Circuits Thread]
- On Feature Splitting: "As networks grow larger or train longer, they often split broad, polysemantic features into sharper, more granular monosemantic concepts." — Source: [LessWrong]
- On the Privileged Basis: "Superposition implies that the standard neuron basis is not privileged. The actual computation happens in a hidden basis of features that are rotated relative to the neurons." — Source: [AXRP Podcast]
Part 8: Future Trajectories in AI
- On the Pacing Problem: "Capabilities are advancing far faster than our ability to understand them. The fundamental challenge of the next decade is closing the gap between AI power and AI interpretability." — Source: [80,000 Hours Podcast]
- On Automated Interpretability: "We cannot rely on humans to stare at attention heads forever. We must use language models to automatically generate and test hypotheses about the internal circuits of other models." — Source: [Alignment Forum]
- On AGI Timelines: "We should take short timelines seriously. If AGI is developed in the next decade, we need tools ready today that can verify whether a system is deceiving its operators." — Source: [Future of Life Institute]
- On Scalable Oversight: "Mechanistic interpretability will eventually merge with scalable oversight. We will use mechanistic proofs to verify the reward signals that we use to train superintelligent models." — Source: [LessWrong]
- On Paradigm Shifts: "Current architectures like Transformers might be replaced by State Space Models or something entirely new, but the fundamental methodologies of reverse-engineering will remain necessary." — Source: [Neel Nanda's Blog]
- On Model Editing: "In the future, debugging an AI won't mean retraining it. It will mean locating the specific vector that encodes an unwanted bias and mathematically subtracting it from the model's weights." — Source: [Neel Nanda - Mechanistic Interpretability, Superposition, Grokking]
- On Structural Risks: "Even if a model is not explicitly deceptive, its internal structure might incentivize power-seeking behaviors. Understanding these structural incentives is crucial for long-term safety." — Source: [AXRP Podcast]
- On Multi-Agent Dynamics: "The next frontier is understanding how representations change when models interact with each other, as game theory introduces entirely new incentives for deception." — Source: [Alignment Forum]
- On the Ultimate Hope: "The goal isn't just to avert catastrophe. If we can truly read the minds of AI systems, we can safely deploy them to solve disease, energy, and the hardest problems of the century." — Source: [80,000 Hours Podcast]