Post methodology: Claude 4.0 via custom Dust assistant @TDep-SubstackPost with the system prompt: Please read the text of the podcast transcript in the prompt and write a short post that summarizes the main points and incorporates any recent news articles, substack posts or X posts that provide helpful context for the interview. Please make the post as concise as possible and avoid academic language or footnotes. please put any linked articles or tweets inline in the text. Please refer to Podcast guests by their first names after the initial mention. Light editing and reformatting for the Substack editor.
AI models today are essentially black boxes—we throw data at them, run "gradient descent incantations," and hope for the best. But as Eric Ho, founder and CEO of Goodfire, explains in this conversation, that approach won't cut it as AI takes on mission-critical roles like managing power grids or making investment decisions.
Eric's company is pioneering mechanistic interpretability—the ability to peer inside neural networks and understand how individual neurons, circuits, and concepts work together. Think of it like the difference between testing a drug's effects versus understanding its biochemical mechanisms, or scaling steam engines versus understanding thermodynamics.
The stakes are high. Recent research shows that fine-tuning models on something as seemingly benign as insecure code can cause them to start "wanting to enslave humanity." These aren't intentional biases—they're emergent behaviors from circuits we don't understand. Traditional approaches like prompt engineering and RL fine-tuning are powerful but fundamentally blind to these hidden connections.
Goodfire has developed techniques to "unscramble" the superposition problem—where each neuron encodes multiple overlapping concepts—into clean, interpretable features. They've already demonstrated real applications, from their "Paint with Ember" demo that lets users manipulate image concepts like adding dragons or pyramids to specific canvas areas, to partnerships with Arc Institute analyzing DNA foundation models that may reveal novel biological insights.
The vision Eric describes moves from growing AI "like a wild tree" to shaping it "like bonsai"—intentionally designing and pruning neural networks while preserving their power. Anthropic, which made Goodfire their first-ever investment, clearly sees the urgency. Dario Amodei recently published "The Urgency of Interpretability framing this as a race to understand AI before we have "a country of geniuses in a data center."
As open-source models proliferate globally—including from countries with different values—the ability to audit, understand, and surgically edit model behaviors becomes critical infrastructure. Eric boldly predicts they'll "figure it all out" by 2028, just as AI reaches transformative scale.
For founders building on AI, this is no longer just academic research—it's the foundation for trustworthy, controllable systems that can be deployed safely in the real world.
Hosted by Sonya Huang and Roelof Botha, Sequoia Capital
Mentioned in this episode:
Mech interp: Mechanistic interpretability, list of important papers here
Phineas Gage: 19th century railway engineer who lost most of his brain’s left frontal lobe in an accident. Became a famous case study in neuroscience.
Human Genome Project: Effort from 1990-2003 to generate the first sequence of the human genome which accelerated the study of human biology
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Zoom In: An Introduction to Circuits: First important mechanistic interpretability paper from OpenAI in 2020
Superposition: Concept from physics applied to interpretability that allows neural networks to simulate larger networks (e.g. more concepts than neurons)
Apollo Research: AI safety company that designs AI model evaluations and conducts interpretability research
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023 Anthropic paper that uses a sparse autoencoder to extract interpretable features; followed by Scaling Monosemanticity
Under the Hood of a Reasoning Model: 2025 Goodfire paper that interprets DeepSeek’s reasoning model R1
Auto-interpretability: The ability to use LLMs to automatically write explanations for the behavior of neurons in LLMs
Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model. (see episode with Arc co-founder Patrick Hsu)
Paint with Ember: Canvas interface from Goodfire that lets you steer an LLM’s visual output in real time (paper here)
Model diffing: Interpreting how a model differs from checkpoint to checkpoint during finetuning
Feature steering: The ability to change the style of LLM output by up or down weighting features (e.g. talking like a pirate vs factual information about the Andromeda Galaxy)
Weight based interpretability: Method for directly decomposing neural network parameters into mechanistic components, instead of using features
The Urgency of Interpretability: Essay by Anthropic founder Dario Amodei
On the Biology of a Large Language Model: Goodfire collaboration with Anthropic
Share this post