Post methodology: Claude 4.0 via custom Dust assistant @TDep-SubstackPost with the system prompt: Please read the text of the podcast transcript in the prompt and write a short post that summarizes the main points and incorporates any recent news articles, substack posts or X posts that provide helpful context for the interview. Please make the post as concise as possible and avoid academic language or footnotes. please put any linked articles or tweets inline in the text. Please refer to Podcast guests by their first names after the initial mention. Light editing and reformatting for the Substack editor.
ElevenLabs is changing the game for voice tech. In this episode, co-founder Mati Staniszewski explained how a childhood annoyance with monotonous dubbed films in Poland sparked a quest for natural, expressive text-to-speech. By staying laser-focused on audio—capturing emotion, context, and nuance—the team outpaced many broader AI models that were trying to do everything at once.
Their breakthrough came from understanding that audio presents unique challenges. Unlike text models that predict the next token, voice AI must capture subtle elements like sarcasm, timing, and emotional context. When someone says "what a wonderful day" versus saying "what a wonderful day," sarcastically, the entire delivery changes. This contextual understanding became their secret weapon.
The company's viral moments tell the story of their progress: from beta users copying entire books into their tiny text box to create audiobooks, to releasing the first AI that could laugh, to recent hits like Harry Potter by Balenciaga and Darth Vader's voice in Fortnite.
Mati envisions voice becoming the default interface for technology—transforming education with personalized tutors, enabling seamless cross-language communication, and powering everything from healthcare calls to customer support. The company is racing toward what he calls passing the “Turing test” for voice, where conversations with AI agents become indistinguishable from human interaction.
They're also tackling the dark side: impersonation risks. Every piece of content generated through ElevenLabs can be traced back to its creator, and they're developing detection models to identify AI-generated voice content across platforms.
Building from London with a fully remote team, ElevenLabs assembled top audio researchers globally—a smart move given the limited talent pool in voice AI. Their approach of keeping researchers close to deployment means innovations quickly reach users, creating a feedback loop that accelerates development.
The next frontier? Moving from their current “cascade” model (separate speech-to-text, language processing, and text-to-speech components) to a truly integrated duplex system that can handle real-time conversation with all the interruptions, emotions and nuances of human dialogue. Mati believes this could happen as early as this year.
As voice agents proliferate across industries, ElevenLabs is positioning itself not just as a voice provider, but as the infrastructure enabling a future where talking to technology feels as natural as talking to a friend.
Hosted by Pat Grady
Mentioned in this episode:
Attention Is All You Need: The original Transformers paper
Tortoise-tts: Open source text to speech model that was a starting point for ElevenLabs (which now maintains a v2)
Harry Potter by Balenciaga: ElevenLabs’ first big viral moment from 2023
The first AI that can laugh: 2022 blog post backing up ElevenLab’s claim of laughter (it got better in v3)
Darth Vader's voice in Fortnite: ElevenLabs used actual voice clips provided by James Earl Jones before he died
Lex Fridman interviews Prime Minister Modi: ElevenLabs enabled Fridman to speak in Hindi and Modi to speak in English.
Time Person of the Year 2024: ElevenLabs-powered experiment with “conversational journalism”
Iconic Voices: Richard Feynman, Deepak Chopra, Maya Angelou and more available in ElevenLabs reader app
SIP trunking: a method of delivering voice, video, and other unified communications over the internet using the Session Initiation Protocol (SIP)
Genesys: Leading enterprise CX platform for agentic AI
Hitchhiker’s Guide to the Galaxy: Comedy/science-fiction series by Douglas Adams that contains the concept of the Babel Fish instantaneous translator, cited by Mati
FYI: communication and productivity app for creatives that Mati uses, founded by will.i.am
Lovable: prototyping app that Mati loves
Share this post