What is Text-to-Speech (TTS)?
Technology that converts written text into spoken audio. Modern TTS systems produce natural-sounding voices with emotion, pacing, and accent control. Puppetry offers 500+ AI voices across 65+ languages.
Related Terms
Neural Voice
A synthetic voice generated by deep neural networks (as opposed to older concatenative TTS). Neural voices sound significantly more natural, with proper intonation, breathing, and emotional range. Leading providers produce voices.
Voice Cloning
Creating a synthetic replica of a specific person's voice using AI. Users record a short sample (30 seconds to 5 minutes), and the AI learns to reproduce their speech patterns, tone, and accent. Used for personalized video content.
Speech-to-Text (STT)
The reverse of TTS: converting spoken audio into written text. Modern STT systems like OpenAI's Whisper handle accents, background noise, and many languages. Puppetry uses STT internally to align speech to visemes for accurate lip sync.
AI Dubbing
Automatically replacing the original audio in a video with a synthesized translation, while keeping mouth movement convincingly aligned. Puppetry supports AI dubbing across 65+ languages — paste a script in a new language and the lip sync re-renders for that audio.