What is Wav2Lip?
A neural network that generates accurate lip movements from audio input. It takes a face image and audio waveform, then produces video frames with perfectly synced mouth movements. Known for high accuracy across languages and accents.
Related Terms
Lip Sync / Lip Syncing
The process of matching mouth movements to audio speech. In AI video, lip sync algorithms analyze audio waveforms and generate realistic mouth shapes frame-by-frame. Puppetry uses LivePortrait + Wav2Lip for production-quality lip sync across 65+ languages.
Viseme
The visual mouth shape that corresponds to a phoneme (a unit of sound). English uses roughly a dozen distinct visemes — for example, the "OO" lip-rounding shape covers several different phonemes that look the same on camera. AI lip-sync engines map audio to visemes frame-by-frame to produce believable mouth movement.
Phoneme
The smallest unit of sound that distinguishes one word from another in a language. Speech-recognition and lip-sync systems first decompose audio into phonemes, then map each phoneme to a viseme (mouth shape) for animation. English has roughly 44 phonemes.
LivePortrait
An open-source AI model for portrait animation. It generates natural head movements, facial expressions, and eye blinks from a single photo. Combined with Wav2Lip for lip sync, it forms the core of Puppetry's animation pipeline.