VibeVoice is Microsoft’s newly open-sourced TTS model that can synthesize up to 90 minutes of audio, clone up to 4 different speakers, and support both English and Chinese.
🔗 Project link: https://github.com/microsoft/VibeVoice
Why It Stands Out
This TTS is truly impressive:
- Three models in total, with two already released:
- 1.5B model: Generates up to 90 minutes of audio, supports a 64k context length.
- 7B model: Generates up to 45 minutes of audio, supports a 32k context length.
- The third, the streaming model, has not yet been released but will be available later.
With support for 90-minute audio and 4 distinct speakers, VibeVoice can fully power use cases such as AI podcasts.
Hidden Surprises
- When cloning audio, if the original sample includes background music, the generated audio may also include music.
- Certain words (e.g., “welcome”) can trigger special effects.
- Most unexpectedly, the model can sometimes make the speaker break into singing.
Project Overview
VibeVoice is a long-form conversational TTS system designed for expressive, extended, multi-speaker dialogues.
It introduces key innovations like a continuous speech tokenizer to overcome the challenges of traditional TTS systems in scalability, speaker consistency, and natural conversational flow.
Key capabilities:
- Generate up to 90 minutes of continuous speech
- Support for up to 4 unique speakers (compared to 1–2 in most models)
- Maintain smooth delivery with rich emotional expression
Video Demo
English
Chinese
Cross-Lingual
Spontaneous Singing
Long Conversation with 4 people