Microsoft Open-Sources VibeVoice—A Game-Changer in TTS

VibeVoice is Microsoft’s newly open-sourced TTS model that can synthesize up to 90 minutes of audio, clone up to 4 different speakers, and support both English and Chinese.

🔗 Project link: https://github.com/microsoft/VibeVoice

Why It Stands Out

This TTS is truly impressive:

Three models in total, with two already released:
- 1.5B model: Generates up to 90 minutes of audio, supports a 64k context length.
- 7B model: Generates up to 45 minutes of audio, supports a 32k context length.
The third, the streaming model, has not yet been released but will be available later.

With support for 90-minute audio and 4 distinct speakers, VibeVoice can fully power use cases such as AI podcasts.

Hidden Surprises

When cloning audio, if the original sample includes background music, the generated audio may also include music.
Certain words (e.g., “welcome”) can trigger special effects.
Most unexpectedly, the model can sometimes make the speaker break into singing.

Project Overview

VibeVoice is a long-form conversational TTS system designed for expressive, extended, multi-speaker dialogues.
It introduces key innovations like a continuous speech tokenizer to overcome the challenges of traditional TTS systems in scalability, speaker consistency, and natural conversational flow.

Key capabilities: