Microsoft Open-Sources VibeVoice—A Game-Changer in TTS

david 02/09/2025

VibeVoice is Microsoft’s newly open-sourced TTS model that can synthesize up to 90 minutes of audio, clone up to 4 different speakers, and support both English and Chinese.

🔗 Project link: https://github.com/microsoft/VibeVoice

Why It Stands Out

This TTS is truly impressive:

  • Three models in total, with two already released:
    • 1.5B model: Generates up to 90 minutes of audio, supports a 64k context length.
    • 7B model: Generates up to 45 minutes of audio, supports a 32k context length.
  • The third, the streaming model, has not yet been released but will be available later.

With support for 90-minute audio and 4 distinct speakers, VibeVoice can fully power use cases such as AI podcasts.

Hidden Surprises

  • When cloning audio, if the original sample includes background music, the generated audio may also include music.
  • Certain words (e.g., “welcome”) can trigger special effects.
  • Most unexpectedly, the model can sometimes make the speaker break into singing.

Project Overview

VibeVoice is a long-form conversational TTS system designed for expressive, extended, multi-speaker dialogues.
It introduces key innovations like a continuous speech tokenizer to overcome the challenges of traditional TTS systems in scalability, speaker consistency, and natural conversational flow.

Key capabilities:

  • Generate up to 90 minutes of continuous speech
  • Support for up to 4 unique speakers (compared to 1–2 in most models)
  • Maintain smooth delivery with rich emotional expression

Video Demo

English

Chinese

Cross-Lingual

Spontaneous Singing

Long Conversation with 4 people