coqui-ai-TTS Beginner’s Guide | A Deep Learning Toolkit for Text-to-Speech

terry 04/09/2025
《coqui-ai-TTS Beginner’s Guide | A Deep Learning Toolkit for Text-to-Speech》

1. Overview

coqui-ai-TTS is a continuation of the original coqui-tts project, which is no longer maintained. Thanks to Idiap, we can still use the features of coqui-tts through this branch.

Key Features:

  • Pretrained models in 1100+ languages
  • Tools for training new models and fine-tuning existing ones
  • Utilities for dataset analysis and curation
  • Support for voice conversion (with OpenVoice integration)

It works across Windows, macOS, and Linux through Conda.


2. Installation

Step 1: Create a Conda Environment

(Install Conda/Miniconda first: Anaconda Docs)

conda create -n coqui-ai-TTS python=3.10
conda activate coqui-ai-TTS

Step 2: Install coqui-ai-TTS

  • If you only want to use pretrained models:
pip install coqui-tts
  • If you want to train or modify models:
git clone https://github.com/idiap/coqui-ai-TTS
cd coqui-ai-TTS
pip install -e .

3. Managing Models

List available models:

tts --list_models

Default download location:

/root/.local/share/tts

Change model storage location:

Temporary environment variables:

export XDG_DATA_HOME="/www/coqui/models"
export TTS_HOME="/www/coqui/models"
echo $XDG_DATA_HOME

Permanent setup (Linux/macOS):

  1. Edit ~/.bashrc (for bash) or ~/.zshrc (for zsh/macOS).
  2. Add environment variables.
  3. Reload config: source ~/.bashrc or . ~/.zshrc

4. Voice Cloning

Supported Models

  • YourTTS (and other d-vector models)
  • XTTS
  • Tortoise
  • Bark

Two Modes: Python API & Command-line

Python API Example:

import torch
from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"
api = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Step 1: Clone voice
api.tts_to_file(
  text="Hello world",
  speaker_wav=["my/cloning/audio.wav", "my/cloning/audio2.wav"],
  speaker="MySpeaker1",
  language="en",
)

# Step 2: Reuse cloned voice
api.tts_to_file(
  text="Hello world",
  speaker="MySpeaker1",
  language="en",
)

CLI Example:

# Step 1: Clone voice
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --text "你好世界" \
    --language_idx "zh" \
    --speaker_wav "my/cloning/audio.wav" "my/cloning/audio2.wav" \
    --speaker_idx "MySpeaker1"

# Step 2: Reuse cloned voice
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --text "你好世界" \
    --language_idx "zh" \
    --speaker_idx "MySpeaker1"

⚠️ For Chinese voice cloning, install:

pip install pypinyin

Otherwise, you’ll see:

ImportError: Chinese requires: pypinyin

5. Voice Conversion

Convert a source voice into the style of a target voice.

Python API Example:

from TTS.api import TTS

tts = TTS("voice_conversion_models/multilingual/vctk/freevc24").to("cuda")
tts.voice_conversion_to_file(
  source_wav="my/source.wav",
  target_wav="my/target.wav",
  file_path="output.wav"
)

CLI Example:

tts --model_name "voice_conversion_models/multilingual/multi-dataset/openvoice_v2" \
    --source_wav "source.wav" \
    --target_wav "target1.wav" "target2.wav" \
    --out_path "output.wav"

https://github.com/idiap/coqui-ai-TTS