Amphion AI: Open-source Audio, Music and Speech Generation Toolkit

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to:

TTS: Text to Speech (⛳ supported)
SVS: Singing Voice Synthesis (👨‍💻 developing)
VC: Voice Conversion (⛳ supported)
AC: Accent Conversion (⛳ supported)
SVC: Singing Voice Conversion (⛳ supported)
TTA: Text to Audio (⛳ supported)

Features

Text-to-Speech (TTS): Supports FastSpeech2, VITS, VALL-E, NaturalSpeech2, Jets, MaskGCT, Vevo-TTS, DualCodec-VALLE
Voice Conversion (VC): Vevo, FACodec, Noro (zero-shot, noise-robust)
Accent Conversion (AC): Zero-shot with Vevo-Style
Singing Voice Conversion (SVC): Uses WeNet, Whisper, ContentVec features; supports diffusion, transformer, VAE, flow models
Text-to-Audio (TTA): Latent diffusion model (official implementation of NeurIPS 2023 paper)
Neural Audio Codecs: DualCodec (12.5Hz/25Hz, SSL-enhanced), FACodec (content/prosody/timbre separation)
Vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet, WaveGlow, Diffwave, WaveNet, WaveRNN + Multi-Scale CQT Discriminator
Evaluation Metrics: F0 modeling, energy, intelligibility (CER/WER), spectrogram distortion (FAD, MCD, STOI, PESQ), speaker similarity (cosine, RawNet3, Resemblyzer, etc.)
Datasets: Emilia (101k+ hours in-the-wild speech), LibriTTS, LJSpeech, VCTK, M4Singer, Opencpop, OpenSinger, SVCC, AudioCaps, and more — with unified preprocessing via Emilia-Pipe
Visualization: SingVisio for interactive diffusion model insights
Framework: Built for reproducible research, education, and real-world audio generation applications
Easy to install using Docker

License

The project is an open-source that is released under the MIT License.

Citation

@article{amphion_v0.2,
  title        = {Overview of the Amphion Toolkit (v0.2)},
  author       = {Jiaqi Li and Xueyao Zhang and Yuancheng Wang and Haorui He and Chaoren Wang and Li Wang and Huan Liao and Junyi Ao and Zeyu Xie and Yiqiao Huang and Junan Zhang and Zhizheng Wu},
  year         = {2025},
  journal      = {arXiv preprint arXiv:2501.15442},
}

Resources & Downloads

Home

Easy Python

Amphion AI: Open-source Audio, Music and Speech Generation Toolkit

Features

License

Citation

Resources & Downloads

New Article

Features

License

Citation

Resources & Downloads

Related articles