Amphion AI: Open-source Audio, Music and Speech Generation Toolkit

amy 13/01/2026

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to:

    • TTS: Text to Speech (⛳ supported)
    • SVS: Singing Voice Synthesis (👨‍💻 developing)
    • VC: Voice Conversion (⛳ supported)
    • AC: Accent Conversion (⛳ supported)
    • SVC: Singing Voice Conversion (⛳ supported)
    • TTA: Text to Audio (⛳ supported)

Features

  • Text-to-Speech (TTS): Supports FastSpeech2, VITS, VALL-E, NaturalSpeech2, Jets, MaskGCT, Vevo-TTS, DualCodec-VALLE
  • Voice Conversion (VC): Vevo, FACodec, Noro (zero-shot, noise-robust)
  • Accent Conversion (AC): Zero-shot with Vevo-Style
  • Singing Voice Conversion (SVC): Uses WeNet, Whisper, ContentVec features; supports diffusion, transformer, VAE, flow models
  • Text-to-Audio (TTA): Latent diffusion model (official implementation of NeurIPS 2023 paper)
  • Neural Audio Codecs: DualCodec (12.5Hz/25Hz, SSL-enhanced), FACodec (content/prosody/timbre separation)
  • Vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet, WaveGlow, Diffwave, WaveNet, WaveRNN + Multi-Scale CQT Discriminator
  • Evaluation Metrics: F0 modeling, energy, intelligibility (CER/WER), spectrogram distortion (FAD, MCD, STOI, PESQ), speaker similarity (cosine, RawNet3, Resemblyzer, etc.)
  • Datasets: Emilia (101k+ hours in-the-wild speech), LibriTTS, LJSpeech, VCTK, M4Singer, Opencpop, OpenSinger, SVCC, AudioCaps, and more — with unified preprocessing via Emilia-Pipe
  • Visualization: SingVisio for interactive diffusion model insights
  • Framework: Built for reproducible research, education, and real-world audio generation applications
  • Easy to install using Docker

License

The project is an open-source that is released under the MIT License.

Citation

@article{amphion_v0.2,
  title        = {Overview of the Amphion Toolkit (v0.2)},
  author       = {Jiaqi Li and Xueyao Zhang and Yuancheng Wang and Haorui He and Chaoren Wang and Li Wang and Huan Liao and Junyi Ao and Zeyu Xie and Yiqiao Huang and Junan Zhang and Zhizheng Wu},
  year         = {2025},
  journal      = {arXiv preprint arXiv:2501.15442},
}

Resources & Downloads