// summary
VoxCPM2 is a tokenizer-free, 2B parameter text-to-speech system that utilizes a diffusion autoregressive architecture to generate high-quality, expressive audio. The model supports 30 languages and offers advanced capabilities including voice design, controllable voice cloning, and studio-quality 48kHz output. It is fully open-source under the Apache-2.0 license and provides production-ready deployment options via vLLM-Omni and Nano-vLLM.
// technical analysis
VoxCPM2 is a sophisticated, tokenizer-free Text-to-Speech system built on a 2B parameter diffusion autoregressive architecture. By operating directly in the latent space of AudioVAE V2, it bypasses traditional discrete tokenization to achieve high-fidelity, 48kHz studio-quality speech synthesis. The project addresses the need for expressive, multilingual voice generation and cloning, offering a versatile pipeline that supports natural language voice design and precise style control. Its design prioritizes both high performance and commercial accessibility, providing a robust, open-source solution for diverse speech synthesis applications.
// key highlights
// use cases
// getting started
To begin, install the package using 'pip install voxcpm'. You can then use the provided Python API to perform text-to-speech, voice design, or cloning by loading the 'openbmb/VoxCPM2' model. For production environments, the project supports high-throughput serving through Nano-vLLM or vLLM-Omni, which provides an OpenAI-compatible API.