OpenBMB

VoxCPM

AI#Text-to-Speech#Deep Learning #Generative AI#Audio Synthesis

// summary

VoxCPM2 is a tokenizer-free, 2B parameter text-to-speech system that utilizes a diffusion autoregressive architecture to generate high-quality, expressive audio. The model supports 30 languages and offers advanced capabilities including voice design, controllable voice cloning, and studio-quality 48kHz output. It is fully open-source under the Apache-2.0 license and provides production-ready deployment options via vLLM-Omni and Nano-vLLM.

// technical analysis

VoxCPM2 is a sophisticated, tokenizer-free Text-to-Speech system built on a 2B parameter diffusion autoregressive architecture. By operating directly in the latent space of AudioVAE V2, it bypasses traditional discrete tokenization to achieve high-fidelity, 48kHz studio-quality speech synthesis. The project addresses the need for expressive, multilingual voice generation and cloning, offering a versatile pipeline that supports natural language voice design and precise style control. Its design prioritizes both high performance and commercial accessibility, providing a robust, open-source solution for diverse speech synthesis applications.

// key highlights

Supports 30 languages natively without requiring language tags for input text.

Enables creative voice design by generating unique voices from natural language descriptions rather than reference audio.

Provides controllable voice cloning that allows users to adjust emotion, pace, and style while maintaining the original speaker's timbre.

Features ultimate cloning capabilities that reproduce vocal nuances by utilizing both reference audio and its corresponding transcript.

Outputs 48kHz studio-quality audio directly through an asymmetric AudioVAE V2 design with built-in super-resolution.

Offers real-time streaming performance with low latency, optimized for production via Nano-vLLM and vLLM-Omni integrations.

// use cases

Natural-language voice design without reference audio

Controllable voice cloning with style guidance for emotion and pace

High-throughput production speech synthesis via OpenAI-compatible APIs

// getting started

To begin, install the package using 'pip install voxcpm'. You can then use the provided Python API to perform text-to-speech, voice design, or cloning by loading the 'openbmb/VoxCPM2' model. For production environments, the project supports high-throughput serving through Nano-vLLM or vLLM-Omni, which provides an OpenAI-compatible API.