microsoft

VibeVoice

AI#Speech Recognition#Text-to-Speech#Deep Learning #Generative AI

// summary

VibeVoice is a family of open-source voice AI models that utilizes continuous speech tokenizers and next-token diffusion to achieve high-fidelity audio processing. The framework includes advanced tools for long-form speech recognition and real-time streaming text-to-speech generation. These models are designed for research purposes to advance collaboration and innovation within the speech synthesis community.

// technical analysis

VibeVoice is an open-source research framework that advances voice AI through a unified architecture utilizing continuous acoustic and semantic tokenizers operating at an ultra-low 7.5 Hz frame rate. By employing a next-token diffusion framework, the project leverages Large Language Models to maintain semantic coherence while using a diffusion head to ensure high-fidelity audio generation. This design addresses the challenges of long-form speech processing, enabling models to handle up to 90 minutes of audio in a single pass while balancing computational efficiency with expressive output.

// key highlights

Supports single-pass processing for up to 60 minutes of audio in ASR, ensuring consistent speaker tracking and semantic coherence.

Enables rich transcription by jointly performing ASR, diarization, and timestamping to identify who said what and when.

Features a lightweight 0.5B parameter real-time TTS model capable of streaming text input with approximately 300ms latency.

Provides multi-speaker support for up to 4 distinct speakers in a single conversation, maintaining natural turn-taking dynamics.

Integrates with Hugging Face Transformers for seamless model deployment and supports vLLM inference for accelerated performance.

Allows for user-customized hotwords to improve recognition accuracy for domain-specific terminology and names.

// use cases

Long-form speech-to-text with speaker diarization and timestamping

Real-time streaming text-to-speech with low latency

Multi-speaker conversational audio synthesis

// getting started

To begin using VibeVoice, visit the official Hugging Face collection to access the model weights for ASR, TTS, or Real-time variants. Developers can explore the provided Colab notebooks for immediate hands-on testing or refer to the documentation in the repository for specific setup instructions, including finetuning code and vLLM inference configurations.