neuphonic

neutts

AI#TTS#Voice Cloning#LLM#GGUF#On-device

// summary

NeuTTS is a collection of open-source, on-device text-to-speech models designed for real-time performance and high-quality voice synthesis. The framework utilizes lightweight LLM backbones and a neural audio codec to enable instant voice cloning with as little as three seconds of audio. These models are optimized for deployment on mobile and embedded devices, supporting multiple languages including English, Spanish, German, and French.

// technical analysis

NeuTTS is an open-source framework designed to bring state-of-the-art, on-device text-to-speech (TTS) capabilities to local hardware, effectively bypassing the limitations of web-based APIs. By utilizing lightweight LLM backbones combined with a specialized neural audio codec, the project enables real-time, high-quality speech synthesis and instant voice cloning on resource-constrained devices like mobile phones and Raspberry Pis. A key technical trade-off is the use of GGUF-quantized models, which significantly reduces memory and compute requirements while maintaining natural-sounding output, making it ideal for embedded voice agents and privacy-conscious applications.

// key highlights

Delivers ultra-realistic, human-like voice synthesis optimized for the balance between speed, model size, and audio quality.

Supports instant voice cloning, allowing users to replicate a specific speaker's voice using as little as 3 seconds of reference audio.

Provides GGUF-quantized model backbones that are specifically engineered for efficient inference on mobile, laptop, and embedded hardware.

Utilizes the NeuCodec neural audio codec, which achieves high-fidelity audio output at low bitrates using a single codebook architecture.

Includes built-in security features by watermarking all generated audio outputs with a perceptual threshold watermark.

Offers multilingual support, with specific models available for English, Spanish, German, and French.

// use cases

Real-time on-device speech synthesis for embedded voice agents and assistants

Instant voice cloning using short audio samples for personalized applications

Multilingual text-to-speech generation optimized for mobile and low-power hardware

// getting started

To begin, install the library using 'pip install neutts[all]' to include necessary dependencies like llama-cpp-python and onnxruntime. You can then explore the provided example scripts in the repository, such as the basic streaming example, or use the NeuTTS class directly in your Python code to synthesize speech from text and a reference audio file. For optimal performance, ensure you compile the llama-cpp-python package from source with appropriate hardware acceleration flags for your specific CPU or GPU.