HubLensLLMTencent/AngelSlim
// archived 2026-04-21
Tencent

AngelSlim

AI#LLM#Quantization#Model Compression#Speculative Decoding#Deep Learning
View on GitHub
570

// summary

AngelSlim is a highly integrated toolkit designed to provide efficient compression solutions for large language, vision, and diffusion models. It supports a wide range of techniques including advanced quantization, speculative decoding, and token pruning to optimize model performance. The framework offers developers a unified interface for training, deployment, and performance evaluation across various hardware environments.

// technical analysis

AngelSlim is a highly integrated toolkit designed to simplify and accelerate the compression of large-scale models, including LLMs, VLMs, and diffusion models. By unifying diverse compression techniques—such as quantization, speculative decoding, and sparse attention—into a single framework, it addresses the complexity of deploying massive models on resource-constrained hardware. The project prioritizes ease of use through a modular API and configuration-driven workflows, while maintaining a strong focus on performance optimization to enable efficient inference for state-of-the-art models.

// key highlights

01
Provides a unified, highly integrated framework that supports a wide range of compression algorithms for LLMs, VLMs, and diffusion models.
02
Features advanced speculative decoding capabilities via Eagle3, enabling significant inference speedups of 1.4–1.9x.
03
Supports diverse quantization methods including FP8, INT8, INT4, and specialized techniques like NVFP4, Tequila, and Sherry.
04
Optimizes end-to-end performance to allow the quantization and deployment of massive models like Qwen3-235B on limited GPU resources.
05
Includes built-in support for deployment via industry-standard inference engines like vLLM and SGLang for OpenAI-compatible API services.
06
Offers a metadata-driven framework for vision token pruning and merging, facilitating efficient processing in multimodal models.

// use cases

01
Model quantization using algorithms like FP8, INT4, and specialized methods like Tequila and Sherry
02
Speculative decoding training and deployment for LLMs, VLMs, and audio models using Eagle3
03
Diffusion model optimization through advanced caching and quantization techniques

// getting started

To begin, install the toolkit using 'pip install angelslim' or by cloning the repository for an editable source installation. Developers can then utilize the 'Engine' API for programmatic model compression or execute provided shell scripts for tasks like speculative decoding training and model quantization. Detailed documentation and quick-start guides are available to assist with specific model configurations and deployment workflows.