Side-by-side comparison of stars, features, and trends
DeepGEMM is a unified CUDA library providing high-performance tensor core kernels specifically optimized for modern large language models. It features a lightweight Just-In-Time compilation module that eliminates the need for CUDA compilation during installation. The library delivers expert-level performance across various matrix shapes while maintaining a clean and accessible codebase for kernel optimization.
FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. It provides specialized implementations for both sparse and dense attention mechanisms across prefill and decoding stages. The library is designed for NVIDIA GPU architectures and supports advanced features like FP8 KV caching to maximize computational efficiency.