deepseek-ai

DeepGEMM

AICUDAGEMMDeep LearningGPU OptimizationPyTorch

6,348

+340

// summary

DeepGEMM is a lightweight CUDA library designed for efficient General Matrix Multiplications, supporting FP8 and BF16 data formats. It utilizes a just-in-time compilation module to eliminate the need for pre-installation kernel compilation while maintaining performance comparable to expert-tuned libraries. The library provides specialized APIs for dense and MoE-grouped GEMMs, making it a clean resource for learning GPU kernel optimization.

// technical analysis

DeepGEMM is a specialized CUDA library designed for high-performance General Matrix Multiplications (GEMMs), specifically optimized for FP8 and BF16 data formats in both dense and Mix-of-Experts (MoE) architectures. By utilizing a lightweight Just-In-Time (JIT) compilation module, it eliminates the need for pre-installation kernel compilation while maintaining performance that rivals expert-tuned libraries. The project prioritizes simplicity and accessibility by avoiding heavy template reliance, serving as both a production-ready tool for DeepSeek-style models and an educational resource for NVIDIA GPU kernel optimization.

// key highlights

Supports high-performance FP8 and BF16 GEMM operations for both dense and MoE model architectures.

Utilizes a lightweight JIT compilation module to compile kernels at runtime, removing the need for complex pre-installation builds.

Provides specialized grouped GEMM APIs for contiguous and masked layouts, optimized for MoE training and inference scenarios.

Includes dedicated MQA (Multi-Query Attention) scoring kernels designed for the lightning indexer used in DeepSeek v3.2.

Achieves high performance on modern NVIDIA architectures, reaching up to 1550 TFLOPS on H800 GPUs.

Offers a suite of utility functions for managing TMA alignment, tensor core utilization, and scaling factor transformations.

// use cases

High-performance FP8 and BF16 dense matrix multiplication for NVIDIA GPUs

Efficient MoE-grouped GEMM operations for both contiguous and masked layouts

Specialized MQA logit kernels for advanced model indexing and inference

// getting started

To begin, clone the repository recursively to ensure all submodules are included. Run the provided 'develop.sh' script to link essential includes and build the C++ JIT module, then execute the test scripts in the 'tests/' directory to verify functionality. Finally, run 'install.sh' to finalize the setup before importing 'deep_gemm' into your Python projects.