HubLensDeep Learningdeepseek-ai/DeepGEMM
deepseek-ai

DeepGEMM

AICUDAGEMMDeep LearningGPU OptimizationPyTorch
View on GitHub
6,348
+340

// summary

DeepGEMM is a lightweight CUDA library designed for efficient General Matrix Multiplications, supporting FP8 and BF16 data formats. It utilizes a just-in-time compilation module to eliminate the need for pre-installation kernel compilation while maintaining performance comparable to expert-tuned libraries. The library provides specialized APIs for dense and MoE-grouped GEMMs, making it a clean resource for learning GPU kernel optimization.

// technical analysis

DeepGEMM is a specialized CUDA library designed for high-performance General Matrix Multiplications (GEMMs), specifically optimized for FP8 and BF16 data formats in both dense and Mix-of-Experts (MoE) architectures. By utilizing a lightweight Just-In-Time (JIT) compilation module, it eliminates the need for pre-installation kernel compilation while maintaining performance that rivals expert-tuned libraries. The project prioritizes simplicity and accessibility by avoiding heavy template reliance, serving as both a production-ready tool for DeepSeek-style models and an educational resource for NVIDIA GPU kernel optimization.

// key highlights

01
Supports high-performance FP8 and BF16 GEMM operations for both dense and MoE model architectures.
02
Utilizes a lightweight JIT compilation module to compile kernels at runtime, removing the need for complex pre-installation builds.
03
Provides specialized grouped GEMM APIs for contiguous and masked layouts, optimized for MoE training and inference scenarios.
04
Includes dedicated MQA (Multi-Query Attention) scoring kernels designed for the lightning indexer used in DeepSeek v3.2.
05
Achieves high performance on modern NVIDIA architectures, reaching up to 1550 TFLOPS on H800 GPUs.
06
Offers a suite of utility functions for managing TMA alignment, tensor core utilization, and scaling factor transformations.

// use cases

01
High-performance FP8 and BF16 dense matrix multiplication for NVIDIA GPUs
02
Efficient MoE-grouped GEMM operations for both contiguous and masked layouts
03
Specialized MQA logit kernels for advanced model indexing and inference

// getting started

To begin, clone the repository recursively to ensure all submodules are included. Run the provided 'develop.sh' script to link essential includes and build the C++ JIT module, then execute the test scripts in the 'tests/' directory to verify functionality. Finally, run 'install.sh' to finalize the setup before importing 'deep_gemm' into your Python projects.