// summary
DeepGEMM is a lightweight CUDA library designed for efficient General Matrix Multiplications, supporting FP8 and BF16 data formats. It utilizes a just-in-time compilation module to eliminate the need for pre-installation kernel compilation while maintaining performance comparable to expert-tuned libraries. The library provides specialized APIs for dense and MoE-grouped GEMMs, making it a clean resource for learning GPU kernel optimization.
// technical analysis
DeepGEMM is a specialized CUDA library designed for high-performance General Matrix Multiplications (GEMMs), specifically optimized for FP8 and BF16 data formats in both dense and Mix-of-Experts (MoE) architectures. By utilizing a lightweight Just-In-Time (JIT) compilation module, it eliminates the need for pre-installation kernel compilation while maintaining performance that rivals expert-tuned libraries. The project prioritizes simplicity and accessibility by avoiding heavy template reliance, serving as both a production-ready tool for DeepSeek-style models and an educational resource for NVIDIA GPU kernel optimization.
// key highlights
// use cases
// getting started
To begin, clone the repository recursively to ensure all submodules are included. Run the provided 'develop.sh' script to link essential includes and build the C++ JIT module, then execute the test scripts in the 'tests/' directory to verify functionality. Finally, run 'install.sh' to finalize the setup before importing 'deep_gemm' into your Python projects.