HubLensTopicsPyTorch
// topic

PyTorch

10trending in last 90 days·10all-time

// new this month

// ecosystem

LLM5CUDA3Deep Learning3DeepSeek2Attention2PyTorch
AI 10

// this week's top 5

01
deepseek-ai / FlashMLA
FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. It provides specialized implementations for both sparse and dense attention mechanisms across prefill and decoding stages. The library is designed for NVIDIA GPU architectures and supports advanced features like FP8 KV caching to maximize computational efficiency.
9212,583
02
deepseek-ai / TileKernels
TileKernels provides a collection of high-performance GPU kernels specifically designed for large language model operations using the TileLang framework. The project includes specialized implementations for Mixture of Experts routing, advanced quantization techniques, and manifold hyper-connection operations. These kernels are built to maximize hardware performance and are currently utilized in internal training and inference workflows.
82593
03
baidu / vLLM-Kunlun
vLLM Kunlun is a community-maintained hardware plugin that enables seamless execution of vLLM on Kunlun XPU devices. It utilizes a hardware-pluggable interface to decouple the integration, ensuring compatibility with various Transformer, Mixture-of-Expert, and multimodal models. This plugin is the recommended solution for developers looking to deploy high-performance LLMs on Kunlun3 P800 hardware.
78401
04
rohitg00 / ai-engineering-from-scratch
This comprehensive course provides a structured journey from fundamental linear algebra to building advanced autonomous agent swarms. It emphasizes an AI-native learning approach where students utilize AI coding agents to test their understanding and build reusable tools. Every lesson is designed to produce tangible outputs, including prompts, skills, and MCP servers, ensuring students gain practical professional experience.
7850
05
alibaba / TorchEasyRec
TorchEasyRec is a PyTorch-based framework designed for building production-ready deep learning recommendation models. It supports a wide range of tasks including candidate generation, ranking, multi-task learning, and generative recommendation. The framework provides flexible configuration, distributed training capabilities, and seamless integration with various data sources and deployment environments.
78364

// all-time featured (10)

deepseek-ai / FlashMLA
FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. It provides specialized implementations for both sparse and dense attention mechanisms across prefill and decoding stages. The library is designed for NVIDIA GPU architectures and supports advanced features like FP8 KV caching to maximize computational efficiency.
92
deepseek-ai / FlashMLA
FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. The repository provides specialized implementations for both sparse and dense attention mechanisms during prefill and decoding stages. These kernels are optimized for NVIDIA GPU architectures, including SM90 and SM100, to achieve significant computational throughput.
86
deepseek-ai / TileKernels
TileKernels provides a collection of high-performance GPU kernels specifically designed for large language model operations using the TileLang framework. The project includes specialized implementations for Mixture of Experts routing, advanced quantization techniques, and manifold hyper-connection operations. These kernels are built to maximize hardware performance and are currently utilized in internal training and inference workflows.
82
baidu / vLLM-Kunlun
vLLM Kunlun is a community-maintained hardware plugin that enables seamless execution of vLLM on Kunlun XPU devices. It utilizes a hardware-pluggable interface to decouple the integration, ensuring compatibility with various Transformer, Mixture-of-Expert, and multimodal models. This plugin is the recommended solution for developers looking to deploy high-performance LLMs on Kunlun3 P800 hardware.
78
rohitg00 / ai-engineering-from-scratch
This comprehensive course provides a structured journey from fundamental linear algebra to building advanced autonomous agent swarms. It emphasizes an AI-native learning approach where students utilize AI coding agents to test their understanding and build reusable tools. Every lesson is designed to produce tangible outputs, including prompts, skills, and MCP servers, ensuring students gain practical professional experience.
78
alibaba / TorchEasyRec
TorchEasyRec is a PyTorch-based framework designed for building production-ready deep learning recommendation models. It supports a wide range of tasks including candidate generation, ranking, multi-task learning, and generative recommendation. The framework provides flexible configuration, distributed training capabilities, and seamless integration with various data sources and deployment environments.
78
google-research / timesfm
TimesFM is a decoder-only foundation model developed by Google Research specifically for time-series forecasting tasks. The latest 2.5 version features a 200M parameter architecture that supports up to 16k context length and continuous quantile forecasting. The repository provides comprehensive tools for inference, fine-tuning with LoRA, and integration with agentic workflows.
78
PaddlePaddle / PaConvert
This tool is officially maintained by Paddle and aims to achieve efficient automated migration from PyTorch code to PaddlePaddle code. It supports one-click conversion of over 1,600 PyTorch APIs and 200 torchvision APIs, maintaining an average conversion rate of over 95% in tests. The conversion process is operated via the command line, preserves the style and structure of the original code, and provides detailed conversion logs and summaries.
48
nikopueringer / CorridorKey
CorridorKey is a neural network-based tool designed to solve the complex problem of unmixing foreground subjects from green screen backgrounds. By predicting the true straight color and a clean linear alpha channel for every pixel, it preserves delicate details like motion blur and transparency that traditional keyers often destroy. The software supports high-fidelity VFX workflows by outputting 16-bit and 32-bit linear float EXR files compatible with industry-standard compositing applications.
38
NVIDIA / personaplex
PersonaPlex is a real-time, full-duplex speech-to-speech model built on the Moshi architecture that enables precise persona control through text prompts and audio voice conditioning. The model is trained on a mix of synthetic and real-world conversational data to deliver natural, low-latency interactions. Users can deploy the model via a provided server interface or perform offline evaluations using specific voice embeddings and role-based prompts.
38

// use cases by project

FlashMLA
  • 01Token-level sparse attention for efficient prefill and decoding
  • 02Dense attention kernels for high-throughput model inference
  • 03FP8 KV cache support to reduce memory footprint and improve performance
FlashMLA
  • 01Token-level sparse attention for efficient prefill and decoding stages
  • 02Dense attention kernels for standard Multi-Head Attention (MHA) operations
  • 03FP8 KV cache support to optimize memory usage during decoding
TileKernels
  • 01Mixture of Experts (MoE) routing and gating operations
  • 02FP8, FP4, and E5M6 quantization with fused SwiGLU support
  • 03High-level PyTorch autograd wrappers for trainable modeling layers
vLLM-Kunlun
  • 01Running Transformer-based and Mixture-of-Expert LLMs on Kunlun XPU
  • 02Deploying multimodal language models with hardware-optimized performance
  • 03Enabling LoRA fine-tuning and quantization support for efficient model inference
ai-engineering-from-scratch
  • 01Building a portfolio of reusable AI tools, prompts, and agents
  • 02Learning AI concepts through hands-on implementation in Python, TypeScript, Rust, and Julia
  • 03Integrating AI-native development workflows using Claude Code and MCP servers

// comparisons

// related topics