HubLens › Topics › PyTorch

// topic

PyTorch

10trending in last 90 days·10all-time

// new this month

// ecosystem

AI 10

// this week's top 5

deepseek-ai / FlashMLA

FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. It provides specialized implementations for both sparse and dense attention mechanisms across prefill and decoding stages. The library is designed for NVIDIA GPU architectures and supports advanced features like FP8 KV caching to maximize computational efficiency.

deepseek-ai / TileKernels

TileKernels provides a collection of high-performance GPU kernels specifically designed for large language model operations using the TileLang framework. The project includes specialized implementations for Mixture of Experts routing, advanced quantization techniques, and manifold hyper-connection operations. These kernels are built to maximize hardware performance and are currently utilized in internal training and inference workflows.

baidu / vLLM-Kunlun

vLLM Kunlun is a community-maintained hardware plugin that enables seamless execution of vLLM on Kunlun XPU devices. It utilizes a hardware-pluggable interface to decouple the integration, ensuring compatibility with various Transformer, Mixture-of-Expert, and multimodal models. This plugin is the recommended solution for developers looking to deploy high-performance LLMs on Kunlun3 P800 hardware.

rohitg00 / ai-engineering-from-scratch

This comprehensive course provides a structured journey from fundamental linear algebra to building advanced autonomous agent swarms. It emphasizes an AI-native learning approach where students utilize AI coding agents to test their understanding and build reusable tools. Every lesson is designed to produce tangible outputs, including prompts, skills, and MCP servers, ensuring students gain practical professional experience.

alibaba / TorchEasyRec

TorchEasyRec is a PyTorch-based framework designed for building production-ready deep learning recommendation models. It supports a wide range of tasks including candidate generation, ranking, multi-task learning, and generative recommendation. The framework provides flexible configuration, distributed training capabilities, and seamless integration with various data sources and deployment environments.

// all-time featured (10)

deepseek-ai / FlashMLA

FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. It provides specialized implementations for both sparse and dense attention mechanisms across prefill and decoding stages. The library is designed for NVIDIA GPU architectures and supports advanced features like FP8 KV caching to maximize computational efficiency.

deepseek-ai / FlashMLA

FlashMLA is a library of high-performance attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. The repository provides specialized implementations for both sparse and dense attention mechanisms during prefill and decoding stages. These kernels are optimized for NVIDIA GPU architectures, including SM90 and SM100, to achieve significant computational throughput.

deepseek-ai / TileKernels

TileKernels provides a collection of high-performance GPU kernels specifically designed for large language model operations using the TileLang framework. The project includes specialized implementations for Mixture of Experts routing, advanced quantization techniques, and manifold hyper-connection operations. These kernels are built to maximize hardware performance and are currently utilized in internal training and inference workflows.

baidu / vLLM-Kunlun

vLLM Kunlun is a community-maintained hardware plugin that enables seamless execution of vLLM on Kunlun XPU devices. It utilizes a hardware-pluggable interface to decouple the integration, ensuring compatibility with various Transformer, Mixture-of-Expert, and multimodal models. This plugin is the recommended solution for developers looking to deploy high-performance LLMs on Kunlun3 P800 hardware.

rohitg00 / ai-engineering-from-scratch

This comprehensive course provides a structured journey from fundamental linear algebra to building advanced autonomous agent swarms. It emphasizes an AI-native learning approach where students utilize AI coding agents to test their understanding and build reusable tools. Every lesson is designed to produce tangible outputs, including prompts, skills, and MCP servers, ensuring students gain practical professional experience.

alibaba / TorchEasyRec

TorchEasyRec is a PyTorch-based framework designed for building production-ready deep learning recommendation models. It supports a wide range of tasks including candidate generation, ranking, multi-task learning, and generative recommendation. The framework provides flexible configuration, distributed training capabilities, and seamless integration with various data sources and deployment environments.

google-research / timesfm

TimesFM is a decoder-only foundation model developed by Google Research specifically for time-series forecasting tasks. The latest 2.5 version features a 200M parameter architecture that supports up to 16k context length and continuous quantile forecasting. The repository provides comprehensive tools for inference, fine-tuning with LoRA, and integration with agentic workflows.

PaddlePaddle / PaConvert

This tool is officially maintained by Paddle and aims to achieve efficient automated migration from PyTorch code to PaddlePaddle code. It supports one-click conversion of over 1,600 PyTorch APIs and 200 torchvision APIs, maintaining an average conversion rate of over 95% in tests. The conversion process is operated via the command line, preserves the style and structure of the original code, and provides detailed conversion logs and summaries.

nikopueringer / CorridorKey

CorridorKey is a neural network-based tool designed to solve the complex problem of unmixing foreground subjects from green screen backgrounds. By predicting the true straight color and a clean linear alpha channel for every pixel, it preserves delicate details like motion blur and transparency that traditional keyers often destroy. The software supports high-fidelity VFX workflows by outputting 16-bit and 32-bit linear float EXR files compatible with industry-standard compositing applications.

NVIDIA / personaplex

PersonaPlex is a real-time, full-duplex speech-to-speech model built on the Moshi architecture that enables precise persona control through text prompts and audio voice conditioning. The model is trained on a mix of synthetic and real-world conversational data to deliver natural, low-latency interactions. Users can deploy the model via a provided server interface or perform offline evaluations using specific voice embeddings and role-based prompts.

// use cases by project

01Token-level sparse attention for efficient prefill and decoding
02Dense attention kernels for high-throughput model inference
03FP8 KV cache support to reduce memory footprint and improve performance

01Token-level sparse attention for efficient prefill and decoding stages
02Dense attention kernels for standard Multi-Head Attention (MHA) operations
03FP8 KV cache support to optimize memory usage during decoding

01Mixture of Experts (MoE) routing and gating operations
02FP8, FP4, and E5M6 quantization with fused SwiGLU support
03High-level PyTorch autograd wrappers for trainable modeling layers

01Running Transformer-based and Mixture-of-Expert LLMs on Kunlun XPU
02Deploying multimodal language models with hardware-optimized performance
03Enabling LoRA fine-tuning and quantization support for efficient model inference

ai-engineering-from-scratch

01Building a portfolio of reusable AI tools, prompts, and agents
02Learning AI concepts through hands-on implementation in Python, TypeScript, Rust, and Julia
03Integrating AI-native development workflows using Claude Code and MCP servers

// comparisons

litellm vs FlashMLA

// related topics

LLM (5)CUDA (3)Deep Learning (3)DeepSeek (2)Attention (2)