HubLens › Topics › Inference

// topic

Inference

11 trending in last 90 days ·11 all-time

// new this month

// ecosystem

AI 11

// this week's top 6

toverainc / willow

The Willow Inference Server allows users to self-host high-speed language inference tasks for various applications. It supports a range of functionalities including speech-to-text, text-to-speech, and large language model processing. Users can access official documentation and community discussions to optimize their experience with the platform.

ncnn is a high-performance neural network forward computation framework deeply optimized for mobile platforms. The framework has no third-party dependencies and features cross-platform capabilities, outperforming all known open-source frameworks on mobile CPUs. Developers can easily port deep learning models to mobile devices using ncnn to build various intelligent applications.

alibaba / rtp-llm

RTP-LLM is a high-performance large language model inference acceleration engine developed by the Alibaba Foundation Model Inference team. This engine has been widely applied in various Alibaba business scenarios such as Taobao and Tmall, and supports multiple mainstream model formats and hardware backends. By integrating advanced operator optimization, quantization techniques, and distributed inference capabilities, it provides developers with efficient and flexible deployment solutions.

mnfst / awesome-free-llm-apis

This repository provides a curated list of LLM API providers that offer permanent free tiers for text inference. It categorizes services into direct provider APIs and third-party inference platforms, detailing model capabilities, context windows, and rate limits for each. The collection serves as a resource for developers seeking cost-effective access to various language models without requiring credit card information.

PaddlePaddle / FastDeploy

FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, designed to provide out-of-the-box production-grade deployment solutions. The tool supports various mainstream hardware platforms and integrates core technologies such as load-balanced PD separation, unified KV cache transmission, and full quantization format support. Developers can quickly achieve high-performance model inference and deployment through its OpenAI API-compatible service interface.

alibaba / tair-kvcache

Tair KVCache is an Alibaba Cloud system designed to accelerate Large Language Model inference through distributed memory pooling and dynamic multi-level caching. The project provides a centralized manager for unified metadata handling and a simulation tool for predicting performance metrics without requiring GPU resources. These components work together to improve inference efficiency while reducing overall infrastructure costs.

// all-time featured (11)

toverainc / willow

The Willow Inference Server allows users to self-host high-speed language inference tasks for various applications. It supports a range of functionalities including speech-to-text, text-to-speech, and large language model processing. Users can access official documentation and community discussions to optimize their experience with the platform.

ncnn is a high-performance neural network forward computation framework deeply optimized for mobile platforms. The framework has no third-party dependencies and features cross-platform capabilities, outperforming all known open-source frameworks on mobile CPUs. Developers can easily port deep learning models to mobile devices using ncnn to build various intelligent applications.

ncnn is a high-performance neural network forward computation framework specifically optimized for mobile platforms, designed to simplify the deployment of deep learning algorithms on mobile devices. The framework has no third-party dependencies and features cross-platform capabilities, with execution speeds on mobile CPUs that outperform all currently known open-source frameworks. Currently, ncnn is widely used in various mainstream applications under Tencent, helping developers easily build intelligent applications.

alibaba / rtp-llm

RTP-LLM is a high-performance large language model inference acceleration engine developed by the Alibaba Foundation Model Inference team. This engine has been widely applied in various Alibaba business scenarios such as Taobao and Tmall, and supports multiple mainstream model formats and hardware backends. By integrating advanced operator optimization, quantization techniques, and distributed inference capabilities, it provides developers with efficient and flexible deployment solutions.

mnfst / awesome-free-llm-apis

This repository provides a curated list of LLM API providers that offer permanent free tiers for text inference. It categorizes services into direct provider APIs and third-party inference platforms, detailing model capabilities, context windows, and rate limits for each. The collection serves as a resource for developers seeking cost-effective access to various language models without requiring credit card information.

PaddlePaddle / FastDeploy

FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, designed to provide out-of-the-box production-grade deployment solutions. The tool supports various mainstream hardware platforms and integrates core technologies such as load-balanced PD separation, unified KV cache transmission, and full quantization format support. Developers can quickly achieve high-performance model inference and deployment through its OpenAI API-compatible service interface.

alibaba / tair-kvcache

Tair KVCache is an Alibaba Cloud system designed to accelerate Large Language Model inference through distributed memory pooling and dynamic multi-level caching. The project provides a centralized manager for unified metadata handling and a simulation tool for predicting performance metrics without requiring GPU resources. These components work together to improve inference efficiency while reducing overall infrastructure costs.

PaddlePaddle / FastDeploy

FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, aiming to provide out-of-the-box production-grade deployment solutions. The toolkit supports various mainstream hardware platforms and integrates core technologies such as load-balanced PD separation, unified KV cache transmission, and full quantization format support. By being compatible with OpenAI API and vLLM interfaces, it helps developers efficiently implement model inference and online service deployment.

alibaba / rtp-llm

RTP-LLM is a high-performance large model inference acceleration engine developed by the Alibaba Foundation Model Inference Team, widely used in various business scenarios such as Taobao and Tmall. By integrating various advanced CUDA kernels and quantization techniques, the engine significantly improves model inference performance and efficiency. Furthermore, it possesses high flexibility, supporting multiple model formats, multimodal inputs, and LoRA service deployment.

baidu / vLLM-Kunlun

vLLM Kunlun is a community-maintained hardware plugin that enables the seamless execution of vLLM on Kunlun XPU devices. It functions as a hardware-pluggable interface, allowing users to run various large language and multimodal models without modifying the original vLLM source code. The project supports advanced features like quantization, LoRA fine-tuning, and hardware-accelerated graph optimization to ensure high-performance inference.

google-ai-edge / LiteRT-LM

LiteRT-LM is a high-performance, production-ready inference framework designed by Google for deploying Large Language Models on edge devices. It supports a wide range of platforms including Android, iOS, desktop, and IoT, while leveraging GPU and NPU hardware acceleration for optimal performance. The framework enables advanced capabilities such as multi-modality and function calling, powering on-device AI experiences in various Google products.

// use cases by project

01Self-hosted speech-to-text and text-to-speech processing
02High-speed large language model inference
03Integration with WebRTC and other external applications

01Efficiently deploy deep learning algorithm models on mobile devices
02Support mainstream CNN networks such as YOLO, MobileNet, and ResNet
03Achieve high-performance cross-platform neural network inference computation

01Supports a variety of mainstream CNN models, including classification, detection, segmentation, and face recognition algorithms.
02Provides cross-platform deployment capabilities, supporting environments such as Android, iOS, Windows, Linux, macOS, and WebAssembly.
03Helps developers port deep learning algorithms to mobile devices through efficient implementation, enabling the rapid deployment of artificial intelligence applications.

01Supports various quantization techniques and high-performance CUDA operators to achieve extreme inference acceleration.
02Provides multi-LoRA service deployment, multimodal input processing, and tensor parallelism capabilities.
03Features advanced acceleration characteristics such as context prefix caching and speculative sampling.

awesome-free-llm-apis

01Accessing high-performance LLMs for development and prototyping without upfront costs.
02Integrating diverse AI models into applications using OpenAI SDK-compatible endpoints.
03Comparing inference providers based on rate limits, context window sizes, and model modalities.

// comparisons

voicebox vs willow ncnn vs willow rtp-llm vs willow awesome-free-llm-apis vs willow FastDeploy vs unregistry

// related topics

LLM (9)Deep Learning (5)Quantization (3)CUDA (2)PaddlePaddle (2)