HubLensLLMMichael-A-Kuykendall/shimmy
// archived 2026-04-27
Michael-A-Kuykendall

shimmy

AI#LLM#Rust#Inference#OpenAI API#GGUF
View on GitHub
82

// summary

Shimmy is a lightweight, single-binary server that provides a 100% OpenAI-compatible API for running GGUF models locally. It features zero-configuration model discovery, automatic GPU backend detection, and advanced CPU/GPU hybrid processing for large models. Designed for privacy and performance, it allows developers to integrate local LLMs into existing tools without code changes.

// technical analysis

Shimmy is a high-performance, lightweight OpenAI API server written in Rust that enables local execution of GGUF models with zero dependencies. By providing a drop-in replacement for OpenAI endpoints, it allows developers to integrate local LLMs into existing tools like VSCode and Cursor without modifying code. The project prioritizes efficiency and ease of use, employing a single-binary architecture that automatically detects GPU backends and manages model discovery to minimize configuration overhead.

// key highlights

01
Provides 100% OpenAI-compatible endpoints, allowing seamless integration with existing AI SDKs and development tools.
02
Features a single-binary distribution that includes all necessary GPU backends, eliminating complex compilation and dependency management.
03
Implements intelligent MOE (Mixture of Experts) CPU offloading to enable running large 70B+ models on consumer hardware with limited VRAM.
04
Automatically discovers models from Hugging Face, Ollama, and local directories, requiring zero manual configuration to get started.
05
Delivers high performance with sub-second startup times and a minimal memory footprint, significantly outperforming traditional local inference tools.
06
Includes advanced features like response caching and real-time observability to optimize development workflows and inference reliability.

// use cases

01
Drop-in replacement for OpenAI API in local development environments
02
Running large 70B+ models on consumer hardware via MOE CPU offloading
03
Private, cost-effective local inference for VSCode, Cursor, and Continue.dev

// getting started

To begin, download the pre-built binary for your operating system from the GitHub releases page and make it executable if necessary. Run the server using the './shimmy serve' command, which automatically detects your GPU and available models. Once running, point your OpenAI-compatible client or IDE extension to 'http://127.0.0.1:11435/v1' to start interacting with your local models.