Michael-A-Kuykendall

shimmy

// summary

Shimmy is a lightweight, single-binary server that provides a 100% OpenAI-compatible API for running GGUF models locally. It features zero-configuration model discovery, automatic GPU backend detection, and advanced CPU/GPU hybrid processing for large models. Designed for privacy and performance, it allows developers to integrate local LLMs into existing tools without code changes.

// technical analysis

Shimmy is a high-performance, lightweight OpenAI API server written in Rust that enables local execution of GGUF models with zero dependencies. By providing a drop-in replacement for OpenAI endpoints, it allows developers to integrate local LLMs into existing tools like VSCode and Cursor without modifying code. The project prioritizes efficiency and ease of use, employing a single-binary architecture that automatically detects GPU backends and manages model discovery to minimize configuration overhead.

// key highlights

Provides 100% OpenAI-compatible endpoints, allowing seamless integration with existing AI SDKs and development tools.

Features a single-binary distribution that includes all necessary GPU backends, eliminating complex compilation and dependency management.

Implements intelligent MOE (Mixture of Experts) CPU offloading to enable running large 70B+ models on consumer hardware with limited VRAM.

Automatically discovers models from Hugging Face, Ollama, and local directories, requiring zero manual configuration to get started.

Delivers high performance with sub-second startup times and a minimal memory footprint, significantly outperforming traditional local inference tools.

Includes advanced features like response caching and real-time observability to optimize development workflows and inference reliability.

// use cases

Drop-in replacement for OpenAI API in local development environments

Running large 70B+ models on consumer hardware via MOE CPU offloading

Private, cost-effective local inference for VSCode, Cursor, and Continue.dev

// getting started

To begin, download the pre-built binary for your operating system from the GitHub releases page and make it executable if necessary. Run the server using the './shimmy serve' command, which automatically detects your GPU and available models. Once running, point your OpenAI-compatible client or IDE extension to 'http://127.0.0.1:11435/v1' to start interacting with your local models.