// summary
Shimmy is a lightweight, single-binary server that provides a 100% OpenAI-compatible API for running GGUF models locally. It features zero-configuration model discovery, automatic GPU backend detection, and advanced CPU/GPU hybrid processing for large models. Designed for privacy and performance, it allows developers to integrate local LLMs into existing tools without code changes.
// technical analysis
Shimmy is a high-performance, lightweight OpenAI API server written in Rust that enables local execution of GGUF models with zero dependencies. By providing a drop-in replacement for OpenAI endpoints, it allows developers to integrate local LLMs into existing tools like VSCode and Cursor without modifying code. The project prioritizes efficiency and ease of use, employing a single-binary architecture that automatically detects GPU backends and manages model discovery to minimize configuration overhead.
// key highlights
// use cases
// getting started
To begin, download the pre-built binary for your operating system from the GitHub releases page and make it executable if necessary. Run the server using the './shimmy serve' command, which automatically detects your GPU and available models. Once running, point your OpenAI-compatible client or IDE extension to 'http://127.0.0.1:11435/v1' to start interacting with your local models.