HubLensMachine Learningdeepseek-ai/Thinking-with-Visual-Primitives
// archived 2026-05-02
deepseek-ai

Thinking-with-Visual-Primitives

AI🌱 NEW PROJECT BOOST#Machine Learning#Multimodal#LLM#Computer Vision
View on GitHub
213

// summary

Thinking with Visual Primitives introduces a novel approach to Multimodal Large Language Models by interleaving spatial markers directly into the reasoning process. This method addresses the reference gap in complex structural tasks by anchoring abstract language to concrete physical coordinates. The framework achieves frontier-competitive performance while maintaining high visual token efficiency through a compressed architecture.

// technical analysis

The project introduces a novel paradigm for Multimodal Large Language Models by addressing the 'Reference Gap,' where natural language fails to precisely describe dense spatial layouts. By interleaving spatial markers like points and bounding boxes directly into the reasoning trajectory, the model anchors abstract concepts to physical coordinates, effectively mimicking human cognitive behavior. This approach prioritizes structural reasoning and visual grounding, utilizing a highly efficient architecture to maintain performance while significantly reducing the computational overhead of image tokens.

// key highlights

01
Integrates spatial markers such as points and bounding boxes as minimal units of thought to bridge the gap between language and visual reasoning.
02
Utilizes the DeepSeek-V4-Flash architecture to compress visual tokens, achieving extreme efficiency in KV cache usage.
03
Enables grounded task reasoning by allowing the model to 'point' to specific locations while performing complex logical operations.
04
Maintains competitive performance against frontier models like GPT-5.4 and Claude-Sonnet-4.6 despite a more compact model scale.
05
Reduces the overall image-token budget, allowing for deeper cognitive processing without excessive computational costs.

// use cases

01
Grounded task reasoning using spatial markers
02
Complex topological reasoning in visual environments
03
Efficient visual processing with reduced token consumption

// getting started

To begin exploring this project, you can review the provided technical report for a deep dive into the methodology and research findings. While model weights are slated for future integration into the foundation model, you can currently access the project's documentation and research context via the provided GitHub repository. For further inquiries or collaboration, you may contact the research team directly via the provided service email.