meituan

EvoCUA

AI#LLM#Multimodal#Agent#Computer Use#vLLM

314

// summary

EvoCUA is a high-performance open-source multimodal model designed for end-to-end computer automation across various desktop applications. It currently holds the top ranking on the OSWorld benchmark and demonstrates superior cross-OS generalization capabilities. Additionally, the model is recognized for its robust safety profile, exhibiting the lowest unintended-behavior rate among leading computer-use agents.

// technical analysis

EvoCUA is a general-purpose multimodal agent designed for computer use, utilizing a novel data synthesis and training methodology to enhance performance across various desktop applications. By achieving state-of-the-art results on the OSWorld benchmark, it addresses the challenge of creating robust, open-source agents capable of executing complex, multi-turn tasks via natural language instructions. The project prioritizes both performance and safety, demonstrating superior robustness against unintended behaviors compared to other leading computer-use agents.

// key highlights

Ranks as the #1 open-source model on the OSWorld benchmark with a 56.7% task completion rate.

Demonstrates strong zero-shot cross-OS generalization, significantly outperforming base models on the WindowsAgentArena.

Features a novel training and data synthesis approach that improves computer-use capabilities without sacrificing general model performance.

Provides end-to-end multi-turn automation for common desktop software including Chrome, Excel, PowerPoint, and VSCode.

Validated as the safest computer-use agent in an independent study, exhibiting the lowest rate of unintended behaviors.

Offers high efficiency by achieving competitive performance with fewer parameters and fewer execution steps than larger models.

// use cases

End-to-end multi-turn automation for applications like Chrome, Excel, and VSCode

Zero-shot cross-OS control for diverse desktop environments

Scalable synthetic experience training for improved computer-use capabilities

// getting started

To begin, clone the repository and install the required dependencies using Python 3.12. Download the model weights from HuggingFace and deploy them using vLLM as an OpenAI-compatible inference server. Finally, configure your environment variables and use the provided evaluation scripts to run tasks within the OSWorld environment.