WeaveBench: Computer-Use Agents Fail at Hybrid GUI+CLI Tasks — 41% Pass Rate

Microsoft Research

Research official 2 src. ~1 min

WeaveBench introduces 114 real-world tasks requiring AI agents to combine GUI observations/actions with CLI and code operations in a single trajectory — the first benchmark explicitly targeting this hybrid-interface setting. The best current frontier model achieves only 41.2% pass rate on these long-horizon tasks. Published on arXiv (2606.09426) with 95 upvotes on HuggingFace Daily Papers.

Why it matters

Real computer workflows constantly switch between graphical interfaces and the terminal. WeaveBench is the first to require fluent hybrid operation in one trajectory, revealing that even frontier agents fail at more than half of realistic computer-use tasks. 95 upvotes on HF Daily Papers.

Importance: 3/5

Novel hybrid-interface benchmark from Microsoft Research; 95 HF upvotes; strong practical relevance for computer-use agent research.

agents benchmark evaluation agentic-ai gui-agent paper research computer-use

Sources

official WeaveBench — arXiv

official WeaveBench — HuggingFace Papers