MMSkills: Reusable Multimodal Skills for General Visual Agents (105 HF upvotes)

Shanghai Jiao Tong University

Research official 2 src. ~1 min

MMSkills introduces a framework for equipping visual AI agents with reusable multimodal procedural knowledge. Each skill package combines a textual procedure with runtime state cards and multi-view keyframes. An agentic trajectory-to-skill generator transforms public interaction trajectories into reusable skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. At runtime, a branch-loaded multimodal skill agent inspects visual cards and keyframes, aligns them with the live environment, and distills structured guidance. Experiments on GUI and game-based benchmarks show consistent improvements for both frontier and smaller multimodal agents.

Why it matters

Received 105 upvotes on HuggingFace. By coupling text procedures with visual evidence rather than text-only or code-only skills, MMSkills addresses how agents reuse past experience in visually dynamic environments — a building block for more robust agent systems across GUI automation and interactive tasks.

Importance: 3/5

105 HF upvotes; multimodal procedural skill reuse framework with demonstrated improvements across GUI and game benchmarks for both frontier and smaller agents

Sources