May 19, 2026Research Skills Open Source

MMSkills Treats Visual Agent Knowledge as Multimodal, Not Just Text

Fresh arXiv yesterday from Shanghai Jiao Tong and collaborators. 2605.13527. Title: MMSkills, Towards Multimodal Skills for General Visual Agents. Top of HuggingFace Daily Papers at 99 upvotes. Code at github.com/DeepExperience/MMSkills.

The argument is sharper than the typical skills paper. Existing agent skills are text. A skill says do this, then this, then this. For coding agents that works because the world the agent acts in is text. For visual agents, it breaks. The agent has to recognize the current state from pixels, interpret visual evidence that the previous step worked or failed, and decide what to do next. None of that is reusable if you only stored text instructions. MMSkills treats a skill as a multimodal record: state cards (the pixels of where you should be), keyframes (the pixels of how the world should change), plus the textual procedure that connects them.

The pipeline that builds the skills is the actual contribution. An agentic generator turns public non-evaluation trajectories into reusable multimodal skills via four steps: workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. At inference, a branch-loaded agent loads the relevant state cards and keyframes into a temporary branch, aligns them against the live environment, and distills structured guidance into the main agent's context. The audit step is the underrated piece. Other skills papers either trust the synthesizer or pay humans to clean up. MMSkills uses a meta-skill to validate.

Results span GUI agent benchmarks and game-based visual-agent benchmarks. The skills consistently help both frontier and smaller multimodal models. The interpretation that matters: external multimodal procedural memory is complementary to whatever the base model already knows. You do not need to retrain the model. You need to give it the right pictures at the right time. For anyone building visual agents on top of GPT-5.5, Gemini, or Claude this is a near-term capability boost without a finetune.

https://arxiv.org/abs/2605.13527

← Previous

Polarity Closes the 95-to-60 Gap Between Agent Evals and Production

Super User Daily: 2026-05-19

← Back to all articles

MMSkills Treats Visual Agent Knowledge as Multimodal, Not Just Text

Related Articles

Comments