arxiv:2606.01414

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Published on May 31

· Submitted by

Authors:

Abstract

Multimodal skills that combine textual logic with visual support outperform text-only approaches in visual-centric tasks by incorporating spatial layout, visual grounding, and state-aware interactions.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.

View arXiv page View PDF GitHub 4 Add to collection

Community

hhua2

Paper submitter about 5 hours ago

AutoVisualSkill augments multimodal agent skill libraries with reusable Visual Agent Skill artifacts. Given text prompts, images, accessible URLs, multimodal chat history, sampled video frames, and user interaction traces, it analyzes the task context, identifies visual and personalization bottlenecks, and authors structured skills that downstream agents can load, inspect, version, and reuse. Each generated skill includes task logic, visual priors, multimodal binding protocols, runtime constraints, provenance, and a machine-readable manifest.

Beyond task-level skillization, AutoVisualSkill can capture user-specific work habits, preferred procedures, decision patterns, visual judgment patterns, and interaction styles, enabling multimodal agents to act with stronger visual grounding, better workflow continuity, and more personalized task execution.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.01414

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01414 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01414 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01414 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.