See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Abstract
SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset.
We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.
Community
A new paradigm for fine-grained video understanding.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SteerSeg: Attention Steering for Reasoning Video Segmentation (2026)
- Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts (2026)
- LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation (2026)
- Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation (2026)
- BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning (2026)
- Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models (2026)
- CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.18018 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 1
BBBBCHAN/NL-Refer
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper