-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2509.07979
-
Apriel-1.5-15b-Thinker
Paper • 2510.01141 • Published • 123 -
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Paper • 2509.21268 • Published • 104 -
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Paper • 2509.00676 • Published • 85 -
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84
-
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Paper • 2509.01215 • Published • 51 -
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Paper • 2509.09595 • Published • 48 -
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Paper • 2509.07980 • Published • 105 -
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper • 2509.03867 • Published • 213 -
Why Language Models Hallucinate
Paper • 2509.04664 • Published • 199
-
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
Paper • 2510.08540 • Published • 110 -
Diffusion Transformers with Representation Autoencoders
Paper • 2510.11690 • Published • 170 -
Spotlight on Token Perception for Multimodal Reinforcement Learning
Paper • 2510.09285 • Published • 37 -
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Paper • 2510.17354 • Published • 35
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Paper • 2509.22638 • Published • 70 -
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Paper • 2510.05034 • Published • 51 -
Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
Paper • 2601.10332 • Published • 31
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Paper • 2509.05263 • Published • 11 -
Symbolic Graphics Programming with Large Language Models
Paper • 2509.05208 • Published • 47 -
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper • 2509.12201 • Published • 107
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
Paper • 2510.08540 • Published • 110 -
Diffusion Transformers with Representation Autoencoders
Paper • 2510.11690 • Published • 170 -
Spotlight on Token Perception for Multimodal Reinforcement Learning
Paper • 2510.09285 • Published • 37 -
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Paper • 2510.17354 • Published • 35
-
Apriel-1.5-15b-Thinker
Paper • 2510.01141 • Published • 123 -
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Paper • 2509.21268 • Published • 104 -
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Paper • 2509.00676 • Published • 85 -
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Paper • 2509.22638 • Published • 70 -
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Paper • 2510.05034 • Published • 51 -
Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
Paper • 2601.10332 • Published • 31
-
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Paper • 2509.01215 • Published • 51 -
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Paper • 2509.09595 • Published • 48 -
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Paper • 2509.07980 • Published • 105 -
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper • 2509.03867 • Published • 213 -
Why Language Models Hallucinate
Paper • 2509.04664 • Published • 199
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 84 -
LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Paper • 2509.05263 • Published • 11 -
Symbolic Graphics Programming with Large Language Models
Paper • 2509.05208 • Published • 47 -
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper • 2509.12201 • Published • 107