Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning
• 2
Multimodal AI, Agents
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Om AI Lab is a passionate group building multimodal AI agents that reshape our work and life.
Open Agent Leaderboard
Mark regions in images based on text descriptions
Process and answer questions about webpage videos
VLM-R1 model for Open-Vocabulary Object Detection