LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Jul 21, 2025 • 131
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Paper • 2403.11481 • Published Mar 18, 2024 • 13