SmolVLM is a family of compact vision–language models designed for efficient multimodal understanding by integrating lightweight visual encoders with small language models, with a focus on edge deployment and low-latency multimodal AI.

Original paper: SmolVLM: Redefining small and efficient multimodal models

SmolVLM2-500M-Video-Instruct

SmolVLM2-500M-Video-Instruct is a highly efficient ~500M-parameter variant optimized for low-memory footprint and fast multimodal inference. It is well suited for applications such as visual question answering, image captioning, document understanding, and real-time multimodal assistants on edge devices or resource-constrained environments.

Model Configuration:

Reference implementation: smollm
Original Weight: SmolVLM2-500M-Video-Instruct
Resolution: 3x512x512
Support Cooper version:
- Cooper SDK: [2.5.4]
- Cooper Foundry: [2.3]

Model	Device	Model Link
SmolVLM2-500M-Video-Instruct	CV7	Model_Link
SmolVLM2-500M-Video-Instruct	CV72	Model_Link
SmolVLM2-500M-Video-Instruct	CV75	Model_Link

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Ambarella/SmolVLM2

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 210