SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 210
SmolVLM is a family of compact vision–language models designed for efficient multimodal understanding by integrating lightweight visual encoders with small language models, with a focus on edge deployment and low-latency multimodal AI.
Original paper: SmolVLM: Redefining small and efficient multimodal models
SmolVLM2-500M-Video-Instruct is a highly efficient ~500M-parameter variant optimized for low-memory footprint and fast multimodal inference. It is well suited for applications such as visual question answering, image captioning, document understanding, and real-time multimodal assistants on edge devices or resource-constrained environments.
Model Configuration:
| Model | Device | Model Link |
|---|---|---|
| SmolVLM2-500M-Video-Instruct | CV7 | Model_Link |
| SmolVLM2-500M-Video-Instruct | CV72 | Model_Link |
| SmolVLM2-500M-Video-Instruct | CV75 | Model_Link |