PyTorch

smolvlm_logo

SmolVLM is a family of compact vision–language models designed for efficient multimodal understanding by integrating lightweight visual encoders with small language models, with a focus on edge deployment and low-latency multimodal AI.

Original paper: SmolVLM: Redefining small and efficient multimodal models

SmolVLM2-500M-Video-Instruct

SmolVLM2-500M-Video-Instruct is a highly efficient ~500M-parameter variant optimized for low-memory footprint and fast multimodal inference. It is well suited for applications such as visual question answering, image captioning, document understanding, and real-time multimodal assistants on edge devices or resource-constrained environments.

Model Configuration:

Model Device Model Link
SmolVLM2-500M-Video-Instruct CV7 Model_Link
SmolVLM2-500M-Video-Instruct CV72 Model_Link
SmolVLM2-500M-Video-Instruct CV75 Model_Link
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Ambarella/SmolVLM2