Text-to-Speech
vllm
mistral-common

Voxtral 4B TTS 2603

Voxtral TTS is a frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. The model is released with BF16 weights and a set of reference voices. These voices are licensed under CC BY-NC 4, which is the license that the model inherits.

For more details, see our:

Key Features

Voxtral TTS delivers enterprise-grade text-to-speech for production voice agents, with the following capabilities:

  • Realistic, expressive speech with natural prosody and emotional range across 9 major languages, with support for diverse dialects
  • Text-to-Speech generation with 20 preset voices and easy adaptation to new voices
  • Multilingual support: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi
  • Very low latency with fast time-to-first-audio, plus streaming and batch inference support
  • 24 kHz audio output in WAV, PCM, FLAC, MP3, AAC, and Opus formats
  • Production-ready performance for high-throughput, real-time voice agent workflows

For voice customization, visit our AI Studio.

Use Cases

  • Customer support and call center infrastructure.
  • Financial services. -- with video demo on banking KYC voice agents.
  • Manufacturing and industrial operations.
  • Public services and government.
  • Compliance and risk.
  • Supply chain and logistics.
  • Automotive and in-vehicle systems.
  • Sales and marketing.
  • Real-time translation.

Responsible Use - You are responsible for complying with applicable laws and avoiding misuse.

Benchmark Results

Note: The RTF in end2end.py uses an inverted formula (higher = better). The table below converts it back to the standard RTF convention (lower = better)

Concurrency Latency RTF Throughput (char/s/GPU)
1 70 ms 0.103 119.14
16 331 ms 0.237 879.11
32 552 ms 0.302 1430.78

Usage

The model can also be deployed with the following libraries:

vLLM Omni (recommended)

We've worked hand-in-hand with the vLLM-Omni team to have production-grade support for Voxtral 4B TTS 2603 with vLLM-Omni. Special thanks goes out to Han Gao, Hongsheng Liu, Roger Wang, and Yueqian Lin from the vLLM-Omni team.

Installation

Make sure to install vllm from the latest (>= 0.18.0) pypi package. See here for a full installation guide.

uv pip install -U vllm

Next, you should install vllm-omni with vllm-omni >= 0.18.0.

uv pip install vllm-omni --upgrade  # make sure to have >= 0.18.0

Alternatively, you can also make use of a ready-to-go docker image on the docker hub.

Installing vllm >= 0.18.0 should automatically install mistral_common >= 1.10.0 which you can verify by running:

python3 -c "import mistral_common; print(mistral_common.__version__)" # should print >= 1.10.0

Serve

Due to size and the BF16 format of the weights - Voxtral-4B-TTS-2603 can run on a single GPU with >= 16GB memory.

vllm serve mistralai/Voxtral-4B-TTS-2603 --omni

Client

import io
import httpx
import soundfile as sf
 
BASE_URL = "http://<your-server-url>:8000/v1"
 
payload = {
    "input": "Paris is a beautiful city!",
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "response_format": "wav",
    "voice": "casual_male",
}
 
response = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
response.raise_for_status()
 
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")

# you can play the audio with a library like `sounddevice.play` for example

Demo

To run it:

git clone https://github.com/vllm-project/vllm-omni.git && \
cd vllm-omni && \
uv pip install gradio==5.50 && \
python examples/online_serving/voxtral_tts/gradio_demo.py \
  --host <your-server-url> \
  --port 8000

Alternatively you can also try it out live here ➑️ HF Space.

License

The provided voice-references compatible with this model are licensed under CC BY-NC 4, e.g. from EARS, CML-TTS, IndicVoices-R and Arabic Natural Audio datasets. Thus, this model inherits the same license.

You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.

Downloads last month
4,316
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 20 Ask for provider support

Model tree for mistralai/Voxtral-4B-TTS-2603

Finetuned
(28)
this model
Finetunes
2 models
Merges
1 model
Quantizations
4 models

Spaces using mistralai/Voxtral-4B-TTS-2603 9

Paper for mistralai/Voxtral-4B-TTS-2603