Welcome Gemma 4: Frontier multimodal intelligence on device

Published April 2, 2026

Update on GitHub

Upvote

916

The Gemma 4 family of multimodal models by Google DeepMind is out on Hugging Face, with support for your favorite agents, inference engines, and fine-tuning libraries 🤗

These models are the real deal: truly open with Apache 2 licenses, high quality with pareto frontier arena scores, multimodal including audio, and sizes you can use everywhere including on-device. Gemma 4 builds on advances from previous families and makes them click together. In our tests with pre-release checkpoints we have been impressed by their capabilities, to the extent that we struggled to find good fine-tuning examples because they are so good out of the box.

We collaborated with Google and the community to make them available everywhere: transformers, llama.cpp, MLX, WebGPU, Rust; you name it. This blog post will show you how to build with your favorite tools so let us know what you think!

What is New with Gemma 4?
Overview of Capabilities and Architecture
Multimodal Capabilities
Deploy Anywhere
Multi-Token Prediction Drafters
DiffusionGemma: Text Generation via Diffusion
Fine-tuning & Demos
- Fine-tuning with TRL
  - Fine-tuning with TRL on Vertex AI
- Fine-tuning with Unsloth Studio
Try Gemma 4
Benchmark Results
Acknowledgements

What is new with Gemma 4?

Similar to Gemma-3n, Gemma 4 supports image, text, and audio inputs, and generates text responses. The text decoder is based on the Gemma model with support for long context windows. The image encoder is similar to the one from Gemma 3 but with two crucial improvements: variable aspect ratios, and configurable number of image token inputs to find your sweet spot between speed, memory, and quality. All models support images (or video) and text inputs, while the small variants (E2B, E4B) and the 12B Unified model support audio as well.

Gemma 4 comes in five sizes, all base and instruction fine-tuned:

Model	Parameter Size	Context Window	Checkpoints
Gemma 4 E2B	2.3B effective, 5.1B with embeddings	128k	base, IT
Gemma 4 E4B	4.5B effective, 8B with embeddings	128k	base, IT
Gemma 4 12B Unified	11.95B dense, encoder-free	256K	base, IT
Gemma 4 31B	31B dense model	256K	base, IT
Gemma 4 26B A4B	mixture-of-experts with 4B activated/26B total parameters	256K	base, IT

Overview of Capabilities and Architecture

Gemma 4 leverages several architecture components used in previous Gemma versions and other open models, and leaves out complex or inconclusive features such as Altup. The combination is a mix designed to be highly compatible across libraries and devices, that can efficiently support long context and agentic use cases, whilst being ideal for quantization.

As shown in the benchmarks above, this feature mix (combined with the training data and recipe) enables the 31B dense model to achieve an estimated LMArena score (text only) of 1452, while the 26B MoE reaches 1441 with just 4B active parameters 🤯. As we'll see, multimodal operation is comparatively as good as text generation, at least in informal and subjective tests.

These are the main architecture characteristics in Gemma 4:

Alternating local sliding-window and global full-context attention layers. Smaller dense models use sliding windows of 512 tokens while larger models use 1024 tokens.
Dual RoPE configurations: standard RoPE for sliding layers, pruned RoPE for global layers, to enable longer context.
Per-Layer Embeddings (PLE): a second embedding table that feeds a small residual signal into every decoder layer.
Shared KV Cache: the last N layers of the model reuse key-value states from earlier layers, eliminating redundant KV projections.
Vision encoder (except 12B): uses learned 2D positions and multidimensional RoPE. Preserves the original aspect ratios and can encode images to a few different token budgets (70, 140, 280, 560, 1120).
Audio encoder (E2B, E4B): USM-style conformer with the same base architecture as the one in Gemma-3n.
Unified encoder-free multimodality (12B only): the 12B Unified variant drops the separate vision and audio encoders and projects raw image patches and audio waveforms directly into the LLM's embedding space. See Unified Multimodal (12B).

Per-Layer Embeddings (PLE)

One of the most distinctive features in smaller Gemma 4 models is Per-Layer Embeddings (PLE), which was introduced previously in Gemma-3n. In a standard transformer, each token gets a single embedding vector at input, and the same initial representation is what the residual stream builds on across all layers, forcing the embedding to frontload everything the model might need. PLE adds a parallel, lower-dimensional conditioning pathway alongside the main residual stream. For each token, it produces a small dedicated vector for every layer by combining two signals: a token-identity component (from an embedding lookup) and a context-aware component (from a learned projection of the main embeddings). Each decoder layer then uses its corresponding vector to modulate the hidden states via a lightweight residual block after attention and feed-forward. This gives each layer its own channel to receive token-specific information only when it becomes relevant, rather than requiring everything to be packed into a single upfront embedding. Because the PLE dimension is much smaller than the main hidden size, this adds meaningful per-layer specialization at modest parameter cost. For multimodal inputs (images, audio, video), PLE is computed before soft tokens are merged into the embedding sequence — since PLE relies on token IDs that are lost once multimodal features replace the placeholders. Multimodal positions use the pad token ID, effectively receiving neutral per-layer signals.

Shared KV Cache

The shared KV cache is an efficiency optimization that reduces both compute and memory during inference. The last num_kv_shared_layers layers of the model don't compute their own key and value projections. Instead, they reuse the K and V tensors from the last non-shared layer of the same attention type (sliding or full).

In practice, this has a minimal impact on quality while being much more efficient (in terms of both memory and compute) for long context generation and on-device use.

Unified Multimodal (12B)

The Gemma 4 12B Unified model has no separate vision or audio encoder. Instead, raw image patches and audio waveforms are projected directly into the LLM's embedding space through lightweight linear layers, and all modalities flow into a single decoder-only transformer. This reduces multimodal latency and allows the whole model to be fine-tuned in one pass. The checkpoint size makes it deployment-friendly on consumer hardware.

Multimodal Capabilities

We saw in our tests that Gemma 4 supports comprehensive multimodal capabilities out of the box. We don't know what was the training mix, but we had success using it for tasks such as OCR, speech-to-text, object detection, or pointing. It also supports text-only and multimodal function calling, reasoning, code completion and correction.

Here, we show a few inference examples across different model sizes. You can run them conveniently with this notebook. We encourage you to try the demos and share them below this blog!

Multimodal Input Order

Gemma 4 was trained with a specific convention to interleave input modalities:

Image content goes before the text in your prompt.
Audio content goes after the text in your prompt.

To be specific, this is a correct fragment that prepares inputs for the chat template:

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
audio_url = "instructions.m4a"      # A recording of "describe this image"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_url},
            {"type": "text", "text": "Answer in French."},
            {"type": "audio", "audio": audio_url},
        ]
    }
]
# Cette image montre un personnage anthropomorphe ressemblant à un lapin, vêtu d'un manteau bleu et d'un pantalon beige, se tenant sur un chemin de terre dans un paysage rural idyllique ...

The audio inference snippets in this post don't follow this convention, as they were written before this guidance was confirmed. We are keeping them as originally written so result tables remain reproducible. For your own code, prefer the order above.

Object Detection and Pointing

GUI detection

We test Gemma 4 on GUI element detection and pointing across different sizes, with the following image and text prompt: "What's the bounding box for the "view recipe" element in the image?"

With this prompt, the model natively responds in JSON format with the detected bounding boxes - no need for specific instructions or grammar-constrained generation. We found the coordinates refer to an image size of 1000x1000, relative to the input dimensions.

We visualize the outputs below for your convenience. We parse the bounding boxes from the returned JSON: json\n[\n {"box_2d": [171, 75, 245, 308], "label": "view recipe element"}\n]\n

E2B	E4B

26/A4B	31B

Object Detection

We test models to detect everyday objects, here we ask them to detect the bike and compare different model outputs. As in the previous case, we parse the bounding box from the json and translate to image space coordinates.

E2B	E4B	26B/A3B	31B

Multimodal Thinking and Function Calling

We asked Gemma 4 to write HTML code to reconstruct a page we made with Gemini 3. Below you can find the code to do this, we enable thinking and ask each model to generate up to 4000 new tokens, to make it foolproof.

Gemini Generated Website (Reference)	Gemini Reproduced Image

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/landing_page.png",
            },
            {"type": "text", "text": "Write HTML code for this page."},
        ],
    }
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=4000)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])

E2B	E4B

31B	26/A4B

Video Understanding

Smaller Gemma 4 models can take in videos with audio while larger ones can take in videos without audio. While the models are not explicitly post-trained on videos, they can understand videos both with and without audio. The model is particularly strong in audios

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"},
            {"type": "text", "text": "What is happening in the video? What is the song about?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    load_audio_from_video=True, # disable this for larger models
).to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
print(result["content"])

Model	Output
E2B (hallucinated for the audio)	Based on the video provided, here is an analysis: What is happening in this video? The video captures a live musical performance, likely at a large outdoor festival or concert. * Setting: The scene is a large stage set up outdoors, indicated by the extensive lighting rig, large speakers, and the audience visible in the foreground. * Performers: There are several musicians on stage. The focal point seems to be a lead singer/guitarist in a blue shirt and white pants who is actively performing, playing an acoustic or electric guitar. There are also other musicians, including drummers and bassists, visible in the background. * Atmosphere: The atmosphere is energetic and dramatic, characterized by bright stage lights, intense blue and white spotlights, and a significant amount of stage fog/smoke, which adds to the concert mood. * Audience: The foreground is filled with the backs of audience members, showing they are close to the stage and engaged in watching the performance. What is the song about? The lyrics provided below the video are: > "I don't care 'bout street > From what I cherish > Could it's time to protest
E4B ✅	The video captures a live performance of a band on a stage with bright lights and smoke effects, surrounded by an audience. The band performs and sings to the crowd. The song appears to be about the struggles and disillusionment of modern life, specifically the feeling of being stuck and the difficulties in finding one's place in a chaotic and overwhelming world. Lyrics mention themes of the "concrete street," "broken promises," "storm of the day," and the difficulty of finding oneself amidst societal pressures.
26/A4B (no audio) ✅	This video shows a live music performance at an outdoor concert or festival at night. A solo male artist, dressed in a blue shirt and white pants, is the central focus. He is playing an acoustic guitar and singing into a microphone on a large, lit stage. The performance is accompanied by a band, including a drummer and other musicians visible in the background and on large video screens. The stage is filled with atmospheric blue lighting, smoke, and bright spotlights. The video captures different perspectives, including shots from the crowd looking up at the stage and close-up views of the performers projected onto large screens. The crowd is visible in the foreground, watching the show. One of the screens also displays the "Estrella Damm" brand name.
31B (no audio) ✅	A musician is performing on stage during a concert, playing an acoustic guitar and singing into a microphone. He is wearing a blue shirt and white pants. The stage is filled with atmospheric smoke and illuminated by bright blue and white lights. Other band members, including a drummer and a keyboardist, are also visible on stage. The video shows the musician from the perspective of the audience, with some shots focusing on him and others showing the large screen on the side of the stage, which displays close-ups of the performer and the crowd.

Captioning

We have tested all models on captioning. All checkpoints perform very well and accurately capture nuance in complex scenerios. Here's the image we use, with the prompt "Write single detailed caption for this image.".

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/bird.png"},
            {"type": "text", "text": "Write single detailed caption for this image."},
        ],
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])

Model	Output
E2B ✅	A medium shot captures a weathered seagull perched atop a stone pedestal in what appears to be a bustling European square, with a grand, classical-style building featuring ornate columns and architectural details dominating the right side of the frame. In the background, a low, light-colored building stretches across the middle ground, flanked by some green foliage, and hints of other structures, including a terracotta-roofed building, are visible. In the foreground, part of a modern bus stop or shelter is visible, with signage indicating routes, and the overall scene is bathed in bright daylight under a pale sky.
E4B ✅	A solitary seagull perches atop a grey stone pedestal in a bustling European piazza, with the historical grandeur of a classical building dominating the right side of the frame, contrasting with the more modern, low-rise architecture in the background under a bright, hazy sky.
26B/A4B ✅	A full shot shows a white and brown seagull perched on a gray pillar in Venice, Italy. The seagull faces left, its beak pointed towards the edge of the frame. Behind it is a light gray, two-story building with several dark rectangular windows. To the right of the seagull is a large, white building with columns and intricate carvings. The sky is a clear, light blue. The lighting is bright and even, suggesting it is daytime.
31B ✅	A high-angle, medium shot shows a white and gray seagull perched on a gray concrete pylon in Venice, Italy. The seagull is facing left, its body angled slightly towards the camera. Its white head, neck, and chest contrast with its gray wings and back. The seagull's orange beak is slightly open. The pylon is a light gray concrete cylinder with a weathered texture. In the background, there are several buildings. To the left is a long, low, white building with a series of small, rectangular windows. In the center is a cluster of green trees. To the right is a red building with a tiled roof and several small, rectangular windows. Further right is a large, ornate, white stone building with several columns and sculptures. The sky is a pale blue, and the light is bright and clear. In the lower-left corner, there is a yellow bus with the word "Ferrovia" printed on it in black letters. There are several people walking in the background, but they are small and out of focus.

Audio Question Answering

These models are trained to answer questions about speech in audio. Music and non-speech sounds were not part of the training data.

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"},
            {"type": "text", "text": "Can you describe this audio in detail?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1000,
    do_sample=False,
)

print(processor.decode(output[0], skip_special_tokens=True))

Model	Output
E2B	This audio is a personal reflection. The speaker is talking about their final farewell address to the nation, which they delivered in Chicago. They express gratitude for the conversations they've had with the American people, noting that despite not having met them face-to-face or even greeted them, these interactions in various settings like living rooms, schools, farms, factory floors, diners, and military outposts have been what has kept them going.
E4B	The audio is a speech excerpt where a speaker is delivering a farewell address to the nation from Chicago. The speaker reflects on their time in office, expressing gratitude for the conversations they had with the American people across various settings like living rooms, schools, farms, factories, diners, and military outposts. The tone is reflective and appreciative, highlighting the importance of these interactions in their political journey.

Model

Output

E2B

This audio is a personal reflection. The speaker is talking about their final farewell address to the nation, which they delivered in Chicago. They express gratitude for the conversations they've had with the American people, noting that despite not having met them face-to-face or even greeted them, these interactions in various settings like living rooms, schools, farms, factory floors, diners, and military outposts have been what has kept them going.

E4B

The audio is a speech excerpt where a speaker is delivering a farewell address to the nation from Chicago. The speaker reflects on their time in office, expressing gratitude for the conversations they had with the American people across various settings like living rooms, schools, farms, factories, diners, and military outposts. The tone is reflective and appreciative, highlighting the importance of these interactions in their political journey.

Here is an example if you want to do transcription:

Inference code

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"},
            {"type": "text", "text": "Transcribe the audio?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1000,
    do_sample=False,
)

print(processor.decode(output[0], skip_special_tokens=True))

Model	Output
E2B	This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me It was an opportunity to say thank you whether we've seen eye to eye or rarely agreed at all my conversations with you the American people in living rooms and schools at farms and on factory floors at diners and on distant military outposts all these conversations are what have kept me honest
E4B	This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, all these conversations are what have kept me honest.

Model

Output

E2B

This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me It was an opportunity to say thank you whether we've seen eye to eye or rarely agreed at all my conversations with you the American people in living rooms and schools at farms and on factory floors at diners and on distant military outposts all these conversations are what have kept me honest

E4B

This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, all these conversations are what have kept me honest.

Multimodal Function Calling

We test the model by asking to get the weather in the place shown in the image.

Inference code

import re 
WEATHER_TOOL = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Gets the current weather for a specific location.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The city name"},
            },
            "required": ["city"],
        },
    },
}
tools = [WEATHER_TOOL]
messages = [
    {"role": "user", "content": [
          {"type": "text", "text": "What is the city in this image? Check the weather there right now."},

        {"type": "image", "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg"},
    ]},
]
inputs = processor.apply_chat_template(
    messages,
    tools=[WEATHER_TOOL],
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=1000)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])

Model	Output
E2B	Analyze the Request: The user has provided an image and asked two questions:\n * "What is the city in this image?" * "Check the weather there right now."\n\n2. Analyze the Image: The image displays a large, ornate, tall structure, characteristic of temple architecture, likely in Southeast Asia (given the style). It appears to be a famous landmark. 3. Identify the Landmark/Location (Internal Knowledge/Visual Cues): The structure strongly resembles the chedis/stupas found in places like Bangkok, Thailand (e.g., Wat Arun or similar structures).\n\n4. Determine the City: Based on the visual evidence, the location is highly likely to be Bangkok, Thailand.\n\n5. Check Available Tools: The available tool is `get_weather(city: STRING)`. 6. Formulate the Plan:\n * Step 1: Identify the city from the image (Bangkok).\n * Step 2: Use the `get_weather` tool with the identified city.\n\n7. Construct the Tool Call: `get_weather(city="Bangkok")`\n\n8. Generate Response: The model should first state the identified city and then call the tool to get the weather. (Self-correction: The model must call the tool in the correct format if it needs external information.)call:get_weather{city:Bangkok}
E4B	'1. Analyze the Request: The user has provided an image and asked two things:\n a. Identify the city in the image.\n b. Check the weather in that city right now.\n\n2. Analyze the Image: The image shows a very distinctive, ornate tower/spire structure, characteristic of Thai architecture (specifically, a chedi or prang).\n\n3. Determine City Identification (Internal Knowledge/Visual Recognition): Based on the architectural style, the structure strongly resembles prominent temples in Thailand, such as Wat Arun in Bangkok.\n\n4. Evaluate Tool Availability: I have one available tool: `get_weather(city: STRING)`. This tool requires a city name as input.\n\n5. Formulate a Plan:\n a. I need the city name to use the `get_weather` tool.\n b. I must first identify the city from the image.\n c. Since my internal knowledge suggests Bangkok, I will proceed with that assumption, but I must acknowledge that I cannot definitively prove the location from the image alone.\n\n6. Execute Step 1 (City Identification): Assume the city is Bangkok based on the visual evidence of the architecture.\n\n7. Execute Step 2 (Tool Call Generation): Use the `get_weather` tool with "Bangkok" as the `city` argument.\n\n8. Construct the Response: Generate the appropriate tool call in JSON format.', 'answer': '<\|tool_call>call:get_weather{city:<\|"\|>Bangkok<\|"\|>}<tool_call\|>'
31B	The user wants to know the city in the image and the current weather there. 1. Analyze the image: The image shows ornate, spire-like structures (prangs) characteristic of Thai temple architecture. Specifically, these look like the prangs at Wat Phra Kaew (Temple of the Emerald Buddha) or Wat Pho in Bangkok, Thailand. The style is very distinctive of Bangkok's royal temple complexes. 2. Identify the city: Based on the architectural style, the city is Bangkok. 3. Find the weather: I need to call the `get_weather` tool with `city="Bangkok"`.call:get_weather{city:Bangkok}
26/A4B	The image shows several tall, ornate towers (prangs) typical of Thai Buddhist temple architecture. Specifically, these look like the structures within the Grand Palace or Wat Phra Kaew in Bangkok, Thailand. The style is unmistakably Thai. I'll confirm this by looking at the specific towers. These are very iconic.\n2. Confirm the location: The image shows elements of the Grand Palace complex in Bangkok.\n3. Get weather information: Once the city is identified as Bangkok, I will use the `get_weather` tool with `city='Bangkok'`.call:get_weather{city:Bangkok}

Deploy Anywhere

Gemma 4 comes with day-0 support for many open-source inference engines, and is ideal for tool calling and agents! We also release ONNX checkpoints that can run on many hardware backends, allowing use cases on edge devices or in browser!

transformers

Gemma 4 comes with first-class transformers support from the get-go 🤗. This integration allows using the model with other libraries like bitsandbytes, PEFT and TRL. Make sure to install the latest version of transformers.

pip install -U transformers

The easiest way to infer with the small Gemma 4 models is through the any-to-any pipeline. You can initialize it as follows.

from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")

You can then pass in images and text as follows.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg",
            },
            {"type": "text", "text": "Do you have travel advice going to here?"},
        ],
    }
]
output = pipe(messages, max_new_tokens=100, return_full_text=False)
output[0]["generated_text"]
# Based on the image, which appears to show a magnificent, ornate **Buddhist temple or pagoda**, likely in Southeast Asia (such as Thailand, Myanmar, or Cambodia), here is some general travel advice..

When inferring with videos, you can include the audio track using the load_audio_from_video argument.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/rockets.mp4",
            },
            {"type": "text", "text": "What is happening in this video?"},
        ],
    }
]
pipe(messages, load_audio_from_video=True)

Going a level lower, you can load Gemma 4 using the AutoModelForMultimodalLM class, especially useful for fine-tuning. The built-in chat template takes care of formatting the inputs correctly, please make sure you use it to prevent subtle mistakes when building the prompt manually.

Inference code

from transformers import AutoModelForMultimodalLM, AutoProcessor
model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-E2B-it", device_map="auto")
processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/rockets.mp4",
            },
            {"type": "text", "text": "What is happening in this video?"},
        ],
    }
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Llama.cpp

Gemma 4 models come with image+text support in llama.cpp from the get-go! This unlocks using Gemma 4 with all of your favorite local apps: llama-cpp server, lmstudio, Jan as well as coding agents like Pi across many backends such as Metal and CUDA.

You can install llama-cpp as follows.

curl -LsSf https://llama.app/install.sh | sh

You can then start a server compatible with the OpenAI API Replace the quantization scheme at the end of the command with the precision of your choice.

llama serve -hf ggml-org/gemma-4-E2B-it-GGUF

Check out this link for more options on combining llama.cpp with different coding agents and local apps. Find all the GGUF checkpoints in this collection.

Plug in your local agent

We worked on making sure the new models work locally with agents like openclaw, hermes, pi, and open code. All thanks to llama.cpp! Run the following to try Gemma 4 right away.

First, start your local server:

llama serve -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M

For hermes:

hermes model

For openclaw:

openclaw onboard

For pi define a ~/.pi/agent/models.json:

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",	
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ggml-org-gemma-4-26b-4b-gguf"
        }
      ]
    }
  }
}

For open code define a ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "gemma-4-26b-4b-it": {
          "name": "Gemma 4 (local)",
          "limit": {
            "context": 128000,
            "output": 8192
          }
        }
      }
    }
  }
}

transformers.js

transformers.js enables running Gemma 4 right inside browser. You can check out the model card to see text-only, image & text, audio & text inference in detail here. We also shipped a demo for you to test the model here.

MLX

Full multimodal support of Gemma 4 is available using the open-source mlx-vlm library. Here's how to ask the model to describe an image:

pip install -U mlx-vlm

mlx_vlm.generate \
--model google/gemma-4-E4B-it \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
--prompt "Describe this image in detail"

mlx-vlm supports TurboQuant, which delivers the same accuracy as the uncompressed baseline while using ~4x less active memory and running a lot faster end-to-end. This makes long-context inference practical on Apple Silicon without sacrificing quality. Use it like this:

mlx_vlm.generate \
--model "mlx-community/gemma-4-26b-a4b-it-4bit" \
--prompt "Your prompt here" \
--kv-bits 3.5 \
--kv-quant-scheme turboquant

For audio examples and more details, please check the MLX collection.

Mistral.rs

mistral.rs is a Rust-native inference engine with day-0 Gemma 4 support across all modalities (text, image, video, audio) and builtin tool-calling and agentic functionality. Install mistral.rs:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh # Linux/macOS

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex # Windows

You can then start an OpenAI-compatible HTTP server:

mistralrs serve mistralrs-community/gemma-4-E4B-it-UQFF --from-uqff 8

Or, use interactive mode:

mistralrs run -m google/gemma-4-E4B-it --isq 8 --image image.png -i "Describe this image in detail."

mistralrs run -m google/gemma-4-E4B-it --isq 8 --audio audio.mp3 -i "Transcribe this fully."

Find all models here. Please, follow the instructions in the model cards for installation and inference guidelines.

Multi-Token Prediction Drafters

Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 family: small assistant models that accelerate inference via speculative decoding. The drafter proposes several future tokens at once, and the target model verifies them in a single forward pass. You get the same outputs as the target model, just faster — no quality loss, no changes to reasoning behaviour. Reported end-to-end speedups go up to ~3x depending on hardware, batch size, and workload.

Assistants are available for all four Gemma 4 sizes (E2B, E4B, 26B A4B, 31B). They share the KV cache with the target model to avoid recomputing context, and the smaller edge variants additionally use an embedder clustering trick to keep memory and compute low on-device.

Find the checkpoints in the Gemma 4 collection and the mlx-community collection.

DiffusionGemma: Text Generation via Diffusion

Alongside the autoregressive Gemma 4 family, Google DeepMind is releasing DiffusionGemma, a multimodal model that generates text using discrete diffusion instead of token-by-token autoregression. It's built on the same 26B A4B Mixture-of-Experts foundation (25.2B total / 3.8B active parameters, 8 active experts out of 128 plus 1 shared), takes text and image inputs, generates text, and supports up to 256K context — all under the same Apache 2.0 license.

Where a standard causal LM emits one token at a time, DiffusionGemma denoises whole blocks of tokens in parallel. The architecture is encoder-decoder: an autoregressive encoder prefills the prompt and builds the KV cache, while a decoder applies bidirectional attention over a "canvas" of 256 tokens. During multi-canvas sampling, the model iteratively denoises a full canvas with a diffusion sampler; once a canvas is finalized it's encoded and appended to the KV cache, then the next canvas begins. This block-autoregressive approach increases generation speed.

The headline benefit is throughput: parallel denoising generates roughly 15–20 tokens per forward pass, reaching per-user generation speeds exceeding 1100 tokens/second at low batch sizes (H100, FP8). Inference compute is adaptive too — simpler prompts and structured tasks like code need fewer denoising steps, so tokens-per-second scales with task complexity. It keeps the broader Gemma 4 toolkit: thinking mode, function calling, long context, native system prompts, and image understanding (OCR, document parsing, object detection, pointing) at variable aspect ratios and resolutions.

Benchmarks show an expected trade-off between speed and evaluation metrics — DiffusionGemma trails the autoregressive 26B A4B on most tasks (e.g. MMLU Pro 77.6% vs 82.6%, AIME 2026 69.1% vs 88.3%, GPQA Diamond 73.2% vs 82.3%) in exchange for its large speed advantage, while edging ahead on a few (HLE no tools 11.0% vs 8.7%).

Getting started looks just like the rest of Gemma 4, via the dedicated diffusion class:

from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor

MODEL_ID = "google/diffusiongemma-26B-A4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto",
)

message = [{"role": "user", "content": "Why is the sky blue?"}]
input_ids = processor.apply_chat_template(
    message,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)
output = model.generate(**input_ids, max_new_tokens=512)
text = processor.decode(output[0], skip_special_tokens=False)

For best results, Google recommends the Entropy-Bounded (EB) sampler with adaptive stopping (up to 48 denoising steps, a temperature schedule decaying from 0.8 → 0.4, and an entropy bound of 0.1 for token selection). As with Gemma 4, place image content before text in your prompt, and toggle reasoning with the <|think|> control token.

Fine-tuning for all

Gemma 4 models are ideal for fine-tuning in your favorite tools and platforms and at any budget.

Fine-tuning with TRL

Gemma 4 is fully supported for fine-tuning with TRL. To celebrate, TRL has been upgraded with support for multimodal tool responses when interacting with environments, meaning models can now receive images back from tools during training, not just text.

To showcase this, we've built an example training script where Gemma 4 learns to drive in the CARLA simulator. The model sees the road through a camera, decides what to do and learns from the outcome. After training, it consistently changes lanes to avoid pedestrians. The same approach works for any task where a model needs to see and act: robotics, web browsing, or other interactive environments.

Get started:

# pip install git+https://github.com/huggingface/trl.git

python examples/scripts/openenv/carla_vlm_gemma.py \
    --env-urls https://sergiopaniego-carla-env.hf.space \
            https://sergiopaniego-carla-env-2.hf.space \
    --model google/gemma-4-E2B-it

Find the example here.

Fine-tuning with TRL on Vertex AI

Additionally, we have prepared an example on how to fine-tune Gemma 4 with TRL on Vertex AI using SFT, to showcase how to extend the function calling capabilities, whilst freezing both the vision and audio towers. The examples include how to build a custom Docker container with latest Transformers, TRL, etc. with CUDA support on Google Cloud, and how to run it via Vertex AI Serverless Training Jobs.

# pip install google-cloud-aiplatform --upgrade --quiet
from google.cloud import aiplatform

aiplatform.init(
    project="<PROJECT_ID>",
    location="<LOCATION>",
    staging_bucket="<BUCKET_URI>",
)

job = aiplatform.CustomContainerTrainingJob(
    display_name="gemma-4-fine-tuning",
    container_uri="<CONTAINER_URI>",
    command=["python", "/gcs/gemma-4-fine-tuning/train.py"],
)

job = job.submit(
    replica_count=1,
    machine_type="a3-highgpu-1g",
    accelerator_type="NVIDIA_H100_80GB",
    accelerator_count=1,
    base_output_dir="<BUCKET_URI>/output-dir",
    environment_variables={
        "MODEL_ID": "google/gemma-4-E2B-it",
        "HF_TOKEN": <HF_TOKEN>,
    },
    boot_disk_size_gb=500,
)

You can find the complete example in the "Hugging Face on Google Cloud" docs at https://hf.co/docs/google-cloud/examples/vertex-ai-notebooks-fine-tune-gemma-4.

Fine-tuning with Unsloth Studio

If you want to fine tune and run a Gemma 4 model in a UI, try out Unsloth Studio. It runs locally or on Google Colab. First, install and start the app:

# install unsloth studio on MacOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# install unsloth studio on Windows
irm https://unsloth.ai/install.ps1 | iex

# launch unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Search for for a Gemma 4 model like google/gemma-4-E2B-it

Then select any of the Gemma 4 models from the hub.

Try Gemma 4

We have shipped demos for you to try different Gemma 4 models. We include demos based on the transformers implementation for E4B, 12B Unified, 26B/A4B MoE, and 31B dense models. There's also a WebGPU demo with transformers.js 🚀

Benchmark Results

Gemma 4 models demonstrate exceptional performance across diverse benchmarks, from reasoning and coding to vision and long-context tasks. The graph below shows model performance vs size, with Gemma 4 models forming an impressive Pareto frontier:

Source: Google (blog.google)

Here are detailed benchmark results for the instruction-tuned models:

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 12B Unified	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B (no think)
Reasoning & Knowledge
MMLU Pro	85.2%	82.6%	77.2%	69.4%	60.0%	67.6%
AIME 2026 no tools	89.2%	88.3%	77.5%	42.5%	37.5%	20.8%
GPQA Diamond	84.3%	82.3%	78.8%	58.6%	43.4%	42.4%
Tau2 (average over 3)	76.9%	68.2%	69.0%	42.2%	24.5%	16.2%
BigBench Extra Hard	74.4%	64.8%	53.0%	33.1%	21.9%	19.3%
MMMLU	88.4%	86.3%	83.4%	76.6%	67.4%	70.7%
Coding
LiveCodeBench v6	80.0%	77.1%	72.0%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	1659	940	633	110
HLE no tools	19.5%	8.7%	5.2%	-	-	-
HLE with search	26.5%	17.2%	-	-	-	-
Vision
MMMU Pro	76.9%	73.8%	69.1%	52.6%	44.2%	49.7%
OmniDocBench 1.5 (edit distance)	0.131	0.149	0.164	0.181	0.290	0.365
MATH-Vision	85.6%	82.4%	79.7%	59.5%	52.4%	46.0%
MedXPertQA MM	61.3%	58.1%	48.7%	28.7%	23.5%	-
Audio
CoVoST	-	-	38.5^*	35.54	33.47	-
FLEURS (lower is better)	-	-	0.069^*	0.08	0.09	-
Long Context
MRCR v2 8 needle 128k (average)	66.4%	44.1%	43.4%	25.4%	19.1%	13.5%

^*Excluding Chinese language.

Acknowledgements

This work wouldn't have been possible without Google's extensive contribution with the model artefact, but also the significant effort contributing the model to transformers in an effort to standardize it. The open-source ecosystem is now more complete, with a very capable, freely-licensed, open-source model. The Gemma 4 transformers integration was handled by Cyril, Raushan, Eustache, Arthur, Lysandre. We thank Joshua for the transformers.js integration and demo, Eric for mistral.rs integration, Son for Llama.cpp, Prince for MLX, Quentin, Albert and Kashif for TRL, Adarsh for SGLang transformers backend, and Toshihiro for building several demos.

Models mentioned in this article 10

Datasets mentioned in this article 5

Spaces mentioned in this article 5

Collections mentioned in this article 3

How to Use Transformers.js in a Chrome Extension

April 23, 2026

multimodalnlpcommunity

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

April 16, 2026

Community

denizpercin

Apr 2

Thank you all for your efforts!

axlf0l333

Apr 7

Thank you, Thank you, Thank you!

siri-helixar

Apr 2

great writeup, the CARLA driving example is a nice demonstration of the agentic loop.

one gap worth flagging for anyone building on Gemma 4's function calling for real-world deployments: when the model generates a function call, there's currently no verifiable record that a human principal authorized that specific action. a compromised system prompt or injected instruction produces a call that's indistinguishable from legitimate delegation at the tool interface.

i opened a PR on the gemma-cookbook repo today that adds a drop-in HDP middleware layer to address this, sits between Gemma 4's function call output and tool execution, verifies Ed25519-signed delegation tokens, classifies actions by irreversibility (Class 0–3), and blocks unauthorized calls pre-execution.

for edge/robotics deployments (E2B on Jetson Nano / Raspberry Pi directing actuators), there's a companion HDP-P extension covering the physical layer.

PR: github.com/google-gemma/cookbook/pull/342
HDP spec: helixar.ai/about/labs/hdp
HDP-P spec: helixar.ai/about/labs/hdp-physical

alvarobartt

Article author Apr 2

cc @sergiopaniego

Akdeniz0

Apr 2

thats amazing

taeyangoh

Apr 3

This exactly fits our kid education domain! Great job and i deeply appreciate for your work! I will put it into our pipeline now!!!

urroxyz

Apr 3

•

edited Apr 3

Somewhat disappointing release, in my opinion.

However, I adore audio understanding models, so it's nice to see more of those. My favorite right now is MERALION, but it's 10B. I guess another perk would be the safety alignment. I'm sure the Gemma 4 series will be useful to some people. Not me, though...

I just wish larger tech companies would take a page out of OpenAI's book and release actually competitive OSS instead of putting out generic models just to say they support public research.

But the models are stable. That's good. Definitely more consistency and token efficiency than Qwen.

Billwer

Apr 3

appreciate that your efforts on it.

BenTouss

Apr 3

This kind of release notes are really useful, thank you

patrickdeanbrown

Apr 3

Appreciate the great intro! Thank you!

On the MLX examples, I got an error based on the model not being found. It looks like the mlx-community has a slightly different naming convention than the one used in the example.

e.g.
"mlx-community/gemma-4-26b-a4b-bf16" instead of "mlx-community/gemma-4-26B-A4B-it"

The following command worked for me:

▶ mlx_vlm.generate \                                                                              
--model "mlx-community/gemma-4-26b-a4b-it-4bit" \
--prompt "What is the capital of France?" \
--kv-bits 3.5 \
--kv-quant-scheme turboquant

ujensh

Apr 3

Hi All

Struggling to find a way to addLiteRT-LM to an ios build? is it not possible?

Kevin7yi

Apr 12

welcome to new ai video generator: [happyhorse ai] https://www.happy-horse.net

koki888

Apr 4

This comment has been hidden

pamaru

Apr 4

The link to the mistralrs models returns 404. This one worked:

https://huggingface.co/mistralrs-community/models

BitterCoffee11

Apr 15

welcome to new ai video generator: [happyhorse ai] https://happyhorse-ai.co

pcuenq

Article author Apr 4

Thanks a lot @pamaru and @patrickdeanbrown , those links will be fixed when this PR is merged 🙌

abobusinc

Apr 6

•

edited Apr 6

I’ve made history

DorkMckork1

Apr 7

I'm impressed, extremely intelligent models, frontier-level intelligence without being resource hog like the 70+billion models. Fast, responsive, intelligent, flexible, usable on multi platform, huge leap up from Gemma 3. I can tell a huge leap up. This feels like bringing the 70+b parameter models to home consumers, that's what it feels like. Good job google, good job siding with the people instead of the big data centers for once. Good job. More of this, less data centers. As tech advances, the goal is to make it take up less space and not be as obtrusive in our lives. This achieves that. Thank you Google. For once, someone hit the ball out of the park with a home run. This equals the playing field and lets everyone who has a halfway semi decent system have access to AI. Makes it so everyone has access, not just rich people, and not just exploiters who rent out software as a service. Good job.

DorkMckork1

Apr 7

Google solves the RAM crisis by releasing frontier-ai-performance-tier Gemma 4 that takes up as much space as small language models. Quod erat demonstrandum, the other companies that were trying to scale up rather than work on better compression like Google did are now having to play catch up to Google lol.
Anyways thanks again Google

januverma

Apr 11

I did some further testing on these models for multi-modal reasoning, building on your evaluation. And wrote about them here https://januverma.substack.com/p/research-briefings-gemma-4. Solid models for building visual reasoning projects.

BitterCoffee11

Apr 15

This comment has been hidden (marked as Spam)

Enhtsetseg

Apr 17

Sain baina uu

deleted

Apr 22

This comment has been hidden

deleted

May 25

This comment has been hidden

deleted

30 days ago

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

916

Welcome Gemma 4: Frontier multimodal intelligence on device

Table of Contents

What is new with Gemma 4?

Overview of Capabilities and Architecture

Per-Layer Embeddings (PLE)

Shared KV Cache

Unified Multimodal (12B)

Multimodal Capabilities

Multimodal Input Order

Object Detection and Pointing

GUI detection

Object Detection

Multimodal Thinking and Function Calling

Video Understanding

Captioning

Audio Question Answering

Multimodal Function Calling

Deploy Anywhere

transformers

Llama.cpp

Plug in your local agent

transformers.js

MLX

Mistral.rs

Multi-Token Prediction Drafters

DiffusionGemma: Text Generation via Diffusion

Fine-tuning for all

Fine-tuning with TRL

Fine-tuning with TRL on Vertex AI

Fine-tuning with Unsloth Studio

Try Gemma 4

Benchmark Results

Acknowledgements

Models mentioned in this article 10

Datasets mentioned in this article 5

Spaces mentioned in this article 5

Collections mentioned in this article 3

How to Use Transformers.js in a Chrome Extension

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Community

Models mentioned in this article 10

Datasets mentioned in this article 5

Spaces mentioned in this article 5

Collections mentioned in this article 3