YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen-Image-2512 β€” Pruned 50-Block Model with Weight Fusion

A depth-compressed version of Qwen/Qwen-Image-2512 (20.4B parameter text-to-image diffusion transformer). 10 of 60 transformer blocks have been removed and their learned knowledge fused into surviving neighbors via gradient-trained scalar coefficients.

What this model is

  • Base model: Qwen-Image-2512 (60 transformer blocks, 20.4B params)
  • This model: 50 transformer blocks (~17B params), ~1.2x inference speedup
  • Method: Sensitivity-guided block removal + FuseGPT-style weight fusion with backprop training
  • Quality: +0.15 dB PSNR over cold deletion, validated on 128 diverse prompts (74% win rate)

How it was made

Step 1: Block sensitivity analysis

We measured the PSNR impact of removing each of the 60 blocks individually, across 8 diverse prompts (portrait, landscape, food, abstract art, etc.). This produced a per-block importance ranking averaged across prompts β€” not overfit to any single image.

Blocks removed (the 10 least impactful): [5, 12, 15, 19, 50, 52, 53, 54, 55, 57]

These are scattered across the network (4 from early layers, 6 from late layers), NOT concentrated in the middle. The sensitivity analysis revealed that individually redundant blocks exist throughout the network, not just in the conventional "middle zone."

Step 2: Weight fusion (FuseGPT-style)

Instead of simply deleting blocks (cold deletion), we fused each removed block's weights into its 4 nearest surviving neighbors (2 before + 2 after, "merge radius 2"). Each neighbor block's linear layers received the removed block's corresponding weight matrix, scaled by a learned coefficient:

W_neighbor_final = W_neighbor_original + coef * W_removed_block

The coefficients (1,120 total β€” one per linear layer per fusion pair) were initialized at zero (equivalent to cold deletion) and trained via backpropagation.

Step 3: Coefficient training

Training used pre-computed teacher latent pairs from the full 60-block model:

  • Data: ~40,000 (noisy_latent, denoised_target) pairs across 30 diffusion timesteps
  • Loss: MSE between the fused model's denoised output and the full model's denoised output
  • Optimizer: Adam, lr=5e-4, 133 gradient steps (0.5 data passes)
  • Trainable parameters: Only the 1,120 fusion coefficients (all 17B base weights frozen)
  • Training time: ~2 hours on 1x NVIDIA B200

Step 4: Weight baking

After training, the FuseLinear layers were collapsed back into standard nn.Linear by computing the final weight: W_final = W_base + coef * W_removed. The resulting model is architecturally identical to a standard 50-block Qwen-Image transformer β€” no custom layers or runtime overhead.

Results

Validation on 128 unseen prompts (Gustavosta/Stable-Diffusion-Prompts)

Method Mean PSNR vs Deletion Win Rate
Full model (60 blocks) baseline β€” β€”
Cold deletion (50 blocks) 15.59 dB β€” β€”
This model (fusion-trained) 15.74 dB +0.15 dB 74.2%

Per-prompt results on 8 diverse test prompts

Prompt Deletion This Model Delta
interior 16.27 16.64 +0.37
city 16.16 16.42 +0.26
abstract 16.25 16.43 +0.18
landscape 14.89 15.04 +0.14
forest 18.62 18.74 +0.13
food 14.98 15.10 +0.13
animal 16.92 16.99 +0.07
portrait 18.22 17.97 -0.25
Mean 16.54 16.67 +0.13

Usage

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "baseten-admin/Qwen-Image-2512-Pruned-50blocks",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic landscape of mountains at sunset",
    negative_prompt="low quality, blurry",
    num_inference_steps=30,
    true_cfg_scale=4.0,
    height=1024,
    width=1024,
).images[0]
image.save("output.png")

Method details

This work combines ideas from:

  • FuseGPT (github.com/JarvisPei/FuseGPT): Learnable fusion coefficients that start at zero and are trained to absorb removed layers' knowledge into neighbors
  • FlattenGPT (Xu et al., 2026, arXiv:2602.08858): Theoretical analysis of cross-layer redundancy in transformers and the insight that preserving removed block parameters outperforms discarding them
  • Multi-prompt sensitivity analysis: Data-driven block importance ranking across diverse prompts to avoid overfitting to a single test image

Key finding: The conventional approach of removing blocks from the "middle zone" of transformers is suboptimal. Per-block sensitivity analysis reveals that the safest blocks to remove are scattered throughout the network, with some early-layer and late-layer blocks being more redundant than any middle-layer block.

Limitations

  • The model was trained to match the full model's outputs, not to maximize perceptual quality directly
  • Portrait-style prompts showed slight regression (-0.25 dB), likely due to training data distribution
  • Further quality gains are possible with full fine-tuning (unfreezing all parameters) or LoRA on top of this checkpoint

Training infrastructure

  • Hardware: 8x NVIDIA B200 (183GB each)
  • Total experiment time: ~12 hours (sensitivity sweeps, ablations, fusion training, validation)
  • Total inference runs: ~1,500+ across all experiments
Downloads last month
87
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for baseten/Qwen-Image-2512-Pruned-50blocks