Qwen-Image-2512 — Pruned 50-Block Model with Weight Fusion

A depth-compressed version of Qwen/Qwen-Image-2512 (20.4B parameter text-to-image diffusion transformer). 10 of 60 transformer blocks have been removed and their learned knowledge fused into surviving neighbors via gradient-trained scalar coefficients.

What this model is

Base model: Qwen-Image-2512 (60 transformer blocks, 20.4B params)
This model: 50 transformer blocks (~17B params), ~1.2x inference speedup
Method: Sensitivity-guided block removal + FuseGPT-style weight fusion with backprop training
Quality: +0.15 dB PSNR over cold deletion, validated on 128 diverse prompts (74% win rate)

How it was made

Step 1: Block sensitivity analysis

We measured the PSNR impact of removing each of the 60 blocks individually, across 8 diverse prompts (portrait, landscape, food, abstract art, etc.). This produced a per-block importance ranking averaged across prompts — not overfit to any single image.

Blocks removed (the 10 least impactful): [5, 12, 15, 19, 50, 52, 53, 54, 55, 57]

These are scattered across the network (4 from early layers, 6 from late layers), NOT concentrated in the middle. The sensitivity analysis revealed that individually redundant blocks exist throughout the network, not just in the conventional "middle zone."

Step 2: Weight fusion (FuseGPT-style)

Instead of simply deleting blocks (cold deletion), we fused each removed block's weights into its 4 nearest surviving neighbors (2 before + 2 after, "merge radius 2"). Each neighbor block's linear layers received the removed block's corresponding weight matrix, scaled by a learned coefficient:

W_neighbor_final = W_neighbor_original + coef * W_removed_block

The coefficients (1,120 total — one per linear layer per fusion pair) were initialized at zero (equivalent to cold deletion) and trained via backpropagation.

Step 3: Coefficient training

Training used pre-computed teacher latent pairs from the full 60-block model:

Data: ~40,000 (noisy_latent, denoised_target) pairs across 30 diffusion timesteps
Loss: MSE between the fused model's denoised output and the full model's denoised output
Optimizer: Adam, lr=5e-4, 133 gradient steps (0.5 data passes)
Trainable parameters: Only the 1,120 fusion coefficients (all 17B base weights frozen)
Training time: ~2 hours on 1x NVIDIA B200

Step 4: Weight baking

After training, the FuseLinear layers were collapsed back into standard nn.Linear by computing the final weight: W_final = W_base + coef * W_removed. The resulting model is architecturally identical to a standard 50-block Qwen-Image transformer — no custom layers or runtime overhead.

Results

Validation on 128 unseen prompts (Gustavosta/Stable-Diffusion-Prompts)

Method	Mean PSNR	vs Deletion	Win Rate
Full model (60 blocks)	baseline	—	—
Cold deletion (50 blocks)	15.59 dB	—	—
This model (fusion-trained)	15.74 dB	+0.15 dB	74.2%

Per-prompt results on 8 diverse test prompts

Prompt	Deletion	This Model	Delta
interior	16.27	16.64	+0.37
city	16.16	16.42	+0.26
abstract	16.25	16.43	+0.18
landscape	14.89	15.04	+0.14
forest	18.62	18.74	+0.13
food	14.98	15.10	+0.13
animal	16.92	16.99	+0.07
portrait	18.22	17.97	-0.25
Mean	16.54	16.67	+0.13

Usage

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "baseten-admin/Qwen-Image-2512-Pruned-50blocks",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic landscape of mountains at sunset",
    negative_prompt="low quality, blurry",
    num_inference_steps=30,
    true_cfg_scale=4.0,
    height=1024,
    width=1024,
).images[0]
image.save("output.png")

Method details

This work combines ideas from:

FuseGPT (github.com/JarvisPei/FuseGPT): Learnable fusion coefficients that start at zero and are trained to absorb removed layers' knowledge into neighbors
FlattenGPT (Xu et al., 2026, arXiv:2602.08858): Theoretical analysis of cross-layer redundancy in transformers and the insight that preserving removed block parameters outperforms discarding them
Multi-prompt sensitivity analysis: Data-driven block importance ranking across diverse prompts to avoid overfitting to a single test image

Key finding: The conventional approach of removing blocks from the "middle zone" of transformers is suboptimal. Per-block sensitivity analysis reveals that the safest blocks to remove are scattered throughout the network, with some early-layer and late-layer blocks being more redundant than any middle-layer block.

Limitations

The model was trained to match the full model's outputs, not to maximize perceptual quality directly
Portrait-style prompts showed slight regression (-0.25 dB), likely due to training data distribution
Further quality gains are possible with full fine-tuning (unfreezing all parameters) or LoRA on top of this checkpoint

Training infrastructure

Hardware: 8x NVIDIA B200 (183GB each)
Total experiment time: ~12 hours (sensitivity sweeps, ablations, fusion training, validation)
Total inference runs: ~1,500+ across all experiments

Downloads last month: 87

Paper for baseten/Qwen-Image-2512-Pruned-50blocks

FlattenGPT: Depth Compression for Transformer with Layer Flattening

Paper • 2602.08858 • Published Feb 9