YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen-Image-2512 β Pruned 50-Block Model with Weight Fusion
A depth-compressed version of Qwen/Qwen-Image-2512 (20.4B parameter text-to-image diffusion transformer). 10 of 60 transformer blocks have been removed and their learned knowledge fused into surviving neighbors via gradient-trained scalar coefficients.
What this model is
- Base model: Qwen-Image-2512 (60 transformer blocks, 20.4B params)
- This model: 50 transformer blocks (~17B params), ~1.2x inference speedup
- Method: Sensitivity-guided block removal + FuseGPT-style weight fusion with backprop training
- Quality: +0.15 dB PSNR over cold deletion, validated on 128 diverse prompts (74% win rate)
How it was made
Step 1: Block sensitivity analysis
We measured the PSNR impact of removing each of the 60 blocks individually, across 8 diverse prompts (portrait, landscape, food, abstract art, etc.). This produced a per-block importance ranking averaged across prompts β not overfit to any single image.
Blocks removed (the 10 least impactful): [5, 12, 15, 19, 50, 52, 53, 54, 55, 57]
These are scattered across the network (4 from early layers, 6 from late layers), NOT concentrated in the middle. The sensitivity analysis revealed that individually redundant blocks exist throughout the network, not just in the conventional "middle zone."
Step 2: Weight fusion (FuseGPT-style)
Instead of simply deleting blocks (cold deletion), we fused each removed block's weights into its 4 nearest surviving neighbors (2 before + 2 after, "merge radius 2"). Each neighbor block's linear layers received the removed block's corresponding weight matrix, scaled by a learned coefficient:
W_neighbor_final = W_neighbor_original + coef * W_removed_block
The coefficients (1,120 total β one per linear layer per fusion pair) were initialized at zero (equivalent to cold deletion) and trained via backpropagation.
Step 3: Coefficient training
Training used pre-computed teacher latent pairs from the full 60-block model:
- Data: ~40,000 (noisy_latent, denoised_target) pairs across 30 diffusion timesteps
- Loss: MSE between the fused model's denoised output and the full model's denoised output
- Optimizer: Adam, lr=5e-4, 133 gradient steps (0.5 data passes)
- Trainable parameters: Only the 1,120 fusion coefficients (all 17B base weights frozen)
- Training time: ~2 hours on 1x NVIDIA B200
Step 4: Weight baking
After training, the FuseLinear layers were collapsed back into standard nn.Linear by computing the final weight: W_final = W_base + coef * W_removed. The resulting model is architecturally identical to a standard 50-block Qwen-Image transformer β no custom layers or runtime overhead.
Results
Validation on 128 unseen prompts (Gustavosta/Stable-Diffusion-Prompts)
| Method | Mean PSNR | vs Deletion | Win Rate |
|---|---|---|---|
| Full model (60 blocks) | baseline | β | β |
| Cold deletion (50 blocks) | 15.59 dB | β | β |
| This model (fusion-trained) | 15.74 dB | +0.15 dB | 74.2% |
Per-prompt results on 8 diverse test prompts
| Prompt | Deletion | This Model | Delta |
|---|---|---|---|
| interior | 16.27 | 16.64 | +0.37 |
| city | 16.16 | 16.42 | +0.26 |
| abstract | 16.25 | 16.43 | +0.18 |
| landscape | 14.89 | 15.04 | +0.14 |
| forest | 18.62 | 18.74 | +0.13 |
| food | 14.98 | 15.10 | +0.13 |
| animal | 16.92 | 16.99 | +0.07 |
| portrait | 18.22 | 17.97 | -0.25 |
| Mean | 16.54 | 16.67 | +0.13 |
Usage
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"baseten-admin/Qwen-Image-2512-Pruned-50blocks",
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
image = pipe(
prompt="A photorealistic landscape of mountains at sunset",
negative_prompt="low quality, blurry",
num_inference_steps=30,
true_cfg_scale=4.0,
height=1024,
width=1024,
).images[0]
image.save("output.png")
Method details
This work combines ideas from:
- FuseGPT (github.com/JarvisPei/FuseGPT): Learnable fusion coefficients that start at zero and are trained to absorb removed layers' knowledge into neighbors
- FlattenGPT (Xu et al., 2026, arXiv:2602.08858): Theoretical analysis of cross-layer redundancy in transformers and the insight that preserving removed block parameters outperforms discarding them
- Multi-prompt sensitivity analysis: Data-driven block importance ranking across diverse prompts to avoid overfitting to a single test image
Key finding: The conventional approach of removing blocks from the "middle zone" of transformers is suboptimal. Per-block sensitivity analysis reveals that the safest blocks to remove are scattered throughout the network, with some early-layer and late-layer blocks being more redundant than any middle-layer block.
Limitations
- The model was trained to match the full model's outputs, not to maximize perceptual quality directly
- Portrait-style prompts showed slight regression (-0.25 dB), likely due to training data distribution
- Further quality gains are possible with full fine-tuning (unfreezing all parameters) or LoRA on top of this checkpoint
Training infrastructure
- Hardware: 8x NVIDIA B200 (183GB each)
- Total experiment time: ~12 hours (sensitivity sweeps, ablations, fusion training, validation)
- Total inference runs: ~1,500+ across all experiments
- Downloads last month
- 87