Gemma-4-E6B-IT-raw

⚠️ RAW / UNHEALED depth-upscale. This model degenerates (loops) under normal decoding and is not usable as-is. It is published as a base artifact for a healing train and for reproducibility of the E4B depth-upscale research. Do not deploy it raw.

What it is

A depth-upscaled (42 → 66 transformer layers, ~11.9B params) passthrough frankenmerge of google/gemma-4-E4B-it onto itself ("IT + IT"). It is the "E6B" depth target — a deeper E4B — in its pre-heal state. No other model is mixed in; every layer comes from gemma-4-E4B-it.

How it was made

Gemma-4 E4B uses Per-Layer Embeddings (PLE) — each layer is injected with a depth-specific, token-derived signal (embed_tokens_per_layer + per_layer_model_projection). Generic passthrough mergers leave those two global tensors at their original 42-layer width, which breaks a stacked model, so this was assembled with a custom PLE-aware stacker.

The 42 IT layers were re-sequenced into 66 output slots via four overlapping slices (the all-IT version of the project's "e6b_v1" topology — same slice structure, every slice sourced from IT):

output slots  0–17 : IT layers  0–17     (forward)
output slots 18–33 : IT layers 10–25     (rewind/overlap to L10)
output slots 34–49 : IT layers 18–33     (rewind/overlap to L18)
output slots 50–65 : IT layers 26–41     (rewind/overlap to L26)

For each output layer i (sourced from IT layer j) the stacker:

  • copies all model.language_model.layers.{j}.* weights (including the per-layer PLE riders);
  • remaps the global PLE tensors embed_tokens_per_layer[:, j·256:(j+1)·256][:, i·256:(i+1)·256] and per_layer_model_projection[j·256:(j+1)·256, :][i·256:(i+1)·256, :], widening the PLE table from 42×256 to 66×256;
  • copies the trunk (token embeddings, final norm, vision/audio towers) verbatim from IT;
  • sets num_hidden_layers = 66, num_kv_shared_layers = 26 (trailing shared region recomputed so each shared layer borrows KV from the last non-shared layer of its attention type), and rebuilds layer_types to keep E4B's 5-sliding : 1-full rhythm.

Parameter breakdown (~11.9B total): transformer layers 6.65B (56%), Per-Layer-Embedding table 4.43B (37%), token embeddings 0.67B (6%), the rest PLE projection + vision/audio towers. Note that >40% of the mass is the PLE table + embeddings + towers, which scale with vocabulary/modality, not depth — so the depth-upscale's headline size comes from both the 24 added layers and the widened PLE table.

Why it is "raw" (what's wrong with it)

E4B's per-layer embeddings are welded to each layer's weights, and re-running layers at a different depth (the three rewind/overlap seams) pushes the residual stream off-distribution. The result is grammatical but loops under greedy decoding. A broad set of no-train fixes — PLE re-indexing by destination, zeroing duplicated PLE, residual-write scaling, brand-new interpolated layers, donor swaps — were all tested and none recover coherence. The break is structural, not something a weight-shuffle can fix; new depth has to be learned. (This 66-layer build has more rewind seams than a minimal single-seam stack, so expect it to need a somewhat heavier heal.)

How to make it usable

A heal (LoRA SFT/CPT) re-knits it into a coherent deeper model. A Fisher/subspace-protected heal — protecting the high-importance instruction directions while re-knitting the off-distribution layers — is the intended next step, to recover coherence without sacrificing instruction-following.

Intended use

  • Base for a healing train (the primary purpose).
  • Reproducibility / study of Gemma-4 depth-upscaling and its Per-Layer-Embedding constraints.

Built 2026-06-01 as part of the E4B → E6B depth-upscale research.

Downloads last month
-
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rpDungeon/Gemma-4-E6B-IT-raw

Finetuned
(195)
this model