How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rpDungeon/Gemma-4-E6B-IT-raw" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rpDungeon/Gemma-4-E6B-IT-raw",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rpDungeon/Gemma-4-E6B-IT-raw" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rpDungeon/Gemma-4-E6B-IT-raw",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Quick Links

Gemma-4-E6B-IT-raw

⚠️ RAW / UNHEALED depth-upscale. This model degenerates (loops) under normal decoding and is not usable as-is. It is published as a base artifact for a healing train and for reproducibility of the E4B depth-upscale research. Do not deploy it raw.

What it is

A depth-upscaled (42 → 66 transformer layers, ~11.9B params) passthrough frankenmerge of google/gemma-4-E4B-it onto itself ("IT + IT"). It is the "E6B" depth target — a deeper E4B — in its pre-heal state. No other model is mixed in; every layer comes from gemma-4-E4B-it.

How it was made

Gemma-4 E4B uses Per-Layer Embeddings (PLE) — each layer is injected with a depth-specific, token-derived signal (embed_tokens_per_layer + per_layer_model_projection). Generic passthrough mergers leave those two global tensors at their original 42-layer width, which breaks a stacked model, so this was assembled with a custom PLE-aware stacker.

The 42 IT layers were re-sequenced into 66 output slots via four overlapping slices (the all-IT version of the project's "e6b_v1" topology — same slice structure, every slice sourced from IT):

output slots  0–17 : IT layers  0–17     (forward)
output slots 18–33 : IT layers 10–25     (rewind/overlap to L10)
output slots 34–49 : IT layers 18–33     (rewind/overlap to L18)
output slots 50–65 : IT layers 26–41     (rewind/overlap to L26)

For each output layer i (sourced from IT layer j) the stacker:

  • copies all model.language_model.layers.{j}.* weights (including the per-layer PLE riders);
  • remaps the global PLE tensors embed_tokens_per_layer[:, j·256:(j+1)·256][:, i·256:(i+1)·256] and per_layer_model_projection[j·256:(j+1)·256, :][i·256:(i+1)·256, :], widening the PLE table from 42×256 to 66×256;
  • copies the trunk (token embeddings, final norm, vision/audio towers) verbatim from IT;
  • sets num_hidden_layers = 66, num_kv_shared_layers = 26 (trailing shared region recomputed so each shared layer borrows KV from the last non-shared layer of its attention type), and rebuilds layer_types to keep E4B's 5-sliding : 1-full rhythm.

Parameter breakdown (~11.9B total): transformer layers 6.65B (56%), Per-Layer-Embedding table 4.43B (37%), token embeddings 0.67B (6%), the rest PLE projection + vision/audio towers. Note that >40% of the mass is the PLE table + embeddings + towers, which scale with vocabulary/modality, not depth — so the depth-upscale's headline size comes from both the 24 added layers and the widened PLE table.

Why it is "raw" (what's wrong with it)

E4B's per-layer embeddings are welded to each layer's weights, and re-running layers at a different depth (the three rewind/overlap seams) pushes the residual stream off-distribution. The result is grammatical but loops under greedy decoding. A broad set of no-train fixes — PLE re-indexing by destination, zeroing duplicated PLE, residual-write scaling, brand-new interpolated layers, donor swaps — were all tested and none recover coherence. The break is structural, not something a weight-shuffle can fix; new depth has to be learned. (This 66-layer build has more rewind seams than a minimal single-seam stack, so expect it to need a somewhat heavier heal.)

How to make it usable

A heal (LoRA SFT/CPT) re-knits it into a coherent deeper model. A Fisher/subspace-protected heal — protecting the high-importance instruction directions while re-knitting the off-distribution layers — is the intended next step, to recover coherence without sacrificing instruction-following.

Intended use

  • Base for a healing train (the primary purpose).
  • Reproducibility / study of Gemma-4 depth-upscaling and its Per-Layer-Embedding constraints.

Built 2026-06-01 as part of the E4B → E6B depth-upscale research.

Downloads last month
-
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rpDungeon/Gemma-4-E6B-IT-raw

Finetuned
(195)
this model