HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws

Scaling law experiments comparing Euclidean (standard dot-product) vs Hyperbolic (Lorentz model) output layers for Qwen3 language models on OpenWebText.

Project

Repository: github.com/ObliviateRickLin/HyperScale
Base model: Qwen3 architecture (custom sizes, untied embeddings)
Dataset: OpenWebText (8.39B tokens total)
Optimizer: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
Training: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB

Key Differences

	Euclidean	Hyperbolic
Output layer	Standard linear (dot-product logits)	Lorentz hyperboloid (Minkowski inner product logits)
lm_head init	zeros	std=0.02
embed_tokens init	std=1.0	std=1.0
tie_word_embeddings	false	false
Logit computation	`hidden @ lm_head.T` (fp32)	`<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered)
logit_scale	N/A	`d_model / sinh(1)` (learnable)

Known Issues

c_proj zero init missing: Neither Euclidean nor Hyperbolic models zero-initialize self_attn.o_proj and mlp.down_proj (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).

Results

Token budget t1_N means 1/N of the full 8.39B token dataset. Delta < 0 means hyperbolic is better.

Size	Params	Tokens	Hyp	Euc	Delta
p020m	20M	65.6M (t1_128)	9.9145	10.0184	-0.1039
p020m	20M	131M (t1_64)	8.4028	8.4610	-0.0582
p020m	20M	262M (t1_32)	7.1117	7.4237	-0.3120
p020m	20M	524M (t1_16)	6.0795	6.4958	-0.4163
p047m	47M	65.6M (t1_128)	8.5956	8.6146	-0.0190
p047m	47M	262M (t1_32)	6.0944	6.3668	-0.2724
p047m	47M	524M (t1_16)	5.5383	5.7709	-0.2326
p109m	109M	131M (t1_64)	6.1340	6.3509	-0.2169
p109m	109M	262M (t1_32)	5.5677	5.7453	-0.1776
p109m	109M	524M (t1_16)	5.3841	5.5431	-0.1590
p223m	223M	131M (t1_64)	5.8612	6.0407	-0.1795
p407m	407M	65.6M (t1_128)	6.8377	7.1486	-0.3109
p407m	407M	262M (t1_32)	4.5280	5.1510	-0.6230
p407m	407M	524M (t1_16)	4.4119	4.5080	-0.0961
p407m	407M	1.05B (t1_8)	3.4614	3.9945	-0.5331
p407m	407M	2.10B (t1_4)	3.5738	3.6206	-0.0468
p407m	407M	4.20B (t1_2)	3.3230	3.3503	-0.0273
p407m	407M	8.39B (t1_1)	3.1236	--	--
p686m	686M	131M (t1_64)	5.5675	5.7105	-0.1430
p686m	686M	262M (t1_32)	4.9169	5.0321	-0.1152
p686m	686M	524M (t1_16)	4.2684	4.3334	-0.0650
p686m	686M	1.05B (t1_8)	3.7796	3.8389	-0.0593
p686m	686M	4.20B (t1_2)	3.3219	3.2237	+0.0982
p686m	686M	65.6M (t1_128)	11.9171*	6.5685	--
p1083m	1.08B	65.6M (t1_128)	6.9223	7.9453	-1.0230
p1083m	1.08B	262M (t1_32)	4.4833	7.0858*	--
p1083m	1.08B	524M (t1_16)	4.1417	4.2060	-0.0643
p1083m	1.08B	1.05B (t1_8)	3.3901	3.7304	-0.3403
p1083m	1.08B	4.20B (t1_2)	3.1223	3.2614	-0.1391
p1083m	1.08B	8.39B (t1_1)	2.9269	--	--
p1621m	1.62B	all	NaN	3.30-6.54	--
p2324m	2.32B	262M (t1_32)	4.6533	4.7401	-0.0868
p2324m	2.32B	524M (t1_16)	4.0000	4.0594	-0.0594
p2324m	2.32B	4.20B (t1_2)	3.0077	3.2524	-0.2447
p2324m	2.32B	8.39B (t1_1)	3.1524	--	--

* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).

Summary

Hyperbolic is consistently better in most matched comparisons (lower eval loss)
Average improvement: ~5-13% relative reduction in loss
Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
Caveat: init difference (lm_head zeros vs std=0.02) confounds the comparison
Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)

Repository Structure

checkpoints/
  qwen3/                            # Euclidean models
    qwen3_p{SIZE}_t1_{N}_.../
      attempt{K}_{DATE}/
        checkpoint-{STEP}/          # Final checkpoint
  qwen3_hyp/                        # Hyperbolic models
    qwen3_hyp_p{SIZE}_t1_{N}_.../
      attempt{K}_{DATE}/
        checkpoint-{STEP}/

results/                            # Training logs (trainer_state.json)
  scaling_law/owt/
    qwen3/owt_scaling_v3/
    qwen3_hyp/owt_scaling_v3/

Experiment Configuration

Parameter	Value
Architecture	Qwen3 (custom sizes)
Vocab size	151,936
Context length	1,024
Dataset	OpenWebText (8.39B tokens)
Optimizer	NanochatMuon (Muon + per-group AdamW)
Muon targets	2D transformer matrices
AdamW groups	embed (lr=0.2scale), lm_head (lr=0.004scale), misc (lr=0.004*scale)
LR scaling	(d_model/768)^(-0.5)
Precision	bf16 mixed precision
Infrastructure	4x NVIDIA H100 80GB, DeepSpeed ZeRO-2

Citation

@misc{hyperscale2026,
  title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
  author={Jinrui Lin},
  year={2026},
  url={https://github.com/ObliviateRickLin/HyperScale}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support