HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws

Scaling law experiments comparing Euclidean (standard dot-product) vs Hyperbolic (Lorentz model) output layers for Qwen3 language models on OpenWebText.

Project

  • Repository: github.com/ObliviateRickLin/HyperScale
  • Base model: Qwen3 architecture (custom sizes, untied embeddings)
  • Dataset: OpenWebText (8.39B tokens total)
  • Optimizer: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
  • Training: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB

Key Differences

Euclidean Hyperbolic
Output layer Standard linear (dot-product logits) Lorentz hyperboloid (Minkowski inner product logits)
lm_head init zeros std=0.02
embed_tokens init std=1.0 std=1.0
tie_word_embeddings false false
Logit computation hidden @ lm_head.T (fp32) <expmap(h), expmap(w)>_L * scale (fp32, mean-centered)
logit_scale N/A d_model / sinh(1) (learnable)

Known Issues

c_proj zero init missing: Neither Euclidean nor Hyperbolic models zero-initialize self_attn.o_proj and mlp.down_proj (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).

Results

Token budget t1_N means 1/N of the full 8.39B token dataset. Delta < 0 means hyperbolic is better.

Size Params Tokens Hyp Euc Delta
p020m 20M 65.6M (t1_128) 9.9145 10.0184 -0.1039
p020m 20M 131M (t1_64) 8.4028 8.4610 -0.0582
p020m 20M 262M (t1_32) 7.1117 7.4237 -0.3120
p020m 20M 524M (t1_16) 6.0795 6.4958 -0.4163
p047m 47M 65.6M (t1_128) 8.5956 8.6146 -0.0190
p047m 47M 262M (t1_32) 6.0944 6.3668 -0.2724
p047m 47M 524M (t1_16) 5.5383 5.7709 -0.2326
p109m 109M 131M (t1_64) 6.1340 6.3509 -0.2169
p109m 109M 262M (t1_32) 5.5677 5.7453 -0.1776
p109m 109M 524M (t1_16) 5.3841 5.5431 -0.1590
p223m 223M 131M (t1_64) 5.8612 6.0407 -0.1795
p407m 407M 65.6M (t1_128) 6.8377 7.1486 -0.3109
p407m 407M 262M (t1_32) 4.5280 5.1510 -0.6230
p407m 407M 524M (t1_16) 4.4119 4.5080 -0.0961
p407m 407M 1.05B (t1_8) 3.4614 3.9945 -0.5331
p407m 407M 2.10B (t1_4) 3.5738 3.6206 -0.0468
p407m 407M 4.20B (t1_2) 3.3230 3.3503 -0.0273
p407m 407M 8.39B (t1_1) 3.1236 -- --
p686m 686M 131M (t1_64) 5.5675 5.7105 -0.1430
p686m 686M 262M (t1_32) 4.9169 5.0321 -0.1152
p686m 686M 524M (t1_16) 4.2684 4.3334 -0.0650
p686m 686M 1.05B (t1_8) 3.7796 3.8389 -0.0593
p686m 686M 4.20B (t1_2) 3.3219 3.2237 +0.0982
p686m 686M 65.6M (t1_128) 11.9171* 6.5685 --
p1083m 1.08B 65.6M (t1_128) 6.9223 7.9453 -1.0230
p1083m 1.08B 262M (t1_32) 4.4833 7.0858* --
p1083m 1.08B 524M (t1_16) 4.1417 4.2060 -0.0643
p1083m 1.08B 1.05B (t1_8) 3.3901 3.7304 -0.3403
p1083m 1.08B 4.20B (t1_2) 3.1223 3.2614 -0.1391
p1083m 1.08B 8.39B (t1_1) 2.9269 -- --
p1621m 1.62B all NaN 3.30-6.54 --
p2324m 2.32B 262M (t1_32) 4.6533 4.7401 -0.0868
p2324m 2.32B 524M (t1_16) 4.0000 4.0594 -0.0594
p2324m 2.32B 4.20B (t1_2) 3.0077 3.2524 -0.2447
p2324m 2.32B 8.39B (t1_1) 3.1524 -- --

* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).

Summary

  • Hyperbolic is consistently better in most matched comparisons (lower eval loss)
  • Average improvement: ~5-13% relative reduction in loss
  • Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
  • Caveat: init difference (lm_head zeros vs std=0.02) confounds the comparison
  • Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)

Repository Structure

checkpoints/
  qwen3/                            # Euclidean models
    qwen3_p{SIZE}_t1_{N}_.../
      attempt{K}_{DATE}/
        checkpoint-{STEP}/          # Final checkpoint
  qwen3_hyp/                        # Hyperbolic models
    qwen3_hyp_p{SIZE}_t1_{N}_.../
      attempt{K}_{DATE}/
        checkpoint-{STEP}/

results/                            # Training logs (trainer_state.json)
  scaling_law/owt/
    qwen3/owt_scaling_v3/
    qwen3_hyp/owt_scaling_v3/

Experiment Configuration

Parameter Value
Architecture Qwen3 (custom sizes)
Vocab size 151,936
Context length 1,024
Dataset OpenWebText (8.39B tokens)
Optimizer NanochatMuon (Muon + per-group AdamW)
Muon targets 2D transformer matrices
AdamW groups embed (lr=0.2scale), lm_head (lr=0.004scale), misc (lr=0.004*scale)
LR scaling (d_model/768)^(-0.5)
Precision bf16 mixed precision
Infrastructure 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2

Citation

@misc{hyperscale2026,
  title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
  author={Jinrui Lin},
  year={2026},
  url={https://github.com/ObliviateRickLin/HyperScale}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support