Per-instance allocation for max_n, max_batch (B):
WORKING STORAGE:
A_work : [B, max_n, max_n] # working copy (destroyed)
V_accum : [B, max_n, max_n] # eigenvector accumulator
householder : [max_n-2, B, max_n] # stored reflectors (padded)
d : [B, max_n] # tridiagonal diagonal
e : [B, max_n-1] # tridiagonal off-diagonal
Subtotal: ~3 ร max_nยฒ ร B floats
D&C TREE (depth = โlogโ(max_n)โ levels):
FOR each level l (0 to depth-1):
num_sub = 2^l
sub_size = max_n // 2^l (padded up to power of 2)
delta : [B, num_sub, sub_size] # merged eigenvalues
z_vec : [B, num_sub, sub_size] # merge vectors
rho : [B, num_sub] # coupling strengths
mask : [B, num_sub, sub_size] # valid element mask
# Newton state (per root):
lam : [B, num_sub, sub_size] # current root estimates
lo : [B, num_sub, sub_size] # bracket lower
hi : [B, num_sub, sub_size] # bracket upper
f_val : [B, num_sub, sub_size] # secular function value
converge: [B, num_sub, sub_size] # convergence mask
# Eigenvector fragments:
V_frag : [B, num_sub, sub_size, sub_size]
Subtotal per level: ~(9 ร sub_size + sub_sizeยฒ) ร num_sub ร B
Total across levels: since num_sub ร sub_size = max_n at every level,
โ (9 ร max_n + max_nยฒ) ร depth ร B
โ max_nยฒ ร depth ร B (the V_frags dominate)
CONCRETE NUMBERS (fp32, 4 bytes each):
max_n=8, B=4096: ~8ยฒ ร 8 ร 3 ร 4096 ร 4 โ 24 MB
max_n=32, B=1024: ~32ยฒ ร 5 ร 3 ร 1024 ร 4 โ 60 MB
max_n=64, B=512: ~64ยฒ ร 6 ร 3 ร 512 ร 4 โ 144 MB
max_n=128, B=256: ~128ยฒ ร 7 ร 3 ร 256 ร 4 โ 352 MB
max_n=256, B=128: ~256ยฒ ร 8 ร 3 ร 128 ร 4 โ 768 MB
max_n=6, B=8192: ~6ยฒ ร 3 ร 3 ร 8192 ร 4 โ 6 MB โ your CM case