Can I see your llama.cpp settings?

#4
by Nuke1229 - opened

Can I see your llama.cpp settings? And roughly what is your tokens per second?

Can I see your llama.cpp settings? And roughly what is your tokens per second?

I have 10 - 11 tokens per second with APEX quant on RTX 3060, and 18 - 20 on APEX Compact. I am using LM Studio with settings: GPU Offload: 15, Number of layers to force on CPU: 25.

Can I see your llama.cpp settings? And roughly what is your tokens per second?

I have 10 - 11 tokens per second with APEX quant on RTX 3060, and 18 - 20 on APEX Compact. I am using LM Studio with settings: GPU Offload: 15, Number of layers to force on CPU: 25.

thanks, you, Are you planning to make a version for Gemma 4 26B?

Can I see your llama.cpp settings? And roughly what is your tokens per second?

I have 10 - 11 tokens per second with APEX quant on RTX 3060, and 18 - 20 on APEX Compact. I am using LM Studio with settings: GPU Offload: 15, Number of layers to force on CPU: 25.

thanks, you, Are you planning to make a version for Gemma 4 26B?

Gemma4 26B already healthy and it was calibrated by Google before release. It don't need any my fixes.

Can I see your llama.cpp settings? And roughly what is your tokens per second?

I have 10 - 11 tokens per second with APEX quant on RTX 3060, and 18 - 20 on APEX Compact. I am using LM Studio with settings: GPU Offload: 15, Number of layers to force on CPU: 25.

For MoE models, I’d recommend starting by setting both GPU Offload and CPU layers to their maximum, then gradually reducing the CPU layers until your VRAM usage hits around 85%.

On my end (RTX 5070 Ti Laptop 12GB + 48GB DDR5 5200MHz RAM), I tested this with the same APEX model:
• GPU: 40 / CPU: 40 → ~6GB VRAM, ~30 t/s
• GPU: 40 / CPU: 30 → ~10.8GB VRAM, ~36 t/s

Sign up or log in to comment