Daniel Fox PRO
FlameF0X
AI & ML interests
Pre-training text generator.
(Brother, im 18)
Please don't try to contact me.
Recent Activity
liked a model about 1 hour ago
Jackrong/Qwopus3.5-4B-Coder reacted to theirpost with ๐ฅ about 1 hour ago
MiniMax-M3 coming soon.
https://github.com/MiniMax-AI/MiniMax-M3
posted an update about 4 hours ago
MiniMax-M3 coming soon.
https://github.com/MiniMax-AI/MiniMax-M3
Organizations
reacted to wenhuach's post with ๐ฅ 5 days ago
Post
4460
๐ We provide **free** hardware to quantize models at the [Intel Low Bit Open LLM Leaderboard]( Intel/low_bit_open_llm_leaderboard), currently supporting
โญ If you find it useful, please consider starring the AutoRound project on [GitHub](https://github.com/intel/auto-round)!
Pure RTN mode powered by AutoRoundโญ If you find it useful, please consider starring the AutoRound project on [GitHub](https://github.com/intel/auto-round)!
reacted to pankajpandey-dev's post with ๐ฅ 7 days ago
Post
2682
๐งฌ Just uploaded K-quants of Carbon-3B for llama.cpp users!
@HuggingFaceBio released the original GGUF in bf16 only โ so I added the full quant ladder for CPU/edge inference:
โข Q2_K โ 1.4 GB
โข Q3_K_M โ 1.8 GB
โข Q4_K_M โ 2.1 GB โญ
โข Q5_K_M โ 2.4 GB
โข Q6_K โ 2.7 GB
โข Q8_0 โ 3.5 GB
๐ pankajpandey-dev/Carbon-3B-GGUF
Now you can generate DNA sequences on your laptop. Needs a llama.cpp build with PR #23410 (HybridDNATokenizer support).
Huge thanks to the HuggingFaceBio team for the original model ๐
#GGUF #llamacpp #genomics #DNA
@HuggingFaceBio released the original GGUF in bf16 only โ so I added the full quant ladder for CPU/edge inference:
โข Q2_K โ 1.4 GB
โข Q3_K_M โ 1.8 GB
โข Q4_K_M โ 2.1 GB โญ
โข Q5_K_M โ 2.4 GB
โข Q6_K โ 2.7 GB
โข Q8_0 โ 3.5 GB
๐ pankajpandey-dev/Carbon-3B-GGUF
Now you can generate DNA sequences on your laptop. Needs a llama.cpp build with PR #23410 (HybridDNATokenizer support).
Huge thanks to the HuggingFaceBio team for the original model ๐
#GGUF #llamacpp #genomics #DNA
reacted to Crownelius's post with ๐ฅ๐ฅ๐ฅ 12 days ago
Post
4619
Howdy,
CompactAI-O is launching a tiny Model Golf, and the winner walks away with $50 in RunPod credits. Monthly. Every month. Show up, build, somebody wins.
What it is
Build the best language model you can under 100 million parameters, with at least a 1028-token context window. That's it. Any architecture, any tokenizer, any training scheme you can dream up at 3am. The only catch is it's gotta be open source (MIT, GPL, Apache, AGPL) take your pick.
It scratches the same itch as a Kaggle comp without the dataset\leaderboard nonsense. No fixed benchmark to game. No llama.cpp compatibility hoops. If you wanna train a 50M-param MoE with five experts and a tokenizer built on cookbooks, you can do that. Nothing stopping you.
The rules are listed in the discord and on the organization page if you're interested.
Why $50????
It's symbolic. It ain't gonna make anyone rich. But it's enough to cover a weekend of GPU time, enough to keep enthusiasts coming back, and not so much that it pulls in people who are just there for the money. Enthusiasts build interesting things. Interesting things move the field forward. A little incentive. I'd do it for $50 lol.
How to join
First round opens soon. Landing page is here:
โ CompactAI-O/Tiny-model-golf
For questions or to swap ideas, the Discord's open:
โ https://discord.gg/y2jTct6Cxv
Excited to see what yall come up with. โฅ
โ Shane
CompactAI-O is launching a tiny Model Golf, and the winner walks away with $50 in RunPod credits. Monthly. Every month. Show up, build, somebody wins.
What it is
Build the best language model you can under 100 million parameters, with at least a 1028-token context window. That's it. Any architecture, any tokenizer, any training scheme you can dream up at 3am. The only catch is it's gotta be open source (MIT, GPL, Apache, AGPL) take your pick.
It scratches the same itch as a Kaggle comp without the dataset\leaderboard nonsense. No fixed benchmark to game. No llama.cpp compatibility hoops. If you wanna train a 50M-param MoE with five experts and a tokenizer built on cookbooks, you can do that. Nothing stopping you.
The rules are listed in the discord and on the organization page if you're interested.
Why $50????
It's symbolic. It ain't gonna make anyone rich. But it's enough to cover a weekend of GPU time, enough to keep enthusiasts coming back, and not so much that it pulls in people who are just there for the money. Enthusiasts build interesting things. Interesting things move the field forward. A little incentive. I'd do it for $50 lol.
How to join
First round opens soon. Landing page is here:
โ CompactAI-O/Tiny-model-golf
For questions or to swap ideas, the Discord's open:
โ https://discord.gg/y2jTct6Cxv
Excited to see what yall come up with. โฅ
โ Shane
reacted to alvarobartt's post with ๐ 14 days ago
Post
3276
Latest
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
๐ง
๐๏ธ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
โก Active params isn't the same as memory footprint, especially for sparse architectures
๐ฆ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
๐ KV cache can still dominate depending on context length, batch size, and concurrency
๐ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
๐ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Check the repository at https://github.com/alvarobartt/hf-mem
hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
๐ง
hf-mem now splits MoE memory into base model weights, routed experts, and KV cache๐๏ธ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
โก Active params isn't the same as memory footprint, especially for sparse architectures
๐ฆ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
๐ KV cache can still dominate depending on context length, batch size, and concurrency
๐ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
๐ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Check the repository at https://github.com/alvarobartt/hf-mem
posted an update 17 days ago
Post
220
I did some testing on the scalability of FWKV. It hits a speed bottleneck at 1B due to the T4โs bandwidth limitations. Theoretically, it should match RWKVโs inference speed if the GPU had more bandwidth. So the 1B size is not accurate.
replied to their post 18 days ago
Not yet. I'm still experimenting.
Once I get something that I'm pleased with I'm going to write a blog.
posted an update 18 days ago
Post
272
Greetings Hugging Face!
I started a new project called **FWKV** (Feed-forward Weighted Key Value, or Floored Weighted Key Value), a RWKV-style LM that uses FFNNs (Feed-Forward Neural Networks) instead of RNN and
So far I have:
- FlameF0X/FWKV-29M โ this one is undertrained and doesn't have a Space yet. In the attached image you can see its speed on a T4 compared to models with the same configuration.
The only model that's fully working right now is:
- FlameF0X/FWKV-TinyStories โ trained on TinyStories for one epoch. The demo Space is FlameF0X/FWKV-demo.
I started a new project called **FWKV** (Feed-forward Weighted Key Value, or Floored Weighted Key Value), a RWKV-style LM that uses FFNNs (Feed-Forward Neural Networks) instead of RNN and
floor(WยทKยทV). I'm hoping to make it much more efficient and scalable than RWKV.So far I have:
- FlameF0X/FWKV-29M โ this one is undertrained and doesn't have a Space yet. In the attached image you can see its speed on a T4 compared to models with the same configuration.
The only model that's fully working right now is:
- FlameF0X/FWKV-TinyStories โ trained on TinyStories for one epoch. The demo Space is FlameF0X/FWKV-demo.
reacted to ArtelTaleb's post with ๐ฅ 25 days ago
Post
2538
โ๏ธ World Flight Arcade - Can you land in 60 seconds?
I just dropped a new browser game built entirely with Three.js: World Flight Arcade
The concept is brutally simple:
- ๐ 60 seconds of flight above a neon wireframe city
- โ๏ธ One single attempt to land on the runway
- ๐ No second chances. No respawn. Just you, the controls, and the clock.
The camera system is fully dynamic - it stays locked behind the plane within a ยฑ45ยฐ pitch/yaw envelope, giving you that cinematic flight feel while keeping full spatial awareness.
Can you nail the landing on your first try?
๐ Play here: ArtelTaleb/world-flight-arcade
Built by Artel3D - handcrafted in Three.js, zero dependencies, runs directly in your browser.
Drop your score in the comments ๐
#gamedev #threejs #browserGame #webgl #artel3d #indiegame
reacted to HannesVonEssen's post with โค๏ธ 28 days ago
Post
231
๐ฃ I made a visualizer for Hugging Face models: https://hfviewer.com
โจ Simply paste a Hugging Face URL to get an interactive visualization of the architecture!
๐ The recent Qwen3.6-27B model as an example: https://hfviewer.com/Qwen/Qwen3.6-27B
Feel free to try it out and give me feedback on how it can be improved! โค๏ธ
โจ Simply paste a Hugging Face URL to get an interactive visualization of the architecture!
๐ The recent Qwen3.6-27B model as an example: https://hfviewer.com/Qwen/Qwen3.6-27B
Feel free to try it out and give me feedback on how it can be improved! โค๏ธ
reacted to Crownelius's post with ๐ฅ about 1 month ago
Post
3824
[DAY ONE] PROJECT CROWFEATHER 4/30/2026
...The day I forgot to attach wandb.ai
Just dropped Crowfeather-50m, the first checkpoint in a series, and yeah, no graphs.
Crowfeather/Crowfeather-50m
54.5M params. Pretrain only. 17,500 steps banked on FineWeb-edu before Thunder credits ran dry. About 2.3B tokens, no SFT yet.
Architecture: Gemma-4 alternating sliding/global attention (1024 window, last layer always global) plus DeepSeek-V4 Muon optimizer plus WSD scheduler plus Gemma-2 logit soft-cap plus PaLM z-loss. Recipe in the model card.
What it can do: writes grammatical English. Knows that France has Rhine-adjacent monasteries (it picked Rouen instead of Paris but the vocabulary is in there). Tells stories about Mr. Fabien.
What it can't do yet: facts, code, math. Base LM, no SFT, no instruction tuning.
The series:
Every additional training run becomes another model card here
Every model card gets a matching post on this profile
Continuation goes to Colab next, picking up from step 17500 out of 100k
Limited to one post a day on Hugging Face, so updates will trickle out at that pace. Follow [@Crownelius](@Crownelius ) and [@Crowfeather](
Crowfeather ) if you want to watch this thing learn in public. Next drop will either come with the finished pre-train or whatever step I land on before the bank takes my credit card away.
Graphs will be available on my NEXT model lol
-Shane
...The day I forgot to attach wandb.ai
Just dropped Crowfeather-50m, the first checkpoint in a series, and yeah, no graphs.
Crowfeather/Crowfeather-50m
54.5M params. Pretrain only. 17,500 steps banked on FineWeb-edu before Thunder credits ran dry. About 2.3B tokens, no SFT yet.
Architecture: Gemma-4 alternating sliding/global attention (1024 window, last layer always global) plus DeepSeek-V4 Muon optimizer plus WSD scheduler plus Gemma-2 logit soft-cap plus PaLM z-loss. Recipe in the model card.
What it can do: writes grammatical English. Knows that France has Rhine-adjacent monasteries (it picked Rouen instead of Paris but the vocabulary is in there). Tells stories about Mr. Fabien.
What it can't do yet: facts, code, math. Base LM, no SFT, no instruction tuning.
The series:
Every additional training run becomes another model card here
Every model card gets a matching post on this profile
Continuation goes to Colab next, picking up from step 17500 out of 100k
Limited to one post a day on Hugging Face, so updates will trickle out at that pace. Follow [@Crownelius](@Crownelius ) and [@Crowfeather](
Graphs will be available on my NEXT model lol
-Shane
reacted to anakin87's post with โค๏ธ about 1 month ago
Post
3339
A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe
I took LiquidAI/LFM2-2.6B and trained it through play.
๐งโ๐ณ Here's how:
1๏ธโฃ Build a solid RL env with Verifiers (Prime Intellect)
2๏ธโฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3๏ธโฃ SFT warm-up to teach format
4๏ธโฃ Group-based RL (CISPO) against opponents making 20-70% random moves
5๏ธโฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Done! Beats GPT-5-mini ๐
---
๐ฎ Play against the model: anakin87/LFM2-2.6B-mr-tictactoe
๐ค Model: anakin87/LFM2-2.6B-mr-tictactoe
๐ Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
I took LiquidAI/LFM2-2.6B and trained it through play.
๐งโ๐ณ Here's how:
1๏ธโฃ Build a solid RL env with Verifiers (Prime Intellect)
2๏ธโฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3๏ธโฃ SFT warm-up to teach format
4๏ธโฃ Group-based RL (CISPO) against opponents making 20-70% random moves
5๏ธโฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Done! Beats GPT-5-mini ๐
---
๐ฎ Play against the model: anakin87/LFM2-2.6B-mr-tictactoe
๐ค Model: anakin87/LFM2-2.6B-mr-tictactoe
๐ Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
reacted to SeaWolf-AI's post with ๐ฅ 3 months ago
Post
5080
ALL Bench โ Global AI Model Unified Leaderboard
FINAL-Bench/all-bench-leaderboard
If you've ever tried to compare GPT-5.2 and Claude Opus 4.6 side by side, you've probably hit the same wall: the official Hugging Face leaderboard only tracks open-source models, so the most widely used AI systems simply aren't there. ALL Bench fixes that by bringing closed-source models, open-weight models, and โ uniquely โ all four teams under South Korea's national sovereign AI program into a single leaderboard. Thirty-one frontier models, one consistent scoring scale.
Scoring works differently here too. Most leaderboards skip benchmarks a model hasn't submitted, which lets models game their ranking by withholding results. ALL Bench treats every missing entry as zero and divides by ten, so there's no advantage in hiding your weak spots.
The ten core benchmarks span reasoning (GPQA Diamond, AIME 2025, HLE, ARC-AGI-2), coding (SWE-bench Verified, LiveCodeBench), and instruction-following (IFEval, BFCL). The standout is FINAL Bench โ the world's only benchmark measuring whether a model can catch and correct its own mistakes. It reached rank five in global dataset popularity on Hugging Face in February 2026 and has been covered by Seoul Shinmun, Asia Economy, IT Chosun, and Behind.
Nine interactive charts let you explore everything from composite score rankings and a full heatmap to an open-vs-closed scatter plot. Operational metrics like context window, output speed, and pricing are included alongside benchmark scores.
All data is sourced from Artificial Analysis Intelligence Index v4.0, arXiv technical reports, Chatbot Arena ELO ratings, and the Korean Ministry of Science and ICT's official evaluation results. Updates monthly.
FINAL-Bench/all-bench-leaderboard
If you've ever tried to compare GPT-5.2 and Claude Opus 4.6 side by side, you've probably hit the same wall: the official Hugging Face leaderboard only tracks open-source models, so the most widely used AI systems simply aren't there. ALL Bench fixes that by bringing closed-source models, open-weight models, and โ uniquely โ all four teams under South Korea's national sovereign AI program into a single leaderboard. Thirty-one frontier models, one consistent scoring scale.
Scoring works differently here too. Most leaderboards skip benchmarks a model hasn't submitted, which lets models game their ranking by withholding results. ALL Bench treats every missing entry as zero and divides by ten, so there's no advantage in hiding your weak spots.
The ten core benchmarks span reasoning (GPQA Diamond, AIME 2025, HLE, ARC-AGI-2), coding (SWE-bench Verified, LiveCodeBench), and instruction-following (IFEval, BFCL). The standout is FINAL Bench โ the world's only benchmark measuring whether a model can catch and correct its own mistakes. It reached rank five in global dataset popularity on Hugging Face in February 2026 and has been covered by Seoul Shinmun, Asia Economy, IT Chosun, and Behind.
Nine interactive charts let you explore everything from composite score rankings and a full heatmap to an open-vs-closed scatter plot. Operational metrics like context window, output speed, and pricing are included alongside benchmark scores.
All data is sourced from Artificial Analysis Intelligence Index v4.0, arXiv technical reports, Chatbot Arena ELO ratings, and the Korean Ministry of Science and ICT's official evaluation results. Updates monthly.
reacted to marksverdhei's post with ๐ค 4 months ago
Post
2702
Dear Hugging Face team, can we please have a way to archive hf repositories / spaces? I have a bunch of spaces that used to work but don't any more due to the hf space implementations changing and i think it would be good if I could archive those like in GitHub.
React to this post if you want to see this feature! ๐ก
React to this post if you want to see this feature! ๐ก
reacted to IlyasMoutawwakil's post with ๐ฅ 4 months ago
Post
2474
After 2 months of refinement, I'm happy to announce that a lot of Transformers' modeling code is now significantly more torch-compile & export-friendly ๐ฅ
Why it had to be done ๐
PyTorch's Dynamo compiler is increasingly becoming the default interoperability layer for ML systems. Anything that relies on torch.export or torch.compile, from model optimization to cross-framework integrations, benefits directly when models can be captured as a single dynamo-traced graph !
Transformers models are now easier to:
โ๏ธ Compile end-to-end with torch.compile backends
๐ฆ Export reliably via torch.export and torch.onnx.export
๐ Deploy to ONNX / ONNX Runtime, Intel Corporation's OpenVINO, NVIDIA AutoDeploy (TRT-LLM), AMD's Quark, Meta's Executorch and more hardware-specific runtimes.
This work aims at unblocking entire TorchDynamo-based toolchains that rely on exporting Transformers across runtimes and accelerators.
We are doubling down on Transformers commitment to be a first-class citizen of the PyTorch ecosystem, more exportable, more optimizable, and easier to deploy everywhere.
There are definitely some edge-cases that we still haven't addressed so don't hesitate to try compiling / exporting your favorite transformers and to open issues / PRs.
PR in the comments ! More updates coming coming soon !
Why it had to be done ๐
PyTorch's Dynamo compiler is increasingly becoming the default interoperability layer for ML systems. Anything that relies on torch.export or torch.compile, from model optimization to cross-framework integrations, benefits directly when models can be captured as a single dynamo-traced graph !
Transformers models are now easier to:
โ๏ธ Compile end-to-end with torch.compile backends
๐ฆ Export reliably via torch.export and torch.onnx.export
๐ Deploy to ONNX / ONNX Runtime, Intel Corporation's OpenVINO, NVIDIA AutoDeploy (TRT-LLM), AMD's Quark, Meta's Executorch and more hardware-specific runtimes.
This work aims at unblocking entire TorchDynamo-based toolchains that rely on exporting Transformers across runtimes and accelerators.
We are doubling down on Transformers commitment to be a first-class citizen of the PyTorch ecosystem, more exportable, more optimizable, and easier to deploy everywhere.
There are definitely some edge-cases that we still haven't addressed so don't hesitate to try compiling / exporting your favorite transformers and to open issues / PRs.
PR in the comments ! More updates coming coming soon !
reacted to danielhanchen's post with ๐ฅ 5 months ago
Post
2920
You can now do reinforcement learning training with 7ร longer context and no accuracy loss, via our new batching algorithms.
Long reasoning chains in RL are costly, but now we enable you to train gpt-oss with GRPO & reach 380K context on a 192GB GPU.
Blog: https://unsloth.ai/docs/new/grpo-long-context
Long reasoning chains in RL are costly, but now we enable you to train gpt-oss with GRPO & reach 380K context on a 192GB GPU.
Blog: https://unsloth.ai/docs/new/grpo-long-context