AI & ML interests

NLP for African languages, MT, NER, POS, QA, ...

Recent Activity

israel  updated a collection 5 days ago
AfrIFact
israel  updated a dataset 9 days ago
masakhane/AfrIFact
israel  updated a collection 15 days ago
AfrIFact
View all activity

omarkamali 
posted an update 6 days ago
view post
Post
189
🌐 LID Benchmark update:

• 10 Regional Leaderboards
• 17 LID models (+7 new, incl. non-fastText based)
• 449 languages in total (200+ additional)
• Fixed: F1 macro reporting error
• Normalized language codes for more accurate results

The dataset is also updated, now with individual model predictions to reproduce and validate our findings.

omneity-labs/lid-benchmark
omarkamali 
posted an update 17 days ago
view post
Post
208
Omneity Labs LID Benchmark is live 🔥

- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!

Come find your language and which LID model supports it best in this space 👇

omneity-labs/lid-benchmark
omarkamali 
posted an update 18 days ago
view post
Post
1864
I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

  • 14 replies
·
omarkamali 
posted an update about 1 month ago
view post
Post
335
You're probably training on outdated Wikipedia data right now and don't know it. 💡

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly 👇

https://omarkamali.com/blog/wikipedia-monthly-pipeline
Tonic 
posted an update about 2 months ago
view post
Post
3532
🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named
kurakurai
?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .
  • 4 replies
·
Tonic 
posted an update about 2 months ago
view post
Post
3351
🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest


just a few quite minor details i'll take care of but i wanted to share here first
  • 2 replies
·