Masakhane NLP

community

https://www.masakhane.io/

MasakhaneNLP

masakhane-io

Activity Feed Request to join this org

AI & ML interests

NLP for African languages, MT, NER, POS, QA, ...

Recent Activity

israel updated a collection 5 days ago

AfrIFact

israel updated a dataset 9 days ago

masakhane/AfrIFact

israel updated a collection 15 days ago

AfrIFact

View all activity

israel

authored 6 papers 4 days ago

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Paper • 2505.24456 • Published May 30, 2025

AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

Paper • 2503.18247 • Published Mar 24, 2025

Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Paper • 2601.05699 • Published Jan 9 • 2

Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches

Paper • 2508.21512 • Published Aug 29, 2025

Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

Paper • 2603.23654 • Published 18 days ago

AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

Paper • 2604.00706 • Published 10 days ago

israel

updated a collection 5 days ago

AfrIFact

Collection

a multilingual information retrieval, evidence retrieval and fact checking benchmark covering healthcare, culturally grounded content • 2 items • Updated 5 days ago

omarkamali

posted an update 6 days ago

Post

189

🌐 LID Benchmark update:

• 10 Regional Leaderboards
• 17 LID models (+7 new, incl. non-fastText based)
• 449 languages in total (200+ additional)
• Fixed: F1 macro reporting error
• Normalized language codes for more accurate results

The dataset is also updated, now with individual model predictions to reproduce and validate our findings.

omneity-labs/lid-benchmark

israel

updated a dataset 9 days ago

masakhane/AfrIFact

Viewer • Updated 9 days ago • 9.81k • 47 • 2

israel

updated a collection 15 days ago

AfrIFact

Collection

a multilingual information retrieval, evidence retrieval and fact checking benchmark covering healthcare, culturally grounded content • 2 items • Updated 5 days ago

israel

published a dataset 15 days ago

masakhane/AfrIFact

Viewer • Updated 9 days ago • 9.81k • 47 • 2

omarkamali

posted an update 17 days ago

Post

208

Omneity Labs LID Benchmark is live 🔥

- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!

Come find your language and which LID model supports it best in this space 👇

omneity-labs/lid-benchmark

omarkamali

posted an update 18 days ago

Post

1864

I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

14 replies

omarkamali

posted an update about 1 month ago

Post

335

You're probably training on outdated Wikipedia data right now and don't know it. 💡

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly 👇

https://omarkamali.com/blog/wikipedia-monthly-pipeline

Tonic

posted an update about 2 months ago

Post

3532

🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named

kurakurai ?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .

4 replies

Tonic

posted an update about 2 months ago

Post

3351

🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest

just a few quite minor details i'll take care of but i wanted to share here first

2 replies

Atnafu

authored 4 papers 3 months ago

EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation

Paper • 2403.13737 • Published Mar 20, 2024

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Paper • 2502.15916 • Published Feb 21, 2025 • 1

MasakhaNEWS: News Topic Classification for African languages

Paper • 2304.09972 • Published Apr 19, 2023

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Paper • 2505.24456 • Published May 30, 2025

AI & ML interests

Recent Activity

Team members 107

masakhane's activity