• 10 Regional Leaderboards • 17 LID models (+7 new, incl. non-fastText based) • 449 languages in total (200+ additional) • Fixed: F1 macro reporting error • Normalized language codes for more accurate results
The dataset is also updated, now with individual model predictions to reproduce and validate our findings.
You're probably training on outdated Wikipedia data right now and don't know it. 💡
In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."
He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time. • For English, that's 700,000 missing articles. • For Moroccan Arabic, 30% of the language's entire Wikipedia. • For 31 other languages, there was literally no text corpus at all until recently.
I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).
Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.
Here's the full story of how I built Wikipedia Monthly 👇
if you like it give the demo a little star and send a shoutout to : @MaxLSB@jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .