omarkamali (Omar Kamali)

replied to their post about 1 month ago

Hey @MarcusLammers , this is a great idea! I will try to include it in a future iteration based on Wikipedia categories.

posted an update about 1 month ago

Post

272

Another month, another Wikipedia Monthly release! 🎃

Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! 🧰

omarkamali/wikipedia-monthly

2 replies

·

replied to their post about 1 month ago

lay 7fdek a khouya @ayymen !

posted an update about 2 months ago

Post

1597

**Wikipedia Monthly's September edition is now live 🎉**

Highlights of this edition:
· 🗣️ 341 languages
· 📚 63.1M articles
· 📦 86.5GB of data

This update also solves upload issues in the August edition where some languages had missing parts. Happy data engineering!

omarkamali/wikipedia-monthly

2 replies

·

replied to alielfilali01's post about 1 year ago

Appreciate the shoutout @alielfilali01 🙌

I hope you will find Datapluck to be useful!

reacted to alielfilali01's post with ❤️ about 1 year ago

Post

1120

Datapluck: Portability Tool for Huggingface Datasets

"I found myself recently whipping up notebooks just to pull huggingface datasets locally, annotate or operate changes and update them again. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.

While huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing."
~ @omarkamali

Link : https://omarkama.li/blog/datapluck

1 reply

·

Omar Kamali

AI & ML interests

Organizations

Omar Kamali

AI & ML interests

Organizations

omarkamali's activity