Another month, another Wikipedia Monthly release! ๐
Highlights of October's edition: ยท ๐ฃ๏ธ 341 languages ยท ๐ 64.7M articles (+2.5%) ยท ๐ฆ 89.4GB of data (+3.3%)
We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.
Now you can load the english (or your favorite language) subset in seconds: dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")
Datapluck: Portability Tool for Huggingface Datasets
"I found myself recently whipping up notebooks just to pull huggingface datasets locally, annotate or operate changes and update them again. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.
While huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing." ~ @omarkamali