view post Post 317 What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king? See translation 1 reply · 👀 1 1 + Reply
view post Post 791 Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages. See translation 👍 3 3 + Reply
Stance-aware Definition Generation for Argumentative Texts Collection Models and datasets for the paper: "Stance-aware Definition Generation for Argumentative Texts" • 7 items • Updated Aug 6
Stance-aware Definition Generation for Argumentative Texts Collection Models and datasets for the paper: "Stance-aware Definition Generation for Argumentative Texts" • 7 items • Updated Aug 6
Stance-aware Definition Generation for Argumentative Texts Collection Models and datasets for the paper: "Stance-aware Definition Generation for Argumentative Texts" • 7 items • Updated Aug 6