Language and Translation Technology Team

university

https://lt3.ugent.be/

lt3ugent

lt3

Activity Feed Request to join this org

AI & ML interests

terminology; sentiment analysis; natural language processing; emotion detection; machine translation

Recent Activity

Amala3 published a dataset 4 days ago

LT3/EmotioNL_Tweets

clark-12 updated a dataset 3 months ago

LT3/UniC

natalievgrafova updated a dataset 4 months ago

LT3/abortion_definitions_annotations

View all activity

Amala3

published a dataset 4 days ago

LT3/EmotioNL_Tweets

Viewer • Updated Dec 5, 2024 • 1k • 4

BramVanroy

posted an update about 2 months ago

Post

317

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

clark-12

updated a dataset 3 months ago

LT3/UniC

Viewer • Updated Aug 21 • 964 • 16

BramVanroy

posted an update 4 months ago

Post

791

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.