Occiglot

community

https://occiglot.eu/

occiglot

Activity Feed

AI & ML interests

Open Source Language Models for Europe

Recent Activity

stefan-it authored a paper 23 days ago

SindBERT, the Sailor: Charting the Seas of Turkish NLP

stefan-it authored a paper about 1 month ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

mbrack authored a paper about 1 month ago

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

View all activity

stefan-it

authored a paper 23 days ago

SindBERT, the Sailor: Charting the Seas of Turkish NLP

Paper • 2510.21364 • Published 26 days ago • 1

stefan-it

authored a paper about 1 month ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published Oct 15 • 7

mbrack

authored a paper about 1 month ago

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Paper • 2510.12789 • Published Oct 14 • 17

BramVanroy

posted an update about 1 month ago

Post

287

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

stefan-it

authored a paper 2 months ago

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Paper • 2509.05668 • Published Sep 6 • 5

BramVanroy

posted an update 3 months ago

Post

764

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

s-conia

authored a paper 5 months ago

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Paper • 2503.14996 • Published Mar 19 • 3

mbrack

authored a paper 5 months ago

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Paper • 2506.16679 • Published Jun 20 • 1

eliaswendt

authored a paper 6 months ago

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Paper • 2505.22232 • Published May 28 • 18

mkrausio

authored 2 papers 6 months ago

Right on Time: Revising Time Series Models by Constraining their Explanations

Paper • 2402.12921 • Published Feb 20, 2024

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Paper • 2505.22232 • Published May 28 • 18

mbrack

authored 9 papers 6 months ago

Class Attribute Inference Attacks: Inferring Sensitive Class Information by Diffusion-Based Attribute Manipulations

Paper • 2303.09289 • Published Mar 16, 2023 • 2

Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge

Paper • 2309.11575 • Published Sep 20, 2023

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Paper • 2305.15296 • Published May 24, 2023 • 1

Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World's Ugliness?

Paper • 2305.18398 • Published May 28, 2023 • 2

LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

Paper • 2406.05113 • Published Jun 7, 2024 • 3

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

Paper • 2301.08110 • Published Jan 19, 2023 • 1

AI & ML interests

Recent Activity

Team members 15

occiglot's activity