MathematicianNLPer commited on
Commit
98897e8
·
verified ·
1 Parent(s): 2aab9ba

Copy from GemMaroc/Qwen2.5-14B-Instruct-darija

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - MoroccanArabic
5
+ - Darija
6
+ - GemMaroc
7
+ - conversational
8
+ - qwen
9
+ pipeline_tag: text-generation
10
+ datasets:
11
+ - GemMaroc/TULU-3-50k-darija-english
12
+ language:
13
+ - ar
14
+ - ary
15
+ - en
16
+ base_model:
17
+ - Qwen/Qwen2.5-14B-Instruct
18
+ ---
19
+
20
+ # Model Card for Qwen2.5-14B-Instruct-darija
21
+
22
+ # Qwen2.5-14B-Instruct-darija
23
+
24
+ Unlocking **Moroccan Darija** proficiency in a powerful large language model, trained with a _minimal-data, green-AI_ recipe that preserves Qwen2.5-14B-Instruct's strong reasoning abilities while adding fluent Darija generation.
25
+
26
+ ---
27
+
28
+ ## Model at a glance
29
+
30
+ | | Details |
31
+ | ------------------- | ------------------------------------------------------------------------------------------------------ |
32
+ | **Model ID** | `AbderrahmanSkiredj1/Qwen2.5-14B-Instruct-darija` |
33
+ | **Base model** | [`Qwen/Qwen2.5-14B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) |
34
+ | **Architecture** | Decoder-only Transformer (Qwen2.5) |
35
+ | **Parameters** | 14 billion |
36
+ | **Context length** | 32,768 tokens |
37
+ | **Training regime** | Supervised fine-tuning (LoRA → merged) on 50 K high-quality Darija/English instructions TULU-50K slice |
38
+ | **Compute budget** | 32 GPU·h (8 × H100-80GB × 4 h) – ≈ 18 kWh / 7 kg CO₂e |
39
+ | **License** | Apache 2.0 |
40
+
41
+ ---
42
+
43
+ ## Why another Darija model?
44
+
45
+ - **Inclusive AI** > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
46
+ - **Quality-over-quantity** A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross-lingual reasoning.
47
+ - **Green AI** Qwen2.5-14B-Instruct-darija achieves competitive Darija scores using minimal energy.
48
+ - **Balanced Performance** 14B parameters provide excellent balance between performance and resource requirements.
49
+
50
+ ---
51
+
52
+ ## Benchmark summary
53
+
54
+ ### Darija Benchmarks
55
+
56
+ | Model | Darija MMLU | Darija HellaSwag | Sentiment Analysis | GSM8K Darija | Summarization (chrF) | ROUGE-1 | ROUGE-L | BERTScore |
57
+ | ------------------------------- | ----------- | ---------------- | ------------------ | ------------ | -------------------- | ------- | ------- | --------- |
58
+ | Qwen2.5-14B-Instruct | 57.5 % | 45.9 % | 63.3 % | 72.3 % | 26.3 | 9.1 | 8.8 | 36.2 |
59
+ | **Qwen2.5-14B-Instruct-darija** | **57.6 %** | **51.3 %** | **64.1 %** | **83.3 %** | **27.1** | 7.3 | 7.1 | **38.8** |
60
+
61
+ ### English Benchmarks
62
+
63
+ | Model | MMLU | TruthfulQA | HellaSwag | GSM8K @5 | GSM8K Gen |
64
+ | ------------------------------- | ---------- | ---------- | ---------- | -------- | --------- |
65
+ | Qwen2.5-14B-Instruct | 76.8 % | 70.8 % | 75.6 % | 80.2 % | 94.2 % |
66
+ | **Qwen2.5-14B-Instruct-darija** | **75.3 %** | 61.7 % | **79.1 %** | 75.6 % | 92.4 % |
67
+
68
+ <sub>Zero-shot accuracy; full table in the paper.</sub>
69
+
70
+ ---
71
+
72
+ ## Quick start
73
+
74
+ ```python
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
76
+
77
+ model_id = "AbderrahmanSkiredj1/Qwen2.5-14B-Instruct-darija"
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ model_id,
82
+ torch_dtype="auto",
83
+ device_map="auto"
84
+ )
85
+
86
+ pipe = pipeline(
87
+ "text-generation",
88
+ model=model,
89
+ tokenizer=tokenizer,
90
+ device_map="auto",
91
+ max_new_tokens=1024,
92
+ temperature=0.7,
93
+ repetition_penalty=1.2,
94
+ no_repeat_ngram_size=3,
95
+ )
96
+
97
+ messages = [
98
+ {"role": "user", "content": "شنو هي نظرية 'butterfly effect'؟ فسّرها بدارجة ونقّط مثال بسيط."}
99
+ ]
100
+
101
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
102
+ print(pipe(prompt)[0]["generated_text"][len(prompt):])
103
+ ```
104
+
105
+ ### Chat template (Qwen2.5 format)
106
+
107
+ The tokenizer provides a baked-in Jinja template that starts with a **begin-of-sequence** token (`<|im_start|>`), then alternates user/model turns, each wrapped by `<|im_start|>` … `<|im_end|>` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue:
108
+
109
+ ```
110
+ <|im_start|>user
111
+ {user message}<|im_end|>
112
+ <|im_start|>assistant
113
+ ```
114
+
115
+ The assistant will keep generating tokens until it decides to emit `<|im_end|>`.
116
+
117
+ ```python
118
+ prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
119
+ ```
120
+
121
+ No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically.
122
+
123
+ ---
124
+
125
+ Pre-quantised checkpoints will be published under the same repo tags (`qwen2.5-14b-darija-awq-int4`, `qwen2.5-14b-darija-gguf-q4_k_m`).
126
+
127
+ ---
128
+
129
+ ## Training recipe (one-paragraph recap)
130
+
131
+ 1. **Data** Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross-lingual robustness.
132
+ 2. **LoRA SFT** Rank 16, α = 32, 3 epochs, bf16, context 32,768.
133
+ 3. **Merge & push** Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload.
134
+
135
+ ---
136
+
137
+ ## Limitations & ethical considerations
138
+
139
+ - Sentiment and abstractive summarisation still trail state-of-the-art.
140
+ - Tokeniser is unchanged; rare Darija spellings may fragment.
141
+ - Model may inherit societal biases present in pre-training data.
142
+ - No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.
143
+
144
+ ---
145
+
146
+ ## Citation
147
+
148
+ If you use Qwen2.5-14B-Instruct-darija in your work, please cite:
149
+
150
+ ```bibtex
151
+ @misc{skiredj2025gemmarocunlockingdarijaproficiency,
152
+ title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data},
153
+ author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},
154
+ year={2025},
155
+ eprint={2505.17082},
156
+ archivePrefix={arXiv},
157
+ primaryClass={cs.CL},
158
+ url={https://arxiv.org/abs/2505.17082},
159
+ }
160
+ ```