Copy from GemMaroc/Qwen2.5-14B-Instruct-darija
Browse files
README.md
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- MoroccanArabic
|
| 5 |
+
- Darija
|
| 6 |
+
- GemMaroc
|
| 7 |
+
- conversational
|
| 8 |
+
- qwen
|
| 9 |
+
pipeline_tag: text-generation
|
| 10 |
+
datasets:
|
| 11 |
+
- GemMaroc/TULU-3-50k-darija-english
|
| 12 |
+
language:
|
| 13 |
+
- ar
|
| 14 |
+
- ary
|
| 15 |
+
- en
|
| 16 |
+
base_model:
|
| 17 |
+
- Qwen/Qwen2.5-14B-Instruct
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# Model Card for Qwen2.5-14B-Instruct-darija
|
| 21 |
+
|
| 22 |
+
# Qwen2.5-14B-Instruct-darija
|
| 23 |
+
|
| 24 |
+
Unlocking **Moroccan Darija** proficiency in a powerful large language model, trained with a _minimal-data, green-AI_ recipe that preserves Qwen2.5-14B-Instruct's strong reasoning abilities while adding fluent Darija generation.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Model at a glance
|
| 29 |
+
|
| 30 |
+
| | Details |
|
| 31 |
+
| ------------------- | ------------------------------------------------------------------------------------------------------ |
|
| 32 |
+
| **Model ID** | `AbderrahmanSkiredj1/Qwen2.5-14B-Instruct-darija` |
|
| 33 |
+
| **Base model** | [`Qwen/Qwen2.5-14B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) |
|
| 34 |
+
| **Architecture** | Decoder-only Transformer (Qwen2.5) |
|
| 35 |
+
| **Parameters** | 14 billion |
|
| 36 |
+
| **Context length** | 32,768 tokens |
|
| 37 |
+
| **Training regime** | Supervised fine-tuning (LoRA → merged) on 50 K high-quality Darija/English instructions TULU-50K slice |
|
| 38 |
+
| **Compute budget** | 32 GPU·h (8 × H100-80GB × 4 h) – ≈ 18 kWh / 7 kg CO₂e |
|
| 39 |
+
| **License** | Apache 2.0 |
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Why another Darija model?
|
| 44 |
+
|
| 45 |
+
- **Inclusive AI** > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
|
| 46 |
+
- **Quality-over-quantity** A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross-lingual reasoning.
|
| 47 |
+
- **Green AI** Qwen2.5-14B-Instruct-darija achieves competitive Darija scores using minimal energy.
|
| 48 |
+
- **Balanced Performance** 14B parameters provide excellent balance between performance and resource requirements.
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Benchmark summary
|
| 53 |
+
|
| 54 |
+
### Darija Benchmarks
|
| 55 |
+
|
| 56 |
+
| Model | Darija MMLU | Darija HellaSwag | Sentiment Analysis | GSM8K Darija | Summarization (chrF) | ROUGE-1 | ROUGE-L | BERTScore |
|
| 57 |
+
| ------------------------------- | ----------- | ---------------- | ------------------ | ------------ | -------------------- | ------- | ------- | --------- |
|
| 58 |
+
| Qwen2.5-14B-Instruct | 57.5 % | 45.9 % | 63.3 % | 72.3 % | 26.3 | 9.1 | 8.8 | 36.2 |
|
| 59 |
+
| **Qwen2.5-14B-Instruct-darija** | **57.6 %** | **51.3 %** | **64.1 %** | **83.3 %** | **27.1** | 7.3 | 7.1 | **38.8** |
|
| 60 |
+
|
| 61 |
+
### English Benchmarks
|
| 62 |
+
|
| 63 |
+
| Model | MMLU | TruthfulQA | HellaSwag | GSM8K @5 | GSM8K Gen |
|
| 64 |
+
| ------------------------------- | ---------- | ---------- | ---------- | -------- | --------- |
|
| 65 |
+
| Qwen2.5-14B-Instruct | 76.8 % | 70.8 % | 75.6 % | 80.2 % | 94.2 % |
|
| 66 |
+
| **Qwen2.5-14B-Instruct-darija** | **75.3 %** | 61.7 % | **79.1 %** | 75.6 % | 92.4 % |
|
| 67 |
+
|
| 68 |
+
<sub>Zero-shot accuracy; full table in the paper.</sub>
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Quick start
|
| 73 |
+
|
| 74 |
+
```python
|
| 75 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
| 76 |
+
|
| 77 |
+
model_id = "AbderrahmanSkiredj1/Qwen2.5-14B-Instruct-darija"
|
| 78 |
+
|
| 79 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 80 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 81 |
+
model_id,
|
| 82 |
+
torch_dtype="auto",
|
| 83 |
+
device_map="auto"
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
pipe = pipeline(
|
| 87 |
+
"text-generation",
|
| 88 |
+
model=model,
|
| 89 |
+
tokenizer=tokenizer,
|
| 90 |
+
device_map="auto",
|
| 91 |
+
max_new_tokens=1024,
|
| 92 |
+
temperature=0.7,
|
| 93 |
+
repetition_penalty=1.2,
|
| 94 |
+
no_repeat_ngram_size=3,
|
| 95 |
+
)
|
| 96 |
+
|
| 97 |
+
messages = [
|
| 98 |
+
{"role": "user", "content": "شنو هي نظرية 'butterfly effect'؟ فسّرها بدارجة ونقّط مثال بسيط."}
|
| 99 |
+
]
|
| 100 |
+
|
| 101 |
+
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 102 |
+
print(pipe(prompt)[0]["generated_text"][len(prompt):])
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### Chat template (Qwen2.5 format)
|
| 106 |
+
|
| 107 |
+
The tokenizer provides a baked-in Jinja template that starts with a **begin-of-sequence** token (`<|im_start|>`), then alternates user/model turns, each wrapped by `<|im_start|>` … `<|im_end|>` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue:
|
| 108 |
+
|
| 109 |
+
```
|
| 110 |
+
<|im_start|>user
|
| 111 |
+
{user message}<|im_end|>
|
| 112 |
+
<|im_start|>assistant
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
The assistant will keep generating tokens until it decides to emit `<|im_end|>`.
|
| 116 |
+
|
| 117 |
+
```python
|
| 118 |
+
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
Pre-quantised checkpoints will be published under the same repo tags (`qwen2.5-14b-darija-awq-int4`, `qwen2.5-14b-darija-gguf-q4_k_m`).
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Training recipe (one-paragraph recap)
|
| 130 |
+
|
| 131 |
+
1. **Data** Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross-lingual robustness.
|
| 132 |
+
2. **LoRA SFT** Rank 16, α = 32, 3 epochs, bf16, context 32,768.
|
| 133 |
+
3. **Merge & push** Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload.
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## Limitations & ethical considerations
|
| 138 |
+
|
| 139 |
+
- Sentiment and abstractive summarisation still trail state-of-the-art.
|
| 140 |
+
- Tokeniser is unchanged; rare Darija spellings may fragment.
|
| 141 |
+
- Model may inherit societal biases present in pre-training data.
|
| 142 |
+
- No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## Citation
|
| 147 |
+
|
| 148 |
+
If you use Qwen2.5-14B-Instruct-darija in your work, please cite:
|
| 149 |
+
|
| 150 |
+
```bibtex
|
| 151 |
+
@misc{skiredj2025gemmarocunlockingdarijaproficiency,
|
| 152 |
+
title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data},
|
| 153 |
+
author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},
|
| 154 |
+
year={2025},
|
| 155 |
+
eprint={2505.17082},
|
| 156 |
+
archivePrefix={arXiv},
|
| 157 |
+
primaryClass={cs.CL},
|
| 158 |
+
url={https://arxiv.org/abs/2505.17082},
|
| 159 |
+
}
|
| 160 |
+
```
|