|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- tuk |
|
|
- eng |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- XSkills/turkmen_english_s500 |
|
|
tags: |
|
|
- translation |
|
|
- nllb |
|
|
- lora |
|
|
- peft |
|
|
- turkmen |
|
|
model_name: nllb-200-turkmen-english-lora |
|
|
pipeline_tag: translation |
|
|
base_model: |
|
|
- facebook/nllb-200-distilled-600M |
|
|
--- |
|
|
|
|
|
# NLLB-200 (600 M) – LoRA fine-tuned for Turkmen ↔ English |
|
|
|
|
|
**Author** : Merdan Durdyyev |
|
|
**Base model** : [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M) |
|
|
**Tuning method** : Low-Rank Adaptation (LoRA) on only the `q_proj` & `v_proj` matrices (≈ 2.4 M trainable → 0.38 % of total params). |
|
|
|
|
|
> I built this checkpoint as the final project for my Deep-Learning class and as a small contribution to the Turkmen AI community, where open-source resources are scarce. |
|
|
|
|
|
--- |
|
|
|
|
|
## TL;DR & Quick results |
|
|
|
|
|
Try it on [Space demo](https://huggingface.co/spaces/XSkills/nllb-turkmen-english) Article with full technical journey is available [Medium](https://medium.com/@meinnps/fine-tuning-nllb-200-with-lora-on-a-650-sentence-turkmen-english-corpus-082f68bdec71). |
|
|
|
|
|
### Model Comparison (Fine-tuned vs Original) |
|
|
|
|
|
#### English to Turkmen |
|
|
|
|
|
| Metric | Fine-tuned | Original | Difference | |
|
|
|---------------------------|-----------:|---------:|-----------:| |
|
|
| **BLEU** | 8.24 | 8.12 | +0.12 | |
|
|
| **chrF** | 39.55 | 39.46 | +0.09 | |
|
|
| **TER (lower is better)** | 87.20 | 87.30 | -0.10 | |
|
|
|
|
|
#### Turkmen to English |
|
|
|
|
|
| Metric | Fine-tuned | Original | Difference | |
|
|
|---------------------------|-----------:|---------:|-----------:| |
|
|
| **BLEU** | 25.88 | 26.48 | -0.60 | |
|
|
| **chrF** | 52.71 | 52.91 | -0.20 | |
|
|
| **TER (lower is better)** | 67.70 | 69.70 | -2.00 | |
|
|
|
|
|
*Scores computed with sacre BLEU 2.5, chrF, TER on the official `test` split. |
|
|
A separate spreadsheet with **human adequacy/fluency ratings** is available in the article.* |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended use & scope |
|
|
|
|
|
* **Good for**: research prototypes, student projects, quick experiments on Turkmen text. |
|
|
* **Not for**: commercial MT systems (license is **CC-BY-NC 4.0**), critical medical/legal translation, or production workloads without further validation. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to use |
|
|
|
|
|
*(If you want to take a look to the LoRA adapter visit [nllb-200-turkmen-english-lora-adapter](https://huggingface.co/XSkills/nllb-200-turkmen-english-lora-adapter/tree/main))* |
|
|
|
|
|
Using piplene |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Create the translation pipeline |
|
|
pipe = pipeline("translation", model="XSkills/nllb-200-turkmen-english-lora") |
|
|
|
|
|
# Translate from English to Turkmen |
|
|
# You need to specify the source and target languages using their FLORES-200 codes |
|
|
text = "Hello, how are you today?" |
|
|
translated = pipe(text, src_lang="eng_Latn", tgt_lang="tuk_Latn") |
|
|
print(translated) |
|
|
``` |
|
|
|
|
|
Using Tokenizer |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
model_id = "XSkills/nllb-200-turkmen-english-lora" |
|
|
tok = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
|
|
|
|
def tr(text, src="tuk_Latn", tgt="eng_Latn"): |
|
|
tok.src_lang = src |
|
|
ids = tok(text, return_tensors="pt", truncation=True, max_length=128) |
|
|
out = model.generate( |
|
|
**ids, |
|
|
forced_bos_token_id=tok.convert_tokens_to_ids(tgt), |
|
|
max_length=128, |
|
|
num_beams=5 |
|
|
) |
|
|
return tok.decode(out[0], skip_special_tokens=True) |
|
|
|
|
|
print(tr("Men kitaby okaýaryn.")) |
|
|
``` |
|
|
|
|
|
## Training data |
|
|
- Dataset : [XSkills/turkmen_english_s500](https://huggingface.co/datasets/XSkills/turkmen_english_s500) 619 parallel sentences (495 train / 62 val / 62 test) of news & official communiqués. |
|
|
- Collecting even this small corpus proved challenging because publicly available Turkmen data are limited. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
| Item | Value | |
|
|
|------|-------| |
|
|
| GPU | 1 × NVIDIA A100 40 GB (Google Colab) | |
|
|
| Wall-time | ~ 3 minutes | |
|
|
| Optimiser | AdamW | |
|
|
| Learning rate | 1 × 10⁻⁵, cosine schedule, warm-up 10% | |
|
|
| Epochs | 5 | |
|
|
| Batch size | 4 (train) / 8 (eval) | |
|
|
| Weight-decay | 0.005 | |
|
|
| FP16 | Yes | |
|
|
| LoRA config | `r=16`, `alpha=32`, `dropout=0.05`, modules = `["q_proj","v_proj"]` | |
|
|
|
|
|
### LoRA Config |
|
|
|
|
|
```python |
|
|
lora_config = LoraConfig( |
|
|
r=16, |
|
|
lora_alpha=32, |
|
|
target_modules=["q_proj", "v_proj"], |
|
|
lora_dropout=0.05, |
|
|
bias="none", |
|
|
task_type=TaskType.SEQ_2_SEQ_LM, |
|
|
) |
|
|
``` |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
```python |
|
|
training_args = Seq2SeqTrainingArguments( |
|
|
output_dir=FINETUNED_DIR, |
|
|
per_device_train_batch_size=4, |
|
|
per_device_eval_batch_size=8, |
|
|
weight_decay=0.005, |
|
|
save_total_limit=3, |
|
|
learning_rate=1e-5, |
|
|
num_train_epochs=5, |
|
|
lr_scheduler_type="cosine", |
|
|
predict_with_generate=True, |
|
|
fp16=True if torch.cuda.is_available() else False, |
|
|
logging_dir="./logs", |
|
|
logging_steps=50, |
|
|
eval_steps=50, |
|
|
save_steps=100, |
|
|
eval_accumulation_steps=2, |
|
|
report_to="tensorboard", |
|
|
warmup_ratio=0.1, |
|
|
metric_for_best_model="eval_bleu", # Use BLEU for model selection |
|
|
greater_is_better=True, |
|
|
) |
|
|
``` |
|
|
## Evaluation |
|
|
|
|
|
Automatic metrics are given in TL;DR. |
|
|
A manual review on 50 random test sentences showed: |
|
|
- Adequacy: 36 / 50 translations judged “Good” or better. |
|
|
- Fluency: 38 / 50 sound natural to a native speaker. |
|
|
*(Full spreadsheet available — ask via contact below.)* |
|
|
|
|
|
|
|
|
## Limitations & bias |
|
|
- Only 500ish sentences → limited vocabulary & domain coverage. |
|
|
- May hallucinate proper nouns or numbers on longer inputs. |
|
|
- Gender/ politeness nuances not guaranteed. |
|
|
- CC-BY-NC licence forbids commercial use; respect Meta’s original terms. |
|
|
|
|
|
## How to Contribute |
|
|
|
|
|
We welcome contributions to improve Turkmen-English translation capabilities! Here's how you can help: |
|
|
|
|
|
### Data Contributions |
|
|
- **Read Dataset Contribution**: You can find the instructions for contributing to the dataset at [Dataset Readme](https://huggingface.co/datasets/XSkills/turkmen_english_s500/blob/main/README.md) |
|
|
|
|
|
### Code Contributions |
|
|
- **Hyperparameter experiments**: Try different LoRA configurations and document your results |
|
|
- **Evaluation**: Help with human evaluation of translation quality and fluency |
|
|
- **Bug fixes**: Report issues or submit fixes for the model implementation |
|
|
|
|
|
### Use Cases & Documentation |
|
|
- **Example applications**: Share how you're using the model for research or projects |
|
|
- **Domain-specific guides**: Create guides for using the model in specific domains |
|
|
- **Translation examples**: Share interesting or challenging translation examples |
|
|
|
|
|
### Getting Started |
|
|
1. Fork the repository |
|
|
2. Make your changes |
|
|
3. Submit a pull request with clear documentation of your contribution |
|
|
4. For data contributions, contact the maintainer directly |
|
|
|
|
|
All contributors will be acknowledged in the model documentation. Contact [meinnps@gmail.com](mailto:meinnps@gmail.com) with any questions or to discuss potential contributions. |
|
|
|
|
|
--- |
|
|
|
|
|
*Note: This model is licensed under CC-BY-NC-4.0, so all contributions must be compatible with non-commercial use only.* |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@misc{durdyyev2025turkmenNLLBLoRA, |
|
|
title = {LoRA Fine‐tuning of NLLB‐200 for Turkmen–English Translation}, |
|
|
author = {Durdyyev, Merdan}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/XSkills/nllb-200-turkmen-english-lora} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
If you have questions, suggestions or want to collaborate, please reach out through [e-mail](meinnps@gmail.com), [LinkedIn]( https://linkedin.com/in/merdandt) or [Telegram](https://t.me/merdandt). |
|
|
|
|
|
## Future Work |
|
|
- Try to tune on bigger dataset. |
|
|
- Try to tweak the hyperparameters |
|
|
- Use [sacreBLEU](https://github.com/mjpost/sacrebleu) metric |
|
|
|