nllb-200-turkmen-english-lora / README.md

Update README.md

a6f0e8b verified 7 months ago

7.82 kB

	---
	license: cc-by-nc-4.0
	language:
	- tuk
	- eng
	library_name: transformers
	datasets:
	- XSkills/turkmen_english_s500
	tags:
	- translation
	- nllb
	- lora
	- peft
	- turkmen
	model_name: nllb-200-turkmen-english-lora
	pipeline_tag: translation
	base_model:
	- facebook/nllb-200-distilled-600M
	---

	# NLLB-200 (600 M) – LoRA fine-tuned for Turkmen ↔ English

	Author : Merdan Durdyyev
	Base model : [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M)
	Tuning method : Low-Rank Adaptation (LoRA) on only the `q_proj` & `v_proj` matrices (≈ 2.4 M trainable → 0.38 % of total params).

	> I built this checkpoint as the final project for my Deep-Learning class and as a small contribution to the Turkmen AI community, where open-source resources are scarce.

	---

	## TL;DR & Quick results

	Try it on [Space demo](https://huggingface.co/spaces/XSkills/nllb-turkmen-english) Article with full technical journey is available [Medium](https://medium.com/@meinnps/fine-tuning-nllb-200-with-lora-on-a-650-sentence-turkmen-english-corpus-082f68bdec71).

	### Model Comparison (Fine-tuned vs Original)

	#### English to Turkmen

	\| Metric \| Fine-tuned \| Original \| Difference \|
	\|---------------------------\|-----------:\|---------:\|-----------:\|
	\| BLEU \| 8.24 \| 8.12 \| +0.12 \|
	\| chrF \| 39.55 \| 39.46 \| +0.09 \|
	\| TER (lower is better) \| 87.20 \| 87.30 \| -0.10 \|

	#### Turkmen to English

	\| Metric \| Fine-tuned \| Original \| Difference \|
	\|---------------------------\|-----------:\|---------:\|-----------:\|
	\| BLEU \| 25.88 \| 26.48 \| -0.60 \|
	\| chrF \| 52.71 \| 52.91 \| -0.20 \|
	\| TER (lower is better) \| 67.70 \| 69.70 \| -2.00 \|

	*Scores computed with sacre BLEU 2.5, chrF, TER on the official `test` split.
	A separate spreadsheet with human adequacy/fluency ratings is available in the article.*

	---

	## Intended use & scope

	* Good for: research prototypes, student projects, quick experiments on Turkmen text.
	* Not for: commercial MT systems (license is CC-BY-NC 4.0), critical medical/legal translation, or production workloads without further validation.

	---

	## How to use

	(If you want to take a look to the LoRA adapter visit [nllb-200-turkmen-english-lora-adapter](https://huggingface.co/XSkills/nllb-200-turkmen-english-lora-adapter/tree/main))

	Using piplene
	```python
	from transformers import pipeline

	# Create the translation pipeline
	pipe = pipeline("translation", model="XSkills/nllb-200-turkmen-english-lora")

	# Translate from English to Turkmen
	# You need to specify the source and target languages using their FLORES-200 codes
	text = "Hello, how are you today?"
	translated = pipe(text, src_lang="eng_Latn", tgt_lang="tuk_Latn")
	print(translated)
	```

	Using Tokenizer
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_id = "XSkills/nllb-200-turkmen-english-lora"
	tok = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

	def tr(text, src="tuk_Latn", tgt="eng_Latn"):
	tok.src_lang = src
	ids = tok(text, return_tensors="pt", truncation=True, max_length=128)
	out = model.generate(
	**ids,
	forced_bos_token_id=tok.convert_tokens_to_ids(tgt),
	max_length=128,
	num_beams=5
	)
	return tok.decode(out[0], skip_special_tokens=True)

	print(tr("Men kitaby okaýaryn."))
	```

	## Training data
	- Dataset : [XSkills/turkmen_english_s500](https://huggingface.co/datasets/XSkills/turkmen_english_s500) 619 parallel sentences (495 train / 62 val / 62 test) of news & official communiqués.
	- Collecting even this small corpus proved challenging because publicly available Turkmen data are limited.

	## Training procedure

	\| Item \| Value \|
	\|------\|-------\|
	\| GPU \| 1 × NVIDIA A100 40 GB (Google Colab) \|
	\| Wall-time \| ~ 3 minutes \|
	\| Optimiser \| AdamW \|
	\| Learning rate \| 1 × 10⁻⁵, cosine schedule, warm-up 10% \|
	\| Epochs \| 5 \|
	\| Batch size \| 4 (train) / 8 (eval) \|
	\| Weight-decay \| 0.005 \|
	\| FP16 \| Yes \|
	\| LoRA config \| `r=16`, `alpha=32`, `dropout=0.05`, modules = `["q_proj","v_proj"]` \|

	### LoRA Config

	```python
	lora_config = LoraConfig(
	r=16,
	lora_alpha=32,
	target_modules=["q_proj", "v_proj"],
	lora_dropout=0.05,
	bias="none",
	task_type=TaskType.SEQ_2_SEQ_LM,
	)
	```

	### Training Configuration

	```python
	training_args = Seq2SeqTrainingArguments(
	output_dir=FINETUNED_DIR,
	per_device_train_batch_size=4,
	per_device_eval_batch_size=8,
	weight_decay=0.005,
	save_total_limit=3,
	learning_rate=1e-5,
	num_train_epochs=5,
	lr_scheduler_type="cosine",
	predict_with_generate=True,
	fp16=True if torch.cuda.is_available() else False,
	logging_dir="./logs",
	logging_steps=50,
	eval_steps=50,
	save_steps=100,
	eval_accumulation_steps=2,
	report_to="tensorboard",
	warmup_ratio=0.1,
	metric_for_best_model="eval_bleu", # Use BLEU for model selection
	greater_is_better=True,
	)
	```
	## Evaluation

	Automatic metrics are given in TL;DR.
	A manual review on 50 random test sentences showed:
	- Adequacy: 36 / 50 translations judged “Good” or better.
	- Fluency: 38 / 50 sound natural to a native speaker.
	(Full spreadsheet available — ask via contact below.)


	## Limitations & bias
	- Only 500ish sentences → limited vocabulary & domain coverage.
	- May hallucinate proper nouns or numbers on longer inputs.
	- Gender/ politeness nuances not guaranteed.
	- CC-BY-NC licence forbids commercial use; respect Meta’s original terms.

	## How to Contribute

	We welcome contributions to improve Turkmen-English translation capabilities! Here's how you can help:

	### Data Contributions
	- Read Dataset Contribution: You can find the instructions for contributing to the dataset at [Dataset Readme](https://huggingface.co/datasets/XSkills/turkmen_english_s500/blob/main/README.md)

	### Code Contributions
	- Hyperparameter experiments: Try different LoRA configurations and document your results
	- Evaluation: Help with human evaluation of translation quality and fluency
	- Bug fixes: Report issues or submit fixes for the model implementation

	### Use Cases & Documentation
	- Example applications: Share how you're using the model for research or projects
	- Domain-specific guides: Create guides for using the model in specific domains
	- Translation examples: Share interesting or challenging translation examples

	### Getting Started
	1. Fork the repository
	2. Make your changes
	3. Submit a pull request with clear documentation of your contribution
	4. For data contributions, contact the maintainer directly

	All contributors will be acknowledged in the model documentation. Contact [meinnps@gmail.com](mailto:meinnps@gmail.com) with any questions or to discuss potential contributions.

	---

	Note: This model is licensed under CC-BY-NC-4.0, so all contributions must be compatible with non-commercial use only.

	## Citation
	```bibtex
	@misc{durdyyev2025turkmenNLLBLoRA,
	title = {LoRA Fine‐tuning of NLLB‐200 for Turkmen–English Translation},
	author = {Durdyyev, Merdan},
	year = {2025},
	url = {https://huggingface.co/XSkills/nllb-200-turkmen-english-lora}
	}
	```

	## Contact
	If you have questions, suggestions or want to collaborate, please reach out through [e-mail](meinnps@gmail.com), [LinkedIn]( https://linkedin.com/in/merdandt) or [Telegram](https://t.me/merdandt).

	## Future Work
	- Try to tune on bigger dataset.
	- Try to tweak the hyperparameters
	- Use [sacreBLEU](https://github.com/mjpost/sacrebleu) metric