Vietnamese–English–Japanese Neural Machine Translation
Fine-tuned facebook/nllb-200-distilled-600M (600M parameters) for trilingual translation between Vietnamese, English, and Japanese.
Model Details
- Architecture: NLLB-200 (encoder-decoder transformer, distilled)
- Parameters: 600M
- Base Model: facebook/nllb-200-distilled-600M
- Task: Neural Machine Translation (Vi ↔ En ↔ Ja)
- Framework: PyTorch + HuggingFace Transformers
Supported Translation Pairs
| Source | Target | Direction |
|---|---|---|
| Vietnamese | English | vi → en |
| English | Vietnamese | en → vi |
| Vietnamese | Japanese | vi → ja |
| Japanese | Vietnamese | ja → vi |
| English | Japanese | en → ja |
| Japanese | English | ja → en |
Training
- Dataset: OPUS parallel corpora (vi-en, vi-ja pairs)
- Optimizer: AdamW with linear warmup scheduling
- Decoding: Beam search (configurable width 1–10, length penalty tuning)
- Evaluation: SacreBLEU scores per language pair
- Max Sequence Length: 128 tokens
Language Auto-Detection
Includes a character-based language detection module using Unicode range analysis:
- Vietnamese: Diacritical marks (ă, â, ê, ô, ơ, ư, đ) and tonal marks
- Japanese: Hiragana (U+3040–309F), Katakana (U+30A0–30FF), CJK (U+4E00–9FFF)
- English: ASCII Latin fallback
Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("sanvo/vietnamese-nmt")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
# Vietnamese → English
tokenizer.src_lang = "vie_Latn"
inputs = tokenizer("Xin chào, bạn khỏe không?", return_tensors="pt")
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
print(tokenizer.decode(translated[0], skip_special_tokens=True))
# Vietnamese → Japanese
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["jpn_Jpan"])
print(tokenizer.decode(translated[0], skip_special_tokens=True))
Features
- Trilingual Vietnamese ↔ English ↔ Japanese translation
- Configurable beam search with length penalty
- Unicode-based language auto-detection
- Interactive trilingual Gradio interface
Links
- GitHub: svn05/vietnamese-nmt
Model tree for sanvo/vietnamese-nmt
Base model
facebook/nllb-200-distilled-600M