Vietnamese–English–Japanese Neural Machine Translation

Fine-tuned facebook/nllb-200-distilled-600M (600M parameters) for trilingual translation between Vietnamese, English, and Japanese.

Model Details

Architecture: NLLB-200 (encoder-decoder transformer, distilled)
Parameters: 600M
Base Model: facebook/nllb-200-distilled-600M
Task: Neural Machine Translation (Vi ↔ En ↔ Ja)
Framework: PyTorch + HuggingFace Transformers

Supported Translation Pairs

Source	Target	Direction
Vietnamese	English	vi → en
English	Vietnamese	en → vi
Vietnamese	Japanese	vi → ja
Japanese	Vietnamese	ja → vi
English	Japanese	en → ja
Japanese	English	ja → en

Training

Dataset: OPUS parallel corpora (vi-en, vi-ja pairs)
Optimizer: AdamW with linear warmup scheduling
Decoding: Beam search (configurable width 1–10, length penalty tuning)
Evaluation: SacreBLEU scores per language pair
Max Sequence Length: 128 tokens

Language Auto-Detection

Includes a character-based language detection module using Unicode range analysis:

Vietnamese: Diacritical marks (ă, â, ê, ô, ơ, ư, đ) and tonal marks
Japanese: Hiragana (U+3040–309F), Katakana (U+30A0–30FF), CJK (U+4E00–9FFF)
English: ASCII Latin fallback

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("sanvo/vietnamese-nmt")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

# Vietnamese → English
tokenizer.src_lang = "vie_Latn"
inputs = tokenizer("Xin chào, bạn khỏe không?", return_tensors="pt")
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
print(tokenizer.decode(translated[0], skip_special_tokens=True))

# Vietnamese → Japanese
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["jpn_Jpan"])
print(tokenizer.decode(translated[0], skip_special_tokens=True))

Features

Trilingual Vietnamese ↔ English ↔ Japanese translation
Configurable beam search with length penalty
Unicode-based language auto-detection
Interactive trilingual Gradio interface

Model tree for sanvo/vietnamese-nmt

Base model

facebook/nllb-200-distilled-600M

Finetuned

(258)

this model

sanvo
/

vietnamese-nmt