Vietnamese–English–Japanese Neural Machine Translation

Fine-tuned facebook/nllb-200-distilled-600M (600M parameters) for trilingual translation between Vietnamese, English, and Japanese.

Model Details

  • Architecture: NLLB-200 (encoder-decoder transformer, distilled)
  • Parameters: 600M
  • Base Model: facebook/nllb-200-distilled-600M
  • Task: Neural Machine Translation (Vi ↔ En ↔ Ja)
  • Framework: PyTorch + HuggingFace Transformers

Supported Translation Pairs

Source Target Direction
Vietnamese English vi → en
English Vietnamese en → vi
Vietnamese Japanese vi → ja
Japanese Vietnamese ja → vi
English Japanese en → ja
Japanese English ja → en

Training

  • Dataset: OPUS parallel corpora (vi-en, vi-ja pairs)
  • Optimizer: AdamW with linear warmup scheduling
  • Decoding: Beam search (configurable width 1–10, length penalty tuning)
  • Evaluation: SacreBLEU scores per language pair
  • Max Sequence Length: 128 tokens

Language Auto-Detection

Includes a character-based language detection module using Unicode range analysis:

  • Vietnamese: Diacritical marks (ă, â, ê, ô, ơ, ư, đ) and tonal marks
  • Japanese: Hiragana (U+3040–309F), Katakana (U+30A0–30FF), CJK (U+4E00–9FFF)
  • English: ASCII Latin fallback

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("sanvo/vietnamese-nmt")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

# Vietnamese → English
tokenizer.src_lang = "vie_Latn"
inputs = tokenizer("Xin chào, bạn khỏe không?", return_tensors="pt")
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
print(tokenizer.decode(translated[0], skip_special_tokens=True))

# Vietnamese → Japanese
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["jpn_Jpan"])
print(tokenizer.decode(translated[0], skip_special_tokens=True))

Features

  • Trilingual Vietnamese ↔ English ↔ Japanese translation
  • Configurable beam search with length penalty
  • Unicode-based language auto-detection
  • Interactive trilingual Gradio interface

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sanvo/vietnamese-nmt

Finetuned
(258)
this model

Dataset used to train sanvo/vietnamese-nmt