T5-Small Vietnamese
A T5-small model adapted for Vietnamese language through continual pretraining with ViT5 tokenizer.
Model Description
This model combines:
- Architecture: google-t5/t5-small (~60M parameters)
- Tokenizer: VietAI/vit5-base tokenizer (Vietnamese-optimized)
- Pretraining: Span corruption denoising objective on Vietnamese text
The model was created by:
- Loading T5-small architecture
- Replacing tokenizer with ViT5's Vietnamese tokenizer
- Resizing embedding layer to match new vocabulary
- Pretraining on Vietnamese corpus
Training Details
Training Data
- Dataset: VTSNLP/vietnamese_curated_dataset
- Samples: 2,500,000 text samples out of 12,169,131 samples from the dataset (7.5 Gb)
- Max Length: 4,056 tokens
Pretraining Objective
- Method: Span Corruption (T5-style denoising)
- Noise Density: 15%
- Mean Span Length: 3.0 tokens
Compute Resources
| Resource | Details |
|---|---|
| Hardware | NVIDIA A100 80GB |
| Platform | Google Colab |
| Training Time | ~100 hours |
Usage
Basic Usage (Fill-mask / Denoising)
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")
text = "Bến Tre là <extra_id_0> của Việt Nam."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Expected output:
<extra_id_0> một trong những tỉnh </s>
Additional Examples
test_cases = [
"Hà Nội là <extra_id_0> của Việt Nam.",
"Phở là món <extra_id_0> nổi tiếng của Việt Nam.",
"Tôi <extra_id_0> học.",
"Tiếng Việt là ngôn ngữ <extra_id_0> của người Việt.",
"Con mèo đang <extra_id_0> trên ghế.",
"Việt Nam là một <extra_id_0> nằm ở <extra_id_1> Á."
]
for text in test_cases:
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=50,
num_beams=4,
early_stopping=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Zero-shot Downstream Task Examples
Although the model is not fine-tuned for specific downstream tasks, it can perform several tasks in a zero-shot manner by leveraging T5’s text-to-text formulation.
Zero-shot Named Entity Recognition (NER)
text = (
"Ông Phạm Nhật Vượng là chủ tịch của tập đoàn Vingroup. "
"Tên ông là <extra_id_0>, thực thể tổ chức là <extra_id_1>."
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Expected output:
<extra_id_0> ông Phạm Nhật Vượng <extra_id_1> Vingroup </s>
Zero-shot Contextual Question Answering (QA)
text = (
"Bối cảnh: Chiến thắng Điện Biên Phủ năm 1954 là một mốc son chói lọi "
"trong lịch sử dân tộc Việt Nam. Dưới sự chỉ huy của Đại tướng Võ Nguyên Giáp, "
"quân và dân ta đã đập tan tập đoàn cứ điểm mạnh nhất Đông Dương của thực dân Pháp "
"sau 56 ngày đêm chiến đấu gian khổ. "
"Câu hỏi: Ai là người chỉ huy quân đội Việt Nam trong chiến dịch Điện Biên Phủ? "
"Trả lời: <extra_id_0>"
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Expected output:
<extra_id_0> Đại tướng Võ Nguyên Giáp </s>
Intended Uses
This model can be used as:
Base model for fine-tuning on Vietnamese NLP tasks:
- Text summarization
- Question answering
- Text classification
- Named Entity Recognition
- Machine translation
Fill-in-the-blank style text completion
Vietnamese language understanding tasks
Fine-tuning Example
from transformers import T5ForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")
# Fine-tune on your downstream task
training_args = TrainingArguments(
output_dir="./my-finetuned-model",
per_device_train_batch_size=8,
learning_rate=3e-5,
num_train_epochs=3,
# ... other arguments
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=your_dataset,
# ...
)
trainer.train()
Model Architecture
T5ForConditionalGeneration(
(shared): Embedding(36334, 512) # Resized for ViT5 tokenizer
(encoder): T5Stack(
(embed_tokens): Embedding(36334, 512)
(block): ModuleList(6 layers)
(final_layer_norm): T5LayerNorm()
)
(decoder): T5Stack(
(embed_tokens): Embedding(36334, 512)
(block): ModuleList(6 layers)
(final_layer_norm): T5LayerNorm()
)
(lm_head): Linear(512, 36334)
)
Citation
If you use this model, please cite:
@misc{t5-small-vietnamese,
author = {nbdaaa},
title = {T5-Small Vietnamese: A Vietnamese-adapted T5 model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/nbdaaa/t5-small-vietnamese}
}
Acknowledgments
- Downloads last month
- 80