mBERT Fine-tuned for Kyrgyz Punctuation Restoration
This model restores punctuation in Kyrgyz text. It is based on bert-base-multilingual-cased (mBERT) fine-tuned on a Kyrgyz punctuation dataset as a token classification task.
Labels
| Label | Description |
|---|---|
| O | No punctuation |
| COMMA | Comma (,) |
| PERIOD | Period (.) |
| QUESTION | Question mark (?) |
Evaluation Results
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| O | 0.959 | 0.967 | 0.963 | 16344 |
| COMMA | 0.742 | 0.692 | 0.717 | 2169 |
| PERIOD | 0.987 | 0.993 | 0.990 | 1984 |
| QUESTION | 0.638 | 0.648 | 0.643 | 125 |
| Weighted avg | 0.937 | 0.939 | 0.938 | 20622 |
Comparison with XLM-RoBERTa
| Model | Weighted F1 | COMMA F1 | QUESTION F1 |
|---|---|---|---|
| mBERT (this model) | 0.938 | 0.717 | 0.643 |
| XLM-RoBERTa | 0.949 | 0.775 | 0.704 |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "YOUR_USERNAME/mbert-kyrgyz-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()
ID2LABEL = {0: 'O', 1: 'COMMA', 2: 'PERIOD', 3: 'QUESTION'}
PUNCT_MAP = {'O': '', 'COMMA': ',', 'PERIOD': '.', 'QUESTION': '?'}
def restore_punctuation(text):
words = text.split()
encoding = tokenizer(
words,
is_split_into_words=True,
return_tensors='pt',
truncation=True,
max_length=256,
)
with torch.no_grad():
outputs = model(**encoding)
preds = torch.argmax(outputs.logits, dim=-1)[0].tolist()
word_ids = encoding.word_ids(batch_index=0)
result = []
for i in range(len(word_ids) - 1, -1, -1):
wid = word_ids[i]
if wid is None:
continue
if i == len(word_ids) - 1 or word_ids[i + 1] != wid:
label = ID2LABEL[preds[i]]
punct = PUNCT_MAP[label]
result.append((wid, punct))
word_puncts = {}
for wid, punct in result:
if wid not in word_puncts:
word_puncts[wid] = punct
output = []
for idx, word in enumerate(words):
output.append(word + word_puncts.get(idx, ''))
return ' '.join(output)
text = "бүгүн аба ырайы жакшы болду биз сейилдөөгө чыктык"
print(restore_punctuation(text))
Training Details
- Base model: bert-base-multilingual-cased (177M parameters)
- Dataset: 14,028 Kyrgyz sentences (141,626 tokens)
- Epochs: 5
- Batch size: 16
- Learning rate: 5e-5
- Max sequence length: 256
- Optimizer: AdamW (weight decay 0.01, warmup ratio 0.1)
- FP16: enabled
- Hardware: Google Colab T4 GPU
- Label strategy: Last subtoken of each word
Citation
@article{uvalieva2025kyrgyz,
title = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
author = {Uvalieva, Zarina},
journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
year = {2025}
}
GitHub : https://github.com/Zarina33/kyrgyz-punctuation-restoration
- Downloads last month
- 16
Model tree for Zarinaaa/mbert-kyrgyz-punctuation
Base model
google-bert/bert-base-multilingual-cased