mBERT Fine-tuned for Kyrgyz Punctuation Restoration

This model restores punctuation in Kyrgyz text. It is based on bert-base-multilingual-cased (mBERT) fine-tuned on a Kyrgyz punctuation dataset as a token classification task.

Labels

Label	Description
O	No punctuation
COMMA	Comma (,)
PERIOD	Period (.)
QUESTION	Question mark (?)

Evaluation Results

Class	Precision	Recall	F1-score	Support
O	0.959	0.967	0.963	16344
COMMA	0.742	0.692	0.717	2169
PERIOD	0.987	0.993	0.990	1984
QUESTION	0.638	0.648	0.643	125
Weighted avg	0.937	0.939	0.938	20622

Comparison with XLM-RoBERTa

Model	Weighted F1	COMMA F1	QUESTION F1
mBERT (this model)	0.938	0.717	0.643
XLM-RoBERTa	0.949	0.775	0.704

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "YOUR_USERNAME/mbert-kyrgyz-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

ID2LABEL = {0: 'O', 1: 'COMMA', 2: 'PERIOD', 3: 'QUESTION'}
PUNCT_MAP = {'O': '', 'COMMA': ',', 'PERIOD': '.', 'QUESTION': '?'}

def restore_punctuation(text):
    words = text.split()
    encoding = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors='pt',
        truncation=True,
        max_length=256,
    )
    with torch.no_grad():
        outputs = model(**encoding)
    preds = torch.argmax(outputs.logits, dim=-1)[0].tolist()
    word_ids = encoding.word_ids(batch_index=0)

    result = []
    for i in range(len(word_ids) - 1, -1, -1):
        wid = word_ids[i]
        if wid is None:
            continue
        if i == len(word_ids) - 1 or word_ids[i + 1] != wid:
            label = ID2LABEL[preds[i]]
            punct = PUNCT_MAP[label]
            result.append((wid, punct))

    word_puncts = {}
    for wid, punct in result:
        if wid not in word_puncts:
            word_puncts[wid] = punct

    output = []
    for idx, word in enumerate(words):
        output.append(word + word_puncts.get(idx, ''))

    return ' '.join(output)

text = "бүгүн аба ырайы жакшы болду биз сейилдөөгө чыктык"
print(restore_punctuation(text))

Training Details

Base model: bert-base-multilingual-cased (177M parameters)
Dataset: 14,028 Kyrgyz sentences (141,626 tokens)
Epochs: 5
Batch size: 16
Learning rate: 5e-5
Max sequence length: 256
Optimizer: AdamW (weight decay 0.01, warmup ratio 0.1)
FP16: enabled
Hardware: Google Colab T4 GPU
Label strategy: Last subtoken of each word

Citation

@article{uvalieva2025kyrgyz,
  title   = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
  author  = {Uvalieva, Zarina},
  journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  year    = {2025}
}

GitHub : https://github.com/Zarina33/kyrgyz-punctuation-restoration

Downloads last month: 16

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Zarinaaa/mbert-kyrgyz-punctuation

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(945)

this model