mBERT Fine-tuned for Kyrgyz Punctuation Restoration

This model restores punctuation in Kyrgyz text. It is based on bert-base-multilingual-cased (mBERT) fine-tuned on a Kyrgyz punctuation dataset as a token classification task.

Labels

Label Description
O No punctuation
COMMA Comma (,)
PERIOD Period (.)
QUESTION Question mark (?)

Evaluation Results

Class Precision Recall F1-score Support
O 0.959 0.967 0.963 16344
COMMA 0.742 0.692 0.717 2169
PERIOD 0.987 0.993 0.990 1984
QUESTION 0.638 0.648 0.643 125
Weighted avg 0.937 0.939 0.938 20622

Comparison with XLM-RoBERTa

Model Weighted F1 COMMA F1 QUESTION F1
mBERT (this model) 0.938 0.717 0.643
XLM-RoBERTa 0.949 0.775 0.704

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "YOUR_USERNAME/mbert-kyrgyz-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

ID2LABEL = {0: 'O', 1: 'COMMA', 2: 'PERIOD', 3: 'QUESTION'}
PUNCT_MAP = {'O': '', 'COMMA': ',', 'PERIOD': '.', 'QUESTION': '?'}

def restore_punctuation(text):
    words = text.split()
    encoding = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors='pt',
        truncation=True,
        max_length=256,
    )
    with torch.no_grad():
        outputs = model(**encoding)
    preds = torch.argmax(outputs.logits, dim=-1)[0].tolist()
    word_ids = encoding.word_ids(batch_index=0)

    result = []
    for i in range(len(word_ids) - 1, -1, -1):
        wid = word_ids[i]
        if wid is None:
            continue
        if i == len(word_ids) - 1 or word_ids[i + 1] != wid:
            label = ID2LABEL[preds[i]]
            punct = PUNCT_MAP[label]
            result.append((wid, punct))

    word_puncts = {}
    for wid, punct in result:
        if wid not in word_puncts:
            word_puncts[wid] = punct

    output = []
    for idx, word in enumerate(words):
        output.append(word + word_puncts.get(idx, ''))

    return ' '.join(output)

text = "бүгүн аба ырайы жакшы болду биз сейилдөөгө чыктык"
print(restore_punctuation(text))

Training Details

  • Base model: bert-base-multilingual-cased (177M parameters)
  • Dataset: 14,028 Kyrgyz sentences (141,626 tokens)
  • Epochs: 5
  • Batch size: 16
  • Learning rate: 5e-5
  • Max sequence length: 256
  • Optimizer: AdamW (weight decay 0.01, warmup ratio 0.1)
  • FP16: enabled
  • Hardware: Google Colab T4 GPU
  • Label strategy: Last subtoken of each word

Citation

@article{uvalieva2025kyrgyz,
  title   = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
  author  = {Uvalieva, Zarina},
  journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  year    = {2025}
}

GitHub : https://github.com/Zarina33/kyrgyz-punctuation-restoration

Downloads last month
16
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zarinaaa/mbert-kyrgyz-punctuation

Finetuned
(945)
this model