CKTN-EKECTRA

CKTN-EKECTRA is a vocabulary-extended version of google/rembert adapted for three low-resource Southeast Asian languages: Khmer, Cham, and Tày-Nùng.

The model was produced by injecting 4,213 new subword pieces directly into the SentencePiece Unigram model via Protobuf surgery, followed by Fast Vocabulary Transfer (FVT) embedding initialization. It is released as a Stage 1 checkpoint — vocabulary has been extended and embeddings initialized, but continued pre-training (MLM) on the target corpora is still required for the new tokens to reach full representational quality.


Model Details

Property Value
Base model google/rembert
Architecture RemBERT (32 layers, 1152 hidden, 18 heads)
Original vocab size 250,300
Extended vocab size 254,513
New tokens injected +4,213
Injection method SPM Protobuf Surgery (sentencepiece_model_pb2)
Embedding init FVT sub-piece averaging + character-level fallback
Stage 1 / 2 — vocab extension only (pre-training pending)

Target Languages

Language Code Script
Khmer km ខ្មែរ (Khmer)
Cham cjm ꨀꩃ (Cham)
Tày-Nùng tay Latin Extended

Why SPM Protobuf Surgery Instead of add_tokens()

HuggingFace's tokenizer.add_tokens() stores new pieces in added_tokens_encoder — a separate dictionary that is applied as a greedy pre-pass before SentencePiece sees the text. This breaks Viterbi decoding in three ways:

  1. Greedy fragmentation — SPM Unigram's Viterbi finds the globally optimal segmentation. Greedily pre-matching a substring destroys that optimality for the surrounding context, causing leftover fragments to fall back to character-level tokenization and spiking fertility.

  2. prefix mismatchadd_tokens() stores pieces like ▁word as literal strings. The greedy matcher looks for the Unicode character in raw text, but raw text only contains ASCII spaces — so new word-start tokens never fire.

  3. Continuation ratio inflation — pieces stored without fail the startswith("▁") check in _build_word_start_ids(), causing every match to be miscounted as a continuation token.

By injecting new pieces directly into the ModelProto and reloading the tokenizer from the patched .model file, all three issues are eliminated: Viterbi sees the full extended vocabulary, prefixes are consistent, and log-probability scores are estimated from sub-piece decomposition so the decoder treats new tokens as first-class pieces.


Intrinsic Metrics

Evaluated on held-out lines from each corpus using the methodology of Rust et al. (ACL 2021).

Language Metric PRE POST Δ
Khmer Fertility ↓
Khmer Coverage % ↑
Khmer Continuation Ratio % ↓
Cham Fertility ↓
Cham Coverage % ↑
Tày-Nùng Fertility ↓
Tày-Nùng Coverage % ↑

Fill in the values from RemBERT/intrinsic_metrics.json after Stage 1 completes.


Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ducanhdinh/CKTN-EKECTRA", use_fast=False)
model     = AutoModel.from_pretrained("ducanhdinh/CKTN-EKECTRA")

# Khmer example
text = "សួស្តី"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)  # [1, seq_len, 1152]

Important: always pass use_fast=False when loading the tokenizer. The fast tokenizer reads its vocabulary from tokenizer.json (which caches the original 250,300-token vocab) rather than from the patched .model file, silently ignoring all new pieces. The slow tokenizer reads directly from the SPM binary and always reflects the true extended vocabulary.


Training Data

New tokens were selected from the following corpora (minimum frequency threshold: 50):

  • Khmer — news articles and web text; word-segmented with khmer-nltk
  • Cham — web-collected Cham-script text
  • Tày-Nùng — Latin-script documents in the Tày and Nùng languages of northern Vietnam

A SentencePiece Unigram model (vocab size 8,000, character coverage 0.9999) was trained on all three corpora jointly. Candidate tokens were filtered to remove punctuation, control characters, and single-byte non-script characters before frequency thresholding.


Limitations

  • This is a Stage 1 checkpoint only. The 4,213 new embedding rows have been initialized via FVT sub-piece averaging but have not been updated by any gradient steps. Downstream performance on Khmer/Cham/Tày-Nùng tasks will be limited until Stage 2 (continued MLM pre-training) is complete.
  • tie_word_embeddings is false (inherited from RemBERT). Both input and output embedding matrices must be covered by the optimizer during continued pre-training.
  • The model has not been fine-tuned on any downstream task.

Citation

If you use this model, please cite the following works:

@inproceedings{gee-etal-2022-fast,
  title     = {Fast Vocabulary Transfer for Language Model Compression},
  author    = {Gee, Leonidas and Ghassemi, Mohammad and Nikita, Yulia},
  booktitle = {Proceedings of EMNLP 2022},
  year      = {2022},
  url       = {https://aclanthology.org/2022.emnlp-main.802/},
}

@inproceedings{rust-etal-2021-good,
  title     = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models},
  author    = {Rust, Phillip and Pfeiffer, Jonas and Vulic, Ivan and Ruder, Sebastian and Gurevych, Iryna},
  booktitle = {Proceedings of ACL 2021},
  year      = {2021},
  url       = {https://aclanthology.org/2021.acl-long.571/},
}

@article{chung-etal-2020-rembert,
  title   = {Rethinking Embedding Coupling in Pre-trained Language Models},
  author  = {Chung, Hyung Won and Févry, Thibault and Tsai, Henry and Johnson, Melvin and Ruder, Sebastian},
  journal = {arXiv preprint arXiv:2010.12821},
  year    = {2020},
}
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ducanhdinh/CKTN-EKECTRA

Base model

google/rembert
Finetuned
(58)
this model

Paper for ducanhdinh/CKTN-EKECTRA