CKTN-EKECTRA
CKTN-EKECTRA is a vocabulary-extended version of google/rembert adapted for three low-resource Southeast Asian languages: Khmer, Cham, and Tày-Nùng.
The model was produced by injecting 4,213 new subword pieces directly into the SentencePiece Unigram model via Protobuf surgery, followed by Fast Vocabulary Transfer (FVT) embedding initialization. It is released as a Stage 1 checkpoint — vocabulary has been extended and embeddings initialized, but continued pre-training (MLM) on the target corpora is still required for the new tokens to reach full representational quality.
Model Details
| Property | Value |
|---|---|
| Base model | google/rembert |
| Architecture | RemBERT (32 layers, 1152 hidden, 18 heads) |
| Original vocab size | 250,300 |
| Extended vocab size | 254,513 |
| New tokens injected | +4,213 |
| Injection method | SPM Protobuf Surgery (sentencepiece_model_pb2) |
| Embedding init | FVT sub-piece averaging + character-level fallback |
| Stage | 1 / 2 — vocab extension only (pre-training pending) |
Target Languages
| Language | Code | Script |
|---|---|---|
| Khmer | km |
ខ្មែរ (Khmer) |
| Cham | cjm |
ꨀꩃ (Cham) |
| Tày-Nùng | tay |
Latin Extended |
Why SPM Protobuf Surgery Instead of add_tokens()
HuggingFace's tokenizer.add_tokens() stores new pieces in added_tokens_encoder — a separate
dictionary that is applied as a greedy pre-pass before SentencePiece sees the text. This
breaks Viterbi decoding in three ways:
Greedy fragmentation — SPM Unigram's Viterbi finds the globally optimal segmentation. Greedily pre-matching a substring destroys that optimality for the surrounding context, causing leftover fragments to fall back to character-level tokenization and spiking fertility.
▁prefix mismatch —add_tokens()stores pieces like▁wordas literal strings. The greedy matcher looks for the Unicode▁character in raw text, but raw text only contains ASCII spaces — so new word-start tokens never fire.Continuation ratio inflation — pieces stored without
▁fail thestartswith("▁")check in_build_word_start_ids(), causing every match to be miscounted as a continuation token.
By injecting new pieces directly into the ModelProto and reloading the tokenizer from the
patched .model file, all three issues are eliminated: Viterbi sees the full extended
vocabulary, ▁ prefixes are consistent, and log-probability scores are estimated from
sub-piece decomposition so the decoder treats new tokens as first-class pieces.
Intrinsic Metrics
Evaluated on held-out lines from each corpus using the methodology of Rust et al. (ACL 2021).
| Language | Metric | PRE | POST | Δ |
|---|---|---|---|---|
| Khmer | Fertility ↓ | — | — | — |
| Khmer | Coverage % ↑ | — | — | — |
| Khmer | Continuation Ratio % ↓ | — | — | — |
| Cham | Fertility ↓ | — | — | — |
| Cham | Coverage % ↑ | — | — | — |
| Tày-Nùng | Fertility ↓ | — | — | — |
| Tày-Nùng | Coverage % ↑ | — | — | — |
Fill in the values from
RemBERT/intrinsic_metrics.jsonafter Stage 1 completes.
Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ducanhdinh/CKTN-EKECTRA", use_fast=False)
model = AutoModel.from_pretrained("ducanhdinh/CKTN-EKECTRA")
# Khmer example
text = "សួស្តី"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) # [1, seq_len, 1152]
Important: always pass
use_fast=Falsewhen loading the tokenizer. The fast tokenizer reads its vocabulary fromtokenizer.json(which caches the original 250,300-token vocab) rather than from the patched.modelfile, silently ignoring all new pieces. The slow tokenizer reads directly from the SPM binary and always reflects the true extended vocabulary.
Training Data
New tokens were selected from the following corpora (minimum frequency threshold: 50):
- Khmer — news articles and web text; word-segmented with
khmer-nltk - Cham — web-collected Cham-script text
- Tày-Nùng — Latin-script documents in the Tày and Nùng languages of northern Vietnam
A SentencePiece Unigram model (vocab size 8,000, character coverage 0.9999) was trained on all three corpora jointly. Candidate tokens were filtered to remove punctuation, control characters, and single-byte non-script characters before frequency thresholding.
Limitations
- This is a Stage 1 checkpoint only. The 4,213 new embedding rows have been initialized via FVT sub-piece averaging but have not been updated by any gradient steps. Downstream performance on Khmer/Cham/Tày-Nùng tasks will be limited until Stage 2 (continued MLM pre-training) is complete.
tie_word_embeddingsisfalse(inherited from RemBERT). Both input and output embedding matrices must be covered by the optimizer during continued pre-training.- The model has not been fine-tuned on any downstream task.
Citation
If you use this model, please cite the following works:
@inproceedings{gee-etal-2022-fast,
title = {Fast Vocabulary Transfer for Language Model Compression},
author = {Gee, Leonidas and Ghassemi, Mohammad and Nikita, Yulia},
booktitle = {Proceedings of EMNLP 2022},
year = {2022},
url = {https://aclanthology.org/2022.emnlp-main.802/},
}
@inproceedings{rust-etal-2021-good,
title = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models},
author = {Rust, Phillip and Pfeiffer, Jonas and Vulic, Ivan and Ruder, Sebastian and Gurevych, Iryna},
booktitle = {Proceedings of ACL 2021},
year = {2021},
url = {https://aclanthology.org/2021.acl-long.571/},
}
@article{chung-etal-2020-rembert,
title = {Rethinking Embedding Coupling in Pre-trained Language Models},
author = {Chung, Hyung Won and Févry, Thibault and Tsai, Henry and Johnson, Melvin and Ruder, Sebastian},
journal = {arXiv preprint arXiv:2010.12821},
year = {2020},
}
- Downloads last month
- -
Model tree for ducanhdinh/CKTN-EKECTRA
Base model
google/rembert