CKTN-EKECTRA

CKTN-EKECTRA is a vocabulary-extended version of google/rembert adapted for three low-resource Southeast Asian languages: Khmer, Cham, and Tày-Nùng.

The model was produced by injecting 4,213 new subword pieces directly into the SentencePiece Unigram model via Protobuf surgery, followed by Fast Vocabulary Transfer (FVT) embedding initialization. It is released as a Stage 1 checkpoint — vocabulary has been extended and embeddings initialized, but continued pre-training (MLM) on the target corpora is still required for the new tokens to reach full representational quality.

Model Details

Property	Value
Base model	`google/rembert`
Architecture	RemBERT (32 layers, 1152 hidden, 18 heads)
Original vocab size	250,300
Extended vocab size	254,513
New tokens injected	+4,213
Injection method	SPM Protobuf Surgery (`sentencepiece_model_pb2`)
Embedding init	FVT sub-piece averaging + character-level fallback
Stage	1 / 2 — vocab extension only (pre-training pending)

Target Languages

Language	Code	Script
Khmer	`km`	ខ្មែរ (Khmer)
Cham	`cjm`	ꨀꩃ (Cham)
Tày-Nùng	`tay`	Latin Extended

Why SPM Protobuf Surgery Instead of `add_tokens()`

HuggingFace's tokenizer.add_tokens() stores new pieces in added_tokens_encoder — a separate dictionary that is applied as a greedy pre-pass before SentencePiece sees the text. This breaks Viterbi decoding in three ways:

Greedy fragmentation — SPM Unigram's Viterbi finds the globally optimal segmentation. Greedily pre-matching a substring destroys that optimality for the surrounding context, causing leftover fragments to fall back to character-level tokenization and spiking fertility.
▁ prefix mismatch — add_tokens() stores pieces like ▁word as literal strings. The greedy matcher looks for the Unicode ▁ character in raw text, but raw text only contains ASCII spaces — so new word-start tokens never fire.
Continuation ratio inflation — pieces stored without ▁ fail the startswith("▁") check in _build_word_start_ids(), causing every match to be miscounted as a continuation token.

By injecting new pieces directly into the ModelProto and reloading the tokenizer from the patched .model file, all three issues are eliminated: Viterbi sees the full extended vocabulary, ▁ prefixes are consistent, and log-probability scores are estimated from sub-piece decomposition so the decoder treats new tokens as first-class pieces.

Intrinsic Metrics

Evaluated on held-out lines from each corpus using the methodology of Rust et al. (ACL 2021).

Language	Metric	PRE	POST	Δ
Khmer	Fertility ↓	—	—	—
Khmer	Coverage % ↑	—	—	—
Khmer	Continuation Ratio % ↓	—	—	—
Cham	Fertility ↓	—	—	—
Cham	Coverage % ↑	—	—	—
Tày-Nùng	Fertility ↓	—	—	—
Tày-Nùng	Coverage % ↑	—	—	—

Fill in the values from RemBERT/intrinsic_metrics.json after Stage 1 completes.

Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ducanhdinh/CKTN-EKECTRA", use_fast=False)
model     = AutoModel.from_pretrained("ducanhdinh/CKTN-EKECTRA")

# Khmer example
text = "សួស្តី"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)  # [1, seq_len, 1152]

Important: always pass use_fast=False when loading the tokenizer. The fast tokenizer reads its vocabulary from tokenizer.json (which caches the original 250,300-token vocab) rather than from the patched .model file, silently ignoring all new pieces. The slow tokenizer reads directly from the SPM binary and always reflects the true extended vocabulary.

Training Data

New tokens were selected from the following corpora (minimum frequency threshold: 50):

Khmer — news articles and web text; word-segmented with khmer-nltk
Cham — web-collected Cham-script text
Tày-Nùng — Latin-script documents in the Tày and Nùng languages of northern Vietnam

A SentencePiece Unigram model (vocab size 8,000, character coverage 0.9999) was trained on all three corpora jointly. Candidate tokens were filtered to remove punctuation, control characters, and single-byte non-script characters before frequency thresholding.

Limitations

This is a Stage 1 checkpoint only. The 4,213 new embedding rows have been initialized via FVT sub-piece averaging but have not been updated by any gradient steps. Downstream performance on Khmer/Cham/Tày-Nùng tasks will be limited until Stage 2 (continued MLM pre-training) is complete.
tie_word_embeddings is false (inherited from RemBERT). Both input and output embedding matrices must be covered by the optimizer during continued pre-training.
The model has not been fine-tuned on any downstream task.

Citation

If you use this model, please cite the following works:

@inproceedings{gee-etal-2022-fast,
  title     = {Fast Vocabulary Transfer for Language Model Compression},
  author    = {Gee, Leonidas and Ghassemi, Mohammad and Nikita, Yulia},
  booktitle = {Proceedings of EMNLP 2022},
  year      = {2022},
  url       = {https://aclanthology.org/2022.emnlp-main.802/},
}

@inproceedings{rust-etal-2021-good,
  title     = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models},
  author    = {Rust, Phillip and Pfeiffer, Jonas and Vulic, Ivan and Ruder, Sebastian and Gurevych, Iryna},
  booktitle = {Proceedings of ACL 2021},
  year      = {2021},
  url       = {https://aclanthology.org/2021.acl-long.571/},
}

@article{chung-etal-2020-rembert,
  title   = {Rethinking Embedding Coupling in Pre-trained Language Models},
  author  = {Chung, Hyung Won and Févry, Thibault and Tsai, Henry and Johnson, Melvin and Ruder, Sebastian},
  journal = {arXiv preprint arXiv:2010.12821},
  year    = {2020},
}

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ducanhdinh/CKTN-EKECTRA

Base model

google/rembert

Finetuned

(58)

this model

Paper for ducanhdinh/CKTN-EKECTRA

Rethinking embedding coupling in pre-trained language models

Paper • 2010.12821 • Published Oct 24, 2020 • 1