Khmer Graph‑Regularized Tokenizer (v2.2‑RC)

A tokenizer + lexeme embedding layer that preserves production tokenization and induces a structured Khmer semantic space.

✅ Compatibility: Identical segmentation to production SPM‑8K

✅ Structure: Graph Regularization aligns embeddings to a curated morpho/synonymy graph

✅ Quality: Coherence@10 = 43.25% ( +110% vs baseline 20.62%)

✅ Ready for tasks: NER, QA, RAG, semantic search (no retraining of the tokenizer needed)

Why this is different

This is not "just a tokenizer." It's a lexeme‑level encoder whose vectors are shaped by a linguistic graph (morphology + synonymy). The tokenizer output stays the same; the embedding space becomes clustered and interpretable.

Results (v2.2‑RC)

Primary metric: Coherence@10 — overlap between top‑K embedding neighbors and graph neighbors.

Metric	v2.2‑RC	Baseline	Δ
Coherence@10 (morpho/synonymy, pruned)	43.25%	20.62%	+110%
Coherence@10 (distributional, enriched)	23.64%	11.58%	+104%

Interpretation: Embedding neighborhoods now reflect true linguistic relations (morphology/synonymy), not only raw co‑occurrence.

Example clusters (top cosine neighbors)

Administrative titles — អគ្គលេខាធិការ (Secretary General)

អគ្គលេខាធិការរង (Deputy SG), អគ្គលេខាធិការព្រឹទ្ធសភា (Senate SG), អគ្គលេខាធិការដ្ឋាន … (~93% sim)

Geographic / national — កម្ពុជា (Cambodia)

រដ្ឋកម្ពុជា (State of Cambodia), ព្រះរាជាណាចក្រកម្ពុជា (Kingdom of Cambodia) … (~90%+ sim)

Kinship — ឪពុក / ម្តាយ (father / mother)

ឪពុកចិញ្ចឹម (adoptive father), ម្តាយឪពុក (parents) … (~88–92% sim)

Grammatical particles — បាន / នឹង (past / future)

បានទៅ, បានតែ, បានហើយ / នឹងទៅ, នឹងធ្វើ, នឹងមក … (~85–88% sim)

How it works (Variant B)

We train a separate lexeme embedding table and add two graph‑aware losses on top of LM training:

Loss = LM + λ(t) ( Laplacian + Consistency )
                    └─pull graph neighbors   └─align lexeme↔subwords

Laplacian: Σ(i,j)∈E w_ij · ||e_i - e_j||² (with symmetric normalization, anti‑hub)
Consistency: ||pool(subword_embs) - e_lexeme||²

Winning config (v2.2‑RC):

SPM: production Khmer 8K (unchanged tokenization)
Graph: 4,245 curated morpho/synonymy edges (pruned from 15K)
Lexemes: 12,579 (max_pieces=24)
Lambdas: lap=2.5e-4, lex=1.0e-4 (2.5:1 to emphasize graph)
Dropout: edge_dropout=0.01 (clean graph ⇒ minimal noise)
Schedule: warmup 2k → plateau 4k → anneal 1.76k
Stability: max_grad_norm=1.0

Key design calls:

Quality > quantity (curated 4.2k edges outperform 10k noisy)
Graph signal > subword consistency (invert ratio vs v2.1)
Minimal dropout when graph is clean

What's in the release

v2.2-RC/
├── spm_km_8k_prod.model                  # SentencePiece 8k (production-compatible)
├── lexeme_embeddings.pt                  # [12,654 x 768] graph-regularized lexeme vectors
├── edges_pruned.tsv                      # 4,245 morpho/synonymy edges (curated)
├── lexeme_subwords_prod8k_v22.tsv        # 12,579 lexeme → subword mappings (max_pieces=24)
├── metrics_corrected.yaml                # full evaluation dump (Coherence@k, etc.)
└── (optional) nodes.tsv                  # lexeme ids/names for neighbor lookup

Tokenization remains identical to production 8k. You can swap the embeddings into downstream code without touching your tokenization pipelines.

Quickstart

1) Tokenization (unchanged)

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("spm_km_8k_prod.model")

txt = "ព្រះរាជាណាចក្រកម្ពុជា"
print(sp.encode_as_pieces(txt))
# ['▁', 'ព្រះរាជាណាចក្រ', 'កម្ពុជា']

2) Semantic neighbors (graph‑regularized embeddings)

import csv, torch, numpy as np

LEX = torch.load("lexeme_embeddings.pt").numpy()  # [N_lex, 768]

# Optional: map words → lexeme ids
word2id = {}
try:
    with open("nodes.tsv") as f:  # TSV with columns: id, word
        for row in csv.DictReader(f, delimiter="\t"):
            word2id[row["word"]] = int(row["id"])
except FileNotFoundError:
    print("nodes.tsv not provided – neighbor demo will be limited.")

def neighbors(word, k=10):
    if word not in word2id: return []
    q = LEX[word2id[word]]
    nrm = np.linalg.norm
    sims = [(w, float(np.dot(q, LEX[i]) / (nrm(q)*nrm(LEX[i]))))
            for w, i in word2id.items()]
    return sorted(sims, key=lambda x: x[1], reverse=True)[1:k+1]  # skip self

print(neighbors("ព្រះសង្ឃ", k=5))

Reproduce key metrics

Coherence@10 (pruned graph):

python scripts/03_validation/coherence_at_k.py \
  --lexeme_ckpt lexeme_embeddings.pt \
  --edges edges_pruned.tsv \
  --k 10

TPC note: TPC depends on your corpus and preprocessing. Using production SPM‑8K, aim for ≤0.15 on a representative Khmer corpus with proper normalization.

Practical impact

NER: clustered lexeme space → clearer boundaries & types (target +5–10 F1)
QA / RAG: semantically coherent neighbors → better retrieval (target +5 EM / +10% MRR)
Few‑shot: morpho/derivational families share geometry → faster domain adaptation

Limitations & cautions

TPC is corpus‑dependent. Validate on your data (normalization, Khmer ratio, ASCII filtering).
Protected symbols: for 8k vocab, keep curated lists small (~≤200 critical items).
Break‑Rate target for production polish: <1% (research runs may be higher).

Roadmap

v2.3: piece‑count–weighted consistency loss; optional relation‑type weights
Benchmarks: Khmer NER / QA micro‑sets and extrinsic eval scripts
Viz: t‑SNE / UMAP plots + cosine heatmaps for cluster inspection

Citation

@misc{khmer-tokenizer-v2.2-rc,
  author       = {Niko (khopilot)},
  title        = {Graph-Regularized Lexeme Embeddings for Khmer Tokenization},
  year         = {2025},
  howpublished = {Hugging Face},
  note         = {Coherence@10 = 43.25\% (+110\% vs baseline) with morpho/synonymy graph}
}

License: MIT • Author: @khopilot

If you use this in production, please share issues/benchmarks—especially NER/QA scores. It helps the community converge on robust Khmer NLP baselines.

Downloads last month: -; Downloads are not tracked for this model. How to track