modernbert-base-chinese

A modernbert model pretrained on a corpus of Simplified Chinese, Traditional Chinese, and Cantonese.

Note: This model is undertrained due to budget constraints.

The tokenizer is a character-based BertTokenizer, where each Chinese character is a separate token. This was a design decision to facilitate sequence tagging tasks. It also supports mixed Chinese and English text.

How to use

You can use this model directly with a pipeline for masked language modeling. Since the tokenizer is character-based, you should only mask single characters.

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="ming030890/modernbert-base-chinese",
    tokenizer="ming030890/modernbert-base-chinese"
)

# Mainland Mandarin (Simplified Chinese)
result = fill_mask("今天天[MASK]真好。")
print(result)

# Traditional Chinese Example
result = fill_mask("這碗牛[MASK]麵好吃。")
print(result)

# Cantonese Example
result = fill_mask("你咁樣做真係[MASK]衰。")
print(result)

# Mixed Chinese and English Example — code switching
result = fill_mask("我啱啱買咗[MASK]新laptop。")
print(result)

You can also load the model and tokenizer directly:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ming030890/modernbert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("ming030890/modernbert-base-chinese")
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support