modernbert-base-chinese
A modernbert model pretrained on a corpus of Simplified Chinese, Traditional Chinese, and Cantonese.
Note: This model is undertrained due to budget constraints.
The tokenizer is a character-based BertTokenizer, where each Chinese character is a separate token. This was a design decision to facilitate sequence tagging tasks. It also supports mixed Chinese and English text.
How to use
You can use this model directly with a pipeline for masked language modeling. Since the tokenizer is character-based, you should only mask single characters.
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="ming030890/modernbert-base-chinese",
tokenizer="ming030890/modernbert-base-chinese"
)
# Mainland Mandarin (Simplified Chinese)
result = fill_mask("今天天[MASK]真好。")
print(result)
# Traditional Chinese Example
result = fill_mask("這碗牛[MASK]麵好吃。")
print(result)
# Cantonese Example
result = fill_mask("你咁樣做真係[MASK]衰。")
print(result)
# Mixed Chinese and English Example — code switching
result = fill_mask("我啱啱買咗[MASK]新laptop。")
print(result)
You can also load the model and tokenizer directly:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ming030890/modernbert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("ming030890/modernbert-base-chinese")
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support