persian_roberta_opt_tokenizer

A compact RoBERTa-style Masked Language Model (MLM) for Persian (Farsi). We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data. The model is pre-trained with this tokenizer, optimized for Persian script and evaluated on two downstream tasks:

  • NER on a merged ARMAN + PEYMA corpus
  • Relation Extraction on PERLEX

Model size and training hyperparameters were kept identical to the baselines to ensure fair comparisons.


1) Model Description

  • Architecture: RoBERTa-style Transformer for Masked LM
  • Intended use: Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning
  • Vocabulary: BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation)
  • Max sequence length: 256

The repository name on the Hub should be: selfms/persian_roberta_opt_tokenizer.


2) Architecture and Training Setup

Backbone (example config):

  • hidden size: 256
  • layers: 6
  • attention heads: 4
  • intermediate size: 1024
  • activation: GELU
  • dropout: 0.1
  • positional embeddings: 514

Adjust numbers above to your final config.json if they differ. All baselines used the same parameter budget.

Pretraining objective: Masked Language Modeling

Fine-tuning hyperparameters (shared across all compared models):

epochs = 3
batch_size = 8
learning_rate = 3e-5
weight_decay = 0.01
max_tokens = 128
optimizer = AdamW
scheduler = linear with warmup (recommended 10% warmup)
seed = 42

3) Data and Tasks

NER

  • Datasets: ARMAN + PEYMA, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently)
  • Preprocessing: Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces

Relation Extraction

  • Dataset: PERLEX (Persian Relation Extraction)
  • Entity marking: special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below

4) Quantitative Results

4.1 NER (ARMAN + PEYMA, merged)

Model Precision Recall F1-Score
Proposed (this model) 93.4 94.8 94.08
TooKaBERT-base 94.9 96.2 95.5
FABERT 94.1 95.3 94.7

4.2 Relation Extraction (PERLEX)

Model F1-score (%)
Proposed (this model) 90
TooKaBERT-base 91
FABERT 88

All three models used identical hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects.


5) Usage

5.1 Fill-Mask Inference (simple)

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

path = "selfms/persian_roberta_opt_tokenizer"

tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForMaskedLM.from_pretrained(path)
model.eval()

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10)
print(fill(" سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه"))

5.2 Text-Embedding Inference (simple)

import torch
from transformers import AutoTokenizer, AutoModel

path = "selfms/persian_roberta_opt_tokenizer"
tok = AutoTokenizer.from_pretrained(path)
mdl = AutoModel.from_pretrained(path).eval()

def embed(text):
    with torch.no_grad():
        x = tok(text, return_tensors="pt", truncation=True, max_length=256)
        h = mdl(**x).last_hidden_state
        a = x["attention_mask"].unsqueeze(-1)
        v = (h * a).sum(1) / a.sum(1).clamp(min=1)
        return (v / v.norm(dim=1, keepdim=True)).squeeze(0)  # 1D vector

text = "متن فارسی به بردار 768 بعدی تبدیل میشه"
vec = embed(text)
print(len(vec))

5.3 Tokenizer Inference (simple)

from transformers import AutoTokenizer

path = "selfms/persian_roberta_opt_tokenizer"
tok = AutoTokenizer.from_pretrained(path)

text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده"

enc = tok(text, return_tensors="pt")
tokens = tok.convert_ids_to_tokens(enc["input_ids"][0])

print("Tokens:", tokens)
print("IDs   :", enc["input_ids"][0].tolist())

6) Comparison with Other Models

Under identical parameter budgets and training settings:

  • NER (ARMAN + PEYMA): TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 .
  • Relation Extraction (PERLEX): Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91).

These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone.


7) Limitations, Bias, and Ethical Considerations

  • Domain bias: Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon.
  • Tokenization quirks: ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality.
  • Sequence length: Experiments reported at max_tokens=128. Longer contexts may require re-tuning and more memory.
  • Stereotypes/Bias: As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions.

8) How to Reproduce

  1. Pretrain or load the MLM checkpoint:
from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer")
mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer")
  1. Fine-tune for NER/RE with the shared hyperparameters:
epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128
  1. Evaluate:
  • NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently)
  • RE: relation-level micro-F1 on PERLEX

9) Files in the Repository

  • config.json
  • model.safetensors or pytorch_model.bin
  • tokenizer_config.json, special_tokens_map.json, tokenizer.json
  • vocab.json, merges.txt (BPE)
  • README.md, LICENSE, .gitattributes

Ensure mask_token is set to <mask> and pipeline_tag: fill-mask is present so the Hub widget works out-of-the-box.


10) Citation

If you use this model, please cite:

@misc{persian_roberta_opt_tokenizer_2025,
  title        = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM},
  author       = {selfms},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}},
  note         = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).}
}

11) License

Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution.

Metrics & Evaluation Notes

  • NER: entity-level micro-F1 under the BIO tagging scheme.
  • Relation Extraction (RE): micro-F1 at relation level.
  • Sequence length: model supports up to 512 tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used 256 for efficiency.

Model Config Summary

  • Architecture: RoBERTa-base (12 layers, 12 heads, hidden size 768, FFN 3072).
  • Max positions: 514 (effective input up to 512 tokens).
  • Dropout: hidden 0.1, attention 0.1.
  • Vocab size: 48,000 (BPE).
  • Special tokens: <s>=0, <pad>=1, </s>=2, <mask> as mask token.
Downloads last month
45
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results