OGBert-110M-Base
A 110M parameter ModernBERT-based masked language model trained on glossary and domain-specific text.
Related models:
- mjbommar/ogbert-110m-sentence - Sentence embedding version with mean pooling + L2 normalization
Model Details
| Property | Value |
|---|---|
| Architecture | ModernBERT |
| Parameters | 110M |
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Vocab size | 32,768 |
| Max sequence | 1,024 tokens |
Training
- Task: Masked Language Modeling (MLM)
- Dataset: mjbommar/ogbert-v1-mlm - derived from OpenGloss, a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
- Masking: Standard 15% token masking
- Training steps: 8,000 steps (selected for optimal downstream performance)
- Tokens processed: ~4.5B
- Batch size: 1,024
- Peak learning rate: 3e-4
Performance
Word Similarity (SimLex-999)
SimLex-999 measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity.
| Model | Params | SimLex-999 (ρ) |
|---|---|---|
| OGBert-110M-Base | 110M | 0.345 |
| BERT-base | 110M | 0.070 |
| RoBERTa-base | 125M | -0.061 |
OGBert-110M-Base achieves 5x better word similarity than BERT-base with the same parameter count.
Document Clustering
Evaluated on 80 domain-specific documents across 10 categories using KMeans.
| Model | Params | ARI | Cluster Acc |
|---|---|---|---|
| OGBert-110M-Base | 110M | 0.941 | 0.975 |
| BERT-base | 110M | 0.896 | 0.950 |
| RoBERTa-base | 125M | 0.941 | 0.975 |
OGBert-110M-Base matches or exceeds RoBERTa-base on clustering tasks.
Usage
Fill-Mask Pipeline
from transformers import pipeline
fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-110m-base')
result = fill_mask('The financial <|mask|> was approved.')
Direct Model Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-110m-base')
model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-110m-base')
inputs = tokenizer('The <|mask|> definition is clear.', return_tensors='pt')
outputs = model(**inputs)
For Sentence Embeddings
Use mjbommar/ogbert-110m-sentence instead, which includes mean pooling and L2 normalization for optimal similarity search.
Citation
If you use this model, please cite the OpenGloss dataset:
@article{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Bommarito II, Michael J.},
journal={arXiv preprint arXiv:2511.18622},
year={2025}
}
License
Apache 2.0
- Downloads last month
- 37
Dataset used to train mjbommar/ogbert-110m-base
Evaluation results
- spearman on SimLex-999self-reported0.345