SentenceTransformer

ModernBERT-small-v2 represents an efficient approach to creating highly efficient and accurate dense vector encoders. It leverages a small ModernBERT architecture, simple MLM training, and distillation from a larger performant model to achieve superior performance at a lower computational cost compared to standard large models.

Key Features & Training Methodology

This model was created using a specialized four-stage pipeline:

Deep & Narrow Architecture: Unlike typical small models (e.g., 6 layers), this student model features 12 Transformer layers but operates within a narrow 384-dimensional embedding space. This depth allows for complex multi-hop reasoning crucial for high-accuracy retrieval tasks, while the narrow dimension ensures extremely fast encoding and small index sizes.
Guided Initialization (GUIDE): The model did not start from random weights. It inherited structural and semantic knowledge from a larger teacher model (answerdotai/ModernBERT-base) via Principal Component Analysis (PCA) Projection. This technique surgically compressed the teacher's 768-dimensional knowledge into the student's 384-dimensional space, providing a massive "head start."
Extensive MLM Pre-training: Following initialization, the model underwent comprehensive Masked Language Modeling (MLM) pre-training on a highly diverse corpus combining:
- Search Data (MS MARCO)
- Academic Texts (Stanford Philosophy)
- General Knowledge (NPR, FineWiki)
Knowledge Distillation (STS Tuning): The final, critical stage optimized the model for semantic similarity. It was trained to mimic the output embeddings of a powerful Retrieval Teacher (Alibaba-NLP/gte-modernbert-base) using Mean Squared Error (MSE) loss. This specialized tuning ensures its 384-dimensional vectors excel at similarity and retrieval tasks.

Training

The final model, ModernBERT-small-v2, was trained using a curated combination of four distinct datasets during the MLM Pre-training phase to ensure broad general knowledge acquisition before the final distillation tuning.

The following datasets were integrated and processed:

MS MARCO Triplets (sentence-transformers/msmarco-msmarco-MiniLM-L6-v3, "triplet" split)
- Source Focus: Query/Document ranking (Search Relevance).
Stanford Encyclopedia of Philosophy Triplets (johnnyboycurtis/Philosophical-Triplets-Retrieval)
- Source Focus: Deep, technical, and abstract academic reasoning.
NPR Articles (sentence-transformers/npr)
- Source Focus: Modern news, journalistic style, and general current events.
FineWiki (English) (HuggingFaceFW/finewiki, "en" split)
- Source Focus: Encyclopedic, factual knowledge spanning a wide range of topics.
- Only used in distillation training; not used in MLM.

(Note: During the final Knowledge Distillation phase, the targets were generated using embeddings from the teacher model (Alibaba-NLP/gte-modernbert-base) based on the combined text content of this merged corpus.)

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: 1024 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- parquet

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

import torch
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("johnnyboycurtis/ModernBERT-small-v2", model_kwargs={"attn_implementation": "flash_attention_2", "dtype": torch.bfloat16}) # or use "sdpa"

# Run inference
sentences = [
    '# Breda Holmes\nBreda Holmes is a former camogie player, winner of the B+I Star of the Year award in 1987 and seven All Ireland medals in succession between 1984 and 1991, celebrating the seventh by scoring the match-turning goal from Ann Downey’s sideline ball against Cork in the 1991 final.\n\n## Career\nShe captained Carysfort Training College in their 1984 Purcell Cup campaign and won six All Ireland club medals with St Paul’s camogie club, based in Kilkenny city.\n',
    'What is Intellectual Property? Intellectual property (IP) refers to creations of the mind, such as inventions; literary and artistic works; designs; and symbols, names and images used in commerce. IP is protected in law by, for example, patents, copyright and trademarks, which enable people to earn recognition or financial benefit from what they invent or create.',
    '10 Most Famous Soccer Stadiums in the World. The Camp Nou with its capacity of 99,354 is the largest stadium in Europe and also the fourth largest soccer stadium in the world. It is situated in Barcelona, Catalonia, Spain, and is the home of Spanish club Barcelona since 1957.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.2616, 0.5490],
#         [0.2616, 1.0000, 0.3196],
#         [0.5490, 0.3196, 1.0000]])

Evaluation

Metrics

Knowledge Distillation

Dataset: mse-dev
Evaluated with MSEEvaluator

Metric	Value
negative_mse	-77.74

Information Retrieval

Datasets: NanoMSMARCO and NanoHotpotQA
Evaluated with InformationRetrievalEvaluator

Metric	NanoMSMARCO	NanoHotpotQA
cosine_accuracy@1	0.32	0.52
cosine_accuracy@3	0.52	0.76
cosine_accuracy@5	0.6	0.78
cosine_accuracy@10	0.76	0.84
cosine_precision@1	0.32	0.52
cosine_precision@3	0.1733	0.3333
cosine_precision@5	0.12	0.22
cosine_precision@10	0.076	0.122
cosine_recall@1	0.32	0.26
cosine_recall@3	0.52	0.5
cosine_recall@5	0.6	0.55
cosine_recall@10	0.76	0.61
cosine_ndcg@10	0.5251	0.5457
cosine_mrr@10	0.4523	0.6494
cosine_map@100	0.4624	0.4736

Nano BEIR

Dataset: NanoBEIR_mean

Evaluated with NanoBEIREvaluator with these parameters:

{
    "dataset_names": [
        "MSMARCO",
        "HotpotQA"
    ],
    "dataset_id": "sentence-transformers/NanoBEIR-en"
}

Metric	Value
cosine_accuracy@1	0.42
cosine_accuracy@3	0.64
cosine_accuracy@5	0.69
cosine_accuracy@10	0.8
cosine_precision@1	0.42
cosine_precision@3	0.2533
cosine_precision@5	0.17
cosine_precision@10	0.099
cosine_recall@1	0.29
cosine_recall@3	0.51
cosine_recall@5	0.575
cosine_recall@10	0.685
cosine_ndcg@10	0.5354
cosine_mrr@10	0.5509
cosine_map@100	0.468

Training Details

Training Dataset

parquet

Dataset: parquet
Size: 3,375,201 training samples
Columns: text and label
Approximate statistics based on the first 1000 samples:
text label
type string list
details
min: 5 tokens
mean: 280.41 tokens
max: 1024 tokens

size: 384 elements

	text	label
type	string	list
details	min: 5 tokens mean: 280.41 tokens max: 1024 tokens	size: 384 elements

Samples:

text	label
`# Scientists Link Diamonds To Earth's Quick Cooling Scientists say they have evidence the Earth was bombarded by meteors about 13,000 years ago, triggering a 1,000-year cold spell. Researchers write in the journal Science that they have found a layer of microscopic diamonds scattered across North America. An abrupt cooling may have caused many large mammals to become extinct.`	`[4.6171875, 2.515625, 2.439453125, -1.4853515625, -6.328125, ...]`
# Brad Giffen Brad Giffen is a retired Canadian news anchor who has worked on television in both Canada and the United States. Over his broadcasting career he has also worked as a radio personality, disc jockey, VJ, television reporter, television producer and voice-over artist. ## Broadcasting career Giffen studied at the Poynter Institute for Advanced Journalism Study. In the late 1980s he was a broadcaster on CHUM-FM radio station in Toronto, Ontario, Canada. He previously was John Majhor's successor veejay on CITY-TV's music video program Toronto Rocks. and he hosted the CBC Television battle of the bands competition Rock Wars. In 1990, Giffen pivoted to news journalism and became a reporter for CFTO's nightly news program World Beat News (later rebranded as CFTO News in early 1998, and CTV News in 2005). In 1993, Giffen moved to the United States and became co-anchor of the nightly news on the Fox affiliate KSTU, in Salt Lake City, Utah. Giffen left that post in 1995 to accept ...	`[-1.693359375, 13.3828125, 4.50390625, 0.41064453125, -2.884765625, ...]`
# How Trump Won, According To The Exit Polls Donald Trump will be the next president of the United States. That's remarkable for all sorts of reasons: He has no governmental experience, for example. And many times during his campaign, Trump's words inflamed large swaths of Americans, whether it was his comments from years ago talking about grabbing women's genitals or calling Mexican immigrants in the U.S. illegally "rapists" and playing up crimes committed by immigrants, including drug crimes and murders. But right now, it's also remarkable because almost no one saw it coming. All major forecasters predicted a Hillary Clinton win, whether moderately or by a landslide. So what happened? We don't know just yet why pollsters and forecasters got it wrong, but here's what made this electorate so different from the one that elected Barack Obama by 4 points in 2012. To be clear, it's impossible to break any election results out into fully discrete demographic groups or trends — race, gend...	`[3.4296875, 12.828125, 2.8203125, -5.47265625, -5.390625, ...]`

Loss: MSELoss

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
learning_rate: 0.0001
num_train_epochs: 2
warmup_steps: 0.1
fp16: True
load_best_model_at_end: True

All Hyperparameters

Click to expand

do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 0.0001
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_ratio: None
warmup_steps: 0.1
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
enable_jit_checkpoint: False
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
use_cpu: False
seed: 42
data_seed: None
bf16: False
fp16: True
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: -1
ddp_backend: None
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
group_by_length: False
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_for_metrics: []
eval_do_concat_batches: True
auto_find_batch_size: False
full_determinism: False
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
use_cache: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Click to expand

Epoch	Step	Training Loss	mse-dev_negative_mse	NanoMSMARCO_cosine_ndcg@10	NanoHotpotQA_cosine_ndcg@10	NanoBEIR_mean_cosine_ndcg@10
0.0019	100	4.2698	-	-	-	-
0.0038	200	4.2304	-	-	-	-
0.0057	300	4.1280	-	-	-	-
0.0076	400	3.8576	-	-	-	-
0.0095	500	3.1561	-	-	-	-
0.0114	600	2.5527	-	-	-	-
0.0133	700	2.3275	-	-	-	-
0.0152	800	2.2656	-	-	-	-
0.0171	900	2.2401	-	-	-	-
0.0190	1000	2.2256	-221.2144	0.0514	0.0577	0.0545
0.0209	1100	2.2140	-	-	-	-
0.0228	1200	2.1920	-	-	-	-
0.0247	1300	2.1840	-	-	-	-
0.0265	1400	2.1662	-	-	-	-
0.0284	1500	2.1598	-	-	-	-
0.0303	1600	2.1452	-	-	-	-
0.0322	1700	2.1226	-	-	-	-
0.0341	1800	2.1068	-	-	-	-
0.0360	1900	2.0941	-	-	-	-
0.0379	2000	2.0796	-206.8865	0.1481	0.0672	0.1077
0.0398	2100	2.0621	-	-	-	-
0.0417	2200	2.0545	-	-	-	-
0.0436	2300	2.0382	-	-	-	-
0.0455	2400	2.0267	-	-	-	-
0.0474	2500	2.0167	-	-	-	-
0.0493	2600	2.0041	-	-	-	-
0.0512	2700	1.9902	-	-	-	-
0.0531	2800	1.9746	-	-	-	-
0.0550	2900	1.9650	-	-	-	-
0.0569	3000	1.9539	-194.5440	0.1243	0.1242	0.1243
0.0588	3100	1.9401	-	-	-	-
0.0607	3200	1.9317	-	-	-	-
0.0626	3300	1.9181	-	-	-	-
0.0645	3400	1.9098	-	-	-	-
0.0664	3500	1.8983	-	-	-	-
0.0683	3600	1.8924	-	-	-	-
0.0702	3700	1.8806	-	-	-	-
0.0721	3800	1.8717	-	-	-	-
0.0740	3900	1.8591	-	-	-	-
0.0758	4000	1.8525	-184.2026	0.1647	0.1745	0.1696
0.0777	4100	1.8416	-	-	-	-
0.0796	4200	1.8359	-	-	-	-
0.0815	4300	1.8256	-	-	-	-
0.0834	4400	1.8131	-	-	-	-
0.0853	4500	1.8063	-	-	-	-
0.0872	4600	1.7950	-	-	-	-
0.0891	4700	1.7846	-	-	-	-
0.0910	4800	1.7762	-	-	-	-
0.0929	4900	1.7620	-	-	-	-
0.0948	5000	1.7605	-175.1685	0.1960	0.2024	0.1992
0.0967	5100	1.7481	-	-	-	-
0.0986	5200	1.7419	-	-	-	-
0.1005	5300	1.7301	-	-	-	-
0.1024	5400	1.7280	-	-	-	-
0.1043	5500	1.7131	-	-	-	-
0.1062	5600	1.7063	-	-	-	-
0.1081	5700	1.6959	-	-	-	-
0.1100	5800	1.6884	-	-	-	-
0.1119	5900	1.6801	-	-	-	-
0.1138	6000	1.6700	-166.4924	0.2493	0.2150	0.2321
0.1157	6100	1.6637	-	-	-	-
0.1176	6200	1.6543	-	-	-	-
0.1195	6300	1.6451	-	-	-	-
0.1214	6400	1.6382	-	-	-	-
0.1233	6500	1.6278	-	-	-	-
0.1251	6600	1.6235	-	-	-	-
0.1270	6700	1.6150	-	-	-	-
0.1289	6800	1.6054	-	-	-	-
0.1308	6900	1.6007	-	-	-	-
0.1327	7000	1.5874	-158.1013	0.2809	0.2349	0.2579
0.1346	7100	1.5824	-	-	-	-
0.1365	7200	1.5724	-	-	-	-
0.1384	7300	1.5669	-	-	-	-
0.1403	7400	1.5535	-	-	-	-
0.1422	7500	1.5450	-	-	-	-
0.1441	7600	1.5345	-	-	-	-
0.1460	7700	1.5340	-	-	-	-
0.1479	7800	1.5242	-	-	-	-
0.1498	7900	1.5181	-	-	-	-
0.1517	8000	1.5086	-150.1032	0.2957	0.2454	0.2705
0.1536	8100	1.5007	-	-	-	-
0.1555	8200	1.4950	-	-	-	-
0.1574	8300	1.4829	-	-	-	-
0.1593	8400	1.4780	-	-	-	-
0.1612	8500	1.4737	-	-	-	-
0.1631	8600	1.4603	-	-	-	-
0.1650	8700	1.4510	-	-	-	-
0.1669	8800	1.4500	-	-	-	-
0.1688	8900	1.4408	-	-	-	-
0.1707	9000	1.4372	-142.8462	0.3033	0.2824	0.2929
0.1726	9100	1.4270	-	-	-	-
0.1744	9200	1.4233	-	-	-	-
0.1763	9300	1.4135	-	-	-	-
0.1782	9400	1.4074	-	-	-	-
0.1801	9500	1.3981	-	-	-	-
0.1820	9600	1.3919	-	-	-	-
0.1839	9700	1.3844	-	-	-	-
0.1858	9800	1.3741	-	-	-	-
0.1877	9900	1.3685	-	-	-	-
0.1896	10000	1.3668	-135.7081	0.3194	0.3059	0.3127
0.1915	10100	1.3568	-	-	-	-
0.1934	10200	1.3505	-	-	-	-
0.1953	10300	1.3433	-	-	-	-
0.1972	10400	1.3338	-	-	-	-
0.1991	10500	1.3295	-	-	-	-
0.2010	10600	1.3275	-	-	-	-
0.2029	10700	1.3149	-	-	-	-
0.2048	10800	1.3119	-	-	-	-
0.2067	10900	1.3055	-	-	-	-
0.2086	11000	1.2952	-129.2064	0.3109	0.3434	0.3272
0.2105	11100	1.2920	-	-	-	-
0.2124	11200	1.2851	-	-	-	-
0.2143	11300	1.2769	-	-	-	-
0.2162	11400	1.2747	-	-	-	-
0.2181	11500	1.2686	-	-	-	-
0.2200	11600	1.2684	-	-	-	-
0.2219	11700	1.2582	-	-	-	-
0.2237	11800	1.2582	-	-	-	-
0.2256	11900	1.2479	-	-	-	-
0.2275	12000	1.2418	-123.6261	0.3439	0.3547	0.3493
0.2294	12100	1.2400	-	-	-	-
0.2313	12200	1.2330	-	-	-	-
0.2332	12300	1.2288	-	-	-	-
0.2351	12400	1.2230	-	-	-	-
0.2370	12500	1.2164	-	-	-	-
0.2389	12600	1.2157	-	-	-	-
0.2408	12700	1.2166	-	-	-	-
0.2427	12800	1.2045	-	-	-	-
0.2446	12900	1.2035	-	-	-	-
0.2465	13000	1.1968	-118.8691	0.3282	0.3329	0.3306
0.2484	13100	1.1942	-	-	-	-
0.2503	13200	1.1895	-	-	-	-
0.2522	13300	1.1843	-	-	-	-
0.2541	13400	1.1755	-	-	-	-
0.2560	13500	1.1756	-	-	-	-
0.2579	13600	1.1707	-	-	-	-
0.2598	13700	1.1637	-	-	-	-
0.2617	13800	1.1684	-	-	-	-
0.2636	13900	1.1628	-	-	-	-
0.2655	14000	1.1585	-115.4122	0.3779	0.3579	0.3679
0.2674	14100	1.1602	-	-	-	-
0.2693	14200	1.1504	-	-	-	-
0.2712	14300	1.1483	-	-	-	-
0.2730	14400	1.1488	-	-	-	-
0.2749	14500	1.1392	-	-	-	-
0.2768	14600	1.1343	-	-	-	-
0.2787	14700	1.1363	-	-	-	-
0.2806	14800	1.1342	-	-	-	-
0.2825	14900	1.1327	-	-	-	-
0.2844	15000	1.1219	-111.9139	0.3794	0.3791	0.3793
0.2863	15100	1.1246	-	-	-	-
0.2882	15200	1.1152	-	-	-	-
0.2901	15300	1.1196	-	-	-	-
0.2920	15400	1.1097	-	-	-	-
0.2939	15500	1.1067	-	-	-	-
0.2958	15600	1.0994	-	-	-	-
0.2977	15700	1.1077	-	-	-	-
0.2996	15800	1.1057	-	-	-	-
0.3015	15900	1.0949	-	-	-	-
0.3034	16000	1.0981	-109.2994	0.3867	0.3855	0.3861
0.3053	16100	1.0933	-	-	-	-
0.3072	16200	1.0873	-	-	-	-
0.3091	16300	1.0851	-	-	-	-
0.3110	16400	1.0840	-	-	-	-
0.3129	16500	1.0831	-	-	-	-
0.3148	16600	1.0755	-	-	-	-
0.3167	16700	1.0733	-	-	-	-
0.3186	16800	1.0724	-	-	-	-
0.3205	16900	1.0698	-	-	-	-
0.3223	17000	1.0710	-106.3769	0.4092	0.4066	0.4079
0.3242	17100	1.0699	-	-	-	-
0.3261	17200	1.0642	-	-	-	-
0.3280	17300	1.0576	-	-	-	-
0.3299	17400	1.0597	-	-	-	-
0.3318	17500	1.0572	-	-	-	-
0.3337	17600	1.0547	-	-	-	-
0.3356	17700	1.0502	-	-	-	-
0.3375	17800	1.0467	-	-	-	-
0.3394	17900	1.0485	-	-	-	-
0.3413	18000	1.0455	-103.7698	0.4510	0.4237	0.4374
0.3432	18100	1.0433	-	-	-	-
0.3451	18200	1.0404	-	-	-	-
0.3470	18300	1.0397	-	-	-	-
0.3489	18400	1.0352	-	-	-	-
0.3508	18500	1.0318	-	-	-	-
0.3527	18600	1.0302	-	-	-	-
0.3546	18700	1.0330	-	-	-	-
0.3565	18800	1.0220	-	-	-	-
0.3584	18900	1.0223	-	-	-	-
0.3603	19000	1.0254	-101.5743	0.4439	0.4265	0.4352
0.3622	19100	1.0186	-	-	-	-
0.3641	19200	1.0216	-	-	-	-
0.3660	19300	1.0152	-	-	-	-
0.3679	19400	1.0139	-	-	-	-
0.3698	19500	1.0125	-	-	-	-
0.3716	19600	1.0087	-	-	-	-
0.3735	19700	1.0045	-	-	-	-
0.3754	19800	1.0032	-	-	-	-
0.3773	19900	1.0013	-	-	-	-
0.3792	20000	1.0017	-99.6613	0.4554	0.4374	0.4464
0.3811	20100	1.0007	-	-	-	-
0.3830	20200	0.9959	-	-	-	-
0.3849	20300	0.9965	-	-	-	-
0.3868	20400	0.9909	-	-	-	-
0.3887	20500	0.9902	-	-	-	-
0.3906	20600	0.9903	-	-	-	-
0.3925	20700	0.9927	-	-	-	-
0.3944	20800	0.9865	-	-	-	-
0.3963	20900	0.9843	-	-	-	-
0.3982	21000	0.9809	-97.4922	0.4689	0.4462	0.4575
0.4001	21100	0.9801	-	-	-	-
0.4020	21200	0.9785	-	-	-	-
0.4039	21300	0.9718	-	-	-	-
0.4058	21400	0.9725	-	-	-	-
0.4077	21500	0.9705	-	-	-	-
0.4096	21600	0.9729	-	-	-	-
0.4115	21700	0.9714	-	-	-	-
0.4134	21800	0.9647	-	-	-	-
0.4153	21900	0.9623	-	-	-	-
0.4172	22000	0.9579	-95.7813	0.4642	0.4549	0.4595
0.4191	22100	0.9553	-	-	-	-
0.4209	22200	0.9558	-	-	-	-
0.4228	22300	0.9584	-	-	-	-
0.4247	22400	0.9544	-	-	-	-
0.4266	22500	0.9520	-	-	-	-
0.4285	22600	0.9516	-	-	-	-
0.4304	22700	0.9543	-	-	-	-
0.4323	22800	0.9502	-	-	-	-
0.4342	22900	0.9477	-	-	-	-
0.4361	23000	0.9405	-93.9238	0.4856	0.4521	0.4688
0.4380	23100	0.9448	-	-	-	-
0.4399	23200	0.9424	-	-	-	-
0.4418	23300	0.9369	-	-	-	-
0.4437	23400	0.9318	-	-	-	-
0.4456	23500	0.9342	-	-	-	-
0.4475	23600	0.9392	-	-	-	-
0.4494	23700	0.9358	-	-	-	-
0.4513	23800	0.9303	-	-	-	-
0.4532	23900	0.9306	-	-	-	-
0.4551	24000	0.9277	-92.2427	0.4946	0.4798	0.4872
0.4570	24100	0.9267	-	-	-	-
0.4589	24200	0.9228	-	-	-	-
0.4608	24300	0.9239	-	-	-	-
0.4627	24400	0.9225	-	-	-	-
0.4646	24500	0.9169	-	-	-	-
0.4665	24600	0.9170	-	-	-	-
0.4684	24700	0.9195	-	-	-	-
0.4702	24800	0.9153	-	-	-	-
0.4721	24900	0.9138	-	-	-	-
0.4740	25000	0.9108	-90.7635	0.4622	0.4812	0.4717
0.4759	25100	0.9133	-	-	-	-
0.4778	25200	0.9076	-	-	-	-
0.4797	25300	0.9081	-	-	-	-
0.4816	25400	0.9093	-	-	-	-
0.4835	25500	0.9037	-	-	-	-
0.4854	25600	0.9025	-	-	-	-
0.4873	25700	0.9058	-	-	-	-
0.4892	25800	0.9018	-	-	-	-
0.4911	25900	0.9014	-	-	-	-
0.4930	26000	0.8946	-89.2562	0.4745	0.4957	0.4851
0.4949	26100	0.8982	-	-	-	-
0.4968	26200	0.8946	-	-	-	-
0.4987	26300	0.8941	-	-	-	-
0.5006	26400	0.8925	-	-	-	-
0.5025	26500	0.8947	-	-	-	-
0.5044	26600	0.8906	-	-	-	-
0.5063	26700	0.8895	-	-	-	-
0.5082	26800	0.8866	-	-	-	-
0.5101	26900	0.8840	-	-	-	-
0.5120	27000	0.8764	-87.8039	0.5011	0.5173	0.5092
0.5139	27100	0.8859	-	-	-	-
0.5158	27200	0.8839	-	-	-	-
0.5177	27300	0.8794	-	-	-	-
0.5195	27400	0.8790	-	-	-	-
0.5214	27500	0.8788	-	-	-	-
0.5233	27600	0.8780	-	-	-	-
0.5252	27700	0.8749	-	-	-	-
0.5271	27800	0.8742	-	-	-	-
0.5290	27900	0.8700	-	-	-	-
0.5309	28000	0.8691	-86.4419	0.4936	0.4776	0.4856
0.5328	28100	0.8747	-	-	-	-
0.5347	28200	0.8644	-	-	-	-
0.5366	28300	0.8673	-	-	-	-
0.5385	28400	0.8670	-	-	-	-
0.5404	28500	0.8638	-	-	-	-
0.5423	28600	0.8649	-	-	-	-
0.5442	28700	0.8629	-	-	-	-
0.5461	28800	0.8629	-	-	-	-
0.5480	28900	0.8591	-	-	-	-
0.5499	29000	0.8566	-85.0408	0.4792	0.4918	0.4855
0.5518	29100	0.8588	-	-	-	-
0.5537	29200	0.8545	-	-	-	-
0.5556	29300	0.8534	-	-	-	-
0.5575	29400	0.8543	-	-	-	-
0.5594	29500	0.8534	-	-	-	-
0.5613	29600	0.8519	-	-	-	-
0.5632	29700	0.8486	-	-	-	-
0.5651	29800	0.8530	-	-	-	-
0.5670	29900	0.8477	-	-	-	-
0.5688	30000	0.8465	-83.9435	0.4986	0.5097	0.5042
0.5707	30100	0.8425	-	-	-	-
0.5726	30200	0.8437	-	-	-	-
0.5745	30300	0.8430	-	-	-	-
0.5764	30400	0.8431	-	-	-	-
0.5783	30500	0.8424	-	-	-	-
0.5802	30600	0.8403	-	-	-	-
0.5821	30700	0.8347	-	-	-	-
0.5840	30800	0.8344	-	-	-	-
0.5859	30900	0.8348	-	-	-	-
0.5878	31000	0.8351	-82.8113	0.4999	0.5088	0.5043
0.5897	31100	0.8362	-	-	-	-
0.5916	31200	0.8307	-	-	-	-
0.5935	31300	0.8315	-	-	-	-
0.5954	31400	0.8311	-	-	-	-
0.5973	31500	0.8305	-	-	-	-
0.5992	31600	0.8304	-	-	-	-
0.6011	31700	0.8277	-	-	-	-
0.6030	31800	0.8249	-	-	-	-
0.6049	31900	0.8262	-	-	-	-
0.6068	32000	0.8236	-81.7389	0.4811	0.5256	0.5034
0.6087	32100	0.8209	-	-	-	-
0.6106	32200	0.8226	-	-	-	-
0.6125	32300	0.8207	-	-	-	-
0.6144	32400	0.8224	-	-	-	-
0.6163	32500	0.8163	-	-	-	-
0.6182	32600	0.8181	-	-	-	-
0.6200	32700	0.8147	-	-	-	-
0.6219	32800	0.8170	-	-	-	-
0.6238	32900	0.8156	-	-	-	-
0.6257	33000	0.8141	-80.4979	0.5042	0.5085	0.5064
0.6276	33100	0.8088	-	-	-	-
0.6295	33200	0.8098	-	-	-	-
0.6314	33300	0.8133	-	-	-	-
0.6333	33400	0.8087	-	-	-	-
0.6352	33500	0.8086	-	-	-	-
0.6371	33600	0.8094	-	-	-	-
0.6390	33700	0.8054	-	-	-	-
0.6409	33800	0.8043	-	-	-	-
0.6428	33900	0.8035	-	-	-	-
0.6447	34000	0.7990	-79.5726	0.4990	0.5166	0.5078
0.6466	34100	0.8035	-	-	-	-
0.6485	34200	0.7990	-	-	-	-
0.6504	34300	0.7996	-	-	-	-
0.6523	34400	0.8005	-	-	-	-
0.6542	34500	0.8000	-	-	-	-
0.6561	34600	0.7975	-	-	-	-
0.6580	34700	0.7959	-	-	-	-
0.6599	34800	0.7921	-	-	-	-
0.6618	34900	0.7916	-	-	-	-
0.6637	35000	0.7933	-78.7884	0.5104	0.5139	0.5122
0.6656	35100	0.7908	-	-	-	-
0.6675	35200	0.7913	-	-	-	-
0.6693	35300	0.7921	-	-	-	-
0.6712	35400	0.7929	-	-	-	-
0.6731	35500	0.7915	-	-	-	-
0.6750	35600	0.7871	-	-	-	-
0.6769	35700	0.7836	-	-	-	-
0.6788	35800	0.7805	-	-	-	-
0.6807	35900	0.7870	-	-	-	-
0.6826	36000	0.7797	-77.7400	0.5251	0.5457	0.5354

Framework Versions

Python: 3.11.13
Sentence Transformers: 5.2.2
Transformers: 5.1.0
PyTorch: 2.7.1+cu128
Accelerate: 1.9.0
Datasets: 4.0.0
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MSELoss

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

ModernBERT Model Architecture

@misc{warner2024smarterbetterfasterlonger,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

Model Weight Initialization

@misc{trinh2025guideguidedinitializationdistillation,
      title={GUIDE: Guided Initialization and Distillation of Embeddings}, 
      author={Khoa Trinh and Gaurav Menghani and Erik Vee},
      year={2025},
      eprint={2510.06502},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.06502}, 
}