SentenceTransformer

ModernBERT-small-v2 represents an efficient approach to creating highly efficient and accurate dense vector encoders. It leverages a small ModernBERT architecture, simple MLM training, and distillation from a larger performant model to achieve superior performance at a lower computational cost compared to standard large models.

Key Features & Training Methodology

This model was created using a specialized four-stage pipeline:

  1. Deep & Narrow Architecture: Unlike typical small models (e.g., 6 layers), this student model features 12 Transformer layers but operates within a narrow 384-dimensional embedding space. This depth allows for complex multi-hop reasoning crucial for high-accuracy retrieval tasks, while the narrow dimension ensures extremely fast encoding and small index sizes.

  2. Guided Initialization (GUIDE): The model did not start from random weights. It inherited structural and semantic knowledge from a larger teacher model (answerdotai/ModernBERT-base) via Principal Component Analysis (PCA) Projection. This technique surgically compressed the teacher's 768-dimensional knowledge into the student's 384-dimensional space, providing a massive "head start."

  3. Extensive MLM Pre-training: Following initialization, the model underwent comprehensive Masked Language Modeling (MLM) pre-training on a highly diverse corpus combining:

    • Search Data (MS MARCO)
    • Academic Texts (Stanford Philosophy)
    • General Knowledge (NPR, FineWiki)
  4. Knowledge Distillation (STS Tuning): The final, critical stage optimized the model for semantic similarity. It was trained to mimic the output embeddings of a powerful Retrieval Teacher (Alibaba-NLP/gte-modernbert-base) using Mean Squared Error (MSE) loss. This specialized tuning ensures its 384-dimensional vectors excel at similarity and retrieval tasks.

Training

The final model, ModernBERT-small-v2, was trained using a curated combination of four distinct datasets during the MLM Pre-training phase to ensure broad general knowledge acquisition before the final distillation tuning.

The following datasets were integrated and processed:

  1. MS MARCO Triplets (sentence-transformers/msmarco-msmarco-MiniLM-L6-v3, "triplet" split)
    • Source Focus: Query/Document ranking (Search Relevance).
  2. Stanford Encyclopedia of Philosophy Triplets (johnnyboycurtis/Philosophical-Triplets-Retrieval)
    • Source Focus: Deep, technical, and abstract academic reasoning.
  3. NPR Articles (sentence-transformers/npr)
    • Source Focus: Modern news, journalistic style, and general current events.
  4. FineWiki (English) (HuggingFaceFW/finewiki, "en" split)
    • Source Focus: Encyclopedic, factual knowledge spanning a wide range of topics.
    • Only used in distillation training; not used in MLM.

(Note: During the final Knowledge Distillation phase, the targets were generated using embeddings from the teacher model (Alibaba-NLP/gte-modernbert-base) based on the combined text content of this merged corpus.)

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 1024 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • parquet

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

import torch
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("johnnyboycurtis/ModernBERT-small-v2", model_kwargs={"attn_implementation": "flash_attention_2", "dtype": torch.bfloat16}) # or use "sdpa"

# Run inference
sentences = [
    '# Breda Holmes\nBreda Holmes is a former camogie player, winner of the B+I Star of the Year award in 1987 and seven All Ireland medals in succession between 1984 and 1991, celebrating the seventh by scoring the match-turning goal from Ann Downey’s sideline ball against Cork in the 1991 final.\n\n## Career\nShe captained Carysfort Training College in their 1984 Purcell Cup campaign and won six All Ireland club medals with St Paul’s camogie club, based in Kilkenny city.\n',
    'What is Intellectual Property? Intellectual property (IP) refers to creations of the mind, such as inventions; literary and artistic works; designs; and symbols, names and images used in commerce. IP is protected in law by, for example, patents, copyright and trademarks, which enable people to earn recognition or financial benefit from what they invent or create.',
    '10 Most Famous Soccer Stadiums in the World. The Camp Nou with its capacity of 99,354 is the largest stadium in Europe and also the fourth largest soccer stadium in the world. It is situated in Barcelona, Catalonia, Spain, and is the home of Spanish club Barcelona since 1957.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.2616, 0.5490],
#         [0.2616, 1.0000, 0.3196],
#         [0.5490, 0.3196, 1.0000]])

Evaluation

Metrics

Knowledge Distillation

Metric Value
negative_mse -77.74

Information Retrieval

Metric NanoMSMARCO NanoHotpotQA
cosine_accuracy@1 0.32 0.52
cosine_accuracy@3 0.52 0.76
cosine_accuracy@5 0.6 0.78
cosine_accuracy@10 0.76 0.84
cosine_precision@1 0.32 0.52
cosine_precision@3 0.1733 0.3333
cosine_precision@5 0.12 0.22
cosine_precision@10 0.076 0.122
cosine_recall@1 0.32 0.26
cosine_recall@3 0.52 0.5
cosine_recall@5 0.6 0.55
cosine_recall@10 0.76 0.61
cosine_ndcg@10 0.5251 0.5457
cosine_mrr@10 0.4523 0.6494
cosine_map@100 0.4624 0.4736

Nano BEIR

  • Dataset: NanoBEIR_mean
  • Evaluated with NanoBEIREvaluator with these parameters:
    {
        "dataset_names": [
            "MSMARCO",
            "HotpotQA"
        ],
        "dataset_id": "sentence-transformers/NanoBEIR-en"
    }
    
Metric Value
cosine_accuracy@1 0.42
cosine_accuracy@3 0.64
cosine_accuracy@5 0.69
cosine_accuracy@10 0.8
cosine_precision@1 0.42
cosine_precision@3 0.2533
cosine_precision@5 0.17
cosine_precision@10 0.099
cosine_recall@1 0.29
cosine_recall@3 0.51
cosine_recall@5 0.575
cosine_recall@10 0.685
cosine_ndcg@10 0.5354
cosine_mrr@10 0.5509
cosine_map@100 0.468

Training Details

Training Dataset

parquet

  • Dataset: parquet
  • Size: 3,375,201 training samples
  • Columns: text and label
  • Approximate statistics based on the first 1000 samples:
    text label
    type string list
    details
    • min: 5 tokens
    • mean: 280.41 tokens
    • max: 1024 tokens
    • size: 384 elements
  • Samples:
    text label
    # Scientists Link Diamonds To Earth's Quick Cooling

    Scientists say they have evidence the Earth was bombarded by meteors about 13,000 years ago, triggering a 1,000-year cold spell. Researchers write in the journal Science that they have found a layer of microscopic diamonds scattered across North America. An abrupt cooling may have caused many large mammals to become extinct.
    [4.6171875, 2.515625, 2.439453125, -1.4853515625, -6.328125, ...]
    # Brad Giffen
    Brad Giffen is a retired Canadian news anchor who has worked on television in both Canada and the United States.
    Over his broadcasting career he has also worked as a radio personality, disc jockey, VJ, television reporter, television producer and voice-over artist.

    ## Broadcasting career
    Giffen studied at the Poynter Institute for Advanced Journalism Study. In the late 1980s he was a broadcaster on CHUM-FM radio station in Toronto, Ontario, Canada. He previously was John Majhor's successor veejay on CITY-TV's music video program Toronto Rocks. and he hosted the CBC Television battle of the bands competition Rock Wars.
    In 1990, Giffen pivoted to news journalism and became a reporter for CFTO's nightly news program World Beat News (later rebranded as CFTO News in early 1998, and CTV News in 2005).
    In 1993, Giffen moved to the United States and became co-anchor of the nightly news on the Fox affiliate KSTU, in Salt Lake City, Utah. Giffen left that post in 1995 to accept ...
    [-1.693359375, 13.3828125, 4.50390625, 0.41064453125, -2.884765625, ...]
    # How Trump Won, According To The Exit Polls

    Donald Trump will be the next president of the United States. That's remarkable for all sorts of reasons: He has no governmental experience, for example. And many times during his campaign, Trump's words inflamed large swaths of Americans, whether it was his comments from years ago talking about grabbing women's genitals or calling Mexican immigrants in the U.S. illegally "rapists" and playing up crimes committed by immigrants, including drug crimes and murders. But right now, it's also remarkable because almost no one saw it coming. All major forecasters predicted a Hillary Clinton win, whether moderately or by a landslide. So what happened? We don't know just yet why pollsters and forecasters got it wrong, but here's what made this electorate so different from the one that elected Barack Obama by 4 points in 2012. To be clear, it's impossible to break any election results out into fully discrete demographic groups or trends — race, gend...
    [3.4296875, 12.828125, 2.8203125, -5.47265625, -5.390625, ...]
  • Loss: MSELoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 0.0001
  • num_train_epochs: 2
  • warmup_steps: 0.1
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 0.0001
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_ratio: None
  • warmup_steps: 0.1
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • enable_jit_checkpoint: False
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • use_cpu: False
  • seed: 42
  • data_seed: None
  • bf16: False
  • fp16: True
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: -1
  • ddp_backend: None
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • auto_find_batch_size: False
  • full_determinism: False
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • use_cache: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss mse-dev_negative_mse NanoMSMARCO_cosine_ndcg@10 NanoHotpotQA_cosine_ndcg@10 NanoBEIR_mean_cosine_ndcg@10
0.0019 100 4.2698 - - - -
0.0038 200 4.2304 - - - -
0.0057 300 4.1280 - - - -
0.0076 400 3.8576 - - - -
0.0095 500 3.1561 - - - -
0.0114 600 2.5527 - - - -
0.0133 700 2.3275 - - - -
0.0152 800 2.2656 - - - -
0.0171 900 2.2401 - - - -
0.0190 1000 2.2256 -221.2144 0.0514 0.0577 0.0545
0.0209 1100 2.2140 - - - -
0.0228 1200 2.1920 - - - -
0.0247 1300 2.1840 - - - -
0.0265 1400 2.1662 - - - -
0.0284 1500 2.1598 - - - -
0.0303 1600 2.1452 - - - -
0.0322 1700 2.1226 - - - -
0.0341 1800 2.1068 - - - -
0.0360 1900 2.0941 - - - -
0.0379 2000 2.0796 -206.8865 0.1481 0.0672 0.1077
0.0398 2100 2.0621 - - - -
0.0417 2200 2.0545 - - - -
0.0436 2300 2.0382 - - - -
0.0455 2400 2.0267 - - - -
0.0474 2500 2.0167 - - - -
0.0493 2600 2.0041 - - - -
0.0512 2700 1.9902 - - - -
0.0531 2800 1.9746 - - - -
0.0550 2900 1.9650 - - - -
0.0569 3000 1.9539 -194.5440 0.1243 0.1242 0.1243
0.0588 3100 1.9401 - - - -
0.0607 3200 1.9317 - - - -
0.0626 3300 1.9181 - - - -
0.0645 3400 1.9098 - - - -
0.0664 3500 1.8983 - - - -
0.0683 3600 1.8924 - - - -
0.0702 3700 1.8806 - - - -
0.0721 3800 1.8717 - - - -
0.0740 3900 1.8591 - - - -
0.0758 4000 1.8525 -184.2026 0.1647 0.1745 0.1696
0.0777 4100 1.8416 - - - -
0.0796 4200 1.8359 - - - -
0.0815 4300 1.8256 - - - -
0.0834 4400 1.8131 - - - -
0.0853 4500 1.8063 - - - -
0.0872 4600 1.7950 - - - -
0.0891 4700 1.7846 - - - -
0.0910 4800 1.7762 - - - -
0.0929 4900 1.7620 - - - -
0.0948 5000 1.7605 -175.1685 0.1960 0.2024 0.1992
0.0967 5100 1.7481 - - - -
0.0986 5200 1.7419 - - - -
0.1005 5300 1.7301 - - - -
0.1024 5400 1.7280 - - - -
0.1043 5500 1.7131 - - - -
0.1062 5600 1.7063 - - - -
0.1081 5700 1.6959 - - - -
0.1100 5800 1.6884 - - - -
0.1119 5900 1.6801 - - - -
0.1138 6000 1.6700 -166.4924 0.2493 0.2150 0.2321
0.1157 6100 1.6637 - - - -
0.1176 6200 1.6543 - - - -
0.1195 6300 1.6451 - - - -
0.1214 6400 1.6382 - - - -
0.1233 6500 1.6278 - - - -
0.1251 6600 1.6235 - - - -
0.1270 6700 1.6150 - - - -
0.1289 6800 1.6054 - - - -
0.1308 6900 1.6007 - - - -
0.1327 7000 1.5874 -158.1013 0.2809 0.2349 0.2579
0.1346 7100 1.5824 - - - -
0.1365 7200 1.5724 - - - -
0.1384 7300 1.5669 - - - -
0.1403 7400 1.5535 - - - -
0.1422 7500 1.5450 - - - -
0.1441 7600 1.5345 - - - -
0.1460 7700 1.5340 - - - -
0.1479 7800 1.5242 - - - -
0.1498 7900 1.5181 - - - -
0.1517 8000 1.5086 -150.1032 0.2957 0.2454 0.2705
0.1536 8100 1.5007 - - - -
0.1555 8200 1.4950 - - - -
0.1574 8300 1.4829 - - - -
0.1593 8400 1.4780 - - - -
0.1612 8500 1.4737 - - - -
0.1631 8600 1.4603 - - - -
0.1650 8700 1.4510 - - - -
0.1669 8800 1.4500 - - - -
0.1688 8900 1.4408 - - - -
0.1707 9000 1.4372 -142.8462 0.3033 0.2824 0.2929
0.1726 9100 1.4270 - - - -
0.1744 9200 1.4233 - - - -
0.1763 9300 1.4135 - - - -
0.1782 9400 1.4074 - - - -
0.1801 9500 1.3981 - - - -
0.1820 9600 1.3919 - - - -
0.1839 9700 1.3844 - - - -
0.1858 9800 1.3741 - - - -
0.1877 9900 1.3685 - - - -
0.1896 10000 1.3668 -135.7081 0.3194 0.3059 0.3127
0.1915 10100 1.3568 - - - -
0.1934 10200 1.3505 - - - -
0.1953 10300 1.3433 - - - -
0.1972 10400 1.3338 - - - -
0.1991 10500 1.3295 - - - -
0.2010 10600 1.3275 - - - -
0.2029 10700 1.3149 - - - -
0.2048 10800 1.3119 - - - -
0.2067 10900 1.3055 - - - -
0.2086 11000 1.2952 -129.2064 0.3109 0.3434 0.3272
0.2105 11100 1.2920 - - - -
0.2124 11200 1.2851 - - - -
0.2143 11300 1.2769 - - - -
0.2162 11400 1.2747 - - - -
0.2181 11500 1.2686 - - - -
0.2200 11600 1.2684 - - - -
0.2219 11700 1.2582 - - - -
0.2237 11800 1.2582 - - - -
0.2256 11900 1.2479 - - - -
0.2275 12000 1.2418 -123.6261 0.3439 0.3547 0.3493
0.2294 12100 1.2400 - - - -
0.2313 12200 1.2330 - - - -
0.2332 12300 1.2288 - - - -
0.2351 12400 1.2230 - - - -
0.2370 12500 1.2164 - - - -
0.2389 12600 1.2157 - - - -
0.2408 12700 1.2166 - - - -
0.2427 12800 1.2045 - - - -
0.2446 12900 1.2035 - - - -
0.2465 13000 1.1968 -118.8691 0.3282 0.3329 0.3306
0.2484 13100 1.1942 - - - -
0.2503 13200 1.1895 - - - -
0.2522 13300 1.1843 - - - -
0.2541 13400 1.1755 - - - -
0.2560 13500 1.1756 - - - -
0.2579 13600 1.1707 - - - -
0.2598 13700 1.1637 - - - -
0.2617 13800 1.1684 - - - -
0.2636 13900 1.1628 - - - -
0.2655 14000 1.1585 -115.4122 0.3779 0.3579 0.3679
0.2674 14100 1.1602 - - - -
0.2693 14200 1.1504 - - - -
0.2712 14300 1.1483 - - - -
0.2730 14400 1.1488 - - - -
0.2749 14500 1.1392 - - - -
0.2768 14600 1.1343 - - - -
0.2787 14700 1.1363 - - - -
0.2806 14800 1.1342 - - - -
0.2825 14900 1.1327 - - - -
0.2844 15000 1.1219 -111.9139 0.3794 0.3791 0.3793
0.2863 15100 1.1246 - - - -
0.2882 15200 1.1152 - - - -
0.2901 15300 1.1196 - - - -
0.2920 15400 1.1097 - - - -
0.2939 15500 1.1067 - - - -
0.2958 15600 1.0994 - - - -
0.2977 15700 1.1077 - - - -
0.2996 15800 1.1057 - - - -
0.3015 15900 1.0949 - - - -
0.3034 16000 1.0981 -109.2994 0.3867 0.3855 0.3861
0.3053 16100 1.0933 - - - -
0.3072 16200 1.0873 - - - -
0.3091 16300 1.0851 - - - -
0.3110 16400 1.0840 - - - -
0.3129 16500 1.0831 - - - -
0.3148 16600 1.0755 - - - -
0.3167 16700 1.0733 - - - -
0.3186 16800 1.0724 - - - -
0.3205 16900 1.0698 - - - -
0.3223 17000 1.0710 -106.3769 0.4092 0.4066 0.4079
0.3242 17100 1.0699 - - - -
0.3261 17200 1.0642 - - - -
0.3280 17300 1.0576 - - - -
0.3299 17400 1.0597 - - - -
0.3318 17500 1.0572 - - - -
0.3337 17600 1.0547 - - - -
0.3356 17700 1.0502 - - - -
0.3375 17800 1.0467 - - - -
0.3394 17900 1.0485 - - - -
0.3413 18000 1.0455 -103.7698 0.4510 0.4237 0.4374
0.3432 18100 1.0433 - - - -
0.3451 18200 1.0404 - - - -
0.3470 18300 1.0397 - - - -
0.3489 18400 1.0352 - - - -
0.3508 18500 1.0318 - - - -
0.3527 18600 1.0302 - - - -
0.3546 18700 1.0330 - - - -
0.3565 18800 1.0220 - - - -
0.3584 18900 1.0223 - - - -
0.3603 19000 1.0254 -101.5743 0.4439 0.4265 0.4352
0.3622 19100 1.0186 - - - -
0.3641 19200 1.0216 - - - -
0.3660 19300 1.0152 - - - -
0.3679 19400 1.0139 - - - -
0.3698 19500 1.0125 - - - -
0.3716 19600 1.0087 - - - -
0.3735 19700 1.0045 - - - -
0.3754 19800 1.0032 - - - -
0.3773 19900 1.0013 - - - -
0.3792 20000 1.0017 -99.6613 0.4554 0.4374 0.4464
0.3811 20100 1.0007 - - - -
0.3830 20200 0.9959 - - - -
0.3849 20300 0.9965 - - - -
0.3868 20400 0.9909 - - - -
0.3887 20500 0.9902 - - - -
0.3906 20600 0.9903 - - - -
0.3925 20700 0.9927 - - - -
0.3944 20800 0.9865 - - - -
0.3963 20900 0.9843 - - - -
0.3982 21000 0.9809 -97.4922 0.4689 0.4462 0.4575
0.4001 21100 0.9801 - - - -
0.4020 21200 0.9785 - - - -
0.4039 21300 0.9718 - - - -
0.4058 21400 0.9725 - - - -
0.4077 21500 0.9705 - - - -
0.4096 21600 0.9729 - - - -
0.4115 21700 0.9714 - - - -
0.4134 21800 0.9647 - - - -
0.4153 21900 0.9623 - - - -
0.4172 22000 0.9579 -95.7813 0.4642 0.4549 0.4595
0.4191 22100 0.9553 - - - -
0.4209 22200 0.9558 - - - -
0.4228 22300 0.9584 - - - -
0.4247 22400 0.9544 - - - -
0.4266 22500 0.9520 - - - -
0.4285 22600 0.9516 - - - -
0.4304 22700 0.9543 - - - -
0.4323 22800 0.9502 - - - -
0.4342 22900 0.9477 - - - -
0.4361 23000 0.9405 -93.9238 0.4856 0.4521 0.4688
0.4380 23100 0.9448 - - - -
0.4399 23200 0.9424 - - - -
0.4418 23300 0.9369 - - - -
0.4437 23400 0.9318 - - - -
0.4456 23500 0.9342 - - - -
0.4475 23600 0.9392 - - - -
0.4494 23700 0.9358 - - - -
0.4513 23800 0.9303 - - - -
0.4532 23900 0.9306 - - - -
0.4551 24000 0.9277 -92.2427 0.4946 0.4798 0.4872
0.4570 24100 0.9267 - - - -
0.4589 24200 0.9228 - - - -
0.4608 24300 0.9239 - - - -
0.4627 24400 0.9225 - - - -
0.4646 24500 0.9169 - - - -
0.4665 24600 0.9170 - - - -
0.4684 24700 0.9195 - - - -
0.4702 24800 0.9153 - - - -
0.4721 24900 0.9138 - - - -
0.4740 25000 0.9108 -90.7635 0.4622 0.4812 0.4717
0.4759 25100 0.9133 - - - -
0.4778 25200 0.9076 - - - -
0.4797 25300 0.9081 - - - -
0.4816 25400 0.9093 - - - -
0.4835 25500 0.9037 - - - -
0.4854 25600 0.9025 - - - -
0.4873 25700 0.9058 - - - -
0.4892 25800 0.9018 - - - -
0.4911 25900 0.9014 - - - -
0.4930 26000 0.8946 -89.2562 0.4745 0.4957 0.4851
0.4949 26100 0.8982 - - - -
0.4968 26200 0.8946 - - - -
0.4987 26300 0.8941 - - - -
0.5006 26400 0.8925 - - - -
0.5025 26500 0.8947 - - - -
0.5044 26600 0.8906 - - - -
0.5063 26700 0.8895 - - - -
0.5082 26800 0.8866 - - - -
0.5101 26900 0.8840 - - - -
0.5120 27000 0.8764 -87.8039 0.5011 0.5173 0.5092
0.5139 27100 0.8859 - - - -
0.5158 27200 0.8839 - - - -
0.5177 27300 0.8794 - - - -
0.5195 27400 0.8790 - - - -
0.5214 27500 0.8788 - - - -
0.5233 27600 0.8780 - - - -
0.5252 27700 0.8749 - - - -
0.5271 27800 0.8742 - - - -
0.5290 27900 0.8700 - - - -
0.5309 28000 0.8691 -86.4419 0.4936 0.4776 0.4856
0.5328 28100 0.8747 - - - -
0.5347 28200 0.8644 - - - -
0.5366 28300 0.8673 - - - -
0.5385 28400 0.8670 - - - -
0.5404 28500 0.8638 - - - -
0.5423 28600 0.8649 - - - -
0.5442 28700 0.8629 - - - -
0.5461 28800 0.8629 - - - -
0.5480 28900 0.8591 - - - -
0.5499 29000 0.8566 -85.0408 0.4792 0.4918 0.4855
0.5518 29100 0.8588 - - - -
0.5537 29200 0.8545 - - - -
0.5556 29300 0.8534 - - - -
0.5575 29400 0.8543 - - - -
0.5594 29500 0.8534 - - - -
0.5613 29600 0.8519 - - - -
0.5632 29700 0.8486 - - - -
0.5651 29800 0.8530 - - - -
0.5670 29900 0.8477 - - - -
0.5688 30000 0.8465 -83.9435 0.4986 0.5097 0.5042
0.5707 30100 0.8425 - - - -
0.5726 30200 0.8437 - - - -
0.5745 30300 0.8430 - - - -
0.5764 30400 0.8431 - - - -
0.5783 30500 0.8424 - - - -
0.5802 30600 0.8403 - - - -
0.5821 30700 0.8347 - - - -
0.5840 30800 0.8344 - - - -
0.5859 30900 0.8348 - - - -
0.5878 31000 0.8351 -82.8113 0.4999 0.5088 0.5043
0.5897 31100 0.8362 - - - -
0.5916 31200 0.8307 - - - -
0.5935 31300 0.8315 - - - -
0.5954 31400 0.8311 - - - -
0.5973 31500 0.8305 - - - -
0.5992 31600 0.8304 - - - -
0.6011 31700 0.8277 - - - -
0.6030 31800 0.8249 - - - -
0.6049 31900 0.8262 - - - -
0.6068 32000 0.8236 -81.7389 0.4811 0.5256 0.5034
0.6087 32100 0.8209 - - - -
0.6106 32200 0.8226 - - - -
0.6125 32300 0.8207 - - - -
0.6144 32400 0.8224 - - - -
0.6163 32500 0.8163 - - - -
0.6182 32600 0.8181 - - - -
0.6200 32700 0.8147 - - - -
0.6219 32800 0.8170 - - - -
0.6238 32900 0.8156 - - - -
0.6257 33000 0.8141 -80.4979 0.5042 0.5085 0.5064
0.6276 33100 0.8088 - - - -
0.6295 33200 0.8098 - - - -
0.6314 33300 0.8133 - - - -
0.6333 33400 0.8087 - - - -
0.6352 33500 0.8086 - - - -
0.6371 33600 0.8094 - - - -
0.6390 33700 0.8054 - - - -
0.6409 33800 0.8043 - - - -
0.6428 33900 0.8035 - - - -
0.6447 34000 0.7990 -79.5726 0.4990 0.5166 0.5078
0.6466 34100 0.8035 - - - -
0.6485 34200 0.7990 - - - -
0.6504 34300 0.7996 - - - -
0.6523 34400 0.8005 - - - -
0.6542 34500 0.8000 - - - -
0.6561 34600 0.7975 - - - -
0.6580 34700 0.7959 - - - -
0.6599 34800 0.7921 - - - -
0.6618 34900 0.7916 - - - -
0.6637 35000 0.7933 -78.7884 0.5104 0.5139 0.5122
0.6656 35100 0.7908 - - - -
0.6675 35200 0.7913 - - - -
0.6693 35300 0.7921 - - - -
0.6712 35400 0.7929 - - - -
0.6731 35500 0.7915 - - - -
0.6750 35600 0.7871 - - - -
0.6769 35700 0.7836 - - - -
0.6788 35800 0.7805 - - - -
0.6807 35900 0.7870 - - - -
0.6826 36000 0.7797 -77.7400 0.5251 0.5457 0.5354

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 5.2.2
  • Transformers: 5.1.0
  • PyTorch: 2.7.1+cu128
  • Accelerate: 1.9.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MSELoss

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

ModernBERT Model Architecture

@misc{warner2024smarterbetterfasterlonger,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

Model Weight Initialization

@misc{trinh2025guideguidedinitializationdistillation,
      title={GUIDE: Guided Initialization and Distillation of Embeddings}, 
      author={Khoa Trinh and Gaurav Menghani and Erik Vee},
      year={2025},
      eprint={2510.06502},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.06502}, 
}
Downloads last month
-
Safetensors
Model size
37M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for johnnyboycurtis/ModernBERT-small-v2

Evaluation results