| | --- |
| | datasets: |
| | - newmindai/stsb-deepl-tr |
| |
|
| | base_model: |
| | - BAAI/bge-m3 |
| |
|
| | pipeline_tag: sentence-similarity |
| | library_name: sentence-transformers |
| |
|
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - semantic-textual-similarity |
| | - turkish |
| | - multilingual |
| | - single-task-training |
| |
|
| | license: apache-2.0 |
| |
|
| | language: |
| | - tr |
| |
|
| | metrics: |
| | - pearson_cosine |
| | - spearman_cosine |
| |
|
| | model-index: |
| | - name: BGE-M3 Turkish STS-B (AnglELoss) |
| | results: |
| | - task: |
| | type: semantic-similarity |
| | name: Semantic Similarity |
| | dataset: |
| | name: stsb-eval |
| | type: stsb-eval |
| | metrics: |
| | - type: pearson_cosine |
| | value: 0.8575361568991451 |
| | name: Pearson Cosine |
| | - type: spearman_cosine |
| | value: 0.8629008775002103 |
| | name: Spearman Cosine |
| | --- |
| | |
| | # Turkish Semantic Similarity Model - BGE-M3 (STS-B Fine-tuned) |
| |
|
| | This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using **AnglELoss** (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark. |
| |
|
| | ## Overview |
| |
|
| | * **Base Model**: BAAI/bge-m3 (1024-dimensional embeddings) |
| | * **Training Task**: Semantic Textual Similarity (STS) |
| | * **Framework**: Sentence Transformers (v5.1.1) |
| | * **Language**: Turkish (multilingual base model) |
| | * **Dataset**: Turkish STS-B (stsb-deepl-tr) - 5,749 training samples |
| | * **Loss Function**: AnglELoss (Angle-optimized with pairwise angle similarity) |
| | * **Training Status**: Completed (5 epochs) |
| | * **Best Checkpoint**: Epoch 1.0 (Step 45) - Validation Loss: 5.682 |
| | * **Final Spearman Correlation**: 86.29% |
| | * **Final Pearson Correlation**: 85.75% |
| | * **Context Length**: 1024 tokens |
| | * **Training Time**: ~8 minutes (single task) |
| |
|
| | ## Performance Metrics |
| |
|
| | ### Final Evaluation Results |
| |
|
| | **Best Model: Epoch 1.0 (Step 45)** |
| |
|
| | | Metric | Score | |
| | |--------|-------| |
| | | **Spearman Correlation** | **0.8629** (86.29%) | |
| | | **Pearson Correlation** | **0.8575** (85.75%) | |
| | | **Validation Loss** | **5.682** | |
| |
|
| | *Best checkpoint saved at step 45 (epoch 1.0) based on validation loss* |
| |
|
| | ### Training Progression |
| |
|
| | | Step | Epoch | Training Loss | Validation Loss | Spearman | Pearson | |
| | |------|-------|---------------|-----------------|----------|---------| |
| | | 10 | 0.22 | 7.2492 | - | - | - | |
| | | 15 | 0.33 | - | 6.8784 | 0.8359 | 0.8322 | |
| | | 30 | 0.67 | 6.0701 | 5.8729 | 0.8340 | 0.8355 | |
| | | **45** | **1.0** | **-** | **5.682** | **0.8535** | **0.8430** | |
| | | 60 | 1.33 | 5.5751 | 5.7641 | 0.8572 | 0.8524 | |
| | | 105 | 2.33 | 5.3594 | 6.0607 | 0.8629 | 0.8551 | |
| | | 150 | 3.33 | 5.1111 | 6.1735 | 0.8634 | 0.8586 | |
| | | 165 | 3.67 | - | 6.2597 | 0.8636 | 0.8571 | |
| | | 225 | 5.0 | - | 6.5089 | 0.8629 | 0.8575 | |
| |
|
| | *Bold row indicates the best checkpoint selected by early stopping* |
| |
|
| | ## Training Infrastructure |
| |
|
| | ### Hardware Configuration |
| |
|
| | * **Nodes**: 1 |
| | * **Node Name**: as07r1b16 |
| | * **GPUs per Node**: 4 |
| | * **Total GPUs**: 4 |
| | * **CPUs**: Not specified |
| | * **Node Hours**: ~0.13 hours (8 minutes) |
| | * **GPU Type**: NVIDIA (MareNostrum 5 ACC Partition) |
| | * **Training Type**: Multi-GPU with DataParallel (DP) |
| |
|
| | ### Training Statistics |
| |
|
| | * **Total Training Steps**: 225 |
| | * **Training Samples**: 5,749 (Turkish STS-B pairs) |
| | * **Evaluation Samples**: 1,379 (Turkish STS-B pairs) |
| | * **Final Average Loss**: 5.463 |
| | * **Training Time**: ~6.5 minutes (390 seconds) |
| | * **Samples/Second**: 73.68 |
| | * **Steps/Second**: 0.577 |
| |
|
| | ## Training Configuration |
| |
|
| | ### Batch Configuration |
| |
|
| | * **Per-Device Batch Size**: 8 (per GPU) |
| | * **Number of GPUs**: 4 |
| | * **Physical Batch Size**: 32 (4 GPUs × 8 per-device) |
| | * **Gradient Accumulation Steps**: 4 |
| | * **Effective Batch Size**: 128 (32 physical × 4 accumulation) |
| | * **Samples per Step**: 32 |
| |
|
| | ### Loss Function |
| |
|
| | * **Type**: AnglELoss (Angle-optimized Embeddings) |
| | * **Implementation**: Cosine Similarity Loss with angle optimization |
| | * **Scale**: 20.0 (temperature parameter) |
| | * **Similarity Function**: pairwise_angle_sim |
| | * **Task**: Regression (predicting similarity scores 0.0-1.0) |
| |
|
| | **AnglELoss Advantages**: |
| | 1. **Angle Optimization**: Optimizes the angle between embeddings rather than raw cosine similarity |
| | 2. **Better Geometric Properties**: Encourages uniform distribution on the unit hypersphere |
| | 3. **Improved Discrimination**: Better separation between similar and dissimilar pairs |
| | 4. **Temperature Scaling**: Scale parameter (20.0) controls the sharpness of similarity distribution |
| |
|
| | ### Optimization |
| |
|
| | * **Optimizer**: AdamW (fused) |
| | * **Base Learning Rate**: 5e-05 |
| | * **Learning Rate Scheduler**: Linear with warmup |
| | * **Warmup Steps**: 89 |
| | * **Weight Decay**: 0.01 |
| | * **Max Gradient Norm**: 1.0 |
| | * **Mixed Precision**: Disabled |
| |
|
| | ### Checkpointing & Evaluation |
| |
|
| | * **Save Strategy**: Every 45 steps |
| | * **Evaluation Strategy**: Every 15 steps |
| | * **Logging Steps**: 10 |
| | * **Save Total Limit**: 3 checkpoints |
| | * **Best Model Selection**: Based on validation loss (lower is better) |
| | * **Load Best Model at End**: True |
| |
|
| | ## Job Details |
| |
|
| | | JobID | JobName | Account | Partition | State | Start | End | Node | GPUs | Duration | |
| | |-------|---------|---------|-----------|-------|-------|-----|------|------|----------| |
| | | 31478447 | bgem3-base-stsb | ehpc317 | acc | COMPLETED | Nov 3 13:59:58 | Nov 3 14:07:37 | as07r1b16 | 4 | 0.13h | |
| |
|
| | ## Model Architecture |
| |
|
| | ``` |
| | SentenceTransformer( |
| | (0): Transformer({ |
| | 'max_seq_length': 1024, |
| | 'do_lower_case': False, |
| | 'architecture': 'XLMRobertaModel' |
| | }) |
| | (1): Pooling({ |
| | 'word_embedding_dimension': 1024, |
| | 'pooling_mode_mean_tokens': True, |
| | 'pooling_mode_cls_token': False, |
| | 'pooling_mode_max_tokens': False, |
| | 'pooling_mode_mean_sqrt_len_tokens': False, |
| | 'pooling_mode_weightedmean_tokens': False, |
| | 'pooling_mode_lasttoken': False, |
| | 'include_prompt': True |
| | }) |
| | (2): Normalize() |
| | ) |
| | ``` |
| |
|
| | ## Training Dataset |
| |
|
| | ### stsb-deepl-tr |
| |
|
| | * **Dataset**: [stsb-deepl-tr](https://huggingface.co/datasets/newmindai/stsb-deepl-tr) |
| | * **Training Size**: 5,749 sentence pairs |
| | * **Evaluation Size**: 1,379 sentence pairs |
| | * **Task**: Semantic Textual Similarity (regression) |
| | * **Score Range**: 0.0 (completely dissimilar) to 5.0 (semantically equivalent) |
| | * **Normalized Range**: 0.0 to 1.0 (divided by 5.0 during preprocessing) |
| | * **Average Sentence Length**: ~10-15 tokens per sentence |
| |
|
| | ### Data Format |
| |
|
| | Each training example consists of: |
| | - **Sentence 1**: Turkish sentence (6-30 tokens) |
| | - **Sentence 2**: Turkish sentence (6-26 tokens) |
| | - **Similarity Score**: Float value 0.0-1.0 (normalized from 0-5 scale) |
| |
|
| | ### Sample Data |
| |
|
| | | Sentence 1 | Sentence 2 | Score | |
| | |:-----------|:-----------|:------| |
| | | Bir uçak kalkıyor. | Bir uçak havalanıyor. | 0.2 | |
| | | Bir adam büyük bir flüt çalıyor. | Bir adam flüt çalıyor. | 0.152 | |
| | | Bir adam pizzanın üzerine rendelenmiş peynir serpiyor. | Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor. | 0.152 | |
| |
|
| | ## Capabilities |
| |
|
| | This model is specifically optimized for: |
| |
|
| | - **Semantic Similarity Scoring**: Predicting similarity scores between Turkish sentence pairs |
| | - **Paraphrase Detection**: Identifying paraphrases and semantically equivalent sentences |
| | - **Duplicate Detection**: Finding duplicate or near-duplicate Turkish content |
| | - **Question-Answer Matching**: Matching questions with semantically similar answers |
| | - **Document Similarity**: Comparing semantic similarity of Turkish documents |
| | - **Sentence Clustering**: Grouping semantically similar Turkish sentences |
| | - **Textual Entailment**: Understanding semantic relationships between sentences |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install -U sentence-transformers |
| | ``` |
| |
|
| | ### Semantic Similarity Scoring |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer, util |
| | |
| | # Load the model |
| | model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True) |
| | |
| | # Turkish sentence pairs |
| | sentence_pairs = [ |
| | ["Bir uçak kalkıyor.", "Bir uçak havalanıyor."], |
| | ["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."], |
| | ["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."] |
| | ] |
| | |
| | # Compute similarity scores |
| | for sent1, sent2 in sentence_pairs: |
| | emb1 = model.encode(sent1, convert_to_tensor=True) |
| | emb2 = model.encode(sent2, convert_to_tensor=True) |
| | |
| | similarity = util.pytorch_cos_sim(emb1, emb2).item() |
| | print(f"Similarity: {similarity:.4f}") |
| | print(f" - '{sent1}'") |
| | print(f" - '{sent2}'") |
| | print() |
| | ``` |
| |
|
| | ### Batch Encoding |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True) |
| | |
| | # Turkish sentences |
| | sentences = [ |
| | "Bir adam çiftliğinde çalışıyor.", |
| | "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.", |
| | "Bir kedi yavrusu yürüyor.", |
| | "İki Hintli kadın sahilde duruyor." |
| | ] |
| | |
| | # Encode sentences |
| | embeddings = model.encode(sentences) |
| | print(f"Embeddings shape: {embeddings.shape}") |
| | # Output: (4, 1024) |
| | |
| | # Compute similarity matrix |
| | similarities = model.similarity(embeddings, embeddings) |
| | print("Similarity matrix:") |
| | print(similarities) |
| | ``` |
| |
|
| | ### Finding Most Similar Sentences |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer, util |
| | |
| | model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True) |
| | |
| | # Query and corpus |
| | query = "Bir adam çiftlikte çalışıyor." |
| | corpus = [ |
| | "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.", |
| | "Bir kedi yavrusu yürüyor.", |
| | "Bir kadın kumu kazıyor.", |
| | "Kayalık bir deniz kıyısında bir adam ve köpek.", |
| | "İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor." |
| | ] |
| | |
| | # Encode |
| | query_emb = model.encode(query, convert_to_tensor=True) |
| | corpus_emb = model.encode(corpus, convert_to_tensor=True) |
| | |
| | # Find most similar |
| | hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0] |
| | |
| | print(f"Query: {query}\n") |
| | print("Top 3 most similar sentences:") |
| | for hit in hits: |
| | print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}") |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Complete Hyperparameters |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Per-device train batch size | 8 | |
| | | Number of GPUs | 4 | |
| | | Physical batch size | 32 | |
| | | Gradient accumulation steps | 4 | |
| | | Effective batch size | 128 | |
| | | Learning rate | 5e-05 | |
| | | Weight decay | 0.01 | |
| | | Warmup steps | 89 | |
| | | LR scheduler | linear | |
| | | Max gradient norm | 1.0 | |
| | | Num train epochs | 5 | |
| | | Save steps | 45 | |
| | | Eval steps | 15 | |
| | | Logging steps | 10 | |
| | | AnglELoss scale | 20.0 | |
| | | Batch sampler | batch_sampler | |
| | | Load best model at end | True | |
| | | Optimizer | adamw_torch_fused | |
| | |
| | ### Framework Versions |
| | |
| | * **Python**: 3.10.12 |
| | * **Sentence Transformers**: 5.1.1 |
| | * **PyTorch**: 2.8.0+cu128 |
| | * **Transformers**: 4.57.0 |
| | * **CUDA**: 12.8 |
| | * **Accelerate**: 1.10.1 |
| | * **Datasets**: 4.2.0 |
| | * **Tokenizers**: 0.22.1 |
| | |
| | ## Use Cases |
| | |
| | - **Chatbot Response Matching**: Find the most semantically similar pre-defined response for user queries |
| | - **FAQ Search**: Match user questions to the most relevant FAQ entries |
| | - **Content Recommendation**: Recommend articles or documents with similar semantic content |
| | - **Plagiarism Detection**: Identify semantically similar text for academic integrity checks |
| | - **Customer Support**: Match support tickets to similar previously resolved issues |
| | - **Document Clustering**: Group documents by semantic similarity for organization |
| | - **Paraphrase Mining**: Automatically detect paraphrases in large Turkish text corpora |
| | - **Semantic Search**: Build semantic search engines for Turkish content |
| | - **Question Answering**: Match questions to semantically relevant answer candidates |
| | - **Text Summarization**: Identify redundant sentences for summary generation |
| | |
| | ## Citation |
| | |
| | ### AnglELoss |
| | |
| | ```bibtex |
| | @inproceedings{li-li-2024-aoe, |
| | title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity", |
| | author = "Li, Xianming and Li, Jing", |
| | year = "2024", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://aclanthology.org/2024.acl-long.101/", |
| | doi = "10.18653/v1/2024.acl-long.101" |
| | } |
| | ``` |
| | |
| | ### Sentence Transformers |
| | |
| | ```bibtex |
| | @inproceedings{reimers-2019-sentence-bert, |
| | title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| | author = "Reimers, Nils and Gurevych, Iryna", |
| | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
| | month = "11", |
| | year = "2019", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://arxiv.org/abs/1908.10084", |
| | } |
| | ``` |
| | |
| | ### Base Model (BGE-M3) |
| | |
| | ```bibtex |
| | @misc{bge-m3, |
| | title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, |
| | author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, |
| | year={2024}, |
| | eprint={2402.03216}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL} |
| | } |
| | ``` |
| | |
| | ### Dataset |
| | |
| | ```bibtex |
| | @misc{stsb-deepl-tr, |
| | title={Turkish STS-B Dataset (DeepL Translation)}, |
| | author={NewMind AI}, |
| | year={2024}, |
| | url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | This model is licensed under the Apache 2.0 License. |
| | |
| | ## Acknowledgments |
| | |
| | * **Base Model**: BAAI/bge-m3 |
| | * **Training Infrastructure**: MareNostrum 5 Supercomputer (Barcelona Supercomputing Center) |
| | * **Framework**: Sentence Transformers by UKP Lab |
| | * **Dataset**: [newmindai/stsb-deepl-tr](https://huggingface.co/datasets/newmindai/stsb-deepl-tr) |
| | * **Loss Function**: AnglELoss (Angle-optimized Embeddings) |
| | * **Training Approach**: Single-task fine-tuning on Turkish STS-B |
| | |