Upload complete model with all files

73b690a verified about 2 months ago

13.6 kB

	---
	datasets:
	- newmindai/stsb-deepl-tr

	base_model:
	- BAAI/bge-m3

	pipeline_tag: sentence-similarity
	library_name: sentence-transformers

	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- semantic-textual-similarity
	- turkish
	- multilingual
	- single-task-training

	license: apache-2.0

	language:
	- tr

	metrics:
	- pearson_cosine
	- spearman_cosine

	model-index:
	- name: BGE-M3 Turkish STS-B (AnglELoss)
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: stsb-eval
	type: stsb-eval
	metrics:
	- type: pearson_cosine
	value: 0.8575361568991451
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.8629008775002103
	name: Spearman Cosine
	---

	# Turkish Semantic Similarity Model - BGE-M3 (STS-B Fine-tuned)

	This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using AnglELoss (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark.

	## Overview

	* Base Model: BAAI/bge-m3 (1024-dimensional embeddings)
	* Training Task: Semantic Textual Similarity (STS)
	* Framework: Sentence Transformers (v5.1.1)
	* Language: Turkish (multilingual base model)
	* Dataset: Turkish STS-B (stsb-deepl-tr) - 5,749 training samples
	* Loss Function: AnglELoss (Angle-optimized with pairwise angle similarity)
	* Training Status: Completed (5 epochs)
	* Best Checkpoint: Epoch 1.0 (Step 45) - Validation Loss: 5.682
	* Final Spearman Correlation: 86.29%
	* Final Pearson Correlation: 85.75%
	* Context Length: 1024 tokens
	* Training Time: ~8 minutes (single task)

	## Performance Metrics

	### Final Evaluation Results

	Best Model: Epoch 1.0 (Step 45)

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Spearman Correlation \| 0.8629 (86.29%) \|
	\| Pearson Correlation \| 0.8575 (85.75%) \|
	\| Validation Loss \| 5.682 \|

	Best checkpoint saved at step 45 (epoch 1.0) based on validation loss

	### Training Progression

	\| Step \| Epoch \| Training Loss \| Validation Loss \| Spearman \| Pearson \|
	\|------\|-------\|---------------\|-----------------\|----------\|---------\|
	\| 10 \| 0.22 \| 7.2492 \| - \| - \| - \|
	\| 15 \| 0.33 \| - \| 6.8784 \| 0.8359 \| 0.8322 \|
	\| 30 \| 0.67 \| 6.0701 \| 5.8729 \| 0.8340 \| 0.8355 \|
	\| 45 \| 1.0 \| - \| 5.682 \| 0.8535 \| 0.8430 \|
	\| 60 \| 1.33 \| 5.5751 \| 5.7641 \| 0.8572 \| 0.8524 \|
	\| 105 \| 2.33 \| 5.3594 \| 6.0607 \| 0.8629 \| 0.8551 \|
	\| 150 \| 3.33 \| 5.1111 \| 6.1735 \| 0.8634 \| 0.8586 \|
	\| 165 \| 3.67 \| - \| 6.2597 \| 0.8636 \| 0.8571 \|
	\| 225 \| 5.0 \| - \| 6.5089 \| 0.8629 \| 0.8575 \|

	Bold row indicates the best checkpoint selected by early stopping

	## Training Infrastructure

	### Hardware Configuration

	* Nodes: 1
	* Node Name: as07r1b16
	* GPUs per Node: 4
	* Total GPUs: 4
	* CPUs: Not specified
	* Node Hours: ~0.13 hours (8 minutes)
	* GPU Type: NVIDIA (MareNostrum 5 ACC Partition)
	* Training Type: Multi-GPU with DataParallel (DP)

	### Training Statistics

	* Total Training Steps: 225
	* Training Samples: 5,749 (Turkish STS-B pairs)
	* Evaluation Samples: 1,379 (Turkish STS-B pairs)
	* Final Average Loss: 5.463
	* Training Time: ~6.5 minutes (390 seconds)
	* Samples/Second: 73.68
	* Steps/Second: 0.577

	## Training Configuration

	### Batch Configuration

	* Per-Device Batch Size: 8 (per GPU)
	* Number of GPUs: 4
	* Physical Batch Size: 32 (4 GPUs × 8 per-device)
	* Gradient Accumulation Steps: 4
	* Effective Batch Size: 128 (32 physical × 4 accumulation)
	* Samples per Step: 32

	### Loss Function

	* Type: AnglELoss (Angle-optimized Embeddings)
	* Implementation: Cosine Similarity Loss with angle optimization
	* Scale: 20.0 (temperature parameter)
	* Similarity Function: pairwise_angle_sim
	* Task: Regression (predicting similarity scores 0.0-1.0)

	AnglELoss Advantages:
	1. Angle Optimization: Optimizes the angle between embeddings rather than raw cosine similarity
	2. Better Geometric Properties: Encourages uniform distribution on the unit hypersphere
	3. Improved Discrimination: Better separation between similar and dissimilar pairs
	4. Temperature Scaling: Scale parameter (20.0) controls the sharpness of similarity distribution

	### Optimization

	* Optimizer: AdamW (fused)
	* Base Learning Rate: 5e-05
	* Learning Rate Scheduler: Linear with warmup
	* Warmup Steps: 89
	* Weight Decay: 0.01
	* Max Gradient Norm: 1.0
	* Mixed Precision: Disabled

	### Checkpointing & Evaluation

	* Save Strategy: Every 45 steps
	* Evaluation Strategy: Every 15 steps
	* Logging Steps: 10
	* Save Total Limit: 3 checkpoints
	* Best Model Selection: Based on validation loss (lower is better)
	* Load Best Model at End: True

	## Job Details

	\| JobID \| JobName \| Account \| Partition \| State \| Start \| End \| Node \| GPUs \| Duration \|
	\|-------\|---------\|---------\|-----------\|-------\|-------\|-----\|------\|------\|----------\|
	\| 31478447 \| bgem3-base-stsb \| ehpc317 \| acc \| COMPLETED \| Nov 3 13:59:58 \| Nov 3 14:07:37 \| as07r1b16 \| 4 \| 0.13h \|

	## Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({
	'max_seq_length': 1024,
	'do_lower_case': False,
	'architecture': 'XLMRobertaModel'
	})
	(1): Pooling({
	'word_embedding_dimension': 1024,
	'pooling_mode_mean_tokens': True,
	'pooling_mode_cls_token': False,
	'pooling_mode_max_tokens': False,
	'pooling_mode_mean_sqrt_len_tokens': False,
	'pooling_mode_weightedmean_tokens': False,
	'pooling_mode_lasttoken': False,
	'include_prompt': True
	})
	(2): Normalize()
	)
	```

	## Training Dataset

	### stsb-deepl-tr

	* Dataset: [stsb-deepl-tr](https://huggingface.co/datasets/newmindai/stsb-deepl-tr)
	* Training Size: 5,749 sentence pairs
	* Evaluation Size: 1,379 sentence pairs
	* Task: Semantic Textual Similarity (regression)
	* Score Range: 0.0 (completely dissimilar) to 5.0 (semantically equivalent)
	* Normalized Range: 0.0 to 1.0 (divided by 5.0 during preprocessing)
	* Average Sentence Length: ~10-15 tokens per sentence

	### Data Format

	Each training example consists of:
	- Sentence 1: Turkish sentence (6-30 tokens)
	- Sentence 2: Turkish sentence (6-26 tokens)
	- Similarity Score: Float value 0.0-1.0 (normalized from 0-5 scale)

	### Sample Data

	\| Sentence 1 \| Sentence 2 \| Score \|
	\|:-----------\|:-----------\|:------\|
	\| Bir uçak kalkıyor. \| Bir uçak havalanıyor. \| 0.2 \|
	\| Bir adam büyük bir flüt çalıyor. \| Bir adam flüt çalıyor. \| 0.152 \|
	\| Bir adam pizzanın üzerine rendelenmiş peynir serpiyor. \| Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor. \| 0.152 \|

	## Capabilities

	This model is specifically optimized for:

	- Semantic Similarity Scoring: Predicting similarity scores between Turkish sentence pairs
	- Paraphrase Detection: Identifying paraphrases and semantically equivalent sentences
	- Duplicate Detection: Finding duplicate or near-duplicate Turkish content
	- Question-Answer Matching: Matching questions with semantically similar answers
	- Document Similarity: Comparing semantic similarity of Turkish documents
	- Sentence Clustering: Grouping semantically similar Turkish sentences
	- Textual Entailment: Understanding semantic relationships between sentences

	## Usage

	### Installation

	```bash
	pip install -U sentence-transformers
	```

	### Semantic Similarity Scoring

	```python
	from sentence_transformers import SentenceTransformer, util

	# Load the model
	model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

	# Turkish sentence pairs
	sentence_pairs = [
	["Bir uçak kalkıyor.", "Bir uçak havalanıyor."],
	["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."],
	["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."]
	]

	# Compute similarity scores
	for sent1, sent2 in sentence_pairs:
	emb1 = model.encode(sent1, convert_to_tensor=True)
	emb2 = model.encode(sent2, convert_to_tensor=True)

	similarity = util.pytorch_cos_sim(emb1, emb2).item()
	print(f"Similarity: {similarity:.4f}")
	print(f" - '{sent1}'")
	print(f" - '{sent2}'")
	print()
	```

	### Batch Encoding

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

	# Turkish sentences
	sentences = [
	"Bir adam çiftliğinde çalışıyor.",
	"Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
	"Bir kedi yavrusu yürüyor.",
	"İki Hintli kadın sahilde duruyor."
	]

	# Encode sentences
	embeddings = model.encode(sentences)
	print(f"Embeddings shape: {embeddings.shape}")
	# Output: (4, 1024)

	# Compute similarity matrix
	similarities = model.similarity(embeddings, embeddings)
	print("Similarity matrix:")
	print(similarities)
	```

	### Finding Most Similar Sentences

	```python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)

	# Query and corpus
	query = "Bir adam çiftlikte çalışıyor."
	corpus = [
	"Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
	"Bir kedi yavrusu yürüyor.",
	"Bir kadın kumu kazıyor.",
	"Kayalık bir deniz kıyısında bir adam ve köpek.",
	"İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor."
	]

	# Encode
	query_emb = model.encode(query, convert_to_tensor=True)
	corpus_emb = model.encode(corpus, convert_to_tensor=True)

	# Find most similar
	hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]

	print(f"Query: {query}\n")
	print("Top 3 most similar sentences:")
	for hit in hits:
	print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}")
	```

	## Training Details

	### Complete Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Per-device train batch size \| 8 \|
	\| Number of GPUs \| 4 \|
	\| Physical batch size \| 32 \|
	\| Gradient accumulation steps \| 4 \|
	\| Effective batch size \| 128 \|
	\| Learning rate \| 5e-05 \|
	\| Weight decay \| 0.01 \|
	\| Warmup steps \| 89 \|
	\| LR scheduler \| linear \|
	\| Max gradient norm \| 1.0 \|
	\| Num train epochs \| 5 \|
	\| Save steps \| 45 \|
	\| Eval steps \| 15 \|
	\| Logging steps \| 10 \|
	\| AnglELoss scale \| 20.0 \|
	\| Batch sampler \| batch_sampler \|
	\| Load best model at end \| True \|
	\| Optimizer \| adamw_torch_fused \|

	### Framework Versions

	* Python: 3.10.12
	* Sentence Transformers: 5.1.1
	* PyTorch: 2.8.0+cu128
	* Transformers: 4.57.0
	* CUDA: 12.8
	* Accelerate: 1.10.1
	* Datasets: 4.2.0
	* Tokenizers: 0.22.1

	## Use Cases

	- Chatbot Response Matching: Find the most semantically similar pre-defined response for user queries
	- FAQ Search: Match user questions to the most relevant FAQ entries
	- Content Recommendation: Recommend articles or documents with similar semantic content
	- Plagiarism Detection: Identify semantically similar text for academic integrity checks
	- Customer Support: Match support tickets to similar previously resolved issues
	- Document Clustering: Group documents by semantic similarity for organization
	- Paraphrase Mining: Automatically detect paraphrases in large Turkish text corpora
	- Semantic Search: Build semantic search engines for Turkish content
	- Question Answering: Match questions to semantically relevant answer candidates
	- Text Summarization: Identify redundant sentences for summary generation

	## Citation

	### AnglELoss

	```bibtex
	@inproceedings{li-li-2024-aoe,
	title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity",
	author = "Li, Xianming and Li, Jing",
	year = "2024",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2024.acl-long.101/",
	doi = "10.18653/v1/2024.acl-long.101"
	}
	```

	### Sentence Transformers

	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	### Base Model (BGE-M3)

	```bibtex
	@misc{bge-m3,
	title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
	author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
	year={2024},
	eprint={2402.03216},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	### Dataset

	```bibtex
	@misc{stsb-deepl-tr,
	title={Turkish STS-B Dataset (DeepL Translation)},
	author={NewMind AI},
	year={2024},
	url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr}
	}
	```

	## License

	This model is licensed under the Apache 2.0 License.

	## Acknowledgments

	* Base Model: BAAI/bge-m3
	* Training Infrastructure: MareNostrum 5 Supercomputer (Barcelona Supercomputing Center)
	* Framework: Sentence Transformers by UKP Lab
	* Dataset: [newmindai/stsb-deepl-tr](https://huggingface.co/datasets/newmindai/stsb-deepl-tr)
	* Loss Function: AnglELoss (Angle-optimized Embeddings)
	* Training Approach: Single-task fine-tuning on Turkish STS-B