nmmursit commited on
Commit
73b690a
·
verified ·
1 Parent(s): 785cdee

Upload complete model with all files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,435 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - newmindai/stsb-deepl-tr
4
+
5
+ base_model:
6
+ - BAAI/bge-m3
7
+
8
+ pipeline_tag: sentence-similarity
9
+ library_name: sentence-transformers
10
+
11
+ tags:
12
+ - sentence-transformers
13
+ - sentence-similarity
14
+ - feature-extraction
15
+ - semantic-textual-similarity
16
+ - turkish
17
+ - multilingual
18
+ - single-task-training
19
+
20
+ license: apache-2.0
21
+
22
+ language:
23
+ - tr
24
+
25
+ metrics:
26
+ - pearson_cosine
27
+ - spearman_cosine
28
+
29
+ model-index:
30
+ - name: BGE-M3 Turkish STS-B (AnglELoss)
31
+ results:
32
+ - task:
33
+ type: semantic-similarity
34
+ name: Semantic Similarity
35
+ dataset:
36
+ name: stsb-eval
37
+ type: stsb-eval
38
+ metrics:
39
+ - type: pearson_cosine
40
+ value: 0.8575361568991451
41
+ name: Pearson Cosine
42
+ - type: spearman_cosine
43
+ value: 0.8629008775002103
44
+ name: Spearman Cosine
45
+ ---
46
+
47
+ # Turkish Semantic Similarity Model - BGE-M3 (STS-B Fine-tuned)
48
+
49
+ This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using **AnglELoss** (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark.
50
+
51
+ ## Overview
52
+
53
+ * **Base Model**: BAAI/bge-m3 (1024-dimensional embeddings)
54
+ * **Training Task**: Semantic Textual Similarity (STS)
55
+ * **Framework**: Sentence Transformers (v5.1.1)
56
+ * **Language**: Turkish (multilingual base model)
57
+ * **Dataset**: Turkish STS-B (stsb-deepl-tr) - 5,749 training samples
58
+ * **Loss Function**: AnglELoss (Angle-optimized with pairwise angle similarity)
59
+ * **Training Status**: Completed (5 epochs)
60
+ * **Best Checkpoint**: Epoch 1.0 (Step 45) - Validation Loss: 5.682
61
+ * **Final Spearman Correlation**: 86.29%
62
+ * **Final Pearson Correlation**: 85.75%
63
+ * **Context Length**: 1024 tokens
64
+ * **Training Time**: ~8 minutes (single task)
65
+
66
+ ## Performance Metrics
67
+
68
+ ### Final Evaluation Results
69
+
70
+ **Best Model: Epoch 1.0 (Step 45)**
71
+
72
+ | Metric | Score |
73
+ |--------|-------|
74
+ | **Spearman Correlation** | **0.8629** (86.29%) |
75
+ | **Pearson Correlation** | **0.8575** (85.75%) |
76
+ | **Validation Loss** | **5.682** |
77
+
78
+ *Best checkpoint saved at step 45 (epoch 1.0) based on validation loss*
79
+
80
+ ### Training Progression
81
+
82
+ | Step | Epoch | Training Loss | Validation Loss | Spearman | Pearson |
83
+ |------|-------|---------------|-----------------|----------|---------|
84
+ | 10 | 0.22 | 7.2492 | - | - | - |
85
+ | 15 | 0.33 | - | 6.8784 | 0.8359 | 0.8322 |
86
+ | 30 | 0.67 | 6.0701 | 5.8729 | 0.8340 | 0.8355 |
87
+ | **45** | **1.0** | **-** | **5.682** | **0.8535** | **0.8430** |
88
+ | 60 | 1.33 | 5.5751 | 5.7641 | 0.8572 | 0.8524 |
89
+ | 105 | 2.33 | 5.3594 | 6.0607 | 0.8629 | 0.8551 |
90
+ | 150 | 3.33 | 5.1111 | 6.1735 | 0.8634 | 0.8586 |
91
+ | 165 | 3.67 | - | 6.2597 | 0.8636 | 0.8571 |
92
+ | 225 | 5.0 | - | 6.5089 | 0.8629 | 0.8575 |
93
+
94
+ *Bold row indicates the best checkpoint selected by early stopping*
95
+
96
+ ## Training Infrastructure
97
+
98
+ ### Hardware Configuration
99
+
100
+ * **Nodes**: 1
101
+ * **Node Name**: as07r1b16
102
+ * **GPUs per Node**: 4
103
+ * **Total GPUs**: 4
104
+ * **CPUs**: Not specified
105
+ * **Node Hours**: ~0.13 hours (8 minutes)
106
+ * **GPU Type**: NVIDIA (MareNostrum 5 ACC Partition)
107
+ * **Training Type**: Multi-GPU with DataParallel (DP)
108
+
109
+ ### Training Statistics
110
+
111
+ * **Total Training Steps**: 225
112
+ * **Training Samples**: 5,749 (Turkish STS-B pairs)
113
+ * **Evaluation Samples**: 1,379 (Turkish STS-B pairs)
114
+ * **Final Average Loss**: 5.463
115
+ * **Training Time**: ~6.5 minutes (390 seconds)
116
+ * **Samples/Second**: 73.68
117
+ * **Steps/Second**: 0.577
118
+
119
+ ## Training Configuration
120
+
121
+ ### Batch Configuration
122
+
123
+ * **Per-Device Batch Size**: 8 (per GPU)
124
+ * **Number of GPUs**: 4
125
+ * **Physical Batch Size**: 32 (4 GPUs × 8 per-device)
126
+ * **Gradient Accumulation Steps**: 4
127
+ * **Effective Batch Size**: 128 (32 physical × 4 accumulation)
128
+ * **Samples per Step**: 32
129
+
130
+ ### Loss Function
131
+
132
+ * **Type**: AnglELoss (Angle-optimized Embeddings)
133
+ * **Implementation**: Cosine Similarity Loss with angle optimization
134
+ * **Scale**: 20.0 (temperature parameter)
135
+ * **Similarity Function**: pairwise_angle_sim
136
+ * **Task**: Regression (predicting similarity scores 0.0-1.0)
137
+
138
+ **AnglELoss Advantages**:
139
+ 1. **Angle Optimization**: Optimizes the angle between embeddings rather than raw cosine similarity
140
+ 2. **Better Geometric Properties**: Encourages uniform distribution on the unit hypersphere
141
+ 3. **Improved Discrimination**: Better separation between similar and dissimilar pairs
142
+ 4. **Temperature Scaling**: Scale parameter (20.0) controls the sharpness of similarity distribution
143
+
144
+ ### Optimization
145
+
146
+ * **Optimizer**: AdamW (fused)
147
+ * **Base Learning Rate**: 5e-05
148
+ * **Learning Rate Scheduler**: Linear with warmup
149
+ * **Warmup Steps**: 89
150
+ * **Weight Decay**: 0.01
151
+ * **Max Gradient Norm**: 1.0
152
+ * **Mixed Precision**: Disabled
153
+
154
+ ### Checkpointing & Evaluation
155
+
156
+ * **Save Strategy**: Every 45 steps
157
+ * **Evaluation Strategy**: Every 15 steps
158
+ * **Logging Steps**: 10
159
+ * **Save Total Limit**: 3 checkpoints
160
+ * **Best Model Selection**: Based on validation loss (lower is better)
161
+ * **Load Best Model at End**: True
162
+
163
+ ## Job Details
164
+
165
+ | JobID | JobName | Account | Partition | State | Start | End | Node | GPUs | Duration |
166
+ |-------|---------|---------|-----------|-------|-------|-----|------|------|----------|
167
+ | 31478447 | bgem3-base-stsb | ehpc317 | acc | COMPLETED | Nov 3 13:59:58 | Nov 3 14:07:37 | as07r1b16 | 4 | 0.13h |
168
+
169
+ ## Model Architecture
170
+
171
+ ```
172
+ SentenceTransformer(
173
+ (0): Transformer({
174
+ 'max_seq_length': 1024,
175
+ 'do_lower_case': False,
176
+ 'architecture': 'XLMRobertaModel'
177
+ })
178
+ (1): Pooling({
179
+ 'word_embedding_dimension': 1024,
180
+ 'pooling_mode_mean_tokens': True,
181
+ 'pooling_mode_cls_token': False,
182
+ 'pooling_mode_max_tokens': False,
183
+ 'pooling_mode_mean_sqrt_len_tokens': False,
184
+ 'pooling_mode_weightedmean_tokens': False,
185
+ 'pooling_mode_lasttoken': False,
186
+ 'include_prompt': True
187
+ })
188
+ (2): Normalize()
189
+ )
190
+ ```
191
+
192
+ ## Training Dataset
193
+
194
+ ### stsb-deepl-tr
195
+
196
+ * **Dataset**: [stsb-deepl-tr](https://huggingface.co/datasets/newmindai/stsb-deepl-tr)
197
+ * **Training Size**: 5,749 sentence pairs
198
+ * **Evaluation Size**: 1,379 sentence pairs
199
+ * **Task**: Semantic Textual Similarity (regression)
200
+ * **Score Range**: 0.0 (completely dissimilar) to 5.0 (semantically equivalent)
201
+ * **Normalized Range**: 0.0 to 1.0 (divided by 5.0 during preprocessing)
202
+ * **Average Sentence Length**: ~10-15 tokens per sentence
203
+
204
+ ### Data Format
205
+
206
+ Each training example consists of:
207
+ - **Sentence 1**: Turkish sentence (6-30 tokens)
208
+ - **Sentence 2**: Turkish sentence (6-26 tokens)
209
+ - **Similarity Score**: Float value 0.0-1.0 (normalized from 0-5 scale)
210
+
211
+ ### Sample Data
212
+
213
+ | Sentence 1 | Sentence 2 | Score |
214
+ |:-----------|:-----------|:------|
215
+ | Bir uçak kalkıyor. | Bir uçak havalanıyor. | 0.2 |
216
+ | Bir adam büyük bir flüt çalıyor. | Bir adam flüt çalıyor. | 0.152 |
217
+ | Bir adam pizzanın üzerine rendelenmiş peynir serpiyor. | Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor. | 0.152 |
218
+
219
+ ## Capabilities
220
+
221
+ This model is specifically optimized for:
222
+
223
+ - **Semantic Similarity Scoring**: Predicting similarity scores between Turkish sentence pairs
224
+ - **Paraphrase Detection**: Identifying paraphrases and semantically equivalent sentences
225
+ - **Duplicate Detection**: Finding duplicate or near-duplicate Turkish content
226
+ - **Question-Answer Matching**: Matching questions with semantically similar answers
227
+ - **Document Similarity**: Comparing semantic similarity of Turkish documents
228
+ - **Sentence Clustering**: Grouping semantically similar Turkish sentences
229
+ - **Textual Entailment**: Understanding semantic relationships between sentences
230
+
231
+ ## Usage
232
+
233
+ ### Installation
234
+
235
+ ```bash
236
+ pip install -U sentence-transformers
237
+ ```
238
+
239
+ ### Semantic Similarity Scoring
240
+
241
+ ```python
242
+ from sentence_transformers import SentenceTransformer, util
243
+
244
+ # Load the model
245
+ model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
246
+
247
+ # Turkish sentence pairs
248
+ sentence_pairs = [
249
+ ["Bir uçak kalkıyor.", "Bir uçak havalanıyor."],
250
+ ["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."],
251
+ ["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."]
252
+ ]
253
+
254
+ # Compute similarity scores
255
+ for sent1, sent2 in sentence_pairs:
256
+ emb1 = model.encode(sent1, convert_to_tensor=True)
257
+ emb2 = model.encode(sent2, convert_to_tensor=True)
258
+
259
+ similarity = util.pytorch_cos_sim(emb1, emb2).item()
260
+ print(f"Similarity: {similarity:.4f}")
261
+ print(f" - '{sent1}'")
262
+ print(f" - '{sent2}'")
263
+ print()
264
+ ```
265
+
266
+ ### Batch Encoding
267
+
268
+ ```python
269
+ from sentence_transformers import SentenceTransformer
270
+
271
+ model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
272
+
273
+ # Turkish sentences
274
+ sentences = [
275
+ "Bir adam çiftliğinde çalışıyor.",
276
+ "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
277
+ "Bir kedi yavrusu yürüyor.",
278
+ "İki Hintli kadın sahilde duruyor."
279
+ ]
280
+
281
+ # Encode sentences
282
+ embeddings = model.encode(sentences)
283
+ print(f"Embeddings shape: {embeddings.shape}")
284
+ # Output: (4, 1024)
285
+
286
+ # Compute similarity matrix
287
+ similarities = model.similarity(embeddings, embeddings)
288
+ print("Similarity matrix:")
289
+ print(similarities)
290
+ ```
291
+
292
+ ### Finding Most Similar Sentences
293
+
294
+ ```python
295
+ from sentence_transformers import SentenceTransformer, util
296
+
297
+ model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
298
+
299
+ # Query and corpus
300
+ query = "Bir adam çiftlikte çalışıyor."
301
+ corpus = [
302
+ "Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
303
+ "Bir kedi yavrusu yürüyor.",
304
+ "Bir kadın kumu kazıyor.",
305
+ "Kayalık bir deniz kıyısında bir adam ve köpek.",
306
+ "İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor."
307
+ ]
308
+
309
+ # Encode
310
+ query_emb = model.encode(query, convert_to_tensor=True)
311
+ corpus_emb = model.encode(corpus, convert_to_tensor=True)
312
+
313
+ # Find most similar
314
+ hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]
315
+
316
+ print(f"Query: {query}\n")
317
+ print("Top 3 most similar sentences:")
318
+ for hit in hits:
319
+ print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}")
320
+ ```
321
+
322
+ ## Training Details
323
+
324
+ ### Complete Hyperparameters
325
+
326
+ | Parameter | Value |
327
+ |-----------|-------|
328
+ | Per-device train batch size | 8 |
329
+ | Number of GPUs | 4 |
330
+ | Physical batch size | 32 |
331
+ | Gradient accumulation steps | 4 |
332
+ | Effective batch size | 128 |
333
+ | Learning rate | 5e-05 |
334
+ | Weight decay | 0.01 |
335
+ | Warmup steps | 89 |
336
+ | LR scheduler | linear |
337
+ | Max gradient norm | 1.0 |
338
+ | Num train epochs | 5 |
339
+ | Save steps | 45 |
340
+ | Eval steps | 15 |
341
+ | Logging steps | 10 |
342
+ | AnglELoss scale | 20.0 |
343
+ | Batch sampler | batch_sampler |
344
+ | Load best model at end | True |
345
+ | Optimizer | adamw_torch_fused |
346
+
347
+ ### Framework Versions
348
+
349
+ * **Python**: 3.10.12
350
+ * **Sentence Transformers**: 5.1.1
351
+ * **PyTorch**: 2.8.0+cu128
352
+ * **Transformers**: 4.57.0
353
+ * **CUDA**: 12.8
354
+ * **Accelerate**: 1.10.1
355
+ * **Datasets**: 4.2.0
356
+ * **Tokenizers**: 0.22.1
357
+
358
+ ## Use Cases
359
+
360
+ - **Chatbot Response Matching**: Find the most semantically similar pre-defined response for user queries
361
+ - **FAQ Search**: Match user questions to the most relevant FAQ entries
362
+ - **Content Recommendation**: Recommend articles or documents with similar semantic content
363
+ - **Plagiarism Detection**: Identify semantically similar text for academic integrity checks
364
+ - **Customer Support**: Match support tickets to similar previously resolved issues
365
+ - **Document Clustering**: Group documents by semantic similarity for organization
366
+ - **Paraphrase Mining**: Automatically detect paraphrases in large Turkish text corpora
367
+ - **Semantic Search**: Build semantic search engines for Turkish content
368
+ - **Question Answering**: Match questions to semantically relevant answer candidates
369
+ - **Text Summarization**: Identify redundant sentences for summary generation
370
+
371
+ ## Citation
372
+
373
+ ### AnglELoss
374
+
375
+ ```bibtex
376
+ @inproceedings{li-li-2024-aoe,
377
+ title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity",
378
+ author = "Li, Xianming and Li, Jing",
379
+ year = "2024",
380
+ publisher = "Association for Computational Linguistics",
381
+ url = "https://aclanthology.org/2024.acl-long.101/",
382
+ doi = "10.18653/v1/2024.acl-long.101"
383
+ }
384
+ ```
385
+
386
+ ### Sentence Transformers
387
+
388
+ ```bibtex
389
+ @inproceedings{reimers-2019-sentence-bert,
390
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
391
+ author = "Reimers, Nils and Gurevych, Iryna",
392
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
393
+ month = "11",
394
+ year = "2019",
395
+ publisher = "Association for Computational Linguistics",
396
+ url = "https://arxiv.org/abs/1908.10084",
397
+ }
398
+ ```
399
+
400
+ ### Base Model (BGE-M3)
401
+
402
+ ```bibtex
403
+ @misc{bge-m3,
404
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
405
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
406
+ year={2024},
407
+ eprint={2402.03216},
408
+ archivePrefix={arXiv},
409
+ primaryClass={cs.CL}
410
+ }
411
+ ```
412
+
413
+ ### Dataset
414
+
415
+ ```bibtex
416
+ @misc{stsb-deepl-tr,
417
+ title={Turkish STS-B Dataset (DeepL Translation)},
418
+ author={NewMind AI},
419
+ year={2024},
420
+ url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr}
421
+ }
422
+ ```
423
+
424
+ ## License
425
+
426
+ This model is licensed under the Apache 2.0 License.
427
+
428
+ ## Acknowledgments
429
+
430
+ * **Base Model**: BAAI/bge-m3
431
+ * **Training Infrastructure**: MareNostrum 5 Supercomputer (Barcelona Supercomputing Center)
432
+ * **Framework**: Sentence Transformers by UKP Lab
433
+ * **Dataset**: [newmindai/stsb-deepl-tr](https://huggingface.co/datasets/newmindai/stsb-deepl-tr)
434
+ * **Loss Function**: AnglELoss (Angle-optimized Embeddings)
435
+ * **Training Approach**: Single-task fine-tuning on Turkish STS-B
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 8194,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "transformers_version": "4.57.0",
24
+ "type_vocab_size": 1,
25
+ "use_cache": true,
26
+ "vocab_size": 250002
27
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.1",
4
+ "transformers": "4.57.0",
5
+ "pytorch": "2.8.0+cu128"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1fbf7c95f0da3a18ffd8b960041f9f9a95babb13bcd86e995ce3a6e7ad3a61e7
3
+ size 2271064456
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 1024,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e3b8957de04e3a4ed42b1a11381556f9adad8d0d502b9dd071c75f626b28f40
3
+ size 17083053
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 8192,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "sp_model_kwargs": {},
54
+ "tokenizer_class": "XLMRobertaTokenizer",
55
+ "unk_token": "<unk>"
56
+ }