SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the statictable-pair-class dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/paraphrase-multilingual-miniLM-L12-V2-ocl")
# Run inference
sentences = [
    'DATA HARIAN DEBIT, KETINGGIAN, DAN VOLUME AIR SUNGAI DENGAN DAERAH ALIRAN SUNGAI DI ATAS 100 KM2, TAHUN 2015',
    'Ringkasan Neraca Arus Dana, Triwulan IV, 2009, (Miliar Rupiah)',
    'Ringkasan Neraca Arus Dana, Triwulan III, 2014**), (Miliar Rupiah)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.7948
cosine_accuracy@3 0.9381
cosine_accuracy@5 0.9739
cosine_accuracy@10 0.9935
cosine_precision@1 0.7948
cosine_precision@3 0.3518
cosine_precision@5 0.2404
cosine_precision@10 0.1485
cosine_recall@1 0.6249
cosine_recall@3 0.7494
cosine_recall@5 0.7882
cosine_recall@10 0.8352
cosine_ndcg@1 0.7948
cosine_ndcg@3 0.7878
cosine_ndcg@5 0.7923
cosine_ndcg@10 0.7999
cosine_mrr@1 0.7948
cosine_mrr@3 0.861
cosine_mrr@5 0.8692
cosine_mrr@10 0.8716
cosine_map@1 0.7948
cosine_map@3 0.7436
cosine_map@5 0.7386
cosine_map@10 0.7372

Training Details

Training Dataset

statictable-pair-class

  • Dataset: statictable-pair-class at 62bf40d
  • Size: 37,302 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 5 tokens
    • mean: 20.23 tokens
    • max: 50 tokens
    • min: 5 tokens
    • mean: 25.61 tokens
    • max: 50 tokens
    • 0: ~78.10%
    • 1: ~21.90%
  • Samples:
    query doc label
    BERAPA JUMLAH PENDUDUK 15 TAHUN KE ATAS YANG BEKERJA DI TIAP PROVINSI, MENURUT STATUS PEKERJAAN (2022)? Penduduk Berumur 15 Tahun Ke Atas yang Bekerja Menurut Provinsi dan Status Pekerjaan Utama, 2022 1
    Budget kesehatan Kemenkeu Ditjen Anggaran Persentase Rumah Tangga yang Menempati Rumah dengan Dinding Terluas Bukan Bambu/lainnya, 1993-2021 0
    Cek pengeluaran makanan mingguan rata-rata warga Kalsel (2000-2021), bedakan per kelompok pengeluaran Persentase Rumah Tangga Menurut Provinsi dan Fasilitas Tempat Buang Air Besar, 2000-2021 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

statictable-pair-class

  • Dataset: statictable-pair-class at 62bf40d
  • Size: 37,302 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 8 tokens
    • mean: 20.43 tokens
    • max: 48 tokens
    • min: 5 tokens
    • mean: 25.59 tokens
    • max: 58 tokens
    • 0: ~75.40%
    • 1: ~24.60%
  • Samples:
    query doc label
    Negara tujuan ekspor jewelry 2012-2023 Ekspor Barang Perhiasan dan Barang Berharga Menurut Negara Tujuan Utama, 2012-2023 1
    Jumlah Pns Indonesia Per Masa Kerja Dan Gender 2005 Jumlah Pegawai Negeri Sipil Menurut Masa Kerja dan Jenis Kelamin, 2004 - 2023 1
    Berapa rata-rata pendapatan per orang (setelah pajak) menurut golongan rumah tangga pada tahun 2000? Angka Kematian Bayi/AKB (Infant Mortality Rate/IMR) Hasil Long Form SP2020 Menurut Provinsi/Kabupaten/Kota, 2020 0
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • warmup_ratio: 0.1
  • save_on_each_node: True
  • fp16: True
  • dataloader_num_workers: 2
  • load_best_model_at_end: True
  • eval_on_start: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: True
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 2
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss bps-statictable-ir_cosine_ndcg@10
0 0 - 1.5101 0.4643
0.1905 100 0.7674 0.2281 0.7461
0.3810 200 0.2224 0.1618 0.7612
0.5714 300 0.1341 0.0648 0.7620
0.7619 400 0.0865 0.0533 0.7732
0.9524 500 0.0804 0.0303 0.7596
1.1410 600 0.0398 0.0115 0.7931
1.3314 700 0.0173 0.0163 0.7919
1.5219 800 0.0204 0.0376 0.7876
1.7124 900 0.0231 0.0111 0.7887
1.9029 1000 0.0055 0.0085 0.7841
2.0914 1100 0.0136 0.0115 0.7931
2.2819 1200 0.0091 0.0041 0.7923
2.4724 1300 0.0086 0.0045 0.7977
2.6629 1400 0.0 0.0045 0.7980
2.8533 1500 0.0074 0.0051 0.7999
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yahyaabd/paraphrase-multilingual-miniLM-L12-V2-ocl

Dataset used to train yahyaabd/paraphrase-multilingual-miniLM-L12-V2-ocl

Paper for yahyaabd/paraphrase-multilingual-miniLM-L12-V2-ocl

Evaluation results