Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ pipeline_tag: text-classification
|
|
| 13 |
# helizac/berturk-pair-acceptability
|
| 14 |
|
| 15 |
This model is a fine-tuned version of `dbmdz/bert-base-turkish-cased` for classifying the acceptability of a Turkish text output given a Turkish text input.
|
| 16 |
-
It was developed as part of the
|
| 17 |
|
| 18 |
## Model Description
|
| 19 |
|
|
@@ -111,7 +111,7 @@ output_text_2 = "Elmalar çok güzel!"
|
|
| 111 |
prediction_2, confidence_2 = predict_pair_acceptability(input_text_2, output_text_2, model, tokenizer, device, MAX_LENGTH)
|
| 112 |
print(f"Input: {input_text_2}\nOutput: {output_text_2}\nPrediction: {prediction_2} (Confidence: {confidence_2:.4f})\n")
|
| 113 |
|
| 114 |
-
# Example 3: Unacceptable (grammatically poor
|
| 115 |
input_text_3 = "Hayalindeki meslek ne büyük."
|
| 116 |
output_text_3 = "Olmak ben istemek büyük.
|
| 117 |
prediction_3, confidence_3 = predict_pair_acceptability(input_text_3, output_text_3, model, tokenizer, device, MAX_LENGTH)
|
|
@@ -120,7 +120,7 @@ print(f"Input: {input_text_3}\nOutput: {output_text_3}\nPrediction: {prediction_
|
|
| 120 |
|
| 121 |
## Training Data
|
| 122 |
The model was fine-tuned on a dataset of approximately 460,000 Turkish input-output text pairs.
|
| 123 |
-
"Acceptable" pairs (\~132,000) were sourced from various public Turkish NLP datasets
|
| 124 |
"Unacceptable" pairs (\~328,000) were synthetically generated by applying rule-based corruptions (typos, toxic word injection, repetition, mismatched outputs) to the acceptable outputs.
|
| 125 |
All pairs were truncated/padded to a maximum sequence length of 64 tokens for the combined input and output.
|
| 126 |
|
|
@@ -134,7 +134,7 @@ The stress test for this model showed:
|
|
| 134 |
* (Tested on T4 GPU)
|
| 135 |
|
| 136 |
## Citation
|
| 137 |
-
This model was developed as part of the following
|
| 138 |
|
| 139 |
Erdi, F. (2025). MODEL ÇIKTILARININ KABUL EDİLEBİLİRLİĞİNİN DEĞERLENDİRİLMESİ (Evaluation of the Acceptability of Model Outputs). T.C Galatasaray Üniversitesi, Mühendislik ve Teknoloji Fakültesi.
|
| 140 |
|
|
|
|
| 13 |
# helizac/berturk-pair-acceptability
|
| 14 |
|
| 15 |
This model is a fine-tuned version of `dbmdz/bert-base-turkish-cased` for classifying the acceptability of a Turkish text output given a Turkish text input.
|
| 16 |
+
It was developed as part of the "Evaluation of the Acceptability of Model Outputs" (May 2025).
|
| 17 |
|
| 18 |
## Model Description
|
| 19 |
|
|
|
|
| 111 |
prediction_2, confidence_2 = predict_pair_acceptability(input_text_2, output_text_2, model, tokenizer, device, MAX_LENGTH)
|
| 112 |
print(f"Input: {input_text_2}\nOutput: {output_text_2}\nPrediction: {prediction_2} (Confidence: {confidence_2:.4f})\n")
|
| 113 |
|
| 114 |
+
# Example 3: Unacceptable (grammatically poor)
|
| 115 |
input_text_3 = "Hayalindeki meslek ne büyük."
|
| 116 |
output_text_3 = "Olmak ben istemek büyük.
|
| 117 |
prediction_3, confidence_3 = predict_pair_acceptability(input_text_3, output_text_3, model, tokenizer, device, MAX_LENGTH)
|
|
|
|
| 120 |
|
| 121 |
## Training Data
|
| 122 |
The model was fine-tuned on a dataset of approximately 460,000 Turkish input-output text pairs.
|
| 123 |
+
"Acceptable" pairs (\~132,000) were sourced from various public Turkish NLP datasets
|
| 124 |
"Unacceptable" pairs (\~328,000) were synthetically generated by applying rule-based corruptions (typos, toxic word injection, repetition, mismatched outputs) to the acceptable outputs.
|
| 125 |
All pairs were truncated/padded to a maximum sequence length of 64 tokens for the combined input and output.
|
| 126 |
|
|
|
|
| 134 |
* (Tested on T4 GPU)
|
| 135 |
|
| 136 |
## Citation
|
| 137 |
+
This model was developed as part of the following:
|
| 138 |
|
| 139 |
Erdi, F. (2025). MODEL ÇIKTILARININ KABUL EDİLEBİLİRLİĞİNİN DEĞERLENDİRİLMESİ (Evaluation of the Acceptability of Model Outputs). T.C Galatasaray Üniversitesi, Mühendislik ve Teknoloji Fakültesi.
|
| 140 |
|