Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
TajaKuzmanPungersek commited on
Commit
c9cb98d
·
verified ·
1 Parent(s): c09e570

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -196,9 +196,11 @@ For language-specific results, see [the AGILE benchmark](https://github.com/Taja
196
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
197
 
198
  For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words).
199
- It is advised that the predictions, predicted with confidence higher than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.
200
-
201
- After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
 
 
202
 
203
 
204
  ### Use examples
 
196
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
197
 
198
  For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words).
199
+ Predictions with confidence scores below 0.8 should not be used.
200
+ In our experience annotating large web corpora in various European languages, this occurs in approximately 4% of cases.
201
+ We label these instances as "Mix."
202
+ Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions,
203
+ as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.
204
 
205
 
206
  ### Use examples