ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation
Paper
•
2310.17389
•
Published
An experimental text classification model fine-tuned from Microsoft/DeBERTa-V3 base for Cockatoo
This model is licensed under the Apache-2.0 license.
Available Labels:
"id2label": {
"0": "scam",
"1": "violence",
"2": "harassment",
"3": "hate_speech",
"4": "toxicity",
"5": "obscenity"
}
Constellation One achieves a near-SOTA levels of performance within its weight class, specifically excelling in detecting scams and harassment.
By default, the model has very high recall values (~0.9) in all categories. After tuning threshold values, recall values will drop to ~0.81, but F1 will increase to ~0.74.
Thresholds:
LABEL_THRESHOLDS = {
'scam': 0.5,
'violence': 0.5,
'harassment': 0.5,
'hate_speech': 0.5,
'toxicity': 0.5,
'obscenity': 0.5
}
Thresholds:
LABEL_THRESHOLDS = {
'scam': 0.60,
'violence': 0.73,
'harassment': 0.70,
'hate_speech': 0.80,
'toxicity': 0.75,
'obscenity': 0.85
}
Training/Inferencing server: https://github.com/DominicTWHV/Cockatoo_ML_Training/
Training Metrics: https://cockatoo.dev/ml-training.html
| Dataset | License | Link |
|---|---|---|
| Phishing Dataset | MIT | Hugging Face |
| Measuring Hate Speech | CC-BY-4.0 | Hugging Face |
| Tweet Eval (SemEval-2019) | [See Citation]* | Hugging Face |
| Toxic Chat | CC-BY-NC-4.0 | Hugging Face |
| Jigsaw Toxicity | Apache-2.0 | Hugging Face |
| Text Moderation Multilingual | Apache-2.0 | Hugging Face |
@article{kennedy2020constructing,
title={Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application},
author={Kennedy, Chris J and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia},
journal={arXiv preprint arXiv:2009.10277},
year={2020}
}
@inproceedings{basile-etal-2019-semeval,
title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter",
author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela",
booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
year = "2019",
address = "Minneapolis, Minnesota, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/S19-2007",
doi = "10.18653/v1/S19-2007",
pages = "54--63"
}
@misc{lin2023toxicchat,
title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation},
author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang},
year={2023},
eprint={2310.17389},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{text-moderation-large,
title={Text-Moderation-Multilingual: A Multilingual Text Moderation Dataset},
author={[KoalaAI]},
year={2025},
note={Aggregated from ifmain's and OpenAI's moderation datasets}
}
Base model
microsoft/deberta-v3-base