OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
Abstract
OmniSafeBench-MM is a comprehensive tool for evaluating multi-modal jailbreak attacks and defenses, covering various attack methods, defense strategies, and risk domains.
Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.
Community
This work presents OmniSafeBench-MM, a unified, open-source benchmark and toolbox designed for comprehensive evaluation of multimodal jailbreak attack and defense methods. It integrates 13 representative attack techniques, 15 defense strategies, and a diverse dataset spanning 9 risk domains and 50 fine-grained categories, covering real-world–relevant query types (consultative, imperative, declarative). We also propose a three-dimensional evaluation protocol measuring harmfulness (from low-impact individual harm to societal-level threats), intent alignment, and response detail — allowing nuanced safety-utility tradeoff analysis. Our extensive experiments across 10 open-source and 8 closed-source multimodal LLMs reveal widespread vulnerabilities to multimodal jailbreak attacks. By unifying data, methods, and evaluation, OmniSafeBench-MM offers a standardized, reproducible platform — which we hope will become a foundational resource for future research on safe, robust multimodal LLMs.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper