arxiv:2511.22826

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Published on Nov 28

· Submitted by

Authors:

Abstract

Multimodal Large Language Models (MLLMs) lack robustness to contradictory modalities, as demonstrated by MMA-Bench, and a modality alignment tuning strategy improves their multimodal reasoning.

AI-generated summary

Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.

View arXiv page View PDF Add to collection

Community

Tianle

Paper submitter about 18 hours ago

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration

As the famous George Orwell quote goes - "all animals are equal but some animals are more equal than others", we indeed find that though present-day MLLMs claim to process all modalities equally, some modalities (like text then visual) are more heavily utilized than others.

Consider a video of birds chirping. If blindfolded, could you describe what you hear? Similarly, if wearing noise-canceling headphones, could you describe what you see? . In all these situations, most humans can easily describe the events and rely on the available modality. But regarding present-day Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities?.

Most models are trained on datasets that overwhelmingly assume that all available modalities are aligned. To rigorously study this, we introduce MMA-Bench to systematically control the presence or alignment of one modality at a time. We use structured contradictions to reveal if MLLMs are truly multi-modal or take shortcuts during cross-modal reasoning tasks.

Key Findings: The Hierarchy of Modalities
Our analysis revealed a critical flaw: despite being trained as multimodal systems, all MLLMs collapse when modalities conflict.

Text Dominance: We found that small text distractions cripple models, ignoring clear audio-visual cues. White-box interpretability reveals that textual tokens consistently exhibit the highest attention magnitudes. This indicates that textual priors frequently override multimodal signals.
Seeing Without Listening: Models rely almost exclusively on vision, i.e., "seeing without listening," resulting in strong degradation of audio performance when modalities conflict.
Visual Fallback: When the requested modality is absent (e.g., asking about audio in a silent video), models do not reliably follow the desired modality instructions and instead fall back on whichever modality remains informative - typically the visual stream.
Brittle Modality Integration: MLLMs fail ungracefully if any one modality is perturbed. This brittleness makes these models susceptible to simple text-, vision-, and audio-based data poisoning or prompt attacks.

Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. This prompt-conditioned supervision yields adaptive attention redistribution toward the queried modality rather than static bias suppression. We show that targeted modality alignment is more effective than scaling alone.

Code and dataset will be available at cskyl.github.io/MMA-Bench/.

librarian-bot

about 8 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.22826 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.22826 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.22826 in a Space README.md to link it from this page.