Abstract
Vision transformers are enhanced for segmentation tasks through a Gaussian kernel modulation that improves local attention while maintaining classification performance.
Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.
Community
LocAtViT is a method to pretrain vision transformers so that their patch representations transfer better to dense prediction (e.g., segmentation), without changing the pretraining objective.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ViT-5: Vision Transformers for The Mid-2020s (2026)
- Revisiting [CLS] and Patch Token Interaction in Vision Transformers (2026)
- SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention (2026)
- Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens (2026)
- Beyond the final layer: Attentive multilayer fusion for vision transformers (2026)
- Vision Transformers Need More Than Registers (2026)
- CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper