NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
Abstract
Neighborhood Attention Filtering (NAF) upsamples features from Vision Foundation Models without retraining, achieving state-of-the-art performance across tasks with high efficiency.
Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.
Community
Hi,
I am excited to share my latest work. Do not hesitate to reach me if you have any question.
We open-sourced everything: training, evaluation, weights, demos in the github repository: https://github.com/valeoai/NAF?tab=readme-ov-file
cool paper!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling (2025)
- AnyUp: Universal Feature Upsampling (2025)
- Learned Adaptive Kernels for High-Fidelity Image Downscaling (2025)
- FlowFeat: Pixel-Dense Embedding of Motion Profiles (2025)
- RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models (2025)
- Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency (2025)
- MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper