The Heterogeneous Feature of RoPE-based Attention in Long-Context LLMs

Community Article Published November 15, 2025

This is originally an oral presentation delivered on July 13 and 14, 2025, at the Joint Academic Workshop of HIT, THU, and FDU (held at Fudan University) and at the workshop of Shanghai Innovation Institute.

TL;DR: In this talk, we introduce the Heterogeneous Feature of Attention in Long-Context LLMs, the phenomenon that attention components along different qk dimensions play different roles in long-context LLMs. We uncover and explain this heterogeneity from the RoPE perspective, then leverage it to length extrapolation, cache optimization, and long-video modeling.

Introduction

Good afternoon, professors and fellow students. It is my great honor to present here today. I am Xiaoran Liu, a Ph.D. student from the FNLP. The topic of my talk today is Heterogeneous Feature of Attention in Long-Context LLMs.

There are two keywords in this title. The first is long context. Long-context processing has long been an important topic in NLP. From a historical perspective, the pursuit of longer context length has driven the evolution of model architectures: from Bag-of-Words without context, to CNNs, RNNs, and LSTMs with limited context, then to today’s Transformers, and even the recent challengers such as RWKV and Mamba. The pursuit of longer context continually inspires new architecture. In the era of LLMs, long context has always been a core competitive advantage. LLMs have scaled their windows from the initial 2K tokens to the millions of token today.

The second keyword is attention, especially the attention score. Attention score has been a key insight behind many influential works in long-context research. One well-known example is StreamingLLM (ICLR’24), which found that LLM’s attention scores show unusually strong peaks around the initial tokens and recent tokens. By preserving attention to these two parts, LLM can maintain stable performance when streaming long inputs. This work was accepted to ICLR last year and has been highly influential. Building on this, the authors also proposed a cache optimization method, DuoAttention (ICLR'25), to maintain retrieval performance in long contexts. Beyond cache optimization, attention scores can also be used for dynamic sparsification to accelerate long-context inference, such as Minference (NeurIPS’24 Spotlight).

However, these studies treat the attention score as an atomic whole, while lacking analysis of how different dimensions of q and k contribute differently to the entire score. This gap has been touched upon in some very recent studies, which also leads us to today’s focus, heterogeneous feature. This talk will center on the discovery and utilization of heterogeneous features, and it will also introduce related long-context research from our lab.

Definition

So, what is heterogeneous features? The Heterogeneous Features refer to the phenomenon that attention components along different qk dimensions play different roles in attention in long-context LLMs. Here are two examples:

Observation

First, from the perspective of long-context retrieval, we know that most attention score is allocated to the initial tokens and recent tokens, as stated in StreamingLLM, and the attention score is the sum of 128 elementwise qk products across dimensions. If we split the sum, for instance, the first 70 dimensions versus the last 58, we find that the lower dimensions account for the high attention score on recent tokens, and the upper dimensions account for that on the initial tokens. Based on this, if we add noise to the first 70 dimensions, the NIAH (Needle-In-A-Haystack) performance of LLMs barely changes. But if we add the same noise to the last 58 dimensions, even though they are fewer, the NIAH performance degrades significantly. This phenomenon is consistently observed in both LLaMA and Qwen models.

From the perspective of length extrapolation, we examine how the attention score components from lower and upper dimensions fluctuate within and beyond the model’s trained context length. We find that lower dimensions remain stable whether extrapolating or not, while upper dimensions show abnormal fluctuations once the token index exceeds the maximum supported context length, and the position of these fluctuations strongly correlates with where perplexity spikes occur. Thus, we observe that the lower and the upper qk dimensions exhibit heterogeneous features.

Explanation

So where do these heterogeneous features come from? We believe the source is the Rotary Position Embedding (RoPE). Why RoPE causes heterogeneous features. As is well known, RoPE encodes positional information using sine or cosine functions with different rotary angles, namely frequencies, across qk dimensions. This structure inherits two mathematical properties of sinusoidal functions: periodicity and monotonicity.

Lower dimensions correspond to short period or high frequency, and observe complete (even multiple) periods in pre-training. Upper dimensions correspond to long period or low frequency, and only see a portion of the period (e.g., only the positive half) in pre-training. Additionally, lower dimensions have short monotonic intervals, so different relative positions can collapse into the same embedding, similar to hash collisions. Upper dimensions have long monotonic intervals, allowing them to preserve good partial order over long contexts and thereby capture long-context semantic dependencies. Therefore, we have a seemingly strange yet actually reasonable conclusion that, periodicity limits the extrapolation ability of upper dimensions, while monotonicity makes upper dimensions responsible for modeling long-context semantics.

To replace the vague notion of lower and upper, we provide a mathematical definition. That is the critical dimension. Critical dimension is the number of dimensions for which RoPE completes a full period within the pre-training context window. Dimensions before and after the critical dimension correspond precisely to the heterogeneous behaviors described earlier.

Application

Then comes a question. How to utilize the heterogeneous feature?

Length Extrapolation

Most context-extension methods (e.g., NTK-based scaling) modify the rotary base of RoPE. Using the critical dimension, we can approximate the maximal extrapolatable context length by computing how far the period of the critical dimension extends after scaling. As shown in our formula, this estimated limit matches the maximum supported context length well, giving us a scaling law for RoPE-based extrapolation (arXiv). Inverting this formula tells us how much we need to scale the rotary base to reach a desired context length, thus enabling million-token contexts. This work is accepted to ICLR'24.

$T_\text{extra} = \max\left(T_\text{tune}, 2\pi\cdot\beta^{\lceil\frac{2}{d}\log_{10000}{\frac{T_\text{train}}{2\pi}}\rceil\frac{2}{d}}\right),\quad \beta_\text{extra} = {10000}^{\log_{\frac{T_\text{Train}}{2\pi}}{\frac{T_\text{extra}}{2\pi}}}$

Cache Optimization

Since only the relatively few dimensions after the critical dimension are crucial for long contexts, the others can be compressed. Inspired by the HiPPO framework, we propose FourierAttention (arXiv, Github), expanding the long-context-insensitive dimensions with a fixed-order basis (Here, we choose Fourier basis functions) and storing only fixed-size expansion coefficients to represent arbitrarily long KV caches. Fourier basis performs best among the candidate choices and yields the NIAH results closest to the pre-trained model.

More importantly, Fourier transforms allow efficient parallel compression and decompression. Using Triton, we rewrite the FlashAttention operator to fuse the Inverse Fourier Transform into FlashDecoding, eliminating the need to materialize full KV caches during inference. As a result, on a single GPU, we support longer contexts than competing methods, reduce memory substantially, and keep similar latency.

Multi-Modality

Heterogeneous features also help with multimodal positional embedding. In our ICML'25 Oral paper VideoRoPE (arXiv, Github), we systematically analyze the design principles for video positional embedding. Upper dimensions (low frequency) capture long-range temporal dependencies, while lower dimensions (high frequency) capture local spatial semantics. This greatly improves long-video modeling and retrieval. We further propose VideoRoPE++, VideoRoPE based on the extrapolation method YaRN-V, and the discriminative evaluation suite V-RULER.

Diffusion LM

Then comes another question. Do heterogeneous features extend beyond autoregressive attention?

Yes. This is demonstrated by our recent work LongLLaDA (later accepted by AAAI'26, arXiv, Github), the first work on length extrapolation for diffusion-based language models. Unlike autoregressive models with unidirectional attention, dLLMs use bidirectional attention, so all dimensions necessarily encode both positive and negative positions. Thus, dLLMs naturally extrapolate, but only maintain effective local awareness, similar to a sliding-window—restricted retrieval to the neighborhood of the pre-trained context length.

Despite this difference, dLLMs also have some dimensions that have not seen full positional embedding periods in pre-training, so they also exhibit critical dimensions and heterogeneous features. This yields a scaling law for extrapolation in dLLMs. Using this, we extended the context window of LLaDA by 6 times in a plug-and-play manner.

Conclusion

From the above studies, we see that exploring long-context processing involves far more than simply increasing context length. It includes efficiency, multimodal extension, and many other aspects, architecture, infrastructure, training, and evaluation. Over the past two years, our team at FNLP has extensively explored these directions and achieved progress on all fronts, summarized in our survey Thus Spake Long-Context Large Language Models (arXiv, Github), also a respect for the symphonic poem Thus Spoke Zarathustra.

Thank all co-authors for your collaboration and thank you for your listening.

Citation

@article{liu2023scaling,
  title={Scaling Laws of RoPE-based Extrapolation},
  author={Liu, Xiaoran and Yan, Hang and Zhang, Shuo and An, Chenxin and Qiu, Xipeng and Lin, Dahua},
  journal={arXiv preprint arXiv:2310.05209},
  year={2023}
}

@article{liu2025beyond,
  title={Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache},
  author={Liu, Xiaoran and He, Siyang and Wang, Qiqi and Li, Ruixiao and Song, Yuerong and Liu, Zhigeng and Huang, Mianqiu and Huang, Zengfeng and Guo, Qipeng and He, Ziwei He and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2506.11886},
  year={2025}
}

@article{wei2025videorope,
  title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
  author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and Qiu, Xipeng and Lin, Dahua},
  journal={arXiv preprint arXiv:2502.05173},
  year={2025}
}

@article{liu2025longllada,
  title={LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs},
  author={Liu, Xiaoran and Liu, Zhigeng and Huang, Zengfeng and Guo, Qipeng and He, Ziwei and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2506.14429},
  year={2025}
}

@article{liu2025thus,
  title={Thus Spake Long-Context Large Language Model},
  author={Liu, Xiaoran and Li, Ruixiao and Huang, Mianqiu and Liu, Zhigeng and Song, Yuerong and Guo, Qipeng and He, Siyang and Wang, Qiqi and Li, Linlin and Liu, Qun and He, Ziwei and Zhou, Yaqian and Huang, Xuanjing and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2502.17129},
  year={2025}
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote