Sliding Window Attention Adaptation (SWAA) LoRA Adapters
This repository contains LoRA adapter weights for models fine-tuned as part of the paper Sliding Window Attention Adaptation.
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This paper investigates this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance.
Code: https://github.com/yuyijiong/sliding-window-attention-adaptation
Usage
This model provides LoRA adapter weights designed to be applied on top of compatible base models (e.g., Qwen/Qwen3-30B-A3B-Instruct-2507). To fully enable and customize the Sliding Window Attention Adaptation (SWAA) features, you need to use the custom swaa_patch from the official GitHub repository, which performs monkey patching for transformers and vLLM.
Installation Steps:
- Clone the official
sliding-window-attention-adaptationrepository:git clone https://github.com/yuyijiong/sliding-window-attention-adaptation.git - Install the custom
flash-attentionpackage as described in the GitHub repository:cd sliding-window-attention-adaptation/flash-attention-SWAA bash install.sh # Note: CUDA >=12.8 is recommended. This may overwrite existing flash-attn installation. # Consider creating a new Python environment. - Ensure the
swaa_patchfolder is accessible in your Python environment's path. If you cloned the repository to your current working directory, you might need to add it tosys.path:import sys sys.path.append("./sliding-window-attention-adaptation")
Inference Example with Hugging Face transformers and peft:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Import and run the SWAA patching function before loading the model
# (Ensure 'swaa_patch' is accessible in your PYTHONPATH)
from swaa_patch import SWAAConfig, hack_hf_swaa
hack_hf_swaa(training=False)
# Define paths for the base model and this LoRA adapter
base_model_id = "Qwen/Qwen3-30B-A3B-Instruct-2507" # Example base model (from metadata)
lora_adapter_id = "yuyijiong/Qwen3-SWA-adaptation-30B" # Replace with this specific LoRA adapter repo path
# Load tokenizer for the base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token # Common fallback if pad_token is not explicitly set
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 based on your GPU
trust_remote_code=True,
attn_implementation="flash_attention_2", # Ensure custom flash-attention is installed
).eval()
# Load this LoRA adapter weights on top of the base model
model = PeftModel.from_pretrained(base_model, lora_adapter_id)
# Merge LoRA weights into the base model for efficient inference if desired
# model = model.merge_and_unload()
# Define and attach the SWAA configuration to the model's config
# Adjust `sliding_window_size`, `keep_first`, and `non_sliding_layers`
# based on your specific requirements and model architecture as discussed in the paper.
swaa_config = SWAAConfig(
sliding_window_size=2048,
keep_first=100,
force_fa_decode=True,
non_sliding_layers=[l for l in range(model.config.num_hidden_layers) if l % 2 == 0],
)
model.config.swaa_config = swaa_config # Attach SWAA config to model config
# Prepare your input prompt (example with a long context)
prompt = "The quick brown fox jumps over the lazy dog. " * 100 + "What is the main subject of this story?"
inputs = tokenizer([prompt], return_tensors="pt", padding=True).to(model.device)
# Generate text
output_ids = model.generate(
**inputs,
max_new_tokens=100, # Max tokens to generate
do_sample=True, # Set to True for sampling, False for greedy decoding
temperature=0.7, # Adjust for creativity (when do_sample=True)
top_p=0.9, # Adjust for sampling (when do_sample=True)
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode and print the generated text
decoded_output = tokenizer.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_output)
Datasets
The training and evaluation datasets used in this work are available on the Hugging Face Hub:
- Training Dataset:
yuyijiong/fusang-v1-filtered(The training dataset for long-context SFT.) - Evaluation Datasets:
yuyijiong/LongMemEval_24k(Includes benchmark data for evaluation, such aslongmemeval_24k.parquetandlongbenchv2_qa.parquet).
Citation
If you find our work helpful or inspiring, please feel free to cite it:
@article{yu2025sliding,
title={Sliding Window Attention Adaptation},
author={Yu, Yijiong and Liu, Jiale and Wu, Qingyun and Wang, Huazheng and Pei, Ji},
journal={arXiv preprint arXiv:2512.10411},
year={2025}
}
Model tree for yuyijiong/Qwen3-SWA-adaptation
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507