Sliding Window Attention Adaptation (SWAA) LoRA Adapters

This repository contains LoRA adapter weights for models fine-tuned as part of the paper Sliding Window Attention Adaptation.

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This paper investigates this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance.

Code: https://github.com/yuyijiong/sliding-window-attention-adaptation

image

Usage

This model provides LoRA adapter weights designed to be applied on top of compatible base models (e.g., Qwen/Qwen3-30B-A3B-Instruct-2507). To fully enable and customize the Sliding Window Attention Adaptation (SWAA) features, you need to use the custom swaa_patch from the official GitHub repository, which performs monkey patching for transformers and vLLM.

Installation Steps:

  1. Clone the official sliding-window-attention-adaptation repository:
    git clone https://github.com/yuyijiong/sliding-window-attention-adaptation.git
    
  2. Install the custom flash-attention package as described in the GitHub repository:
    cd sliding-window-attention-adaptation/flash-attention-SWAA
    bash install.sh
    # Note: CUDA >=12.8 is recommended. This may overwrite existing flash-attn installation.
    # Consider creating a new Python environment.
    
  3. Ensure the swaa_patch folder is accessible in your Python environment's path. If you cloned the repository to your current working directory, you might need to add it to sys.path:
    import sys
    sys.path.append("./sliding-window-attention-adaptation")
    

Inference Example with Hugging Face transformers and peft:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Import and run the SWAA patching function before loading the model
# (Ensure 'swaa_patch' is accessible in your PYTHONPATH)
from swaa_patch import SWAAConfig, hack_hf_swaa
hack_hf_swaa(training=False)

# Define paths for the base model and this LoRA adapter
base_model_id = "Qwen/Qwen3-30B-A3B-Instruct-2507" # Example base model (from metadata)
lora_adapter_id = "yuyijiong/Qwen3-SWA-adaptation-30B" # Replace with this specific LoRA adapter repo path

# Load tokenizer for the base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token # Common fallback if pad_token is not explicitly set

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 based on your GPU
    trust_remote_code=True,
    attn_implementation="flash_attention_2", # Ensure custom flash-attention is installed
).eval()

# Load this LoRA adapter weights on top of the base model
model = PeftModel.from_pretrained(base_model, lora_adapter_id)
# Merge LoRA weights into the base model for efficient inference if desired
# model = model.merge_and_unload() 

# Define and attach the SWAA configuration to the model's config
# Adjust `sliding_window_size`, `keep_first`, and `non_sliding_layers`
# based on your specific requirements and model architecture as discussed in the paper.
swaa_config = SWAAConfig(
    sliding_window_size=2048,
    keep_first=100,
    force_fa_decode=True,
    non_sliding_layers=[l for l in range(model.config.num_hidden_layers) if l % 2 == 0],
)
model.config.swaa_config = swaa_config # Attach SWAA config to model config

# Prepare your input prompt (example with a long context)
prompt = "The quick brown fox jumps over the lazy dog. " * 100 + "What is the main subject of this story?"
inputs = tokenizer([prompt], return_tensors="pt", padding=True).to(model.device)

# Generate text
output_ids = model.generate(
    **inputs,
    max_new_tokens=100,      # Max tokens to generate
    do_sample=True,          # Set to True for sampling, False for greedy decoding
    temperature=0.7,         # Adjust for creativity (when do_sample=True)
    top_p=0.9,               # Adjust for sampling (when do_sample=True)
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode and print the generated text
decoded_output = tokenizer.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_output)

Datasets

The training and evaluation datasets used in this work are available on the Hugging Face Hub:

  1. Training Dataset: yuyijiong/fusang-v1-filtered (The training dataset for long-context SFT.)
  2. Evaluation Datasets: yuyijiong/LongMemEval_24k (Includes benchmark data for evaluation, such as longmemeval_24k.parquet and longbenchv2_qa.parquet).

Citation

If you find our work helpful or inspiring, please feel free to cite it:

@article{yu2025sliding,
    title={Sliding Window Attention Adaptation},
    author={Yu, Yijiong and Liu, Jiale and Wu, Qingyun and Wang, Huazheng and Pei, Ji},
    journal={arXiv preprint arXiv:2512.10411},
    year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuyijiong/Qwen3-SWA-adaptation

Finetuned
(28)
this model

Dataset used to train yuyijiong/Qwen3-SWA-adaptation

Collection including yuyijiong/Qwen3-SWA-adaptation