Qwen3.5: Nobody Agrees on Attention Anymore
On February 16th, 2026, Alibaba's Qwen team released Qwen3.5-397B-A17B , their next-generation foundation model. If you're getting a sense of déjà vu from the holiday timing, you should be. GLM-5 shipped on February 11th. MiniMax M2.5 landed the same day. Kimi K2.5 arrived on January 27th. And now Qwen3.5 closes out the pre-holiday window. Let’s see how it compares with the other releases.
What Qwen3.5 Actually Is
Qwen3.5-397B-A17B is a 397 billion parameter Mixture-of-Experts model with only 17 billion active parameters per token. The hosted API version is called Qwen3.5-Plus, and it ships with a 1M context window, built-in tools, and adaptive tool use out of the box.
The key innovations stack up like this:
Hybrid Attention Architecture. This is the big architectural bet. Qwen3.5 builds on the Qwen3-Next lineage, combining Gated Delta Networks (a linear attention variant) with sparse Mixture-of-Experts. The model alternates between Gated DeltaNet layers (linear attention) and full attention layers in roughly a 3:1 ratio. Sebastian Raschka has an excellent write-up on how this works, but the short version is: three out of every four transformer blocks use linear attention (which scales near-linearly with sequence length), and every fourth block uses standard full attention. The result is a model that can process long contexts more efficiently than full attention.
The Gated DeltaNet mechanism itself draws from the “Gated Delta Networks: Improving Mamba2 with Delta Rule” paper. It combines Mamba2’s gated decay mechanism with a delta rule for updating hidden states. The gated attention output gating helps eliminate attention sinks and massive activations, improving training stability at scale.
Scalable RL at Agent Scale. Qwen3.5 was trained with reinforcement learning scaled across what the team describes as "million-agent environments with progressively complex task distributions." This follows the trend we've seen from MiniMax's Forge and Zhipu's Slime: asynchronous RL infrastructure designed to handle the long-horizon, multi-step nature of agentic tasks. The details here are sparse in the initial release, but the emphasis on "robust real-world adaptability" suggests they've invested heavily in environment diversity during RL post-training.
Unified Vision-Language Foundation. Unlike Qwen3, which had separate text and vision model lines (Qwen3 and Qwen3-VL), Qwen3.5 is natively multimodal from the ground up. Early fusion training on multimodal tokens means the model doesn’t need a separate vision adapter. The team claims cross-generational parity with Qwen3 on text tasks while outperforming Qwen3-VL on visual understanding.
201 Languages. Expanded from Qwen3’s 119 to 201 languages and dialects. This is the broadest language coverage of any open model I’m aware of. Note that Qwen tends to be very generous with its definition of “language support,” and quality is not guaranteed for low-resource languages.
Attention & Sparsity
Efficient attention mechanisms and increased sparsity are two common trends with recent releases. While DeepSeek pioneered this field, every major Chinese lab has its own take on how to handle attention.
Qwen3-Next uses a 3:1 hybrid attention layout in which most layers use Gated DeltaNet (linear attention) and the remaining layers use Gated Attention (full/softmax-style attention). Kimi K2.5 and GLM-5 both use Multi-head Latent Attention (MLA), but GLM-5 also integrates DeepSeek Sparse Attention (DSA) to induce token-level sparsity on top of it. Finally, MiniMax-M2.5 is the only full attention model (see this article) with Multi-Head Attention (MHA) for reliability purposes.
The active parameter count is also worth zooming in on. At 17B active, Qwen3.5 is a lot sparser than Qwen3-235B-A22B, but in line with other recent releases:
Notably, MiniMax M2.5 has the smallest active parameter count at 10B and the same activation ratio as Qwen3.5. Kimi K2.5 is even sparser but also significantly bigger with 1T total parameters.
Benchmarks
Reasoning and Math. Qwen3.5 scores 91.3 on AIME 2026 and 94.8 on HMMT Feb 25, which is competitive but below the best-performing models (GPT-5.2 hits 96.7 on AIME 2026, Claude 93.3). Math capabilities are solid but far from dominant.
Knowledge and Instruction Following. On IFBench, it scores 76.5, beating every model in the comparison, including GPT-5.2 (75.4) and blowing past Claude (58.0). MultiChallenge tells the same story: 67.6 vs. GPT-5.2’s 57.9 and Claude’s 54.2. The model seems exceptionally good at following complex instructions but to be confirmed with real-world testing.
Agents. The agentic benchmarks paint an interesting picture. Qwen3.5 scores 86.7 on Tau2-Bench, second only to Claude (91.6). On MCPMark, it hits 46.1 vs. GPT-5.2’s 57.5 and Claude’s 42.3. On BrowseComp, Qwen3.5 reports two numbers depending on strategy: 69.0 with simple context-folding, and 78.6 using the same discard-all strategy as DeepSeek-V3.2 and K2.5. The BrowseComp split is worth noting because it highlights how much agentic benchmark scores depend on scaffolding choices, not just raw model capability.
Coding. Qwen3.5 scores 76.4 on SWE-bench Verified, essentially level with K2.5 (76.8) and Gemini 3 Pro (76.2), but behind GPT-5.2 (80.0) and Claude (80.9). On SWE-bench Multilingual, it does better at 72.0, matching GPT-5.2. SecCodeBench is a strong suit: 68.3, tied with GPT-5.2 (68.7) and Claude (68.6).
Vision. As a natively multimodal model, Qwen3.5 excels here. It scores 85.0 on MMMU (up from Qwen3-VL’s 80.6!), 88.6 on MathVision (ahead of Gemini 3 Pro’s 86.6), and 90.8 on OmniDocBench. The visual agent results are solid too: 62.2 on OSWorld-Verified and 66.8 on AndroidWorld. The ZEROBench result of 12 (vs. 10 for Gemini and 9 for GPT-5.2) is notable given the extreme difficulty of this benchmark.
Qwen3.5 is not the best at any single category, but it's remarkably well-rounded and leads on instruction following. It significantly outperforms its own Qwen3-Max-Thinking across the board despite being much smaller (397B vs 1T+).
The Bigger Picture
The attention mechanism is the new battleground. A year ago, the question was “MoE or dense?” That’s settled (and we might thank Llama 4 for this). Now the divergence is in how you handle attention: hybrid linear-full (Qwen3.5, K2.5), fully linear (MiniMax), sparse selection (GLM-5). DeepSeek’s fingerprints are everywhere (MLA in K2.5 and GLM-5, DSA in GLM-5), but the Gated DeltaNet hybrid from Qwen3.5 (and initiated in Qwen3-Next) offers a new direction.
The benchmark landscape has shifted to match agentic workloads. All four releases target agentic tasks. The models are evaluated on SWE-bench, BrowseComp, HLE with tools, TAU2-Bench, and MCPMark. The era of chatbot benchmarks as the primary evaluation axis is over. Qwen3.5’s BrowseComp split (69.0 vs. 78.6 depending on strategy) is a reminder that agentic scores are increasingly a function of scaffolding and context management, not just raw intelligence.
What’s Next
The fact that Qwen3.5 ships only the 397B-A17B size on day one (”more sizes are coming”) suggests we’ll see a family rollout similar to Qwen3. It’ll be interesting to see if smaller variants also adopt the hybrid DeltaNet architecture. For this release, it feels like Qwen3-Next previewed this direction (back in September), but Qwen3.5 is the production-scale validation.
Quick links:
- Model weights: https://huggingface.co/Qwen/Qwen3.5-397B-A17B
- GitHub: https://github.com/QwenLM/Qwen3.5
- Blog: https://qwen.ai/blog?id=qwen3.5




