Update README.md
#1
by
bzheng - opened
README.md
CHANGED
|
@@ -1,3 +1,78 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
license_link: https://huggingface.co/Qwen/Qwen3.5-9B-Base/blob/main/LICENSE
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Qwen3.5-9B-Base
|
| 9 |
+
|
| 10 |
+
<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/logo_qwen3.5.png">
|
| 11 |
+
|
| 12 |
+
[](https://chat.qwen.ai)
|
| 13 |
+
|
| 14 |
+
> [!Note]
|
| 15 |
+
> This repository contains model weights and configuration files for the pre-trained only model in the Hugging Face Transformers format.
|
| 16 |
+
>
|
| 17 |
+
> These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.
|
| 18 |
+
>
|
| 19 |
+
> The intended use cases are fine-tuning, in-context learning experiments, and other research or development purposes, not direct interaction.
|
| 20 |
+
> However, the control tokens, e.g., `<|im_start|>` and `<|im_end|>` were trained to allow efficient LoRA-style PEFT with the official chat template, mitigating the need to finetune embeddings, a significant optimization given Qwen3.5's larger vocabulary.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.
|
| 24 |
+
|
| 25 |
+
## Qwen3.5 Highlights
|
| 26 |
+
|
| 27 |
+
Qwen3.5 features the following enhancement:
|
| 28 |
+
|
| 29 |
+
- **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.
|
| 30 |
+
|
| 31 |
+
- **Efficient Hybrid Architecture**: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
|
| 32 |
+
|
| 33 |
+
- **Scalable RL Generalization**: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.
|
| 34 |
+
|
| 35 |
+
- **Global Linguistic Coverage**: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.
|
| 36 |
+
|
| 37 |
+
- **Next-Generation Training Infrastructure**: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
For more details, please refer to our blog post [Qwen3.5](https://qwen.ai/blog?id=qwen3.5).
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
## Model Overview
|
| 44 |
+
|
| 45 |
+
- Type: Causal Language Model with Vision Encoder
|
| 46 |
+
- Training Stage: Pre-training & Post-training
|
| 47 |
+
- Language Model
|
| 48 |
+
- Number of Parameters: 9B
|
| 49 |
+
- Hidden Dimension: 4096
|
| 50 |
+
- Token Embedding: 248320 (Padded)
|
| 51 |
+
- Number of Layers: 32
|
| 52 |
+
- Hidden Layout: 8 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
|
| 53 |
+
- Gated DeltaNet:
|
| 54 |
+
- Number of Linear Attention Heads: 32 for V and 16 for QK
|
| 55 |
+
- Head Dimension: 128
|
| 56 |
+
- Gated Attention:
|
| 57 |
+
- Number of Attention Heads: 16 for Q and 4 for KV
|
| 58 |
+
- Head Dimension: 256
|
| 59 |
+
- Rotary Position Embedding Dimension: 64
|
| 60 |
+
- Feed Forward Network:
|
| 61 |
+
- Intermediate Dimension: 12288
|
| 62 |
+
- LM Output: 248320 (Padded)
|
| 63 |
+
- MTP: trained with multi-steps
|
| 64 |
+
- Context Length: 262,144 natively and extensible up to 1,010,000 tokens.
|
| 65 |
+
|
| 66 |
+
### Citation
|
| 67 |
+
|
| 68 |
+
If you find our work helpful, feel free to give us a cite.
|
| 69 |
+
|
| 70 |
+
```bibtex
|
| 71 |
+
@misc{qwen3.5,
|
| 72 |
+
title = {{Qwen3.5}: Towards Native Multimodal Agents},
|
| 73 |
+
author = {{Qwen Team}},
|
| 74 |
+
month = {February},
|
| 75 |
+
year = {2026},
|
| 76 |
+
url = {https://qwen.ai/blog?id=qwen3.5}
|
| 77 |
+
}
|
| 78 |
+
```
|