Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
license: other
|
| 3 |
-
license_name:
|
| 4 |
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
| 5 |
language:
|
| 6 |
- en
|
|
@@ -61,7 +61,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
| 61 |
|
| 62 |
prompt = "Give me a short introduction to large language model."
|
| 63 |
messages = [
|
| 64 |
-
{"role": "system", "content": "You are a helpful assistant."},
|
| 65 |
{"role": "user", "content": prompt}
|
| 66 |
]
|
| 67 |
text = tokenizer.apply_chat_template(
|
|
@@ -84,11 +84,25 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
| 84 |
|
| 85 |
### Processing Long Texts
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
## Evaluation & Performance
|
| 94 |
|
|
|
|
| 1 |
---
|
| 2 |
license: other
|
| 3 |
+
license_name: qwen
|
| 4 |
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
| 5 |
language:
|
| 6 |
- en
|
|
|
|
| 61 |
|
| 62 |
prompt = "Give me a short introduction to large language model."
|
| 63 |
messages = [
|
| 64 |
+
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
|
| 65 |
{"role": "user", "content": prompt}
|
| 66 |
]
|
| 67 |
text = tokenizer.apply_chat_template(
|
|
|
|
| 84 |
|
| 85 |
### Processing Long Texts
|
| 86 |
|
| 87 |
+
The current `config.json` is set for context length up to 32,768 tokens.
|
| 88 |
+
To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
|
| 89 |
+
|
| 90 |
+
For supported frameworks, you could add the following to `config.json` to enable YaRN:
|
| 91 |
+
```json
|
| 92 |
+
{
|
| 93 |
+
...,
|
| 94 |
+
"rope_scaling": {
|
| 95 |
+
"factor": 4.0,
|
| 96 |
+
"original_max_position_embeddings": 32768,
|
| 97 |
+
"type": "yarn"
|
| 98 |
+
}
|
| 99 |
+
}
|
| 100 |
+
```
|
| 101 |
|
| 102 |
+
For deployment, we recommend using vLLM.
|
| 103 |
+
Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM.
|
| 104 |
+
Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**.
|
| 105 |
+
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
| 106 |
|
| 107 |
## Evaluation & Performance
|
| 108 |
|