Update README.md
Browse files
README.md
CHANGED
|
@@ -31,7 +31,7 @@ library_name: transformers
|
|
| 31 |
|
| 32 |
## 1. Model Introduction
|
| 33 |
|
| 34 |
-
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-
|
| 35 |
|
| 36 |
### Key Features
|
| 37 |
- **Deep Thinking & Tool Orchestration**: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
|
|
@@ -110,7 +110,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 110 |
<details>
|
| 111 |
<summary><b>Footnotes</b></summary>
|
| 112 |
|
| 113 |
-
1. To ensure a fast, lightweight experience, we selectively employ a subset of tools and reduce the number of tool call
|
| 114 |
|
| 115 |
2. **Testing Details**:
|
| 116 |
β2.1. All benchmarks were evaluated at temperature = 1.0 and 256 k context length for K2 Thinking, except for SciCode, for which we followed the official temperature setting of 0.0.
|
|
@@ -126,7 +126,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 126 |
β4.1. K2 Thinking was equipped with search, code-interpreter, and web-browsing tools.
|
| 127 |
β4.2. BrowseComp-ZH, Seal-0 and FinSearchComp-T3 were run 4 times independently and the average is reported (avg@4).
|
| 128 |
β4.3. The evaluation used o3-mini as judge, configured identically to the official HLE setting; judge prompts were taken verbatim from the official repository.
|
| 129 |
-
β4.4. On HLE, the maximum
|
| 130 |
β4.5. When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs.
|
| 131 |
β4.6. The web access to Hugging Face may lead to data leakage in certain benchmark tests, such as HLE. K2 Thinking can achieve a score of 51.3 on HLE without blocking Hugging Face. To ensure a fair and rigorous comparison, we blocked access to Hugging Face during testing.
|
| 132 |
|
|
|
|
| 31 |
|
| 32 |
## 1. Model Introduction
|
| 33 |
|
| 34 |
+
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200β300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
|
| 35 |
|
| 36 |
### Key Features
|
| 37 |
- **Deep Thinking & Tool Orchestration**: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
|
|
|
|
| 110 |
<details>
|
| 111 |
<summary><b>Footnotes</b></summary>
|
| 112 |
|
| 113 |
+
1. To ensure a fast, lightweight experience, we selectively employ a subset of tools and reduce the number of tool call steps under the chat mode on kimi.com. As a result, chatting on kimi.com may not reproduce our benchmark scores. Our agentic mode will be updated soon to reflect the full capabilities of K2 Thinking.
|
| 114 |
|
| 115 |
2. **Testing Details**:
|
| 116 |
β2.1. All benchmarks were evaluated at temperature = 1.0 and 256 k context length for K2 Thinking, except for SciCode, for which we followed the official temperature setting of 0.0.
|
|
|
|
| 126 |
β4.1. K2 Thinking was equipped with search, code-interpreter, and web-browsing tools.
|
| 127 |
β4.2. BrowseComp-ZH, Seal-0 and FinSearchComp-T3 were run 4 times independently and the average is reported (avg@4).
|
| 128 |
β4.3. The evaluation used o3-mini as judge, configured identically to the official HLE setting; judge prompts were taken verbatim from the official repository.
|
| 129 |
+
β4.4. On HLE, the maximum step limit was 120, with a 48 k-token reasoning budget per step; on agentic-search tasks, the limit was 300 steps with a 24 k-token reasoning budget per step.
|
| 130 |
β4.5. When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs.
|
| 131 |
β4.6. The web access to Hugging Face may lead to data leakage in certain benchmark tests, such as HLE. K2 Thinking can achieve a score of 51.3 on HLE without blocking Hugging Face. To ensure a fair and rigorous comparison, we blocked access to Hugging Face during testing.
|
| 132 |
|