moonshotai
/

Kimi-K2-Thinking

@@ -31,7 +31,7 @@ library_name: transformers
 ## 1. Model Introduction
-Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-turn reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
 ### Key Features
 - **Deep Thinking & Tool Orchestration**: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
@@ -110,7 +110,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
 <details>
 <summary><b>Footnotes</b></summary>
-1. To ensure a fast, lightweight experience, we selectively employ a subset of tools and reduce the number of tool call turns under the chat mode on kimi.com. As a result, chatting on kimi.com may not reproduce our benchmark scores. Our agentic mode will be updated soon to reflect the full capabilities of K2 Thinking.
 2. **Testing Details**:
  2.1. All benchmarks were evaluated at temperature = 1.0 and 256 k context length for K2 Thinking, except for SciCode, for which we followed the official temperature setting of 0.0.
@@ -126,7 +126,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
  4.1. K2 Thinking was equipped with search, code-interpreter, and web-browsing tools.
  4.2. BrowseComp-ZH, Seal-0 and FinSearchComp-T3 were run 4 times independently and the average is reported (avg@4).
  4.3. The evaluation used o3-mini as judge, configured identically to the official HLE setting; judge prompts were taken verbatim from the official repository.
- 4.4. On HLE, the maximum turn limit was 120, with a 48 k-token reasoning budget per turn; on agentic-search tasks, the limit was 300 turns with a 24 k-token reasoning budget per turn.
  4.5. When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs.
  4.6. The web access to Hugging Face may lead to data leakage in certain benchmark tests, such as HLE. K2 Thinking can achieve a score of 51.3 on HLE without blocking Hugging Face. To ensure a fair and rigorous comparison, we blocked access to Hugging Face during testing.

 ## 1. Model Introduction
+Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
 ### Key Features
 - **Deep Thinking & Tool Orchestration**: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
 <details>
 <summary><b>Footnotes</b></summary>
+1. To ensure a fast, lightweight experience, we selectively employ a subset of tools and reduce the number of tool call steps under the chat mode on kimi.com. As a result, chatting on kimi.com may not reproduce our benchmark scores. Our agentic mode will be updated soon to reflect the full capabilities of K2 Thinking.
 2. **Testing Details**:
  2.1. All benchmarks were evaluated at temperature = 1.0 and 256 k context length for K2 Thinking, except for SciCode, for which we followed the official temperature setting of 0.0.
  4.1. K2 Thinking was equipped with search, code-interpreter, and web-browsing tools.
  4.2. BrowseComp-ZH, Seal-0 and FinSearchComp-T3 were run 4 times independently and the average is reported (avg@4).
  4.3. The evaluation used o3-mini as judge, configured identically to the official HLE setting; judge prompts were taken verbatim from the official repository.
+ 4.4. On HLE, the maximum step limit was 120, with a 48 k-token reasoning budget per step; on agentic-search tasks, the limit was 300 steps with a 24 k-token reasoning budget per step.
  4.5. When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs.
  4.6. The web access to Hugging Face may lead to data leakage in certain benchmark tests, such as HLE. K2 Thinking can achieve a score of 51.3 on HLE without blocking Hugging Face. To ensure a fair and rigorous comparison, we blocked access to Hugging Face during testing.