Add metadata (pipeline tag, library, license) and project page link
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Native Parallel Reasoner (NPR) — Model Card
|
| 2 |
|
| 3 |
**Model name:** `NPR-4B non-thinking` (Native Parallel Reasoner)
|
|
@@ -6,6 +12,8 @@
|
|
| 6 |
|
| 7 |
**Code / Repo:** [https://github.com/bigai-nlco/Native-Parallel-Reasoner](https://github.com/bigai-nlco/Native-Parallel-Reasoner).
|
| 8 |
|
|
|
|
|
|
|
| 9 |
**Hub page:** [https://huggingface.co/bigai-NPR](https://huggingface.co/bigai-NPR).
|
| 10 |
|
| 11 |
---
|
|
@@ -16,37 +24,37 @@ NPR is a teacher-free framework that enables a language model to learn *native p
|
|
| 16 |
|
| 17 |
**Key components**
|
| 18 |
|
| 19 |
-
*
|
| 20 |
-
*
|
| 21 |
-
*
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
# Intended uses
|
| 26 |
|
| 27 |
-
*
|
| 28 |
-
*
|
| 29 |
-
*
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
# Out-of-scope / Not recommended uses
|
| 34 |
|
| 35 |
-
*
|
| 36 |
-
*
|
| 37 |
-
*
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
# Training data & setup (summary)
|
| 42 |
|
| 43 |
-
*
|
| 44 |
-
*
|
| 45 |
-
*
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
|
| 51 |
---
|
| 52 |
|
|
@@ -56,9 +64,9 @@ NPR is a teacher-free framework that enables a language model to learn *native p
|
|
| 56 |
|
| 57 |
**Selected results (reported in paper)**
|
| 58 |
|
| 59 |
-
*
|
| 60 |
-
*
|
| 61 |
-
*
|
| 62 |
|
| 63 |
> See the paper for full tables (per-benchmark avg@k, best@k, and ablation studies).
|
| 64 |
|
|
@@ -75,10 +83,72 @@ tokenizer = AutoTokenizer.from_pretrained("bigai-NPR")
|
|
| 75 |
model = AutoModelForCausalLM.from_pretrained("bigai-NPR")
|
| 76 |
|
| 77 |
gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0) # adjust device
|
| 78 |
-
prompt = "You must write your answer strictly following the XML-like format defined below. Failure to comply with this format will result in an invalid response
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
outputs = gen(prompt, max_new_tokens=256, num_return_sequences=8)
|
| 80 |
for i, out in enumerate(outputs):
|
| 81 |
-
print(f"Candidate {i+1}
|
|
|
|
| 82 |
```
|
| 83 |
|
| 84 |
> Practical note: NPR is designed to run with a parallel decoding engine (the NPR-Engine) to realize genuine parallelism and speedups mentioned in the paper. Running naive autoregressive decoding over the same checkpoint will not reproduce the parallel inference acceleration. See repo for engine/run scripts.
|
|
@@ -87,33 +157,33 @@ for i, out in enumerate(outputs):
|
|
| 87 |
|
| 88 |
# Limitations & risks
|
| 89 |
|
| 90 |
-
*
|
| 91 |
-
*
|
| 92 |
-
*
|
| 93 |
-
*
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
# Ethical considerations
|
| 98 |
|
| 99 |
-
*
|
| 100 |
-
*
|
| 101 |
|
| 102 |
---
|
| 103 |
|
| 104 |
# License & citation
|
| 105 |
|
| 106 |
-
*
|
| 107 |
-
*
|
| 108 |
-
|
| 109 |
|
| 110 |
---
|
| 111 |
|
| 112 |
# Where to find more
|
| 113 |
|
| 114 |
-
*
|
| 115 |
-
*
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
-
*Prepared from the NPR paper and repository materials. For full technical details, exact per-benchmark tables, ablations and reproduction instructions, consult the original paper and the project repo.*
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-generation
|
| 3 |
+
library_name: transformers
|
| 4 |
+
license: mit
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
# Native Parallel Reasoner (NPR) — Model Card
|
| 8 |
|
| 9 |
**Model name:** `NPR-4B non-thinking` (Native Parallel Reasoner)
|
|
|
|
| 12 |
|
| 13 |
**Code / Repo:** [https://github.com/bigai-nlco/Native-Parallel-Reasoner](https://github.com/bigai-nlco/Native-Parallel-Reasoner).
|
| 14 |
|
| 15 |
+
**Project Page:** [https://bigai-nlco.github.io/Native-Parallel-Reasoner/](https://bigai-nlco.github.io/Native-Parallel-Reasoner/).
|
| 16 |
+
|
| 17 |
**Hub page:** [https://huggingface.co/bigai-NPR](https://huggingface.co/bigai-NPR).
|
| 18 |
|
| 19 |
---
|
|
|
|
| 24 |
|
| 25 |
**Key components**
|
| 26 |
|
| 27 |
+
* **Three-stage training curriculum:** (1) format-discovery via RL (NPR-ZERO), (2) supervised parallel warmup on self-distilled trajectories (NPR-BETA), (3) native-parallel RL (PAPO) to directly optimize branching policies.
|
| 28 |
+
* **PAPO (Parallel-Aware Policy Optimization):** RL objective and practical modifications (batch-level advantage normalization, preserve gradients on special tokens, on-policy optimization) tailored to parallel decoding and stable optimization.
|
| 29 |
+
* **NPR-Engine:** engineering fixes (budget-aware KV reclamation, branch-aware token accounting, pre-branch validators, mild repetition penalties, etc.) that address memory, determinism and correctness issues in parallel rollouts.
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
# Intended uses
|
| 34 |
|
| 35 |
+
* Research on improving reasoning capabilities of LLMs via parallel decoding and RL.
|
| 36 |
+
* Benchmarks and experiments in symbolic/math/programming reasoning where outputs are verifiable and can be used as reward signals.
|
| 37 |
+
* Building systems that require candidate-diverse solutions quickly (e.g., best-of-k style verification pipelines using several parallel branches).
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
# Out-of-scope / Not recommended uses
|
| 42 |
|
| 43 |
+
* Generating unverified factual claims for high-stakes decision making without extra verification — NPR focuses on verifiable reasoning tasks and does not guarantee correctness for open generative tasks.
|
| 44 |
+
* Use without appropriate safety/verification layers in domains requiring legal/medical/regulatory compliance.
|
| 45 |
+
* Relying on NPR to produce human-level judgment where subjective evaluation or human values are needed.
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
# Training data & setup (summary)
|
| 50 |
|
| 51 |
+
* **Base models:** Qwen3-4B (and Qwen3-4B-Instruct variants) used as backbones for experiments.
|
| 52 |
+
* **Data source:** experiments built on ORZ dataset (57k problem–answer pairs); pipeline uses a fixed subset of 8k examples for the 3 training stages (Stage 1 → Stage 2 → Stage 3). Self-distilled trajectories are filtered by outcome correctness and format compliance to produce the distilled training corpus.
|
| 53 |
+
* **Optimization / hyperparams (high-level):**
|
| 54 |
|
| 55 |
+
* Stage 1 (DAPO / format RL): large generation budget (max length up to 30,000 tokens in training).
|
| 56 |
+
* Stage 2 (Parallel SFT warmup): LR start ≈ 1e-6 decayed to 5e-7; weight decay 0.1.
|
| 57 |
+
* Stage 3 (PAPO + NPR-Engine): LR ≈ 1e-7; PAPO uses batch-level advantage normalization and strict on-policy updates.
|
| 58 |
|
| 59 |
---
|
| 60 |
|
|
|
|
| 64 |
|
| 65 |
**Selected results (reported in paper)**
|
| 66 |
|
| 67 |
+
* NPR trained on Qwen3-4B achieves **performance gains up to ~24.5%** over baselines on aggregate metrics and **inference speedups up to 4.6×** compared to autoregressive decoding.
|
| 68 |
+
* Example numbers: NPR-4B (finetuned on Qwen3-4B-Instruct) reported **AIME25: 50.4%**, **AIME24: 63.3%**, outperforming Multiverse baselines (Multiverse-4B and Multiverse-32B) by noticeable margins in many benchmarks.
|
| 69 |
+
* **Genuine parallelism:** NPR exhibits near **100% genuine parallel execution** on evaluated tasks (no hidden autoregressive fallback observed), in contrast to >30% AR fallback found in some prior baselines.
|
| 70 |
|
| 71 |
> See the paper for full tables (per-benchmark avg@k, best@k, and ablation studies).
|
| 72 |
|
|
|
|
| 83 |
model = AutoModelForCausalLM.from_pretrained("bigai-NPR")
|
| 84 |
|
| 85 |
gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0) # adjust device
|
| 86 |
+
prompt = "You must write your answer strictly following the XML-like format defined below. Failure to comply with this format will result in an invalid response.
|
| 87 |
+
|
| 88 |
+
**Definitions and Rules:**
|
| 89 |
+
|
| 90 |
+
* `<guideline>`: A container for one or more `<plan>` tags. It sets the objective for the current stage of reasoning.
|
| 91 |
+
* `<plan>i:</plan>`: A single, specific, and actionable task or hypothesis to be executed. Multiple plans within a guideline represent parallel exploration.
|
| 92 |
+
* `<step>i:</step>`: The detailed execution of the corresponding `<plan>i`. The number of `<step>` tags must exactly match the number of `<plan>` tags in the preceding `<guideline>`. **Crucially, the content of this step must be generated *as if* you have no knowledge of the content of its sibling steps.**
|
| 93 |
+
* `<takeaway>`: Use the `<takeaway>` tag to analyze steps and generate a *concise* summary. Compare the outcomes of the different steps, identify the most promising path, or consolidate the findings. The takeaway determines the next action: either proceeding to the next `<guideline>` for deeper analysis or moving to the final answer. **Only analyze the executed steps, NO additional computation or reasoning is allowed here.**
|
| 94 |
+
* After analysis, add the final, user-facing conclusion that summarizes the entire logical journey from all preceding steps and takeaways into a clear, final response for the user. For questions with a definitive, short answer, you must include `\\\\boxed{...}` containing only the final result.
|
| 95 |
+
|
| 96 |
+
**Strict Requirements:**
|
| 97 |
+
|
| 98 |
+
1. **Execute Independently:** For each `<plan>`, generate a corresponding `<step>`.
|
| 99 |
+
* Each of the plans and steps must be a *self-contained, complete strategy* for solving the task or subtask.
|
| 100 |
+
* You must treat each `<step>` as an independent execution unit. The reasoning within `<step>i:` must only be based on `<plan>i:`, not on the content of any other `<step>`.
|
| 101 |
+
* The number of `<step>` tags must always equal the number of `<plan>` tags in the directly preceding `<guideline>`.
|
| 102 |
+
* Avoid words implying sequence or dependency (e.g. “then”, “after”, “next”).
|
| 103 |
+
2. **Explore in Parallel:** When a problem or previous analysis involves multiple hypotheses, alternative methods, or independent sub-tasks, your next `<guideline>` should contain multiple `<plan>` tags.
|
| 104 |
+
* Each `<plan>` represents a parallel line of reasoning.
|
| 105 |
+
* `<guideline>` with a single `<plan>` is allowed if one plan is needed.
|
| 106 |
+
* Multiple alternative plans are recommended and will be awarded.
|
| 107 |
+
3. **Meaningful content:** All tags must contain meaningful content. Do not add any text or explanation between the tags.
|
| 108 |
+
4. No other tags or text outside the defined structure is allowed. Directly generate output. Do not wrap it in triple backticks or any other code block formatting.
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
**Example Output Format:**
|
| 112 |
+
|
| 113 |
+
<guideline>
|
| 114 |
+
<plan>1: [A concise one-sentence, indepedent high-level plan.]</plan>
|
| 115 |
+
...
|
| 116 |
+
</guideline>
|
| 117 |
+
<step>
|
| 118 |
+
1: [Detailed analysis trajectory of plan 1. Must be entirely self-contained.]
|
| 119 |
+
</step>
|
| 120 |
+
...
|
| 121 |
+
<takeaway>
|
| 122 |
+
[Compare the results from the steps above. Synthesize the findings and determine the next action.]
|
| 123 |
+
</takeaway>
|
| 124 |
+
|
| 125 |
+
<guideline>
|
| 126 |
+
<plan>1: [A one-sentence, high-level strategy]</plan>
|
| 127 |
+
<plan>2: [A one-sentence, high-level strategy]</plan>
|
| 128 |
+
...
|
| 129 |
+
</guideline>
|
| 130 |
+
<step>
|
| 131 |
+
1: [Detailed analysis trajectory of plan 1. Must be entirely self-contained.]
|
| 132 |
+
</step>
|
| 133 |
+
<step>
|
| 134 |
+
2: [Detailed analysis trajectory of plan 2. Must be entirely self-contained.]
|
| 135 |
+
</step>
|
| 136 |
+
...
|
| 137 |
+
<takeaway>
|
| 138 |
+
[Compare the results from the steps above. Synthesize the findings and determine the next action.]
|
| 139 |
+
</takeaway>
|
| 140 |
+
|
| 141 |
+
... [more guidelines, steps and takeaways]
|
| 142 |
+
|
| 143 |
+
[The final, summarized conclusion based on all takeaways. Include definitive answers in \\\\boxed{...} format.]
|
| 144 |
+
|
| 145 |
+
How many positive two-digit integers are factors of both 100 and 150?
|
| 146 |
+
|
| 147 |
+
"
|
| 148 |
outputs = gen(prompt, max_new_tokens=256, num_return_sequences=8)
|
| 149 |
for i, out in enumerate(outputs):
|
| 150 |
+
print(f"Candidate {i+1}:
|
| 151 |
+
", out["generated_text"])
|
| 152 |
```
|
| 153 |
|
| 154 |
> Practical note: NPR is designed to run with a parallel decoding engine (the NPR-Engine) to realize genuine parallelism and speedups mentioned in the paper. Running naive autoregressive decoding over the same checkpoint will not reproduce the parallel inference acceleration. See repo for engine/run scripts.
|
|
|
|
| 157 |
|
| 158 |
# Limitations & risks
|
| 159 |
|
| 160 |
+
* **Task specialization:** NPR is trained and evaluated primarily on *verifiable reasoning tasks* (math/programming/factual verification). Its parallel reasoning gains may not translate to unconstrained or open-ended generation tasks.
|
| 161 |
+
* **Verification dependence:** The pipeline relies on verifiable outcomes (used for self-distillation and rewards). In domains lacking reliable verifiers, the approach will be difficult to apply.
|
| 162 |
+
* **Compute & engineering complexity:** Achieving the reported parallel RL stability required substantial engine-level fixes (KV bookkeeping, token budget accounting, format validators). Reproducing results needs similar engineering effort and careful resource management.
|
| 163 |
+
* **Potential failure modes:** as with other learned planners/searchers, NPR can produce plausible but incorrect reasoning branches; downstream verification and human oversight are recommended for critical uses.
|
| 164 |
|
| 165 |
---
|
| 166 |
|
| 167 |
# Ethical considerations
|
| 168 |
|
| 169 |
+
* Avoid using NPR outputs as sole authority in high-stakes domains (legal, medical, financial) without human verification.
|
| 170 |
+
* The self-distillation pipeline and large-scale RL could propagate dataset biases present in ORZ or other training subsets; evaluate fairness and bias for your target application.
|
| 171 |
|
| 172 |
---
|
| 173 |
|
| 174 |
# License & citation
|
| 175 |
|
| 176 |
+
* **License:** See the repository and the Hugging Face model page for the specific license attached to the code and model artifact.
|
| 177 |
+
* **If you use NPR in research or products, please cite:**
|
| 178 |
+
Wu, T., Liu, Y., Bai, J., Jia, Z., Zhang, S., Lin, Z., Wang, Y., Zhu, S.-C., & Zheng, Z. *Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning* (2025).
|
| 179 |
|
| 180 |
---
|
| 181 |
|
| 182 |
# Where to find more
|
| 183 |
|
| 184 |
+
* Paper (ArXiv & PDF): referenced above.
|
| 185 |
+
* Repository & instructions for reproducing NPR’s training pipeline and NPR-Engine: [https://github.com/bigai-nlco/Native-Parallel-Reasoner](https://github.com/bigai-nlco/Native-Parallel-Reasoner).
|
| 186 |
|
| 187 |
---
|
| 188 |
|
| 189 |
+
*Prepared from the NPR paper and repository materials. For full technical details, exact per-benchmark tables, ablations and reproduction instructions, consult the original paper and the project repo.*
|