YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Native Parallel Reasoner (NPR) — Model Card

Model name: NPR-4B non-thinking (Native Parallel Reasoner)

Paper: Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning.

Code / Repo: https://github.com/bigai-nlco/Native-Parallel-Reasoner.

Hub page: https://huggingface.co/bigai-NPR.


Model overview

NPR is a teacher-free framework that enables a language model to learn native parallel reasoning (i.e., generate and evaluate multiple reasoning branches concurrently) through a three-stage, self-distilled training pipeline and a parallel-aware reinforcement learning algorithm (PAPO). The project also provides an engineered rollout backend (NPR-Engine) that makes large-scale parallel RL training stable and practical.

Key components

  • Three-stage training curriculum: (1) format-discovery via RL (NPR-ZERO), (2) supervised parallel warmup on self-distilled trajectories (NPR-BETA), (3) native-parallel RL (PAPO) to directly optimize branching policies.
  • PAPO (Parallel-Aware Policy Optimization): RL objective and practical modifications (batch-level advantage normalization, preserve gradients on special tokens, on-policy optimization) tailored to parallel decoding and stable optimization.
  • NPR-Engine: engineering fixes (budget-aware KV reclamation, branch-aware token accounting, pre-branch validators, mild repetition penalties, etc.) that address memory, determinism and correctness issues in parallel rollouts.

Intended uses

  • Research on improving reasoning capabilities of LLMs via parallel decoding and RL.
  • Benchmarks and experiments in symbolic/math/programming reasoning where outputs are verifiable and can be used as reward signals.
  • Building systems that require candidate-diverse solutions quickly (e.g., best-of-k style verification pipelines using several parallel branches).

Out-of-scope / Not recommended uses

  • Generating unverified factual claims for high-stakes decision making without extra verification — NPR focuses on verifiable reasoning tasks and does not guarantee correctness for open generative tasks.
  • Use without appropriate safety/verification layers in domains requiring legal/medical/regulatory compliance.
  • Relying on NPR to produce human-level judgment where subjective evaluation or human values are needed.

Training data & setup (summary)

  • Base models: Qwen3-4B (and Qwen3-4B-Instruct variants) used as backbones for experiments.

  • Data source: experiments built on ORZ dataset (57k problem–answer pairs); pipeline uses a fixed subset of 8k examples for the 3 training stages (Stage 1 → Stage 2 → Stage 3). Self-distilled trajectories are filtered by outcome correctness and format compliance to produce the distilled training corpus.

  • Optimization / hyperparams (high-level):

    • Stage 1 (DAPO / format RL): large generation budget (max length up to 30,000 tokens in training).
    • Stage 2 (Parallel SFT warmup): LR start ≈ 1e-6 decayed to 5e-7; weight decay 0.1.
    • Stage 3 (PAPO + NPR-Engine): LR ≈ 1e-7; PAPO uses batch-level advantage normalization and strict on-policy updates.

Evaluation

Benchmarks used: AIME25, AIME24, HMMT25, AMC23, OlympiadBench, Minerva-Math, ZebraLogic, MATH500. Metrics include avg@k (avg@8 for smaller datasets where multiple solutions are sampled; avg@1 for larger datasets).

Selected results (reported in paper)

  • NPR trained on Qwen3-4B achieves performance gains up to ~24.5% over baselines on aggregate metrics and inference speedups up to 4.6× compared to autoregressive decoding.
  • Example numbers: NPR-4B (finetuned on Qwen3-4B-Instruct) reported AIME25: 50.4%, AIME24: 63.3%, outperforming Multiverse baselines (Multiverse-4B and Multiverse-32B) by noticeable margins in many benchmarks.
  • Genuine parallelism: NPR exhibits near 100% genuine parallel execution on evaluated tasks (no hidden autoregressive fallback observed), in contrast to >30% AR fallback found in some prior baselines.

See the paper for full tables (per-benchmark avg@k, best@k, and ablation studies).


How to use

Basic example (Hugging Face Transformers) — adapt depending on the model artifact type and the framework you use:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("bigai-NPR")
model = AutoModelForCausalLM.from_pretrained("bigai-NPR")

gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)  # adjust device
prompt = "You must write your answer strictly following the XML-like format defined below. Failure to comply with this format will result in an invalid response.\n\n**Definitions and Rules:**\n\n* `<guideline>`: A container for one or more `<plan>` tags. It sets the objective for the current stage of reasoning.\n* `<plan>i:</plan>`: A single, specific, and actionable task or hypothesis to be executed. Multiple plans within a guideline represent parallel exploration.\n* `<step>i:</step>`: The detailed execution of the corresponding `<plan>i`. The number of `<step>` tags must exactly match the number of `<plan>` tags in the preceding `<guideline>`. **Crucially, the content of this step must be generated *as if* you have no knowledge of the content of its sibling steps.**\n* `<takeaway>`: Use the `<takeaway>` tag to analyze steps and generate a *concise* summary. Compare the outcomes of the different steps, identify the most promising path, or consolidate the findings. The takeaway determines the next action: either proceeding to the next `<guideline>` for deeper analysis or moving to the final answer. **Only analyze the executed steps, NO additional computation or reasoning is allowed here.**\n* After analysis, add the final, user-facing conclusion that summarizes the entire logical journey from all preceding steps and takeaways into a clear, final response for the user. For questions with a definitive, short answer, you must include `\\\\boxed{...}` containing only the final result.\n\n**Strict Requirements:**\n\n1. **Execute Independently:** For each `<plan>`, generate a corresponding `<step>`.\n    * Each of the plans and steps must be a *self-contained, complete strategy* for solving the task or subtask.\n    * You must treat each `<step>` as an independent execution unit. The reasoning within `<step>i:` must only be based on `<plan>i:`, not on the content of any other `<step>`.\n    * The number of `<step>` tags must always equal the number of `<plan>` tags in the directly preceding `<guideline>`.\n    * Avoid words implying sequence or dependency (e.g. “then”, “after”, “next”).\n2. **Explore in Parallel:** When a problem or previous analysis involves multiple hypotheses, alternative methods, or independent sub-tasks, your next `<guideline>` should contain multiple `<plan>` tags.\n    * Each `<plan>` represents a parallel line of reasoning.\n    * `<guideline>` with a single `<plan>` is allowed if one plan is needed.\n    * Multiple alternative plans are recommended and will be awarded.\n3. **Meaningful content:** All tags must contain meaningful content. Do not add any text or explanation between the tags.\n4. No other tags or text outside the defined structure is allowed. Directly generate output. Do not wrap it in triple backticks or any other code block formatting.\n\n\n**Example Output Format:**\n\n<guideline>\n<plan>1: [A concise one-sentence, indepedent high-level plan.]</plan>\n...\n</guideline>\n<step>\n1: [Detailed analysis trajectory of plan 1. Must be entirely self-contained.]\n</step>\n...\n<takeaway>\n[Compare the results from the steps above. Synthesize the findings and determine the next action.]\n</takeaway>\n\n<guideline>\n<plan>1: [A one-sentence, high-level strategy]</plan>\n<plan>2: [A one-sentence, high-level strategy]</plan>\n...\n</guideline>\n<step>\n1: [Detailed analysis trajectory of plan 1. Must be entirely self-contained.]\n</step>\n<step>\n2: [Detailed analysis trajectory of plan 2. Must be entirely self-contained.]\n</step>\n...\n<takeaway>\n[Compare the results from the steps above. Synthesize the findings and determine the next action.]\n</takeaway>\n\n... [more guidelines, steps and takeaways]\n\n[The final, summarized conclusion based on all takeaways. Include definitive answers in \\\\boxed{...} format.]\n\nHow many positive two-digit integers are factors of both 100 and 150?\n\n"
outputs = gen(prompt, max_new_tokens=256, num_return_sequences=8)
for i, out in enumerate(outputs):
    print(f"Candidate {i+1}:\n", out["generated_text"])

Practical note: NPR is designed to run with a parallel decoding engine (the NPR-Engine) to realize genuine parallelism and speedups mentioned in the paper. Running naive autoregressive decoding over the same checkpoint will not reproduce the parallel inference acceleration. See repo for engine/run scripts.


Limitations & risks

  • Task specialization: NPR is trained and evaluated primarily on verifiable reasoning tasks (math/programming/factual verification). Its parallel reasoning gains may not translate to unconstrained or open-ended generation tasks.
  • Verification dependence: The pipeline relies on verifiable outcomes (used for self-distillation and rewards). In domains lacking reliable verifiers, the approach will be difficult to apply.
  • Compute & engineering complexity: Achieving the reported parallel RL stability required substantial engine-level fixes (KV bookkeeping, token budget accounting, format validators). Reproducing results needs similar engineering effort and careful resource management.
  • Potential failure modes: as with other learned planners/searchers, NPR can produce plausible but incorrect reasoning branches; downstream verification and human oversight are recommended for critical uses.

Ethical considerations

  • Avoid using NPR outputs as sole authority in high-stakes domains (legal, medical, financial) without human verification.
  • The self-distillation pipeline and large-scale RL could propagate dataset biases present in ORZ or other training subsets; evaluate fairness and bias for your target application.

License & citation

  • License: See the repository and the Hugging Face model page for the specific license attached to the code and model artifact.
  • If you use NPR in research or products, please cite: Wu, T., Liu, Y., Bai, J., Jia, Z., Zhang, S., Lin, Z., Wang, Y., Zhu, S.-C., & Zheng, Z. Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning (2025).

Where to find more


Prepared from the NPR paper and repository materials. For full technical details, exact per-benchmark tables, ablations and reproduction instructions, consult the original paper and the project repo.

Downloads last month
17
Safetensors
Model size
4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bigai-NPR/NPR-4B-non-thinking

Quantizations
2 models