ReplicaLab Scientist — GRPO LoRA Adapter

A LoRA adapter fine-tuned on unsloth/Qwen3.5-0.8B using Group Relative Policy Optimization (GRPO) for multi-agent scientific negotiation.

What is ReplicaLab?

ReplicaLab is a multi-agent constraint-aware planning environment that trains an AI Scientist agent to negotiate feasible scientific replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity.

Live demo: ayushozha-replicalab.hf.space

Training Details

  • Method: GRPO (Group Relative Policy Optimization) via TRL
  • Base model: unsloth/Qwen3.5-0.8B
  • LoRA config: rank=16, alpha=32, dropout=0.0
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Hardware: NVIDIA H100 80GB HBM3 (Northflank)
  • Steps: 200 (checkpoints at 100, 150, 200)
  • Training framework: Unsloth + TRL 0.24.0 + PEFT 0.18.1

Reward Formula

total_reward = 10 × rigor × feasibility × fidelity × parsimony
             + efficiency_bonus + communication_bonus − penalties

The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low.

Training Curves

Overview

Training Overview

Reward Over Training

Reward Curve

Training Loss

Loss Curve

KL Divergence

KL Divergence

Completion Length

Completion Length

Evaluation Results

Improvement Over Baseline

Improvements

Side-by-Side Comparison

Eval Comparison

Metric Baseline Scientist Trained Scientist Change
Average reward 4.25 7.10 +67%
Rounds to agreement 4.1 2.8 −32%
Invalid action rate 15% 4% −73%
Agreement rate 50% 80% +60%
Avg rigor score 0.55 0.72 +31%
Avg feasibility score 0.52 0.78 +50%
Avg fidelity score 0.58 0.71 +22%

Scenario Families

Template Domain Example Task
math_reasoning Mathematics Proof planning under deadline and review constraints
ml_benchmark Machine Learning Model replication with compute and time budgets
finance_trading Finance Backtest design under capital and risk limits

Quick Start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-0.8B")
model = PeftModel.from_pretrained(base_model, "openenv-community/replicalab-scientist-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("openenv-community/replicalab-scientist-grpo-lora")

# Use within the ReplicaLab environment for scientific negotiation

Framework Versions

  • PEFT: 0.18.1
  • TRL: 0.24.0
  • Transformers: 5.2.0
  • PyTorch: 2.8.0+cu128
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

@misc{replicalab2026,
    title        = {ReplicaLab: Multi-Agent Constraint-Aware Planning for Scientific Replication},
    author       = {Ayush Ojha and Kian and Peixi and Kush},
    year         = 2026,
    url          = {https://github.com/Ayush10/replicalab-ai}
}

License

MIT

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openenv-community/replicalab-scientist-grpo-lora

Finetuned
Qwen/Qwen3.5-0.8B
Adapter
(8)
this model

Spaces using openenv-community/replicalab-scientist-grpo-lora 3