ReplicaLab Scientist — GRPO LoRA Adapter
A LoRA adapter fine-tuned on unsloth/Qwen3.5-0.8B using Group Relative Policy Optimization (GRPO) for multi-agent scientific negotiation.
What is ReplicaLab?
ReplicaLab is a multi-agent constraint-aware planning environment that trains an AI Scientist agent to negotiate feasible scientific replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity.
Live demo: ayushozha-replicalab.hf.space
Training Details
- Method: GRPO (Group Relative Policy Optimization) via TRL
- Base model:
unsloth/Qwen3.5-0.8B - LoRA config: rank=16, alpha=32, dropout=0.0
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Hardware: NVIDIA H100 80GB HBM3 (Northflank)
- Steps: 200 (checkpoints at 100, 150, 200)
- Training framework: Unsloth + TRL 0.24.0 + PEFT 0.18.1
Reward Formula
total_reward = 10 × rigor × feasibility × fidelity × parsimony
+ efficiency_bonus + communication_bonus − penalties
The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low.
Training Curves
Overview
Reward Over Training
Training Loss
KL Divergence
Completion Length
Evaluation Results
Improvement Over Baseline
Side-by-Side Comparison
| Metric | Baseline Scientist | Trained Scientist | Change |
|---|---|---|---|
| Average reward | 4.25 | 7.10 | +67% |
| Rounds to agreement | 4.1 | 2.8 | −32% |
| Invalid action rate | 15% | 4% | −73% |
| Agreement rate | 50% | 80% | +60% |
| Avg rigor score | 0.55 | 0.72 | +31% |
| Avg feasibility score | 0.52 | 0.78 | +50% |
| Avg fidelity score | 0.58 | 0.71 | +22% |
Scenario Families
| Template | Domain | Example Task |
|---|---|---|
math_reasoning |
Mathematics | Proof planning under deadline and review constraints |
ml_benchmark |
Machine Learning | Model replication with compute and time budgets |
finance_trading |
Finance | Backtest design under capital and risk limits |
Quick Start
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-0.8B")
model = PeftModel.from_pretrained(base_model, "openenv-community/replicalab-scientist-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("openenv-community/replicalab-scientist-grpo-lora")
# Use within the ReplicaLab environment for scientific negotiation
Framework Versions
- PEFT: 0.18.1
- TRL: 0.24.0
- Transformers: 5.2.0
- PyTorch: 2.8.0+cu128
- Datasets: 4.3.0
- Tokenizers: 0.22.2
Citation
@misc{replicalab2026,
title = {ReplicaLab: Multi-Agent Constraint-Aware Planning for Scientific Replication},
author = {Ayush Ojha and Kian and Peixi and Kush},
year = 2026,
url = {https://github.com/Ayush10/replicalab-ai}
}
License
MIT
- Downloads last month
- 37






