replicalab / docs /future_improvements.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Future Improvements

Source of truth: ReplicaLab_Comprehensive_Task_Division.md

This document tracks post-MVP architectural improvements. Work here begins only after the core logic is complete and the hackathon deliverables are stable.


1. Domain-Agnostic Normalized Scenario Layer

Priority: highest future feature

Problem

The current models in replicalab/models.py use domain-biased field names:

  • paper_title, paper_hypothesis, paper_method, paper_key_finding
  • equipment_available, reagents_in_stock, staff_count
  • sample_size, controls, technique

These work for the three MVP scenario families (cell biology, ML benchmark, behavioral psychology) because all three map onto a lab-style replication frame. But if the environment needs to support domains outside scientific replication (e.g., engineering design, clinical trial planning, supply chain optimization), the field names stop making sense.

The turn protocol itself (propose, revise, request_info, accept) is already generic. The gap is in the observation and protocol content layer.

Solution: normalized scenario representation

Introduce a structured internal representation that any domain adapter can emit:

class NormalizedScenarioPack(BaseModel):
    domain_id: str                          # "cell_biology", "ml_benchmark", etc.
    task_summary: str                       # what the agent is trying to achieve
    success_criteria: list[str]             # measurable conditions for success
    constraints: list[Constraint]           # budget, time, equipment, policy, etc.
    resources: list[Resource]               # what is available to work with
    allowed_substitutions: list[Substitution]  # valid swaps the agent can propose
    hidden_reference_spec: dict             # ground truth the judge scores against
    difficulty: str                         # "easy", "medium", "hard"
    metadata: dict                          # domain-specific extras

Where:

class Constraint(BaseModel):
    dimension: str          # "budget", "time", "equipment", "personnel", "safety"
    label: str              # human-readable name
    value: Any              # the constraint value (numeric, list, etc.)
    hard: bool = True       # hard constraint vs soft preference

class Resource(BaseModel):
    category: str           # "equipment", "reagent", "compute", "personnel"
    name: str               # resource identifier
    available: bool         # currently available
    quantity: Optional[int] # count if applicable
    notes: str = ""         # booking conflicts, expiry, etc.

class Substitution(BaseModel):
    original: str           # what the reference spec uses
    replacement: str        # what the agent can use instead
    quality_impact: float   # 0.0 to 1.0, how much fidelity is lost
    cost_delta: float       # cost difference

Architecture principle

Domain template
    -> Scenario adapter (thin mapper, <50 lines per domain)
        -> NormalizedScenarioPack
            -> Observation mapper (fills ScientistObservation / LabManagerObservation)
            -> Prompt assembler (data-driven, not hard-coded)
            -> Validator (checks action against constraints)
            -> Scorer (compares final protocol against hidden_reference_spec)

The external contract (ScientistAction, LabManagerAction, ScientistObservation, LabManagerObservation, StepResult) stays unchanged. The normalization lives below those models as an internal implementation layer.

LLMs reason and negotiate. They never own truth. Truth lives in the normalized scenario pack and the deterministic scorer.

How this affects the future core logic

Current component Impact Severity
replicalab/models.py External contract unchanged. Add NormalizedScenarioPack and helper models as new classes Low
replicalab/scenarios/templates.py (SCN 02) Must define the normalized schema. generate_scenario() returns a pack instead of raw dicts High
replicalab/scenarios/*.py (SCN 03-05) Each domain file becomes a thin scenario adapter that emits a normalized pack Medium
replicalab/scenarios/templates.py (SCN 06) Difficulty scaling becomes mechanical: add/remove constraints, tighten resource limits Medium, but simpler
replicalab/scenarios/templates.py (SCN 07) Constraint generator emits Constraint objects instead of ad hoc lab fields High
replicalab/scenarios/templates.py (SCN 08) hidden_reference_spec is part of the pack, not a separate hidden structure Medium
replicalab/utils/validation.py (MOD 05-06) Validators read constraints[] and resources[] from the pack instead of checking lab-specific fields High
replicalab/scoring/*.py (JDG 01-04) Scorers compare the final protocol against hidden_reference_spec on normalized dimensions High
replicalab/env/replicalab_env.py (ENV 01-07) EpisodeState gains a scenario_pack field. Reset populates it from the adapter Medium
replicalab/agents/scientist_policy.py (AGT 01-02) Prompts assembled from scenario pack data, not hard-coded domain text Medium
replicalab/agents/lab_manager_policy.py (AGT 05-07) Feasibility checker reads normalized constraints instead of lab-specific fields Medium
frontend/ (UI 01+) Render "constraint cards" and "resource cards" instead of lab-specific panels Low (future)

What stays the same

  • The turn protocol (propose, revise, request_info, accept)
  • The reward formula (10 * rigor * feasibility * fidelity + bonuses - penalties)
  • The external API contract (REST + WebSocket payloads)
  • The training loop and RL pipeline
  • The deterministic reward principle

2. Planned work items for the normalized scenario layer

Item 1: Define the normalized scenario schema

What: Add NormalizedScenarioPack, Constraint, Resource, and Substitution as Pydantic models in a new file replicalab/scenarios/schema.py.

Why: This is the foundation. Every other item depends on having a stable schema that all adapters, validators, and scorers agree on.

Depends on: Core MVP scenario work (SCN 02-09) being complete so we know what fields the adapters actually need.

Scope: ~80 lines of model definitions, no business logic.


Item 2: Convert existing scenario templates into adapters

What: Refactor cell_biology.py, ml_benchmark.py, and behavioral_psych.py so each one returns a NormalizedScenarioPack instead of raw domain-specific dicts.

Why: Proves the schema works for all three MVP domains. If a field cannot be cleanly mapped, the schema needs revision before adding new domains.

Depends on: Item 1 (schema exists), SCN 03-05 (domain templates exist).

Scope: ~50 lines per adapter. Should be thin mappers. If an adapter exceeds 50 lines, the schema is wrong.

Constraint: The existing observation fields (paper_title, equipment_available, etc.) must still be populated. The adapter fills both the normalized pack and the legacy observation slots until the observation models are generalized.


Item 3: Build data-driven prompt assembly

What: Replace hard-coded prompt text with a template that assembles from the scenario pack:

You are a {role} working on: {task_summary}

Success criteria:
{success_criteria[]}

You must work within these constraints:
{constraints[].label}: {constraints[].value}

Available resources:
{resources[].name} ({resources[].category}): {available/unavailable}

Why: Makes AGT 01 (Scientist prompt) and AGT 07 (Lab Manager templates) domain-neutral. Adding a new domain requires only a new adapter, not new prompts.

Depends on: Item 2 (adapters produce normalized packs), AGT 01 and AGT 07 existing in their MVP form.

Scope: One prompt template function per role. ~40 lines each.


Item 4: Hybrid LLM Lab Manager with deterministic post-checking

What: Replace the rule-based Lab Manager with a hybrid architecture:

  1. LLM receives the LabManagerObservation and generates negotiation text plus alternative suggestions in natural language
  2. Deterministic constraint checker computes the real feasibility flags by reading the normalized scenario pack's constraints[] and resources[]
  3. A composer merges the LLM output with the checker output into a valid LabManagerAction
  4. The model_validator on LabManagerAction catches any inconsistency

Why: Gives the Lab Manager realistic negotiation language and creative suggestions (the LLM's strength) while keeping feasibility flags truthful (the checker's strength). Training reward stays deterministic because the reward engine only reads the validated action, not the LLM's raw text.

Depends on: Item 2 (checker needs normalized constraints), AGT 05 (feasibility checker exists), MOD 02 (LabManagerAction validators exist).

Scope: ~120 lines. The LLM call, the checker, the composer. Uses the same base model as the Scientist (Qwen3-4B) with a separate role adapter.

Risk: Episode variance increases because the same seed may produce different negotiation paths. Mitigate by keeping the deterministic checker as the authority on all boolean flags. The LLM only controls explanation text and suggestion ideas, never the truth flags.


Item 5: Normalized scoring against hidden reference spec

What: Refactor the scoring engine so score_rigor(), score_feasibility(), and score_fidelity() compare the final protocol against hidden_reference_spec from the normalized scenario pack instead of using domain-specific scoring logic.

Scoring dimensions become:

  • Rigor: Does the protocol preserve the success criteria? Compare protocol.controls against hidden_reference_spec.required_controls, check sample size ratio, verify statistical validity markers.
  • Feasibility: Does the protocol satisfy all hard constraints? Walk constraints[] and check each one against the protocol.
  • Fidelity: How close is the protocol to the reference spec? Compare technique, duration, equipment, reagents against hidden_reference_spec and compute a similarity score using allowed_substitutions[] quality impact.

Why: Makes scoring work for any domain without per-domain scorer code. The domain-specific knowledge lives in the scenario adapter (which defines what the reference spec and constraints are), not in the scoring engine.

Depends on: Item 1 (schema with hidden_reference_spec), Item 2 (adapters populate it), JDG 01-04 (MVP scorers exist to refactor from).

Scope: Refactor of existing scorer files. ~150 lines total across rigor.py, feasibility.py, fidelity.py.


Item 6: Lab Manager orchestrator with specialist subagents

What: Decompose the hybrid Lab Manager into a coordinator that delegates to specialist subagents:

Subagent Responsibility
Budget agent Checks cost against remaining budget
Scheduling agent Checks timeline and booking conflicts
Equipment agent Checks equipment availability and substitutions
Safety agent Checks policy and compliance constraints
Coordinator Aggregates subagent outputs into one LabManagerAction

Externally, the contract is unchanged: one LabManagerAction per turn. The orchestration is internal.

Why: Stronger multi-agent story for the hackathon track alignment. Demonstrates that the Lab Manager is not a monolithic policy but a team of constraint specialists. Each subagent can be individually tested, improved, or replaced.

Depends on: Item 4 (hybrid Lab Manager works first), Item 2 (normalized constraints are available for each subagent to read).

Scope: Orchestration layer ~200 lines. Each subagent ~40 lines. Total ~400 lines.

Risk: Adds latency (multiple LLM calls or multiple checker passes per turn), orchestration failure handling, and logging complexity. Only pursue after the single hybrid Lab Manager is stable and training is producing results.

Phasing: This is the lowest priority item. Build it only if the MVP is complete, training shows improvement, and there is time remaining before submission.


3. Recommended order

Order Item Gate
1 Define normalized scenario schema After SCN 02-09 complete
2 Convert templates to adapters After Item 1
3 Data-driven prompt assembly After Item 2 + AGT 01/07
4 Hybrid LLM Lab Manager After Item 2 + AGT 05
5 Normalized scoring After Item 2 + JDG 01-04
6 Lab Manager orchestrator with subagents After Item 4 stable

4. Key principle

The external contract stays stable. Internal policy can evolve. LLMs reason and negotiate. They never own truth. Truth lives in the normalized scenario pack and the deterministic scorer.