# Future Improvements Source of truth: `ReplicaLab_Comprehensive_Task_Division.md` This document tracks post-MVP architectural improvements. Work here begins only after the core logic is complete and the hackathon deliverables are stable. --- ## 1. Domain-Agnostic Normalized Scenario Layer ### Priority: highest future feature ### Problem The current models in `replicalab/models.py` use domain-biased field names: - `paper_title`, `paper_hypothesis`, `paper_method`, `paper_key_finding` - `equipment_available`, `reagents_in_stock`, `staff_count` - `sample_size`, `controls`, `technique` These work for the three MVP scenario families (cell biology, ML benchmark, behavioral psychology) because all three map onto a lab-style replication frame. But if the environment needs to support domains outside scientific replication (e.g., engineering design, clinical trial planning, supply chain optimization), the field names stop making sense. The turn protocol itself (`propose`, `revise`, `request_info`, `accept`) is already generic. The gap is in the observation and protocol content layer. ### Solution: normalized scenario representation Introduce a structured internal representation that any domain adapter can emit: ```python class NormalizedScenarioPack(BaseModel): domain_id: str # "cell_biology", "ml_benchmark", etc. task_summary: str # what the agent is trying to achieve success_criteria: list[str] # measurable conditions for success constraints: list[Constraint] # budget, time, equipment, policy, etc. resources: list[Resource] # what is available to work with allowed_substitutions: list[Substitution] # valid swaps the agent can propose hidden_reference_spec: dict # ground truth the judge scores against difficulty: str # "easy", "medium", "hard" metadata: dict # domain-specific extras ``` Where: ```python class Constraint(BaseModel): dimension: str # "budget", "time", "equipment", "personnel", "safety" label: str # human-readable name value: Any # the constraint value (numeric, list, etc.) hard: bool = True # hard constraint vs soft preference class Resource(BaseModel): category: str # "equipment", "reagent", "compute", "personnel" name: str # resource identifier available: bool # currently available quantity: Optional[int] # count if applicable notes: str = "" # booking conflicts, expiry, etc. class Substitution(BaseModel): original: str # what the reference spec uses replacement: str # what the agent can use instead quality_impact: float # 0.0 to 1.0, how much fidelity is lost cost_delta: float # cost difference ``` ### Architecture principle ``` Domain template -> Scenario adapter (thin mapper, <50 lines per domain) -> NormalizedScenarioPack -> Observation mapper (fills ScientistObservation / LabManagerObservation) -> Prompt assembler (data-driven, not hard-coded) -> Validator (checks action against constraints) -> Scorer (compares final protocol against hidden_reference_spec) ``` The external contract (`ScientistAction`, `LabManagerAction`, `ScientistObservation`, `LabManagerObservation`, `StepResult`) stays unchanged. The normalization lives below those models as an internal implementation layer. LLMs reason and negotiate. They never own truth. Truth lives in the normalized scenario pack and the deterministic scorer. ### How this affects the future core logic | Current component | Impact | Severity | |---|---|---| | `replicalab/models.py` | External contract unchanged. Add `NormalizedScenarioPack` and helper models as new classes | Low | | `replicalab/scenarios/templates.py` (SCN 02) | Must define the normalized schema. `generate_scenario()` returns a pack instead of raw dicts | High | | `replicalab/scenarios/*.py` (SCN 03-05) | Each domain file becomes a thin scenario adapter that emits a normalized pack | Medium | | `replicalab/scenarios/templates.py` (SCN 06) | Difficulty scaling becomes mechanical: add/remove constraints, tighten resource limits | Medium, but simpler | | `replicalab/scenarios/templates.py` (SCN 07) | Constraint generator emits `Constraint` objects instead of ad hoc lab fields | High | | `replicalab/scenarios/templates.py` (SCN 08) | `hidden_reference_spec` is part of the pack, not a separate hidden structure | Medium | | `replicalab/utils/validation.py` (MOD 05-06) | Validators read `constraints[]` and `resources[]` from the pack instead of checking lab-specific fields | High | | `replicalab/scoring/*.py` (JDG 01-04) | Scorers compare the final protocol against `hidden_reference_spec` on normalized dimensions | High | | `replicalab/env/replicalab_env.py` (ENV 01-07) | `EpisodeState` gains a `scenario_pack` field. Reset populates it from the adapter | Medium | | `replicalab/agents/scientist_policy.py` (AGT 01-02) | Prompts assembled from scenario pack data, not hard-coded domain text | Medium | | `replicalab/agents/lab_manager_policy.py` (AGT 05-07) | Feasibility checker reads normalized constraints instead of lab-specific fields | Medium | | `frontend/` (UI 01+) | Render "constraint cards" and "resource cards" instead of lab-specific panels | Low (future) | ### What stays the same - The turn protocol (`propose`, `revise`, `request_info`, `accept`) - The reward formula (`10 * rigor * feasibility * fidelity + bonuses - penalties`) - The external API contract (REST + WebSocket payloads) - The training loop and RL pipeline - The deterministic reward principle --- ## 2. Planned work items for the normalized scenario layer ### Item 1: Define the normalized scenario schema **What:** Add `NormalizedScenarioPack`, `Constraint`, `Resource`, and `Substitution` as Pydantic models in a new file `replicalab/scenarios/schema.py`. **Why:** This is the foundation. Every other item depends on having a stable schema that all adapters, validators, and scorers agree on. **Depends on:** Core MVP scenario work (SCN 02-09) being complete so we know what fields the adapters actually need. **Scope:** ~80 lines of model definitions, no business logic. --- ### Item 2: Convert existing scenario templates into adapters **What:** Refactor `cell_biology.py`, `ml_benchmark.py`, and `behavioral_psych.py` so each one returns a `NormalizedScenarioPack` instead of raw domain-specific dicts. **Why:** Proves the schema works for all three MVP domains. If a field cannot be cleanly mapped, the schema needs revision before adding new domains. **Depends on:** Item 1 (schema exists), SCN 03-05 (domain templates exist). **Scope:** ~50 lines per adapter. Should be thin mappers. If an adapter exceeds 50 lines, the schema is wrong. **Constraint:** The existing observation fields (`paper_title`, `equipment_available`, etc.) must still be populated. The adapter fills both the normalized pack and the legacy observation slots until the observation models are generalized. --- ### Item 3: Build data-driven prompt assembly **What:** Replace hard-coded prompt text with a template that assembles from the scenario pack: ``` You are a {role} working on: {task_summary} Success criteria: {success_criteria[]} You must work within these constraints: {constraints[].label}: {constraints[].value} Available resources: {resources[].name} ({resources[].category}): {available/unavailable} ``` **Why:** Makes AGT 01 (Scientist prompt) and AGT 07 (Lab Manager templates) domain-neutral. Adding a new domain requires only a new adapter, not new prompts. **Depends on:** Item 2 (adapters produce normalized packs), AGT 01 and AGT 07 existing in their MVP form. **Scope:** One prompt template function per role. ~40 lines each. --- ### Item 4: Hybrid LLM Lab Manager with deterministic post-checking **What:** Replace the rule-based Lab Manager with a hybrid architecture: 1. LLM receives the `LabManagerObservation` and generates negotiation text plus alternative suggestions in natural language 2. Deterministic constraint checker computes the real feasibility flags by reading the normalized scenario pack's `constraints[]` and `resources[]` 3. A composer merges the LLM output with the checker output into a valid `LabManagerAction` 4. The `model_validator` on `LabManagerAction` catches any inconsistency **Why:** Gives the Lab Manager realistic negotiation language and creative suggestions (the LLM's strength) while keeping feasibility flags truthful (the checker's strength). Training reward stays deterministic because the reward engine only reads the validated action, not the LLM's raw text. **Depends on:** Item 2 (checker needs normalized constraints), AGT 05 (feasibility checker exists), MOD 02 (LabManagerAction validators exist). **Scope:** ~120 lines. The LLM call, the checker, the composer. Uses the same base model as the Scientist (Qwen3-4B) with a separate role adapter. **Risk:** Episode variance increases because the same seed may produce different negotiation paths. Mitigate by keeping the deterministic checker as the authority on all boolean flags. The LLM only controls `explanation` text and suggestion ideas, never the truth flags. --- ### Item 5: Normalized scoring against hidden reference spec **What:** Refactor the scoring engine so `score_rigor()`, `score_feasibility()`, and `score_fidelity()` compare the final protocol against `hidden_reference_spec` from the normalized scenario pack instead of using domain-specific scoring logic. Scoring dimensions become: - **Rigor:** Does the protocol preserve the success criteria? Compare `protocol.controls` against `hidden_reference_spec.required_controls`, check sample size ratio, verify statistical validity markers. - **Feasibility:** Does the protocol satisfy all hard constraints? Walk `constraints[]` and check each one against the protocol. - **Fidelity:** How close is the protocol to the reference spec? Compare technique, duration, equipment, reagents against `hidden_reference_spec` and compute a similarity score using `allowed_substitutions[]` quality impact. **Why:** Makes scoring work for any domain without per-domain scorer code. The domain-specific knowledge lives in the scenario adapter (which defines what the reference spec and constraints are), not in the scoring engine. **Depends on:** Item 1 (schema with `hidden_reference_spec`), Item 2 (adapters populate it), JDG 01-04 (MVP scorers exist to refactor from). **Scope:** Refactor of existing scorer files. ~150 lines total across `rigor.py`, `feasibility.py`, `fidelity.py`. --- ### Item 6: Lab Manager orchestrator with specialist subagents **What:** Decompose the hybrid Lab Manager into a coordinator that delegates to specialist subagents: | Subagent | Responsibility | |---|---| | Budget agent | Checks cost against remaining budget | | Scheduling agent | Checks timeline and booking conflicts | | Equipment agent | Checks equipment availability and substitutions | | Safety agent | Checks policy and compliance constraints | | Coordinator | Aggregates subagent outputs into one `LabManagerAction` | Externally, the contract is unchanged: one `LabManagerAction` per turn. The orchestration is internal. **Why:** Stronger multi-agent story for the hackathon track alignment. Demonstrates that the Lab Manager is not a monolithic policy but a team of constraint specialists. Each subagent can be individually tested, improved, or replaced. **Depends on:** Item 4 (hybrid Lab Manager works first), Item 2 (normalized constraints are available for each subagent to read). **Scope:** Orchestration layer ~200 lines. Each subagent ~40 lines. Total ~400 lines. **Risk:** Adds latency (multiple LLM calls or multiple checker passes per turn), orchestration failure handling, and logging complexity. Only pursue after the single hybrid Lab Manager is stable and training is producing results. **Phasing:** This is the lowest priority item. Build it only if the MVP is complete, training shows improvement, and there is time remaining before submission. --- ## 3. Recommended order | Order | Item | Gate | |---|---|---| | 1 | Define normalized scenario schema | After SCN 02-09 complete | | 2 | Convert templates to adapters | After Item 1 | | 3 | Data-driven prompt assembly | After Item 2 + AGT 01/07 | | 4 | Hybrid LLM Lab Manager | After Item 2 + AGT 05 | | 5 | Normalized scoring | After Item 2 + JDG 01-04 | | 6 | Lab Manager orchestrator with subagents | After Item 4 stable | --- ## 4. Key principle The external contract stays stable. Internal policy can evolve. LLMs reason and negotiate. They never own truth. Truth lives in the normalized scenario pack and the deterministic scorer.