# Large Language Model Analysis Framework for High Energy Physics

A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.

## Table of Contents
- [Setup](#setup)
- [Data and Solution](#data-and-solution)
- [Running Tests](#running-tests)
- [Analysis and Visualization](#analysis-and-visualization)
- [Project Structure](#project-structure)
- [Advanced Usage](#advanced-usage)

---

## Setup

### Prerequisites

**CBORG API Access Required**

This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:

1. Access to the CBORG API (contact LBL for access)
2. A CBORG API key
3. Network access to the CBORG API endpoint

**Note for External Users:** CBORG is an internal LBL system. External users may need to:
- Request guest access through LBL collaborations
- Adapt the code to use OpenAI API directly (requires code modifications)
- Contact the repository maintainers for alternative deployment options

### Environment Setup
Create Conda environment:  
```bash
mamba env create -f environment.yml
conda activate llm_env
```

### API Configuration
Create script `~/.apikeys.sh` to export CBORG API key:
```bash
export CBORG_API_KEY="INSERT_API_KEY"
```

Then source it before running tests:
```bash
source ~/.apikeys.sh
```

### Initial Configuration

Before running tests, set up your configuration files:

```bash
# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt

# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test
```

**Important:** The `models.txt` file must end with a blank line.

---

## Data and Solution

### ATLAS Open Data Samples
All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
```
/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
```

**Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
```bash
chmod -R a-w /path/to/data/directory
```

### Reference Solution
- Navigate to `solution/` directory and run `python soln.py`
- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution

### Reference Arrays for Validation
Large `.npy` reference arrays are not committed to Git (see `.gitignore`).

**Quick fetch from repo root:**
```bash
bash scripts/fetch_solution_arrays.sh
```

**Or copy from NERSC shared path:**
```
/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
```

---

## Running Tests

### Model Configuration

Three model list files control testing:
- **`models.txt`**: Models for sequential testing
- **`models_supervisor.txt`**: Supervisor models for paired testing
- **`models_coder.txt`**: Coder models for paired testing

**Important formatting rules:**
- One model per line
- File must end with a blank line
- Repeat model names for multiple trials
- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)

See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.

### Testing Workflows

#### 1. Sequential Testing (Single Model at a Time)
```bash
bash test_models.sh output_dir_name
```
Tests all models in `models.txt` sequentially.

#### 2. Parallel Testing (Multiple Models)
```bash
# Basic parallel execution
bash test_models_parallel.sh output_dir_name

# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]

# Examples:
bash test_models_parallel_gnu.sh experiment1        # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5           # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5    # 10 models, 5 tasks each
```

**GNU Parallel features:**
- Scales to 20-30 models with 200-300 total parallel jobs
- Automatic resource management
- Fast I/O using `/dev/shm` temporary workspace
- Comprehensive error handling and logging

#### 3. Step-by-Step Testing with Validation
```bash
# Run all 5 steps with validation
./run_smk_sequential.sh --validate

# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002

# Run individual steps
./run_smk_sequential.sh --step1 --validate  # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate  # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate  # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate  # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate  # Step 5: Categorization

# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir  # Creates timestamped dir
```

**Directory naming options:**
- `--job-id ID`: Creates `results_job_ID/`
- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
- `--out-dir DIR`: Custom directory name

### Validation

**Automatic validation (during execution):**
```bash
./run_smk_sequential.sh --step1 --step2 --validate
```
Validation logs saved to `{output_dir}/logs/*_validation.log`

**Manual validation (after execution):**
```bash
# Validate all steps
python check_soln.py --out_dir results_job_002

# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2
```

**Validation features:**
- ✅ Adaptive tolerance with 4 significant digit precision
- 📊 Column-by-column difference analysis
- 📋 Side-by-side value comparison
- 🎯 Clear, actionable error messages

### Speed Optimization

Reduce iteration counts in `config.yml`:
```yaml
# Limit LLM coder attempts (default 10)
max_iterations: 3
```

---

## Analysis and Visualization

### Results Summary
All test results are aggregated in:
```
results_summary.csv
```

**Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions

### Error Analysis and Categorization

**Automated error analysis:**
```bash
python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
```

Uses LLM to analyze comprehensive logs and categorize errors into:
- Semantic errors
- Function-calling errors  
- Intermediate file not found
- Incorrect branch name
- OpenAI API errors
- Data quality issues (all weights = 0)
- Other/uncategorized

### Interactive Analysis Notebooks

#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
Comprehensive analysis of model performance across all 5 workflow steps:
- **Success rate heatmap** (models × steps)
- **Agent work progression** (iterations over steps)
- **API call statistics** (by step and model)
- **Cost analysis** (input/output tokens, estimated pricing)

**Output plots:**
- `plots/1_success_rate_heatmap.pdf`
- `plots/2_agent_work_line_plot.pdf`
- `plots/3_api_calls_line_plot.pdf`
- `plots/4_cost_per_step.pdf`
- `plots/five_step_summary_stats.csv`

#### 2. Error Category Analysis (`error_analysis.ipynb`)
Deep dive into error patterns and failure modes:
- **Normalized error distribution** (stacked bar chart with percentages)
- **Error type heatmap** (models × error categories)
- **Top model breakdowns** (faceted plots for top 9 models)
- **Error trends across steps** (stacked area chart)

**Output plots:**
- `plots/error_distribution_by_model.pdf`
- `plots/error_heatmap_by_model.pdf`
- `plots/error_categories_top_models.pdf`
- `plots/errors_by_step.pdf`

#### 3. Quick Statistics (`plot_stats.ipynb`)
Legacy notebook for basic statistics visualization.

### Log Interpretation

**Automated log analysis:**
```bash
python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
```

Analyzes comprehensive supervisor-coder logs to identify:
- Root causes of failures
- Responsible parties (user, supervisor, coder, external)
- Error patterns across iterations

---

## Project Structure

### Core Scripts
- **`supervisor_coder.py`**: Supervisor-coder framework implementation
- **`check_soln.py`**: Solution validation with enhanced comparison
- **`write_prompt.py`**: Prompt management and context chaining
- **`update_stats.py`**: Statistics tracking and CSV updates
- **`error_analysis.py`**: LLM-powered error categorization

### Test Runners
- **`test_models.sh`**: Sequential model testing
- **`test_models_parallel.sh`**: Parallel testing (basic)
- **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
- **`test_stats.sh`**: Individual model statistics
- **`test_stats_parallel.sh`**: Parallel step execution
- **`run_smk_sequential.sh`**: Step-by-step workflow runner

### Snakemake Workflows (`workflow/`)
The analysis workflow is divided into 5 sequential steps:

1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
2. **`create_numpy.smk`**: Convert ROOT → NumPy arrays
3. **`preprocess.smk`**: Apply preprocessing transformations
4. **`scores.smk`**: Compute signal/background classification scores
5. **`categorization.smk`**: Final categorization and statistical analysis

**Note:** Later steps use solution outputs to enable testing even when earlier steps fail.

### Prompts (`prompts/`)
- `summarize_root.txt`: Step 1 task description
- `create_numpy.txt`: Step 2 task description
- `preprocess.txt`: Step 3 task description
- `scores.txt`: Step 4 task description
- `categorization.txt`: Step 5 task description
- `supervisor_first_call.txt`: Initial supervisor instructions
- `supervisor_call.txt`: Subsequent supervisor instructions

### Utility Scripts (`util/`)
- **`inspect_root.py`**: ROOT file inspection tools
- **`analyze_particles.py`**: Particle-level analysis
- **`compare_arrays.py`**: NumPy array comparison utilities

### Model Documentation
- **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias → actual model mappings
- **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
- **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
- **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison

### Analysis Notebooks
- **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
- **`error_analysis.ipynb`**: Error categorization and pattern analysis
- **`error_analysis_plotting.ipynb`**: Additional error visualizations
- **`plot_stats.ipynb`**: Legacy statistics plots

### Output Structure
Each test run creates:
```
output_name/
├── model_timestamp/
│   ├── generated_code/     # LLM-generated Python scripts
│   ├── logs/               # Execution logs and supervisor records
│   ├── arrays/             # NumPy arrays produced by generated code
│   ├── plots/              # Comparison plots (generated vs. solution)
│   ├── prompt_pairs/       # User + supervisor prompts
│   ├── results/            # Temporary ROOT files (job-scoped)
│   └── snakemake_log/      # Snakemake execution logs
```

**Job-scoped ROOT outputs:**
- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
- Automatically cleaned after significance calculation

---

## Advanced Usage

### Supervisor-Coder Configuration

Control iteration limits in `config.yml`:
```yaml
model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10  # Maximum supervisor-coder iterations per step
```

### Parallel Execution Tuning

For `test_models_parallel_gnu.sh`:
```bash
# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>

# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5    # 15 total jobs

# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10  # 100 total jobs
```

### Custom Validation

Run validation on specific steps or with custom tolerances:
```bash
# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2

# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
```

### Log Analysis Pipeline

```bash
# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5

# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt

# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv

# 4. Generate visualizations
jupyter notebook error_analysis.ipynb
```

---

## Roadmap and Future Directions

### Planned Improvements

**Prompt Engineering:**
- Auto-load context (file lists, logs) at step start
- Provide comprehensive inputs/outputs/summaries upfront
- Develop prompt-management layer for cross-analysis reuse

**Validation & Monitoring:**
- Embed validation in workflows for immediate error detection
- Record input/output and state transitions for reproducibility
- Enhanced situation awareness through comprehensive logging

**Multi-Analysis Extension:**
- Rerun H→γγ with improved system prompts
- Extend to H→4ℓ and other Higgs+X channels
- Provide learned materials from previous analyses as reference

**Self-Improvement:**
- Reinforcement learning–style feedback loops
- Agent-driven prompt refinement
- Automatic generalization across HEP analyses

---

## Citation and Acknowledgments

This framework tests LLM agents on ATLAS Open Data from:
- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006

Models tested via CBORG API (Lawrence Berkeley National Laboratory).

---

## Support and Contributing

For questions or issues:
1. Check existing documentation in `*.md` files
2. Review example configurations in `config.yml`
3. Examine validation logs in output directories

For contributions, please ensure:
- Model lists end with blank lines
- Prompts follow established format
- Validation passes for all test cases