# Large Language Model Analysis Framework for High Energy Physics A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture. ## Table of Contents - [Setup](#setup) - [Data and Solution](#data-and-solution) - [Running Tests](#running-tests) - [Analysis and Visualization](#analysis-and-visualization) - [Project Structure](#project-structure) - [Advanced Usage](#advanced-usage) --- ## Setup ### Prerequisites **CBORG API Access Required** This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need: 1. Access to the CBORG API (contact LBL for access) 2. A CBORG API key 3. Network access to the CBORG API endpoint **Note for External Users:** CBORG is an internal LBL system. External users may need to: - Request guest access through LBL collaborations - Adapt the code to use OpenAI API directly (requires code modifications) - Contact the repository maintainers for alternative deployment options ### Environment Setup Create Conda environment: ```bash mamba env create -f environment.yml conda activate llm_env ``` ### API Configuration Create script `~/.apikeys.sh` to export CBORG API key: ```bash export CBORG_API_KEY="INSERT_API_KEY" ``` Then source it before running tests: ```bash source ~/.apikeys.sh ``` ### Initial Configuration Before running tests, set up your configuration files: ```bash # Copy example configuration files cp config.example.yml config.yml cp models.example.txt models.txt # Edit config.yml to set your preferred models and parameters # Edit models.txt to list models you want to test ``` **Important:** The `models.txt` file must end with a blank line. --- ## Data and Solution ### ATLAS Open Data Samples All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at: ``` /global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/ ``` **Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files: ```bash chmod -R a-w /path/to/data/directory ``` ### Reference Solution - Navigate to `solution/` directory and run `python soln.py` - Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution ### Reference Arrays for Validation Large `.npy` reference arrays are not committed to Git (see `.gitignore`). **Quick fetch from repo root:** ```bash bash scripts/fetch_solution_arrays.sh ``` **Or copy from NERSC shared path:** ``` /global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays ``` --- ## Running Tests ### Model Configuration Three model list files control testing: - **`models.txt`**: Models for sequential testing - **`models_supervisor.txt`**: Supervisor models for paired testing - **`models_coder.txt`**: Coder models for paired testing **Important formatting rules:** - One model per line - File must end with a blank line - Repeat model names for multiple trials - Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`) See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions. ### Testing Workflows #### 1. Sequential Testing (Single Model at a Time) ```bash bash test_models.sh output_dir_name ``` Tests all models in `models.txt` sequentially. #### 2. Parallel Testing (Multiple Models) ```bash # Basic parallel execution bash test_models_parallel.sh output_dir_name # GNU Parallel (recommended for large-scale testing) bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model] # Examples: bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each ``` **GNU Parallel features:** - Scales to 20-30 models with 200-300 total parallel jobs - Automatic resource management - Fast I/O using `/dev/shm` temporary workspace - Comprehensive error handling and logging #### 3. Step-by-Step Testing with Validation ```bash # Run all 5 steps with validation ./run_smk_sequential.sh --validate # Run specific steps ./run_smk_sequential.sh --step2 --step3 --validate --job-id 002 # Run individual steps ./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT ./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays ./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess ./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores ./run_smk_sequential.sh --step5 --validate # Step 5: Categorization # Custom output directory ./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir ``` **Directory naming options:** - `--job-id ID`: Creates `results_job_ID/` - `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/` - `--out-dir DIR`: Custom directory name ### Validation **Automatic validation (during execution):** ```bash ./run_smk_sequential.sh --step1 --step2 --validate ``` Validation logs saved to `{output_dir}/logs/*_validation.log` **Manual validation (after execution):** ```bash # Validate all steps python check_soln.py --out_dir results_job_002 # Validate specific step python check_soln.py --out_dir results_job_002 --step 2 ``` **Validation features:** - ✅ Adaptive tolerance with 4 significant digit precision - 📊 Column-by-column difference analysis - 📋 Side-by-side value comparison - 🎯 Clear, actionable error messages ### Speed Optimization Reduce iteration counts in `config.yml`: ```yaml # Limit LLM coder attempts (default 10) max_iterations: 3 ``` --- ## Analysis and Visualization ### Results Summary All test results are aggregated in: ``` results_summary.csv ``` **Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions ### Error Analysis and Categorization **Automated error analysis:** ```bash python error_analysis.py --results_dirs ... --output results_summary.csv --model ``` Uses LLM to analyze comprehensive logs and categorize errors into: - Semantic errors - Function-calling errors - Intermediate file not found - Incorrect branch name - OpenAI API errors - Data quality issues (all weights = 0) - Other/uncategorized ### Interactive Analysis Notebooks #### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`) Comprehensive analysis of model performance across all 5 workflow steps: - **Success rate heatmap** (models × steps) - **Agent work progression** (iterations over steps) - **API call statistics** (by step and model) - **Cost analysis** (input/output tokens, estimated pricing) **Output plots:** - `plots/1_success_rate_heatmap.pdf` - `plots/2_agent_work_line_plot.pdf` - `plots/3_api_calls_line_plot.pdf` - `plots/4_cost_per_step.pdf` - `plots/five_step_summary_stats.csv` #### 2. Error Category Analysis (`error_analysis.ipynb`) Deep dive into error patterns and failure modes: - **Normalized error distribution** (stacked bar chart with percentages) - **Error type heatmap** (models × error categories) - **Top model breakdowns** (faceted plots for top 9 models) - **Error trends across steps** (stacked area chart) **Output plots:** - `plots/error_distribution_by_model.pdf` - `plots/error_heatmap_by_model.pdf` - `plots/error_categories_top_models.pdf` - `plots/errors_by_step.pdf` #### 3. Quick Statistics (`plot_stats.ipynb`) Legacy notebook for basic statistics visualization. ### Log Interpretation **Automated log analysis:** ```bash python logs_interpreter.py --log_dir --model lbl/cborg-deepthought --output analysis.txt ``` Analyzes comprehensive supervisor-coder logs to identify: - Root causes of failures - Responsible parties (user, supervisor, coder, external) - Error patterns across iterations --- ## Project Structure ### Core Scripts - **`supervisor_coder.py`**: Supervisor-coder framework implementation - **`check_soln.py`**: Solution validation with enhanced comparison - **`write_prompt.py`**: Prompt management and context chaining - **`update_stats.py`**: Statistics tracking and CSV updates - **`error_analysis.py`**: LLM-powered error categorization ### Test Runners - **`test_models.sh`**: Sequential model testing - **`test_models_parallel.sh`**: Parallel testing (basic) - **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended) - **`test_stats.sh`**: Individual model statistics - **`test_stats_parallel.sh`**: Parallel step execution - **`run_smk_sequential.sh`**: Step-by-step workflow runner ### Snakemake Workflows (`workflow/`) The analysis workflow is divided into 5 sequential steps: 1. **`summarize_root.smk`**: Extract ROOT file structure and branch information 2. **`create_numpy.smk`**: Convert ROOT → NumPy arrays 3. **`preprocess.smk`**: Apply preprocessing transformations 4. **`scores.smk`**: Compute signal/background classification scores 5. **`categorization.smk`**: Final categorization and statistical analysis **Note:** Later steps use solution outputs to enable testing even when earlier steps fail. ### Prompts (`prompts/`) - `summarize_root.txt`: Step 1 task description - `create_numpy.txt`: Step 2 task description - `preprocess.txt`: Step 3 task description - `scores.txt`: Step 4 task description - `categorization.txt`: Step 5 task description - `supervisor_first_call.txt`: Initial supervisor instructions - `supervisor_call.txt`: Subsequent supervisor instructions ### Utility Scripts (`util/`) - **`inspect_root.py`**: ROOT file inspection tools - **`analyze_particles.py`**: Particle-level analysis - **`compare_arrays.py`**: NumPy array comparison utilities ### Model Documentation - **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias → actual model mappings - **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models - **`MODEL_NAME_UPDATES.md`**: Model name standardization notes - **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison ### Analysis Notebooks - **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis - **`error_analysis.ipynb`**: Error categorization and pattern analysis - **`error_analysis_plotting.ipynb`**: Additional error visualizations - **`plot_stats.ipynb`**: Legacy statistics plots ### Output Structure Each test run creates: ``` output_name/ ├── model_timestamp/ │ ├── generated_code/ # LLM-generated Python scripts │ ├── logs/ # Execution logs and supervisor records │ ├── arrays/ # NumPy arrays produced by generated code │ ├── plots/ # Comparison plots (generated vs. solution) │ ├── prompt_pairs/ # User + supervisor prompts │ ├── results/ # Temporary ROOT files (job-scoped) │ └── snakemake_log/ # Snakemake execution logs ``` **Job-scoped ROOT outputs:** - Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`) - Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference - Automatically cleaned after significance calculation --- ## Advanced Usage ### Supervisor-Coder Configuration Control iteration limits in `config.yml`: ```yaml model: 'anthropic/claude-sonnet:latest' name: 'experiment_name' out_dir: 'results/experiment_name' max_iterations: 10 # Maximum supervisor-coder iterations per step ``` ### Parallel Execution Tuning For `test_models_parallel_gnu.sh`: ```bash # Syntax: bash test_models_parallel_gnu.sh # Conservative (safe for shared systems): bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs # Aggressive (dedicated nodes): bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs ``` ### Custom Validation Run validation on specific steps or with custom tolerances: ```bash # Validate only data conversion step python check_soln.py --out_dir results/ --step 2 # Check multiple specific steps python check_soln.py --out_dir results/ --step 2 --step 3 --step 4 ``` ### Log Analysis Pipeline ```bash # 1. Run tests bash test_models_parallel_gnu.sh experiment1 5 5 # 2. Analyze logs with LLM python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt # 3. Categorize errors python error_analysis.py --results_dirs experiment1/*/ --output summary.csv # 4. Generate visualizations jupyter notebook error_analysis.ipynb ``` --- ## Roadmap and Future Directions ### Planned Improvements **Prompt Engineering:** - Auto-load context (file lists, logs) at step start - Provide comprehensive inputs/outputs/summaries upfront - Develop prompt-management layer for cross-analysis reuse **Validation & Monitoring:** - Embed validation in workflows for immediate error detection - Record input/output and state transitions for reproducibility - Enhanced situation awareness through comprehensive logging **Multi-Analysis Extension:** - Rerun H→γγ with improved system prompts - Extend to H→4ℓ and other Higgs+X channels - Provide learned materials from previous analyses as reference **Self-Improvement:** - Reinforcement learning–style feedback loops - Agent-driven prompt refinement - Automatic generalization across HEP analyses --- ## Citation and Acknowledgments This framework tests LLM agents on ATLAS Open Data from: - 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006 Models tested via CBORG API (Lawrence Berkeley National Laboratory). --- ## Support and Contributing For questions or issues: 1. Check existing documentation in `*.md` files 2. Review example configurations in `config.yml` 3. Examine validation logs in output directories For contributions, please ensure: - Model lists end with blank lines - Prompts follow established format - Validation passes for all test cases