LLM4HEP / jobs /README.md
ho22joshua's picture
initial commit
cfcbbc8

Job Submissions

A series of Perlmutter jobs can be submitted via the submit.sh shell script. This is a one-button method of launching parallel tests for a given list of models.

submit.sh

This script reads ../models.txt or ../models_supervisor.txt + ../models_coder.txt and extracts the list of supervisor models and coder models to test. This script has a command-line input specifying the configuration mode using --mode.

  • --mode identical: the default option. This mode reads from ../models.txt and uses identical models for supervisor/coder
  • --mode pairwise: This mode reads from ../models_supervisor.txt + ../models_coder.txt and constructs all pairwise combinations of supervisor/coder setups.

All of the different supervisor/coder configurations are then submitted as separate jobs. This allows each supervisor/coder pairing to run testing in parallel via the run_tests.sh script. To adjust the number of "trials" per test (number of times each test is run), just modify the variable NUM_TESTS. There is also a variable called OUTDIR that will let you specify the output directory for your tests.

run_tests.sh

This script has 3 different input parameters:

  • supervisor: the model to be used as supervisor
  • coder: the model to be used as coder
  • NUM_TESTS: the number of trials to run
  • OUTDIR: the output directory for your tests (optional)

This script will just load the conda environment and call the final script of this chain, test_models.py. To adjust the slurm options, modify the header of this file (job time, account, qos, slurm output directory, etc).

test_models.py

This script parallelizes the testing for a given supervisor/coder setup. Each trial is broken down into 5 steps (summarize root, create_numpy, preprocess, scores, and categorization), and each step is run in parallel, taking advantage of the fact that each step is independent from all other steps. Furthermore, additional parallization is performed according to the number of total trials to be conducted. In the current configuration, 2 tests are run in parallel. You can modify the number of parallel tests my changing the max_workers in the argument of the ProcessPoolExecutor