Questions for training / fine-tuning

#5
by KirillShmilovich - opened

Thanks for open-sourcing this work! I'm potentially interested in some fine-tuning and have a few questions:

  1. pIC50 conversion
    Is the correct conversion from pIC50 to your training scale (e.g., the regression target values in data/training_data.csv):
    binding_affinity = pIC50 - 6
    (Based on output being -log10(IC50_μM))

  2. Default assay_batch_size=1 behavior
    With the default assay_batch_size=1, the CliffLoss relative component (90% weight) computes 0 pairs since you need ≥2 samples from the same assay for pairwise comparisons. Is this intentional, or should users increase assay_batch_size for training?

  3. Recommended training hyperparameters
    What settings were used to train the released checkpoint? The OpenFold3 defaults (epoch_len=4, batch_size=1) seem like placeholders. Any guidance on:

  • assay_batch_size
  • epoch_len
  • batch_size
  • Number of epochs
  1. Data source for query_ids
    Do the numeric query_id values map to BindingDB entry IDs? And do binding_affinity_dataset_* / PUBCHEM_* assay IDs correspond to BindingDB/PubChem assay IDs?

  2. Pre-computed embeddings
    Are there plans to release the cached OpenFold3 embeddings, or is the expectation that users generate their own from scratch?

Thanks!

KirillShmilovich changed discussion title from Questions for reproducing training / fine-tuning on custom data to Questions for training / fine-tuning

@sb-sb wondering if you're able to help with these questions?

SandboxAQ org
edited 6 days ago

Hi @KirillShmilovich here are a couple of answers to your questions, I am also about to push some training and inference examples which may help as well:

  1. That is correct, we train on pic50 in micromolar, the conversion you've written should work
  2. You almost never want to use assay_batch_size=1 (we will change the default), this parameter basically says how many datapoints to sample from a given assay which as you have identified isn't compatible with the relative loss
  3. We expect you may need to modify the following to work with your compute configuration but we used:
    • assay_batch_size: 5
    • epoch_len: 428000
    • batch_size: 2
    • num_epochs: -1 (we trained until validation metrics stagnated)
  4. I do not believe the query_id map to BindingDB entry IDs, the assays_ids should correspond to the database assay_IDS though. I will be releasing an updated csv which includes more information to help make the connection clearer.
  5. The cached embeddings are not released at present.

Excellent! Thank you @sb-sb very much appreciated!

Looking forward to your updates!

Sign up or log in to comment