Questions for training / fine-tuning
Thanks for open-sourcing this work! I'm potentially interested in some fine-tuning and have a few questions:
pIC50 conversion
Is the correct conversion from pIC50 to your training scale (e.g., the regression target values indata/training_data.csv):
binding_affinity = pIC50 - 6
(Based on output being -log10(IC50_μM))Default assay_batch_size=1 behavior
With the default assay_batch_size=1, the CliffLoss relative component (90% weight) computes 0 pairs since you need ≥2 samples from the same assay for pairwise comparisons. Is this intentional, or should users increase assay_batch_size for training?Recommended training hyperparameters
What settings were used to train the released checkpoint? The OpenFold3 defaults (epoch_len=4, batch_size=1) seem like placeholders. Any guidance on:
- assay_batch_size
- epoch_len
- batch_size
- Number of epochs
Data source for query_ids
Do the numeric query_id values map to BindingDB entry IDs? And do binding_affinity_dataset_* / PUBCHEM_* assay IDs correspond to BindingDB/PubChem assay IDs?Pre-computed embeddings
Are there plans to release the cached OpenFold3 embeddings, or is the expectation that users generate their own from scratch?
Thanks!
Hi @KirillShmilovich here are a couple of answers to your questions, I am also about to push some training and inference examples which may help as well:
- That is correct, we train on pic50 in micromolar, the conversion you've written should work
- You almost never want to use assay_batch_size=1 (we will change the default), this parameter basically says how many datapoints to sample from a given assay which as you have identified isn't compatible with the relative loss
- We expect you may need to modify the following to work with your compute configuration but we used:
- assay_batch_size: 5
- epoch_len: 428000
- batch_size: 2
- num_epochs: -1 (we trained until validation metrics stagnated)
- I do not believe the query_id map to BindingDB entry IDs, the assays_ids should correspond to the database assay_IDS though. I will be releasing an updated csv which includes more information to help make the connection clearer.
- The cached embeddings are not released at present.