Questions for training / fine-tuning

by KirillShmilovich - opened 18 days ago

Discussion

KirillShmilovich

18 days ago

Thanks for open-sourcing this work! I'm potentially interested in some fine-tuning and have a few questions:

pIC50 conversion
Is the correct conversion from pIC50 to your training scale (e.g., the regression target values in data/training_data.csv):
binding_affinity = pIC50 - 6
(Based on output being -log10(IC50_μM))
Default assay_batch_size=1 behavior
With the default assay_batch_size=1, the CliffLoss relative component (90% weight) computes 0 pairs since you need ≥2 samples from the same assay for pairwise comparisons. Is this intentional, or should users increase assay_batch_size for training?
Recommended training hyperparameters
What settings were used to train the released checkpoint? The OpenFold3 defaults (epoch_len=4, batch_size=1) seem like placeholders. Any guidance on:

assay_batch_size
epoch_len
batch_size
Number of epochs

Data source for query_ids
Do the numeric query_id values map to BindingDB entry IDs? And do binding_affinity_dataset_* / PUBCHEM_* assay IDs correspond to BindingDB/PubChem assay IDs?
Pre-computed embeddings
Are there plans to release the cached OpenFold3 embeddings, or is the expectation that users generate their own from scratch?

Thanks!

KirillShmilovich changed discussion title from Questions for reproducing training / fine-tuning on custom data to Questions for training / fine-tuning 18 days ago

KirillShmilovich

6 days ago

@sb-sb wondering if you're able to help with these questions?

sb-sb

SandboxAQ org 6 days ago

•

edited 6 days ago

Hi @KirillShmilovich here are a couple of answers to your questions, I am also about to push some training and inference examples which may help as well:

That is correct, we train on pic50 in micromolar, the conversion you've written should work
You almost never want to use assay_batch_size=1 (we will change the default), this parameter basically says how many datapoints to sample from a given assay which as you have identified isn't compatible with the relative loss
We expect you may need to modify the following to work with your compute configuration but we used:
- assay_batch_size: 5
- epoch_len: 428000
- batch_size: 2
- num_epochs: -1 (we trained until validation metrics stagnated)
I do not believe the query_id map to BindingDB entry IDs, the assays_ids should correspond to the database assay_IDS though. I will be releasing an updated csv which includes more information to help make the connection clearer.
The cached embeddings are not released at present.

KirillShmilovich

4 days ago

Excellent! Thank you @sb-sb very much appreciated!

Looking forward to your updates!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment