--- license: apache-2.0 datasets: - Derify/augmented_canonical_druglike_QED_43M - Derify/druglike metrics: - roc_auc - rmse library_name: transformers tags: - ChemBERTa - cheminformatics pipeline_tag: fill-mask model-index: - name: Derify/ChemBERTa-druglike results: - task: type: text-classification name: Classification (ROC AUC) dataset: name: BACE type: Derify/druglike metrics: - type: roc_auc value: 0.8114 - task: type: text-classification name: Classification (ROC AUC) dataset: name: BBBP type: Derify/druglike metrics: - type: roc_auc value: 0.7399 - task: type: text-classification name: Classification (ROC AUC) dataset: name: TOX21 type: Derify/druglike metrics: - type: roc_auc value: 0.7522 - task: type: text-classification name: Classification (ROC AUC) dataset: name: HIV type: Derify/druglike metrics: - type: roc_auc value: 0.7527 - task: type: text-classification name: Classification (ROC AUC) dataset: name: SIDER type: Derify/druglike metrics: - type: roc_auc value: 0.6577 - task: type: text-classification name: Classification (ROC AUC) dataset: name: CLINTOX type: Derify/druglike metrics: - type: roc_auc value: 0.9660 - task: type: regression name: Regression (RMSE) dataset: name: ESOL type: Derify/druglike metrics: - type: rmse value: 0.8241 - task: type: regression name: Regression (RMSE) dataset: name: FREESOLV type: Derify/druglike metrics: - type: rmse value: 0.5350 - task: type: regression name: Regression (RMSE) dataset: name: LIPO type: Derify/druglike metrics: - type: rmse value: 0.6663 - task: type: regression name: Regression (RMSE) dataset: name: BACE type: Derify/druglike metrics: - type: rmse value: 1.0105 - task: type: regression name: Regression (RMSE) dataset: name: CLEARANCE type: Derify/druglike metrics: - type: rmse value: 43.4499 --- # ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES ## Model Description This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules. ## Training Procedure The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks. ### Phase 1 – “easy” pretraining - Dataset: [augmented_canonical_druglike_QED_43M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_43M) - Masking probability: 15% - Training duration: 9 epochs (chosen due to loss plateauing) - Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies ### Phase 2 – “advanced” pretraining - Dataset: [druglike dataset](https://huggingface.co/datasets/Derify/druglike) - Masking probablity: 40% - Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score. ### Training Configuration - Optimizer: NVIDIA Apex's FusedAdam optimizer - Scheduler: Constant with warmup (10% of steps) - Batch size: 144 sequences - Precision: mixed-precision (fp16) and tf32 enabled ## Model Objective This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for: - Molecular similarity tasks - Drug-like compound analysis - Chemical space exploration in pharmaceutical contexts ## Evaluation The model's effectiveness was validated through downstream Chem-MRL training on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities. W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3). ## Benchmarks ### Classification Datasets (ROC AUC - Higher is better) | Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ | | ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- | | **Tasks** | 1 | 1 | 12 | 1 | 27 | 2 | | Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660 | ### Regression Datasets (RMSE - Lower is better) | Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ | | ------------------------- | ------ | --------- | ------ | ------ | ---------- | | **Tasks** | 1 | 1 | 1 | 1 | 1 | | Derify/ChemBERTa-druglike | 0.8241 | 0.5350 | 0.6663 | 1.0105 | 43.4499 | Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework. Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. Each task was run with 3 different random seeds, and the mean performance is reported. ## Use Cases - Molecular property prediction - Drug discovery and development - Chemical similarity analysis ## Limitations - Optimized specifically for drug-like molecules - Performance may vary on non-drug-like chemical compounds ## References ### ChemBERTa Series ``` @misc{chithrananda2020chembertalargescaleselfsupervisedpretraining, title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar}, year={2020}, eprint={2010.09885}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2010.09885}, } ``` ``` @misc{ahmad2022chemberta2chemicalfoundationmodels, title={ChemBERTa-2: Towards Chemical Foundation Models}, author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar}, year={2022}, eprint={2209.01712}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2209.01712}, } ``` ``` @misc{singh2025chemberta3opensource, title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models}, author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others}, year={2025}, howpublished={ChemRxiv}, doi={10.26434/chemrxiv-2025-4glrl-v2}, note={This content is a preprint and has not been peer-reviewed}, url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2} } ```