---
license: mit
datasets:
  - ZINC-22
language:
  - en
tags:
  - molecular-generation
  - drug-discovery
  - llama
  - flash-attention
pipeline_tag: text-generation
---

# NovoMolGen

NovoMolGen is a family of molecular foundation models trained on 
1.5 billion ZINC-22 molecules with Llama architectures and FlashAttention. 
It achieves state-of-the-art performance on both unconstrained and 
goal-directed molecule generation tasks.

<img src="assets/NovoMolGen.png" width="900"/>

## How to load

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", trust_remote_code=True)
```

## Quick-start (FlashAttention + bf16)

```python
from accelerate import Accelerator

acc = Accelerator(mixed_precision='bf16')
model = acc.prepare(model)

outputs = model.sample(tokenizer=tokenizer, batch_size=4)
print(outputs['SMILES'])
```

## Citation

```bibtex
@misc{chitsaz2025novomolgenrethinkingmolecularlanguage,
      title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, 
      author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
      year={2025},
      eprint={2508.13408},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13408}, 
}
```