--- license: mit datasets: - ZINC-22 language: - en tags: - molecular-generation - drug-discovery - llama - flash-attention pipeline_tag: text-generation --- # NovoMolGen NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC-22 molecules with Llama architectures and FlashAttention. It achieves state-of-the-art performance on both unconstrained and goal-directed molecule generation tasks. ## How to load ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", trust_remote_code=True) ``` ## Quick-start (FlashAttention + bf16) ```python from accelerate import Accelerator acc = Accelerator(mixed_precision='bf16') model = acc.prepare(model) outputs = model.sample(tokenizer=tokenizer, batch_size=4) print(outputs['SMILES']) ``` ## Citation ```bibtex @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, year={2025}, eprint={2508.13408}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.13408}, } ```