msb-roshan commited on
Commit
b6e86b7
·
verified ·
1 Parent(s): 605e343

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - ZINC-22
5
+ language:
6
+ - en
7
+ tags:
8
+ - molecular-generation
9
+ - drug-discovery
10
+ - llama
11
+ - flash-attention
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ # NovoMolGen
16
+
17
+ NovoMolGen is a family of molecular foundation models trained on
18
+ 1.5 billion ZINC-22 molecules with Llama architectures and FlashAttention.
19
+ It achieves state-of-the-art performance on both unconstrained and
20
+ goal-directed molecule generation tasks.
21
+
22
+
23
+ ## How to load
24
+
25
+ ```python
26
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
27
+ >>> tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SAFE_BPE", trust_remote_code=True)
28
+ >>> model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SAFE_BPE", trust_remote_code=True)
29
+ ```
30
+
31
+ ## Quick-start (FlashAttention + bf16)
32
+
33
+ ```python
34
+ >>> from accelerate import Accelerator
35
+
36
+ >>> acc = Accelerator(mixed_precision='bf16')
37
+ >>> model = acc.prepare(model)
38
+
39
+ >>> outputs = model.sample(tokenizer=tokenizer, batch_size=4)
40
+ >>> print(outputs['SAFE'])
41
+ ```
42
+
43
+ ## Transformers-native HF checkpoint (`revision="hf-checkpoint"`)
44
+
45
+ We also publish a Transformers-native checkpoint on the `hf-checkpoint` revision. This version loads directly with `AutoModelForCausalLM` and works out-of-the-box with `.generate(...)`.
46
+
47
+ ```python
48
+ >>> import torch
49
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
50
+
51
+ >>> model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SAFE_BPE", revision='hf-checkpoint', device_map='auto')
52
+ >>> tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SAFE_BPE", revision='hf-checkpoint')
53
+
54
+ >>> input_ids = torch.tensor([[tokenizer.bos_token_id]]).expand(4, -1).contiguous().to(model.device)
55
+ >>> outs = model.generate(input_ids=input_ids, temperature=1.0, max_length=64, do_sample=True, pad_token_id=tokenizer.eos_token_id, top_k=1, top_p=0)
56
+
57
+ >>> molecules = [t.replace(" ", "") for t in tokenizer.batch_decode(outs, skip_special_tokens=True)]
58
+ ['CCO[C@H](CNC(=O)N(CC(=O)OC(C)(C)C)c1cccc(Br)n1)C(F)(F)F',
59
+ 'CCn1nnnc1CNc1ncnc(N[C@H]2CCO[C@@H](C)C2)c1C',
60
+ 'CC(C)(O)CNC(=O)CC[C@H]1C[C@@H](NC(=O)COCC(F)F)C1',
61
+ 'Cc1ncc(C(=O)N2C[C@H]3[C@H](CNC(=O)c4cnn[nH]4)CCC[C@H]3C2)n1C']
62
+
63
+ ```
64
+
65
+ ## Citation
66
+
67
+ ```bibtex
68
+ @misc{chitsaz2025novomolgenrethinkingmolecularlanguage,
69
+ title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
70
+ author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar},
71
+ year={2025},
72
+ eprint={2508.13408},
73
+ archivePrefix={arXiv},
74
+ primaryClass={cs.LG},
75
+ url={https://arxiv.org/abs/2508.13408},
76
+ }
77
+ ```