| | --- |
| | license: apache-2.0 |
| | pipeline_tag: translation |
| | tags: |
| | - chemistry |
| | - biology |
| | --- |
| | |
| | # **Contributors** |
| |
|
| | - Sebastian Lindner (GitHub [@Bienenwolf655](https://www.google.com); Twitter @) |
| | - Michael Heinzinger (GitHub @mheinzinger; Twitter @) |
| | - Noelia Ferruz (GitHub @noeliaferruz; Twitter @ferruz_noelia; Webpage: www.aiproteindesign.com ) |
| | |
| | # **REXyme: A Translation Machine for the Generation of New-to-Nature Enzymes** |
| | **Work in Progress** |
| | |
| | REXyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions. |
| | It is possible to provide fine-grained input at the substrate level. |
| | Akin to how translation machines have learned to translate between complex language pairs with great success, |
| | often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will |
| | be able to translate between the chemical and sequence spaces. REXyme was trained on a set of xx reactions and yy enzyme pairs and it produces |
| | sequences that putatitely perform their intended reactions. |
| | |
| | To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system), |
| | which you can do online here: xxxx |
| | |
| | We are still working in the analysis of the model for different tasks, including experimental testing. |
| | See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes. |
| | |
| | ## **Model description** |
| | REXyme is based on the [Efficient T5 Transformer](xx) architecture (which in turn is very similar to the current version of Google Translator) |
| | and contains xx layers |
| | with a model dimensionality of xx, totaling xx million parameters. |
| | |
| | REXyme is a translation machine trained on the xx database containing xx reaction-enzyme pairs. |
| | The pre-training was done on pairs of smiles and ... (fasta headers?), |
| | |
| | ZymCTRL was trained with an autoregressive objective (this is not right, check it ??) i.e., the model learns to predict a missing |
| | token in the encoder's input. Hence, |
| | the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction. |
| | |
| | Sebastian check if this applies?? There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers. |
| | In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that |
| | there are relationships between the two classes. |
| | |
| | The figure below summarizes the process of training: (add figure) |
| | |
| | ## **Model Performance** |
| | |
| | - explain dataset curation |
| | - general descriptors (esmfold, iuored.. ) |
| | - second pgp |
| | - mmseqs (Average?) |
| | |
| | |
| | ## **How to generate from REXyme** |
| | REXyme can be used with the HuggingFace transformer python package. |
| | Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation |
| | |
| | Since REXyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES. |
| | |
| | [please seb include snippet to generate sequences] |
| | |
| | |
| | ## **A word of caution** |
| | |
| | - We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e., |
| | with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work, |
| | it is very much work in progress. We'll uptadate the model and documentation shortly. |
| | |
| | |
| | |
| | |