CARMANIA
Overview
CARMANIA is a self-supervised genomic language model that augments traditional next-token (NT) prediction with a Transition-Matrix (TM) loss. This auxiliary loss aligns predicted token transitions with empirically derived bigram (n-gram) statistics for each input sequence, enabling the model to capture higher-order dependencies and learn organism-specific sequence structures.
This model is particularly designed for DNA sequence modeling and has shown superior performance in both in-domain and out-of-domain genomic tasks.
π Pretraining Dataset
πΉ Basic Genome Dataset
| Dataset | # Samples | # Bases | Fragment Length (kbp) |
|---|---|---|---|
| Basic Genome | 1,010,237 | 10B | 10 |
- 4,634 genomes across bacteria, archaea, viruses, and eukaryotes
- Extracted as 10 kbp segments
𧬠Tokenization & Transition Matrix
- Tokenizer: Single-nucleotide (A, T, C, G) level tokenization to retain fine-grained features such as SNPs.
- Transition Matrix: For each input, we compute a normalized 4Γ4 bigram transition matrix, where each row represents a probability distribution over the next nucleotide. This matrix serves as ground truth for the TM loss and guides the model to learn biologically meaningful dependencies.
β οΈ Requirement: FlashAttention
This release is optimized to run with FlashAttention on Ampere/Ada/Hopper GPUs (e.g. A100, RTX 3090, H100).
If you have an A100 or another supported GPU, install FlashAttention:
pip install flash_attn --no-build-isolation
π§ Usage
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained(
"MsAlEhR/carmania-big-10k-prok-genome",
trust_remote_code=True,
torch_dtype=torch.float16, # fixed dtype (or autocast)
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"MsAlEhR/carmania-big-10k-prok-genome",
trust_remote_code=True,
model_max_length=10000,
)
inputs = tokenizer("ACGTAGGCTA", return_tensors="pt").to("cuda")
outputs = model(**inputs)
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support