H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis

Model ID: Chrode/H3BERTa
Architecture: RoBERTa-base (encoder-only, Masked Language Model)
Sequence type: Heavy chain CDR-H3 regions
Training: Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)
Max sequence length: 100 amino acids
Vocabulary: 25 tokens (20 standard amino acids + special tokens)
Mask token: [MASK]


Official github repository is available here.

Model Overview

H3BERTa is a transformer-based language model trained specifically on the Complementarity-Determining Region 3 of the heavy chain (CDR-H3), the most diverse and functionally critical region of antibodies.
It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling embedding extraction, variant scoring, and context-aware mutation predictions.


Intended Use

  • Embedding extraction for CDR-H3 repertoire analysis
  • Mutation impact scoring (pseudo-likelihood estimation)
  • Downstream fine-tuning (e.g., bnabs identification)

How to Use

Input format: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial โ€œCโ€ or terminal โ€œWโ€ residues, and without whitespace or separators between amino acids.

from transformers import AutoTokenizer, AutoModel

model_id = "Chrode/H3BERTa"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

Example #1: Embeddings extraction

Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.

from transformers import pipeline
import torch, numpy as np

feat = pipeline(
    task="feature-extraction",
    model="Chrode/H3BERTa",
    tokenizer="Chrode/H3BERTa",
    device=0 if torch.cuda.is_available() else -1
)

seqs = [
    "ARMGAAREWDFQY",
    "ARDGLGEVAPDYRYGIDV"
]

with torch.no_grad():
    outs = feat(seqs)

# Mean pooling across tokens โ†’ per-sequence embedding
embs = [np.array(o).mean(axis=0) for o in outs]
print(len(embs), embs[0].shape)

Example #2: Masked-Language Modeling (Mutation Scoring)

Predict likely amino acids for masked positions or evaluate single-site mutations.

from transformers import pipeline, AutoTokenizer

model_id = "Chrode/H3BERTa"
tok = AutoTokenizer.from_pretrained(model_id)

mlm = pipeline(
    task="fill-mask",
    model=model_id,
    tokenizer=tok,
    device=0
)

# Example: predict missing residue
seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
preds = mlm(seq, top_k=10)

for p in preds:
    print(p["token_str"], round(p["score"], 4))

# Score a specific point mutation
AMINO = list("ACDEFGHIKLMNPQRSTVWY")

def score_point_mutation(seq, idx, mutant_aa):
    masked = seq[:idx] + tok.mask_token + seq[idx+1:]
    preds = mlm(masked, top_k=len(AMINO))
    for p in preds:
        if p["token_str"] == mutant_aa:
            return p["score"]
    return 0.0

wt = "ARDRSTGGYFDY"
print("Rโ†’A @ pos 3:", score_point_mutation(wt, 3, "A"))

Citation

If you use this model, please cite:

Rodella C. et al. H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.

  • under review.

License

The model and tokenizer are released under the MIT License. For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.

Downloads last month
4
Safetensors
Model size
85.7M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support