H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis

Model ID: Chrode/H3BERTa
Architecture: RoBERTa-base (encoder-only, Masked Language Model)
Sequence type: Heavy chain CDR-H3 regions
Training: Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)
Max sequence length: 100 amino acids
Vocabulary: 25 tokens (20 standard amino acids + special tokens)
Mask token: [MASK]

Official github repository is available here.

Model Overview

H3BERTa is a transformer-based language model trained specifically on the Complementarity-Determining Region 3 of the heavy chain (CDR-H3), the most diverse and functionally critical region of antibodies.
It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling embedding extraction, variant scoring, and context-aware mutation predictions.

Intended Use

Embedding extraction for CDR-H3 repertoire analysis
Mutation impact scoring (pseudo-likelihood estimation)
Downstream fine-tuning (e.g., bnabs identification)

How to Use

Input format: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids.

from transformers import AutoTokenizer, AutoModel

model_id = "Chrode/H3BERTa"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

Example #1: Embeddings extraction

Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.

from transformers import pipeline
import torch, numpy as np

feat = pipeline(
    task="feature-extraction",
    model="Chrode/H3BERTa",
    tokenizer="Chrode/H3BERTa",
    device=0 if torch.cuda.is_available() else -1
)

seqs = [
    "ARMGAAREWDFQY",
    "ARDGLGEVAPDYRYGIDV"
]

with torch.no_grad():
    outs = feat(seqs)

# Mean pooling across tokens → per-sequence embedding
embs = [np.array(o).mean(axis=0) for o in outs]
print(len(embs), embs[0].shape)

Example #2: Masked-Language Modeling (Mutation Scoring)

Predict likely amino acids for masked positions or evaluate single-site mutations.

from transformers import pipeline, AutoTokenizer

model_id = "Chrode/H3BERTa"
tok = AutoTokenizer.from_pretrained(model_id)

mlm = pipeline(
    task="fill-mask",
    model=model_id,
    tokenizer=tok,
    device=0
)

# Example: predict missing residue
seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
preds = mlm(seq, top_k=10)

for p in preds:
    print(p["token_str"], round(p["score"], 4))

# Score a specific point mutation
AMINO = list("ACDEFGHIKLMNPQRSTVWY")

def score_point_mutation(seq, idx, mutant_aa):
    masked = seq[:idx] + tok.mask_token + seq[idx+1:]
    preds = mlm(masked, top_k=len(AMINO))
    for p in preds:
        if p["token_str"] == mutant_aa:
            return p["score"]
    return 0.0

wt = "ARDRSTGGYFDY"
print("R→A @ pos 3:", score_point_mutation(wt, 3, "A"))

Citation

If you use this model, please cite:

Rodella C. et al. H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.

under review.

License

The model and tokenizer are released under the MIT License. For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.

Downloads last month: 4

Safetensors

Model size

85.7M params

Tensor type

F32