Add support for transformers 4.44 through 5.0+

#16

Add support for broader set of transformers versions

This PR updates llama_bidirectional_model.py to support transformers versions 4.44 through 5.0+, replacing the previous requirement of exactly 4.47.1.

Why this change was needed

The previous implementation relied on overriding _update_causal_mask() to create bidirectional attention masks. This approach broke in several ways:

  1. transformers 4.48: The attention refactor (#35235) activated our _attn_implementation = "eager" line, forcing eager attention instead of SDPA
  2. transformers 4.53: The _update_causal_mask method was removed entirely, with masking logic moved to masking_utils

What changed

  • Unified forward() override instead of _update_causal_mask override
  • Introspection-based API detection using inspect.signature() rather than hardcoded version checks
  • Automatic fallback for mask creation: uses create_bidirectional_mask (5.0+) or _prepare_4d_attention_mask (older)
  • Handles API differences across versions:
    • Decoder layer return type (tuple in <4.54, tensor in ≥4.54)
    • Cache parameter name (past_key_value vs past_key_values)
    • DynamicCache constructor signature
  • Removed _attn_implementation = "eager" - users should pass attention implementation via model_kwargs when loading

Testing

Tested with transformers versions: 4.44, 4.47.1, 4.48, 4.53, 4.54, 4.56, 4.57, 5.0.0

Embeddings verified consistent across versions (with expected minor floating-point differences ~1e-4 in 5.0+ due to different mask creation internals).

nvidia-oliver-holworthy changed pull request status to open
nvidia-oliver-holworthy changed pull request status to merged

Sign up or log in to comment