Add support for transformers 4.44 through 5.0+
#16
by
nvidia-oliver-holworthy - opened
Add support for broader set of transformers versions
This PR updates llama_bidirectional_model.py to support transformers versions 4.44 through 5.0+, replacing the previous requirement of exactly 4.47.1.
Why this change was needed
The previous implementation relied on overriding _update_causal_mask() to create bidirectional attention masks. This approach broke in several ways:
- transformers 4.48: The attention refactor (#35235) activated our
_attn_implementation = "eager"line, forcing eager attention instead of SDPA - transformers 4.53: The
_update_causal_maskmethod was removed entirely, with masking logic moved tomasking_utils
What changed
- Unified
forward()override instead of_update_causal_maskoverride - Introspection-based API detection using
inspect.signature()rather than hardcoded version checks - Automatic fallback for mask creation: uses
create_bidirectional_mask(5.0+) or_prepare_4d_attention_mask(older) - Handles API differences across versions:
- Decoder layer return type (tuple in <4.54, tensor in ≥4.54)
- Cache parameter name (
past_key_valuevspast_key_values) DynamicCacheconstructor signature
- Removed
_attn_implementation = "eager"- users should pass attention implementation viamodel_kwargswhen loading
Testing
Tested with transformers versions: 4.44, 4.47.1, 4.48, 4.53, 4.54, 4.56, 4.57, 5.0.0
Embeddings verified consistent across versions (with expected minor floating-point differences ~1e-4 in 5.0+ due to different mask creation internals).
nvidia-oliver-holworthy changed pull request status to
open
nvidia-oliver-holworthy changed pull request status to
merged