Corrupted weights?
I have been trying to write an inference engine for CoDA in Swift/MLX and it only generated gibberish. I then checked the weights, e.g.:
Layer 23 kNorm loaded: shape=[128], std=1.2598647
RAW kNorm stats: mean=2.0254, min=-0.0121, max=9.0000
I then did the same thing via Google Colab and PyTorch, e.g.
--------------------
Layer: model.layers.23.self_attn.q_proj.weight
Stats: Mean=-0.0001, Std=0.0584, Min=-0.4121, Max=0.4141
--------------------
Layer: model.layers.23.self_attn.k_proj.weight
Stats: Mean=-0.0000, Std=0.0542, Min=-0.3867, Max=0.4062
--------------------
Layer: model.layers.23.self_attn.v_proj.weight
Stats: Mean=0.0001, Std=0.0614, Min=-0.3945, Max=0.3496
--------------------
Layer: model.layers.23.self_attn.o_proj.weight
Stats: Mean=0.0000, Std=0.0566, Min=-0.4785, Max=0.4375
--------------------
Layer: model.layers.23.self_attn.q_norm.weight
Stats: Mean=1.4233, Std=0.5039, Min=-0.0302, Max=2.6094
--------------------
Layer: model.layers.23.self_attn.k_norm.weight
Stats: Mean=2.0254, Std=1.2648, Min=-0.0121, Max=9.0000
--------------------
Layer: model.layers.23.mlp.gate_proj.weight
Stats: Mean=-0.0002, Std=0.0608, Min=-1.3203, Max=0.8750
--------------------
Layer: model.layers.23.mlp.up_proj.weight
Stats: Mean=0.0000, Std=0.0683, Min=-0.7930, Max=0.7422
--------------------
Layer: model.layers.23.mlp.down_proj.weight
Stats: Mean=-0.0000, Std=0.0622, Min=-1.0391, Max=1.1094
--------------------
Layer: model.layers.23.input_layernorm.weight
Stats: Mean=10.5016, Std=5.4567, Min=0.0001, Max=74.5000
--------------------
Layer: model.layers.23.post_attention_layernorm.weight
Stats: Mean=2.0072, Std=0.3130, Min=-0.0005, Max=5.1875
--------------------
I'll gladly provide all the values if needed.
But the question is: Are the weights corrupted?
Hi Muzel, thanks for checking in. Could you provide the environment version you worked on, especially transformer version?
- transformers: 4.57.1
- torch: 2.8.0+cu126
- Python 3.12
Could you try an elder version, say 4.47.1?
With 4.47.1:
Layer: model.layers.23.self_attn.q_proj.weight
Stats: Mean=-0.0001, Std=0.0584, Min=-0.4121, Max=0.4141
--------------------
Layer: model.layers.23.self_attn.k_proj.weight
Stats: Mean=-0.0000, Std=0.0542, Min=-0.3867, Max=0.4062
--------------------
Layer: model.layers.23.self_attn.v_proj.weight
Stats: Mean=0.0001, Std=0.0614, Min=-0.3945, Max=0.3496
--------------------
Layer: model.layers.23.self_attn.o_proj.weight
Stats: Mean=0.0000, Std=0.0566, Min=-0.4785, Max=0.4375
--------------------
Layer: model.layers.23.self_attn.q_norm.weight
Stats: Mean=1.4233, Std=0.5039, Min=-0.0302, Max=2.6094
--------------------
Layer: model.layers.23.self_attn.k_norm.weight
Stats: Mean=2.0254, Std=1.2648, Min=-0.0121, Max=9.0000
--------------------
Layer: model.layers.23.mlp.gate_proj.weight
Stats: Mean=-0.0002, Std=0.0608, Min=-1.3203, Max=0.8750
--------------------
Layer: model.layers.23.mlp.up_proj.weight
Stats: Mean=0.0000, Std=0.0683, Min=-0.7930, Max=0.7422
--------------------
Layer: model.layers.23.mlp.down_proj.weight
Stats: Mean=-0.0000, Std=0.0622, Min=-1.0391, Max=1.1094
--------------------
Layer: model.layers.23.input_layernorm.weight
Stats: Mean=10.5016, Std=5.4567, Min=0.0001, Max=74.5000
--------------------
Layer: model.layers.23.post_attention_layernorm.weight
Stats: Mean=2.0072, Std=0.3130, Min=-0.0005, Max=5.1875
Did you also experience similar behavior? We did our post-training and eval under 4.47.1 and bfloat16 precision.
model_name = "Salesforce/CoDA-v0-Instruct"
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
I don't understand what you mean with 'did you experience similar behavior', sorry. I hadn't tried running my inference engine with transformers, as I'm using MLX.
I reran the weights logging with exactly your configuration:
Layer: model.layers.23.self_attn.q_proj.weight
Stats: Mean=-0.0001, Std=0.0583, Min=-0.4121, Max=0.4141
--------------------
Layer: model.layers.23.self_attn.k_proj.weight
Stats: Mean=-0.0000, Std=0.0542, Min=-0.3867, Max=0.4062
--------------------
Layer: model.layers.23.self_attn.v_proj.weight
Stats: Mean=0.0001, Std=0.0615, Min=-0.3945, Max=0.3496
--------------------
Layer: model.layers.23.self_attn.o_proj.weight
Stats: Mean=0.0000, Std=0.0566, Min=-0.4785, Max=0.4375
--------------------
Layer: model.layers.23.self_attn.q_norm.weight
Stats: Mean=1.4219, Std=0.5039, Min=-0.0302, Max=2.6094
--------------------
Layer: model.layers.23.self_attn.k_norm.weight
Stats: Mean=2.0312, Std=1.2656, Min=-0.0121, Max=9.0000
--------------------
Layer: model.layers.23.mlp.gate_proj.weight
Stats: Mean=-0.0002, Std=0.0608, Min=-1.3203, Max=0.8750
--------------------
Layer: model.layers.23.mlp.up_proj.weight
Stats: Mean=0.0000, Std=0.0684, Min=-0.7930, Max=0.7422
--------------------
Layer: model.layers.23.mlp.down_proj.weight
Stats: Mean=-0.0000, Std=0.0623, Min=-1.0391, Max=1.1094
--------------------
Layer: model.layers.23.input_layernorm.weight
Stats: Mean=10.5000, Std=5.4688, Min=0.0001, Max=74.5000
--------------------
Layer: model.layers.23.post_attention_layernorm.weight
Stats: Mean=2.0000, Std=0.3125, Min=-0.0005, Max=5.1875
Hi Muzel, sorry for the unclear context - could you replicate the undesired behavior of the model/suspicious weights when loading the model in transformers 4.47.1 and do inference? I am not an expert of MLX and not sure what happens in your environment.
No, I cannot, I do not have the capacity to rewrite the whole framework just to test that - but thanks for helping anyway!