We release Llama-SARM-4B with SAE weights, with the score head left untrained for reproducibility and score head weights are initialized to all zero for interpretability.

SARM: Interpretable Reward Model via Sparse Autoencoder

Authors (* indicates equal contribution)

Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
Paper: Interpretable Reward Model via Sparse Autoencoder
Model: Schrieffer/SARM-4B
- Finetuned from model: Llama-3.1-8B-Instruct
Code Repository: https://github.com/schrieffer-z/sarm
Demo: Try SARM Demo in Huggingface Space

Downloads last month: 56

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support