Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning Paper • 2504.02922 • Published Apr 3
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning Paper • 2507.16795 • Published Jul 22 • 2
Automatically Interpreting Millions of Features in Large Language Models Paper • 2410.13928 • Published Oct 17, 2024 • 1