Abstract
RACTOR, a ridge-adaptive logistic probe, achieves accurate and stable concept vector estimation for activation steering in frozen LLMs with reduced training costs, supported by theoretical analysis of ridge logistic regression in high-dimensional settings.
Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.
Community
We propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe for better additive steering.
Part 1 [Paper Review]
Summary
RAPTOR introduces a ridge-adaptive approach to logistic probing, designed to improve the extraction and interpretation of features from the internal representations of Large Language Models (LLMs). By dynamically adjusting the regularization strength (the "ridge" parameter) based on the specific statistical properties of activation layers, the method aims to provide more robust and reliable probes than traditional fixed-regularization or heuristically-tuned linear classifiers.
Key Contributions
- Adaptive Regularization Framework: The primary strength of the work lies in moving beyond the "one-size-fits-all" approach to probe regularization. By automating the selection of the $\lambda$ parameter, RAPTOR accounts for the varying scales and noise levels found across different transformer layers and model depths.
- Enhanced Interpretability Stability: The paper demonstrates that adaptive probing leads to more consistent feature identification. This reduces the risk of "over-probing," where a classifier might pick up on spurious correlations or noise in high-dimensional activation spaces rather than the intended semantic features.
- Computational Efficiency: The authors provide a computationally feasible pathway for applying ridge-adaptive logic across massive datasets and large-scale models, addressing a common bottleneck in interpretability research where cross-validation for every individual probe is prohibitively expensive.
Limitations and Concerns
- The Linearity Assumption: Like all linear probing methods, RAPTOR is constrained by the assumption that the information of interest is linearly separable within the activation space. While this is a standard paradigm in interpretability, the paper could more deeply discuss scenarios where linear probes (even adaptive ones) fundamentally fail to capture "computationally latent" features that require non-linear readout.
- Baseline Comparisons: While the method shows improvement over standard logistic regression with fixed defaults, the margin of improvement over a rigorous (though more expensive) grid-search cross-validation approach remains a point of scrutiny. The paper would benefit from a clearer breakdown of when the "adaptivity" provides a statistical advantage versus merely a speed advantage.
- Sensitivity to Activation Scaling: Neural network activations often exhibit significant outliers or specific distributions (e.g., "outlier dimensions" in LLMs). It is not entirely clear how RAPTOR handles these extreme values and whether the ridge penalty is sufficient to prevent these dimensions from dominating the probe's weight matrix.
Overall Assessment
RAPTOR is a timely and technically sound contribution to the field of mechanistic interpretability. As the community shifts from simple performance benchmarking to understanding how models represent knowledge, the reliability of the tools used for that "microscopic" look becomes paramount. By professionalizing the probing process through adaptive regularization, this work provides a more principled foundation for future interpretability studies. It is a solid, incremental improvement that addresses a practical pain point for researchers working with high-dimensional model internals.
Rating (Informal)
7/10
Part 2 [Constructive Suggestions]
Clarification Suggestions
- Regularization Rationale: It would be helpful to clarify why Ridge ($L_2$) was chosen over Lasso ($L_1$) or Elastic Net. In many interpretability contexts, researchers prefer the sparsity of $L_1$ to identify a "minimal set" of neurons. Explaining why a dense, ridge-based distribution of weights is more appropriate for LLM feature extraction would strengthen the theoretical framing.
- Activation Pre-processing: Provide more detail on whether activations were centered or normalized before probing. Since Ridge regression is scale-sensitive, the impact of layer normalization within the transformer on RAPTOR’s adaptive mechanism deserves an explicit discussion.
Experimental Extensions
- Out-of-Distribution (OOD) Robustness: A compelling way to prove the value of adaptive regularization is to show that RAPTOR probes generalize better to OOD data or different datasets than standard probes. If RAPTOR truly captures a more "natural" feature boundary, it should be less prone to overfitting on the training distribution.
- Scaling Laws for Probing: Investigating how the optimal $\lambda$ found by RAPTOR changes as model size increases (e.g., comparing Llama-3 8B vs. 70B) could provide interesting insights into how "dense" or "noisy" representations become as models scale.
Positioning Advice
- Tooling vs. Discovery: The authors should decide if RAPTOR is being marketed primarily as a diagnostic tool (to help researchers build better probes faster) or a scientific discovery (revealing something new about how LLMs represent data). If it is the former, focusing more on the "ease of use" and "reduction in researcher degrees of freedom" would be a very persuasive angle for the interpretability community.
- Comparison to "Top-K" and Sparse Autoencoders (SAEs): Recent trends have shifted toward SAEs for feature extraction. Positioning RAPTOR as a complementary, more lightweight alternative for supervised feature identification—rather than a competitor to unsupervised methods—would help readers understand exactly where this fits in the current research landscape.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper