Arcade-3B: SLM Optimization via Orthogonal Decoupling of Latent State Spaces

Community Article Published March 15, 2026

In parameter-constrained Small Language Models (SLMs), it is often difficult for the model to effectively distinguish between "task state representation" and "underlying logical constraints" within high-dimensional search spaces. Traditional fine-tuning methods frequently lead to coupling conflicts between these two in the Latent Space, which limits the model's convergence ceiling.

Arcade-3B introduces the SC-OrthFine architecture, with the core objective of achieving decoupling of state-space search: it forcibly projects the model's search behavior into mutually orthogonal State Vectors and Constraint Vectors.

1. The Coupling Dilemma in State-Space Search

In a 3B-scale model, the hidden state output H H carries extremely high information density. During gradient backpropagation, the traditional Lce L_{ce} (Cross-Entropy Loss) adjusts weights indiscriminately to fit the target distribution. However, when handling logical reasoning (e.g., GSM8K) or code generation (e.g., HumanEval), the model must simultaneously process:

  1. Semantic State ( S ): Generating the contextual representation of the current token.
  2. Logical Constraints ( C ): Adhering to syntax, mathematical rules, or long-range structural dependencies.

When these two overlap on the same Manifold, the search behavior experiences significant interference.

dia

2. SC-Orthogonal: Orthogonal Projection Decoupling Mechanism

To address the issues mentioned above, we designed the SC-Orthogonal Optimization Loop. Its core logic involves splitting the hidden state HRB×L×D H \in \mathbb{R}^{B \times L \times D} along the feature dimension to define two independent subspaces:

  • State Projection Half (State Half, S S ): Focuses on the feature representation for instantaneous prediction.
  • Constraint Projection Half (Constraint Half, C C ): Carries global logical boundaries and structural constraints.

Mathematical Definition and Loss Function

To ensure the decoupling of search behavior, we introduce an orthogonality constraint. By minimizing the inner product of S S and C C , we force them to maintain a 90 90^\circ orthogonality in a geometric sense:

Dot=SC=i=1D/2SiCiDot = S \cdot C = \sum_{i=1}^{D/2} S_i C_i

To implement this constraint during the training process, we define the Orthogonality Loss function Lorth L_{orth} :

Lorth=1BLb,l(Sb,lCb,l)2L_{orth} = \frac{1}{B \cdot L} \sum_{b,l} (S_{b,l} \cdot C_{b,l})^2

The final joint optimization objective function is:

Ltotal=Lce+λLorthL_{total} = L_{ce} + \lambda \cdot L_{orth}

By introducing the orthogonal penalty term regulated by λ \lambda , the model is forced to perform parameter searches within mutually independent subspaces, thereby avoiding feature collapse.

benchmark_comparison

3. Experimental Analysis: Performance Gains from Decoupling

Experimental results indicate that this state-space decoupling is particularly prominent in logic-intensive tasks:

  • Robustness in Logical Reasoning: In the GSM8K benchmark, Arcade-3B achieved an accuracy of 62.9%. This proves that through orthogonal constraints, the model can better isolate mathematical logical constraints from language generation states, reducing "hallucination" interference during the reasoning process.
  • Coding Efficiency: In the HumanEval task, the score of 41.5% significantly leads other models of the same scale that do not employ orthogonal decoupling (such as Qwen1.5-1.8B at 27.4%), demonstrating that orthogonal subspaces offer higher search efficiency for complex structured data.
Benchmark Arcade-3B Gemma-2-2B Llama-2-7B
MMLU 52.9% 52.4% 45.3%
GSM8K 62.9% 50.9% 14.6%
HumanEval 41.5% 32.3% 12.8%

Conclusion

The technical path of Arcade-3B demonstrates that for small parameter models, simply increasing data volume or obtaining logits via distillation is insufficient. Through the underlying mathematical constraints of SC-OrthFine, achieving state-space search decoupling from a geometric perspective is an effective means of enhancing a model's "logical density."

Community

Sign up or log in to comment