neon213-Muon: The SLM SOTA (3.44 Loss)

neon213-Muon is the best-performing model of the NeonBench series. By combining a Progressive Growth strategy ($k=1 \to 21$) with the Muon Optimizer, this model achieves a project-wide SOTA on FineWeb-Edu, outperforming the AdamW baseline by a noticable margin.

Property Value
Parameters (Total) 26.49M
Parameters (Active) 20.20M (Non-Embedding)
Architecture Growable SwiGLU-Conv (neon213-Muon)
Optimizer Muon V4 (Orthogonalized)
Tokenizer tok6 (16,384 Vocab)
Dimensions $d_{model}=384, n_{head}=6, n_{layers}=8, d_{ff}=1536$
Context Growable ($k=1 \to 21$)
Status Project SOTA (3.44 Val Loss)

πŸ—οΈ Architecture

The architecture is based on neon185 (SwiGLU-Conv), which features:

  1. SwiGLU MLP: w2(SiLU(w1(x)) * w3(x)) gating.
  2. Hydra Convolution: Depthwise convolutions on MLP gates to provide local context.
  3. Conv-Attention: Depthwise convolutions on Q/K/V/I projections.
  4. Sigmoid Attention Gate: Learned sigmoid(Intent) gate on attention output.

Growable Kernels

Unlike previous static models, neon213 features configurable kernel sizes (conv_k, mlp_k). This allows the model to start with pointwise operations ($k=1$) and grow its receptive field during training.

# Conv-Attention Layer
self.conv_q = nn.Conv1d(d, d, kernel_size=k, groups=d)  # k grows 1->9
self.conv_k = nn.Conv1d(d, d, kernel_size=k, groups=d)
self.conv_v = nn.Conv1d(d, d, kernel_size=k, groups=d)
self.conv_i = nn.Conv1d(d, d, kernel_size=k, groups=d)

# SwiGLU MLP Layer
self.conv_gate = nn.Conv1d(d, d, kernel_size=k, groups=d) # k grows 1->9

πŸ’‘ Key Innovations

1. The Muon PR Breakthrough (Diversity Recovery)

In our AdamW baseline tests, we observed severe Dimensional Collapse. Despite a 384-dimensional latent space, the Participation Ratio (PR) was often as low as ~12.0, meaning less than 4% of the representational power was being utilized.

By switching to the Muon Optimizer (Orthogonalized Momentum), we forced the weight matrices to remain diverse.

  • V-Projection PR: 12.2 (AdamW) βž” 25.9 (Muon).
  • Result: The model is "wider" internally, enabling valid learning long after AdamW would have plateaued.

2. Learned Intent Gating

Standard Gated Scaled Dot-Product Attention (as seen in architectures like Qwen 3.5) derives its output gate from existing projections β€” typically the Query. The gate is calculated, not independently learned:

Gated-SDPA:y=Οƒ(Wgβ‹…Q)βŠ™Attn(Q,K,V)\text{Gated-SDPA}: \quad y = \sigma(W_g \cdot Q) \odot \text{Attn}(Q, K, V)

In neon213, the gate is a fully independent learned projection called Intent ($I$). Intent has its own dedicated weights (c_attn slice) and its own dedicated convolution (conv_i), giving it a completely separate representational capacity from Q, K, and V:

Intent-Gated:y=Οƒ(Conv(I))βŠ™Attn(Q,K,V)\text{Intent-Gated}: \quad y = \sigma(\text{Conv}(I)) \odot \text{Attn}(Q, K, V)

This means the model can learn what information to keep (Intent) independently from what information to search for (Query) and what information to retrieve (Value).

3. Depthwise Convolutions as Communication Channels

The depthwise convolutions applied to Q, K, V, and I shouldn't be interpreted as simple blurs. Each convolution kernel is a set of fully learned, unconstrained weights β€” including negative values. This means each dimension can independently decide:

  • How much of a neighboring token's signal to incorporate (weight magnitude).
  • Whether to amplify or inhibit that signal (positive vs. negative weights).

In practice, this creates an additional token-to-token communication pathway that operates before the attention mechanism. While attention allows tokens to selectively read from any position, the convolutions provide a fixed, local, per-dimension channel for adjacent tokens to share information β€” a form of inductive bias that complements the global, content-based routing of attention.

4. Progressive Kernel Growth

Convolutions at large kernel sizes ($k=9$) are powerful but difficult to train from scratch β€” the model must simultaneously learn what to convolve and how far to look. neon213 solves this with Progressive Kernel Growth:

  1. Training begins with pointwise kernels ($k=1$), which are equivalent to no convolution at all. The model first learns the fundamentals of attention and MLP gating without any local context.
  2. Kernels are then gradually expanded ($k=1 \to 3 \to 5 \to 7 \to 9$) using zero-padding initialization β€” the new kernel positions are filled with zeros, so the model's behavior is perfectly preserved at the moment of expansion.
  3. The model then learns to use the new context during the subsequent training steps, gradually discovering how to exploit wider local neighborhoods.

This approach is analogous to curriculum learning: the model masters simple patterns first, then progressively gains the capacity to leverage richer local context.


πŸ“ˆ Progressive Growth Training (Muon SOTA)

The Muon SOTA model followed an accelerated and expanded growth curriculum. While the AdamW baseline struggled past $k=9$, the Muon-backed heads remained stable up to $k=21$.

Stage Kernel ($k$) Steps Description
1 1 5,000 Cold Start: Muon bootstraps attention stability.
2-10 3 βž” 19 27,000 Hybrid Growth: Step-wise expansion ($+2k$ every 3k steps).
11 21 3,000 Target Depth: Reached full $k=21$ context.
12 21 30,000 The Floor: Final long-tail convergence with Cosine decay.

Total Steps: ~65,000.

Growth Mechanics

  1. Muon recovery: After each kernel expansion "shock," the orthogonalized weights recovered informational throughput (PR) 2x faster than AdamW.
  2. Kernel Growth: New kernel weights were zero-padded, maintaining exact parity at the moment of transition.

πŸ’Ύ Checkpoint & Quantization

The final model checkpoint exceeded the GitHub 100MB file limit (101 MB). To resolve this, the checkpoint was converted to Float16 (Half Precision).

  • Original Size: 101.16 MB
  • FP16 Size: 62.60 MB
  • Format: Standard PyTorch state_dict.
  • Compatibility: NeonModelEngine automatically handles the fp16 $\to$ fp32 cast during loading.

πŸ“Š Performance

Metric Value Notes
Best Val Loss 3.44 FineWeb-Edu (Project SOTA).
Participation Ratio 25.9 High dimensional utilization.

Sample Generation:

"The meaning of life is deeply intertwined with the social and cultural constructs of the era. It is not a static definition but a dynamic process of engagement with the environment and the community, reflecting"

"The meaning of life is captured in the intricate patterns of human interaction and the shared pursuit of knowledge. It is the ability to adapt, to learn, and to contribute to the collective wisdom of the species"

Downloads last month
145
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support