JuliaFluxGPT-fused-v2
Cross-species symbiogenesis: Pythia-14m-deduped (300B tokens, d=128) weights projected through a tokenizer bridge into JuliaFluxGPT-fused (23M, d=512), alpha-blended with existing fused weights, and fine-tuned for 1500 steps on classical philosophy texts.
This is the second fusion experiment — v1 fused JuliaSLM (5M) into the architecture; v2 additionally incorporates knowledge from EleutherAI's Pythia-14m, which was trained on 1000x more data (300B tokens of The Pile vs 266M tokens).
Architecture
| Property |
Value |
| Parameters |
~23M |
| d_model |
512 |
| Layers |
8 |
| Attention |
8Q / 2KV Grouped Query Attention |
| Head dim |
64 |
| FFN |
SwiGLU (inner dim 1344) |
| Normalization |
RMSNorm (pre-norm) |
| Position encoding |
RoPE (full head_dim) |
| Context length |
256 tokens |
| Vocab size |
2,000 BPE tokens |
| Weight tying |
Yes (shared embed/output) |
| Framework |
PyTorch |
Fusion Method
- Tokenizer bridge: Each BPE-2000 token decoded to text, re-encoded with Pythia's 50K GPT-NeoX tokenizer, Pythia embeddings averaged → (2000, 128) bridged embedding matrix
- Weight projection: Pythia d=128 weights projected into d=512 target space — attention heads duplicated (4→8 query, 4→2 KV), GELU FFN converted to SwiGLU, LayerNorm→RMSNorm
- Alpha blending: Layers 0-5 blended α=0.5 (existing v1 + Pythia), Layers 6-7 α=0.0 (Pythia only, since these were randomly initialized in v1)
- Fine-tune: 1500 steps, cosine LR 3e-4→1e-5, batch 64×256 tokens
Training
| Detail |
Value |
| Source model 1 |
JuliaFluxGPT-fused (v1, val_loss=3.698) |
| Source model 2 |
EleutherAI/pythia-14m-deduped (300B tokens) |
| Training data |
115M tokens, classical philosophy (BPE-2000) |
| Fine-tune steps |
1,500 (early stopped — val loss plateaued) |
| LR schedule |
Cosine, 3e-4 → 1e-5, 300 step warmup |
| Batch size |
64 × 256 = 16,384 tokens/step |
| Hardware |
NVIDIA RTX 3060 12GB |
| Precision |
f32 |
| W&B run |
juliafluxgpt-pythia-fusion |
Scaling Context
| Model |
Params |
Val Loss |
Perplexity |
Notes |
| MicroJulia |
0.5M |
4.32 |
75.2 |
Baseline GPT-2 style |
| MonarchSLM |
5M |
3.72 |
41.3 |
Monarch mixer |
| SymbioSLM |
5M |
3.60 |
36.6 |
Multi-organelle gated |
| JuliaSLM |
5M |
3.54 |
34.5 |
Standard MHA, best 5M |
| JuliaFluxGPT-fused (v1) |
23M |
3.698 |
40.4 |
JuliaSLM→JuliaFluxGPT fusion |
| JuliaFluxGPT-fused-v2 |
23M |
3.873 |
48.1 |
+ Pythia cross-species fusion |
Note: v2 val_loss (3.873) is higher than v1 (3.698). The Pythia cross-species fusion disrupted domain-specific features learned during v1 fine-tuning. The 23M model is also 4.5x over Chinchilla-optimal for 115M tokens, limiting fine-tuning recovery. Linguistically, the larger fused models produce more coherent text despite higher loss — see eval_linguistics.py for computational metrics.
Files
| File |
Description |
juliaflux_fused_v2_best.pt |
Best checkpoint (step 1500, raw state_dict, 88MB) |
juliaflux_model.py |
Model definition (JuliaFluxGPT class) |
vocab.json |
BPE-2000 vocabulary (GPT-2 format) |
merges.txt |
BPE merge rules |
Usage
import torch
from juliaflux_model import JuliaFluxConfig, JuliaFluxGPT
config = JuliaFluxConfig()
model = JuliaFluxGPT(config)
state_dict = torch.load("juliaflux_fused_v2_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()
Links