JuliaFluxGPT-fused-v2

Cross-species symbiogenesis: Pythia-14m-deduped (300B tokens, d=128) weights projected through a tokenizer bridge into JuliaFluxGPT-fused (23M, d=512), alpha-blended with existing fused weights, and fine-tuned for 1500 steps on classical philosophy texts.

This is the second fusion experiment — v1 fused JuliaSLM (5M) into the architecture; v2 additionally incorporates knowledge from EleutherAI's Pythia-14m, which was trained on 1000x more data (300B tokens of The Pile vs 266M tokens).

Architecture

Property Value
Parameters ~23M
d_model 512
Layers 8
Attention 8Q / 2KV Grouped Query Attention
Head dim 64
FFN SwiGLU (inner dim 1344)
Normalization RMSNorm (pre-norm)
Position encoding RoPE (full head_dim)
Context length 256 tokens
Vocab size 2,000 BPE tokens
Weight tying Yes (shared embed/output)
Framework PyTorch

Fusion Method

  1. Tokenizer bridge: Each BPE-2000 token decoded to text, re-encoded with Pythia's 50K GPT-NeoX tokenizer, Pythia embeddings averaged → (2000, 128) bridged embedding matrix
  2. Weight projection: Pythia d=128 weights projected into d=512 target space — attention heads duplicated (4→8 query, 4→2 KV), GELU FFN converted to SwiGLU, LayerNorm→RMSNorm
  3. Alpha blending: Layers 0-5 blended α=0.5 (existing v1 + Pythia), Layers 6-7 α=0.0 (Pythia only, since these were randomly initialized in v1)
  4. Fine-tune: 1500 steps, cosine LR 3e-4→1e-5, batch 64×256 tokens

Training

Detail Value
Source model 1 JuliaFluxGPT-fused (v1, val_loss=3.698)
Source model 2 EleutherAI/pythia-14m-deduped (300B tokens)
Training data 115M tokens, classical philosophy (BPE-2000)
Fine-tune steps 1,500 (early stopped — val loss plateaued)
LR schedule Cosine, 3e-4 → 1e-5, 300 step warmup
Batch size 64 × 256 = 16,384 tokens/step
Hardware NVIDIA RTX 3060 12GB
Precision f32
W&B run juliafluxgpt-pythia-fusion

Scaling Context

Model Params Val Loss Perplexity Notes
MicroJulia 0.5M 4.32 75.2 Baseline GPT-2 style
MonarchSLM 5M 3.72 41.3 Monarch mixer
SymbioSLM 5M 3.60 36.6 Multi-organelle gated
JuliaSLM 5M 3.54 34.5 Standard MHA, best 5M
JuliaFluxGPT-fused (v1) 23M 3.698 40.4 JuliaSLM→JuliaFluxGPT fusion
JuliaFluxGPT-fused-v2 23M 3.873 48.1 + Pythia cross-species fusion

Note: v2 val_loss (3.873) is higher than v1 (3.698). The Pythia cross-species fusion disrupted domain-specific features learned during v1 fine-tuning. The 23M model is also 4.5x over Chinchilla-optimal for 115M tokens, limiting fine-tuning recovery. Linguistically, the larger fused models produce more coherent text despite higher loss — see eval_linguistics.py for computational metrics.

Files

File Description
juliaflux_fused_v2_best.pt Best checkpoint (step 1500, raw state_dict, 88MB)
juliaflux_model.py Model definition (JuliaFluxGPT class)
vocab.json BPE-2000 vocabulary (GPT-2 format)
merges.txt BPE merge rules

Usage

import torch
from juliaflux_model import JuliaFluxConfig, JuliaFluxGPT

config = JuliaFluxConfig()
model = JuliaFluxGPT(config)
state_dict = torch.load("juliaflux_fused_v2_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using LisaMegaWatts/JuliaFluxGPT-fused-v2 1

Evaluation results