JuliaFluxGPT-fused-v2

Cross-species symbiogenesis: Pythia-14m-deduped (300B tokens, d=128) weights projected through a tokenizer bridge into JuliaFluxGPT-fused (23M, d=512), alpha-blended with existing fused weights, and fine-tuned for 1500 steps on classical philosophy texts.

This is the second fusion experiment — v1 fused JuliaSLM (5M) into the architecture; v2 additionally incorporates knowledge from EleutherAI's Pythia-14m, which was trained on 1000x more data (300B tokens of The Pile vs 266M tokens).

Architecture

Property	Value
Parameters	~23M
d_model	512
Layers	8
Attention	8Q / 2KV Grouped Query Attention
Head dim	64
FFN	SwiGLU (inner dim 1344)
Normalization	RMSNorm (pre-norm)
Position encoding	RoPE (full head_dim)
Context length	256 tokens
Vocab size	2,000 BPE tokens
Weight tying	Yes (shared embed/output)
Framework	PyTorch

Fusion Method

Tokenizer bridge: Each BPE-2000 token decoded to text, re-encoded with Pythia's 50K GPT-NeoX tokenizer, Pythia embeddings averaged → (2000, 128) bridged embedding matrix
Weight projection: Pythia d=128 weights projected into d=512 target space — attention heads duplicated (4→8 query, 4→2 KV), GELU FFN converted to SwiGLU, LayerNorm→RMSNorm
Alpha blending: Layers 0-5 blended α=0.5 (existing v1 + Pythia), Layers 6-7 α=0.0 (Pythia only, since these were randomly initialized in v1)
Fine-tune: 1500 steps, cosine LR 3e-4→1e-5, batch 64×256 tokens

Training

Detail	Value
Source model 1	JuliaFluxGPT-fused (v1, val_loss=3.698)
Source model 2	EleutherAI/pythia-14m-deduped (300B tokens)
Training data	115M tokens, classical philosophy (BPE-2000)
Fine-tune steps	1,500 (early stopped — val loss plateaued)
LR schedule	Cosine, 3e-4 → 1e-5, 300 step warmup
Batch size	64 × 256 = 16,384 tokens/step
Hardware	NVIDIA RTX 3060 12GB
Precision	f32
W&B run	juliafluxgpt-pythia-fusion

Scaling Context

Model	Params	Val Loss	Perplexity	Notes
MicroJulia	0.5M	4.32	75.2	Baseline GPT-2 style
MonarchSLM	5M	3.72	41.3	Monarch mixer
SymbioSLM	5M	3.60	36.6	Multi-organelle gated
JuliaSLM	5M	3.54	34.5	Standard MHA, best 5M
JuliaFluxGPT-fused (v1)	23M	3.698	40.4	JuliaSLM→JuliaFluxGPT fusion
JuliaFluxGPT-fused-v2	23M	3.873	48.1	+ Pythia cross-species fusion

Note: v2 val_loss (3.873) is higher than v1 (3.698). The Pythia cross-species fusion disrupted domain-specific features learned during v1 fine-tuning. The 23M model is also 4.5x over Chinchilla-optimal for 115M tokens, limiting fine-tuning recovery. Linguistically, the larger fused models produce more coherent text despite higher loss — see eval_linguistics.py for computational metrics.

Files

File	Description
`juliaflux_fused_v2_best.pt`	Best checkpoint (step 1500, raw state_dict, 88MB)
`juliaflux_model.py`	Model definition (JuliaFluxGPT class)
`vocab.json`	BPE-2000 vocabulary (GPT-2 format)
`merges.txt`	BPE merge rules

Usage

import torch
from juliaflux_model import JuliaFluxConfig, JuliaFluxGPT

config = JuliaFluxConfig()
model = JuliaFluxGPT(config)
state_dict = torch.load("juliaflux_fused_v2_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

Space using LisaMegaWatts/JuliaFluxGPT-fused-v2 1

Evaluation results

val_loss on SymbioGPT-10M (philosophy corpus)
self-reported

3.873
perplexity on SymbioGPT-10M (philosophy corpus)
self-reported

48.100

LisaMegaWatts
/

JuliaFluxGPT-fused-v2