VocabFusion-SymbioGPT-10M

Cross-Scale Vocabulary Fusion: a frozen Pythia-14m embedding (50K vocab, d=128) fused into a trained SymbioGPT-10M host (2K vocab, d=320) via a learned junction layer.

The donor provides vocabulary knowledge (50K BPE tokens from The Pile); the host provides structural/syntactic knowledge (multi-organelle architecture trained on curated philosophy text).

Architecture

input_ids → Pythia Embedding (frozen, 50304×128)
          → Junction (Linear 128→320 + LayerNorm + SiLU)
          → 8× SymbioBlocks (CausalConv + Monarch + LongConv + Attention, d=320)
          → RMSNorm
          → Output Projection (Linear 320→128)
          → Weight-tied logits (F.linear with frozen Pythia embedding)
          → logits (50304 vocab)

SymbioBlock Organelles

Each of the 8 host blocks contains 4 parallel organelles with learned gating:

Organelle	Description
CausalConv	Left-padded 1D convolution (kernel=4) for local patterns
Monarch	Monarch Mixer with learned causal mask for efficient global mixing
LongConv	Exponential-decay convolution for medium-range dependencies
Attention	Multi-head attention with RoPE (5 heads, head_dim=64)

Training

Progressive unfreezing across 3 phases (Chinchilla-optimal schedule):

Phase	Steps	Trainable	LR Scale	Params
P1	0–10,399	Junction + Output Projection	1.0x	82K
P2	10,400–18,199	+ 2 near blocks	0.5x	2.7M
P3	18,200–25,999	+ 6 far blocks + RMSNorm	0.25x	10.5M

Training Config

Parameter	Value
Training tokens	502.7M
Validation tokens	55.7M (subsampled to 5K seqs for eval)
Context length	256
Batch size	16 × 2 grad accum = 32 effective
Peak LR	6e-4 (cosine decay)
Warmup	10% of steps
Weight decay	0.1
Precision	FP16 mixed
Hardware	NVIDIA RTX 3060 12GB
Wall time	6,872s (~115 min)
Corpus	Curated philosophy text

Parameter Counts

Component	Parameters
Junction	41,600
Output Projection	40,960
Near Blocks (0–1)	2,603,270
Far Blocks (2–7)	7,809,810
RMSNorm	320
Donor Embedding (frozen)	6,438,912
Trainable	10,495,960
Total	16,934,872

Results

Metric	Value
Final val loss	5.502
Final val PPL	245.2
Best val loss	5.507 (step 25,000)
Best val PPL	246.4

Val loss decreased monotonically across all 3 phases with no overfitting:

Checkpoint	Val Loss	Val PPL	Phase
Step 2,000	6.501	666.0	P1
Step 11,500	6.041	420.4	P2
Step 15,000	5.788	326.5	P2
Step 25,000	5.507	246.4	P3
Step 26,000	5.502	245.2	P3

Files

File	Description
`vocab_fusion_best.pt`	Best checkpoint (step 25,000, val_loss=5.507)
`vocab_fusion_final.pt`	Final checkpoint (step 26,000, val_loss=5.502)
`tokenizer/`	Pythia tokenizer (GPT-NeoX, 50K BPE)

Usage

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LisaMegaWatts/VocabFusion-SymbioGPT-10M", subfolder="tokenizer")

# Load model (requires symbiogenesis library)
from vocab_fusion_experiment.model import VocabFusionConfig, VocabFusionModel
from transformers import GPTNeoXForCausalLM

config = VocabFusionConfig()
donor = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-14m")
model = VocabFusionModel(config, donor.gpt_neox.embed_in)

state = torch.load("vocab_fusion_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

# Generate
input_ids = tokenizer.encode("The nature of consciousness", return_tensors="pt")
with torch.no_grad():
    logits = model(input_ids)
    next_token = logits[0, -1].argmax()
    print(tokenizer.decode(next_token))

Part of Symbiogenesis

This model is part of the Symbiogenesis research project, exploring biological fusion strategies for neural network composition.

W&B Run: vocab-fusion-pythia14m

GitHub: MonumentalSystems/symbiogenesis-experiments

Downloads last month: -; Downloads are not tracked for this model. How to track

LisaMegaWatts
/

VocabFusion-SymbioGPT-10M