VocabFusion-SymbioGPT-10M

Cross-Scale Vocabulary Fusion: a frozen Pythia-14m embedding (50K vocab, d=128) fused into a trained SymbioGPT-10M host (2K vocab, d=320) via a learned junction layer.

The donor provides vocabulary knowledge (50K BPE tokens from The Pile); the host provides structural/syntactic knowledge (multi-organelle architecture trained on curated philosophy text).

Architecture

input_ids β†’ Pythia Embedding (frozen, 50304Γ—128)
          β†’ Junction (Linear 128β†’320 + LayerNorm + SiLU)
          β†’ 8Γ— SymbioBlocks (CausalConv + Monarch + LongConv + Attention, d=320)
          β†’ RMSNorm
          β†’ Output Projection (Linear 320β†’128)
          β†’ Weight-tied logits (F.linear with frozen Pythia embedding)
          β†’ logits (50304 vocab)

SymbioBlock Organelles

Each of the 8 host blocks contains 4 parallel organelles with learned gating:

Organelle Description
CausalConv Left-padded 1D convolution (kernel=4) for local patterns
Monarch Monarch Mixer with learned causal mask for efficient global mixing
LongConv Exponential-decay convolution for medium-range dependencies
Attention Multi-head attention with RoPE (5 heads, head_dim=64)

Training

Progressive unfreezing across 3 phases (Chinchilla-optimal schedule):

Phase Steps Trainable LR Scale Params
P1 0–10,399 Junction + Output Projection 1.0x 82K
P2 10,400–18,199 + 2 near blocks 0.5x 2.7M
P3 18,200–25,999 + 6 far blocks + RMSNorm 0.25x 10.5M

Training Config

Parameter Value
Training tokens 502.7M
Validation tokens 55.7M (subsampled to 5K seqs for eval)
Context length 256
Batch size 16 Γ— 2 grad accum = 32 effective
Peak LR 6e-4 (cosine decay)
Warmup 10% of steps
Weight decay 0.1
Precision FP16 mixed
Hardware NVIDIA RTX 3060 12GB
Wall time 6,872s (~115 min)
Corpus Curated philosophy text

Parameter Counts

Component Parameters
Junction 41,600
Output Projection 40,960
Near Blocks (0–1) 2,603,270
Far Blocks (2–7) 7,809,810
RMSNorm 320
Donor Embedding (frozen) 6,438,912
Trainable 10,495,960
Total 16,934,872

Results

Metric Value
Final val loss 5.502
Final val PPL 245.2
Best val loss 5.507 (step 25,000)
Best val PPL 246.4

Val loss decreased monotonically across all 3 phases with no overfitting:

Checkpoint Val Loss Val PPL Phase
Step 2,000 6.501 666.0 P1
Step 11,500 6.041 420.4 P2
Step 15,000 5.788 326.5 P2
Step 25,000 5.507 246.4 P3
Step 26,000 5.502 245.2 P3

Files

File Description
vocab_fusion_best.pt Best checkpoint (step 25,000, val_loss=5.507)
vocab_fusion_final.pt Final checkpoint (step 26,000, val_loss=5.502)
tokenizer/ Pythia tokenizer (GPT-NeoX, 50K BPE)

Usage

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LisaMegaWatts/VocabFusion-SymbioGPT-10M", subfolder="tokenizer")

# Load model (requires symbiogenesis library)
from vocab_fusion_experiment.model import VocabFusionConfig, VocabFusionModel
from transformers import GPTNeoXForCausalLM

config = VocabFusionConfig()
donor = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-14m")
model = VocabFusionModel(config, donor.gpt_neox.embed_in)

state = torch.load("vocab_fusion_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

# Generate
input_ids = tokenizer.encode("The nature of consciousness", return_tensors="pt")
with torch.no_grad():
    logits = model(input_ids)
    next_token = logits[0, -1].argmax()
    print(tokenizer.decode(next_token))

Part of Symbiogenesis

This model is part of the Symbiogenesis research project, exploring biological fusion strategies for neural network composition.

W&B Run: vocab-fusion-pythia14m

GitHub: MonumentalSystems/symbiogenesis-experiments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using LisaMegaWatts/VocabFusion-SymbioGPT-10M 2