VocabFusion-SymbioGPT-10M
Cross-Scale Vocabulary Fusion: a frozen Pythia-14m embedding (50K vocab, d=128) fused into a trained SymbioGPT-10M host (2K vocab, d=320) via a learned junction layer.
The donor provides vocabulary knowledge (50K BPE tokens from The Pile); the host provides structural/syntactic knowledge (multi-organelle architecture trained on curated philosophy text).
Architecture
input_ids β Pythia Embedding (frozen, 50304Γ128)
β Junction (Linear 128β320 + LayerNorm + SiLU)
β 8Γ SymbioBlocks (CausalConv + Monarch + LongConv + Attention, d=320)
β RMSNorm
β Output Projection (Linear 320β128)
β Weight-tied logits (F.linear with frozen Pythia embedding)
β logits (50304 vocab)
SymbioBlock Organelles
Each of the 8 host blocks contains 4 parallel organelles with learned gating:
| Organelle | Description |
|---|---|
| CausalConv | Left-padded 1D convolution (kernel=4) for local patterns |
| Monarch | Monarch Mixer with learned causal mask for efficient global mixing |
| LongConv | Exponential-decay convolution for medium-range dependencies |
| Attention | Multi-head attention with RoPE (5 heads, head_dim=64) |
Training
Progressive unfreezing across 3 phases (Chinchilla-optimal schedule):
| Phase | Steps | Trainable | LR Scale | Params |
|---|---|---|---|---|
| P1 | 0β10,399 | Junction + Output Projection | 1.0x | 82K |
| P2 | 10,400β18,199 | + 2 near blocks | 0.5x | 2.7M |
| P3 | 18,200β25,999 | + 6 far blocks + RMSNorm | 0.25x | 10.5M |
Training Config
| Parameter | Value |
|---|---|
| Training tokens | 502.7M |
| Validation tokens | 55.7M (subsampled to 5K seqs for eval) |
| Context length | 256 |
| Batch size | 16 Γ 2 grad accum = 32 effective |
| Peak LR | 6e-4 (cosine decay) |
| Warmup | 10% of steps |
| Weight decay | 0.1 |
| Precision | FP16 mixed |
| Hardware | NVIDIA RTX 3060 12GB |
| Wall time | 6,872s (~115 min) |
| Corpus | Curated philosophy text |
Parameter Counts
| Component | Parameters |
|---|---|
| Junction | 41,600 |
| Output Projection | 40,960 |
| Near Blocks (0β1) | 2,603,270 |
| Far Blocks (2β7) | 7,809,810 |
| RMSNorm | 320 |
| Donor Embedding (frozen) | 6,438,912 |
| Trainable | 10,495,960 |
| Total | 16,934,872 |
Results
| Metric | Value |
|---|---|
| Final val loss | 5.502 |
| Final val PPL | 245.2 |
| Best val loss | 5.507 (step 25,000) |
| Best val PPL | 246.4 |
Val loss decreased monotonically across all 3 phases with no overfitting:
| Checkpoint | Val Loss | Val PPL | Phase |
|---|---|---|---|
| Step 2,000 | 6.501 | 666.0 | P1 |
| Step 11,500 | 6.041 | 420.4 | P2 |
| Step 15,000 | 5.788 | 326.5 | P2 |
| Step 25,000 | 5.507 | 246.4 | P3 |
| Step 26,000 | 5.502 | 245.2 | P3 |
Files
| File | Description |
|---|---|
vocab_fusion_best.pt |
Best checkpoint (step 25,000, val_loss=5.507) |
vocab_fusion_final.pt |
Final checkpoint (step 26,000, val_loss=5.502) |
tokenizer/ |
Pythia tokenizer (GPT-NeoX, 50K BPE) |
Usage
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LisaMegaWatts/VocabFusion-SymbioGPT-10M", subfolder="tokenizer")
# Load model (requires symbiogenesis library)
from vocab_fusion_experiment.model import VocabFusionConfig, VocabFusionModel
from transformers import GPTNeoXForCausalLM
config = VocabFusionConfig()
donor = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-14m")
model = VocabFusionModel(config, donor.gpt_neox.embed_in)
state = torch.load("vocab_fusion_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()
# Generate
input_ids = tokenizer.encode("The nature of consciousness", return_tensors="pt")
with torch.no_grad():
logits = model(input_ids)
next_token = logits[0, -1].argmax()
print(tokenizer.decode(next_token))
Part of Symbiogenesis
This model is part of the Symbiogenesis research project, exploring biological fusion strategies for neural network composition.
W&B Run: vocab-fusion-pythia14m