aarav-gpt-zg2-instruct
Adaptive Autoregressive Reasoning Architecture for Vocabulary(aarav) is a 336M parameter decoder-only language model trained from scratch using a modern Llama-style architecture.
Model Details
| Property | Value |
|---|---|
| Architecture | Llama-style (Pre-RMSNorm + RoPE + SwiGLU + GQA) |
| Parameters | 336.1M |
| Layers | 24 |
| Hidden dim | 1024 |
| Attention heads | 16 (query) / 4 (KV) |
| Context length | 1024 tokens |
| Vocab size | 32,000 |
| Tokenizer | SentencePiece (32K vocab) |
| Training tokens | 0.00B |
| Training steps | 11,500 |
| Validation loss | 1.2686 |
| Validation perplexity | 3.6 |
Architecture
This model uses the modern 2023-2025 consensus architecture:
- RMSNorm (pre-normalization) for training stability
- Rotary Position Embeddings (RoPE) instead of learned position embeddings
- SwiGLU activation in feed-forward layers (~8/3 expansion ratio)
- Grouped Query Attention (GQA) with 4:1 query-to-KV head ratio
- QK-normalization for attention stability
- No bias terms throughout the model
- Z-loss regularization during training
Training Data
Trained on a diverse mix of:
- 70% C4 (Common Crawl, cleaned)
- 30% Wikipedia (English, November 2023)
Usage
With PyTorch (custom code)
import torch
from modern_llm_model import ModernGPT, ModelConfig
# Load model
checkpoint = torch.load("pytorch_model.pt", map_location="cuda")
config = ModelConfig(**checkpoint["config"])
model = ModernGPT(config).cuda()
model.load_state_dict(checkpoint["model"])
model.eval()
# Generate
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
tokens = sp.encode("The future of AI is")
x = torch.tensor([tokens], dtype=torch.long, device="cuda")
output = model.generate(x, max_new_tokens=100, temperature=0.7, top_k=40)
print(sp.decode(output[0].tolist()))
Training Configuration
{
"base_checkpoint": "modern_checkpoints/best_model.pt",
"tokenizer_path": "wiki.model",
"max_steps": 15000,
"batch_size": 4,
"grad_accum_steps": 8,
"block_size": 1024,
"lr": 2e-05,
"min_lr": 2e-06,
"warmup_steps": 500,
"weight_decay": 0.01,
"beta1": 0.9,
"beta2": 0.95,
"grad_clip": 1.0,
"eval_interval": 500,
"eval_iters": 30,
"log_interval": 25,
"checkpoint_interval": 2000,
"checkpoint_dir": "sft_v2_checkpoints",
"patience": 15,
"min_delta": 0.001,
"use_amp": true,
"compile_model": true,
"datasets": "alpaca,slimorca",
"val_split": 0.05,
"max_samples": 0,
"wandb_project": "llm-training",
"wandb_run_name": "aarav-gpt-zg2-sft-v2",
"wandb_enabled": true
}
Limitations
- This is a base model (partially instruction-tuned) โ it does text completion and limited conversation. Do not expect ChatGPT like responses.
- Trained on English data only
- 336M parameters โ smaller than production LLMs, intended for research and education
- May produce factually incorrect, biased, or nonsensical text
License
Apache 2.0
- Downloads last month
- 51