aarav-gpt-zg2-instruct

Adaptive Autoregressive Reasoning Architecture for Vocabulary(aarav) is a 336M parameter decoder-only language model trained from scratch using a modern Llama-style architecture.

Model Details

Property Value
Architecture Llama-style (Pre-RMSNorm + RoPE + SwiGLU + GQA)
Parameters 336.1M
Layers 24
Hidden dim 1024
Attention heads 16 (query) / 4 (KV)
Context length 1024 tokens
Vocab size 32,000
Tokenizer SentencePiece (32K vocab)
Training tokens 0.00B
Training steps 11,500
Validation loss 1.2686
Validation perplexity 3.6

Architecture

This model uses the modern 2023-2025 consensus architecture:

  • RMSNorm (pre-normalization) for training stability
  • Rotary Position Embeddings (RoPE) instead of learned position embeddings
  • SwiGLU activation in feed-forward layers (~8/3 expansion ratio)
  • Grouped Query Attention (GQA) with 4:1 query-to-KV head ratio
  • QK-normalization for attention stability
  • No bias terms throughout the model
  • Z-loss regularization during training

Training Data

Trained on a diverse mix of:

  • 70% C4 (Common Crawl, cleaned)
  • 30% Wikipedia (English, November 2023)

Usage

With PyTorch (custom code)

import torch
from modern_llm_model import ModernGPT, ModelConfig

# Load model
checkpoint = torch.load("pytorch_model.pt", map_location="cuda")
config = ModelConfig(**checkpoint["config"])
model = ModernGPT(config).cuda()
model.load_state_dict(checkpoint["model"])
model.eval()

# Generate
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

tokens = sp.encode("The future of AI is")
x = torch.tensor([tokens], dtype=torch.long, device="cuda")
output = model.generate(x, max_new_tokens=100, temperature=0.7, top_k=40)
print(sp.decode(output[0].tolist()))

Training Configuration

{
  "base_checkpoint": "modern_checkpoints/best_model.pt",
  "tokenizer_path": "wiki.model",
  "max_steps": 15000,
  "batch_size": 4,
  "grad_accum_steps": 8,
  "block_size": 1024,
  "lr": 2e-05,
  "min_lr": 2e-06,
  "warmup_steps": 500,
  "weight_decay": 0.01,
  "beta1": 0.9,
  "beta2": 0.95,
  "grad_clip": 1.0,
  "eval_interval": 500,
  "eval_iters": 30,
  "log_interval": 25,
  "checkpoint_interval": 2000,
  "checkpoint_dir": "sft_v2_checkpoints",
  "patience": 15,
  "min_delta": 0.001,
  "use_amp": true,
  "compile_model": true,
  "datasets": "alpaca,slimorca",
  "val_split": 0.05,
  "max_samples": 0,
  "wandb_project": "llm-training",
  "wandb_run_name": "aarav-gpt-zg2-sft-v2",
  "wandb_enabled": true
}

Limitations

  • This is a base model (partially instruction-tuned) โ€” it does text completion and limited conversation. Do not expect ChatGPT like responses.
  • Trained on English data only
  • 336M parameters โ€” smaller than production LLMs, intended for research and education
  • May produce factually incorrect, biased, or nonsensical text

License

Apache 2.0

Downloads last month
51
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using anindyakrg/aarav-gpt-zg2-instruct 1