Qwen3-Micro-Chat

A custom, ultra-lightweight implementation of the Qwen3 architecture designed for rapid prototyping, experimentation, and "memorization-style" training. This project demonstrates a full pipeline from random initialization to a chat-ready model.


πŸš€ Model Architecture

The model is configured to be exceptionally small to allow for high-speed training on consumer hardware while maintaining the core Qwen3 logic. By tying word embeddings and reducing the hidden dimensions, the model remains functional yet highly efficient.

Parameter Value
Hidden Size 64
Intermediate Size 256
Hidden Layers 8
Attention Heads 4
KV Heads 2 (Grouped Query Attention)
Vocab Size 151,936
Max Context 512 tokens
Tie Word Embeddings True (Saves ~19M parameters)

πŸ›  Training Pipeline

The project follows a two-step training process designed to move from raw data understanding to conversational interaction.

1. Pre-training (pretrain.py)

This script initializes the model with random weights and trains it on a base corpus.

  • Goal: Establish basic token relationships and "memorize" core facts (e.g., zonetwelve identity).
  • Optimization: Uses an aggressive learning rate () with a cosine scheduler and zero weight decay to encourage fast convergence on the provided micro-dataset.

2. Chat Fine-tuning (train_chat.py)

This script takes the output from the pre-training stage and applies ChatML-style formatting.

  • Mechanism: Uses tokenizer.apply_chat_template to structure data into user and assistant roles.
  • Dataset: Focused on identity and basic Q&A, repeated across 800 samples to ensure the model "locks in" the persona.
  • Persona: Trained to identify as an LLM by zonetwelve and provide links to the zonetwelve blog.

πŸ“ Configuration Highlights

  • Fused Optimizer: Both scripts use adamw_torch_fused for maximum throughput during the weight update step.
  • Precision: Defaults to bf16 (BFloat16) to reduce memory footprint and leverage modern GPU tensor cores without the loss of dynamic range seen in FP16.
  • Data Collator: Uses DataCollatorForLanguageModeling with mlm=False for standard causal language modeling, ensuring sequences are padded correctly for batched processing.
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support