Qwen3-Micro-Chat
A custom, ultra-lightweight implementation of the Qwen3 architecture designed for rapid prototyping, experimentation, and "memorization-style" training. This project demonstrates a full pipeline from random initialization to a chat-ready model.
π Model Architecture
The model is configured to be exceptionally small to allow for high-speed training on consumer hardware while maintaining the core Qwen3 logic. By tying word embeddings and reducing the hidden dimensions, the model remains functional yet highly efficient.
| Parameter | Value |
|---|---|
| Hidden Size | 64 |
| Intermediate Size | 256 |
| Hidden Layers | 8 |
| Attention Heads | 4 |
| KV Heads | 2 (Grouped Query Attention) |
| Vocab Size | 151,936 |
| Max Context | 512 tokens |
| Tie Word Embeddings | True (Saves ~19M parameters) |
π Training Pipeline
The project follows a two-step training process designed to move from raw data understanding to conversational interaction.
1. Pre-training (pretrain.py)
This script initializes the model with random weights and trains it on a base corpus.
- Goal: Establish basic token relationships and "memorize" core facts (e.g., zonetwelve identity).
- Optimization: Uses an aggressive learning rate () with a cosine scheduler and zero weight decay to encourage fast convergence on the provided micro-dataset.
2. Chat Fine-tuning (train_chat.py)
This script takes the output from the pre-training stage and applies ChatML-style formatting.
- Mechanism: Uses
tokenizer.apply_chat_templateto structure data intouserandassistantroles. - Dataset: Focused on identity and basic Q&A, repeated across 800 samples to ensure the model "locks in" the persona.
- Persona: Trained to identify as an LLM by zonetwelve and provide links to the zonetwelve blog.
π Configuration Highlights
- Fused Optimizer: Both scripts use
adamw_torch_fusedfor maximum throughput during the weight update step. - Precision: Defaults to
bf16(BFloat16) to reduce memory footprint and leverage modern GPU tensor cores without the loss of dynamic range seen in FP16. - Data Collator: Uses
DataCollatorForLanguageModelingwithmlm=Falsefor standard causal language modeling, ensuring sequences are padded correctly for batched processing.
- Downloads last month
- 7