Qwen3-Micro-Chat

A custom, ultra-lightweight implementation of the Qwen3 architecture designed for rapid prototyping, experimentation, and "memorization-style" training. This project demonstrates a full pipeline from random initialization to a chat-ready model.

🚀 Model Architecture

The model is configured to be exceptionally small to allow for high-speed training on consumer hardware while maintaining the core Qwen3 logic. By tying word embeddings and reducing the hidden dimensions, the model remains functional yet highly efficient.

Parameter	Value
Hidden Size	64
Intermediate Size	256
Hidden Layers	8
Attention Heads	4
KV Heads	2 (Grouped Query Attention)
Vocab Size	151,936
Max Context	512 tokens
Tie Word Embeddings	True (Saves ~19M parameters)

🛠 Training Pipeline

The project follows a two-step training process designed to move from raw data understanding to conversational interaction.

1. Pre-training (`pretrain.py`)

This script initializes the model with random weights and trains it on a base corpus.

Goal: Establish basic token relationships and "memorize" core facts (e.g., zonetwelve identity).
Optimization: Uses an aggressive learning rate () with a cosine scheduler and zero weight decay to encourage fast convergence on the provided micro-dataset.

2. Chat Fine-tuning (`train_chat.py`)

This script takes the output from the pre-training stage and applies ChatML-style formatting.

Mechanism: Uses tokenizer.apply_chat_template to structure data into user and assistant roles.
Dataset: Focused on identity and basic Q&A, repeated across 800 samples to ensure the model "locks in" the persona.
Persona: Trained to identify as an LLM by zonetwelve and provide links to the zonetwelve blog.

📝 Configuration Highlights

Fused Optimizer: Both scripts use adamw_torch_fused for maximum throughput during the weight update step.
Precision: Defaults to bf16 (BFloat16) to reduce memory footprint and leverage modern GPU tensor cores without the loss of dynamic range seen in FP16.
Data Collator: Uses DataCollatorForLanguageModeling with mlm=False for standard causal language modeling, ensuring sequences are padded correctly for batched processing.

Downloads last month: 7

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support