---
language:
- en
license: mit
pipeline_tag: text-generation
tags:
- pytorch
- transformer
- language-model
- muon-optimizer
- small-model
- llm-training
- educational
datasets:
- HuggingFaceTB/smollm-corpus
metrics:
- perplexity
- accuracy
library_name: pytorch
---

# 🫐 Train Your Own Small Language Model

A minimal toolkit for training and using small language models with the Muon optimizer.

## 🚀 Quick Start

### Option 1: Google Colab (No Setup Required)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m9wXIkMlSVW3whSHZiOywMdUClj0amnZ?usp=sharing)

Click the badge above to run everything in your browser with free GPU access!

### Option 2: Local Setup
```bash
# Clone and setup
git clone https://github.com/vukrosic/build-and-release-your-own-llm
cd build-and-release-your-own-llm
python setup.py  # Installs requirements and creates .env file
```

## 🎯 Three Ways to Use This Project

### 1. 🚀 Quick Start - Use My Pre-trained Model

Want to try text generation immediately?

```bash
# Install dependencies
pip install -r requirements.txt

# Run inference with my pre-trained model
python inference.py
```

The script will:
- Show available checkpoints from `vukrosic/blueberry-1`
- Download the model automatically
- Let you generate text interactively

**No setup required!** The model downloads automatically.

### 2. 🏗️ Train Your Own Model

Want to train from scratch?

```bash
# Install dependencies
pip install -r requirements.txt

# Start training (takes ~20 minutes on GPU)
python train_llm.py

# Use your trained model
python inference.py
```

Your model will be saved in `checkpoints/` and you can resume training anytime.

### 3. 📤 Train and Share Your Model

Want to share your model on Hugging Face?

```bash
# 1. Copy environment template
cp .env.example .env

# 2. Edit .env file:
# HF_REPO_NAME=your-username/your-model-name
# HF_TOKEN=hf_your_token_here
# PUSH_TO_HUB=true

# 3. Train (uploads automatically)
python train_llm.py
```

Get your HF token from: https://huggingface.co/settings/tokens

## 📁 Project Structure

```
├── train_llm.py       # Training script with Muon optimizer
├── inference.py       # Text generation and model loading
├── upload_to_hf.py    # Upload checkpoints to Hugging Face
├── example_usage.py   # Example workflow script
├── setup.py          # Easy setup script
├── requirements.txt   # Python dependencies
├── .env.example      # Environment variables template
└── README.md         # This file
```

## 🎯 What You Get

- **21M parameter transformer model** (384d, 6 layers, 8 heads)
- **Muon optimizer** for efficient training
- **Automatic checkpointing** every 5000 steps
- **Resume training** from any checkpoint
- **Interactive text generation**
- **Hugging Face integration** (optional)

## 📊 Expected Results

- **Training time**: ~16-20 minutes on modern GPU
- **Final perplexity**: ~1.06
- **Model size**: ~21M parameters
- **Memory usage**: ~4-6GB GPU

## 🔧 Customization

### Change Model Size
Edit `train_llm.py`:
```python
@dataclass
class ModelConfig:
    d_model: int = 512      # Bigger model (was 384)
    n_layers: int = 8       # More layers (was 6)
    max_steps: int = 5000  # Train longer for better results (20000)
```

### Use Your Own Data
Edit the dataset loading in `train_llm.py`:
```python
# Replace this line:
dataset = load_dataset("HuggingFaceTB/smollm-corpus", "cosmopedia-v2", split="train", streaming=True)

# With your dataset:
dataset = load_dataset("your-dataset-name", split="train", streaming=True)
```

### Adjust Training Speed
```python
batch_size: int = 16        # Smaller = less memory
gradient_accumulation_steps: int = 8  # Larger = same effective batch size
```

## 📊 Understanding the Output

### During Training
```
Training: 67%|██████▋   | 20000/30000 [12:34<06:15, 26.6it/s, loss=1.234, acc=0.876, ppl=3.4, lr=8.5e-03]
```

- **loss**: Lower is better (target: ~1.0)
- **acc**: Accuracy (target: ~98%)
- **ppl**: Perplexity (target: ~1.1)
- **lr**: Learning rate (automatically scheduled)

### During Inference
```
Prompt: The future of AI is
Generated text: The future of AI is bright and full of possibilities. Machine learning algorithms continue to evolve...
```

## 🚨 Common Issues

### "CUDA out of memory"
```python
# In train_llm.py, reduce batch size:
batch_size: int = 12  # or even 8
```

### "No checkpoints found"
Make sure you've run training first:
```bash
python train_llm.py  # Wait for it to complete
python inference.py  # Now this will work
```

### "HF upload failed"
Check your token permissions:
1. Go to https://huggingface.co/settings/tokens
2. Make sure token has "Write" permission
3. Update your `.env` file

## 🎉 What's Next?

1. **Experiment with prompts** - Try different starting texts
2. **Adjust generation parameters** - Change temperature and top_k in inference.py
3. **Train on your data** - Replace the dataset with your own text
4. **Scale up** - Increase model size for better performance
5. **Share your model** - Upload to Hugging Face for others to use

## 📦 Checkpoint Management

### Automatic Checkpointing
The training script now saves checkpoints every 5000 steps in the `checkpoints/` directory:
```
checkpoints/
├── checkpoint_step_5000/
│   ├── model.pt          # Model weights and optimizer state
│   ├── config.json       # Model configuration
│   └── tokenizer files   # Tokenizer configuration
├── checkpoint_step_10000/
└── checkpoint_step_15000/
```

### Upload to Hugging Face
Share your trained models with the community:

```bash
# Set your Hugging Face token
export HF_TOKEN="hf_your_token_here"

# List available checkpoints
python upload_to_hf.py --list

# Upload latest checkpoint
python upload_to_hf.py --repo-name username/my-awesome-model

# Upload specific checkpoint
python upload_to_hf.py --repo-name username/my-model --checkpoint checkpoints/checkpoint_step_10000

# Create private repository
python upload_to_hf.py --repo-name username/my-model --private
```

Get your token from: https://huggingface.co/settings/tokens

### Example Workflow
```bash
# Run the complete example
python example_usage.py

# Or step by step:
python train_llm.py                    # Train model (saves checkpoints)
python upload_to_hf.py --list          # See available checkpoints  
python upload_to_hf.py --repo-name username/model  # Upload to HF
```

## 💡 Pro Tips

- **Resume training**: The script automatically detects checkpoints
- **Monitor GPU usage**: Use `nvidia-smi` to check memory usage
- **Save compute**: Use smaller models for experimentation
- **Better results**: More training steps = better model (usually)
- **Checkpoint frequency**: Adjust `save_every` in ModelConfig for different intervals
- **Share early**: Upload intermediate checkpoints to track training progress

Happy training! 🚀