---
license: mit
datasets:
- roneneldan/TinyStories
language:
- en
library_name: pytorch
tags:
- text-generation-inference
- gemma3
metrics:
- perplexity
pipeline_tag: text-generation
---

# Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation

A PyTorch implementation of Google DeepMind's Gemma3 270M model built entirely from scratch, featuring a compact transformer architecture.

## Model Overview

This is from scratch implementation of the Gemma3 270M architecture that demonstrates modern transformer techniques including sliding window attention, RoPE positional encoding, and mixed precision training. The model maintains the core architectural principles of the official Gemma3 270M while making practical choices for training efficiency.

## Training Data

### Dataset
- **Source**: TinyStories dataset (~600M tokens)
- **Tokenizer**: GPT-2 tokenizer for faster data processing compared to Gemma3 270M tokenizer
- **Format**: Memory-mapped binary files for efficient loading

### Model Details 

- This is the base model itself solely trained on TinyStories dataset for 10 hours on A6000 GPU.
- Task: text-generation
- Language: en
- Dataset: https://huggingface.co/datasets/roneneldan/TinyStories

## Training Procedure

### Training Hyperparameters 

- **learning_rate:** 1e-4                
- **max_iters:** 150000                  
- **warmup_steps:** 1000                 
- **min_lr:** 5e-4                        
- **eval_iters:** 500                     
- **batch_size:** 32                      
- **block_size:** 128                     
- **gradient_accumulation_steps:** 32     
- **device:** cuda                     
- **dtype:** bfloat16                     
- **ptdtype:** float32

## Evaluation results

Detailed training analysis and model evaluation can be found in [`results/results_interpertation.md`](results/results_interpertation.md), which includes:

- **📊 Loss Analysis**: Training and validation loss curves showing smooth convergence without overfitting
- **📝 Qualitative Evaluation**: Story generation examples demonstrating coherent narrative abilities
- **📈 Training Dynamics**: Gradient norm analysis and learning rate schedule evaluation
- **🎯 Model Performance**: Final perplexity metrics and generation quality assessment

**Key Results:**
- Final train loss: 1.8 (perplexity ~6.0)
- Final validation loss: 2.0 (perplexity ~7.4) 
- Excellent generalization with no overfitting observed
- Coherent story generation with proper grammar and age-appropriate content

## Usage

**Code Snippet**
```python
# Import Necessary Libraries 
import torch
import tiktoken
from architecture import model_config, Gemma3Model 

# Tokenizer
enc = tiktoken.get_encoding("gpt2")

# Loading Model
model_config["dtype"] = torch.bfloat16
model = Gemma3Model(model_config)  # re-create the model with same config
device =  "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states

# Inference
sentence = "Dad was telling the kids an adventure tale about a pirate ship"
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))
```

**Result**

```text
Dad was telling the kids an adventure tale about a pirate ship coming to the shore. 

Suddenly, Dad showed John many pictures and showed him what to do. She chose a film for them to watch. 
John was excited. He had never seen one before and was intrigued.

When they arrived, Dad handed John bookshelf safely. "What have you got, John?", asked Dad. John eagerly answered back to Dad. Dad explained that the businessman was a dinosaur that had been guarded by the sea. 

John thought about this for a reason and knew he was too happy with this movie. He said to Dad, "Life is a really fun experience". 
His Dad nodded and said, "Yes, you can accept anything special. It was a very comfortable motorcycle."Once upon a time, there was a nice friendly little boy named John. Every day he would have endless their conversation and encouragement. He was so full of joy and excitement taking action.

Today, John was playing in the backyard when
```

## Limitations and Biases

- This model is only intended for understanding the architecture of a transformer based model from scratch and get the intuition
- Inference is super slow as KV cache is absent
- TinyStories is synthetic data generated by GPT-3.5/4
- May have inherited biases or patterns from the generating model
- Limited diversity compared to real human-written content
- Repetitive narrative structures typical of children's literature
- 270M parameters is relatively small by modern standards
- Limited reasoning capabilities compared to larger models


## Training Infrastructure

For a complete guide covering the entire process - from data tokenization to inference - please refer to the [GitHub repository](https://github.com/di37/gemma3-270M-tinystories-pytorch).

## Last Update

2025-09-06

## Citation 

```bibtex
@misc{gemma3-270m-pytorch,
  title={Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation},
  author={Doula Isham Rashik Hasan},
  year={2025},
  howpublished={\url{https://github.com/di37/gemma3-270M-tinystories-pytorch}},
  note={Implementation of Google DeepMind's Gemma3 270M from scratch pre-trained on TinyStories}
}
```