🔬 Why Small Turkish GPTs Hallucinate Facts
An experimental 85M model trained from scratch
tl;dr: This model demonstrates a critical lesson in language modeling: loss ↓ ≠ factual accuracy ↑. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why.
🎯 The Core Problem
After much training steps on 500K Turkish documents:
| Metric | Start | End | Improvement |
|---|---|---|---|
| Validation Loss | 6.0 | 3.75 | 37% better ✅ |
| Validation PPL | 397 | 42.7 | 90% better ✅ |
| Factual Accuracy | ❌ | ❌ | Still inconsistent |
📉 Loss vs Factuality Divergence
Training Progression for prompt "Türkiye'nin başkenti"
| Step | Val Loss | Val PPL | Generated Capital | Correct? |
|---|---|---|---|---|
| 1000 | 5.98 | 397.3 | Ankara | ✅ |
| 3000 | 3.94 | 51.7 | Ankara | ✅ |
| 5000 | 4.02 | 56.2 | Random city | ❌ |
| 6500 | 3.90 | 49.6 | Bolu | ❌ |
| 7500 | 3.83 | 46.1 | Konya | ❌ |
| 8000 | 3.80 | 44.8 | Bursa | ❌ |
| 9000 | 3.75 | 42.7 | Ankara (sometimes) | ⚠️ |
Key observation: Loss steadily decreases, but capital city prediction remains unstable.
🧪 Concrete Examples
Prompt: "Türkiye'nin başkenti"
Step 6500 output:
"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..."
- ❌ Wrong: Bolu is not the capital
- ✅ Right: Date format, legal language, formal tone, grammar
Step 7500 output:
"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..."
- ❌ Wrong: Konya is not the capital
- ✅ Right: Geographic context, date ranges, economic terminology
Step 9000 output:
"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..."
- ✅ Finally correct!
🤔 Why This Happens
What the Model Actually Learns
Cross-entropy loss optimizes for: "What token is likely in this context?"
In training data distribution:
- "Türkiye'nin başkenti Ankara..." appears ~60% of patterns
- "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts)
The model learns distributional probabilities, not factual truth.
From the model's perspective:
- Sometimes generate "Ankara" (most frequent)
- Sometimes generate other cities (contextually plausible)
- Both reduce loss equally if they appear in training data
Why Loss Still Decreases
Even with wrong facts, the model improves at:
- ✅ Grammar (Turkish morphology)
- ✅ Syntax (sentence structure)
- ✅ Style (formal/informal tone matching)
- ✅ Context coherence (topic consistency)
- ✅ Pattern matching (Wikipedia-style text)
Loss measures linguistic fluency, NOT factual correctness.
📊 What 85M Parameters Can vs Cannot Do
✅ Successfully Learned
- Linguistic patterns: Grammar, morphology, syntax
- Contextual coherence: Topic-appropriate vocabulary
- Format mimicry: News articles, formal documents
- Statistical associations: Common word pairings
❌ Failed to Learn
- Factual grounding: "Ankara = capital" as deterministic rule
- Logical consistency: Same prompt should give same fact
- Knowledge retrieval: Reliable information recall
- Fact vs pattern: Distinguishing truth from plausibility
Model Size Comparison
| Model | Parameters | Factual Reliability |
|---|---|---|
| Kayra (this) | 85M | Poor - hallucinations common |
| GPT-2 Small | 124M | Poor - similar issues |
| GPT-2 Medium | 355M | Better but still unreliable |
| GPT-3 | 175B | Good consistency |
| GPT-4 | ~1.7T | + RLHF = reliable |
Conclusion: 85M learns language patterns, not a knowledge base.
🔬 Technical Details
Architecture
- Type: Transformer Decoder (GPT-style)
- Layers: 10
- Hidden size: 640
- Attention heads: 10
- FFN size: 2560
- Vocabulary: 32,000 BPE tokens
- Context: 512 tokens
- Total: ~85M parameters
Training Data
- Wikipedia TR: 170K articles
- mC4 Turkish: 330K web documents
- Total: 500K deduplicated documents
- Deduplication: MinHash LSH (85% threshold)
Training Setup
- Effective batch: 64 (4 × 16 gradient accumulation)
- Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup)
- Optimizer: AdamW (β1=0.9, β2=0.95)
- Hardware: NVIDIA T4 GPU (16GB)
- Time: ~9 hours
📈 Evaluation Summary
Fluency: ✅ Good
| Metric | Score |
|---|---|
| Grammatical Turkish | 95%+ |
| Natural sentence flow | 90%+ |
| Coherent paragraphs | 85%+ |
Factuality: ❌ Poor
| Metric | Score |
|---|---|
| Correct capital city | ~50% (random) |
| Correct historical dates | ~40% |
| Consistent facts across runs | ~30% |
💡 Key Learnings
1. Pretraining ≠ Knowledge Encoding (at this scale)
85M parameters learn how to speak Turkish, not what is true about Turkey.
2. Solutions Require Additional Steps
Option A: Bigger Model (1B+) More parameters = better fact retention, but still needs instruction tuning
Option B: Instruction Tuning Explicit "correct answer" supervision with contrastive examples
Option C: Retrieval Augmentation (RAG) External knowledge base for fact verification
3. Validation Loss is Misleading
Low perplexity ≠ factual correctness. Always manually test:
- Same prompt → consistent facts?
- Known facts → correct retrieval?
- Hallucination rate → human evaluation
🎯 Appropriate Use Cases
✅ Recommended
- Research on Turkish NLP limitations
- Pretraining baseline comparisons
- Hallucination pattern studies
- Educational demonstrations
- Understanding LLM failure modes
❌ Not Recommended
- Production applications
- Factual question answering
- Information retrieval systems
- Educational content generation
- Any task requiring accuracy
🚀 Future: Kayra-v2
Planned improvements:
- Larger model: 350M-750M parameters
- Better tokenizer: NFC Unicode normalization
- Instruction tuning: 10K QA pairs with verified answers
- Alignment: RLHF or DPO for factual accuracy
- Evaluation: Proper fact-checking benchmarks
🔧 Usage
⚠️ Requires trust_remote_code=True (custom architecture)
Load the model with:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"sixfingerdev/kayra-1-exp",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")
Generate with repetition penalty to reduce loops:
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
temperature=0.8,
top_k=50,
repetition_penalty=1.2,
do_sample=True
)
Expected behavior: Fluent Turkish, possibly wrong facts.
📚 Citation
@misc{kayra2024hallucination,
title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model},
author={sixfingerdev},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}},
note={Research on loss-factuality divergence in low-resource language models}
}
🙏 Acknowledgments
- Inspiration: Eleuther AI's research on small model limitations
- Data: Wikimedia Foundation, Common Crawl (mC4)
- Framework: PyTorch, HuggingFace Transformers
📜 License
MIT License - Use freely for research and education.
Disclaimer: This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool.
Kayra-1-exp - Teaching us what 85M parameters cannot do 🔬
Discussion: Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷
🌙 Kayra-1-exp
Kayra - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli.
📊 Model Detayları
- Model türü: Decoder-only Transformer (GPT-style)
- Parametreler: ~85 milyon
- Validation PPL: 42.7
- Validation Loss: 3.75
- Dil: Tamamen Türkçe
- Lisans: MIT
🏗️ Mimari
- Layers: 10
- Hidden size: 640
- Attention heads: 10
- FFN size: 2560
- Vocabulary: 32,000
- Context length: 512
📚 Eğitim Verisi
- Wikipedia TR: ~170K makale
- mC4 Turkish: ~330K doküman
- Toplam: ~500K dedupe edilmiş doküman (MinHash LSH)
🚀 Kullanım
Örnek kod:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"sixfingerdev/kayra-1-exp",
trust_remote_code=True # ← ÖNEMLİ!
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")
prompt = "Türkiye'nin başkenti"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
temperature=0.2,
top_k=50,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
⚠️ Limitasyonlar (Experimental)
Bu deneysel bir prototiptir:
- ❌ Çok nadir unicode bozuklukları var (NFD normalization)
- ❌ Bazen yanlış bilgi üretebilir
- ❌ Production kullanımı önerilmez
Örnekler:
- "stadyumu" → "stad yumu" (Unicode parçalı)
🔮 Gelecek (Kayra-1-stable)
Düzeltilmiş versiyonda:
- ✅ NFC Unicode normalization
- ✅ Instruction fine-tuning
- ✅ Production-ready
📈 Eğitim Detayları
- Optimizer: AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps)
- Batch size: 4 × 16 (gradient accumulation)
- Precision: Mixed FP16
- Hardware: Tesla T4 GPU
- Training time: ~9 hours
📜 Lisans
MIT License - Ticari ve akademik kullanım serbesttir.
🙏 Teşekkürler
- Veri: Wikimedia, Common Crawl (mC4)
- İlham: GPT-1, Kumru
Kayra - Türkçe'yi Yaratan Zeka 🌙
Model: sixfingerdev/kayra-1-exp
- Downloads last month
- 86