🔬 Why Small Turkish GPTs Hallucinate Facts

An experimental 85M model trained from scratch

tl;dr: This model demonstrates a critical lesson in language modeling: loss ↓ ≠ factual accuracy ↑. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why.


🎯 The Core Problem

After much training steps on 500K Turkish documents:

Metric Start End Improvement
Validation Loss 6.0 3.75 37% better ✅
Validation PPL 397 42.7 90% better ✅
Factual Accuracy Still inconsistent

📉 Loss vs Factuality Divergence

Training Progression for prompt "Türkiye'nin başkenti"

Step Val Loss Val PPL Generated Capital Correct?
1000 5.98 397.3 Ankara
3000 3.94 51.7 Ankara
5000 4.02 56.2 Random city
6500 3.90 49.6 Bolu
7500 3.83 46.1 Konya
8000 3.80 44.8 Bursa
9000 3.75 42.7 Ankara (sometimes) ⚠️

Key observation: Loss steadily decreases, but capital city prediction remains unstable.


🧪 Concrete Examples

Prompt: "Türkiye'nin başkenti"

Step 6500 output:

"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..."

  • ❌ Wrong: Bolu is not the capital
  • ✅ Right: Date format, legal language, formal tone, grammar

Step 7500 output:

"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..."

  • ❌ Wrong: Konya is not the capital
  • ✅ Right: Geographic context, date ranges, economic terminology

Step 9000 output:

"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..."

  • ✅ Finally correct!

🤔 Why This Happens

What the Model Actually Learns

Cross-entropy loss optimizes for: "What token is likely in this context?"

In training data distribution:

  • "Türkiye'nin başkenti Ankara..." appears ~60% of patterns
  • "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts)

The model learns distributional probabilities, not factual truth.

From the model's perspective:

  • Sometimes generate "Ankara" (most frequent)
  • Sometimes generate other cities (contextually plausible)
  • Both reduce loss equally if they appear in training data

Why Loss Still Decreases

Even with wrong facts, the model improves at:

  • ✅ Grammar (Turkish morphology)
  • ✅ Syntax (sentence structure)
  • ✅ Style (formal/informal tone matching)
  • ✅ Context coherence (topic consistency)
  • ✅ Pattern matching (Wikipedia-style text)

Loss measures linguistic fluency, NOT factual correctness.


📊 What 85M Parameters Can vs Cannot Do

✅ Successfully Learned

  • Linguistic patterns: Grammar, morphology, syntax
  • Contextual coherence: Topic-appropriate vocabulary
  • Format mimicry: News articles, formal documents
  • Statistical associations: Common word pairings

❌ Failed to Learn

  • Factual grounding: "Ankara = capital" as deterministic rule
  • Logical consistency: Same prompt should give same fact
  • Knowledge retrieval: Reliable information recall
  • Fact vs pattern: Distinguishing truth from plausibility

Model Size Comparison

Model Parameters Factual Reliability
Kayra (this) 85M Poor - hallucinations common
GPT-2 Small 124M Poor - similar issues
GPT-2 Medium 355M Better but still unreliable
GPT-3 175B Good consistency
GPT-4 ~1.7T + RLHF = reliable

Conclusion: 85M learns language patterns, not a knowledge base.


🔬 Technical Details

Architecture

  • Type: Transformer Decoder (GPT-style)
  • Layers: 10
  • Hidden size: 640
  • Attention heads: 10
  • FFN size: 2560
  • Vocabulary: 32,000 BPE tokens
  • Context: 512 tokens
  • Total: ~85M parameters

Training Data

  • Wikipedia TR: 170K articles
  • mC4 Turkish: 330K web documents
  • Total: 500K deduplicated documents
  • Deduplication: MinHash LSH (85% threshold)

Training Setup

  • Effective batch: 64 (4 × 16 gradient accumulation)
  • Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup)
  • Optimizer: AdamW (β1=0.9, β2=0.95)
  • Hardware: NVIDIA T4 GPU (16GB)
  • Time: ~9 hours

📈 Evaluation Summary

Fluency: ✅ Good

Metric Score
Grammatical Turkish 95%+
Natural sentence flow 90%+
Coherent paragraphs 85%+

Factuality: ❌ Poor

Metric Score
Correct capital city ~50% (random)
Correct historical dates ~40%
Consistent facts across runs ~30%

💡 Key Learnings

1. Pretraining ≠ Knowledge Encoding (at this scale)

85M parameters learn how to speak Turkish, not what is true about Turkey.

2. Solutions Require Additional Steps

Option A: Bigger Model (1B+) More parameters = better fact retention, but still needs instruction tuning

Option B: Instruction Tuning Explicit "correct answer" supervision with contrastive examples

Option C: Retrieval Augmentation (RAG) External knowledge base for fact verification

3. Validation Loss is Misleading

Low perplexity ≠ factual correctness. Always manually test:

  • Same prompt → consistent facts?
  • Known facts → correct retrieval?
  • Hallucination rate → human evaluation

🎯 Appropriate Use Cases

✅ Recommended

  • Research on Turkish NLP limitations
  • Pretraining baseline comparisons
  • Hallucination pattern studies
  • Educational demonstrations
  • Understanding LLM failure modes

❌ Not Recommended

  • Production applications
  • Factual question answering
  • Information retrieval systems
  • Educational content generation
  • Any task requiring accuracy

🚀 Future: Kayra-v2

Planned improvements:

  • Larger model: 350M-750M parameters
  • Better tokenizer: NFC Unicode normalization
  • Instruction tuning: 10K QA pairs with verified answers
  • Alignment: RLHF or DPO for factual accuracy
  • Evaluation: Proper fact-checking benchmarks

🔧 Usage

⚠️ Requires trust_remote_code=True (custom architecture)

Load the model with:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained(
        "sixfingerdev/kayra-1-exp",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

Generate with repetition penalty to reduce loops:

    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        repetition_penalty=1.2,
        do_sample=True
    )

Expected behavior: Fluent Turkish, possibly wrong facts.


📚 Citation

@misc{kayra2024hallucination,
  title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model},
  author={sixfingerdev},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}},
  note={Research on loss-factuality divergence in low-resource language models}
}

🙏 Acknowledgments

  • Inspiration: Eleuther AI's research on small model limitations
  • Data: Wikimedia Foundation, Common Crawl (mC4)
  • Framework: PyTorch, HuggingFace Transformers

📜 License

MIT License - Use freely for research and education.

Disclaimer: This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool.


Kayra-1-exp - Teaching us what 85M parameters cannot do 🔬


Discussion: Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷

🌙 Kayra-1-exp

Kayra - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli.

📊 Model Detayları

  • Model türü: Decoder-only Transformer (GPT-style)
  • Parametreler: ~85 milyon
  • Validation PPL: 42.7
  • Validation Loss: 3.75
  • Dil: Tamamen Türkçe
  • Lisans: MIT

🏗️ Mimari

  • Layers: 10
  • Hidden size: 640
  • Attention heads: 10
  • FFN size: 2560
  • Vocabulary: 32,000
  • Context length: 512

📚 Eğitim Verisi

  • Wikipedia TR: ~170K makale
  • mC4 Turkish: ~330K doküman
  • Toplam: ~500K dedupe edilmiş doküman (MinHash LSH)

🚀 Kullanım

Örnek kod:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "sixfingerdev/kayra-1-exp",
    trust_remote_code=True  # ← ÖNEMLİ!
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

prompt = "Türkiye'nin başkenti"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    temperature=0.2,
    top_k=50,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

⚠️ Limitasyonlar (Experimental)

Bu deneysel bir prototiptir:

  • ❌ Çok nadir unicode bozuklukları var (NFD normalization)
  • ❌ Bazen yanlış bilgi üretebilir
  • ❌ Production kullanımı önerilmez

Örnekler:

  • "stadyumu" → "stad yumu" (Unicode parçalı)

🔮 Gelecek (Kayra-1-stable)

Düzeltilmiş versiyonda:

  • ✅ NFC Unicode normalization
  • ✅ Instruction fine-tuning
  • ✅ Production-ready

📈 Eğitim Detayları

  • Optimizer: AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps)
  • Batch size: 4 × 16 (gradient accumulation)
  • Precision: Mixed FP16
  • Hardware: Tesla T4 GPU
  • Training time: ~9 hours

📜 Lisans

MIT License - Ticari ve akademik kullanım serbesttir.

🙏 Teşekkürler

  • Veri: Wikimedia, Common Crawl (mC4)
  • İlham: GPT-1, Kumru

Kayra - Türkçe'yi Yaratan Zeka 🌙

Model: sixfingerdev/kayra-1-exp

Downloads last month
86
Safetensors
Model size
89M params
Tensor type
F32
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train sixfingerdev/kayra-1-exp

Space using sixfingerdev/kayra-1-exp 1

Collection including sixfingerdev/kayra-1-exp