🔬 Why Small Turkish GPTs Hallucinate Facts

An experimental 85M model trained from scratch

tl;dr: This model demonstrates a critical lesson in language modeling: loss ↓ ≠ factual accuracy ↑. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why.

🎯 The Core Problem

After much training steps on 500K Turkish documents:

Metric	Start	End	Improvement
Validation Loss	6.0	3.75	37% better ✅
Validation PPL	397	42.7	90% better ✅
Factual Accuracy	❌	❌	Still inconsistent

📉 Loss vs Factuality Divergence

Training Progression for prompt "Türkiye'nin başkenti"

Step	Val Loss	Val PPL	Generated Capital	Correct?
1000	5.98	397.3	Ankara	✅
3000	3.94	51.7	Ankara	✅
5000	4.02	56.2	Random city	❌
6500	3.90	49.6	Bolu	❌
7500	3.83	46.1	Konya	❌
8000	3.80	44.8	Bursa	❌
9000	3.75	42.7	Ankara (sometimes)	⚠️

Key observation: Loss steadily decreases, but capital city prediction remains unstable.

🧪 Concrete Examples

Prompt: "Türkiye'nin başkenti"

Step 6500 output:

"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..."

❌ Wrong: Bolu is not the capital
✅ Right: Date format, legal language, formal tone, grammar

Step 7500 output:

"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..."

❌ Wrong: Konya is not the capital
✅ Right: Geographic context, date ranges, economic terminology

Step 9000 output:

"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..."

✅ Finally correct!

🤔 Why This Happens

What the Model Actually Learns

Cross-entropy loss optimizes for: "What token is likely in this context?"

In training data distribution:

"Türkiye'nin başkenti Ankara..." appears ~60% of patterns
"Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts)

The model learns distributional probabilities, not factual truth.

From the model's perspective:

Sometimes generate "Ankara" (most frequent)
Sometimes generate other cities (contextually plausible)
Both reduce loss equally if they appear in training data

Why Loss Still Decreases

Even with wrong facts, the model improves at:

✅ Grammar (Turkish morphology)
✅ Syntax (sentence structure)
✅ Style (formal/informal tone matching)
✅ Context coherence (topic consistency)
✅ Pattern matching (Wikipedia-style text)

Loss measures linguistic fluency, NOT factual correctness.

📊 What 85M Parameters Can vs Cannot Do

✅ Successfully Learned

Linguistic patterns: Grammar, morphology, syntax
Contextual coherence: Topic-appropriate vocabulary
Format mimicry: News articles, formal documents
Statistical associations: Common word pairings

❌ Failed to Learn

Factual grounding: "Ankara = capital" as deterministic rule
Logical consistency: Same prompt should give same fact
Knowledge retrieval: Reliable information recall
Fact vs pattern: Distinguishing truth from plausibility

Model Size Comparison

Model	Parameters	Factual Reliability
Kayra (this)	85M	Poor - hallucinations common
GPT-2 Small	124M	Poor - similar issues
GPT-2 Medium	355M	Better but still unreliable
GPT-3	175B	Good consistency
GPT-4	~1.7T	+ RLHF = reliable

Conclusion: 85M learns language patterns, not a knowledge base.

🔬 Technical Details

Architecture

Type: Transformer Decoder (GPT-style)
Layers: 10
Hidden size: 640
Attention heads: 10
FFN size: 2560
Vocabulary: 32,000 BPE tokens
Context: 512 tokens
Total: ~85M parameters

Training Data

Wikipedia TR: 170K articles
mC4 Turkish: 330K web documents
Total: 500K deduplicated documents
Deduplication: MinHash LSH (85% threshold)

Training Setup

Effective batch: 64 (4 × 16 gradient accumulation)
Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup)
Optimizer: AdamW (β1=0.9, β2=0.95)
Hardware: NVIDIA T4 GPU (16GB)
Time: ~9 hours

📈 Evaluation Summary

Fluency: ✅ Good

Metric	Score
Grammatical Turkish	95%+
Natural sentence flow	90%+
Coherent paragraphs	85%+

Factuality: ❌ Poor

Metric	Score
Correct capital city	~50% (random)
Correct historical dates	~40%
Consistent facts across runs	~30%

💡 Key Learnings

1. Pretraining ≠ Knowledge Encoding (at this scale)

85M parameters learn how to speak Turkish, not what is true about Turkey.

2. Solutions Require Additional Steps

Option A: Bigger Model (1B+) More parameters = better fact retention, but still needs instruction tuning

Option B: Instruction Tuning Explicit "correct answer" supervision with contrastive examples

Option C: Retrieval Augmentation (RAG) External knowledge base for fact verification

3. Validation Loss is Misleading

Low perplexity ≠ factual correctness. Always manually test:

Same prompt → consistent facts?
Known facts → correct retrieval?
Hallucination rate → human evaluation

🎯 Appropriate Use Cases

✅ Recommended

Research on Turkish NLP limitations
Pretraining baseline comparisons
Hallucination pattern studies
Educational demonstrations
Understanding LLM failure modes

❌ Not Recommended

Production applications
Factual question answering
Information retrieval systems
Educational content generation
Any task requiring accuracy

🚀 Future: Kayra-v2

Planned improvements:

Larger model: 350M-750M parameters
Better tokenizer: NFC Unicode normalization
Instruction tuning: 10K QA pairs with verified answers
Alignment: RLHF or DPO for factual accuracy
Evaluation: Proper fact-checking benchmarks

🔧 Usage

⚠️ Requires trust_remote_code=True (custom architecture)

Load the model with:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained(
        "sixfingerdev/kayra-1-exp",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

Generate with repetition penalty to reduce loops:

    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        repetition_penalty=1.2,
        do_sample=True
    )

Expected behavior: Fluent Turkish, possibly wrong facts.

📚 Citation

@misc{kayra2024hallucination,
  title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model},
  author={sixfingerdev},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}},
  note={Research on loss-factuality divergence in low-resource language models}
}

🙏 Acknowledgments

Inspiration: Eleuther AI's research on small model limitations
Data: Wikimedia Foundation, Common Crawl (mC4)
Framework: PyTorch, HuggingFace Transformers

📜 License

MIT License - Use freely for research and education.

Disclaimer: This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool.

Kayra-1-exp - Teaching us what 85M parameters cannot do 🔬

Discussion: Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷

🌙 Kayra-1-exp

Kayra - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli.

📊 Model Detayları

Model türü: Decoder-only Transformer (GPT-style)
Parametreler: ~85 milyon
Validation PPL: 42.7
Validation Loss: 3.75
Dil: Tamamen Türkçe
Lisans: MIT

🏗️ Mimari

Layers: 10
Hidden size: 640
Attention heads: 10
FFN size: 2560
Vocabulary: 32,000
Context length: 512

📚 Eğitim Verisi

Wikipedia TR: ~170K makale
mC4 Turkish: ~330K doküman
Toplam: ~500K dedupe edilmiş doküman (MinHash LSH)

🚀 Kullanım

Örnek kod:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "sixfingerdev/kayra-1-exp",
    trust_remote_code=True  # ← ÖNEMLİ!
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

prompt = "Türkiye'nin başkenti"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    temperature=0.2,
    top_k=50,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

⚠️ Limitasyonlar (Experimental)

Bu deneysel bir prototiptir:

❌ Çok nadir unicode bozuklukları var (NFD normalization)
❌ Bazen yanlış bilgi üretebilir
❌ Production kullanımı önerilmez

Örnekler:

"stadyumu" → "stad yumu" (Unicode parçalı)

🔮 Gelecek (Kayra-1-stable)

Düzeltilmiş versiyonda:

✅ NFC Unicode normalization
✅ Instruction fine-tuning
✅ Production-ready

📈 Eğitim Detayları

Optimizer: AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps)
Batch size: 4 × 16 (gradient accumulation)
Precision: Mixed FP16
Hardware: Tesla T4 GPU
Training time: ~9 hours

📜 Lisans

MIT License - Ticari ve akademik kullanım serbesttir.

🙏 Teşekkürler

Veri: Wikimedia, Common Crawl (mC4)
İlham: GPT-1, Kumru

Kayra - Türkçe'yi Yaratan Zeka 🌙

Model: sixfingerdev/kayra-1-exp

Downloads last month: 5

Safetensors

Model size

89M params

Tensor type

F32

BOOL

Datasets used to train sixfingerdev/kayra-1-exp

Space using sixfingerdev/kayra-1-exp 1

Collection including sixfingerdev/kayra-1-exp

Kayra

Collection

Kayra Models • 3 items • Updated Dec 19, 2025 • 1