Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +641 -102

README.md CHANGED Viewed

@@ -1,199 +1,738 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: apache-2.0
+language:
+- id
+- en
+tags:
+- text-generation
+- pytorch
+- causal-lm
+- transformer
+- untrained
+- gqa
+- rope
+- swiglu
+- rmsnorm
+- flash-attention
+- indonesian
 library_name: transformers
+pipeline_tag: text-generation
+widget:
+  - text: "Jakarta adalah ibu kota"
+    example_title: "🇮🇩 Text Completion (ID)"
+  - text: |
+      Pertanyaan: Apa itu kecerdasan buatan?
+      Jawaban:
+    example_title: "🇮🇩 Question Answering (ID)"
+  - text: |
+      Tulis cerita pendek tentang robot yang belajar mencintai.
+    example_title: "🇮🇩 Creative Writing (ID)"
+  - text: "The capital of Indonesia is"
+    example_title: "🇬🇧 Text Completion (EN)"
+  - text: |
+      Question: What is artificial intelligence?
+      Answer:
+    example_title: "🇬🇧 Question Answering (EN)"
+  - text: |
+      def fibonacci(n):
+          """Hitung bilangan fibonacci ke-n"""
+    example_title: "💻 Code Completion"
+  - text: |
+      def reverse_string(s):
+    example_title: "💻 Code Generation"
+  - text: |
+      User: Halo! Siapa kamu?
+      Assistant:
+    example_title: "💬 Chat Format (ID)"
+  - text: |
+      User: Jelaskan tentang machine learning dalam 2 kalimat.
+      Assistant:
+    example_title: "💬 Conversational (ID)"
+inference:
+  parameters:
+    max_new_tokens: 100
+    temperature: 0.7
+    top_p: 0.9
+    top_k: 50
+    do_sample: true
+    repetition_penalty: 1.1
+    num_beams: 1
+datasets: []
+metrics:
+- perplexity
+model-index:
+- name: caca-5M
+  results: []
 ---
+<div align="center">
+<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-5M"/>
+# 🚀 CACA-5M
+### Model Transformer Modern dengan Arsitektur Canggih
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
+[![Transformers](https://img.shields.io/badge/🤗%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)
+**24,253,696** parameters • **24.25M** • **8 layers**
+[📖 Dokumentasi](#dokumentasi) • [🚀 Quick Start](#quick-start) • [💡 Fitur](#fitur-utama) • [🔧 Training](#training-guide) • [📊 Spesifikasi](#spesifikasi-teknis)
+---
+</div>
+## ⚠️ PENTING: Model Belum Dilatih (Untrained)
+> **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**.
+**Status Model:**
+- 🔴 **Belum dilatih** - Bobot masih random
+- 🟡 **Hanya untuk riset** - Eksperimen arsitektur & training
+- 🟢 **Siap dilatih** - Arsitektur sudah teruji
+Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.
+---
+## 📋 Deskripsi
+**Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**.
+<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
+<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
+<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
+</blockquote>
+### 🎯 Keunggulan Utama
+- **🇮🇩 Bilingual Support**: Optimized untuk Bahasa Indonesia & English
+- **⚡ Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat
+- **💾 Memory Efficient**: Grouped Query Attention menghemat 75% KV cache
+- **🎯 Long Context**: Support hingga 2,048 token
+- **🔧 Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi
+---
+## ✨ Fitur Utama
+### 🎯 Core Features
+- ✅ **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
+  - Query heads: 4
+  - KV heads: 2
+  - Ratio: 2:1 (hemat 75% KV cache)
+- ✅ **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
+  - Theta: 10000
+  - Support extrapolation untuk konteks > training length
+- ✅ **RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
+  - Epsilon: 1e-06
+- ✅ **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
+  - Intermediate size: 1,024
+- ✅ **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
+  - Otomatis aktif jika tersedia CUDA
+### 🔥 Advanced Features
+### 🎯 Attention Mechanisms
+- ⚡ **Flash Attention v2** - 3x faster with IO-aware algorithm
+- 🔑 **Grouped Query Attention (GQA)** - 4Q : 2KV heads
+- 🚀 **xFormers Support** - Memory efficient attention fallback
+- 🎯 **PyTorch SDPA** - Native scaled dot product attention
+### 📍 Position Encodings
+- 🔄 **RoPE** - Rotary embeddings (θ=10000)
+### 🪟 Long Context Features
+### 🎓 Training Optimizations
+- 💾 **Gradient Checkpointing** - Memory efficient training
+- 🎯 **Mixed Precision** - BF16 & FP16 support
+### 📦 Quantization Support
+- 4️⃣ **4-bit Quantization** - NF4, FP4 via bitsandbytes
+- 8️⃣ **8-bit Quantization** - LLM.int8() support
+- 🔄 **Double Quantization** - Further compression
+### 🛠️ Optimization Features
+- 💾 **KV Cache** - Generasi autoregressive 5-10x lebih cepat
+- 🔧 **Gradient Checkpointing** - Training model besar dengan memory terbatas
+- 📦 **Quantization Ready** - Support 4-bit & 8-bit quantization
+- 🎯 **Mixed Precision Training** - BF16 & FP16 support
+---
+## 📊 Spesifikasi Teknis
+<div align="center">
+| Spesifikasi | Detail |
+|-------------|--------|
+| **💎 Total Parameters** | **24,253,696** (24.25M) |
+| **📐 Hidden Size** | 256 |
+| **🔢 Intermediate Size** | 1,024 |
+| **🏗️ Num Layers** | 8 |
+| **🎯 Attention Heads** | 4 |
+| **🔑 KV Heads** | 2 (GQA) |
+| **📏 Head Dimension** | 64 |
+| **📚 Vocab Size** | 32,000 tokens |
+| **📖 Max Context** | 2,048 tokens |
+| **🏛️ Architecture** | Decoder-only Transformer |
+| **🎨 Model Type** | Causal Language Model |
+</div>
+### 📐 Arsitektur Detail
+<details>
+<summary><b>🔍 Klik untuk lihat struktur lengkap</b></summary>
+```
+CacaForCausalLM (24.25M)
+│
+├─ Embedding Layer
+│  └─ Token Embeddings: 32,000 × 256
+│     └─ Parameters: 8,192,000
+│
+├─ Transformer Layers (8x)
+│  │
+│  ├─ Layer {i} (repeated 8 times)
+│  │  │
+│  │  ├─ Input LayerNorm (RMSNorm)
+│  │  │  └─ Params: 256
+│  │  │
+│  │  ├─ Self-Attention (Grouped Query Attention)
+│  │  │  ├─ Q Projection: 256 → 256
+│  │  │  ├─ K Projection: 256 → 128
+│  │  │  ├─ V Projection: 256 → 128
+│  │  │  ├─ O Projection: 256 → 256
+│  │  │  ├─ RoPE Embeddings: θ=10000
+│  │  │  └─ Flash Attention 2 (if available)
+│  │  │
+│  │  ├─ Post-Attention LayerNorm (RMSNorm)
+│  │  │  └─ Params: 256
+│  │  │
+│  │  ├─ MLP (SwiGLU)
+│  │  │  ├─ Gate: 256 → 1,024
+│  │  │  ├─ Up: 256 → 1,024
+│  │  │  ├─ Activation: SiLU (Swish)
+│  │  │  └─ Down: 1,024 → 256
+│  │  │
+│  │  └─ Residual Connections (2x per layer)
+│  │
+│  └─ Total Layer Params: ~0M per layer
+│
+├─ Final LayerNorm (RMSNorm)
+│  └─ Params: 256
+│
+└─ LM Head (Output Projection)
+   └─ Linear: 256 → 32,000
+      └─ Parameters: 8,192,000
+```
+**Perhitungan Parameter:**
+- Embeddings: `32,000 × 256 = 8,192,000`
+- Layers: `8 layers × ~0M = ~6M`
+- **Total: 24,253,696 parameters**
+</details>
+---
+## 🚀 Quick Start
+### 📦 Instalasi
+```bash
+# Dependencies dasar
+pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
+# Optional: Untuk performa maksimal
+pip install flash-attn --no-build-isolation  # Flash Attention 2
+pip install xformers                          # Memory efficient attention
+pip install bitsandbytes                      # Quantization support
+```
+### 💻 Penggunaan Dasar
+#### 1️⃣ Load Model
+```python
+from transformers import AutoModelForCausalLM, AutoConfig
+import torch
+# Load configuration
+config = AutoConfig.from_pretrained(
+    "Lyon28/caca-5M-untrained",
+    trust_remote_code=True
+)
+print(f"Model: {config.model_type}")
+print(f"Parameters: 24,253,696")
+print(f"Hidden size: {config.hidden_size}")
+print(f"Layers: {config.num_hidden_layers}")
+# Load model
+model = AutoModelForCausalLM.from_pretrained(
+    "Lyon28/caca-5M-untrained",
+    config=config,
+    torch_dtype=torch.bfloat16,  # Gunakan BF16 untuk efisiensi
+    device_map="auto",            # Otomatis distribusi ke GPU
+    trust_remote_code=True
+)
+print(f"Model loaded! Device: {model.device}")
+```
+#### 2️⃣ Verifikasi Model
+```python
+# Hitung total parameter
+total_params = sum(p.numel() for p in model.parameters())
+trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+print(f"Total parameters: {total_params:,}")
+print(f"Trainable parameters: {trainable_params:,}")
+print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")
+# Test forward pass
+batch_size, seq_len = 2, 10
+input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
+input_ids = input_ids.to(model.device)
+with torch.no_grad():
+    outputs = model(input_ids)
+print(f"Output shape: {outputs.logits.shape}")
+print("✅ Model berfungsi dengan baik!")
+```
+#### 3️⃣ Generate Text (Setelah Training)
+```python
+from transformers import AutoTokenizer
+# Load tokenizer (gunakan tokenizer yang sesuai)
+tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")
+# Prepare input
+text = "Jelaskan tentang kecerdasan buatan"
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+# Generate
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=100,
+    temperature=0.7,
+    top_p=0.9,
+    top_k=50,
+    do_sample=True,
+    repetition_penalty=1.1,
+    pad_token_id=tokenizer.eos_token_id
+)
+# Decode
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+---
+## 🔧 Training Guide
+### 📚 Persiapan Dataset
+```python
+from datasets import load_dataset
+# Load dataset (contoh)
+dataset = load_dataset("indonesian-nlp/id-wikipedia")
+# Atau load dari file lokal
+from datasets import Dataset
+import pandas as pd
+df = pd.read_csv("your_data.csv")
+dataset = Dataset.from_pandas(df)
+print(f"Dataset size: {len(dataset)}")
+```
+### 🎯 Training Configuration
+```python
+from transformers import Trainer, TrainingArguments
+from transformers import DataCollatorForLanguageModeling
+# Training arguments
+training_args = TrainingArguments(
+    # Output
+    output_dir="./caca-caca-5M-trained",
+    run_name="caca-caca-5M-v1",
+    # Training
+    num_train_epochs=3,
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=8,  # Effective batch size = 32
+    learning_rate=2e-4,
+    weight_decay=0.1,
+    warmup_steps=2000,
+    # Optimization
+    bf16=True,                      # Mixed precision training
+    gradient_checkpointing=True,     # Hemat memory
+    optim="adamw_torch_fused",      # Optimizer tercepat
+    max_grad_norm=1.0,
+    # Logging & Evaluation
+    logging_steps=10,
+    logging_first_step=True,
+    eval_strategy="steps",
+    eval_steps=500,
+    save_steps=1000,
+    save_total_limit=3,
+    # Hub integration
+    push_to_hub=True,
+    hub_model_id="your-username/caca-caca-5M-trained",
+    hub_strategy="every_save",
+    # Distributed training
+    ddp_find_unused_parameters=False,
+    dataloader_num_workers=4,
+)
+# Data collator
+data_collator = DataCollatorForLanguageModeling(
+    tokenizer=tokenizer,
+    mlm=False  # Causal LM, bukan Masked LM
+)
+# Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset["train"],
+    eval_dataset=dataset["validation"],
+    data_collator=data_collator,
+)
+# Train!
+print("🚀 Starting training...")
+trainer.train()
+# Save final model
+print("💾 Saving model...")
+trainer.save_model("./caca-caca-5M-final")
+trainer.push_to_hub()
+print("✅ Training complete!")
+```
+### 📊 Estimasi Resource
+<details>
+<summary><b>💰 Klik untuk melihat estimasi biaya & waktu training</b></summary>
+**Hardware Requirements:**
+| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
+|-----|--------|------------|-------|-------------------------|
+| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
+| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
+| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
+| 8×A100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |
+**Cloud Costs (approximate):**
+- AWS p4d.24xlarge (8×A100): ~$32/hour × 24 hours = **~$768/day**
+- GCP a2-ultragpu-8g: ~$30/hour × 24 hours = **~$720/day**
+- Lambda Labs (8×A100): ~$15/hour × 24 hours = **~$360/day**
+**Tips menghemat biaya:**
+- Gunakan spot instances (60-70% lebih murah)
+- Gradient accumulation untuk batch size lebih besar
+- Mixed precision (BF16) untuk 2x speedup
+- Gradient checkpointing untuk hemat memory
+</details>
+---
+## 💬 Format Chat
+Model ini mendukung format chat standar:
+```python
+# Single-turn
+messages = [
+    {"role": "user", "content": "Halo! Siapa kamu?"},
+]
+# Multi-turn conversation
+messages = [
+    {"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
+    {"role": "user", "content": "Jelaskan tentang fotosintesis"},
+    {"role": "assistant", "content": "Fotosintesis adalah proses..."},
+    {"role": "user", "content": "Apa manfaatnya bagi manusia?"},
+]
+# Apply chat template
+formatted = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+print(formatted)
+# Output:
+# System: Kamu adalah asisten AI yang membantu.
+#
+# User: Jelaskan tentang fotosintesis
+# Assistant: Fotosintesis adalah proses...
+# User: Apa manfaatnya bagi manusia?
+# Assistant:
+```
+---
+## 🎯 Use Cases
+### ✅ Cocok Untuk:
+- 🔬 **Penelitian**: Eksperimen arsitektur LLM modern
+- 📚 **Edukasi**: Belajar tentang transformer & training
+- 🎓 **Akademis**: Paper, thesis, project
+- 🚀 **Base Model**: Fine-tuning untuk task spesifik
+- 💡 **Proof of Concept**: Test ide sebelum scale up
+### ❌ Tidak Cocok Untuk:
+- 🚫 **Production**: Model belum dilatih
+- 🚫 **Real-world apps**: Output masih random
+- 🚫 **Safety-critical**: Belum ada safety alignment
+- 🚫 **Direct deployment**: Perlu training dulu
+---
+## 📖 Dokumentasi
+### 🔗 Links Penting
+- 📚 **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers)
+- 💻 **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
+- 💬 **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
+- 🐛 **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
+### 📝 Related Models
+<div align="center">
+| Model Size | Parameters | Link |
+|------------|------------|------|
+| 🐣 Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) |
+| 🐥 Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) |
+| 🦅 Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) |
+| 🦁 Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) |
+| 🐉 XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) |
+| 🦖 XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) |
+</div>
+---
+## 🤝 Contributing
+Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:
+- 🐛 **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
+- 💡 **Suggest features**: Punya ide? Share di discussions
+- 📝 **Improve docs**: PR welcome untuk dokumentasi
+- 🎓 **Share results**: Training hasil? Share di model card
+- ⭐ **Star & Share**: Bantu project ini berkembang
+---
+## 📜 License & Citation
+### 📄 License
+Model ini dirilis di bawah **Apache License 2.0**:
+- ✅ Gratis untuk penggunaan komersial
+- ✅ Gratis untuk penggunaan riset
+- ✅ Boleh modifikasi & distribusi
+- ✅ Tidak ada garansi
+### 📚 Citation
+Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:
+```bibtex
+@misc{cacacaca-5M2025,
+  author = {Lyon},
+  title = {Caca-caca-5M: Modern Transformer Architecture with GQA and Advanced Features},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{https://huggingface.co/Lyon28/caca-5M-untrained}},
+}
+```
+### 🙏 Acknowledgments
+Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:
+#### 🏗️ **Core Architecture**
+- **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
+  - Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
+- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
+- **PaLM** (Google, 2022) - SwiGLU activation function
+#### 🎯 **Attention Mechanisms**
+- **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness
+  - Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
+- **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention
+  - Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
+- **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding
+- **xFormers** (Meta AI, 2022) - Memory efficient attention implementations
+- **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention
+#### 📍 **Position Encodings**
+- **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
+  - Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
+- **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation
+  - Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
+- **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context
+  - Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)
+#### 🪟 **Long Context & Efficiency**
+- **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns
+  - Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
+- **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths
+  - Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
+- **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow
+  - Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)
+#### 🧠 **Mixture of Experts (MoE)**
+- **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture
+  - Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
+- **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice
+  - Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
+- **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE
+- **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing
+#### 🎓 **Training Optimizations**
+- **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks
+  - Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
+- **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping
+  - Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
+- **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
+  - Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
+- **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training
+#### 📦 **Quantization**
+- **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication
+  - Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
+- **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
+  - Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
+- **GPTQ** (Frantar et al., 2022) - Post-training quantization
+- **bitsandbytes** (Dettmers) - Efficient quantization library
+#### 🎨 **Multimodal Components**
+- **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding
+  - Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
+- **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion
+  - Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
+- **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment
+  - Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
+- **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration
+#### 🛠️ **Normalization & Activations**
+- **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
+  - Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
+- **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant
+  - Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
+#### 🔧 **Implementation & Tools**
+- **Hugging Face Transformers** - Model implementation framework
+- **PyTorch** - Deep learning framework
+- **Safetensors** - Secure tensor serialization format
+- **Accelerate** - Distributed training utilities
+---
+**Special Thanks to:**
+- 🇮🇩 Indonesian NLP Community
+- 🤗 Hugging Face Team
+- 🔬 Open source AI research community
+## ⚠️ Limitations & Bias
+### Keterbatasan
+- 🔴 **Untrained**: Model belum dilatih, output random
+- 🟡 **No Tokenizer**: Perlu prepare tokenizer sendiri
+- 🟡 **No Safety**: Belum ada content filtering/alignment
+- 🟠 **Memory Intensive**: Training butuh GPU besar
+### Potential Biases
+Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:
+- **Bahasa**: Bias terhadap bahasa mayoritas di dataset
+- **Kultur**: Bias terhadap perspektif kultur tertentu
+- **Gender & Demografis**: Potential stereotypes
+- **Faktual**: Bisa generate informasi tidak akurat
+**Rekomendasi**: Lakukan evaluation & filtering sebelum deployment.
+---
+## 📞 Support & Contact
+### 💬 Community
+- **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
+### 📧 Contact
+Untuk pertanyaan atau kolaborasi:
+- Email: cacatransformers@gmail.com
+- HF Profile: [@Lyon28](https://huggingface.co/Lyon28)
+---
+<div align="center">
+## 🌟 Star History
+[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)
+---
+### 💝 Dibuat dengan ❤️ untuk komunitas AI Indonesia
+**Terima kasih telah menggunakan Caca!**
+Jika project ini bermanfaat, consider untuk:
+- ⭐ Star repository ini
+- 🔗 Share ke teman-teman
+- 💬 Join discussions
+- 🤝 Contribute ke project
+---
+</div>
+### Quote Dari caca
+<div align="center">
+  <img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
+</div>