---
license: apache-2.0
language:
- id
- en
tags:
- text-generation
- pytorch
- causal-lm
- transformer
- untrained
- gqa
- rope
- swiglu
- rmsnorm
- flash-attention
- indonesian
library_name: transformers
pipeline_tag: text-generation
widget:
  - text: "Jakarta adalah ibu kota"
    example_title: "🇮🇩 Text Completion (ID)"
  - text: |
      Pertanyaan: Apa itu kecerdasan buatan?
      Jawaban:
    example_title: "🇮🇩 Question Answering (ID)"
  - text: |
      Tulis cerita pendek tentang robot yang belajar mencintai.
    example_title: "🇮🇩 Creative Writing (ID)"
  - text: "The capital of Indonesia is"
    example_title: "🇬🇧 Text Completion (EN)"
  - text: |
      Question: What is artificial intelligence?
      Answer:
    example_title: "🇬🇧 Question Answering (EN)"
  - text: |
      def fibonacci(n):
          """Hitung bilangan fibonacci ke-n"""
    example_title: "💻 Code Completion"
  - text: |
      def reverse_string(s):
    example_title: "💻 Code Generation"
  - text: |
      User: Halo! Siapa kamu?
      Assistant:
    example_title: "💬 Chat Format (ID)"
  - text: |
      User: Jelaskan tentang machine learning dalam 2 kalimat.
      Assistant:
    example_title: "💬 Conversational (ID)"
inference:
  parameters:
    max_new_tokens: 100
    temperature: 0.7
    top_p: 0.9
    top_k: 50
    do_sample: true
    repetition_penalty: 1.1
    num_beams: 1
datasets: []
metrics:
- perplexity
model-index:
- name: caca-250M
  results: []
---

<div align="center">

<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-250M"/>

# 🚀 CACA-250M

### Model Transformer Modern dengan Arsitektur Canggih

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/🤗%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)

**430,493,696** parameters • **430.49M** • **24 layers**

[📖 Dokumentasi](#dokumentasi) • [🚀 Quick Start](#quick-start) • [💡 Fitur](#fitur-utama) • [🔧 Training](#training-guide) • [📊 Spesifikasi](#spesifikasi-teknis)

---

</div>

## ⚠️ PENTING: Model Belum Dilatih (Untrained)

> **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**.

**Status Model:**
- 🔴 **Belum dilatih** - Bobot masih random
- 🟡 **Hanya untuk riset** - Eksperimen arsitektur & training
- 🟢 **Siap dilatih** - Arsitektur sudah teruji

Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.

---

## 📋 Deskripsi

**Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**.

<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
</blockquote>

### 🎯 Keunggulan Utama

- **🇮🇩 Bilingual Support**: Optimized untuk Bahasa Indonesia & English
- **⚡ Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat
- **💾 Memory Efficient**: Grouped Query Attention menghemat 75% KV cache
- **🎯 Long Context**: Support hingga 8,192 token
- **🔧 Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi

---

## ✨ Fitur Utama

### 🎯 Core Features

- ✅ **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
  - Query heads: 16
  - KV heads: 4
  - Ratio: 4:1 (hemat 75% KV cache)

- ✅ **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
  - Theta: 10000
  - Support extrapolation untuk konteks > training length

- ✅ **RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
  - Epsilon: 1e-06

- ✅ **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
  - Intermediate size: 4,096

- ✅ **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
  - Otomatis aktif jika tersedia CUDA

### 🔥 Advanced Features

### 🎯 Attention Mechanisms
- ⚡ **Flash Attention v2** - 3x faster with IO-aware algorithm
- 🔑 **Grouped Query Attention (GQA)** - 16Q : 4KV heads
- 🚀 **xFormers Support** - Memory efficient attention fallback
- 🎯 **PyTorch SDPA** - Native scaled dot product attention

### 📍 Position Encodings
- 🔄 **RoPE** - Rotary embeddings (θ=10000)

### 🪟 Long Context Features

### 🎓 Training Optimizations
- 💾 **Gradient Checkpointing** - Memory efficient training
- 🎯 **Mixed Precision** - BF16 & FP16 support

### 📦 Quantization Support
- 4️⃣ **4-bit Quantization** - NF4, FP4 via bitsandbytes
- 8️⃣ **8-bit Quantization** - LLM.int8() support
- 🔄 **Double Quantization** - Further compression

### 🛠️ Optimization Features

- 💾 **KV Cache** - Generasi autoregressive 5-10x lebih cepat
- 🔧 **Gradient Checkpointing** - Training model besar dengan memory terbatas
- 📦 **Quantization Ready** - Support 4-bit & 8-bit quantization
- 🎯 **Mixed Precision Training** - BF16 & FP16 support

---

## 📊 Spesifikasi Teknis

<div align="center">

| Spesifikasi | Detail |
|-------------|--------|
| **💎 Total Parameters** | **430,493,696** (430.49M) |
| **📐 Hidden Size** | 1,024 |
| **🔢 Intermediate Size** | 4,096 |
| **🏗️ Num Layers** | 24 |
| **🎯 Attention Heads** | 16 |
| **🔑 KV Heads** | 4 (GQA) |
| **📏 Head Dimension** | 64 |
| **📚 Vocab Size** | 32,000 tokens |
| **📖 Max Context** | 8,192 tokens |
| **🏛️ Architecture** | Decoder-only Transformer |
| **🎨 Model Type** | Causal Language Model |

</div>

### 📐 Arsitektur Detail

<details>
<summary><b>🔍 Klik untuk lihat struktur lengkap</b></summary>

```
CacaForCausalLM (430.49M)
│
├─ Embedding Layer
│  └─ Token Embeddings: 32,000 × 1024
│     └─ Parameters: 32,768,000
│
├─ Transformer Layers (24x)
│  │
│  ├─ Layer {i} (repeated 24 times)
│  │  │
│  │  ├─ Input LayerNorm (RMSNorm)
│  │  │  └─ Params: 1,024
│  │  │
│  │  ├─ Self-Attention (Grouped Query Attention)
│  │  │  ├─ Q Projection: 1,024 → 1,024
│  │  │  ├─ K Projection: 1,024 → 256
│  │  │  ├─ V Projection: 1,024 → 256
│  │  │  ├─ O Projection: 1,024 → 1,024
│  │  │  ├─ RoPE Embeddings: θ=10000
│  │  │  └─ Flash Attention 2 (if available)
│  │  │
│  │  ├─ Post-Attention LayerNorm (RMSNorm)
│  │  │  └─ Params: 1,024
│  │  │
│  │  ├─ MLP (SwiGLU)
│  │  │  ├─ Gate: 1,024 → 4,096
│  │  │  ├─ Up: 1,024 → 4,096
│  │  │  ├─ Activation: SiLU (Swish)
│  │  │  └─ Down: 4,096 → 1,024
│  │  │
│  │  └─ Residual Connections (2x per layer)
│  │
│  └─ Total Layer Params: ~13M per layer
│
├─ Final LayerNorm (RMSNorm)
│  └─ Params: 1,024
│
└─ LM Head (Output Projection)
   └─ Linear: 1,024 → 32,000
      └─ Parameters: 32,768,000
```

**Perhitungan Parameter:**
- Embeddings: `32,000 × 1,024 = 32,768,000`
- Layers: `24 layers × ~13M = ~327M`
- **Total: 430,493,696 parameters**

</details>

---

## 🚀 Quick Start

### 📦 Instalasi

```bash
# Dependencies dasar
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation  # Flash Attention 2
pip install xformers                          # Memory efficient attention
pip install bitsandbytes                      # Quantization support
```

### 💻 Penggunaan Dasar

#### 1️⃣ Load Model

```python
from transformers import AutoModelForCausalLM, AutoConfig
import torch

# Load configuration
config = AutoConfig.from_pretrained(
    "Lyon28/caca-250M-untrained",
    trust_remote_code=True
)

print(f"Model: {config.model_type}")
print(f"Parameters: 430,493,696")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-250M-untrained",
    config=config,
    torch_dtype=torch.bfloat16,  # Gunakan BF16 untuk efisiensi
    device_map="auto",            # Otomatis distribusi ke GPU
    trust_remote_code=True
)

print(f"Model loaded! Device: {model.device}")
```

#### 2️⃣ Verifikasi Model

```python
# Hitung total parameter
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")

# Test forward pass
batch_size, seq_len = 2, 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
input_ids = input_ids.to(model.device)

with torch.no_grad():
    outputs = model(input_ids)

print(f"Output shape: {outputs.logits.shape}")
print("✅ Model berfungsi dengan baik!")
```

#### 3️⃣ Generate Text (Setelah Training)

```python
from transformers import AutoTokenizer

# Load tokenizer (gunakan tokenizer yang sesuai)
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")

# Prepare input
text = "Jelaskan tentang kecerdasan buatan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    do_sample=True,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.eos_token_id
)

# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

---

## 🔧 Training Guide

### 📚 Persiapan Dataset

```python
from datasets import load_dataset

# Load dataset (contoh)
dataset = load_dataset("indonesian-nlp/id-wikipedia")

# Atau load dari file lokal
from datasets import Dataset
import pandas as pd

df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)

print(f"Dataset size: {len(dataset)}")
```

### 🎯 Training Configuration

```python
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

# Training arguments
training_args = TrainingArguments(
    # Output
    output_dir="./caca-caca-250M-trained",
    run_name="caca-caca-250M-v1",
    
    # Training
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # Effective batch size = 32
    learning_rate=2e-4,
    weight_decay=0.1,
    warmup_steps=2000,
    
    # Optimization
    bf16=True,                      # Mixed precision training
    gradient_checkpointing=True,     # Hemat memory
    optim="adamw_torch_fused",      # Optimizer tercepat
    max_grad_norm=1.0,
    
    # Logging & Evaluation
    logging_steps=10,
    logging_first_step=True,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=3,
    
    # Hub integration
    push_to_hub=True,
    hub_model_id="your-username/caca-caca-250M-trained",
    hub_strategy="every_save",
    
    # Distributed training
    ddp_find_unused_parameters=False,
    dataloader_num_workers=4,
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal LM, bukan Masked LM
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
)

# Train!
print("🚀 Starting training...")
trainer.train()

# Save final model
print("💾 Saving model...")
trainer.save_model("./caca-caca-250M-final")
trainer.push_to_hub()

print("✅ Training complete!")
```

### 📊 Estimasi Resource

<details>
<summary><b>💰 Klik untuk melihat estimasi biaya & waktu training</b></summary>

**Hardware Requirements:**

| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
|-----|--------|------------|-------|-------------------------|
| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
| 8×A100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |

**Cloud Costs (approximate):**
- AWS p4d.24xlarge (8×A100): ~$32/hour × 24 hours = **~$768/day**
- GCP a2-ultragpu-8g: ~$30/hour × 24 hours = **~$720/day**
- Lambda Labs (8×A100): ~$15/hour × 24 hours = **~$360/day**

**Tips menghemat biaya:**
- Gunakan spot instances (60-70% lebih murah)
- Gradient accumulation untuk batch size lebih besar
- Mixed precision (BF16) untuk 2x speedup
- Gradient checkpointing untuk hemat memory

</details>

---

## 💬 Format Chat

Model ini mendukung format chat standar:

```python
# Single-turn
messages = [
    {"role": "user", "content": "Halo! Siapa kamu?"},
]

# Multi-turn conversation
messages = [
    {"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
    {"role": "user", "content": "Jelaskan tentang fotosintesis"},
    {"role": "assistant", "content": "Fotosintesis adalah proses..."},
    {"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]

# Apply chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses...
# User: Apa manfaatnya bagi manusia?
# Assistant:
```

---

## 🎯 Use Cases

### ✅ Cocok Untuk:

- 🔬 **Penelitian**: Eksperimen arsitektur LLM modern
- 📚 **Edukasi**: Belajar tentang transformer & training
- 🎓 **Akademis**: Paper, thesis, project
- 🚀 **Base Model**: Fine-tuning untuk task spesifik
- 💡 **Proof of Concept**: Test ide sebelum scale up

### ❌ Tidak Cocok Untuk:

- 🚫 **Production**: Model belum dilatih
- 🚫 **Real-world apps**: Output masih random
- 🚫 **Safety-critical**: Belum ada safety alignment
- 🚫 **Direct deployment**: Perlu training dulu

---

## 📖 Dokumentasi

### 🔗 Links Penting

- 📚 **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers)
- 💻 **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
- 💬 **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
- 🐛 **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)

### 📝 Related Models

<div align="center">

| Model Size | Parameters | Link |
|------------|------------|------|
| 🐣 Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) |
| 🐥 Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) |
| 🦅 Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) |
| 🦁 Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) |
| 🐉 XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) |
| 🦖 XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) |

</div>

---

## 🤝 Contributing

Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:

- 🐛 **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
- 💡 **Suggest features**: Punya ide? Share di discussions
- 📝 **Improve docs**: PR welcome untuk dokumentasi
- 🎓 **Share results**: Training hasil? Share di model card
- ⭐ **Star & Share**: Bantu project ini berkembang

---

## 📜 License & Citation

### 📄 License

Model ini dirilis di bawah **Apache License 2.0**:
- ✅ Gratis untuk penggunaan komersial
- ✅ Gratis untuk penggunaan riset
- ✅ Boleh modifikasi & distribusi
- ✅ Tidak ada garansi

### 📚 Citation

Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:

```bibtex
@misc{cacacaca-250M2025,
  author = {Lyon},
  title = {Caca-caca-250M: Modern Transformer Architecture with GQA and Advanced Features},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/Lyon28/caca-250M-untrained}},
}
```

### 🙏 Acknowledgments

Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:

#### 🏗️ **Core Architecture**
- **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
  - Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
- **PaLM** (Google, 2022) - SwiGLU activation function

#### 🎯 **Attention Mechanisms**
- **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness
  - Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
- **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention
  - Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
- **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding
- **xFormers** (Meta AI, 2022) - Memory efficient attention implementations
- **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention

#### 📍 **Position Encodings**
- **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
  - Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation
  - Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
- **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context
  - Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)

#### 🪟 **Long Context & Efficiency**
- **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns
  - Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
- **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths
  - Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
- **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow
  - Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)

#### 🧠 **Mixture of Experts (MoE)**
- **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture
  - Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
- **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice
  - Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
- **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE
- **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing

#### 🎓 **Training Optimizations**
- **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks
  - Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
- **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping
  - Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
- **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
  - Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
- **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training

#### 📦 **Quantization**
- **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication
  - Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
- **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
  - Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- **GPTQ** (Frantar et al., 2022) - Post-training quantization
- **bitsandbytes** (Dettmers) - Efficient quantization library

#### 🎨 **Multimodal Components**
- **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding
  - Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
- **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion
  - Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
- **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment
  - Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
- **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration

#### 🛠️ **Normalization & Activations**
- **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
  - Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
- **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant
  - Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)

#### 🔧 **Implementation & Tools**
- **Hugging Face Transformers** - Model implementation framework
- **PyTorch** - Deep learning framework
- **Safetensors** - Secure tensor serialization format
- **Accelerate** - Distributed training utilities

---

**Special Thanks to:**
- 🇮🇩 Indonesian NLP Community
- 🤗 Hugging Face Team
- 🔬 Open source AI research community

## ⚠️ Limitations & Bias

### Keterbatasan

- 🔴 **Untrained**: Model belum dilatih, output random
- 🟡 **No Tokenizer**: Perlu prepare tokenizer sendiri
- 🟡 **No Safety**: Belum ada content filtering/alignment
- 🟠 **Memory Intensive**: Training butuh GPU besar

### Potential Biases

Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:

- **Bahasa**: Bias terhadap bahasa mayoritas di dataset
- **Kultur**: Bias terhadap perspektif kultur tertentu
- **Gender & Demografis**: Potential stereotypes
- **Faktual**: Bisa generate informasi tidak akurat

**Rekomendasi**: Lakukan evaluation & filtering sebelum deployment.

---

## 📞 Support & Contact

### 💬 Community

- **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)

### 📧 Contact

Untuk pertanyaan atau kolaborasi:
- Email: cacatransformers@gmail.com
- HF Profile: [@Lyon28](https://huggingface.co/Lyon28)

---

<div align="center">

## 🌟 Star History

[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)

---

### 💝 Dibuat dengan ❤️ untuk komunitas AI Indonesia

**Terima kasih telah menggunakan Caca!**

Jika project ini bermanfaat, consider untuk:
- ⭐ Star repository ini
- 🔗 Share ke teman-teman
- 💬 Join discussions
- 🤝 Contribute ke project

---

</div>

### Quote Dari caca
<div align="center">
  <img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
</div>