caca-250M-untrained / README.md
Lyon28's picture
Upload README.md with huggingface_hub
926b747 verified
---
license: apache-2.0
language:
- id
- en
tags:
- text-generation
- pytorch
- causal-lm
- transformer
- untrained
- gqa
- rope
- swiglu
- rmsnorm
- flash-attention
- indonesian
library_name: transformers
pipeline_tag: text-generation
widget:
- text: "Jakarta adalah ibu kota"
example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Text Completion (ID)"
- text: |
Pertanyaan: Apa itu kecerdasan buatan?
Jawaban:
example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Question Answering (ID)"
- text: |
Tulis cerita pendek tentang robot yang belajar mencintai.
example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Creative Writing (ID)"
- text: "The capital of Indonesia is"
example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Text Completion (EN)"
- text: |
Question: What is artificial intelligence?
Answer:
example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Question Answering (EN)"
- text: |
def fibonacci(n):
"""Hitung bilangan fibonacci ke-n"""
example_title: "๐Ÿ’ป Code Completion"
- text: |
def reverse_string(s):
example_title: "๐Ÿ’ป Code Generation"
- text: |
User: Halo! Siapa kamu?
Assistant:
example_title: "๐Ÿ’ฌ Chat Format (ID)"
- text: |
User: Jelaskan tentang machine learning dalam 2 kalimat.
Assistant:
example_title: "๐Ÿ’ฌ Conversational (ID)"
inference:
parameters:
max_new_tokens: 100
temperature: 0.7
top_p: 0.9
top_k: 50
do_sample: true
repetition_penalty: 1.1
num_beams: 1
datasets: []
metrics:
- perplexity
model-index:
- name: caca-250M
results: []
---
<div align="center">
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-250M"/>
# ๐Ÿš€ CACA-250M
### Model Transformer Modern dengan Arsitektur Canggih
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/๐Ÿค—%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)
**430,493,696** parameters โ€ข **430.49M** โ€ข **24 layers**
[๐Ÿ“– Dokumentasi](#dokumentasi) โ€ข [๐Ÿš€ Quick Start](#quick-start) โ€ข [๐Ÿ’ก Fitur](#fitur-utama) โ€ข [๐Ÿ”ง Training](#training-guide) โ€ข [๐Ÿ“Š Spesifikasi](#spesifikasi-teknis)
---
</div>
## โš ๏ธ PENTING: Model Belum Dilatih (Untrained)
> **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**.
**Status Model:**
- ๐Ÿ”ด **Belum dilatih** - Bobot masih random
- ๐ŸŸก **Hanya untuk riset** - Eksperimen arsitektur & training
- ๐ŸŸข **Siap dilatih** - Arsitektur sudah teruji
Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.
---
## ๐Ÿ“‹ Deskripsi
**Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**.
<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
</blockquote>
### ๐ŸŽฏ Keunggulan Utama
- **๐Ÿ‡ฎ๐Ÿ‡ฉ Bilingual Support**: Optimized untuk Bahasa Indonesia & English
- **โšก Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat
- **๐Ÿ’พ Memory Efficient**: Grouped Query Attention menghemat 75% KV cache
- **๐ŸŽฏ Long Context**: Support hingga 8,192 token
- **๐Ÿ”ง Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi
---
## โœจ Fitur Utama
### ๐ŸŽฏ Core Features
- โœ… **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
- Query heads: 16
- KV heads: 4
- Ratio: 4:1 (hemat 75% KV cache)
- โœ… **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
- Theta: 10000
- Support extrapolation untuk konteks > training length
- โœ… **RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
- Epsilon: 1e-06
- โœ… **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
- Intermediate size: 4,096
- โœ… **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
- Otomatis aktif jika tersedia CUDA
### ๐Ÿ”ฅ Advanced Features
### ๐ŸŽฏ Attention Mechanisms
- โšก **Flash Attention v2** - 3x faster with IO-aware algorithm
- ๐Ÿ”‘ **Grouped Query Attention (GQA)** - 16Q : 4KV heads
- ๐Ÿš€ **xFormers Support** - Memory efficient attention fallback
- ๐ŸŽฏ **PyTorch SDPA** - Native scaled dot product attention
### ๐Ÿ“ Position Encodings
- ๐Ÿ”„ **RoPE** - Rotary embeddings (ฮธ=10000)
### ๐ŸชŸ Long Context Features
### ๐ŸŽ“ Training Optimizations
- ๐Ÿ’พ **Gradient Checkpointing** - Memory efficient training
- ๐ŸŽฏ **Mixed Precision** - BF16 & FP16 support
### ๐Ÿ“ฆ Quantization Support
- 4๏ธโƒฃ **4-bit Quantization** - NF4, FP4 via bitsandbytes
- 8๏ธโƒฃ **8-bit Quantization** - LLM.int8() support
- ๐Ÿ”„ **Double Quantization** - Further compression
### ๐Ÿ› ๏ธ Optimization Features
- ๐Ÿ’พ **KV Cache** - Generasi autoregressive 5-10x lebih cepat
- ๐Ÿ”ง **Gradient Checkpointing** - Training model besar dengan memory terbatas
- ๐Ÿ“ฆ **Quantization Ready** - Support 4-bit & 8-bit quantization
- ๐ŸŽฏ **Mixed Precision Training** - BF16 & FP16 support
---
## ๐Ÿ“Š Spesifikasi Teknis
<div align="center">
| Spesifikasi | Detail |
|-------------|--------|
| **๐Ÿ’Ž Total Parameters** | **430,493,696** (430.49M) |
| **๐Ÿ“ Hidden Size** | 1,024 |
| **๐Ÿ”ข Intermediate Size** | 4,096 |
| **๐Ÿ—๏ธ Num Layers** | 24 |
| **๐ŸŽฏ Attention Heads** | 16 |
| **๐Ÿ”‘ KV Heads** | 4 (GQA) |
| **๐Ÿ“ Head Dimension** | 64 |
| **๐Ÿ“š Vocab Size** | 32,000 tokens |
| **๐Ÿ“– Max Context** | 8,192 tokens |
| **๐Ÿ›๏ธ Architecture** | Decoder-only Transformer |
| **๐ŸŽจ Model Type** | Causal Language Model |
</div>
### ๐Ÿ“ Arsitektur Detail
<details>
<summary><b>๐Ÿ” Klik untuk lihat struktur lengkap</b></summary>
```
CacaForCausalLM (430.49M)
โ”‚
โ”œโ”€ Embedding Layer
โ”‚ โ””โ”€ Token Embeddings: 32,000 ร— 1024
โ”‚ โ””โ”€ Parameters: 32,768,000
โ”‚
โ”œโ”€ Transformer Layers (24x)
โ”‚ โ”‚
โ”‚ โ”œโ”€ Layer {i} (repeated 24 times)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ Input LayerNorm (RMSNorm)
โ”‚ โ”‚ โ”‚ โ””โ”€ Params: 1,024
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ Self-Attention (Grouped Query Attention)
โ”‚ โ”‚ โ”‚ โ”œโ”€ Q Projection: 1,024 โ†’ 1,024
โ”‚ โ”‚ โ”‚ โ”œโ”€ K Projection: 1,024 โ†’ 256
โ”‚ โ”‚ โ”‚ โ”œโ”€ V Projection: 1,024 โ†’ 256
โ”‚ โ”‚ โ”‚ โ”œโ”€ O Projection: 1,024 โ†’ 1,024
โ”‚ โ”‚ โ”‚ โ”œโ”€ RoPE Embeddings: ฮธ=10000
โ”‚ โ”‚ โ”‚ โ””โ”€ Flash Attention 2 (if available)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ Post-Attention LayerNorm (RMSNorm)
โ”‚ โ”‚ โ”‚ โ””โ”€ Params: 1,024
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ MLP (SwiGLU)
โ”‚ โ”‚ โ”‚ โ”œโ”€ Gate: 1,024 โ†’ 4,096
โ”‚ โ”‚ โ”‚ โ”œโ”€ Up: 1,024 โ†’ 4,096
โ”‚ โ”‚ โ”‚ โ”œโ”€ Activation: SiLU (Swish)
โ”‚ โ”‚ โ”‚ โ””โ”€ Down: 4,096 โ†’ 1,024
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€ Residual Connections (2x per layer)
โ”‚ โ”‚
โ”‚ โ””โ”€ Total Layer Params: ~13M per layer
โ”‚
โ”œโ”€ Final LayerNorm (RMSNorm)
โ”‚ โ””โ”€ Params: 1,024
โ”‚
โ””โ”€ LM Head (Output Projection)
โ””โ”€ Linear: 1,024 โ†’ 32,000
โ””โ”€ Parameters: 32,768,000
```
**Perhitungan Parameter:**
- Embeddings: `32,000 ร— 1,024 = 32,768,000`
- Layers: `24 layers ร— ~13M = ~327M`
- **Total: 430,493,696 parameters**
</details>
---
## ๐Ÿš€ Quick Start
### ๐Ÿ“ฆ Instalasi
```bash
# Dependencies dasar
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation # Flash Attention 2
pip install xformers # Memory efficient attention
pip install bitsandbytes # Quantization support
```
### ๐Ÿ’ป Penggunaan Dasar
#### 1๏ธโƒฃ Load Model
```python
from transformers import AutoModelForCausalLM, AutoConfig
import torch
# Load configuration
config = AutoConfig.from_pretrained(
"Lyon28/caca-250M-untrained",
trust_remote_code=True
)
print(f"Model: {config.model_type}")
print(f"Parameters: 430,493,696")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-250M-untrained",
config=config,
torch_dtype=torch.bfloat16, # Gunakan BF16 untuk efisiensi
device_map="auto", # Otomatis distribusi ke GPU
trust_remote_code=True
)
print(f"Model loaded! Device: {model.device}")
```
#### 2๏ธโƒฃ Verifikasi Model
```python
# Hitung total parameter
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")
# Test forward pass
batch_size, seq_len = 2, 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
input_ids = input_ids.to(model.device)
with torch.no_grad():
outputs = model(input_ids)
print(f"Output shape: {outputs.logits.shape}")
print("โœ… Model berfungsi dengan baik!")
```
#### 3๏ธโƒฃ Generate Text (Setelah Training)
```python
from transformers import AutoTokenizer
# Load tokenizer (gunakan tokenizer yang sesuai)
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")
# Prepare input
text = "Jelaskan tentang kecerdasan buatan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
top_k=50,
do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
---
## ๐Ÿ”ง Training Guide
### ๐Ÿ“š Persiapan Dataset
```python
from datasets import load_dataset
# Load dataset (contoh)
dataset = load_dataset("indonesian-nlp/id-wikipedia")
# Atau load dari file lokal
from datasets import Dataset
import pandas as pd
df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)
print(f"Dataset size: {len(dataset)}")
```
### ๐ŸŽฏ Training Configuration
```python
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
# Training arguments
training_args = TrainingArguments(
# Output
output_dir="./caca-caca-250M-trained",
run_name="caca-caca-250M-v1",
# Training
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # Effective batch size = 32
learning_rate=2e-4,
weight_decay=0.1,
warmup_steps=2000,
# Optimization
bf16=True, # Mixed precision training
gradient_checkpointing=True, # Hemat memory
optim="adamw_torch_fused", # Optimizer tercepat
max_grad_norm=1.0,
# Logging & Evaluation
logging_steps=10,
logging_first_step=True,
eval_strategy="steps",
eval_steps=500,
save_steps=1000,
save_total_limit=3,
# Hub integration
push_to_hub=True,
hub_model_id="your-username/caca-caca-250M-trained",
hub_strategy="every_save",
# Distributed training
ddp_find_unused_parameters=False,
dataloader_num_workers=4,
)
# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # Causal LM, bukan Masked LM
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
data_collator=data_collator,
)
# Train!
print("๐Ÿš€ Starting training...")
trainer.train()
# Save final model
print("๐Ÿ’พ Saving model...")
trainer.save_model("./caca-caca-250M-final")
trainer.push_to_hub()
print("โœ… Training complete!")
```
### ๐Ÿ“Š Estimasi Resource
<details>
<summary><b>๐Ÿ’ฐ Klik untuk melihat estimasi biaya & waktu training</b></summary>
**Hardware Requirements:**
| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
|-----|--------|------------|-------|-------------------------|
| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
| 8ร—A100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |
**Cloud Costs (approximate):**
- AWS p4d.24xlarge (8ร—A100): ~$32/hour ร— 24 hours = **~$768/day**
- GCP a2-ultragpu-8g: ~$30/hour ร— 24 hours = **~$720/day**
- Lambda Labs (8ร—A100): ~$15/hour ร— 24 hours = **~$360/day**
**Tips menghemat biaya:**
- Gunakan spot instances (60-70% lebih murah)
- Gradient accumulation untuk batch size lebih besar
- Mixed precision (BF16) untuk 2x speedup
- Gradient checkpointing untuk hemat memory
</details>
---
## ๐Ÿ’ฌ Format Chat
Model ini mendukung format chat standar:
```python
# Single-turn
messages = [
{"role": "user", "content": "Halo! Siapa kamu?"},
]
# Multi-turn conversation
messages = [
{"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
{"role": "user", "content": "Jelaskan tentang fotosintesis"},
{"role": "assistant", "content": "Fotosintesis adalah proses..."},
{"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]
# Apply chat template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses...
# User: Apa manfaatnya bagi manusia?
# Assistant:
```
---
## ๐ŸŽฏ Use Cases
### โœ… Cocok Untuk:
- ๐Ÿ”ฌ **Penelitian**: Eksperimen arsitektur LLM modern
- ๐Ÿ“š **Edukasi**: Belajar tentang transformer & training
- ๐ŸŽ“ **Akademis**: Paper, thesis, project
- ๐Ÿš€ **Base Model**: Fine-tuning untuk task spesifik
- ๐Ÿ’ก **Proof of Concept**: Test ide sebelum scale up
### โŒ Tidak Cocok Untuk:
- ๐Ÿšซ **Production**: Model belum dilatih
- ๐Ÿšซ **Real-world apps**: Output masih random
- ๐Ÿšซ **Safety-critical**: Belum ada safety alignment
- ๐Ÿšซ **Direct deployment**: Perlu training dulu
---
## ๐Ÿ“– Dokumentasi
### ๐Ÿ”— Links Penting
- ๐Ÿ“š **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers)
- ๐Ÿ’ป **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
- ๐Ÿ’ฌ **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
- ๐Ÿ› **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
### ๐Ÿ“ Related Models
<div align="center">
| Model Size | Parameters | Link |
|------------|------------|------|
| ๐Ÿฃ Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) |
| ๐Ÿฅ Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) |
| ๐Ÿฆ… Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) |
| ๐Ÿฆ Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) |
| ๐Ÿ‰ XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) |
| ๐Ÿฆ– XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) |
</div>
---
## ๐Ÿค Contributing
Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:
- ๐Ÿ› **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
- ๐Ÿ’ก **Suggest features**: Punya ide? Share di discussions
- ๐Ÿ“ **Improve docs**: PR welcome untuk dokumentasi
- ๐ŸŽ“ **Share results**: Training hasil? Share di model card
- โญ **Star & Share**: Bantu project ini berkembang
---
## ๐Ÿ“œ License & Citation
### ๐Ÿ“„ License
Model ini dirilis di bawah **Apache License 2.0**:
- โœ… Gratis untuk penggunaan komersial
- โœ… Gratis untuk penggunaan riset
- โœ… Boleh modifikasi & distribusi
- โœ… Tidak ada garansi
### ๐Ÿ“š Citation
Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:
```bibtex
@misc{cacacaca-250M2025,
author = {Lyon},
title = {Caca-caca-250M: Modern Transformer Architecture with GQA and Advanced Features},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/Lyon28/caca-250M-untrained}},
}
```
### ๐Ÿ™ Acknowledgments
Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:
#### ๐Ÿ—๏ธ **Core Architecture**
- **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
- **PaLM** (Google, 2022) - SwiGLU activation function
#### ๐ŸŽฏ **Attention Mechanisms**
- **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness
- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
- **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention
- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
- **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding
- **xFormers** (Meta AI, 2022) - Memory efficient attention implementations
- **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention
#### ๐Ÿ“ **Position Encodings**
- **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation
- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
- **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context
- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)
#### ๐ŸชŸ **Long Context & Efficiency**
- **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns
- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
- **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths
- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
- **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow
- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)
#### ๐Ÿง  **Mixture of Experts (MoE)**
- **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture
- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
- **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice
- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
- **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE
- **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing
#### ๐ŸŽ“ **Training Optimizations**
- **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks
- Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
- **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping
- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
- **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
- Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
- **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training
#### ๐Ÿ“ฆ **Quantization**
- **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication
- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
- **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- **GPTQ** (Frantar et al., 2022) - Post-training quantization
- **bitsandbytes** (Dettmers) - Efficient quantization library
#### ๐ŸŽจ **Multimodal Components**
- **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding
- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
- **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion
- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
- **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment
- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
- **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration
#### ๐Ÿ› ๏ธ **Normalization & Activations**
- **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
- **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant
- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
#### ๐Ÿ”ง **Implementation & Tools**
- **Hugging Face Transformers** - Model implementation framework
- **PyTorch** - Deep learning framework
- **Safetensors** - Secure tensor serialization format
- **Accelerate** - Distributed training utilities
---
**Special Thanks to:**
- ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian NLP Community
- ๐Ÿค— Hugging Face Team
- ๐Ÿ”ฌ Open source AI research community
## โš ๏ธ Limitations & Bias
### Keterbatasan
- ๐Ÿ”ด **Untrained**: Model belum dilatih, output random
- ๐ŸŸก **No Tokenizer**: Perlu prepare tokenizer sendiri
- ๐ŸŸก **No Safety**: Belum ada content filtering/alignment
- ๐ŸŸ  **Memory Intensive**: Training butuh GPU besar
### Potential Biases
Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:
- **Bahasa**: Bias terhadap bahasa mayoritas di dataset
- **Kultur**: Bias terhadap perspektif kultur tertentu
- **Gender & Demografis**: Potential stereotypes
- **Faktual**: Bisa generate informasi tidak akurat
**Rekomendasi**: Lakukan evaluation & filtering sebelum deployment.
---
## ๐Ÿ“ž Support & Contact
### ๐Ÿ’ฌ Community
- **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
### ๐Ÿ“ง Contact
Untuk pertanyaan atau kolaborasi:
- Email: cacatransformers@gmail.com
- HF Profile: [@Lyon28](https://huggingface.co/Lyon28)
---
<div align="center">
## ๐ŸŒŸ Star History
[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)
---
### ๐Ÿ’ Dibuat dengan โค๏ธ untuk komunitas AI Indonesia
**Terima kasih telah menggunakan Caca!**
Jika project ini bermanfaat, consider untuk:
- โญ Star repository ini
- ๐Ÿ”— Share ke teman-teman
- ๐Ÿ’ฌ Join discussions
- ๐Ÿค Contribute ke project
---
</div>
### Quote Dari caca
<div align="center">
<img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
</div>