๐ CACA-10M
Model Transformer Modern dengan Arsitektur Canggih
33,924,224 parameters โข 33.9M โข 10 layers
๐ Dokumentasi โข ๐ Quick Start โข ๐ก Fitur โข ๐ง Training โข ๐ Spesifikasi
โ ๏ธ PENTING: Model Belum Dilatih (Untrained)
PERHATIAN: Ini adalah model yang belum melalui proses training. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.
Status Model:
- ๐ด Belum dilatih - Bobot masih random
- ๐ก Hanya untuk riset - Eksperimen arsitektur & training
- ๐ข Siap dilatih - Arsitektur sudah teruji
Widget di atas hanya menunjukkan format input yang diharapkan. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.
๐ Deskripsi
Caca adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada efisiensi, skalabilitas, dan performa tinggi.
๐ฏ Keunggulan Utama
- ๐ฎ๐ฉ Bilingual Support: Optimized untuk Bahasa Indonesia & English
- โก Ultra Fast: Flash Attention 2 untuk inferensi 3x lebih cepat
- ๐พ Memory Efficient: Grouped Query Attention menghemat 75% KV cache
- ๐ฏ Long Context: Support hingga 2,048 token
- ๐ง Modular: Arsitektur fleksibel dengan berbagai opsi konfigurasi
โจ Fitur Utama
๐ฏ Core Features
โ Grouped Query Attention (GQA) - Efisiensi memori dan komputasi superior
- Query heads: 6
- KV heads: 2
- Ratio: 3:1 (hemat 75% KV cache)
โ Rotary Position Embeddings (RoPE) - Generalisasi konteks panjang lebih baik
- Theta: 10000
- Support extrapolation untuk konteks > training length
โ RMSNorm - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
- Epsilon: 1e-06
โ SwiGLU Activation - Performa 10-15% lebih baik dari ReLU/GELU
- Intermediate size: 1,536
โ Flash Attention 2 - Akselerasi hingga 3x dengan memory efficiency
- Otomatis aktif jika tersedia CUDA
๐ฅ Advanced Features
๐ฏ Attention Mechanisms
- โก Flash Attention v2 - 3x faster with IO-aware algorithm
- ๐ Grouped Query Attention (GQA) - 6Q : 2KV heads
- ๐ xFormers Support - Memory efficient attention fallback
- ๐ฏ PyTorch SDPA - Native scaled dot product attention
๐ Position Encodings
- ๐ RoPE - Rotary embeddings (ฮธ=10000)
๐ช Long Context Features
๐ Training Optimizations
- ๐พ Gradient Checkpointing - Memory efficient training
- ๐ฏ Mixed Precision - BF16 & FP16 support
๐ฆ Quantization Support
- 4๏ธโฃ 4-bit Quantization - NF4, FP4 via bitsandbytes
- 8๏ธโฃ 8-bit Quantization - LLM.int8() support
- ๐ Double Quantization - Further compression
๐ ๏ธ Optimization Features
- ๐พ KV Cache - Generasi autoregressive 5-10x lebih cepat
- ๐ง Gradient Checkpointing - Training model besar dengan memory terbatas
- ๐ฆ Quantization Ready - Support 4-bit & 8-bit quantization
- ๐ฏ Mixed Precision Training - BF16 & FP16 support
๐ Spesifikasi Teknis
| Spesifikasi | Detail |
|---|---|
| ๐ Total Parameters | 33,924,224 (33.9M) |
| ๐ Hidden Size | 384 |
| ๐ข Intermediate Size | 1,536 |
| ๐๏ธ Num Layers | 10 |
| ๐ฏ Attention Heads | 6 |
| ๐ KV Heads | 2 (GQA) |
| ๐ Head Dimension | 64 |
| ๐ Vocab Size | 32,000 tokens |
| ๐ Max Context | 2,048 tokens |
| ๐๏ธ Architecture | Decoder-only Transformer |
| ๐จ Model Type | Causal Language Model |
๐ Arsitektur Detail
๐ Klik untuk lihat struktur lengkap
CacaForCausalLM (33.9M)
โ
โโ Embedding Layer
โ โโ Token Embeddings: 32,000 ร 384
โ โโ Parameters: 12,288,000
โ
โโ Transformer Layers (10x)
โ โ
โ โโ Layer {i} (repeated 10 times)
โ โ โ
โ โ โโ Input LayerNorm (RMSNorm)
โ โ โ โโ Params: 384
โ โ โ
โ โ โโ Self-Attention (Grouped Query Attention)
โ โ โ โโ Q Projection: 384 โ 384
โ โ โ โโ K Projection: 384 โ 128
โ โ โ โโ V Projection: 384 โ 128
โ โ โ โโ O Projection: 384 โ 384
โ โ โ โโ RoPE Embeddings: ฮธ=10000
โ โ โ โโ Flash Attention 2 (if available)
โ โ โ
โ โ โโ Post-Attention LayerNorm (RMSNorm)
โ โ โ โโ Params: 384
โ โ โ
โ โ โโ MLP (SwiGLU)
โ โ โ โโ Gate: 384 โ 1,536
โ โ โ โโ Up: 384 โ 1,536
โ โ โ โโ Activation: SiLU (Swish)
โ โ โ โโ Down: 1,536 โ 384
โ โ โ
โ โ โโ Residual Connections (2x per layer)
โ โ
โ โโ Total Layer Params: ~1M per layer
โ
โโ Final LayerNorm (RMSNorm)
โ โโ Params: 384
โ
โโ LM Head (Output Projection)
โโ Linear: 384 โ 32,000
โโ Parameters: 12,288,000
Perhitungan Parameter:
- Embeddings:
32,000 ร 384 = 12,288,000 - Layers:
10 layers ร ~1M = ~19M - Total: 33,924,224 parameters
๐ Quick Start
๐ฆ Instalasi
# Dependencies dasar
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation # Flash Attention 2
pip install xformers # Memory efficient attention
pip install bitsandbytes # Quantization support
๐ป Penggunaan Dasar
1๏ธโฃ Load Model
from transformers import AutoModelForCausalLM, AutoConfig
import torch
# Load configuration
config = AutoConfig.from_pretrained(
"Lyon28/caca-10M-untrained",
trust_remote_code=True
)
print(f"Model: {config.model_type}")
print(f"Parameters: 33,924,224")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-10M-untrained",
config=config,
torch_dtype=torch.bfloat16, # Gunakan BF16 untuk efisiensi
device_map="auto", # Otomatis distribusi ke GPU
trust_remote_code=True
)
print(f"Model loaded! Device: {model.device}")
2๏ธโฃ Verifikasi Model
# Hitung total parameter
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")
# Test forward pass
batch_size, seq_len = 2, 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
input_ids = input_ids.to(model.device)
with torch.no_grad():
outputs = model(input_ids)
print(f"Output shape: {outputs.logits.shape}")
print("โ
Model berfungsi dengan baik!")
3๏ธโฃ Generate Text (Setelah Training)
from transformers import AutoTokenizer
# Load tokenizer (gunakan tokenizer yang sesuai)
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")
# Prepare input
text = "Jelaskan tentang kecerdasan buatan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
top_k=50,
do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
๐ง Training Guide
๐ Persiapan Dataset
from datasets import load_dataset
# Load dataset (contoh)
dataset = load_dataset("indonesian-nlp/id-wikipedia")
# Atau load dari file lokal
from datasets import Dataset
import pandas as pd
df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)
print(f"Dataset size: {len(dataset)}")
๐ฏ Training Configuration
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
# Training arguments
training_args = TrainingArguments(
# Output
output_dir="./caca-caca-10M-trained",
run_name="caca-caca-10M-v1",
# Training
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # Effective batch size = 32
learning_rate=2e-4,
weight_decay=0.1,
warmup_steps=2000,
# Optimization
bf16=True, # Mixed precision training
gradient_checkpointing=True, # Hemat memory
optim="adamw_torch_fused", # Optimizer tercepat
max_grad_norm=1.0,
# Logging & Evaluation
logging_steps=10,
logging_first_step=True,
eval_strategy="steps",
eval_steps=500,
save_steps=1000,
save_total_limit=3,
# Hub integration
push_to_hub=True,
hub_model_id="your-username/caca-caca-10M-trained",
hub_strategy="every_save",
# Distributed training
ddp_find_unused_parameters=False,
dataloader_num_workers=4,
)
# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # Causal LM, bukan Masked LM
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
data_collator=data_collator,
)
# Train!
print("๐ Starting training...")
trainer.train()
# Save final model
print("๐พ Saving model...")
trainer.save_model("./caca-caca-10M-final")
trainer.push_to_hub()
print("โ
Training complete!")
๐ Estimasi Resource
๐ฐ Klik untuk melihat estimasi biaya & waktu training
Hardware Requirements:
| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
|---|---|---|---|---|
| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
| 8รA100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |
Cloud Costs (approximate):
- AWS p4d.24xlarge (8รA100):
$32/hour ร 24 hours = **$768/day** - GCP a2-ultragpu-8g:
$30/hour ร 24 hours = **$720/day** - Lambda Labs (8รA100):
$15/hour ร 24 hours = **$360/day**
Tips menghemat biaya:
- Gunakan spot instances (60-70% lebih murah)
- Gradient accumulation untuk batch size lebih besar
- Mixed precision (BF16) untuk 2x speedup
- Gradient checkpointing untuk hemat memory
๐ฌ Format Chat
Model ini mendukung format chat standar:
# Single-turn
messages = [
{"role": "user", "content": "Halo! Siapa kamu?"},
]
# Multi-turn conversation
messages = [
{"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
{"role": "user", "content": "Jelaskan tentang fotosintesis"},
{"role": "assistant", "content": "Fotosintesis adalah proses..."},
{"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]
# Apply chat template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses...
# User: Apa manfaatnya bagi manusia?
# Assistant:
๐ฏ Use Cases
โ Cocok Untuk:
- ๐ฌ Penelitian: Eksperimen arsitektur LLM modern
- ๐ Edukasi: Belajar tentang transformer & training
- ๐ Akademis: Paper, thesis, project
- ๐ Base Model: Fine-tuning untuk task spesifik
- ๐ก Proof of Concept: Test ide sebelum scale up
โ Tidak Cocok Untuk:
- ๐ซ Production: Model belum dilatih
- ๐ซ Real-world apps: Output masih random
- ๐ซ Safety-critical: Belum ada safety alignment
- ๐ซ Direct deployment: Perlu training dulu
๐ Dokumentasi
๐ Links Penting
- ๐ Hugging Face Docs: transformers.github.io
- ๐ป GitHub: Lyon-28/caca-transformers
- ๐ฌ Discussions: Model discussions
- ๐ Issues: Report bugs
๐ Related Models
๐ค Contributing
Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:
- ๐ Report bugs: Temukan bug? Buka issue
- ๐ก Suggest features: Punya ide? Share di discussions
- ๐ Improve docs: PR welcome untuk dokumentasi
- ๐ Share results: Training hasil? Share di model card
- โญ Star & Share: Bantu project ini berkembang
๐ License & Citation
๐ License
Model ini dirilis di bawah Apache License 2.0:
- โ Gratis untuk penggunaan komersial
- โ Gratis untuk penggunaan riset
- โ Boleh modifikasi & distribusi
- โ Tidak ada garansi
๐ Citation
Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:
@misc{cacacaca-10M2025,
author = {Lyon},
title = {Caca-caca-10M: Modern Transformer Architecture with GQA and Advanced Features},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/Lyon28/caca-10M-untrained}},
}
๐ Acknowledgments
Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:
๐๏ธ Core Architecture
- LLaMA (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
- GPT-3 (OpenAI, 2020) - Transformer language modeling paradigm
- PaLM (Google, 2022) - SwiGLU activation function
๐ฏ Attention Mechanisms
- Flash Attention v2 (Tri Dao et al., 2023) - Efficient attention with IO-awareness
- Grouped Query Attention (GQA) (Ainslie et al., Google, 2023) - Memory-efficient attention
- Multi-Query Attention (MQA) (Shazeer, Google, 2019) - Fast decoding
- xFormers (Meta AI, 2022) - Memory efficient attention implementations
- PyTorch SDPA (PyTorch Team, 2023) - Built-in scaled dot product attention
๐ Position Encodings
- RoPE (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
- ALiBI (Press et al., 2022) - Attention with Linear Biases for extrapolation
- YaRN (Peng et al., 2023) - Yet another RoPE extensioN for long context
๐ช Long Context & Efficiency
- Sliding Window Attention (Mistral AI, 2023) - Local attention patterns
- Paper: Mistral 7B
- StreamingLLM / Attention Sink (Xiao et al., MIT, 2023) - Infinite sequence lengths
- Logit Softcapping (Google Gemma, 2024) - Prevent attention overflow
๐ง Mixture of Experts (MoE)
- Mixtral 8x7B (Mistral AI, 2024) - Sparse MoE architecture
- Paper: Mixtral of Experts
- Switch Transformers (Fedus et al., Google, 2021) - Scaling with expert choice
- GLaM (Du et al., Google, 2021) - Generalist Language Model with MoE
- Expert Choice Routing (Zhou et al., Google, 2022) - Improved load balancing
๐ Training Optimizations
- Layer Scale (Touvron et al., Meta, 2021) - Training stability for deep networks
- Stochastic Depth (Huang et al., 2016) - Regularization via random layer dropping
- Mixture of Depths (MoD) (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
- Gradient Checkpointing (Chen et al., 2016) - Memory-efficient training
๐ฆ Quantization
- LLM.int8() (Dettmers et al., 2022) - 8-bit matrix multiplication
- QLoRA (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
- GPTQ (Frantar et al., 2022) - Post-training quantization
- bitsandbytes (Dettmers) - Efficient quantization library
๐จ Multimodal Components
- Vision Transformer (ViT) (Dosovitskiy et al., Google, 2020) - Image encoding
- Perceiver Resampler (Alayrac et al., DeepMind, 2022) - Multimodal fusion
- Q-Former (Li et al., Salesforce, 2023) - Query-based multimodal alignment
- Whisper (Radford et al., OpenAI, 2022) - Audio encoding inspiration
๐ ๏ธ Normalization & Activations
- RMSNorm (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
- SwiGLU (Shazeer, Google, 2020) - GLU activation variant
๐ง Implementation & Tools
- Hugging Face Transformers - Model implementation framework
- PyTorch - Deep learning framework
- Safetensors - Secure tensor serialization format
- Accelerate - Distributed training utilities
Special Thanks to:
- ๐ฎ๐ฉ Indonesian NLP Community
- ๐ค Hugging Face Team
- ๐ฌ Open source AI research community
โ ๏ธ Limitations & Bias
Keterbatasan
- ๐ด Untrained: Model belum dilatih, output random
- ๐ก No Tokenizer: Perlu prepare tokenizer sendiri
- ๐ก No Safety: Belum ada content filtering/alignment
- ๐ Memory Intensive: Training butuh GPU besar
Potential Biases
Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:
- Bahasa: Bias terhadap bahasa mayoritas di dataset
- Kultur: Bias terhadap perspektif kultur tertentu
- Gender & Demografis: Potential stereotypes
- Faktual: Bisa generate informasi tidak akurat
Rekomendasi: Lakukan evaluation & filtering sebelum deployment.
๐ Support & Contact
๐ฌ Community
- Discussions: HF Discussions
๐ง Contact
Untuk pertanyaan atau kolaborasi:
- Email: cacatransformers@gmail.com
- HF Profile: @Lyon28
๐ Star History
๐ Dibuat dengan โค๏ธ untuk komunitas AI Indonesia
Terima kasih telah menggunakan Caca!
Jika project ini bermanfaat, consider untuk:
- โญ Star repository ini
- ๐ Share ke teman-teman
- ๐ฌ Join discussions
- ๐ค Contribute ke project
Quote Dari caca
- Downloads last month
- 163