README.md · Lyon28/caca-250M-untrained at main

caca-250M-untrained / README.md

Lyon28

Upload README.md with huggingface_hub

926b747 verified 2 days ago

preview code

raw

history blame contribute delete

23.9 kB

	---
	license: apache-2.0
	language:
	- id
	- en
	tags:
	- text-generation
	- pytorch
	- causal-lm
	- transformer
	- untrained
	- gqa
	- rope
	- swiglu
	- rmsnorm
	- flash-attention
	- indonesian
	library_name: transformers
	pipeline_tag: text-generation
	widget:
	- text: "Jakarta adalah ibu kota"
	example_title: "🇮🇩 Text Completion (ID)"
	- text: \|
	Pertanyaan: Apa itu kecerdasan buatan?
	Jawaban:
	example_title: "🇮🇩 Question Answering (ID)"
	- text: \|
	Tulis cerita pendek tentang robot yang belajar mencintai.
	example_title: "🇮🇩 Creative Writing (ID)"
	- text: "The capital of Indonesia is"
	example_title: "🇬🇧 Text Completion (EN)"
	- text: \|
	Question: What is artificial intelligence?
	Answer:
	example_title: "🇬🇧 Question Answering (EN)"
	- text: \|
	def fibonacci(n):
	"""Hitung bilangan fibonacci ke-n"""
	example_title: "💻 Code Completion"
	- text: \|
	def reverse_string(s):
	example_title: "💻 Code Generation"
	- text: \|
	User: Halo! Siapa kamu?
	Assistant:
	example_title: "💬 Chat Format (ID)"
	- text: \|
	User: Jelaskan tentang machine learning dalam 2 kalimat.
	Assistant:
	example_title: "💬 Conversational (ID)"
	inference:
	parameters:
	max_new_tokens: 100
	temperature: 0.7
	top_p: 0.9
	top_k: 50
	do_sample: true
	repetition_penalty: 1.1
	num_beams: 1
	datasets: []
	metrics:
	- perplexity
	model-index:
	- name: caca-250M
	results: []
	---

	<div align="center">

	<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-250M"/>

	# 🚀 CACA-250M

	### Model Transformer Modern dengan Arsitektur Canggih

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
	[![Transformers](https://img.shields.io/badge/🤗%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)

	430,493,696 parameters • 430.49M • 24 layers

	[📖 Dokumentasi](#dokumentasi) • [🚀 Quick Start](#quick-start) • [💡 Fitur](#fitur-utama) • [🔧 Training](#training-guide) • [📊 Spesifikasi](#spesifikasi-teknis)

	---

	</div>

	## ⚠️ PENTING: Model Belum Dilatih (Untrained)

	> PERHATIAN: Ini adalah model yang belum melalui proses training. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.

	Status Model:
	- 🔴 Belum dilatih - Bobot masih random
	- 🟡 Hanya untuk riset - Eksperimen arsitektur & training
	- 🟢 Siap dilatih - Arsitektur sudah teruji

	Widget di atas hanya menunjukkan format input yang diharapkan. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.

	---

	## 📋 Deskripsi

	Caca adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada efisiensi, skalabilitas, dan performa tinggi.

	<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
	<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
	<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
	</blockquote>

	### 🎯 Keunggulan Utama

	- 🇮🇩 Bilingual Support: Optimized untuk Bahasa Indonesia & English
	- ⚡ Ultra Fast: Flash Attention 2 untuk inferensi 3x lebih cepat
	- 💾 Memory Efficient: Grouped Query Attention menghemat 75% KV cache
	- 🎯 Long Context: Support hingga 8,192 token
	- 🔧 Modular: Arsitektur fleksibel dengan berbagai opsi konfigurasi

	---

	## ✨ Fitur Utama

	### 🎯 Core Features

	- ✅ Grouped Query Attention (GQA) - Efisiensi memori dan komputasi superior
	- Query heads: 16
	- KV heads: 4
	- Ratio: 4:1 (hemat 75% KV cache)

	- ✅ Rotary Position Embeddings (RoPE) - Generalisasi konteks panjang lebih baik
	- Theta: 10000
	- Support extrapolation untuk konteks > training length

	- ✅ RMSNorm - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
	- Epsilon: 1e-06

	- ✅ SwiGLU Activation - Performa 10-15% lebih baik dari ReLU/GELU
	- Intermediate size: 4,096

	- ✅ Flash Attention 2 - Akselerasi hingga 3x dengan memory efficiency
	- Otomatis aktif jika tersedia CUDA

	### 🔥 Advanced Features

	### 🎯 Attention Mechanisms
	- ⚡ Flash Attention v2 - 3x faster with IO-aware algorithm
	- 🔑 Grouped Query Attention (GQA) - 16Q : 4KV heads
	- 🚀 xFormers Support - Memory efficient attention fallback
	- 🎯 PyTorch SDPA - Native scaled dot product attention

	### 📍 Position Encodings
	- 🔄 RoPE - Rotary embeddings (θ=10000)

	### 🪟 Long Context Features

	### 🎓 Training Optimizations
	- 💾 Gradient Checkpointing - Memory efficient training
	- 🎯 Mixed Precision - BF16 & FP16 support

	### 📦 Quantization Support
	- 4️⃣ 4-bit Quantization - NF4, FP4 via bitsandbytes
	- 8️⃣ 8-bit Quantization - LLM.int8() support
	- 🔄 Double Quantization - Further compression

	### 🛠️ Optimization Features

	- 💾 KV Cache - Generasi autoregressive 5-10x lebih cepat
	- 🔧 Gradient Checkpointing - Training model besar dengan memory terbatas
	- 📦 Quantization Ready - Support 4-bit & 8-bit quantization
	- 🎯 Mixed Precision Training - BF16 & FP16 support

	---

	## 📊 Spesifikasi Teknis

	<div align="center">

	\| Spesifikasi \| Detail \|
	\|-------------\|--------\|
	\| 💎 Total Parameters \| 430,493,696 (430.49M) \|
	\| 📐 Hidden Size \| 1,024 \|
	\| 🔢 Intermediate Size \| 4,096 \|
	\| 🏗️ Num Layers \| 24 \|
	\| 🎯 Attention Heads \| 16 \|
	\| 🔑 KV Heads \| 4 (GQA) \|
	\| 📏 Head Dimension \| 64 \|
	\| 📚 Vocab Size \| 32,000 tokens \|
	\| 📖 Max Context \| 8,192 tokens \|
	\| 🏛️ Architecture \| Decoder-only Transformer \|
	\| 🎨 Model Type \| Causal Language Model \|

	</div>

	### 📐 Arsitektur Detail

	<details>
	<summary><b>🔍 Klik untuk lihat struktur lengkap</b></summary>

	```
	CacaForCausalLM (430.49M)
	│
	├─ Embedding Layer
	│ └─ Token Embeddings: 32,000 × 1024
	│ └─ Parameters: 32,768,000
	│
	├─ Transformer Layers (24x)
	│ │
	│ ├─ Layer {i} (repeated 24 times)
	│ │ │
	│ │ ├─ Input LayerNorm (RMSNorm)
	│ │ │ └─ Params: 1,024
	│ │ │
	│ │ ├─ Self-Attention (Grouped Query Attention)
	│ │ │ ├─ Q Projection: 1,024 → 1,024
	│ │ │ ├─ K Projection: 1,024 → 256
	│ │ │ ├─ V Projection: 1,024 → 256
	│ │ │ ├─ O Projection: 1,024 → 1,024
	│ │ │ ├─ RoPE Embeddings: θ=10000
	│ │ │ └─ Flash Attention 2 (if available)
	│ │ │
	│ │ ├─ Post-Attention LayerNorm (RMSNorm)
	│ │ │ └─ Params: 1,024
	│ │ │
	│ │ ├─ MLP (SwiGLU)
	│ │ │ ├─ Gate: 1,024 → 4,096
	│ │ │ ├─ Up: 1,024 → 4,096
	│ │ │ ├─ Activation: SiLU (Swish)
	│ │ │ └─ Down: 4,096 → 1,024
	│ │ │
	│ │ └─ Residual Connections (2x per layer)
	│ │
	│ └─ Total Layer Params: ~13M per layer
	│
	├─ Final LayerNorm (RMSNorm)
	│ └─ Params: 1,024
	│
	└─ LM Head (Output Projection)
	└─ Linear: 1,024 → 32,000
	└─ Parameters: 32,768,000
	```

	Perhitungan Parameter:
	- Embeddings: `32,000 × 1,024 = 32,768,000`
	- Layers: `24 layers × ~13M = ~327M`
	- Total: 430,493,696 parameters

	</details>

	---

	## 🚀 Quick Start

	### 📦 Instalasi

	```bash
	# Dependencies dasar
	pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

	# Optional: Untuk performa maksimal
	pip install flash-attn --no-build-isolation # Flash Attention 2
	pip install xformers # Memory efficient attention
	pip install bitsandbytes # Quantization support
	```

	### 💻 Penggunaan Dasar

	#### 1️⃣ Load Model

	```python
	from transformers import AutoModelForCausalLM, AutoConfig
	import torch

	# Load configuration
	config = AutoConfig.from_pretrained(
	"Lyon28/caca-250M-untrained",
	trust_remote_code=True
	)

	print(f"Model: {config.model_type}")
	print(f"Parameters: 430,493,696")
	print(f"Hidden size: {config.hidden_size}")
	print(f"Layers: {config.num_hidden_layers}")

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"Lyon28/caca-250M-untrained",
	config=config,
	torch_dtype=torch.bfloat16, # Gunakan BF16 untuk efisiensi
	device_map="auto", # Otomatis distribusi ke GPU
	trust_remote_code=True
	)

	print(f"Model loaded! Device: {model.device}")
	```

	#### 2️⃣ Verifikasi Model

	```python
	# Hitung total parameter
	total_params = sum(p.numel() for p in model.parameters())
	trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

	print(f"Total parameters: {total_params:,}")
	print(f"Trainable parameters: {trainable_params:,}")
	print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")

	# Test forward pass
	batch_size, seq_len = 2, 10
	input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
	input_ids = input_ids.to(model.device)

	with torch.no_grad():
	outputs = model(input_ids)

	print(f"Output shape: {outputs.logits.shape}")
	print("✅ Model berfungsi dengan baik!")
	```

	#### 3️⃣ Generate Text (Setelah Training)

	```python
	from transformers import AutoTokenizer

	# Load tokenizer (gunakan tokenizer yang sesuai)
	tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")

	# Prepare input
	text = "Jelaskan tentang kecerdasan buatan"
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	# Generate
	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	temperature=0.7,
	top_p=0.9,
	top_k=50,
	do_sample=True,
	repetition_penalty=1.1,
	pad_token_id=tokenizer.eos_token_id
	)

	# Decode
	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	---

	## 🔧 Training Guide

	### 📚 Persiapan Dataset

	```python
	from datasets import load_dataset

	# Load dataset (contoh)
	dataset = load_dataset("indonesian-nlp/id-wikipedia")

	# Atau load dari file lokal
	from datasets import Dataset
	import pandas as pd

	df = pd.read_csv("your_data.csv")
	dataset = Dataset.from_pandas(df)

	print(f"Dataset size: {len(dataset)}")
	```

	### 🎯 Training Configuration

	```python
	from transformers import Trainer, TrainingArguments
	from transformers import DataCollatorForLanguageModeling

	# Training arguments
	training_args = TrainingArguments(
	# Output
	output_dir="./caca-caca-250M-trained",
	run_name="caca-caca-250M-v1",

	# Training
	num_train_epochs=3,
	per_device_train_batch_size=4,
	gradient_accumulation_steps=8, # Effective batch size = 32
	learning_rate=2e-4,
	weight_decay=0.1,
	warmup_steps=2000,

	# Optimization
	bf16=True, # Mixed precision training
	gradient_checkpointing=True, # Hemat memory
	optim="adamw_torch_fused", # Optimizer tercepat
	max_grad_norm=1.0,

	# Logging & Evaluation
	logging_steps=10,
	logging_first_step=True,
	eval_strategy="steps",
	eval_steps=500,
	save_steps=1000,
	save_total_limit=3,

	# Hub integration
	push_to_hub=True,
	hub_model_id="your-username/caca-caca-250M-trained",
	hub_strategy="every_save",

	# Distributed training
	ddp_find_unused_parameters=False,
	dataloader_num_workers=4,
	)

	# Data collator
	data_collator = DataCollatorForLanguageModeling(
	tokenizer=tokenizer,
	mlm=False # Causal LM, bukan Masked LM
	)

	# Trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	data_collator=data_collator,
	)

	# Train!
	print("🚀 Starting training...")
	trainer.train()

	# Save final model
	print("💾 Saving model...")
	trainer.save_model("./caca-caca-250M-final")
	trainer.push_to_hub()

	print("✅ Training complete!")
	```

	### 📊 Estimasi Resource

	<details>
	<summary><b>💰 Klik untuk melihat estimasi biaya & waktu training</b></summary>

	Hardware Requirements:

	\| GPU \| Memory \| Batch Size \| Speed \| Est. Time (100B tokens) \|
	\|-----\|--------\|------------\|-------\|-------------------------\|
	\| RTX 3090 (24GB) \| 24GB \| 1-2 \| ~1K tok/s \| ~30 hari \|
	\| A100 (40GB) \| 40GB \| 4-8 \| ~5K tok/s \| ~6 hari \|
	\| A100 (80GB) \| 80GB \| 8-16 \| ~8K tok/s \| ~4 hari \|
	\| 8×A100 (80GB) \| 640GB \| 64+ \| ~50K tok/s \| ~14 jam \|

	Cloud Costs (approximate):
	- AWS p4d.24xlarge (8×A100): ~$32/hour × 24 hours = ~$768/day
	- GCP a2-ultragpu-8g: ~$30/hour × 24 hours = ~$720/day
	- Lambda Labs (8×A100): ~$15/hour × 24 hours = ~$360/day

	Tips menghemat biaya:
	- Gunakan spot instances (60-70% lebih murah)
	- Gradient accumulation untuk batch size lebih besar
	- Mixed precision (BF16) untuk 2x speedup
	- Gradient checkpointing untuk hemat memory

	</details>

	---

	## 💬 Format Chat

	Model ini mendukung format chat standar:

	```python
	# Single-turn
	messages = [
	{"role": "user", "content": "Halo! Siapa kamu?"},
	]

	# Multi-turn conversation
	messages = [
	{"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
	{"role": "user", "content": "Jelaskan tentang fotosintesis"},
	{"role": "assistant", "content": "Fotosintesis adalah proses..."},
	{"role": "user", "content": "Apa manfaatnya bagi manusia?"},
	]

	# Apply chat template
	formatted = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	print(formatted)
	# Output:
	# System: Kamu adalah asisten AI yang membantu.
	#
	# User: Jelaskan tentang fotosintesis
	# Assistant: Fotosintesis adalah proses...
	# User: Apa manfaatnya bagi manusia?
	# Assistant:
	```

	---

	## 🎯 Use Cases

	### ✅ Cocok Untuk:

	- 🔬 Penelitian: Eksperimen arsitektur LLM modern
	- 📚 Edukasi: Belajar tentang transformer & training
	- 🎓 Akademis: Paper, thesis, project
	- 🚀 Base Model: Fine-tuning untuk task spesifik
	- 💡 Proof of Concept: Test ide sebelum scale up

	### ❌ Tidak Cocok Untuk:

	- 🚫 Production: Model belum dilatih
	- 🚫 Real-world apps: Output masih random
	- 🚫 Safety-critical: Belum ada safety alignment
	- 🚫 Direct deployment: Perlu training dulu

	---

	## 📖 Dokumentasi

	### 🔗 Links Penting

	- 📚 Hugging Face Docs: [transformers.github.io](https://huggingface.co/docs/transformers)
	- 💻 GitHub: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
	- 💬 Discussions: [Model discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
	- 🐛 Issues: [Report bugs](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)

	### 📝 Related Models

	<div align="center">

	\| Model Size \| Parameters \| Link \|
	\|------------\|------------\|------\|
	\| 🐣 Tiny \| 1M - 50M \| [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) \|
	\| 🐥 Small \| 75M - 500M \| [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) \|
	\| 🦅 Medium \| 600M - 1B \| [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) \|
	\| 🦁 Large \| 1.5B - 5B \| [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) \|
	\| 🐉 XL \| 6B - 10B \| [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) \|
	\| 🦖 XXL \| 12B+ \| [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) \|

	</div>

	---

	## 🤝 Contributing

	Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:

	- 🐛 Report bugs: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)
	- 💡 Suggest features: Punya ide? Share di discussions
	- 📝 Improve docs: PR welcome untuk dokumentasi
	- 🎓 Share results: Training hasil? Share di model card
	- ⭐ Star & Share: Bantu project ini berkembang

	---

	## 📜 License & Citation

	### 📄 License

	Model ini dirilis di bawah Apache License 2.0:
	- ✅ Gratis untuk penggunaan komersial
	- ✅ Gratis untuk penggunaan riset
	- ✅ Boleh modifikasi & distribusi
	- ✅ Tidak ada garansi

	### 📚 Citation

	Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:

	```bibtex
	@misc{cacacaca-250M2025,
	author = {Lyon},
	title = {Caca-caca-250M: Modern Transformer Architecture with GQA and Advanced Features},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/Lyon28/caca-250M-untrained}},
	}
	```

	### 🙏 Acknowledgments

	Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:

	#### 🏗️ Core Architecture
	- LLaMA (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
	- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
	- GPT-3 (OpenAI, 2020) - Transformer language modeling paradigm
	- PaLM (Google, 2022) - SwiGLU activation function

	#### 🎯 Attention Mechanisms
	- Flash Attention v2 (Tri Dao et al., 2023) - Efficient attention with IO-awareness
	- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
	- Grouped Query Attention (GQA) (Ainslie et al., Google, 2023) - Memory-efficient attention
	- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
	- Multi-Query Attention (MQA) (Shazeer, Google, 2019) - Fast decoding
	- xFormers (Meta AI, 2022) - Memory efficient attention implementations
	- PyTorch SDPA (PyTorch Team, 2023) - Built-in scaled dot product attention

	#### 📍 Position Encodings
	- RoPE (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
	- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
	- ALiBI (Press et al., 2022) - Attention with Linear Biases for extrapolation
	- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
	- YaRN (Peng et al., 2023) - Yet another RoPE extensioN for long context
	- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)

	#### 🪟 Long Context & Efficiency
	- Sliding Window Attention (Mistral AI, 2023) - Local attention patterns
	- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
	- StreamingLLM / Attention Sink (Xiao et al., MIT, 2023) - Infinite sequence lengths
	- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
	- Logit Softcapping (Google Gemma, 2024) - Prevent attention overflow
	- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)

	#### 🧠 Mixture of Experts (MoE)
	- Mixtral 8x7B (Mistral AI, 2024) - Sparse MoE architecture
	- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
	- Switch Transformers (Fedus et al., Google, 2021) - Scaling with expert choice
	- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
	- GLaM (Du et al., Google, 2021) - Generalist Language Model with MoE
	- Expert Choice Routing (Zhou et al., Google, 2022) - Improved load balancing

	#### 🎓 Training Optimizations
	- Layer Scale (Touvron et al., Meta, 2021) - Training stability for deep networks
	- Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
	- Stochastic Depth (Huang et al., 2016) - Regularization via random layer dropping
	- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
	- Mixture of Depths (MoD) (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
	- Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
	- Gradient Checkpointing (Chen et al., 2016) - Memory-efficient training

	#### 📦 Quantization
	- LLM.int8() (Dettmers et al., 2022) - 8-bit matrix multiplication
	- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
	- QLoRA (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
	- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
	- GPTQ (Frantar et al., 2022) - Post-training quantization
	- bitsandbytes (Dettmers) - Efficient quantization library

	#### 🎨 Multimodal Components
	- Vision Transformer (ViT) (Dosovitskiy et al., Google, 2020) - Image encoding
	- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
	- Perceiver Resampler (Alayrac et al., DeepMind, 2022) - Multimodal fusion
	- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
	- Q-Former (Li et al., Salesforce, 2023) - Query-based multimodal alignment
	- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
	- Whisper (Radford et al., OpenAI, 2022) - Audio encoding inspiration

	#### 🛠️ Normalization & Activations
	- RMSNorm (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
	- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
	- SwiGLU (Shazeer, Google, 2020) - GLU activation variant
	- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)

	#### 🔧 Implementation & Tools
	- Hugging Face Transformers - Model implementation framework
	- PyTorch - Deep learning framework
	- Safetensors - Secure tensor serialization format
	- Accelerate - Distributed training utilities

	---

	Special Thanks to:
	- 🇮🇩 Indonesian NLP Community
	- 🤗 Hugging Face Team
	- 🔬 Open source AI research community

	## ⚠️ Limitations & Bias

	### Keterbatasan

	- 🔴 Untrained: Model belum dilatih, output random
	- 🟡 No Tokenizer: Perlu prepare tokenizer sendiri
	- 🟡 No Safety: Belum ada content filtering/alignment
	- 🟠 Memory Intensive: Training butuh GPU besar

	### Potential Biases

	Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:

	- Bahasa: Bias terhadap bahasa mayoritas di dataset
	- Kultur: Bias terhadap perspektif kultur tertentu
	- Gender & Demografis: Potential stereotypes
	- Faktual: Bisa generate informasi tidak akurat

	Rekomendasi: Lakukan evaluation & filtering sebelum deployment.

	---

	## 📞 Support & Contact

	### 💬 Community

	- Discussions: [HF Discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions)

	### 📧 Contact

	Untuk pertanyaan atau kolaborasi:
	- Email: cacatransformers@gmail.com
	- HF Profile: [@Lyon28](https://huggingface.co/Lyon28)

	---

	<div align="center">

	## 🌟 Star History

	[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)

	---

	### 💝 Dibuat dengan ❤️ untuk komunitas AI Indonesia

	Terima kasih telah menggunakan Caca!

	Jika project ini bermanfaat, consider untuk:
	- ⭐ Star repository ini
	- 🔗 Share ke teman-teman
	- 💬 Join discussions
	- 🤝 Contribute ke project

	---

	</div>

	### Quote Dari caca
	<div align="center">
	<img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
	</div>