File size: 23,851 Bytes
1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e 1b87288 75b684e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 |
---
license: apache-2.0
language:
- id
- en
tags:
- text-generation
- pytorch
- causal-lm
- transformer
- untrained
- gqa
- rope
- swiglu
- rmsnorm
- flash-attention
- indonesian
library_name: transformers
pipeline_tag: text-generation
widget:
- text: "Jakarta adalah ibu kota"
example_title: "๐ฎ๐ฉ Text Completion (ID)"
- text: |
Pertanyaan: Apa itu kecerdasan buatan?
Jawaban:
example_title: "๐ฎ๐ฉ Question Answering (ID)"
- text: |
Tulis cerita pendek tentang robot yang belajar mencintai.
example_title: "๐ฎ๐ฉ Creative Writing (ID)"
- text: "The capital of Indonesia is"
example_title: "๐ฌ๐ง Text Completion (EN)"
- text: |
Question: What is artificial intelligence?
Answer:
example_title: "๐ฌ๐ง Question Answering (EN)"
- text: |
def fibonacci(n):
"""Hitung bilangan fibonacci ke-n"""
example_title: "๐ป Code Completion"
- text: |
def reverse_string(s):
example_title: "๐ป Code Generation"
- text: |
User: Halo! Siapa kamu?
Assistant:
example_title: "๐ฌ Chat Format (ID)"
- text: |
User: Jelaskan tentang machine learning dalam 2 kalimat.
Assistant:
example_title: "๐ฌ Conversational (ID)"
inference:
parameters:
max_new_tokens: 100
temperature: 0.7
top_p: 0.9
top_k: 50
do_sample: true
repetition_penalty: 1.1
num_beams: 1
datasets: []
metrics:
- perplexity
model-index:
- name: caca-5M
results: []
---
<div align="center">
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-5M"/>
# ๐ CACA-5M
### Model Transformer Modern dengan Arsitektur Canggih
[](https://opensource.org/licenses/Apache-2.0)
[](https://www.python.org/downloads/)
[](https://pytorch.org/)
[](https://github.com/huggingface/transformers)
**24,253,696** parameters โข **24.25M** โข **8 layers**
[๐ Dokumentasi](#dokumentasi) โข [๐ Quick Start](#quick-start) โข [๐ก Fitur](#fitur-utama) โข [๐ง Training](#training-guide) โข [๐ Spesifikasi](#spesifikasi-teknis)
---
</div>
## โ ๏ธ PENTING: Model Belum Dilatih (Untrained)
> **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**.
**Status Model:**
- ๐ด **Belum dilatih** - Bobot masih random
- ๐ก **Hanya untuk riset** - Eksperimen arsitektur & training
- ๐ข **Siap dilatih** - Arsitektur sudah teruji
Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.
---
## ๐ Deskripsi
**Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**.
<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
</blockquote>
### ๐ฏ Keunggulan Utama
- **๐ฎ๐ฉ Bilingual Support**: Optimized untuk Bahasa Indonesia & English
- **โก Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat
- **๐พ Memory Efficient**: Grouped Query Attention menghemat 75% KV cache
- **๐ฏ Long Context**: Support hingga 2,048 token
- **๐ง Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi
---
## โจ Fitur Utama
### ๐ฏ Core Features
- โ
**Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
- Query heads: 4
- KV heads: 2
- Ratio: 2:1 (hemat 75% KV cache)
- โ
**Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
- Theta: 10000
- Support extrapolation untuk konteks > training length
- โ
**RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
- Epsilon: 1e-06
- โ
**SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
- Intermediate size: 1,024
- โ
**Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
- Otomatis aktif jika tersedia CUDA
### ๐ฅ Advanced Features
### ๐ฏ Attention Mechanisms
- โก **Flash Attention v2** - 3x faster with IO-aware algorithm
- ๐ **Grouped Query Attention (GQA)** - 4Q : 2KV heads
- ๐ **xFormers Support** - Memory efficient attention fallback
- ๐ฏ **PyTorch SDPA** - Native scaled dot product attention
### ๐ Position Encodings
- ๐ **RoPE** - Rotary embeddings (ฮธ=10000)
### ๐ช Long Context Features
### ๐ Training Optimizations
- ๐พ **Gradient Checkpointing** - Memory efficient training
- ๐ฏ **Mixed Precision** - BF16 & FP16 support
### ๐ฆ Quantization Support
- 4๏ธโฃ **4-bit Quantization** - NF4, FP4 via bitsandbytes
- 8๏ธโฃ **8-bit Quantization** - LLM.int8() support
- ๐ **Double Quantization** - Further compression
### ๐ ๏ธ Optimization Features
- ๐พ **KV Cache** - Generasi autoregressive 5-10x lebih cepat
- ๐ง **Gradient Checkpointing** - Training model besar dengan memory terbatas
- ๐ฆ **Quantization Ready** - Support 4-bit & 8-bit quantization
- ๐ฏ **Mixed Precision Training** - BF16 & FP16 support
---
## ๐ Spesifikasi Teknis
<div align="center">
| Spesifikasi | Detail |
|-------------|--------|
| **๐ Total Parameters** | **24,253,696** (24.25M) |
| **๐ Hidden Size** | 256 |
| **๐ข Intermediate Size** | 1,024 |
| **๐๏ธ Num Layers** | 8 |
| **๐ฏ Attention Heads** | 4 |
| **๐ KV Heads** | 2 (GQA) |
| **๐ Head Dimension** | 64 |
| **๐ Vocab Size** | 32,000 tokens |
| **๐ Max Context** | 2,048 tokens |
| **๐๏ธ Architecture** | Decoder-only Transformer |
| **๐จ Model Type** | Causal Language Model |
</div>
### ๐ Arsitektur Detail
<details>
<summary><b>๐ Klik untuk lihat struktur lengkap</b></summary>
```
CacaForCausalLM (24.25M)
โ
โโ Embedding Layer
โ โโ Token Embeddings: 32,000 ร 256
โ โโ Parameters: 8,192,000
โ
โโ Transformer Layers (8x)
โ โ
โ โโ Layer {i} (repeated 8 times)
โ โ โ
โ โ โโ Input LayerNorm (RMSNorm)
โ โ โ โโ Params: 256
โ โ โ
โ โ โโ Self-Attention (Grouped Query Attention)
โ โ โ โโ Q Projection: 256 โ 256
โ โ โ โโ K Projection: 256 โ 128
โ โ โ โโ V Projection: 256 โ 128
โ โ โ โโ O Projection: 256 โ 256
โ โ โ โโ RoPE Embeddings: ฮธ=10000
โ โ โ โโ Flash Attention 2 (if available)
โ โ โ
โ โ โโ Post-Attention LayerNorm (RMSNorm)
โ โ โ โโ Params: 256
โ โ โ
โ โ โโ MLP (SwiGLU)
โ โ โ โโ Gate: 256 โ 1,024
โ โ โ โโ Up: 256 โ 1,024
โ โ โ โโ Activation: SiLU (Swish)
โ โ โ โโ Down: 1,024 โ 256
โ โ โ
โ โ โโ Residual Connections (2x per layer)
โ โ
โ โโ Total Layer Params: ~0M per layer
โ
โโ Final LayerNorm (RMSNorm)
โ โโ Params: 256
โ
โโ LM Head (Output Projection)
โโ Linear: 256 โ 32,000
โโ Parameters: 8,192,000
```
**Perhitungan Parameter:**
- Embeddings: `32,000 ร 256 = 8,192,000`
- Layers: `8 layers ร ~0M = ~6M`
- **Total: 24,253,696 parameters**
</details>
---
## ๐ Quick Start
### ๐ฆ Instalasi
```bash
# Dependencies dasar
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation # Flash Attention 2
pip install xformers # Memory efficient attention
pip install bitsandbytes # Quantization support
```
### ๐ป Penggunaan Dasar
#### 1๏ธโฃ Load Model
```python
from transformers import AutoModelForCausalLM, AutoConfig
import torch
# Load configuration
config = AutoConfig.from_pretrained(
"Lyon28/caca-5M-untrained",
trust_remote_code=True
)
print(f"Model: {config.model_type}")
print(f"Parameters: 24,253,696")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-5M-untrained",
config=config,
torch_dtype=torch.bfloat16, # Gunakan BF16 untuk efisiensi
device_map="auto", # Otomatis distribusi ke GPU
trust_remote_code=True
)
print(f"Model loaded! Device: {model.device}")
```
#### 2๏ธโฃ Verifikasi Model
```python
# Hitung total parameter
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")
# Test forward pass
batch_size, seq_len = 2, 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
input_ids = input_ids.to(model.device)
with torch.no_grad():
outputs = model(input_ids)
print(f"Output shape: {outputs.logits.shape}")
print("โ
Model berfungsi dengan baik!")
```
#### 3๏ธโฃ Generate Text (Setelah Training)
```python
from transformers import AutoTokenizer
# Load tokenizer (gunakan tokenizer yang sesuai)
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")
# Prepare input
text = "Jelaskan tentang kecerdasan buatan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
top_k=50,
do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
---
## ๐ง Training Guide
### ๐ Persiapan Dataset
```python
from datasets import load_dataset
# Load dataset (contoh)
dataset = load_dataset("indonesian-nlp/id-wikipedia")
# Atau load dari file lokal
from datasets import Dataset
import pandas as pd
df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)
print(f"Dataset size: {len(dataset)}")
```
### ๐ฏ Training Configuration
```python
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
# Training arguments
training_args = TrainingArguments(
# Output
output_dir="./caca-caca-5M-trained",
run_name="caca-caca-5M-v1",
# Training
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # Effective batch size = 32
learning_rate=2e-4,
weight_decay=0.1,
warmup_steps=2000,
# Optimization
bf16=True, # Mixed precision training
gradient_checkpointing=True, # Hemat memory
optim="adamw_torch_fused", # Optimizer tercepat
max_grad_norm=1.0,
# Logging & Evaluation
logging_steps=10,
logging_first_step=True,
eval_strategy="steps",
eval_steps=500,
save_steps=1000,
save_total_limit=3,
# Hub integration
push_to_hub=True,
hub_model_id="your-username/caca-caca-5M-trained",
hub_strategy="every_save",
# Distributed training
ddp_find_unused_parameters=False,
dataloader_num_workers=4,
)
# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # Causal LM, bukan Masked LM
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
data_collator=data_collator,
)
# Train!
print("๐ Starting training...")
trainer.train()
# Save final model
print("๐พ Saving model...")
trainer.save_model("./caca-caca-5M-final")
trainer.push_to_hub()
print("โ
Training complete!")
```
### ๐ Estimasi Resource
<details>
<summary><b>๐ฐ Klik untuk melihat estimasi biaya & waktu training</b></summary>
**Hardware Requirements:**
| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
|-----|--------|------------|-------|-------------------------|
| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
| 8รA100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |
**Cloud Costs (approximate):**
- AWS p4d.24xlarge (8รA100): ~$32/hour ร 24 hours = **~$768/day**
- GCP a2-ultragpu-8g: ~$30/hour ร 24 hours = **~$720/day**
- Lambda Labs (8รA100): ~$15/hour ร 24 hours = **~$360/day**
**Tips menghemat biaya:**
- Gunakan spot instances (60-70% lebih murah)
- Gradient accumulation untuk batch size lebih besar
- Mixed precision (BF16) untuk 2x speedup
- Gradient checkpointing untuk hemat memory
</details>
---
## ๐ฌ Format Chat
Model ini mendukung format chat standar:
```python
# Single-turn
messages = [
{"role": "user", "content": "Halo! Siapa kamu?"},
]
# Multi-turn conversation
messages = [
{"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
{"role": "user", "content": "Jelaskan tentang fotosintesis"},
{"role": "assistant", "content": "Fotosintesis adalah proses..."},
{"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]
# Apply chat template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses...
# User: Apa manfaatnya bagi manusia?
# Assistant:
```
---
## ๐ฏ Use Cases
### โ
Cocok Untuk:
- ๐ฌ **Penelitian**: Eksperimen arsitektur LLM modern
- ๐ **Edukasi**: Belajar tentang transformer & training
- ๐ **Akademis**: Paper, thesis, project
- ๐ **Base Model**: Fine-tuning untuk task spesifik
- ๐ก **Proof of Concept**: Test ide sebelum scale up
### โ Tidak Cocok Untuk:
- ๐ซ **Production**: Model belum dilatih
- ๐ซ **Real-world apps**: Output masih random
- ๐ซ **Safety-critical**: Belum ada safety alignment
- ๐ซ **Direct deployment**: Perlu training dulu
---
## ๐ Dokumentasi
### ๐ Links Penting
- ๐ **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers)
- ๐ป **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
- ๐ฌ **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
- ๐ **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
### ๐ Related Models
<div align="center">
| Model Size | Parameters | Link |
|------------|------------|------|
| ๐ฃ Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) |
| ๐ฅ Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) |
| ๐ฆ
Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) |
| ๐ฆ Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) |
| ๐ XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) |
| ๐ฆ XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) |
</div>
---
## ๐ค Contributing
Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:
- ๐ **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
- ๐ก **Suggest features**: Punya ide? Share di discussions
- ๐ **Improve docs**: PR welcome untuk dokumentasi
- ๐ **Share results**: Training hasil? Share di model card
- โญ **Star & Share**: Bantu project ini berkembang
---
## ๐ License & Citation
### ๐ License
Model ini dirilis di bawah **Apache License 2.0**:
- โ
Gratis untuk penggunaan komersial
- โ
Gratis untuk penggunaan riset
- โ
Boleh modifikasi & distribusi
- โ
Tidak ada garansi
### ๐ Citation
Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:
```bibtex
@misc{cacacaca-5M2025,
author = {Lyon},
title = {Caca-caca-5M: Modern Transformer Architecture with GQA and Advanced Features},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/Lyon28/caca-5M-untrained}},
}
```
### ๐ Acknowledgments
Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:
#### ๐๏ธ **Core Architecture**
- **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
- **PaLM** (Google, 2022) - SwiGLU activation function
#### ๐ฏ **Attention Mechanisms**
- **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness
- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
- **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention
- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
- **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding
- **xFormers** (Meta AI, 2022) - Memory efficient attention implementations
- **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention
#### ๐ **Position Encodings**
- **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation
- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
- **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context
- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)
#### ๐ช **Long Context & Efficiency**
- **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns
- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
- **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths
- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
- **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow
- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)
#### ๐ง **Mixture of Experts (MoE)**
- **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture
- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
- **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice
- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
- **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE
- **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing
#### ๐ **Training Optimizations**
- **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks
- Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
- **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping
- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
- **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
- Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
- **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training
#### ๐ฆ **Quantization**
- **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication
- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
- **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- **GPTQ** (Frantar et al., 2022) - Post-training quantization
- **bitsandbytes** (Dettmers) - Efficient quantization library
#### ๐จ **Multimodal Components**
- **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding
- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
- **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion
- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
- **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment
- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
- **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration
#### ๐ ๏ธ **Normalization & Activations**
- **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
- **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant
- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
#### ๐ง **Implementation & Tools**
- **Hugging Face Transformers** - Model implementation framework
- **PyTorch** - Deep learning framework
- **Safetensors** - Secure tensor serialization format
- **Accelerate** - Distributed training utilities
---
**Special Thanks to:**
- ๐ฎ๐ฉ Indonesian NLP Community
- ๐ค Hugging Face Team
- ๐ฌ Open source AI research community
## โ ๏ธ Limitations & Bias
### Keterbatasan
- ๐ด **Untrained**: Model belum dilatih, output random
- ๐ก **No Tokenizer**: Perlu prepare tokenizer sendiri
- ๐ก **No Safety**: Belum ada content filtering/alignment
- ๐ **Memory Intensive**: Training butuh GPU besar
### Potential Biases
Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:
- **Bahasa**: Bias terhadap bahasa mayoritas di dataset
- **Kultur**: Bias terhadap perspektif kultur tertentu
- **Gender & Demografis**: Potential stereotypes
- **Faktual**: Bisa generate informasi tidak akurat
**Rekomendasi**: Lakukan evaluation & filtering sebelum deployment.
---
## ๐ Support & Contact
### ๐ฌ Community
- **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
### ๐ง Contact
Untuk pertanyaan atau kolaborasi:
- Email: cacatransformers@gmail.com
- HF Profile: [@Lyon28](https://huggingface.co/Lyon28)
---
<div align="center">
## ๐ Star History
[](https://star-history.com/#Lyon-28/caca-transformers&Date)
---
### ๐ Dibuat dengan โค๏ธ untuk komunitas AI Indonesia
**Terima kasih telah menggunakan Caca!**
Jika project ini bermanfaat, consider untuk:
- โญ Star repository ini
- ๐ Share ke teman-teman
- ๐ฌ Join discussions
- ๐ค Contribute ke project
---
</div>
### Quote Dari caca
<div align="center">
<img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
</div>
|