|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- id |
|
|
- en |
|
|
tags: |
|
|
- text-generation |
|
|
- pytorch |
|
|
- causal-lm |
|
|
- transformer |
|
|
- untrained |
|
|
- gqa |
|
|
- rope |
|
|
- swiglu |
|
|
- rmsnorm |
|
|
- flash-attention |
|
|
- indonesian |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "Jakarta adalah ibu kota" |
|
|
example_title: "๐ฎ๐ฉ Text Completion (ID)" |
|
|
- text: | |
|
|
Pertanyaan: Apa itu kecerdasan buatan? |
|
|
Jawaban: |
|
|
example_title: "๐ฎ๐ฉ Question Answering (ID)" |
|
|
- text: | |
|
|
Tulis cerita pendek tentang robot yang belajar mencintai. |
|
|
example_title: "๐ฎ๐ฉ Creative Writing (ID)" |
|
|
- text: "The capital of Indonesia is" |
|
|
example_title: "๐ฌ๐ง Text Completion (EN)" |
|
|
- text: | |
|
|
Question: What is artificial intelligence? |
|
|
Answer: |
|
|
example_title: "๐ฌ๐ง Question Answering (EN)" |
|
|
- text: | |
|
|
def fibonacci(n): |
|
|
"""Hitung bilangan fibonacci ke-n""" |
|
|
example_title: "๐ป Code Completion" |
|
|
- text: | |
|
|
def reverse_string(s): |
|
|
example_title: "๐ป Code Generation" |
|
|
- text: | |
|
|
User: Halo! Siapa kamu? |
|
|
Assistant: |
|
|
example_title: "๐ฌ Chat Format (ID)" |
|
|
- text: | |
|
|
User: Jelaskan tentang machine learning dalam 2 kalimat. |
|
|
Assistant: |
|
|
example_title: "๐ฌ Conversational (ID)" |
|
|
inference: |
|
|
parameters: |
|
|
max_new_tokens: 100 |
|
|
temperature: 0.7 |
|
|
top_p: 0.9 |
|
|
top_k: 50 |
|
|
do_sample: true |
|
|
repetition_penalty: 1.1 |
|
|
num_beams: 1 |
|
|
datasets: [] |
|
|
metrics: |
|
|
- perplexity |
|
|
model-index: |
|
|
- name: caca-250M |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-250M"/> |
|
|
|
|
|
# ๐ CACA-250M |
|
|
|
|
|
### Model Transformer Modern dengan Arsitektur Canggih |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://www.python.org/downloads/) |
|
|
[](https://pytorch.org/) |
|
|
[](https://github.com/huggingface/transformers) |
|
|
|
|
|
**430,493,696** parameters โข **430.49M** โข **24 layers** |
|
|
|
|
|
[๐ Dokumentasi](#dokumentasi) โข [๐ Quick Start](#quick-start) โข [๐ก Fitur](#fitur-utama) โข [๐ง Training](#training-guide) โข [๐ Spesifikasi](#spesifikasi-teknis) |
|
|
|
|
|
--- |
|
|
|
|
|
</div> |
|
|
|
|
|
## โ ๏ธ PENTING: Model Belum Dilatih (Untrained) |
|
|
|
|
|
> **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**. |
|
|
|
|
|
**Status Model:** |
|
|
- ๐ด **Belum dilatih** - Bobot masih random |
|
|
- ๐ก **Hanya untuk riset** - Eksperimen arsitektur & training |
|
|
- ๐ข **Siap dilatih** - Arsitektur sudah teruji |
|
|
|
|
|
Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Deskripsi |
|
|
|
|
|
**Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**. |
|
|
|
|
|
<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;"> |
|
|
<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p> |
|
|
<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p> |
|
|
</blockquote> |
|
|
|
|
|
### ๐ฏ Keunggulan Utama |
|
|
|
|
|
- **๐ฎ๐ฉ Bilingual Support**: Optimized untuk Bahasa Indonesia & English |
|
|
- **โก Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat |
|
|
- **๐พ Memory Efficient**: Grouped Query Attention menghemat 75% KV cache |
|
|
- **๐ฏ Long Context**: Support hingga 8,192 token |
|
|
- **๐ง Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi |
|
|
|
|
|
--- |
|
|
|
|
|
## โจ Fitur Utama |
|
|
|
|
|
### ๐ฏ Core Features |
|
|
|
|
|
- โ
**Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior |
|
|
- Query heads: 16 |
|
|
- KV heads: 4 |
|
|
- Ratio: 4:1 (hemat 75% KV cache) |
|
|
|
|
|
- โ
**Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik |
|
|
- Theta: 10000 |
|
|
- Support extrapolation untuk konteks > training length |
|
|
|
|
|
- โ
**RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm |
|
|
- Epsilon: 1e-06 |
|
|
|
|
|
- โ
**SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU |
|
|
- Intermediate size: 4,096 |
|
|
|
|
|
- โ
**Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency |
|
|
- Otomatis aktif jika tersedia CUDA |
|
|
|
|
|
### ๐ฅ Advanced Features |
|
|
|
|
|
### ๐ฏ Attention Mechanisms |
|
|
- โก **Flash Attention v2** - 3x faster with IO-aware algorithm |
|
|
- ๐ **Grouped Query Attention (GQA)** - 16Q : 4KV heads |
|
|
- ๐ **xFormers Support** - Memory efficient attention fallback |
|
|
- ๐ฏ **PyTorch SDPA** - Native scaled dot product attention |
|
|
|
|
|
### ๐ Position Encodings |
|
|
- ๐ **RoPE** - Rotary embeddings (ฮธ=10000) |
|
|
|
|
|
### ๐ช Long Context Features |
|
|
|
|
|
### ๐ Training Optimizations |
|
|
- ๐พ **Gradient Checkpointing** - Memory efficient training |
|
|
- ๐ฏ **Mixed Precision** - BF16 & FP16 support |
|
|
|
|
|
### ๐ฆ Quantization Support |
|
|
- 4๏ธโฃ **4-bit Quantization** - NF4, FP4 via bitsandbytes |
|
|
- 8๏ธโฃ **8-bit Quantization** - LLM.int8() support |
|
|
- ๐ **Double Quantization** - Further compression |
|
|
|
|
|
### ๐ ๏ธ Optimization Features |
|
|
|
|
|
- ๐พ **KV Cache** - Generasi autoregressive 5-10x lebih cepat |
|
|
- ๐ง **Gradient Checkpointing** - Training model besar dengan memory terbatas |
|
|
- ๐ฆ **Quantization Ready** - Support 4-bit & 8-bit quantization |
|
|
- ๐ฏ **Mixed Precision Training** - BF16 & FP16 support |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Spesifikasi Teknis |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
| Spesifikasi | Detail | |
|
|
|-------------|--------| |
|
|
| **๐ Total Parameters** | **430,493,696** (430.49M) | |
|
|
| **๐ Hidden Size** | 1,024 | |
|
|
| **๐ข Intermediate Size** | 4,096 | |
|
|
| **๐๏ธ Num Layers** | 24 | |
|
|
| **๐ฏ Attention Heads** | 16 | |
|
|
| **๐ KV Heads** | 4 (GQA) | |
|
|
| **๐ Head Dimension** | 64 | |
|
|
| **๐ Vocab Size** | 32,000 tokens | |
|
|
| **๐ Max Context** | 8,192 tokens | |
|
|
| **๐๏ธ Architecture** | Decoder-only Transformer | |
|
|
| **๐จ Model Type** | Causal Language Model | |
|
|
|
|
|
</div> |
|
|
|
|
|
### ๐ Arsitektur Detail |
|
|
|
|
|
<details> |
|
|
<summary><b>๐ Klik untuk lihat struktur lengkap</b></summary> |
|
|
|
|
|
``` |
|
|
CacaForCausalLM (430.49M) |
|
|
โ |
|
|
โโ Embedding Layer |
|
|
โ โโ Token Embeddings: 32,000 ร 1024 |
|
|
โ โโ Parameters: 32,768,000 |
|
|
โ |
|
|
โโ Transformer Layers (24x) |
|
|
โ โ |
|
|
โ โโ Layer {i} (repeated 24 times) |
|
|
โ โ โ |
|
|
โ โ โโ Input LayerNorm (RMSNorm) |
|
|
โ โ โ โโ Params: 1,024 |
|
|
โ โ โ |
|
|
โ โ โโ Self-Attention (Grouped Query Attention) |
|
|
โ โ โ โโ Q Projection: 1,024 โ 1,024 |
|
|
โ โ โ โโ K Projection: 1,024 โ 256 |
|
|
โ โ โ โโ V Projection: 1,024 โ 256 |
|
|
โ โ โ โโ O Projection: 1,024 โ 1,024 |
|
|
โ โ โ โโ RoPE Embeddings: ฮธ=10000 |
|
|
โ โ โ โโ Flash Attention 2 (if available) |
|
|
โ โ โ |
|
|
โ โ โโ Post-Attention LayerNorm (RMSNorm) |
|
|
โ โ โ โโ Params: 1,024 |
|
|
โ โ โ |
|
|
โ โ โโ MLP (SwiGLU) |
|
|
โ โ โ โโ Gate: 1,024 โ 4,096 |
|
|
โ โ โ โโ Up: 1,024 โ 4,096 |
|
|
โ โ โ โโ Activation: SiLU (Swish) |
|
|
โ โ โ โโ Down: 4,096 โ 1,024 |
|
|
โ โ โ |
|
|
โ โ โโ Residual Connections (2x per layer) |
|
|
โ โ |
|
|
โ โโ Total Layer Params: ~13M per layer |
|
|
โ |
|
|
โโ Final LayerNorm (RMSNorm) |
|
|
โ โโ Params: 1,024 |
|
|
โ |
|
|
โโ LM Head (Output Projection) |
|
|
โโ Linear: 1,024 โ 32,000 |
|
|
โโ Parameters: 32,768,000 |
|
|
``` |
|
|
|
|
|
**Perhitungan Parameter:** |
|
|
- Embeddings: `32,000 ร 1,024 = 32,768,000` |
|
|
- Layers: `24 layers ร ~13M = ~327M` |
|
|
- **Total: 430,493,696 parameters** |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Quick Start |
|
|
|
|
|
### ๐ฆ Instalasi |
|
|
|
|
|
```bash |
|
|
# Dependencies dasar |
|
|
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors |
|
|
|
|
|
# Optional: Untuk performa maksimal |
|
|
pip install flash-attn --no-build-isolation # Flash Attention 2 |
|
|
pip install xformers # Memory efficient attention |
|
|
pip install bitsandbytes # Quantization support |
|
|
``` |
|
|
|
|
|
### ๐ป Penggunaan Dasar |
|
|
|
|
|
#### 1๏ธโฃ Load Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoConfig |
|
|
import torch |
|
|
|
|
|
# Load configuration |
|
|
config = AutoConfig.from_pretrained( |
|
|
"Lyon28/caca-250M-untrained", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
print(f"Model: {config.model_type}") |
|
|
print(f"Parameters: 430,493,696") |
|
|
print(f"Hidden size: {config.hidden_size}") |
|
|
print(f"Layers: {config.num_hidden_layers}") |
|
|
|
|
|
# Load model |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Lyon28/caca-250M-untrained", |
|
|
config=config, |
|
|
torch_dtype=torch.bfloat16, # Gunakan BF16 untuk efisiensi |
|
|
device_map="auto", # Otomatis distribusi ke GPU |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
print(f"Model loaded! Device: {model.device}") |
|
|
``` |
|
|
|
|
|
#### 2๏ธโฃ Verifikasi Model |
|
|
|
|
|
```python |
|
|
# Hitung total parameter |
|
|
total_params = sum(p.numel() for p in model.parameters()) |
|
|
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) |
|
|
|
|
|
print(f"Total parameters: {total_params:,}") |
|
|
print(f"Trainable parameters: {trainable_params:,}") |
|
|
print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)") |
|
|
|
|
|
# Test forward pass |
|
|
batch_size, seq_len = 2, 10 |
|
|
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len)) |
|
|
input_ids = input_ids.to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(input_ids) |
|
|
|
|
|
print(f"Output shape: {outputs.logits.shape}") |
|
|
print("โ
Model berfungsi dengan baik!") |
|
|
``` |
|
|
|
|
|
#### 3๏ธโฃ Generate Text (Setelah Training) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load tokenizer (gunakan tokenizer yang sesuai) |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here") |
|
|
|
|
|
# Prepare input |
|
|
text = "Jelaskan tentang kecerdasan buatan" |
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=100, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
top_k=50, |
|
|
do_sample=True, |
|
|
repetition_penalty=1.1, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
# Decode |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(generated_text) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Training Guide |
|
|
|
|
|
### ๐ Persiapan Dataset |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load dataset (contoh) |
|
|
dataset = load_dataset("indonesian-nlp/id-wikipedia") |
|
|
|
|
|
# Atau load dari file lokal |
|
|
from datasets import Dataset |
|
|
import pandas as pd |
|
|
|
|
|
df = pd.read_csv("your_data.csv") |
|
|
dataset = Dataset.from_pandas(df) |
|
|
|
|
|
print(f"Dataset size: {len(dataset)}") |
|
|
``` |
|
|
|
|
|
### ๐ฏ Training Configuration |
|
|
|
|
|
```python |
|
|
from transformers import Trainer, TrainingArguments |
|
|
from transformers import DataCollatorForLanguageModeling |
|
|
|
|
|
# Training arguments |
|
|
training_args = TrainingArguments( |
|
|
# Output |
|
|
output_dir="./caca-caca-250M-trained", |
|
|
run_name="caca-caca-250M-v1", |
|
|
|
|
|
# Training |
|
|
num_train_epochs=3, |
|
|
per_device_train_batch_size=4, |
|
|
gradient_accumulation_steps=8, # Effective batch size = 32 |
|
|
learning_rate=2e-4, |
|
|
weight_decay=0.1, |
|
|
warmup_steps=2000, |
|
|
|
|
|
# Optimization |
|
|
bf16=True, # Mixed precision training |
|
|
gradient_checkpointing=True, # Hemat memory |
|
|
optim="adamw_torch_fused", # Optimizer tercepat |
|
|
max_grad_norm=1.0, |
|
|
|
|
|
# Logging & Evaluation |
|
|
logging_steps=10, |
|
|
logging_first_step=True, |
|
|
eval_strategy="steps", |
|
|
eval_steps=500, |
|
|
save_steps=1000, |
|
|
save_total_limit=3, |
|
|
|
|
|
# Hub integration |
|
|
push_to_hub=True, |
|
|
hub_model_id="your-username/caca-caca-250M-trained", |
|
|
hub_strategy="every_save", |
|
|
|
|
|
# Distributed training |
|
|
ddp_find_unused_parameters=False, |
|
|
dataloader_num_workers=4, |
|
|
) |
|
|
|
|
|
# Data collator |
|
|
data_collator = DataCollatorForLanguageModeling( |
|
|
tokenizer=tokenizer, |
|
|
mlm=False # Causal LM, bukan Masked LM |
|
|
) |
|
|
|
|
|
# Trainer |
|
|
trainer = Trainer( |
|
|
model=model, |
|
|
args=training_args, |
|
|
train_dataset=dataset["train"], |
|
|
eval_dataset=dataset["validation"], |
|
|
data_collator=data_collator, |
|
|
) |
|
|
|
|
|
# Train! |
|
|
print("๐ Starting training...") |
|
|
trainer.train() |
|
|
|
|
|
# Save final model |
|
|
print("๐พ Saving model...") |
|
|
trainer.save_model("./caca-caca-250M-final") |
|
|
trainer.push_to_hub() |
|
|
|
|
|
print("โ
Training complete!") |
|
|
``` |
|
|
|
|
|
### ๐ Estimasi Resource |
|
|
|
|
|
<details> |
|
|
<summary><b>๐ฐ Klik untuk melihat estimasi biaya & waktu training</b></summary> |
|
|
|
|
|
**Hardware Requirements:** |
|
|
|
|
|
| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) | |
|
|
|-----|--------|------------|-------|-------------------------| |
|
|
| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari | |
|
|
| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari | |
|
|
| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari | |
|
|
| 8รA100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam | |
|
|
|
|
|
**Cloud Costs (approximate):** |
|
|
- AWS p4d.24xlarge (8รA100): ~$32/hour ร 24 hours = **~$768/day** |
|
|
- GCP a2-ultragpu-8g: ~$30/hour ร 24 hours = **~$720/day** |
|
|
- Lambda Labs (8รA100): ~$15/hour ร 24 hours = **~$360/day** |
|
|
|
|
|
**Tips menghemat biaya:** |
|
|
- Gunakan spot instances (60-70% lebih murah) |
|
|
- Gradient accumulation untuk batch size lebih besar |
|
|
- Mixed precision (BF16) untuk 2x speedup |
|
|
- Gradient checkpointing untuk hemat memory |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฌ Format Chat |
|
|
|
|
|
Model ini mendukung format chat standar: |
|
|
|
|
|
```python |
|
|
# Single-turn |
|
|
messages = [ |
|
|
{"role": "user", "content": "Halo! Siapa kamu?"}, |
|
|
] |
|
|
|
|
|
# Multi-turn conversation |
|
|
messages = [ |
|
|
{"role": "system", "content": "Kamu adalah asisten AI yang membantu."}, |
|
|
{"role": "user", "content": "Jelaskan tentang fotosintesis"}, |
|
|
{"role": "assistant", "content": "Fotosintesis adalah proses..."}, |
|
|
{"role": "user", "content": "Apa manfaatnya bagi manusia?"}, |
|
|
] |
|
|
|
|
|
# Apply chat template |
|
|
formatted = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
print(formatted) |
|
|
# Output: |
|
|
# System: Kamu adalah asisten AI yang membantu. |
|
|
# |
|
|
# User: Jelaskan tentang fotosintesis |
|
|
# Assistant: Fotosintesis adalah proses... |
|
|
# User: Apa manfaatnya bagi manusia? |
|
|
# Assistant: |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฏ Use Cases |
|
|
|
|
|
### โ
Cocok Untuk: |
|
|
|
|
|
- ๐ฌ **Penelitian**: Eksperimen arsitektur LLM modern |
|
|
- ๐ **Edukasi**: Belajar tentang transformer & training |
|
|
- ๐ **Akademis**: Paper, thesis, project |
|
|
- ๐ **Base Model**: Fine-tuning untuk task spesifik |
|
|
- ๐ก **Proof of Concept**: Test ide sebelum scale up |
|
|
|
|
|
### โ Tidak Cocok Untuk: |
|
|
|
|
|
- ๐ซ **Production**: Model belum dilatih |
|
|
- ๐ซ **Real-world apps**: Output masih random |
|
|
- ๐ซ **Safety-critical**: Belum ada safety alignment |
|
|
- ๐ซ **Direct deployment**: Perlu training dulu |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Dokumentasi |
|
|
|
|
|
### ๐ Links Penting |
|
|
|
|
|
- ๐ **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers) |
|
|
- ๐ป **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers) |
|
|
- ๐ฌ **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions) |
|
|
- ๐ **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-250M-untrained/discussions) |
|
|
|
|
|
### ๐ Related Models |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
| Model Size | Parameters | Link | |
|
|
|------------|------------|------| |
|
|
| ๐ฃ Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) | |
|
|
| ๐ฅ Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) | |
|
|
| ๐ฆ
Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) | |
|
|
| ๐ฆ Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) | |
|
|
| ๐ XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) | |
|
|
| ๐ฆ XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) | |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ค Contributing |
|
|
|
|
|
Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi: |
|
|
|
|
|
- ๐ **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-250M-untrained/discussions) |
|
|
- ๐ก **Suggest features**: Punya ide? Share di discussions |
|
|
- ๐ **Improve docs**: PR welcome untuk dokumentasi |
|
|
- ๐ **Share results**: Training hasil? Share di model card |
|
|
- โญ **Star & Share**: Bantu project ini berkembang |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ License & Citation |
|
|
|
|
|
### ๐ License |
|
|
|
|
|
Model ini dirilis di bawah **Apache License 2.0**: |
|
|
- โ
Gratis untuk penggunaan komersial |
|
|
- โ
Gratis untuk penggunaan riset |
|
|
- โ
Boleh modifikasi & distribusi |
|
|
- โ
Tidak ada garansi |
|
|
|
|
|
### ๐ Citation |
|
|
|
|
|
Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{cacacaca-250M2025, |
|
|
author = {Lyon}, |
|
|
title = {Caca-caca-250M: Modern Transformer Architecture with GQA and Advanced Features}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face Model Hub}, |
|
|
howpublished = {\url{https://huggingface.co/Lyon28/caca-250M-untrained}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
### ๐ Acknowledgments |
|
|
|
|
|
Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini: |
|
|
|
|
|
#### ๐๏ธ **Core Architecture** |
|
|
- **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU |
|
|
- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) |
|
|
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm |
|
|
- **PaLM** (Google, 2022) - SwiGLU activation function |
|
|
|
|
|
#### ๐ฏ **Attention Mechanisms** |
|
|
- **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness |
|
|
- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691) |
|
|
- **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention |
|
|
- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245) |
|
|
- **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding |
|
|
- **xFormers** (Meta AI, 2022) - Memory efficient attention implementations |
|
|
- **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention |
|
|
|
|
|
#### ๐ **Position Encodings** |
|
|
- **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings |
|
|
- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) |
|
|
- **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation |
|
|
- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409) |
|
|
- **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context |
|
|
- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071) |
|
|
|
|
|
#### ๐ช **Long Context & Efficiency** |
|
|
- **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns |
|
|
- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825) |
|
|
- **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths |
|
|
- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) |
|
|
- **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow |
|
|
- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295) |
|
|
|
|
|
#### ๐ง **Mixture of Experts (MoE)** |
|
|
- **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture |
|
|
- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088) |
|
|
- **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice |
|
|
- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) |
|
|
- **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE |
|
|
- **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing |
|
|
|
|
|
#### ๐ **Training Optimizations** |
|
|
- **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks |
|
|
- Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239) |
|
|
- **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping |
|
|
- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382) |
|
|
- **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation |
|
|
- Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258) |
|
|
- **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training |
|
|
|
|
|
#### ๐ฆ **Quantization** |
|
|
- **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication |
|
|
- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339) |
|
|
- **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning |
|
|
- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) |
|
|
- **GPTQ** (Frantar et al., 2022) - Post-training quantization |
|
|
- **bitsandbytes** (Dettmers) - Efficient quantization library |
|
|
|
|
|
#### ๐จ **Multimodal Components** |
|
|
- **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding |
|
|
- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929) |
|
|
- **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion |
|
|
- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198) |
|
|
- **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment |
|
|
- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597) |
|
|
- **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration |
|
|
|
|
|
#### ๐ ๏ธ **Normalization & Activations** |
|
|
- **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization |
|
|
- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467) |
|
|
- **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant |
|
|
- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) |
|
|
|
|
|
#### ๐ง **Implementation & Tools** |
|
|
- **Hugging Face Transformers** - Model implementation framework |
|
|
- **PyTorch** - Deep learning framework |
|
|
- **Safetensors** - Secure tensor serialization format |
|
|
- **Accelerate** - Distributed training utilities |
|
|
|
|
|
--- |
|
|
|
|
|
**Special Thanks to:** |
|
|
- ๐ฎ๐ฉ Indonesian NLP Community |
|
|
- ๐ค Hugging Face Team |
|
|
- ๐ฌ Open source AI research community |
|
|
|
|
|
## โ ๏ธ Limitations & Bias |
|
|
|
|
|
### Keterbatasan |
|
|
|
|
|
- ๐ด **Untrained**: Model belum dilatih, output random |
|
|
- ๐ก **No Tokenizer**: Perlu prepare tokenizer sendiri |
|
|
- ๐ก **No Safety**: Belum ada content filtering/alignment |
|
|
- ๐ **Memory Intensive**: Training butuh GPU besar |
|
|
|
|
|
### Potential Biases |
|
|
|
|
|
Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan: |
|
|
|
|
|
- **Bahasa**: Bias terhadap bahasa mayoritas di dataset |
|
|
- **Kultur**: Bias terhadap perspektif kultur tertentu |
|
|
- **Gender & Demografis**: Potential stereotypes |
|
|
- **Faktual**: Bisa generate informasi tidak akurat |
|
|
|
|
|
**Rekomendasi**: Lakukan evaluation & filtering sebelum deployment. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Support & Contact |
|
|
|
|
|
### ๐ฌ Community |
|
|
|
|
|
- **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-250M-untrained/discussions) |
|
|
|
|
|
### ๐ง Contact |
|
|
|
|
|
Untuk pertanyaan atau kolaborasi: |
|
|
- Email: cacatransformers@gmail.com |
|
|
- HF Profile: [@Lyon28](https://huggingface.co/Lyon28) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
## ๐ Star History |
|
|
|
|
|
[](https://star-history.com/#Lyon-28/caca-transformers&Date) |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Dibuat dengan โค๏ธ untuk komunitas AI Indonesia |
|
|
|
|
|
**Terima kasih telah menggunakan Caca!** |
|
|
|
|
|
Jika project ini bermanfaat, consider untuk: |
|
|
- โญ Star repository ini |
|
|
- ๐ Share ke teman-teman |
|
|
- ๐ฌ Join discussions |
|
|
- ๐ค Contribute ke project |
|
|
|
|
|
--- |
|
|
|
|
|
</div> |
|
|
|
|
|
### Quote Dari caca |
|
|
<div align="center"> |
|
|
<img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" /> |
|
|
</div> |
|
|
|