File size: 23,851 Bytes
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
 
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
 
 
1b87288
75b684e
 
1b87288
75b684e
 
1b87288
75b684e
 
1b87288
75b684e
1b87288
75b684e
 
 
 
 
1b87288
75b684e
 
1b87288
75b684e
1b87288
75b684e
 
 
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
1b87288
75b684e
 
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
1b87288
75b684e
 
 
 
 
1b87288
75b684e
 
 
 
1b87288
75b684e
 
 
 
 
 
 
 
1b87288
75b684e
 
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
 
 
1b87288
75b684e
 
 
 
1b87288
75b684e
 
1b87288
75b684e
 
 
1b87288
75b684e
1b87288
75b684e
 
1b87288
75b684e
 
1b87288
75b684e
 
 
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
 
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
1b87288
75b684e
1b87288
75b684e
1b87288
75b684e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b87288
75b684e
1b87288
75b684e
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
---
license: apache-2.0
language:
- id
- en
tags:
- text-generation
- pytorch
- causal-lm
- transformer
- untrained
- gqa
- rope
- swiglu
- rmsnorm
- flash-attention
- indonesian
library_name: transformers
pipeline_tag: text-generation
widget:
  - text: "Jakarta adalah ibu kota"
    example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Text Completion (ID)"
  - text: |
      Pertanyaan: Apa itu kecerdasan buatan?
      Jawaban:
    example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Question Answering (ID)"
  - text: |
      Tulis cerita pendek tentang robot yang belajar mencintai.
    example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Creative Writing (ID)"
  - text: "The capital of Indonesia is"
    example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Text Completion (EN)"
  - text: |
      Question: What is artificial intelligence?
      Answer:
    example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Question Answering (EN)"
  - text: |
      def fibonacci(n):
          """Hitung bilangan fibonacci ke-n"""
    example_title: "๐Ÿ’ป Code Completion"
  - text: |
      def reverse_string(s):
    example_title: "๐Ÿ’ป Code Generation"
  - text: |
      User: Halo! Siapa kamu?
      Assistant:
    example_title: "๐Ÿ’ฌ Chat Format (ID)"
  - text: |
      User: Jelaskan tentang machine learning dalam 2 kalimat.
      Assistant:
    example_title: "๐Ÿ’ฌ Conversational (ID)"
inference:
  parameters:
    max_new_tokens: 100
    temperature: 0.7
    top_p: 0.9
    top_k: 50
    do_sample: true
    repetition_penalty: 1.1
    num_beams: 1
datasets: []
metrics:
- perplexity
model-index:
- name: caca-5M
  results: []
---

<div align="center">

<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-5M"/>

# ๐Ÿš€ CACA-5M

### Model Transformer Modern dengan Arsitektur Canggih

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/๐Ÿค—%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)

**24,253,696** parameters โ€ข **24.25M** โ€ข **8 layers**

[๐Ÿ“– Dokumentasi](#dokumentasi) โ€ข [๐Ÿš€ Quick Start](#quick-start) โ€ข [๐Ÿ’ก Fitur](#fitur-utama) โ€ข [๐Ÿ”ง Training](#training-guide) โ€ข [๐Ÿ“Š Spesifikasi](#spesifikasi-teknis)

---

</div>

## โš ๏ธ PENTING: Model Belum Dilatih (Untrained)

> **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**.

**Status Model:**
- ๐Ÿ”ด **Belum dilatih** - Bobot masih random
- ๐ŸŸก **Hanya untuk riset** - Eksperimen arsitektur & training
- ๐ŸŸข **Siap dilatih** - Arsitektur sudah teruji

Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.

---

## ๐Ÿ“‹ Deskripsi

**Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**.

<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
<p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
<p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
</blockquote>

### ๐ŸŽฏ Keunggulan Utama

- **๐Ÿ‡ฎ๐Ÿ‡ฉ Bilingual Support**: Optimized untuk Bahasa Indonesia & English
- **โšก Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat
- **๐Ÿ’พ Memory Efficient**: Grouped Query Attention menghemat 75% KV cache
- **๐ŸŽฏ Long Context**: Support hingga 2,048 token
- **๐Ÿ”ง Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi

---

## โœจ Fitur Utama

### ๐ŸŽฏ Core Features

- โœ… **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
  - Query heads: 4
  - KV heads: 2
  - Ratio: 2:1 (hemat 75% KV cache)

- โœ… **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
  - Theta: 10000
  - Support extrapolation untuk konteks > training length

- โœ… **RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
  - Epsilon: 1e-06

- โœ… **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
  - Intermediate size: 1,024

- โœ… **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
  - Otomatis aktif jika tersedia CUDA

### ๐Ÿ”ฅ Advanced Features

### ๐ŸŽฏ Attention Mechanisms
- โšก **Flash Attention v2** - 3x faster with IO-aware algorithm
- ๐Ÿ”‘ **Grouped Query Attention (GQA)** - 4Q : 2KV heads
- ๐Ÿš€ **xFormers Support** - Memory efficient attention fallback
- ๐ŸŽฏ **PyTorch SDPA** - Native scaled dot product attention

### ๐Ÿ“ Position Encodings
- ๐Ÿ”„ **RoPE** - Rotary embeddings (ฮธ=10000)

### ๐ŸชŸ Long Context Features

### ๐ŸŽ“ Training Optimizations
- ๐Ÿ’พ **Gradient Checkpointing** - Memory efficient training
- ๐ŸŽฏ **Mixed Precision** - BF16 & FP16 support

### ๐Ÿ“ฆ Quantization Support
- 4๏ธโƒฃ **4-bit Quantization** - NF4, FP4 via bitsandbytes
- 8๏ธโƒฃ **8-bit Quantization** - LLM.int8() support
- ๐Ÿ”„ **Double Quantization** - Further compression

### ๐Ÿ› ๏ธ Optimization Features

- ๐Ÿ’พ **KV Cache** - Generasi autoregressive 5-10x lebih cepat
- ๐Ÿ”ง **Gradient Checkpointing** - Training model besar dengan memory terbatas
- ๐Ÿ“ฆ **Quantization Ready** - Support 4-bit & 8-bit quantization
- ๐ŸŽฏ **Mixed Precision Training** - BF16 & FP16 support

---

## ๐Ÿ“Š Spesifikasi Teknis

<div align="center">

| Spesifikasi | Detail |
|-------------|--------|
| **๐Ÿ’Ž Total Parameters** | **24,253,696** (24.25M) |
| **๐Ÿ“ Hidden Size** | 256 |
| **๐Ÿ”ข Intermediate Size** | 1,024 |
| **๐Ÿ—๏ธ Num Layers** | 8 |
| **๐ŸŽฏ Attention Heads** | 4 |
| **๐Ÿ”‘ KV Heads** | 2 (GQA) |
| **๐Ÿ“ Head Dimension** | 64 |
| **๐Ÿ“š Vocab Size** | 32,000 tokens |
| **๐Ÿ“– Max Context** | 2,048 tokens |
| **๐Ÿ›๏ธ Architecture** | Decoder-only Transformer |
| **๐ŸŽจ Model Type** | Causal Language Model |

</div>

### ๐Ÿ“ Arsitektur Detail

<details>
<summary><b>๐Ÿ” Klik untuk lihat struktur lengkap</b></summary>

```
CacaForCausalLM (24.25M)
โ”‚
โ”œโ”€ Embedding Layer
โ”‚  โ””โ”€ Token Embeddings: 32,000 ร— 256
โ”‚     โ””โ”€ Parameters: 8,192,000
โ”‚
โ”œโ”€ Transformer Layers (8x)
โ”‚  โ”‚
โ”‚  โ”œโ”€ Layer {i} (repeated 8 times)
โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”œโ”€ Input LayerNorm (RMSNorm)
โ”‚  โ”‚  โ”‚  โ””โ”€ Params: 256
โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”œโ”€ Self-Attention (Grouped Query Attention)
โ”‚  โ”‚  โ”‚  โ”œโ”€ Q Projection: 256 โ†’ 256
โ”‚  โ”‚  โ”‚  โ”œโ”€ K Projection: 256 โ†’ 128
โ”‚  โ”‚  โ”‚  โ”œโ”€ V Projection: 256 โ†’ 128
โ”‚  โ”‚  โ”‚  โ”œโ”€ O Projection: 256 โ†’ 256
โ”‚  โ”‚  โ”‚  โ”œโ”€ RoPE Embeddings: ฮธ=10000
โ”‚  โ”‚  โ”‚  โ””โ”€ Flash Attention 2 (if available)
โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”œโ”€ Post-Attention LayerNorm (RMSNorm)
โ”‚  โ”‚  โ”‚  โ””โ”€ Params: 256
โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”œโ”€ MLP (SwiGLU)
โ”‚  โ”‚  โ”‚  โ”œโ”€ Gate: 256 โ†’ 1,024
โ”‚  โ”‚  โ”‚  โ”œโ”€ Up: 256 โ†’ 1,024
โ”‚  โ”‚  โ”‚  โ”œโ”€ Activation: SiLU (Swish)
โ”‚  โ”‚  โ”‚  โ””โ”€ Down: 1,024 โ†’ 256
โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ””โ”€ Residual Connections (2x per layer)
โ”‚  โ”‚
โ”‚  โ””โ”€ Total Layer Params: ~0M per layer
โ”‚
โ”œโ”€ Final LayerNorm (RMSNorm)
โ”‚  โ””โ”€ Params: 256
โ”‚
โ””โ”€ LM Head (Output Projection)
   โ””โ”€ Linear: 256 โ†’ 32,000
      โ””โ”€ Parameters: 8,192,000
```

**Perhitungan Parameter:**
- Embeddings: `32,000 ร— 256 = 8,192,000`
- Layers: `8 layers ร— ~0M = ~6M`
- **Total: 24,253,696 parameters**

</details>

---

## ๐Ÿš€ Quick Start

### ๐Ÿ“ฆ Instalasi

```bash
# Dependencies dasar
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation  # Flash Attention 2
pip install xformers                          # Memory efficient attention
pip install bitsandbytes                      # Quantization support
```

### ๐Ÿ’ป Penggunaan Dasar

#### 1๏ธโƒฃ Load Model

```python
from transformers import AutoModelForCausalLM, AutoConfig
import torch

# Load configuration
config = AutoConfig.from_pretrained(
    "Lyon28/caca-5M-untrained",
    trust_remote_code=True
)

print(f"Model: {config.model_type}")
print(f"Parameters: 24,253,696")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-5M-untrained",
    config=config,
    torch_dtype=torch.bfloat16,  # Gunakan BF16 untuk efisiensi
    device_map="auto",            # Otomatis distribusi ke GPU
    trust_remote_code=True
)

print(f"Model loaded! Device: {model.device}")
```

#### 2๏ธโƒฃ Verifikasi Model

```python
# Hitung total parameter
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")

# Test forward pass
batch_size, seq_len = 2, 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
input_ids = input_ids.to(model.device)

with torch.no_grad():
    outputs = model(input_ids)

print(f"Output shape: {outputs.logits.shape}")
print("โœ… Model berfungsi dengan baik!")
```

#### 3๏ธโƒฃ Generate Text (Setelah Training)

```python
from transformers import AutoTokenizer

# Load tokenizer (gunakan tokenizer yang sesuai)
tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")

# Prepare input
text = "Jelaskan tentang kecerdasan buatan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    do_sample=True,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.eos_token_id
)

# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

---

## ๐Ÿ”ง Training Guide

### ๐Ÿ“š Persiapan Dataset

```python
from datasets import load_dataset

# Load dataset (contoh)
dataset = load_dataset("indonesian-nlp/id-wikipedia")

# Atau load dari file lokal
from datasets import Dataset
import pandas as pd

df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)

print(f"Dataset size: {len(dataset)}")
```

### ๐ŸŽฏ Training Configuration

```python
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

# Training arguments
training_args = TrainingArguments(
    # Output
    output_dir="./caca-caca-5M-trained",
    run_name="caca-caca-5M-v1",
    
    # Training
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # Effective batch size = 32
    learning_rate=2e-4,
    weight_decay=0.1,
    warmup_steps=2000,
    
    # Optimization
    bf16=True,                      # Mixed precision training
    gradient_checkpointing=True,     # Hemat memory
    optim="adamw_torch_fused",      # Optimizer tercepat
    max_grad_norm=1.0,
    
    # Logging & Evaluation
    logging_steps=10,
    logging_first_step=True,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=3,
    
    # Hub integration
    push_to_hub=True,
    hub_model_id="your-username/caca-caca-5M-trained",
    hub_strategy="every_save",
    
    # Distributed training
    ddp_find_unused_parameters=False,
    dataloader_num_workers=4,
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal LM, bukan Masked LM
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
)

# Train!
print("๐Ÿš€ Starting training...")
trainer.train()

# Save final model
print("๐Ÿ’พ Saving model...")
trainer.save_model("./caca-caca-5M-final")
trainer.push_to_hub()

print("โœ… Training complete!")
```

### ๐Ÿ“Š Estimasi Resource

<details>
<summary><b>๐Ÿ’ฐ Klik untuk melihat estimasi biaya & waktu training</b></summary>

**Hardware Requirements:**

| GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
|-----|--------|------------|-------|-------------------------|
| RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
| A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
| A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
| 8ร—A100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |

**Cloud Costs (approximate):**
- AWS p4d.24xlarge (8ร—A100): ~$32/hour ร— 24 hours = **~$768/day**
- GCP a2-ultragpu-8g: ~$30/hour ร— 24 hours = **~$720/day**
- Lambda Labs (8ร—A100): ~$15/hour ร— 24 hours = **~$360/day**

**Tips menghemat biaya:**
- Gunakan spot instances (60-70% lebih murah)
- Gradient accumulation untuk batch size lebih besar
- Mixed precision (BF16) untuk 2x speedup
- Gradient checkpointing untuk hemat memory

</details>

---

## ๐Ÿ’ฌ Format Chat

Model ini mendukung format chat standar:

```python
# Single-turn
messages = [
    {"role": "user", "content": "Halo! Siapa kamu?"},
]

# Multi-turn conversation
messages = [
    {"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
    {"role": "user", "content": "Jelaskan tentang fotosintesis"},
    {"role": "assistant", "content": "Fotosintesis adalah proses..."},
    {"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]

# Apply chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses...
# User: Apa manfaatnya bagi manusia?
# Assistant:
```

---

## ๐ŸŽฏ Use Cases

### โœ… Cocok Untuk:

- ๐Ÿ”ฌ **Penelitian**: Eksperimen arsitektur LLM modern
- ๐Ÿ“š **Edukasi**: Belajar tentang transformer & training
- ๐ŸŽ“ **Akademis**: Paper, thesis, project
- ๐Ÿš€ **Base Model**: Fine-tuning untuk task spesifik
- ๐Ÿ’ก **Proof of Concept**: Test ide sebelum scale up

### โŒ Tidak Cocok Untuk:

- ๐Ÿšซ **Production**: Model belum dilatih
- ๐Ÿšซ **Real-world apps**: Output masih random
- ๐Ÿšซ **Safety-critical**: Belum ada safety alignment
- ๐Ÿšซ **Direct deployment**: Perlu training dulu

---

## ๐Ÿ“– Dokumentasi

### ๐Ÿ”— Links Penting

- ๐Ÿ“š **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers)
- ๐Ÿ’ป **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
- ๐Ÿ’ฌ **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
- ๐Ÿ› **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)

### ๐Ÿ“ Related Models

<div align="center">

| Model Size | Parameters | Link |
|------------|------------|------|
| ๐Ÿฃ Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) |
| ๐Ÿฅ Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) |
| ๐Ÿฆ… Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) |
| ๐Ÿฆ Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) |
| ๐Ÿ‰ XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) |
| ๐Ÿฆ– XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) |

</div>

---

## ๐Ÿค Contributing

Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:

- ๐Ÿ› **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
- ๐Ÿ’ก **Suggest features**: Punya ide? Share di discussions
- ๐Ÿ“ **Improve docs**: PR welcome untuk dokumentasi
- ๐ŸŽ“ **Share results**: Training hasil? Share di model card
- โญ **Star & Share**: Bantu project ini berkembang

---

## ๐Ÿ“œ License & Citation

### ๐Ÿ“„ License

Model ini dirilis di bawah **Apache License 2.0**:
- โœ… Gratis untuk penggunaan komersial
- โœ… Gratis untuk penggunaan riset
- โœ… Boleh modifikasi & distribusi
- โœ… Tidak ada garansi

### ๐Ÿ“š Citation

Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:

```bibtex
@misc{cacacaca-5M2025,
  author = {Lyon},
  title = {Caca-caca-5M: Modern Transformer Architecture with GQA and Advanced Features},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/Lyon28/caca-5M-untrained}},
}
```

### ๐Ÿ™ Acknowledgments

Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:

#### ๐Ÿ—๏ธ **Core Architecture**
- **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
  - Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
- **PaLM** (Google, 2022) - SwiGLU activation function

#### ๐ŸŽฏ **Attention Mechanisms**
- **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness
  - Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
- **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention
  - Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
- **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding
- **xFormers** (Meta AI, 2022) - Memory efficient attention implementations
- **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention

#### ๐Ÿ“ **Position Encodings**
- **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
  - Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation
  - Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
- **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context
  - Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)

#### ๐ŸชŸ **Long Context & Efficiency**
- **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns
  - Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
- **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths
  - Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
- **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow
  - Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)

#### ๐Ÿง  **Mixture of Experts (MoE)**
- **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture
  - Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
- **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice
  - Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
- **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE
- **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing

#### ๐ŸŽ“ **Training Optimizations**
- **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks
  - Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
- **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping
  - Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
- **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
  - Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
- **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training

#### ๐Ÿ“ฆ **Quantization**
- **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication
  - Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
- **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
  - Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- **GPTQ** (Frantar et al., 2022) - Post-training quantization
- **bitsandbytes** (Dettmers) - Efficient quantization library

#### ๐ŸŽจ **Multimodal Components**
- **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding
  - Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
- **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion
  - Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
- **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment
  - Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
- **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration

#### ๐Ÿ› ๏ธ **Normalization & Activations**
- **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
  - Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
- **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant
  - Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)

#### ๐Ÿ”ง **Implementation & Tools**
- **Hugging Face Transformers** - Model implementation framework
- **PyTorch** - Deep learning framework
- **Safetensors** - Secure tensor serialization format
- **Accelerate** - Distributed training utilities

---

**Special Thanks to:**
- ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian NLP Community
- ๐Ÿค— Hugging Face Team
- ๐Ÿ”ฌ Open source AI research community

## โš ๏ธ Limitations & Bias

### Keterbatasan

- ๐Ÿ”ด **Untrained**: Model belum dilatih, output random
- ๐ŸŸก **No Tokenizer**: Perlu prepare tokenizer sendiri
- ๐ŸŸก **No Safety**: Belum ada content filtering/alignment
- ๐ŸŸ  **Memory Intensive**: Training butuh GPU besar

### Potential Biases

Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:

- **Bahasa**: Bias terhadap bahasa mayoritas di dataset
- **Kultur**: Bias terhadap perspektif kultur tertentu
- **Gender & Demografis**: Potential stereotypes
- **Faktual**: Bisa generate informasi tidak akurat

**Rekomendasi**: Lakukan evaluation & filtering sebelum deployment.

---

## ๐Ÿ“ž Support & Contact

### ๐Ÿ’ฌ Community

- **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)

### ๐Ÿ“ง Contact

Untuk pertanyaan atau kolaborasi:
- Email: cacatransformers@gmail.com
- HF Profile: [@Lyon28](https://huggingface.co/Lyon28)

---

<div align="center">

## ๐ŸŒŸ Star History

[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)

---

### ๐Ÿ’ Dibuat dengan โค๏ธ untuk komunitas AI Indonesia

**Terima kasih telah menggunakan Caca!**

Jika project ini bermanfaat, consider untuk:
- โญ Star repository ini
- ๐Ÿ”— Share ke teman-teman
- ๐Ÿ’ฌ Join discussions
- ๐Ÿค Contribute ke project

---

</div>

### Quote Dari caca
<div align="center">
  <img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
</div>