Lyon28 commited on
Commit
75b684e
ยท
verified ยท
1 Parent(s): a963c81

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +641 -102
README.md CHANGED
@@ -1,199 +1,738 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
 
 
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a ๐Ÿค— transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
 
 
 
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
 
 
 
57
 
58
- ## Bias, Risks, and Limitations
 
 
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
61
 
62
- [More Information Needed]
 
63
 
64
- ### Recommendations
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
69
 
70
- ## How to Get Started with the Model
 
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
75
 
76
- ## Training Details
 
 
 
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
 
 
92
 
93
- #### Training Hyperparameters
 
 
 
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
100
 
101
- [More Information Needed]
 
 
 
 
102
 
103
- ## Evaluation
 
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
106
 
107
- ### Testing Data, Factors & Metrics
 
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
112
 
113
- [More Information Needed]
 
 
114
 
115
- #### Factors
 
 
 
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
118
 
119
- [More Information Needed]
 
 
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
124
 
125
- [More Information Needed]
 
126
 
127
- ### Results
 
 
128
 
129
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
130
 
131
- #### Summary
 
 
 
132
 
 
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
 
 
 
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
 
 
 
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
 
 
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
 
 
 
 
 
 
 
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
 
 
 
 
 
 
 
 
 
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
 
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
 
 
 
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
 
 
 
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - id
5
+ - en
6
+ tags:
7
+ - text-generation
8
+ - pytorch
9
+ - causal-lm
10
+ - transformer
11
+ - untrained
12
+ - gqa
13
+ - rope
14
+ - swiglu
15
+ - rmsnorm
16
+ - flash-attention
17
+ - indonesian
18
  library_name: transformers
19
+ pipeline_tag: text-generation
20
+ widget:
21
+ - text: "Jakarta adalah ibu kota"
22
+ example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Text Completion (ID)"
23
+ - text: |
24
+ Pertanyaan: Apa itu kecerdasan buatan?
25
+ Jawaban:
26
+ example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Question Answering (ID)"
27
+ - text: |
28
+ Tulis cerita pendek tentang robot yang belajar mencintai.
29
+ example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Creative Writing (ID)"
30
+ - text: "The capital of Indonesia is"
31
+ example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Text Completion (EN)"
32
+ - text: |
33
+ Question: What is artificial intelligence?
34
+ Answer:
35
+ example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Question Answering (EN)"
36
+ - text: |
37
+ def fibonacci(n):
38
+ """Hitung bilangan fibonacci ke-n"""
39
+ example_title: "๐Ÿ’ป Code Completion"
40
+ - text: |
41
+ def reverse_string(s):
42
+ example_title: "๐Ÿ’ป Code Generation"
43
+ - text: |
44
+ User: Halo! Siapa kamu?
45
+ Assistant:
46
+ example_title: "๐Ÿ’ฌ Chat Format (ID)"
47
+ - text: |
48
+ User: Jelaskan tentang machine learning dalam 2 kalimat.
49
+ Assistant:
50
+ example_title: "๐Ÿ’ฌ Conversational (ID)"
51
+ inference:
52
+ parameters:
53
+ max_new_tokens: 100
54
+ temperature: 0.7
55
+ top_p: 0.9
56
+ top_k: 50
57
+ do_sample: true
58
+ repetition_penalty: 1.1
59
+ num_beams: 1
60
+ datasets: []
61
+ metrics:
62
+ - perplexity
63
+ model-index:
64
+ - name: caca-5M
65
+ results: []
66
  ---
67
 
68
+ <div align="center">
69
 
70
+ <img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-5M"/>
71
 
72
+ # ๐Ÿš€ CACA-5M
73
 
74
+ ### Model Transformer Modern dengan Arsitektur Canggih
75
 
76
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
77
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
78
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
79
+ [![Transformers](https://img.shields.io/badge/๐Ÿค—%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)
80
 
81
+ **24,253,696** parameters โ€ข **24.25M** โ€ข **8 layers**
82
 
83
+ [๐Ÿ“– Dokumentasi](#dokumentasi) โ€ข [๐Ÿš€ Quick Start](#quick-start) โ€ข [๐Ÿ’ก Fitur](#fitur-utama) โ€ข [๐Ÿ”ง Training](#training-guide) โ€ข [๐Ÿ“Š Spesifikasi](#spesifikasi-teknis)
84
 
85
+ ---
86
 
87
+ </div>
 
 
 
 
 
 
88
 
89
+ ## โš ๏ธ PENTING: Model Belum Dilatih (Untrained)
90
 
91
+ > **PERHATIAN**: Ini adalah model yang **belum melalui proses training**. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan **tidak bermakna dan acak**.
92
 
93
+ **Status Model:**
94
+ - ๐Ÿ”ด **Belum dilatih** - Bobot masih random
95
+ - ๐ŸŸก **Hanya untuk riset** - Eksperimen arsitektur & training
96
+ - ๐ŸŸข **Siap dilatih** - Arsitektur sudah teruji
97
 
98
+ Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas.
99
 
100
+ ---
101
 
102
+ ## ๐Ÿ“‹ Deskripsi
103
 
104
+ **Caca** adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada **efisiensi**, **skalabilitas**, dan **performa tinggi**.
105
 
106
+ <blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; color: #555;">
107
+ <p><strong>Caca</strong> itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.</p>
108
+ <p>Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
109
+ </blockquote>
110
 
111
+ ### ๐ŸŽฏ Keunggulan Utama
112
 
113
+ - **๐Ÿ‡ฎ๐Ÿ‡ฉ Bilingual Support**: Optimized untuk Bahasa Indonesia & English
114
+ - **โšก Ultra Fast**: Flash Attention 2 untuk inferensi 3x lebih cepat
115
+ - **๐Ÿ’พ Memory Efficient**: Grouped Query Attention menghemat 75% KV cache
116
+ - **๐ŸŽฏ Long Context**: Support hingga 2,048 token
117
+ - **๐Ÿ”ง Modular**: Arsitektur fleksibel dengan berbagai opsi konfigurasi
118
 
119
+ ---
120
 
121
+ ## โœจ Fitur Utama
122
 
123
+ ### ๐ŸŽฏ Core Features
124
 
125
+ - โœ… **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
126
+ - Query heads: 4
127
+ - KV heads: 2
128
+ - Ratio: 2:1 (hemat 75% KV cache)
129
 
130
+ - โœ… **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
131
+ - Theta: 10000
132
+ - Support extrapolation untuk konteks > training length
133
 
134
+ - โœ… **RMSNorm** - Normalisasi lebih stabil dan 50% lebih cepat dari LayerNorm
135
+ - Epsilon: 1e-06
136
 
137
+ - โœ… **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
138
+ - Intermediate size: 1,024
139
 
140
+ - โœ… **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
141
+ - Otomatis aktif jika tersedia CUDA
142
 
143
+ ### ๐Ÿ”ฅ Advanced Features
144
 
145
+ ### ๐ŸŽฏ Attention Mechanisms
146
+ - โšก **Flash Attention v2** - 3x faster with IO-aware algorithm
147
+ - ๐Ÿ”‘ **Grouped Query Attention (GQA)** - 4Q : 2KV heads
148
+ - ๐Ÿš€ **xFormers Support** - Memory efficient attention fallback
149
+ - ๐ŸŽฏ **PyTorch SDPA** - Native scaled dot product attention
150
 
151
+ ### ๐Ÿ“ Position Encodings
152
+ - ๐Ÿ”„ **RoPE** - Rotary embeddings (ฮธ=10000)
153
 
154
+ ### ๐ŸชŸ Long Context Features
155
 
156
+ ### ๐ŸŽ“ Training Optimizations
157
+ - ๐Ÿ’พ **Gradient Checkpointing** - Memory efficient training
158
+ - ๐ŸŽฏ **Mixed Precision** - BF16 & FP16 support
159
 
160
+ ### ๐Ÿ“ฆ Quantization Support
161
+ - 4๏ธโƒฃ **4-bit Quantization** - NF4, FP4 via bitsandbytes
162
+ - 8๏ธโƒฃ **8-bit Quantization** - LLM.int8() support
163
+ - ๐Ÿ”„ **Double Quantization** - Further compression
164
 
165
+ ### ๐Ÿ› ๏ธ Optimization Features
166
 
167
+ - ๐Ÿ’พ **KV Cache** - Generasi autoregressive 5-10x lebih cepat
168
+ - ๐Ÿ”ง **Gradient Checkpointing** - Training model besar dengan memory terbatas
169
+ - ๐Ÿ“ฆ **Quantization Ready** - Support 4-bit & 8-bit quantization
170
+ - ๐ŸŽฏ **Mixed Precision Training** - BF16 & FP16 support
171
 
172
+ ---
173
 
174
+ ## ๐Ÿ“Š Spesifikasi Teknis
175
+
176
+ <div align="center">
177
+
178
+ | Spesifikasi | Detail |
179
+ |-------------|--------|
180
+ | **๐Ÿ’Ž Total Parameters** | **24,253,696** (24.25M) |
181
+ | **๐Ÿ“ Hidden Size** | 256 |
182
+ | **๐Ÿ”ข Intermediate Size** | 1,024 |
183
+ | **๐Ÿ—๏ธ Num Layers** | 8 |
184
+ | **๐ŸŽฏ Attention Heads** | 4 |
185
+ | **๐Ÿ”‘ KV Heads** | 2 (GQA) |
186
+ | **๐Ÿ“ Head Dimension** | 64 |
187
+ | **๐Ÿ“š Vocab Size** | 32,000 tokens |
188
+ | **๐Ÿ“– Max Context** | 2,048 tokens |
189
+ | **๐Ÿ›๏ธ Architecture** | Decoder-only Transformer |
190
+ | **๐ŸŽจ Model Type** | Causal Language Model |
191
+
192
+ </div>
193
+
194
+ ### ๐Ÿ“ Arsitektur Detail
195
+
196
+ <details>
197
+ <summary><b>๐Ÿ” Klik untuk lihat struktur lengkap</b></summary>
198
+
199
+ ```
200
+ CacaForCausalLM (24.25M)
201
+ โ”‚
202
+ โ”œโ”€ Embedding Layer
203
+ โ”‚ โ””โ”€ Token Embeddings: 32,000 ร— 256
204
+ โ”‚ โ””โ”€ Parameters: 8,192,000
205
+ โ”‚
206
+ โ”œโ”€ Transformer Layers (8x)
207
+ โ”‚ โ”‚
208
+ โ”‚ โ”œโ”€ Layer {i} (repeated 8 times)
209
+ โ”‚ โ”‚ โ”‚
210
+ โ”‚ โ”‚ โ”œโ”€ Input LayerNorm (RMSNorm)
211
+ โ”‚ โ”‚ โ”‚ โ””โ”€ Params: 256
212
+ โ”‚ โ”‚ โ”‚
213
+ โ”‚ โ”‚ โ”œโ”€ Self-Attention (Grouped Query Attention)
214
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ Q Projection: 256 โ†’ 256
215
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ K Projection: 256 โ†’ 128
216
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ V Projection: 256 โ†’ 128
217
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ O Projection: 256 โ†’ 256
218
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ RoPE Embeddings: ฮธ=10000
219
+ โ”‚ โ”‚ โ”‚ โ””โ”€ Flash Attention 2 (if available)
220
+ โ”‚ โ”‚ โ”‚
221
+ โ”‚ โ”‚ โ”œโ”€ Post-Attention LayerNorm (RMSNorm)
222
+ โ”‚ โ”‚ โ”‚ โ””โ”€ Params: 256
223
+ โ”‚ โ”‚ โ”‚
224
+ โ”‚ โ”‚ โ”œโ”€ MLP (SwiGLU)
225
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ Gate: 256 โ†’ 1,024
226
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ Up: 256 โ†’ 1,024
227
+ โ”‚ โ”‚ โ”‚ โ”œโ”€ Activation: SiLU (Swish)
228
+ โ”‚ โ”‚ โ”‚ โ””โ”€ Down: 1,024 โ†’ 256
229
+ โ”‚ โ”‚ โ”‚
230
+ โ”‚ โ”‚ โ””โ”€ Residual Connections (2x per layer)
231
+ โ”‚ โ”‚
232
+ โ”‚ โ””โ”€ Total Layer Params: ~0M per layer
233
+ โ”‚
234
+ โ”œโ”€ Final LayerNorm (RMSNorm)
235
+ โ”‚ โ””โ”€ Params: 256
236
+ โ”‚
237
+ โ””โ”€ LM Head (Output Projection)
238
+ โ””โ”€ Linear: 256 โ†’ 32,000
239
+ โ””โ”€ Parameters: 8,192,000
240
+ ```
241
+
242
+ **Perhitungan Parameter:**
243
+ - Embeddings: `32,000 ร— 256 = 8,192,000`
244
+ - Layers: `8 layers ร— ~0M = ~6M`
245
+ - **Total: 24,253,696 parameters**
246
+
247
+ </details>
248
 
249
+ ---
250
 
251
+ ## ๐Ÿš€ Quick Start
252
 
253
+ ### ๐Ÿ“ฆ Instalasi
254
 
255
+ ```bash
256
+ # Dependencies dasar
257
+ pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
258
 
259
+ # Optional: Untuk performa maksimal
260
+ pip install flash-attn --no-build-isolation # Flash Attention 2
261
+ pip install xformers # Memory efficient attention
262
+ pip install bitsandbytes # Quantization support
263
+ ```
264
 
265
+ ### ๐Ÿ’ป Penggunaan Dasar
266
 
267
+ #### 1๏ธโƒฃ Load Model
268
 
269
+ ```python
270
+ from transformers import AutoModelForCausalLM, AutoConfig
271
+ import torch
272
 
273
+ # Load configuration
274
+ config = AutoConfig.from_pretrained(
275
+ "Lyon28/caca-5M-untrained",
276
+ trust_remote_code=True
277
+ )
278
 
279
+ print(f"Model: {config.model_type}")
280
+ print(f"Parameters: 24,253,696")
281
+ print(f"Hidden size: {config.hidden_size}")
282
+ print(f"Layers: {config.num_hidden_layers}")
283
 
284
+ # Load model
285
+ model = AutoModelForCausalLM.from_pretrained(
286
+ "Lyon28/caca-5M-untrained",
287
+ config=config,
288
+ torch_dtype=torch.bfloat16, # Gunakan BF16 untuk efisiensi
289
+ device_map="auto", # Otomatis distribusi ke GPU
290
+ trust_remote_code=True
291
+ )
292
 
293
+ print(f"Model loaded! Device: {model.device}")
294
+ ```
295
 
296
+ #### 2๏ธโƒฃ Verifikasi Model
297
 
298
+ ```python
299
+ # Hitung total parameter
300
+ total_params = sum(p.numel() for p in model.parameters())
301
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
302
 
303
+ print(f"Total parameters: {total_params:,}")
304
+ print(f"Trainable parameters: {trainable_params:,}")
305
+ print(f"Model size: {total_params * 2 / 1e9:.2f} GB (BF16)")
306
 
307
+ # Test forward pass
308
+ batch_size, seq_len = 2, 10
309
+ input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
310
+ input_ids = input_ids.to(model.device)
311
 
312
+ with torch.no_grad():
313
+ outputs = model(input_ids)
314
 
315
+ print(f"Output shape: {outputs.logits.shape}")
316
+ print("โœ… Model berfungsi dengan baik!")
317
+ ```
318
 
319
+ #### 3๏ธโƒฃ Generate Text (Setelah Training)
320
 
321
+ ```python
322
+ from transformers import AutoTokenizer
323
 
324
+ # Load tokenizer (gunakan tokenizer yang sesuai)
325
+ tokenizer = AutoTokenizer.from_pretrained("your-tokenizer-here")
326
 
327
+ # Prepare input
328
+ text = "Jelaskan tentang kecerdasan buatan"
329
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
330
 
331
+ # Generate
332
+ outputs = model.generate(
333
+ **inputs,
334
+ max_new_tokens=100,
335
+ temperature=0.7,
336
+ top_p=0.9,
337
+ top_k=50,
338
+ do_sample=True,
339
+ repetition_penalty=1.1,
340
+ pad_token_id=tokenizer.eos_token_id
341
+ )
342
 
343
+ # Decode
344
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
345
+ print(generated_text)
346
+ ```
347
 
348
+ ---
349
 
350
+ ## ๐Ÿ”ง Training Guide
351
+
352
+ ### ๐Ÿ“š Persiapan Dataset
353
+
354
+ ```python
355
+ from datasets import load_dataset
356
+
357
+ # Load dataset (contoh)
358
+ dataset = load_dataset("indonesian-nlp/id-wikipedia")
359
+
360
+ # Atau load dari file lokal
361
+ from datasets import Dataset
362
+ import pandas as pd
363
+
364
+ df = pd.read_csv("your_data.csv")
365
+ dataset = Dataset.from_pandas(df)
366
+
367
+ print(f"Dataset size: {len(dataset)}")
368
+ ```
369
+
370
+ ### ๐ŸŽฏ Training Configuration
371
+
372
+ ```python
373
+ from transformers import Trainer, TrainingArguments
374
+ from transformers import DataCollatorForLanguageModeling
375
+
376
+ # Training arguments
377
+ training_args = TrainingArguments(
378
+ # Output
379
+ output_dir="./caca-caca-5M-trained",
380
+ run_name="caca-caca-5M-v1",
381
+
382
+ # Training
383
+ num_train_epochs=3,
384
+ per_device_train_batch_size=4,
385
+ gradient_accumulation_steps=8, # Effective batch size = 32
386
+ learning_rate=2e-4,
387
+ weight_decay=0.1,
388
+ warmup_steps=2000,
389
+
390
+ # Optimization
391
+ bf16=True, # Mixed precision training
392
+ gradient_checkpointing=True, # Hemat memory
393
+ optim="adamw_torch_fused", # Optimizer tercepat
394
+ max_grad_norm=1.0,
395
+
396
+ # Logging & Evaluation
397
+ logging_steps=10,
398
+ logging_first_step=True,
399
+ eval_strategy="steps",
400
+ eval_steps=500,
401
+ save_steps=1000,
402
+ save_total_limit=3,
403
+
404
+ # Hub integration
405
+ push_to_hub=True,
406
+ hub_model_id="your-username/caca-caca-5M-trained",
407
+ hub_strategy="every_save",
408
+
409
+ # Distributed training
410
+ ddp_find_unused_parameters=False,
411
+ dataloader_num_workers=4,
412
+ )
413
+
414
+ # Data collator
415
+ data_collator = DataCollatorForLanguageModeling(
416
+ tokenizer=tokenizer,
417
+ mlm=False # Causal LM, bukan Masked LM
418
+ )
419
+
420
+ # Trainer
421
+ trainer = Trainer(
422
+ model=model,
423
+ args=training_args,
424
+ train_dataset=dataset["train"],
425
+ eval_dataset=dataset["validation"],
426
+ data_collator=data_collator,
427
+ )
428
+
429
+ # Train!
430
+ print("๐Ÿš€ Starting training...")
431
+ trainer.train()
432
+
433
+ # Save final model
434
+ print("๐Ÿ’พ Saving model...")
435
+ trainer.save_model("./caca-caca-5M-final")
436
+ trainer.push_to_hub()
437
+
438
+ print("โœ… Training complete!")
439
+ ```
440
+
441
+ ### ๐Ÿ“Š Estimasi Resource
442
+
443
+ <details>
444
+ <summary><b>๐Ÿ’ฐ Klik untuk melihat estimasi biaya & waktu training</b></summary>
445
+
446
+ **Hardware Requirements:**
447
+
448
+ | GPU | Memory | Batch Size | Speed | Est. Time (100B tokens) |
449
+ |-----|--------|------------|-------|-------------------------|
450
+ | RTX 3090 (24GB) | 24GB | 1-2 | ~1K tok/s | ~30 hari |
451
+ | A100 (40GB) | 40GB | 4-8 | ~5K tok/s | ~6 hari |
452
+ | A100 (80GB) | 80GB | 8-16 | ~8K tok/s | ~4 hari |
453
+ | 8ร—A100 (80GB) | 640GB | 64+ | ~50K tok/s | ~14 jam |
454
+
455
+ **Cloud Costs (approximate):**
456
+ - AWS p4d.24xlarge (8ร—A100): ~$32/hour ร— 24 hours = **~$768/day**
457
+ - GCP a2-ultragpu-8g: ~$30/hour ร— 24 hours = **~$720/day**
458
+ - Lambda Labs (8ร—A100): ~$15/hour ร— 24 hours = **~$360/day**
459
+
460
+ **Tips menghemat biaya:**
461
+ - Gunakan spot instances (60-70% lebih murah)
462
+ - Gradient accumulation untuk batch size lebih besar
463
+ - Mixed precision (BF16) untuk 2x speedup
464
+ - Gradient checkpointing untuk hemat memory
465
+
466
+ </details>
467
 
468
+ ---
469
 
470
+ ## ๐Ÿ’ฌ Format Chat
471
+
472
+ Model ini mendukung format chat standar:
473
+
474
+ ```python
475
+ # Single-turn
476
+ messages = [
477
+ {"role": "user", "content": "Halo! Siapa kamu?"},
478
+ ]
479
+
480
+ # Multi-turn conversation
481
+ messages = [
482
+ {"role": "system", "content": "Kamu adalah asisten AI yang membantu."},
483
+ {"role": "user", "content": "Jelaskan tentang fotosintesis"},
484
+ {"role": "assistant", "content": "Fotosintesis adalah proses..."},
485
+ {"role": "user", "content": "Apa manfaatnya bagi manusia?"},
486
+ ]
487
+
488
+ # Apply chat template
489
+ formatted = tokenizer.apply_chat_template(
490
+ messages,
491
+ tokenize=False,
492
+ add_generation_prompt=True
493
+ )
494
+
495
+ print(formatted)
496
+ # Output:
497
+ # System: Kamu adalah asisten AI yang membantu.
498
+ #
499
+ # User: Jelaskan tentang fotosintesis
500
+ # Assistant: Fotosintesis adalah proses...
501
+ # User: Apa manfaatnya bagi manusia?
502
+ # Assistant:
503
+ ```
504
 
505
+ ---
506
 
507
+ ## ๐ŸŽฏ Use Cases
508
 
509
+ ### โœ… Cocok Untuk:
510
 
511
+ - ๐Ÿ”ฌ **Penelitian**: Eksperimen arsitektur LLM modern
512
+ - ๐Ÿ“š **Edukasi**: Belajar tentang transformer & training
513
+ - ๐ŸŽ“ **Akademis**: Paper, thesis, project
514
+ - ๐Ÿš€ **Base Model**: Fine-tuning untuk task spesifik
515
+ - ๐Ÿ’ก **Proof of Concept**: Test ide sebelum scale up
516
 
517
+ ### โŒ Tidak Cocok Untuk:
 
 
 
 
518
 
519
+ - ๐Ÿšซ **Production**: Model belum dilatih
520
+ - ๐Ÿšซ **Real-world apps**: Output masih random
521
+ - ๐Ÿšซ **Safety-critical**: Belum ada safety alignment
522
+ - ๐Ÿšซ **Direct deployment**: Perlu training dulu
523
 
524
+ ---
525
 
526
+ ## ๐Ÿ“– Dokumentasi
527
 
528
+ ### ๐Ÿ”— Links Penting
529
 
530
+ - ๐Ÿ“š **Hugging Face Docs**: [transformers.github.io](https://huggingface.co/docs/transformers)
531
+ - ๐Ÿ’ป **GitHub**: [Lyon-28/caca-transformers](https://github.com/Lyon-28/caca-transformers)
532
+ - ๐Ÿ’ฌ **Discussions**: [Model discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
533
+ - ๐Ÿ› **Issues**: [Report bugs](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
534
 
535
+ ### ๐Ÿ“ Related Models
536
 
537
+ <div align="center">
538
 
539
+ | Model Size | Parameters | Link |
540
+ |------------|------------|------|
541
+ | ๐Ÿฃ Tiny | 1M - 50M | [caca-1M](../caca-1M-untrained) to [caca-50M](../caca-50M-untrained) |
542
+ | ๐Ÿฅ Small | 75M - 500M | [caca-75M](../caca-75M-untrained) to [caca-500M](../caca-500M-untrained) |
543
+ | ๐Ÿฆ… Medium | 600M - 1B | [caca-600M](../caca-600M-untrained) to [caca-1B](../caca-1B-untrained) |
544
+ | ๐Ÿฆ Large | 1.5B - 5B | [caca-1.5B](../caca-1.5B-untrained) to [caca-5B](../caca-5B-untrained) |
545
+ | ๐Ÿ‰ XL | 6B - 10B | [caca-6B](../caca-6B-untrained) to [caca-10B](../caca-10B-untrained) |
546
+ | ๐Ÿฆ– XXL | 12B+ | [caca-12B](../caca-12B-untrained) to [caca-70B](../caca-70B-untrained) |
547
 
548
+ </div>
549
 
550
+ ---
551
+
552
+ ## ๐Ÿค Contributing
553
+
554
+ Kami sangat terbuka untuk kontribusi! Beberapa cara untuk berkontribusi:
555
+
556
+ - ๐Ÿ› **Report bugs**: Temukan bug? [Buka issue](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
557
+ - ๐Ÿ’ก **Suggest features**: Punya ide? Share di discussions
558
+ - ๐Ÿ“ **Improve docs**: PR welcome untuk dokumentasi
559
+ - ๐ŸŽ“ **Share results**: Training hasil? Share di model card
560
+ - โญ **Star & Share**: Bantu project ini berkembang
561
 
562
+ ---
563
+
564
+ ## ๐Ÿ“œ License & Citation
565
+
566
+ ### ๐Ÿ“„ License
567
+
568
+ Model ini dirilis di bawah **Apache License 2.0**:
569
+ - โœ… Gratis untuk penggunaan komersial
570
+ - โœ… Gratis untuk penggunaan riset
571
+ - โœ… Boleh modifikasi & distribusi
572
+ - โœ… Tidak ada garansi
573
+
574
+ ### ๐Ÿ“š Citation
575
+
576
+ Jika Anda menggunakan model ini dalam penelitian atau project, mohon cite:
577
+
578
+ ```bibtex
579
+ @misc{cacacaca-5M2025,
580
+ author = {Lyon},
581
+ title = {Caca-caca-5M: Modern Transformer Architecture with GQA and Advanced Features},
582
+ year = {2025},
583
+ publisher = {Hugging Face},
584
+ journal = {Hugging Face Model Hub},
585
+ howpublished = {\url{https://huggingface.co/Lyon28/caca-5M-untrained}},
586
+ }
587
+ ```
588
+
589
+ ### ๐Ÿ™ Acknowledgments
590
+
591
+ Model ini terinspirasi dan mengimplementasikan berbagai penelitian terkini:
592
+
593
+ #### ๐Ÿ—๏ธ **Core Architecture**
594
+ - **LLaMA** (Meta AI, 2023) - Base decoder-only architecture, RMSNorm, SwiGLU
595
+ - Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
596
+ - **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
597
+ - **PaLM** (Google, 2022) - SwiGLU activation function
598
+
599
+ #### ๐ŸŽฏ **Attention Mechanisms**
600
+ - **Flash Attention v2** (Tri Dao et al., 2023) - Efficient attention with IO-awareness
601
+ - Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
602
+ - **Grouped Query Attention (GQA)** (Ainslie et al., Google, 2023) - Memory-efficient attention
603
+ - Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
604
+ - **Multi-Query Attention (MQA)** (Shazeer, Google, 2019) - Fast decoding
605
+ - **xFormers** (Meta AI, 2022) - Memory efficient attention implementations
606
+ - **PyTorch SDPA** (PyTorch Team, 2023) - Built-in scaled dot product attention
607
+
608
+ #### ๐Ÿ“ **Position Encodings**
609
+ - **RoPE** (Su et al., EleutherAI, 2021) - Rotary Position Embeddings
610
+ - Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
611
+ - **ALiBI** (Press et al., 2022) - Attention with Linear Biases for extrapolation
612
+ - Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
613
+ - **YaRN** (Peng et al., 2023) - Yet another RoPE extensioN for long context
614
+ - Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)
615
+
616
+ #### ๐ŸชŸ **Long Context & Efficiency**
617
+ - **Sliding Window Attention** (Mistral AI, 2023) - Local attention patterns
618
+ - Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
619
+ - **StreamingLLM / Attention Sink** (Xiao et al., MIT, 2023) - Infinite sequence lengths
620
+ - Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
621
+ - **Logit Softcapping** (Google Gemma, 2024) - Prevent attention overflow
622
+ - Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)
623
+
624
+ #### ๐Ÿง  **Mixture of Experts (MoE)**
625
+ - **Mixtral 8x7B** (Mistral AI, 2024) - Sparse MoE architecture
626
+ - Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
627
+ - **Switch Transformers** (Fedus et al., Google, 2021) - Scaling with expert choice
628
+ - Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
629
+ - **GLaM** (Du et al., Google, 2021) - Generalist Language Model with MoE
630
+ - **Expert Choice Routing** (Zhou et al., Google, 2022) - Improved load balancing
631
+
632
+ #### ๐ŸŽ“ **Training Optimizations**
633
+ - **Layer Scale** (Touvron et al., Meta, 2021) - Training stability for deep networks
634
+ - Paper: [Going Deeper with Image Transformers (CaiT)](https://arxiv.org/abs/2103.17239)
635
+ - **Stochastic Depth** (Huang et al., 2016) - Regularization via random layer dropping
636
+ - Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
637
+ - **Mixture of Depths (MoD)** (Raposo et al., Google DeepMind, 2024) - Dynamic compute allocation
638
+ - Paper: [Mixture-of-Depths: Dynamically allocating compute in transformer-based models](https://arxiv.org/abs/2404.02258)
639
+ - **Gradient Checkpointing** (Chen et al., 2016) - Memory-efficient training
640
+
641
+ #### ๐Ÿ“ฆ **Quantization**
642
+ - **LLM.int8()** (Dettmers et al., 2022) - 8-bit matrix multiplication
643
+ - Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
644
+ - **QLoRA** (Dettmers et al., 2023) - 4-bit quantized LoRA fine-tuning
645
+ - Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
646
+ - **GPTQ** (Frantar et al., 2022) - Post-training quantization
647
+ - **bitsandbytes** (Dettmers) - Efficient quantization library
648
+
649
+ #### ๐ŸŽจ **Multimodal Components**
650
+ - **Vision Transformer (ViT)** (Dosovitskiy et al., Google, 2020) - Image encoding
651
+ - Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
652
+ - **Perceiver Resampler** (Alayrac et al., DeepMind, 2022) - Multimodal fusion
653
+ - Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
654
+ - **Q-Former** (Li et al., Salesforce, 2023) - Query-based multimodal alignment
655
+ - Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
656
+ - **Whisper** (Radford et al., OpenAI, 2022) - Audio encoding inspiration
657
+
658
+ #### ๐Ÿ› ๏ธ **Normalization & Activations**
659
+ - **RMSNorm** (Zhang & Sennrich, 2019) - Root Mean Square Layer Normalization
660
+ - Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
661
+ - **SwiGLU** (Shazeer, Google, 2020) - GLU activation variant
662
+ - Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
663
+
664
+ #### ๐Ÿ”ง **Implementation & Tools**
665
+ - **Hugging Face Transformers** - Model implementation framework
666
+ - **PyTorch** - Deep learning framework
667
+ - **Safetensors** - Secure tensor serialization format
668
+ - **Accelerate** - Distributed training utilities
669
 
670
+ ---
671
 
672
+ **Special Thanks to:**
673
+ - ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian NLP Community
674
+ - ๐Ÿค— Hugging Face Team
675
+ - ๐Ÿ”ฌ Open source AI research community
676
 
677
+ ## โš ๏ธ Limitations & Bias
678
 
679
+ ### Keterbatasan
680
 
681
+ - ๐Ÿ”ด **Untrained**: Model belum dilatih, output random
682
+ - ๐ŸŸก **No Tokenizer**: Perlu prepare tokenizer sendiri
683
+ - ๐ŸŸก **No Safety**: Belum ada content filtering/alignment
684
+ - ๐ŸŸ  **Memory Intensive**: Training butuh GPU besar
685
 
686
+ ### Potential Biases
687
 
688
+ Model ini akan mewarisi bias dari data training yang digunakan. Mohon perhatikan:
689
 
690
+ - **Bahasa**: Bias terhadap bahasa mayoritas di dataset
691
+ - **Kultur**: Bias terhadap perspektif kultur tertentu
692
+ - **Gender & Demografis**: Potential stereotypes
693
+ - **Faktual**: Bisa generate informasi tidak akurat
694
 
695
+ **Rekomendasi**: Lakukan evaluation & filtering sebelum deployment.
696
 
697
+ ---
698
 
699
+ ## ๐Ÿ“ž Support & Contact
700
+
701
+ ### ๐Ÿ’ฌ Community
702
+
703
+ - **Discussions**: [HF Discussions](https://huggingface.co/Lyon28/caca-5M-untrained/discussions)
704
+
705
+ ### ๐Ÿ“ง Contact
706
+
707
+ Untuk pertanyaan atau kolaborasi:
708
+ - Email: cacatransformers@gmail.com
709
+ - HF Profile: [@Lyon28](https://huggingface.co/Lyon28)
710
+
711
+ ---
712
+
713
+ <div align="center">
714
+
715
+ ## ๐ŸŒŸ Star History
716
+
717
+ [![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)
718
+
719
+ ---
720
+
721
+ ### ๐Ÿ’ Dibuat dengan โค๏ธ untuk komunitas AI Indonesia
722
+
723
+ **Terima kasih telah menggunakan Caca!**
724
+
725
+ Jika project ini bermanfaat, consider untuk:
726
+ - โญ Star repository ini
727
+ - ๐Ÿ”— Share ke teman-teman
728
+ - ๐Ÿ’ฌ Join discussions
729
+ - ๐Ÿค Contribute ke project
730
+
731
+ ---
732
 
733
+ </div>
734
 
735
+ ### Quote Dari caca
736
+ <div align="center">
737
+ <img src="https://quotes-caca.vercel.app/api/SsQuote" alt="Daily Quote" width="100%" />
738
+ </div>