rbelanec commited on
Commit
e53937c
verified
1 Parent(s): 15a35c1

End of training

Browse files
README.md CHANGED
@@ -4,6 +4,7 @@ license: llama3
4
  base_model: meta-llama/Meta-Llama-3-8B-Instruct
5
  tags:
6
  - llama-factory
 
7
  - generated_from_trainer
8
  model-index:
9
  - name: train_wsc_42_1760466772
@@ -15,7 +16,7 @@ should probably proofread and complete it, then remove this comment. -->
15
 
16
  # train_wsc_42_1760466772
17
 
18
- This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on an unknown dataset.
19
  It achieves the following results on the evaluation set:
20
  - Loss: 0.3524
21
  - Num Input Tokens Seen: 1468632
 
4
  base_model: meta-llama/Meta-Llama-3-8B-Instruct
5
  tags:
6
  - llama-factory
7
+ - prefix-tuning
8
  - generated_from_trainer
9
  model-index:
10
  - name: train_wsc_42_1760466772
 
16
 
17
  # train_wsc_42_1760466772
18
 
19
+ This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the wsc dataset.
20
  It achieves the following results on the evaluation set:
21
  - Loss: 0.3524
22
  - Num Input Tokens Seen: 1468632
all_results.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 30.0,
3
+ "eval_loss": 0.3523610234260559,
4
+ "eval_runtime": 1.1287,
5
+ "eval_samples_per_second": 98.345,
6
+ "eval_steps_per_second": 12.404,
7
+ "num_input_tokens_seen": 1468632,
8
+ "total_flos": 6.613183518533222e+16,
9
+ "train_loss": 0.47382661629290806,
10
+ "train_runtime": 311.6966,
11
+ "train_samples_per_second": 42.638,
12
+ "train_steps_per_second": 2.695
13
+ }
eval_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 30.0,
3
+ "eval_loss": 0.3523610234260559,
4
+ "eval_runtime": 1.1287,
5
+ "eval_samples_per_second": 98.345,
6
+ "eval_steps_per_second": 12.404,
7
+ "num_input_tokens_seen": 1468632
8
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 30.0,
3
+ "num_input_tokens_seen": 1468632,
4
+ "total_flos": 6.613183518533222e+16,
5
+ "train_loss": 0.47382661629290806,
6
+ "train_runtime": 311.6966,
7
+ "train_samples_per_second": 42.638,
8
+ "train_steps_per_second": 2.695
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1568 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 210,
3
+ "best_metric": 0.34583956003189087,
4
+ "best_model_checkpoint": "saves/prefix-tuning/llama-3-8b-instruct/train_wsc_42_1760466772/checkpoint-210",
5
+ "epoch": 30.0,
6
+ "eval_steps": 42,
7
+ "global_step": 840,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.17857142857142858,
14
+ "grad_norm": 175.3666534423828,
15
+ "learning_rate": 3.1746031746031746e-06,
16
+ "loss": 11.3214,
17
+ "num_input_tokens_seen": 9216,
18
+ "step": 5
19
+ },
20
+ {
21
+ "epoch": 0.35714285714285715,
22
+ "grad_norm": 145.5986785888672,
23
+ "learning_rate": 7.142857142857143e-06,
24
+ "loss": 6.8093,
25
+ "num_input_tokens_seen": 18112,
26
+ "step": 10
27
+ },
28
+ {
29
+ "epoch": 0.5357142857142857,
30
+ "grad_norm": 58.49702453613281,
31
+ "learning_rate": 1.1111111111111112e-05,
32
+ "loss": 2.1384,
33
+ "num_input_tokens_seen": 26624,
34
+ "step": 15
35
+ },
36
+ {
37
+ "epoch": 0.7142857142857143,
38
+ "grad_norm": 57.37192153930664,
39
+ "learning_rate": 1.5079365079365079e-05,
40
+ "loss": 0.7391,
41
+ "num_input_tokens_seen": 35456,
42
+ "step": 20
43
+ },
44
+ {
45
+ "epoch": 0.8928571428571429,
46
+ "grad_norm": 20.545907974243164,
47
+ "learning_rate": 1.9047619047619046e-05,
48
+ "loss": 0.5556,
49
+ "num_input_tokens_seen": 44096,
50
+ "step": 25
51
+ },
52
+ {
53
+ "epoch": 1.0714285714285714,
54
+ "grad_norm": 25.21103286743164,
55
+ "learning_rate": 2.3015873015873015e-05,
56
+ "loss": 0.5487,
57
+ "num_input_tokens_seen": 52192,
58
+ "step": 30
59
+ },
60
+ {
61
+ "epoch": 1.25,
62
+ "grad_norm": 8.62804126739502,
63
+ "learning_rate": 2.6984126984126984e-05,
64
+ "loss": 0.4509,
65
+ "num_input_tokens_seen": 60896,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 1.4285714285714286,
70
+ "grad_norm": 23.560665130615234,
71
+ "learning_rate": 3.095238095238095e-05,
72
+ "loss": 0.4818,
73
+ "num_input_tokens_seen": 69920,
74
+ "step": 40
75
+ },
76
+ {
77
+ "epoch": 1.5,
78
+ "eval_loss": 0.3624400794506073,
79
+ "eval_runtime": 1.118,
80
+ "eval_samples_per_second": 99.287,
81
+ "eval_steps_per_second": 12.523,
82
+ "num_input_tokens_seen": 73824,
83
+ "step": 42
84
+ },
85
+ {
86
+ "epoch": 1.6071428571428572,
87
+ "grad_norm": 13.207117080688477,
88
+ "learning_rate": 3.492063492063492e-05,
89
+ "loss": 0.4103,
90
+ "num_input_tokens_seen": 78560,
91
+ "step": 45
92
+ },
93
+ {
94
+ "epoch": 1.7857142857142856,
95
+ "grad_norm": 11.927769660949707,
96
+ "learning_rate": 3.888888888888889e-05,
97
+ "loss": 0.4405,
98
+ "num_input_tokens_seen": 87136,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 1.9642857142857144,
103
+ "grad_norm": 7.089166641235352,
104
+ "learning_rate": 4.2857142857142856e-05,
105
+ "loss": 0.408,
106
+ "num_input_tokens_seen": 96480,
107
+ "step": 55
108
+ },
109
+ {
110
+ "epoch": 2.142857142857143,
111
+ "grad_norm": 2.959946393966675,
112
+ "learning_rate": 4.682539682539683e-05,
113
+ "loss": 0.3548,
114
+ "num_input_tokens_seen": 104248,
115
+ "step": 60
116
+ },
117
+ {
118
+ "epoch": 2.3214285714285716,
119
+ "grad_norm": 1.298177719116211,
120
+ "learning_rate": 5.0793650793650794e-05,
121
+ "loss": 0.377,
122
+ "num_input_tokens_seen": 113272,
123
+ "step": 65
124
+ },
125
+ {
126
+ "epoch": 2.5,
127
+ "grad_norm": 8.108621597290039,
128
+ "learning_rate": 5.4761904761904766e-05,
129
+ "loss": 0.3922,
130
+ "num_input_tokens_seen": 122168,
131
+ "step": 70
132
+ },
133
+ {
134
+ "epoch": 2.678571428571429,
135
+ "grad_norm": 2.4724233150482178,
136
+ "learning_rate": 5.873015873015873e-05,
137
+ "loss": 0.4379,
138
+ "num_input_tokens_seen": 131576,
139
+ "step": 75
140
+ },
141
+ {
142
+ "epoch": 2.857142857142857,
143
+ "grad_norm": 4.108138084411621,
144
+ "learning_rate": 6.26984126984127e-05,
145
+ "loss": 0.3847,
146
+ "num_input_tokens_seen": 139832,
147
+ "step": 80
148
+ },
149
+ {
150
+ "epoch": 3.0,
151
+ "eval_loss": 0.3848850131034851,
152
+ "eval_runtime": 1.1397,
153
+ "eval_samples_per_second": 97.398,
154
+ "eval_steps_per_second": 12.284,
155
+ "num_input_tokens_seen": 146552,
156
+ "step": 84
157
+ },
158
+ {
159
+ "epoch": 3.0357142857142856,
160
+ "grad_norm": 0.5216990113258362,
161
+ "learning_rate": 6.666666666666667e-05,
162
+ "loss": 0.3722,
163
+ "num_input_tokens_seen": 147832,
164
+ "step": 85
165
+ },
166
+ {
167
+ "epoch": 3.2142857142857144,
168
+ "grad_norm": 0.9133898615837097,
169
+ "learning_rate": 7.063492063492065e-05,
170
+ "loss": 0.4223,
171
+ "num_input_tokens_seen": 157176,
172
+ "step": 90
173
+ },
174
+ {
175
+ "epoch": 3.392857142857143,
176
+ "grad_norm": 4.493597030639648,
177
+ "learning_rate": 7.460317460317461e-05,
178
+ "loss": 0.388,
179
+ "num_input_tokens_seen": 166392,
180
+ "step": 95
181
+ },
182
+ {
183
+ "epoch": 3.571428571428571,
184
+ "grad_norm": 0.2193046361207962,
185
+ "learning_rate": 7.857142857142858e-05,
186
+ "loss": 0.3655,
187
+ "num_input_tokens_seen": 174904,
188
+ "step": 100
189
+ },
190
+ {
191
+ "epoch": 3.75,
192
+ "grad_norm": 3.4654486179351807,
193
+ "learning_rate": 8.253968253968255e-05,
194
+ "loss": 0.4285,
195
+ "num_input_tokens_seen": 184312,
196
+ "step": 105
197
+ },
198
+ {
199
+ "epoch": 3.928571428571429,
200
+ "grad_norm": 0.2327776700258255,
201
+ "learning_rate": 8.650793650793651e-05,
202
+ "loss": 0.3877,
203
+ "num_input_tokens_seen": 193464,
204
+ "step": 110
205
+ },
206
+ {
207
+ "epoch": 4.107142857142857,
208
+ "grad_norm": 0.6309562921524048,
209
+ "learning_rate": 9.047619047619048e-05,
210
+ "loss": 0.3744,
211
+ "num_input_tokens_seen": 201168,
212
+ "step": 115
213
+ },
214
+ {
215
+ "epoch": 4.285714285714286,
216
+ "grad_norm": 0.7549311518669128,
217
+ "learning_rate": 9.444444444444444e-05,
218
+ "loss": 0.4027,
219
+ "num_input_tokens_seen": 210384,
220
+ "step": 120
221
+ },
222
+ {
223
+ "epoch": 4.464285714285714,
224
+ "grad_norm": 0.4532535672187805,
225
+ "learning_rate": 9.841269841269841e-05,
226
+ "loss": 0.3464,
227
+ "num_input_tokens_seen": 219536,
228
+ "step": 125
229
+ },
230
+ {
231
+ "epoch": 4.5,
232
+ "eval_loss": 0.34812429547309875,
233
+ "eval_runtime": 1.0994,
234
+ "eval_samples_per_second": 100.966,
235
+ "eval_steps_per_second": 12.735,
236
+ "num_input_tokens_seen": 221264,
237
+ "step": 126
238
+ },
239
+ {
240
+ "epoch": 4.642857142857143,
241
+ "grad_norm": 0.14779126644134521,
242
+ "learning_rate": 9.999564408362054e-05,
243
+ "loss": 0.333,
244
+ "num_input_tokens_seen": 228432,
245
+ "step": 130
246
+ },
247
+ {
248
+ "epoch": 4.821428571428571,
249
+ "grad_norm": 0.35166576504707336,
250
+ "learning_rate": 9.996902734308346e-05,
251
+ "loss": 0.3557,
252
+ "num_input_tokens_seen": 238032,
253
+ "step": 135
254
+ },
255
+ {
256
+ "epoch": 5.0,
257
+ "grad_norm": 0.6182653307914734,
258
+ "learning_rate": 9.991822668185927e-05,
259
+ "loss": 0.3749,
260
+ "num_input_tokens_seen": 245376,
261
+ "step": 140
262
+ },
263
+ {
264
+ "epoch": 5.178571428571429,
265
+ "grad_norm": 0.5397220253944397,
266
+ "learning_rate": 9.984326668636131e-05,
267
+ "loss": 0.3563,
268
+ "num_input_tokens_seen": 253952,
269
+ "step": 145
270
+ },
271
+ {
272
+ "epoch": 5.357142857142857,
273
+ "grad_norm": 0.11434465646743774,
274
+ "learning_rate": 9.974418363559444e-05,
275
+ "loss": 0.3447,
276
+ "num_input_tokens_seen": 263296,
277
+ "step": 150
278
+ },
279
+ {
280
+ "epoch": 5.535714285714286,
281
+ "grad_norm": 0.9336232542991638,
282
+ "learning_rate": 9.96210254835968e-05,
283
+ "loss": 0.3606,
284
+ "num_input_tokens_seen": 271808,
285
+ "step": 155
286
+ },
287
+ {
288
+ "epoch": 5.714285714285714,
289
+ "grad_norm": 0.10893812775611877,
290
+ "learning_rate": 9.947385183623098e-05,
291
+ "loss": 0.3507,
292
+ "num_input_tokens_seen": 280704,
293
+ "step": 160
294
+ },
295
+ {
296
+ "epoch": 5.892857142857143,
297
+ "grad_norm": 0.15622828900814056,
298
+ "learning_rate": 9.930273392233624e-05,
299
+ "loss": 0.3578,
300
+ "num_input_tokens_seen": 289600,
301
+ "step": 165
302
+ },
303
+ {
304
+ "epoch": 6.0,
305
+ "eval_loss": 0.37425312399864197,
306
+ "eval_runtime": 1.2693,
307
+ "eval_samples_per_second": 87.447,
308
+ "eval_steps_per_second": 11.029,
309
+ "num_input_tokens_seen": 294256,
310
+ "step": 168
311
+ },
312
+ {
313
+ "epoch": 6.071428571428571,
314
+ "grad_norm": 0.267703652381897,
315
+ "learning_rate": 9.910775455925518e-05,
316
+ "loss": 0.3546,
317
+ "num_input_tokens_seen": 297520,
318
+ "step": 170
319
+ },
320
+ {
321
+ "epoch": 6.25,
322
+ "grad_norm": 0.12959004938602448,
323
+ "learning_rate": 9.888900811275204e-05,
324
+ "loss": 0.3507,
325
+ "num_input_tokens_seen": 306672,
326
+ "step": 175
327
+ },
328
+ {
329
+ "epoch": 6.428571428571429,
330
+ "grad_norm": 0.25042709708213806,
331
+ "learning_rate": 9.864660045134165e-05,
332
+ "loss": 0.3929,
333
+ "num_input_tokens_seen": 315824,
334
+ "step": 180
335
+ },
336
+ {
337
+ "epoch": 6.607142857142857,
338
+ "grad_norm": 0.1262078881263733,
339
+ "learning_rate": 9.838064889505141e-05,
340
+ "loss": 0.35,
341
+ "num_input_tokens_seen": 324464,
342
+ "step": 185
343
+ },
344
+ {
345
+ "epoch": 6.785714285714286,
346
+ "grad_norm": 0.2343859225511551,
347
+ "learning_rate": 9.809128215864097e-05,
348
+ "loss": 0.36,
349
+ "num_input_tokens_seen": 332976,
350
+ "step": 190
351
+ },
352
+ {
353
+ "epoch": 6.964285714285714,
354
+ "grad_norm": 0.2336055040359497,
355
+ "learning_rate": 9.777864028930705e-05,
356
+ "loss": 0.3521,
357
+ "num_input_tokens_seen": 341552,
358
+ "step": 195
359
+ },
360
+ {
361
+ "epoch": 7.142857142857143,
362
+ "grad_norm": 0.34513550996780396,
363
+ "learning_rate": 9.744287459890368e-05,
364
+ "loss": 0.352,
365
+ "num_input_tokens_seen": 349584,
366
+ "step": 200
367
+ },
368
+ {
369
+ "epoch": 7.321428571428571,
370
+ "grad_norm": 0.5254360437393188,
371
+ "learning_rate": 9.708414759071059e-05,
372
+ "loss": 0.3669,
373
+ "num_input_tokens_seen": 358672,
374
+ "step": 205
375
+ },
376
+ {
377
+ "epoch": 7.5,
378
+ "grad_norm": 0.04271647334098816,
379
+ "learning_rate": 9.670263288078502e-05,
380
+ "loss": 0.3492,
381
+ "num_input_tokens_seen": 368144,
382
+ "step": 210
383
+ },
384
+ {
385
+ "epoch": 7.5,
386
+ "eval_loss": 0.34583956003189087,
387
+ "eval_runtime": 1.1811,
388
+ "eval_samples_per_second": 93.983,
389
+ "eval_steps_per_second": 11.854,
390
+ "num_input_tokens_seen": 368144,
391
+ "step": 210
392
+ },
393
+ {
394
+ "epoch": 7.678571428571429,
395
+ "grad_norm": 0.2870471179485321,
396
+ "learning_rate": 9.629851511393555e-05,
397
+ "loss": 0.3575,
398
+ "num_input_tokens_seen": 376464,
399
+ "step": 215
400
+ },
401
+ {
402
+ "epoch": 7.857142857142857,
403
+ "grad_norm": 0.16233104467391968,
404
+ "learning_rate": 9.587198987435782e-05,
405
+ "loss": 0.347,
406
+ "num_input_tokens_seen": 385616,
407
+ "step": 220
408
+ },
409
+ {
410
+ "epoch": 8.035714285714286,
411
+ "grad_norm": 0.24467554688453674,
412
+ "learning_rate": 9.542326359097619e-05,
413
+ "loss": 0.3491,
414
+ "num_input_tokens_seen": 393384,
415
+ "step": 225
416
+ },
417
+ {
418
+ "epoch": 8.214285714285714,
419
+ "grad_norm": 0.2640959918498993,
420
+ "learning_rate": 9.495255343753657e-05,
421
+ "loss": 0.3562,
422
+ "num_input_tokens_seen": 402280,
423
+ "step": 230
424
+ },
425
+ {
426
+ "epoch": 8.392857142857142,
427
+ "grad_norm": 0.18736229836940765,
428
+ "learning_rate": 9.446008722749905e-05,
429
+ "loss": 0.3491,
430
+ "num_input_tokens_seen": 410280,
431
+ "step": 235
432
+ },
433
+ {
434
+ "epoch": 8.571428571428571,
435
+ "grad_norm": 0.039574023336172104,
436
+ "learning_rate": 9.394610330378124e-05,
437
+ "loss": 0.3404,
438
+ "num_input_tokens_seen": 418856,
439
+ "step": 240
440
+ },
441
+ {
442
+ "epoch": 8.75,
443
+ "grad_norm": 0.12487329542636871,
444
+ "learning_rate": 9.341085042340532e-05,
445
+ "loss": 0.3528,
446
+ "num_input_tokens_seen": 427816,
447
+ "step": 245
448
+ },
449
+ {
450
+ "epoch": 8.928571428571429,
451
+ "grad_norm": 0.08902257680892944,
452
+ "learning_rate": 9.285458763710524e-05,
453
+ "loss": 0.3542,
454
+ "num_input_tokens_seen": 437352,
455
+ "step": 250
456
+ },
457
+ {
458
+ "epoch": 9.0,
459
+ "eval_loss": 0.34945669770240784,
460
+ "eval_runtime": 1.1316,
461
+ "eval_samples_per_second": 98.092,
462
+ "eval_steps_per_second": 12.372,
463
+ "num_input_tokens_seen": 439768,
464
+ "step": 252
465
+ },
466
+ {
467
+ "epoch": 9.107142857142858,
468
+ "grad_norm": 0.12130332738161087,
469
+ "learning_rate": 9.227758416395169e-05,
470
+ "loss": 0.3435,
471
+ "num_input_tokens_seen": 444504,
472
+ "step": 255
473
+ },
474
+ {
475
+ "epoch": 9.285714285714286,
476
+ "grad_norm": 0.14609932899475098,
477
+ "learning_rate": 9.168011926105598e-05,
478
+ "loss": 0.3565,
479
+ "num_input_tokens_seen": 453144,
480
+ "step": 260
481
+ },
482
+ {
483
+ "epoch": 9.464285714285714,
484
+ "grad_norm": 0.025746047496795654,
485
+ "learning_rate": 9.106248208841569e-05,
486
+ "loss": 0.3499,
487
+ "num_input_tokens_seen": 462040,
488
+ "step": 265
489
+ },
490
+ {
491
+ "epoch": 9.642857142857142,
492
+ "grad_norm": 0.10566066205501556,
493
+ "learning_rate": 9.042497156896748e-05,
494
+ "loss": 0.349,
495
+ "num_input_tokens_seen": 471576,
496
+ "step": 270
497
+ },
498
+ {
499
+ "epoch": 9.821428571428571,
500
+ "grad_norm": 0.2756125032901764,
501
+ "learning_rate": 8.976789624391498e-05,
502
+ "loss": 0.3532,
503
+ "num_input_tokens_seen": 480280,
504
+ "step": 275
505
+ },
506
+ {
507
+ "epoch": 10.0,
508
+ "grad_norm": 0.05616496875882149,
509
+ "learning_rate": 8.90915741234015e-05,
510
+ "loss": 0.3506,
511
+ "num_input_tokens_seen": 489096,
512
+ "step": 280
513
+ },
514
+ {
515
+ "epoch": 10.178571428571429,
516
+ "grad_norm": 0.23147229850292206,
517
+ "learning_rate": 8.839633253260006e-05,
518
+ "loss": 0.3475,
519
+ "num_input_tokens_seen": 498952,
520
+ "step": 285
521
+ },
522
+ {
523
+ "epoch": 10.357142857142858,
524
+ "grad_norm": 0.17449620366096497,
525
+ "learning_rate": 8.768250795329518e-05,
526
+ "loss": 0.34,
527
+ "num_input_tokens_seen": 507976,
528
+ "step": 290
529
+ },
530
+ {
531
+ "epoch": 10.5,
532
+ "eval_loss": 0.35035043954849243,
533
+ "eval_runtime": 1.1011,
534
+ "eval_samples_per_second": 100.804,
535
+ "eval_steps_per_second": 12.714,
536
+ "num_input_tokens_seen": 514888,
537
+ "step": 294
538
+ },
539
+ {
540
+ "epoch": 10.535714285714286,
541
+ "grad_norm": 0.09387495368719101,
542
+ "learning_rate": 8.695044586103296e-05,
543
+ "loss": 0.355,
544
+ "num_input_tokens_seen": 516744,
545
+ "step": 295
546
+ },
547
+ {
548
+ "epoch": 10.714285714285714,
549
+ "grad_norm": 0.269589364528656,
550
+ "learning_rate": 8.620050055791851e-05,
551
+ "loss": 0.3472,
552
+ "num_input_tokens_seen": 525960,
553
+ "step": 300
554
+ },
555
+ {
556
+ "epoch": 10.892857142857142,
557
+ "grad_norm": 0.045772165060043335,
558
+ "learning_rate": 8.543303500114141e-05,
559
+ "loss": 0.3496,
560
+ "num_input_tokens_seen": 534344,
561
+ "step": 305
562
+ },
563
+ {
564
+ "epoch": 11.071428571428571,
565
+ "grad_norm": 0.23182272911071777,
566
+ "learning_rate": 8.464842062731235e-05,
567
+ "loss": 0.3481,
568
+ "num_input_tokens_seen": 541856,
569
+ "step": 310
570
+ },
571
+ {
572
+ "epoch": 11.25,
573
+ "grad_norm": 0.34456300735473633,
574
+ "learning_rate": 8.384703717269584e-05,
575
+ "loss": 0.3448,
576
+ "num_input_tokens_seen": 550176,
577
+ "step": 315
578
+ },
579
+ {
580
+ "epoch": 11.428571428571429,
581
+ "grad_norm": 0.04894077032804489,
582
+ "learning_rate": 8.302927248942627e-05,
583
+ "loss": 0.3405,
584
+ "num_input_tokens_seen": 558368,
585
+ "step": 320
586
+ },
587
+ {
588
+ "epoch": 11.607142857142858,
589
+ "grad_norm": 0.33567559719085693,
590
+ "learning_rate": 8.219552235779578e-05,
591
+ "loss": 0.3466,
592
+ "num_input_tokens_seen": 567392,
593
+ "step": 325
594
+ },
595
+ {
596
+ "epoch": 11.785714285714286,
597
+ "grad_norm": 0.2454521507024765,
598
+ "learning_rate": 8.134619029470534e-05,
599
+ "loss": 0.3594,
600
+ "num_input_tokens_seen": 576864,
601
+ "step": 330
602
+ },
603
+ {
604
+ "epoch": 11.964285714285714,
605
+ "grad_norm": 0.41574931144714355,
606
+ "learning_rate": 8.048168735837121e-05,
607
+ "loss": 0.3578,
608
+ "num_input_tokens_seen": 585376,
609
+ "step": 335
610
+ },
611
+ {
612
+ "epoch": 12.0,
613
+ "eval_loss": 0.3650008738040924,
614
+ "eval_runtime": 1.1362,
615
+ "eval_samples_per_second": 97.697,
616
+ "eval_steps_per_second": 12.322,
617
+ "num_input_tokens_seen": 586448,
618
+ "step": 336
619
+ },
620
+ {
621
+ "epoch": 12.142857142857142,
622
+ "grad_norm": 0.13403870165348053,
623
+ "learning_rate": 7.960243194938192e-05,
624
+ "loss": 0.3703,
625
+ "num_input_tokens_seen": 593296,
626
+ "step": 340
627
+ },
628
+ {
629
+ "epoch": 12.321428571428571,
630
+ "grad_norm": 0.03592513129115105,
631
+ "learning_rate": 7.87088496082013e-05,
632
+ "loss": 0.3383,
633
+ "num_input_tokens_seen": 602000,
634
+ "step": 345
635
+ },
636
+ {
637
+ "epoch": 12.5,
638
+ "grad_norm": 0.30938151478767395,
639
+ "learning_rate": 7.780137280921636e-05,
640
+ "loss": 0.3849,
641
+ "num_input_tokens_seen": 610960,
642
+ "step": 350
643
+ },
644
+ {
645
+ "epoch": 12.678571428571429,
646
+ "grad_norm": 0.047836095094680786,
647
+ "learning_rate": 7.688044075142887e-05,
648
+ "loss": 0.3439,
649
+ "num_input_tokens_seen": 619984,
650
+ "step": 355
651
+ },
652
+ {
653
+ "epoch": 12.857142857142858,
654
+ "grad_norm": 0.08763613551855087,
655
+ "learning_rate": 7.594649914589287e-05,
656
+ "loss": 0.3448,
657
+ "num_input_tokens_seen": 629776,
658
+ "step": 360
659
+ },
660
+ {
661
+ "epoch": 13.035714285714286,
662
+ "grad_norm": 0.0415462851524353,
663
+ "learning_rate": 7.500000000000001e-05,
664
+ "loss": 0.348,
665
+ "num_input_tokens_seen": 637888,
666
+ "step": 365
667
+ },
668
+ {
669
+ "epoch": 13.214285714285714,
670
+ "grad_norm": 0.12538637220859528,
671
+ "learning_rate": 7.404140139871797e-05,
672
+ "loss": 0.345,
673
+ "num_input_tokens_seen": 648128,
674
+ "step": 370
675
+ },
676
+ {
677
+ "epoch": 13.392857142857142,
678
+ "grad_norm": 0.17965561151504517,
679
+ "learning_rate": 7.307116728288727e-05,
680
+ "loss": 0.3462,
681
+ "num_input_tokens_seen": 656768,
682
+ "step": 375
683
+ },
684
+ {
685
+ "epoch": 13.5,
686
+ "eval_loss": 0.3538319170475006,
687
+ "eval_runtime": 1.1141,
688
+ "eval_samples_per_second": 99.628,
689
+ "eval_steps_per_second": 12.566,
690
+ "num_input_tokens_seen": 662016,
691
+ "step": 378
692
+ },
693
+ {
694
+ "epoch": 13.571428571428571,
695
+ "grad_norm": 0.22999395430088043,
696
+ "learning_rate": 7.208976722468392e-05,
697
+ "loss": 0.3606,
698
+ "num_input_tokens_seen": 665472,
699
+ "step": 380
700
+ },
701
+ {
702
+ "epoch": 13.75,
703
+ "grad_norm": 0.0893191397190094,
704
+ "learning_rate": 7.109767620035689e-05,
705
+ "loss": 0.3501,
706
+ "num_input_tokens_seen": 673664,
707
+ "step": 385
708
+ },
709
+ {
710
+ "epoch": 13.928571428571429,
711
+ "grad_norm": 0.09730294346809387,
712
+ "learning_rate": 7.00953743603498e-05,
713
+ "loss": 0.3461,
714
+ "num_input_tokens_seen": 682688,
715
+ "step": 390
716
+ },
717
+ {
718
+ "epoch": 14.107142857142858,
719
+ "grad_norm": 0.05350238084793091,
720
+ "learning_rate": 6.908334679691863e-05,
721
+ "loss": 0.3522,
722
+ "num_input_tokens_seen": 690936,
723
+ "step": 395
724
+ },
725
+ {
726
+ "epoch": 14.285714285714286,
727
+ "grad_norm": 0.04386021941900253,
728
+ "learning_rate": 6.806208330935766e-05,
729
+ "loss": 0.3553,
730
+ "num_input_tokens_seen": 700536,
731
+ "step": 400
732
+ },
733
+ {
734
+ "epoch": 14.464285714285714,
735
+ "grad_norm": 0.0879250317811966,
736
+ "learning_rate": 6.703207816694719e-05,
737
+ "loss": 0.3382,
738
+ "num_input_tokens_seen": 709560,
739
+ "step": 405
740
+ },
741
+ {
742
+ "epoch": 14.642857142857142,
743
+ "grad_norm": 0.31405237317085266,
744
+ "learning_rate": 6.599382986973808e-05,
745
+ "loss": 0.348,
746
+ "num_input_tokens_seen": 718584,
747
+ "step": 410
748
+ },
749
+ {
750
+ "epoch": 14.821428571428571,
751
+ "grad_norm": 0.3567558228969574,
752
+ "learning_rate": 6.494784090728852e-05,
753
+ "loss": 0.3544,
754
+ "num_input_tokens_seen": 727416,
755
+ "step": 415
756
+ },
757
+ {
758
+ "epoch": 15.0,
759
+ "grad_norm": 0.061139896512031555,
760
+ "learning_rate": 6.389461751547008e-05,
761
+ "loss": 0.3506,
762
+ "num_input_tokens_seen": 735680,
763
+ "step": 420
764
+ },
765
+ {
766
+ "epoch": 15.0,
767
+ "eval_loss": 0.35569673776626587,
768
+ "eval_runtime": 1.1923,
769
+ "eval_samples_per_second": 93.099,
770
+ "eval_steps_per_second": 11.742,
771
+ "num_input_tokens_seen": 735680,
772
+ "step": 420
773
+ },
774
+ {
775
+ "epoch": 15.178571428571429,
776
+ "grad_norm": 0.19610817730426788,
777
+ "learning_rate": 6.283466943146053e-05,
778
+ "loss": 0.3486,
779
+ "num_input_tokens_seen": 744832,
780
+ "step": 425
781
+ },
782
+ {
783
+ "epoch": 15.357142857142858,
784
+ "grad_norm": 0.05017664283514023,
785
+ "learning_rate": 6.176850964704213e-05,
786
+ "loss": 0.3573,
787
+ "num_input_tokens_seen": 753728,
788
+ "step": 430
789
+ },
790
+ {
791
+ "epoch": 15.535714285714286,
792
+ "grad_norm": 0.047355543822050095,
793
+ "learning_rate": 6.069665416032487e-05,
794
+ "loss": 0.3564,
795
+ "num_input_tokens_seen": 762752,
796
+ "step": 435
797
+ },
798
+ {
799
+ "epoch": 15.714285714285714,
800
+ "grad_norm": 0.16390874981880188,
801
+ "learning_rate": 5.961962172601458e-05,
802
+ "loss": 0.3461,
803
+ "num_input_tokens_seen": 771200,
804
+ "step": 440
805
+ },
806
+ {
807
+ "epoch": 15.892857142857142,
808
+ "grad_norm": 0.19481535255908966,
809
+ "learning_rate": 5.853793360434687e-05,
810
+ "loss": 0.3564,
811
+ "num_input_tokens_seen": 779776,
812
+ "step": 445
813
+ },
814
+ {
815
+ "epoch": 16.071428571428573,
816
+ "grad_norm": 0.04299188032746315,
817
+ "learning_rate": 5.745211330880872e-05,
818
+ "loss": 0.3214,
819
+ "num_input_tokens_seen": 788984,
820
+ "step": 450
821
+ },
822
+ {
823
+ "epoch": 16.25,
824
+ "grad_norm": 0.07021050900220871,
825
+ "learning_rate": 5.636268635276918e-05,
826
+ "loss": 0.355,
827
+ "num_input_tokens_seen": 798456,
828
+ "step": 455
829
+ },
830
+ {
831
+ "epoch": 16.428571428571427,
832
+ "grad_norm": 0.2501037120819092,
833
+ "learning_rate": 5.527017999514239e-05,
834
+ "loss": 0.3489,
835
+ "num_input_tokens_seen": 807352,
836
+ "step": 460
837
+ },
838
+ {
839
+ "epoch": 16.5,
840
+ "eval_loss": 0.35193607211112976,
841
+ "eval_runtime": 1.1402,
842
+ "eval_samples_per_second": 97.356,
843
+ "eval_steps_per_second": 12.279,
844
+ "num_input_tokens_seen": 810232,
845
+ "step": 462
846
+ },
847
+ {
848
+ "epoch": 16.607142857142858,
849
+ "grad_norm": 0.08860552310943604,
850
+ "learning_rate": 5.417512298520585e-05,
851
+ "loss": 0.3506,
852
+ "num_input_tokens_seen": 814712,
853
+ "step": 465
854
+ },
855
+ {
856
+ "epoch": 16.785714285714285,
857
+ "grad_norm": 0.08465074002742767,
858
+ "learning_rate": 5.307804530669716e-05,
859
+ "loss": 0.346,
860
+ "num_input_tokens_seen": 823608,
861
+ "step": 470
862
+ },
863
+ {
864
+ "epoch": 16.964285714285715,
865
+ "grad_norm": 0.14993856847286224,
866
+ "learning_rate": 5.197947792131348e-05,
867
+ "loss": 0.3467,
868
+ "num_input_tokens_seen": 832888,
869
+ "step": 475
870
+ },
871
+ {
872
+ "epoch": 17.142857142857142,
873
+ "grad_norm": 0.17248603701591492,
874
+ "learning_rate": 5.0879952511737696e-05,
875
+ "loss": 0.3452,
876
+ "num_input_tokens_seen": 840464,
877
+ "step": 480
878
+ },
879
+ {
880
+ "epoch": 17.321428571428573,
881
+ "grad_norm": 0.10642991960048676,
882
+ "learning_rate": 4.97800012243155e-05,
883
+ "loss": 0.3402,
884
+ "num_input_tokens_seen": 849488,
885
+ "step": 485
886
+ },
887
+ {
888
+ "epoch": 17.5,
889
+ "grad_norm": 0.07554838806390762,
890
+ "learning_rate": 4.86801564115082e-05,
891
+ "loss": 0.3428,
892
+ "num_input_tokens_seen": 858448,
893
+ "step": 490
894
+ },
895
+ {
896
+ "epoch": 17.678571428571427,
897
+ "grad_norm": 0.1515018343925476,
898
+ "learning_rate": 4.758095037424567e-05,
899
+ "loss": 0.352,
900
+ "num_input_tokens_seen": 867280,
901
+ "step": 495
902
+ },
903
+ {
904
+ "epoch": 17.857142857142858,
905
+ "grad_norm": 0.2787249684333801,
906
+ "learning_rate": 4.648291510430438e-05,
907
+ "loss": 0.3528,
908
+ "num_input_tokens_seen": 876880,
909
+ "step": 500
910
+ },
911
+ {
912
+ "epoch": 18.0,
913
+ "eval_loss": 0.3557737469673157,
914
+ "eval_runtime": 1.1986,
915
+ "eval_samples_per_second": 92.609,
916
+ "eval_steps_per_second": 11.68,
917
+ "num_input_tokens_seen": 882920,
918
+ "step": 504
919
+ },
920
+ {
921
+ "epoch": 18.035714285714285,
922
+ "grad_norm": 0.21712534129619598,
923
+ "learning_rate": 4.5386582026834906e-05,
924
+ "loss": 0.3418,
925
+ "num_input_tokens_seen": 885480,
926
+ "step": 505
927
+ },
928
+ {
929
+ "epoch": 18.214285714285715,
930
+ "grad_norm": 0.05514955148100853,
931
+ "learning_rate": 4.4292481743163755e-05,
932
+ "loss": 0.3441,
933
+ "num_input_tokens_seen": 893864,
934
+ "step": 510
935
+ },
936
+ {
937
+ "epoch": 18.392857142857142,
938
+ "grad_norm": 0.06789080053567886,
939
+ "learning_rate": 4.3201143773993865e-05,
940
+ "loss": 0.3524,
941
+ "num_input_tokens_seen": 901928,
942
+ "step": 515
943
+ },
944
+ {
945
+ "epoch": 18.571428571428573,
946
+ "grad_norm": 0.11630789190530777,
947
+ "learning_rate": 4.2113096303128125e-05,
948
+ "loss": 0.3495,
949
+ "num_input_tokens_seen": 910696,
950
+ "step": 520
951
+ },
952
+ {
953
+ "epoch": 18.75,
954
+ "grad_norm": 0.13940396904945374,
955
+ "learning_rate": 4.102886592183996e-05,
956
+ "loss": 0.3378,
957
+ "num_input_tokens_seen": 920104,
958
+ "step": 525
959
+ },
960
+ {
961
+ "epoch": 18.928571428571427,
962
+ "grad_norm": 0.051015470176935196,
963
+ "learning_rate": 3.9948977374014544e-05,
964
+ "loss": 0.3511,
965
+ "num_input_tokens_seen": 928936,
966
+ "step": 530
967
+ },
968
+ {
969
+ "epoch": 19.107142857142858,
970
+ "grad_norm": 0.09734626859426498,
971
+ "learning_rate": 3.887395330218429e-05,
972
+ "loss": 0.3465,
973
+ "num_input_tokens_seen": 937840,
974
+ "step": 535
975
+ },
976
+ {
977
+ "epoch": 19.285714285714285,
978
+ "grad_norm": 0.11513973772525787,
979
+ "learning_rate": 3.780431399458114e-05,
980
+ "loss": 0.3478,
981
+ "num_input_tokens_seen": 947248,
982
+ "step": 540
983
+ },
984
+ {
985
+ "epoch": 19.464285714285715,
986
+ "grad_norm": 0.11863286793231964,
987
+ "learning_rate": 3.6740577133328524e-05,
988
+ "loss": 0.3408,
989
+ "num_input_tokens_seen": 955568,
990
+ "step": 545
991
+ },
992
+ {
993
+ "epoch": 19.5,
994
+ "eval_loss": 0.3516767919063568,
995
+ "eval_runtime": 1.1222,
996
+ "eval_samples_per_second": 98.916,
997
+ "eval_steps_per_second": 12.476,
998
+ "num_input_tokens_seen": 957488,
999
+ "step": 546
1000
+ },
1001
+ {
1002
+ "epoch": 19.642857142857142,
1003
+ "grad_norm": 0.04675190523266792,
1004
+ "learning_rate": 3.568325754389438e-05,
1005
+ "loss": 0.3459,
1006
+ "num_input_tokens_seen": 964400,
1007
+ "step": 550
1008
+ },
1009
+ {
1010
+ "epoch": 19.821428571428573,
1011
+ "grad_norm": 0.13177751004695892,
1012
+ "learning_rate": 3.4632866945926855e-05,
1013
+ "loss": 0.3473,
1014
+ "num_input_tokens_seen": 972720,
1015
+ "step": 555
1016
+ },
1017
+ {
1018
+ "epoch": 20.0,
1019
+ "grad_norm": 0.05624629184603691,
1020
+ "learning_rate": 3.3589913705593235e-05,
1021
+ "loss": 0.3508,
1022
+ "num_input_tokens_seen": 980760,
1023
+ "step": 560
1024
+ },
1025
+ {
1026
+ "epoch": 20.178571428571427,
1027
+ "grad_norm": 0.04487062245607376,
1028
+ "learning_rate": 3.255490258954167e-05,
1029
+ "loss": 0.3414,
1030
+ "num_input_tokens_seen": 989464,
1031
+ "step": 565
1032
+ },
1033
+ {
1034
+ "epoch": 20.357142857142858,
1035
+ "grad_norm": 0.3037759065628052,
1036
+ "learning_rate": 3.152833452060522e-05,
1037
+ "loss": 0.3518,
1038
+ "num_input_tokens_seen": 998680,
1039
+ "step": 570
1040
+ },
1041
+ {
1042
+ "epoch": 20.535714285714285,
1043
+ "grad_norm": 0.34507808089256287,
1044
+ "learning_rate": 3.0510706335366035e-05,
1045
+ "loss": 0.3479,
1046
+ "num_input_tokens_seen": 1007768,
1047
+ "step": 575
1048
+ },
1049
+ {
1050
+ "epoch": 20.714285714285715,
1051
+ "grad_norm": 0.32024624943733215,
1052
+ "learning_rate": 2.9502510543697325e-05,
1053
+ "loss": 0.3549,
1054
+ "num_input_tokens_seen": 1016536,
1055
+ "step": 580
1056
+ },
1057
+ {
1058
+ "epoch": 20.892857142857142,
1059
+ "grad_norm": 0.0999528020620346,
1060
+ "learning_rate": 2.850423509039928e-05,
1061
+ "loss": 0.3469,
1062
+ "num_input_tokens_seen": 1025048,
1063
+ "step": 585
1064
+ },
1065
+ {
1066
+ "epoch": 21.0,
1067
+ "eval_loss": 0.35418227314949036,
1068
+ "eval_runtime": 1.282,
1069
+ "eval_samples_per_second": 86.583,
1070
+ "eval_steps_per_second": 10.92,
1071
+ "num_input_tokens_seen": 1029792,
1072
+ "step": 588
1073
+ },
1074
+ {
1075
+ "epoch": 21.071428571428573,
1076
+ "grad_norm": 0.10501652210950851,
1077
+ "learning_rate": 2.751636311904444e-05,
1078
+ "loss": 0.343,
1079
+ "num_input_tokens_seen": 1032864,
1080
+ "step": 590
1081
+ },
1082
+ {
1083
+ "epoch": 21.25,
1084
+ "grad_norm": 0.04894215986132622,
1085
+ "learning_rate": 2.6539372738146695e-05,
1086
+ "loss": 0.3426,
1087
+ "num_input_tokens_seen": 1041248,
1088
+ "step": 595
1089
+ },
1090
+ {
1091
+ "epoch": 21.428571428571427,
1092
+ "grad_norm": 0.11162778735160828,
1093
+ "learning_rate": 2.5573736789767232e-05,
1094
+ "loss": 0.3358,
1095
+ "num_input_tokens_seen": 1050720,
1096
+ "step": 600
1097
+ },
1098
+ {
1099
+ "epoch": 21.607142857142858,
1100
+ "grad_norm": 0.11263741552829742,
1101
+ "learning_rate": 2.4619922620669218e-05,
1102
+ "loss": 0.3549,
1103
+ "num_input_tokens_seen": 1060064,
1104
+ "step": 605
1105
+ },
1106
+ {
1107
+ "epoch": 21.785714285714285,
1108
+ "grad_norm": 0.2273244857788086,
1109
+ "learning_rate": 2.3678391856132204e-05,
1110
+ "loss": 0.3524,
1111
+ "num_input_tokens_seen": 1068256,
1112
+ "step": 610
1113
+ },
1114
+ {
1115
+ "epoch": 21.964285714285715,
1116
+ "grad_norm": 0.23367320001125336,
1117
+ "learning_rate": 2.2749600176535534e-05,
1118
+ "loss": 0.3474,
1119
+ "num_input_tokens_seen": 1077024,
1120
+ "step": 615
1121
+ },
1122
+ {
1123
+ "epoch": 22.142857142857142,
1124
+ "grad_norm": 0.17969422042369843,
1125
+ "learning_rate": 2.1833997096818898e-05,
1126
+ "loss": 0.3425,
1127
+ "num_input_tokens_seen": 1086328,
1128
+ "step": 620
1129
+ },
1130
+ {
1131
+ "epoch": 22.321428571428573,
1132
+ "grad_norm": 0.20163026452064514,
1133
+ "learning_rate": 2.0932025748927013e-05,
1134
+ "loss": 0.3438,
1135
+ "num_input_tokens_seen": 1094520,
1136
+ "step": 625
1137
+ },
1138
+ {
1139
+ "epoch": 22.5,
1140
+ "grad_norm": 0.23486794531345367,
1141
+ "learning_rate": 2.0044122667343297e-05,
1142
+ "loss": 0.3488,
1143
+ "num_input_tokens_seen": 1103160,
1144
+ "step": 630
1145
+ },
1146
+ {
1147
+ "epoch": 22.5,
1148
+ "eval_loss": 0.35537073016166687,
1149
+ "eval_runtime": 1.1326,
1150
+ "eval_samples_per_second": 98.004,
1151
+ "eval_steps_per_second": 12.361,
1152
+ "num_input_tokens_seen": 1103160,
1153
+ "step": 630
1154
+ },
1155
+ {
1156
+ "epoch": 22.678571428571427,
1157
+ "grad_norm": 0.04255451634526253,
1158
+ "learning_rate": 1.917071757781679e-05,
1159
+ "loss": 0.3465,
1160
+ "num_input_tokens_seen": 1112376,
1161
+ "step": 635
1162
+ },
1163
+ {
1164
+ "epoch": 22.857142857142858,
1165
+ "grad_norm": 0.07782665640115738,
1166
+ "learning_rate": 1.831223318938419e-05,
1167
+ "loss": 0.3449,
1168
+ "num_input_tokens_seen": 1121464,
1169
+ "step": 640
1170
+ },
1171
+ {
1172
+ "epoch": 23.035714285714285,
1173
+ "grad_norm": 0.11463826894760132,
1174
+ "learning_rate": 1.746908498978791e-05,
1175
+ "loss": 0.3526,
1176
+ "num_input_tokens_seen": 1130152,
1177
+ "step": 645
1178
+ },
1179
+ {
1180
+ "epoch": 23.214285714285715,
1181
+ "grad_norm": 0.07137856632471085,
1182
+ "learning_rate": 1.6641681044389014e-05,
1183
+ "loss": 0.3495,
1184
+ "num_input_tokens_seen": 1138664,
1185
+ "step": 650
1186
+ },
1187
+ {
1188
+ "epoch": 23.392857142857142,
1189
+ "grad_norm": 0.08257535845041275,
1190
+ "learning_rate": 1.5830421798672568e-05,
1191
+ "loss": 0.3447,
1192
+ "num_input_tokens_seen": 1146216,
1193
+ "step": 655
1194
+ },
1195
+ {
1196
+ "epoch": 23.571428571428573,
1197
+ "grad_norm": 0.0766848772764206,
1198
+ "learning_rate": 1.5035699884440697e-05,
1199
+ "loss": 0.3509,
1200
+ "num_input_tokens_seen": 1155496,
1201
+ "step": 660
1202
+ },
1203
+ {
1204
+ "epoch": 23.75,
1205
+ "grad_norm": 0.08535895496606827,
1206
+ "learning_rate": 1.4257899929787294e-05,
1207
+ "loss": 0.3515,
1208
+ "num_input_tokens_seen": 1164840,
1209
+ "step": 665
1210
+ },
1211
+ {
1212
+ "epoch": 23.928571428571427,
1213
+ "grad_norm": 0.09435199946165085,
1214
+ "learning_rate": 1.3497398372946501e-05,
1215
+ "loss": 0.3402,
1216
+ "num_input_tokens_seen": 1173736,
1217
+ "step": 670
1218
+ },
1219
+ {
1220
+ "epoch": 24.0,
1221
+ "eval_loss": 0.35293668508529663,
1222
+ "eval_runtime": 1.2061,
1223
+ "eval_samples_per_second": 92.034,
1224
+ "eval_steps_per_second": 11.608,
1225
+ "num_input_tokens_seen": 1176968,
1226
+ "step": 672
1227
+ },
1228
+ {
1229
+ "epoch": 24.107142857142858,
1230
+ "grad_norm": 0.05331406742334366,
1231
+ "learning_rate": 1.2754563280104714e-05,
1232
+ "loss": 0.3404,
1233
+ "num_input_tokens_seen": 1182344,
1234
+ "step": 675
1235
+ },
1236
+ {
1237
+ "epoch": 24.285714285714285,
1238
+ "grad_norm": 0.14441683888435364,
1239
+ "learning_rate": 1.202975416726464e-05,
1240
+ "loss": 0.357,
1241
+ "num_input_tokens_seen": 1191944,
1242
+ "step": 680
1243
+ },
1244
+ {
1245
+ "epoch": 24.464285714285715,
1246
+ "grad_norm": 0.06084274500608444,
1247
+ "learning_rate": 1.1323321826247346e-05,
1248
+ "loss": 0.3355,
1249
+ "num_input_tokens_seen": 1200520,
1250
+ "step": 685
1251
+ },
1252
+ {
1253
+ "epoch": 24.642857142857142,
1254
+ "grad_norm": 0.13699252903461456,
1255
+ "learning_rate": 1.0635608154916648e-05,
1256
+ "loss": 0.3433,
1257
+ "num_input_tokens_seen": 1209288,
1258
+ "step": 690
1259
+ },
1260
+ {
1261
+ "epoch": 24.821428571428573,
1262
+ "grad_norm": 0.14677266776561737,
1263
+ "learning_rate": 9.966945991708005e-06,
1264
+ "loss": 0.3554,
1265
+ "num_input_tokens_seen": 1217544,
1266
+ "step": 695
1267
+ },
1268
+ {
1269
+ "epoch": 25.0,
1270
+ "grad_norm": 0.23254753649234772,
1271
+ "learning_rate": 9.317658954541992e-06,
1272
+ "loss": 0.3439,
1273
+ "num_input_tokens_seen": 1224848,
1274
+ "step": 700
1275
+ },
1276
+ {
1277
+ "epoch": 25.178571428571427,
1278
+ "grad_norm": 0.1202574372291565,
1279
+ "learning_rate": 8.688061284200266e-06,
1280
+ "loss": 0.3371,
1281
+ "num_input_tokens_seen": 1234064,
1282
+ "step": 705
1283
+ },
1284
+ {
1285
+ "epoch": 25.357142857142858,
1286
+ "grad_norm": 0.055635951459407806,
1287
+ "learning_rate": 8.07845769223981e-06,
1288
+ "loss": 0.3422,
1289
+ "num_input_tokens_seen": 1242448,
1290
+ "step": 710
1291
+ },
1292
+ {
1293
+ "epoch": 25.5,
1294
+ "eval_loss": 0.35582345724105835,
1295
+ "eval_runtime": 1.0722,
1296
+ "eval_samples_per_second": 103.524,
1297
+ "eval_steps_per_second": 13.057,
1298
+ "num_input_tokens_seen": 1250064,
1299
+ "step": 714
1300
+ },
1301
+ {
1302
+ "epoch": 25.535714285714285,
1303
+ "grad_norm": 0.29359593987464905,
1304
+ "learning_rate": 7.489143213519301e-06,
1305
+ "loss": 0.352,
1306
+ "num_input_tokens_seen": 1251536,
1307
+ "step": 715
1308
+ },
1309
+ {
1310
+ "epoch": 25.714285714285715,
1311
+ "grad_norm": 0.08035387843847275,
1312
+ "learning_rate": 6.920403063408526e-06,
1313
+ "loss": 0.3426,
1314
+ "num_input_tokens_seen": 1259792,
1315
+ "step": 720
1316
+ },
1317
+ {
1318
+ "epoch": 25.892857142857142,
1319
+ "grad_norm": 0.21466954052448273,
1320
+ "learning_rate": 6.372512499750471e-06,
1321
+ "loss": 0.3496,
1322
+ "num_input_tokens_seen": 1268560,
1323
+ "step": 725
1324
+ },
1325
+ {
1326
+ "epoch": 26.071428571428573,
1327
+ "grad_norm": 0.15136410295963287,
1328
+ "learning_rate": 5.845736689642472e-06,
1329
+ "loss": 0.3447,
1330
+ "num_input_tokens_seen": 1276648,
1331
+ "step": 730
1332
+ },
1333
+ {
1334
+ "epoch": 26.25,
1335
+ "grad_norm": 0.13561971485614777,
1336
+ "learning_rate": 5.3403305811010885e-06,
1337
+ "loss": 0.3406,
1338
+ "num_input_tokens_seen": 1287144,
1339
+ "step": 735
1340
+ },
1341
+ {
1342
+ "epoch": 26.428571428571427,
1343
+ "grad_norm": 0.07142584770917892,
1344
+ "learning_rate": 4.8565387796728865e-06,
1345
+ "loss": 0.3451,
1346
+ "num_input_tokens_seen": 1294504,
1347
+ "step": 740
1348
+ },
1349
+ {
1350
+ "epoch": 26.607142857142858,
1351
+ "grad_norm": 0.2093130499124527,
1352
+ "learning_rate": 4.394595430050613e-06,
1353
+ "loss": 0.3485,
1354
+ "num_input_tokens_seen": 1303336,
1355
+ "step": 745
1356
+ },
1357
+ {
1358
+ "epoch": 26.785714285714285,
1359
+ "grad_norm": 0.07084821909666061,
1360
+ "learning_rate": 3.954724102752316e-06,
1361
+ "loss": 0.3451,
1362
+ "num_input_tokens_seen": 1312296,
1363
+ "step": 750
1364
+ },
1365
+ {
1366
+ "epoch": 26.964285714285715,
1367
+ "grad_norm": 0.1166943907737732,
1368
+ "learning_rate": 3.537137685918074e-06,
1369
+ "loss": 0.3474,
1370
+ "num_input_tokens_seen": 1319912,
1371
+ "step": 755
1372
+ },
1373
+ {
1374
+ "epoch": 27.0,
1375
+ "eval_loss": 0.3538145124912262,
1376
+ "eval_runtime": 1.1107,
1377
+ "eval_samples_per_second": 99.936,
1378
+ "eval_steps_per_second": 12.605,
1379
+ "num_input_tokens_seen": 1321408,
1380
+ "step": 756
1381
+ },
1382
+ {
1383
+ "epoch": 27.142857142857142,
1384
+ "grad_norm": 0.21413478255271912,
1385
+ "learning_rate": 3.1420382822767323e-06,
1386
+ "loss": 0.3382,
1387
+ "num_input_tokens_seen": 1329024,
1388
+ "step": 760
1389
+ },
1390
+ {
1391
+ "epoch": 27.321428571428573,
1392
+ "grad_norm": 0.08071761578321457,
1393
+ "learning_rate": 2.7696171113326396e-06,
1394
+ "loss": 0.3531,
1395
+ "num_input_tokens_seen": 1337984,
1396
+ "step": 765
1397
+ },
1398
+ {
1399
+ "epoch": 27.5,
1400
+ "grad_norm": 0.20051950216293335,
1401
+ "learning_rate": 2.420054416819556e-06,
1402
+ "loss": 0.3338,
1403
+ "num_input_tokens_seen": 1346112,
1404
+ "step": 770
1405
+ },
1406
+ {
1407
+ "epoch": 27.678571428571427,
1408
+ "grad_norm": 0.05100846663117409,
1409
+ "learning_rate": 2.093519379466602e-06,
1410
+ "loss": 0.342,
1411
+ "num_input_tokens_seen": 1355584,
1412
+ "step": 775
1413
+ },
1414
+ {
1415
+ "epoch": 27.857142857142858,
1416
+ "grad_norm": 0.11974883079528809,
1417
+ "learning_rate": 1.7901700351184659e-06,
1418
+ "loss": 0.3414,
1419
+ "num_input_tokens_seen": 1364864,
1420
+ "step": 780
1421
+ },
1422
+ {
1423
+ "epoch": 28.035714285714285,
1424
+ "grad_norm": 0.0698491781949997,
1425
+ "learning_rate": 1.5101531982495308e-06,
1426
+ "loss": 0.3553,
1427
+ "num_input_tokens_seen": 1373400,
1428
+ "step": 785
1429
+ },
1430
+ {
1431
+ "epoch": 28.214285714285715,
1432
+ "grad_norm": 0.18634414672851562,
1433
+ "learning_rate": 1.2536043909088191e-06,
1434
+ "loss": 0.349,
1435
+ "num_input_tokens_seen": 1382744,
1436
+ "step": 790
1437
+ },
1438
+ {
1439
+ "epoch": 28.392857142857142,
1440
+ "grad_norm": 0.05743042752146721,
1441
+ "learning_rate": 1.0206477771303236e-06,
1442
+ "loss": 0.3409,
1443
+ "num_input_tokens_seen": 1390424,
1444
+ "step": 795
1445
+ },
1446
+ {
1447
+ "epoch": 28.5,
1448
+ "eval_loss": 0.35238173604011536,
1449
+ "eval_runtime": 1.1779,
1450
+ "eval_samples_per_second": 94.233,
1451
+ "eval_steps_per_second": 11.885,
1452
+ "num_input_tokens_seen": 1394904,
1453
+ "step": 798
1454
+ },
1455
+ {
1456
+ "epoch": 28.571428571428573,
1457
+ "grad_norm": 0.0811547264456749,
1458
+ "learning_rate": 8.113961028402894e-07,
1459
+ "loss": 0.3513,
1460
+ "num_input_tokens_seen": 1399256,
1461
+ "step": 800
1462
+ },
1463
+ {
1464
+ "epoch": 28.75,
1465
+ "grad_norm": 0.06173446401953697,
1466
+ "learning_rate": 6.259506412906402e-07,
1467
+ "loss": 0.35,
1468
+ "num_input_tokens_seen": 1407896,
1469
+ "step": 805
1470
+ },
1471
+ {
1472
+ "epoch": 28.928571428571427,
1473
+ "grad_norm": 0.05312683433294296,
1474
+ "learning_rate": 4.6440114404492363e-07,
1475
+ "loss": 0.338,
1476
+ "num_input_tokens_seen": 1416472,
1477
+ "step": 810
1478
+ },
1479
+ {
1480
+ "epoch": 29.107142857142858,
1481
+ "grad_norm": 0.07188601791858673,
1482
+ "learning_rate": 3.268257975405697e-07,
1483
+ "loss": 0.3438,
1484
+ "num_input_tokens_seen": 1425120,
1485
+ "step": 815
1486
+ },
1487
+ {
1488
+ "epoch": 29.285714285714285,
1489
+ "grad_norm": 0.07120722532272339,
1490
+ "learning_rate": 2.1329118524827662e-07,
1491
+ "loss": 0.3419,
1492
+ "num_input_tokens_seen": 1433248,
1493
+ "step": 820
1494
+ },
1495
+ {
1496
+ "epoch": 29.464285714285715,
1497
+ "grad_norm": 0.12804925441741943,
1498
+ "learning_rate": 1.238522554470989e-07,
1499
+ "loss": 0.3417,
1500
+ "num_input_tokens_seen": 1442592,
1501
+ "step": 825
1502
+ },
1503
+ {
1504
+ "epoch": 29.642857142857142,
1505
+ "grad_norm": 0.26469874382019043,
1506
+ "learning_rate": 5.855229463068712e-08,
1507
+ "loss": 0.3451,
1508
+ "num_input_tokens_seen": 1451360,
1509
+ "step": 830
1510
+ },
1511
+ {
1512
+ "epoch": 29.821428571428573,
1513
+ "grad_norm": 0.21112701296806335,
1514
+ "learning_rate": 1.742290655755707e-08,
1515
+ "loss": 0.3525,
1516
+ "num_input_tokens_seen": 1459808,
1517
+ "step": 835
1518
+ },
1519
+ {
1520
+ "epoch": 30.0,
1521
+ "grad_norm": 0.2481113076210022,
1522
+ "learning_rate": 4.839969555581192e-10,
1523
+ "loss": 0.3344,
1524
+ "num_input_tokens_seen": 1468632,
1525
+ "step": 840
1526
+ },
1527
+ {
1528
+ "epoch": 30.0,
1529
+ "eval_loss": 0.3523610234260559,
1530
+ "eval_runtime": 1.2257,
1531
+ "eval_samples_per_second": 90.557,
1532
+ "eval_steps_per_second": 11.422,
1533
+ "num_input_tokens_seen": 1468632,
1534
+ "step": 840
1535
+ },
1536
+ {
1537
+ "epoch": 30.0,
1538
+ "num_input_tokens_seen": 1468632,
1539
+ "step": 840,
1540
+ "total_flos": 6.613183518533222e+16,
1541
+ "train_loss": 0.47382661629290806,
1542
+ "train_runtime": 311.6966,
1543
+ "train_samples_per_second": 42.638,
1544
+ "train_steps_per_second": 2.695
1545
+ }
1546
+ ],
1547
+ "logging_steps": 5,
1548
+ "max_steps": 840,
1549
+ "num_input_tokens_seen": 1468632,
1550
+ "num_train_epochs": 30,
1551
+ "save_steps": 42,
1552
+ "stateful_callbacks": {
1553
+ "TrainerControl": {
1554
+ "args": {
1555
+ "should_epoch_stop": false,
1556
+ "should_evaluate": false,
1557
+ "should_log": false,
1558
+ "should_save": true,
1559
+ "should_training_stop": true
1560
+ },
1561
+ "attributes": {}
1562
+ }
1563
+ },
1564
+ "total_flos": 6.613183518533222e+16,
1565
+ "train_batch_size": 8,
1566
+ "trial_name": null,
1567
+ "trial_params": null
1568
+ }
training_eval_loss.png ADDED
training_loss.png ADDED