htxu91 commited on
Commit
c8140ad
·
verified ·
1 Parent(s): 578dc2e

Upload folder using huggingface_hub

Browse files
my_evaluation.py ADDED
@@ -0,0 +1,557 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ TEMPLATE = {
4
+ 'orz_tir':
5
+ """A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. In your reasoning-process, You can use python-code to solve your problem. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. And your final answer will be extracted automatically by the \\boxed{} tag.\nThis is the problem:{input}\nAssistant: <think>""",
6
+ "orz_tir_xinji": """A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: You can use Python code during the solution process, and the code will be executed immediately and the result will be returned. You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. And your final answer will be extracted automatically by the \\boxed{} tag.\nThis is the problem:{input}\nAssistant: <think>""",
7
+ "orz_xinji": """A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. And your final answer will be extracted automatically by the \\boxed{} tag.\nThis is the problem:{input}\nAssistant: <think>""",
8
+ "orz_ch": """A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer>. And your final answer will be extracted automatically by the \\boxed{} tag.\nThis is the problem:{input}\nAssistant: <think>""",
9
+ "qwen25-math-cot-tora": """<|im_start|>system\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n""",
10
+ "deepseek_r1_distill": """<|begin▁of▁sentence|>You are Qwen, created by Alibaba Cloud. You are a helpful assistant. You should think step-by-step.<|User|>{input}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|Assistant|>"""
11
+ }
12
+
13
+ import random
14
+ import os, sys
15
+ from timeout_decorator import timeout
16
+ import argparse
17
+ import time
18
+ from vllm import LLM, SamplingParams
19
+ from datetime import datetime
20
+ from tqdm import tqdm
21
+ from collections import OrderedDict
22
+ import openai
23
+ import numpy as np
24
+
25
+ from transformers import AutoTokenizer, AutoModelForCausalLM
26
+
27
+ from latex2sympy2_extended import NormalizationConfig
28
+ from math_verify import LatexExtractionConfig, parse, verify
29
+ import json, os
30
+ import os, sys, uuid
31
+
32
+ ENV_ITER_NUM = int(os.getenv('ENV_ITER_NUM', '2'))
33
+ VLLM_VERSION = os.getenv('VLLM_VERSION', 'vllm_083')
34
+ USE_ID = os.getenv('USE_ID', 'NONE')
35
+
36
+ sys.path.append(os.getenv('OPENRLHF_PATH', '/cpfs/user/chenhao/debug/OpenRLHF_082'))
37
+ # from env.math.math_tir import math_tir_generate
38
+ from env.math.math_tir_process_single_request import math_tir_generate_async
39
+ from openrlhf.async_pipline.process_request import GenerateRequest, default_generate, process_batch_requests
40
+ from passk_eval import estimate_pass_at_k
41
+ from tabulate import tabulate
42
+ import uuid
43
+
44
+ import asyncio
45
+ class AsyncLLM(object):
46
+ def __init__(self, args):
47
+ import vllm
48
+ available_gpus = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
49
+ engine_args = vllm.AsyncEngineArgs(
50
+ model=args.model_name_or_path,
51
+ tensor_parallel_size=len(available_gpus) // args.pipeline_parallel_size,
52
+ pipeline_parallel_size=args.pipeline_parallel_size,
53
+ trust_remote_code=True,
54
+ gpu_memory_utilization=0.98,
55
+ dtype="bfloat16",
56
+ disable_log_requests=True,
57
+ seed=args.seed)
58
+ self.llm = vllm.AsyncLLMEngine.from_engine_args(engine_args)
59
+ self.semaphore = asyncio.Semaphore(512) # 实例级共享
60
+ from transformers import AutoTokenizer
61
+ self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
62
+ self.args = args
63
+ self.batch_size = 4
64
+
65
+ def shutdown(self):
66
+ self.llm.shutdown() # 释放 GPU 内存
67
+
68
+ async def generate_async_server(self, request: GenerateRequest, sampling_params, request_id):
69
+ # Send the request to the LLM engine.
70
+ from vllm.inputs import TokensPrompt
71
+ async with self.semaphore: # 使用共享信号量
72
+ # async with asyncio.Semaphore(MAX_CONCURRENT): # 实例级共享
73
+ # stream = self.llm.generate(
74
+ # request_id=str(request_id),
75
+ # prompt=request.prompts[0],
76
+ # sampling_params=sampling_params,
77
+ # )
78
+
79
+ if USE_ID == 'USE_ID':
80
+ stream = self.llm.generate(
81
+ request_id=str(request_id),
82
+ prompt=TokensPrompt(prompt_token_ids=request.prompt_token_ids),
83
+ # prompt=request.prompts[0],
84
+ sampling_params=sampling_params,
85
+ )
86
+
87
+ else:
88
+ stream = self.llm.generate(
89
+ request_id=str(request_id),
90
+ # prompt=TokensPrompt(prompt_token_ids=request.prompt_token_ids),
91
+ prompt=request.prompts[0],
92
+ sampling_params=sampling_params,
93
+ )
94
+
95
+ # Consume the stream until the request is finished.
96
+ # 移入循环内部确保作用域隔离
97
+ final_output = None
98
+ async for request_output in stream:
99
+ final_output = request_output
100
+ if final_output is None:
101
+ raise RuntimeError(f"Empty stream for request_id: {request_id}")
102
+
103
+ assert final_output.request_id == request_id
104
+ output = [{
105
+ 'outputs':[
106
+ {
107
+ "text": final_output.outputs[0].text,
108
+ "token_ids": final_output.outputs[0].token_ids,
109
+ "stop_reason": final_output.outputs[0].stop_reason,
110
+ "finish_reason": final_output.outputs[0].finish_reason,
111
+ "log_probs": final_output.outputs[0].logprobs
112
+ }
113
+ ],
114
+ "prompt_token_ids": final_output.prompt_token_ids,
115
+ "request_id": final_output.request_id
116
+ }]
117
+ return output
118
+
119
+ async def async_llm_generate(self, request: GenerateRequest):
120
+ # 实际生成逻辑
121
+ from vllm import SamplingParams
122
+ sampling_params = SamplingParams(
123
+ n=request.n,
124
+ repetition_penalty=1.0,
125
+ temperature=request.temperature,
126
+ top_p=request.top_p,
127
+ top_k=request.top_k,
128
+ min_p=request.min_p,
129
+ max_tokens=request.max_tokens,
130
+ include_stop_str_in_output=request.include_stop_str_in_output,
131
+ stop=request.stop,
132
+ skip_special_tokens=False,
133
+ logprobs=None
134
+ )
135
+
136
+ # request_id = str(uuid.uuid4())+request.uuids
137
+ request_id = f"{time.time_ns()}-{uuid.uuid4()}"
138
+ response = await self.generate_async_server(request, sampling_params, request_id)
139
+ return response
140
+
141
+ def build_requests(self, prompts, uuids, sampling_params, infer_type='math_tir_async'):
142
+ request_list = []
143
+ for idx, (prompt, uuid_str) in enumerate(zip(prompts, uuids)):
144
+ request = GenerateRequest(
145
+ prompts=[prompt],
146
+ prompt_token_ids=self.tokenizer(prompt)['input_ids'],
147
+ max_tokens=sampling_params.max_tokens,
148
+ temperature=sampling_params.temperature,
149
+ stop=sampling_params.stop,
150
+ uuids=uuid_str+f'####idx:{idx}',
151
+ env_func=infer_type,
152
+ label=json.dumps({}, ensure_ascii=False),
153
+ request_rank=0,
154
+ max_length=sampling_params.max_tokens+1024,
155
+ enable_vllm_is_correction=False
156
+ )
157
+ request_list.append(request)
158
+ print(len(request_list), '==request_list==')
159
+ return request_list
160
+
161
+ def _create_batches(self, data_list):
162
+ """将数据分成 batch,返回 [(start_idx, batch), ...]"""
163
+ batches = []
164
+ if isinstance(data_list, list):
165
+ for i in range(0, len(data_list), self.batch_size):
166
+ batch = data_list[i:i + self.batch_size]
167
+ batches.append((i, batch))
168
+ if i + self.batch_size < len(data_list) - 1:
169
+ batches.append((i+1, data_list[i + self.batch_size:]))
170
+ elif isinstance(data_list, dict):
171
+ for env_func in data_list:
172
+ for i in range(0, len(data_list[env_func]), self.batch_size):
173
+ batch = data_list[env_func][i:i + self.batch_size]
174
+ batches.append((i, batch))
175
+ if i + self.batch_size < len(data_list[env_func]) - 1:
176
+ batches.append((i+1, data_list[env_func][i + self.batch_size:]))
177
+ else:
178
+ raise ValueError("data_list must be a list or dict")
179
+ return batches
180
+
181
+ async def batch_generate(self, prompts, uuids, sampling_params):
182
+ request_list = self.build_requests(prompts, uuids, sampling_params)
183
+ batches = self._create_batches(request_list)
184
+ response_tasks = []
185
+ for start_idx, batch in batches:
186
+ env_func = batch[0].env_func
187
+ response_tasks.append(process_batch_requests(self.async_llm_generate, start_idx, batch, env_func=env_func, tokenizer=self.tokenizer, use_reward=False))
188
+
189
+ results_raw = await asyncio.gather(*response_tasks)
190
+
191
+ flat_results = []
192
+ for result_raw in results_raw:
193
+ successful_results, failed_results = result_raw
194
+ for item in successful_results:
195
+ flat_results.append(item)
196
+ responses = [result[1][1] for result in flat_results]
197
+ responses.sort(key=lambda x: int(x.request_id.split('####idx:')[-1]))
198
+ return responses
199
+
200
+ def generate(self, prompts, uuids, sampling_params):
201
+ responses = asyncio.run(self.batch_generate(prompts, uuids, sampling_params))
202
+ return responses
203
+
204
+
205
+
206
+ def seed_everything(seed: int):
207
+ import random, os
208
+ import numpy as np
209
+ import torch
210
+
211
+ random.seed(seed)
212
+ os.environ['PYTHONHASHSEED'] = str(seed)
213
+ np.random.seed(seed)
214
+ torch.manual_seed(seed)
215
+ torch.cuda.manual_seed(seed)
216
+ torch.backends.cudnn.deterministic = True
217
+ torch.backends.cudnn.benchmark = True
218
+
219
+ def save_jsonl(samples, save_path):
220
+ # ensure path
221
+ folder = os.path.dirname(save_path)
222
+ os.makedirs(folder, exist_ok=True)
223
+
224
+ with open(save_path, "w", encoding="utf-8") as f:
225
+ for sample in samples:
226
+ f.write(json.dumps(sample, ensure_ascii=False) + "\n")
227
+ print("Saved to", save_path)
228
+
229
+ def evaluation(args, data_name, llm, tokenizer):
230
+ print(f"### being to evaluate {data_name} ###")
231
+ data_list = []
232
+ with open(os.path.join(args.data_dir, data_name, 'test.jsonl')) as frobj:
233
+ for line in tqdm(frobj):
234
+ d = json.loads(line.strip())
235
+ for ans_key in args.answer_key.split(','):
236
+ if ans_key in d:
237
+ d['answer'] = d[ans_key]
238
+ break
239
+ assert 'answer' in d
240
+ data_list.append(d)
241
+
242
+ print(data_list[0].keys())
243
+
244
+ stop_words = ["<|im_end|>", "<|endoftext|>", "</answer>", "</answer>\n"]
245
+ sampling_params = SamplingParams(
246
+ temperature=float(args.temperature),
247
+ top_p=args.top_p,
248
+ top_k=args.top_k,
249
+ max_tokens=args.max_tokens_per_call,
250
+ n=1,
251
+ seed=args.seed,
252
+ stop=stop_words,
253
+ skip_special_tokens=False,
254
+ include_stop_str_in_output=True,
255
+ )
256
+
257
+ print('==sampling_params==', sampling_params)
258
+
259
+ input_prompts = []
260
+ for d in data_list:
261
+ for q_key in args.input_key.split(','):
262
+ if q_key in d:
263
+ input_prompts.append(d[q_key])
264
+ break
265
+
266
+ assert len(input_prompts) == len(data_list)
267
+
268
+ # repeat n times
269
+ prompts = [
270
+ TEMPLATE[args.prompt_type].replace('{input}', prompt) for prompt in input_prompts for _ in range(args.n_sampling)
271
+ ]
272
+
273
+ prompts_idx = [
274
+ idx for (idx, prompt) in enumerate(input_prompts) for _ in range(args.n_sampling)
275
+ ]
276
+
277
+ uuids = []
278
+ for (idx, prompt) in enumerate(input_prompts):
279
+ for _ in range(args.n_sampling):
280
+ uuid_str = str(uuid.uuid4())
281
+ uuids.append(uuid_str)
282
+
283
+ if args.use_vllm:
284
+ outputs = llm.generate(
285
+ prompts,
286
+ sampling_params
287
+ )
288
+ # elif args.use_vllm_tir:
289
+ # outputs = math_tir_generate(llm, sampling_params, None, tokenizer, prompts=prompts)
290
+
291
+ if args.use_vllm_tir and args.use_seperate:
292
+ outputs = llm.generate(prompts, uuids, sampling_params)
293
+
294
+ assert len(outputs) == len(prompts)
295
+
296
+ for idx in range(len(prompts)):
297
+ d_idx = prompts_idx[idx]
298
+ d = data_list[d_idx]
299
+ if 'pred_response' not in d:
300
+ d['pred_response'] = []
301
+ output = outputs[idx]
302
+ d['pred_response'].append(output.outputs[0].text)
303
+
304
+ model_name = "/".join(args.model_name_or_path.split("/")[-2:])
305
+ if args.use_vllm_tir:
306
+ out_file_prefix = f"{args.split}_{args.prompt_type}_{args.num_test_sample}_seed{args.seed}_t{args.temperature}_nsample{args.n_sampling}_enviter{ENV_ITER_NUM}_vllm{VLLM_VERSION}"
307
+ else:
308
+ out_file_prefix = f"{args.split}_{args.prompt_type}_{args.num_test_sample}_seed{args.seed}_t{args.temperature}_nsample{args.n_sampling}_vllm{VLLM_VERSION}"
309
+ output_dir = args.output_dir
310
+ if not os.path.exists(output_dir):
311
+ output_dir = f"outputs/{output_dir}"
312
+ out_file = f"{output_dir}/{data_name}/{out_file_prefix}_s{args.start}_e{args.end}.jsonl"
313
+ os.makedirs(f"{output_dir}/{data_name}", exist_ok=True)
314
+
315
+ # Calculate pass@k.
316
+ total, correct = [], []
317
+ for d in data_list:
318
+ d['pred_score'] = []
319
+ d['pred_answer'] = []
320
+ for resp in d['pred_response']:
321
+ pred_ans = extract_answer(resp)
322
+ if pred_ans:
323
+ d['pred_answer'].append(pred_ans)
324
+ else:
325
+ d['pred_answer'].append('')
326
+ score = answer_grader(str(d['answer']), pred_ans)
327
+ d['pred_score'].append(score)
328
+
329
+ if args.n_sampling > 1:
330
+ # valid_answer = [pred_ans for pred_ans in d['pred_answer'] if pred_ans]
331
+ # d['pred_maj_answer'] = max(set(valid_answer),
332
+ # key=valid_answer.count)
333
+ # d['pred_max_score'] = max(d['pred_score'])
334
+ # d['pred_maj_score'] = answer_grader(str(d['answer']), d['pred_maj_answer'])
335
+
336
+ total.append(len(d['pred_score']))
337
+ correct.append(sum(d['pred_score']))
338
+
339
+ if args.n_sampling > 1:
340
+
341
+ total = np.array(total)
342
+ correct = np.array(correct)
343
+
344
+ ks = [int(args.pass_at_k)]
345
+ pass_at_k = {f"pass@{k}": estimate_pass_at_k(total, correct, k).mean()
346
+ for k in ks if (total >= k).all()}
347
+
348
+ avg_at_k = {}
349
+ score_at_k = [[] for _ in range(args.n_sampling)]
350
+ for d in data_list:
351
+ assert len(d['pred_score']) == args.n_sampling
352
+ for idx, score in enumerate(d['pred_score']):
353
+ score_at_k[idx].append(score)
354
+
355
+ avg_score = []
356
+ for sampling_idx in range(args.n_sampling):
357
+ score = 100 / len(data_list) * sum(score_at_k[sampling_idx])
358
+ avg_score.append(score)
359
+ pass_at_k[f'avg@{args.n_sampling}'] = sum(avg_score) / args.n_sampling
360
+
361
+ else:
362
+ pass_at_k = {}
363
+
364
+ return data_list, out_file, pass_at_k
365
+
366
+ def extract_answer(pred_str):
367
+ if "boxed" in pred_str:
368
+ ans = pred_str.split("boxed")[-1]
369
+ if len(ans) == 0:
370
+ return ""
371
+ elif ans[0] == "{":
372
+ stack = 1
373
+ a = ""
374
+ for c in ans[1:]:
375
+ if c == "{":
376
+ stack += 1
377
+ a += c
378
+ elif c == "}":
379
+ stack -= 1
380
+ if stack == 0:
381
+ break
382
+ a += c
383
+ else:
384
+ a += c
385
+ else:
386
+ a = ans.split("$")[0].strip()
387
+ pred = a
388
+ return pred
389
+ else:
390
+ return None
391
+
392
+ @timeout(10, use_signals=False)
393
+ def my_verify(gold, pred):
394
+ return float(verify(gold, pred))
395
+
396
+ def answer_grader(gold_ans, pred_ans):
397
+
398
+ if pred_ans is None:
399
+ return 0
400
+
401
+ gold_parsed = parse('\\boxed{'+gold_ans+'}',
402
+ extraction_mode="first_match",
403
+ extraction_config=[LatexExtractionConfig()])
404
+
405
+ pred_parsed = parse(
406
+ "\\boxed{"+pred_ans+"}",
407
+ extraction_config=[
408
+ LatexExtractionConfig(
409
+ normalization_config=NormalizationConfig(
410
+ nits=False,
411
+ malformed_operators=False,
412
+ basic_latex=True,
413
+ equations=True,
414
+ boxed=True,
415
+ units=True,
416
+ ),
417
+ # Ensures that boxed is tried first
418
+ boxed_match_priority=0,
419
+ try_extract_without_anchor=False,
420
+ )
421
+ ],
422
+ extraction_mode="first_match",
423
+ )
424
+
425
+ if len(gold_parsed) != 0 and len(pred_parsed) != 0:
426
+ try:
427
+ score = my_verify(gold_parsed,
428
+ pred_parsed)
429
+ except Exception as e:
430
+ score = 0
431
+ else:
432
+ score = 0
433
+
434
+ return score
435
+
436
+
437
+ def evaluation_main(args):
438
+
439
+ available_gpus = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
440
+ enforce_eager = os.getenv('ENFORCE_EAGER', 'FALSE')
441
+
442
+ print(available_gpus, '==available_gpus==')
443
+
444
+ if args.use_seperate:
445
+ print('==using async-llm==')
446
+ llm = None
447
+ else:
448
+ llm = LLM(
449
+ model=args.model_name_or_path,
450
+ tensor_parallel_size=len(available_gpus) // args.pipeline_parallel_size,
451
+ pipeline_parallel_size=args.pipeline_parallel_size,
452
+ trust_remote_code=True,gpu_memory_utilization=0.98,
453
+ dtype="bfloat16",
454
+ enforce_eager=True if enforce_eager == 'TRUE' else False,
455
+ seed=args.seed
456
+ )
457
+
458
+ tokenizer = AutoTokenizer.from_pretrained(
459
+ args.model_name_or_path, trust_remote_code=True, use_fast=True
460
+ )
461
+
462
+ avg_score = 0.0
463
+ score_dict = OrderedDict()
464
+ for data_name in args.data_names.split(','):
465
+ score_dict[data_name] = {}
466
+ if args.use_seperate:
467
+ if llm is not None:
468
+ llm.shutdown()
469
+ del llm
470
+ llm = AsyncLLM(args)
471
+ data_list, out_file, pass_at_k = evaluation(args, data_name, llm, tokenizer)
472
+ if args.n_sampling == 1:
473
+ data_score = sum([d['pred_score'][0] for d in data_list])
474
+ final_score = 100 / len(data_list) * data_score
475
+ score_dict[data_name]['final_score'] = final_score
476
+ # else:
477
+ # data_max_score = sum([d['pred_max_score'] for d in data_list])
478
+ # final_max_score = 100 / len(data_list) * data_max_score
479
+ # score_dict[data_name]['final_max_score'] = final_max_score
480
+
481
+ # data_maj_score = sum([d['pred_maj_score'] for d in data_list])
482
+ # final_maj_score = 100 / len(data_list) * data_maj_score
483
+ # score_dict[data_name]['final_maj_score'] = final_maj_score
484
+
485
+ score_dict[data_name].update(pass_at_k)
486
+
487
+ with open(out_file, 'w') as fwobj:
488
+ for d in data_list:
489
+ fwobj.write(json.dumps(d, ensure_ascii=False)+'\n')
490
+
491
+ print(data_name, '===', score_dict[data_name])
492
+ print(data_name, '====', out_file, '==out_file==')
493
+
494
+
495
+ data = []
496
+ headers = []
497
+ for name in score_dict:
498
+ item = [name]
499
+ headers = ['dataset']
500
+ for score_key in score_dict[name]:
501
+ item.append(score_dict[name][score_key])
502
+ headers.append(score_key)
503
+ data.append(item)
504
+
505
+ table = tabulate(data, headers=headers, tablefmt="pipe")
506
+
507
+ print(f'### {out_file} evaluation ###')
508
+ print(table)
509
+
510
+ metric_path = out_file.replace(".jsonl", f"_{args.prompt_type}_metrics.json")
511
+ with open(metric_path, "w") as f:
512
+ json.dump({
513
+ 'value': score_dict,
514
+ }, f, indent=4)
515
+ print(f'### {metric_path} ###')
516
+
517
+
518
+
519
+ def parse_args():
520
+ parser = argparse.ArgumentParser()
521
+ parser.add_argument("--data_names", default="gsm8k,math", type=str)
522
+ parser.add_argument("--data_dir", default="./data", type=str)
523
+ parser.add_argument("--model_name_or_path", default="gpt-4", type=str)
524
+ parser.add_argument("--output_dir", default="./output", type=str)
525
+ parser.add_argument("--prompt_type", default="tool-integrated", type=str)
526
+ parser.add_argument("--input_key", default="problem,question", type=str)
527
+ parser.add_argument("--answer_key", default="answer,final_answer", type=str)
528
+ parser.add_argument("--split", default="test", type=str)
529
+ parser.add_argument("--num_test_sample", default=-1, type=int) # -1 for full data
530
+ parser.add_argument("--seed", default=0, type=int)
531
+ parser.add_argument("--start", default=0, type=int)
532
+ parser.add_argument("--pass_at_k", default=1, type=int)
533
+ parser.add_argument("--end", default=-1, type=int)
534
+ parser.add_argument("--temperature", default=0, type=float)
535
+ parser.add_argument("--n_sampling", default=1, type=int)
536
+ parser.add_argument("--top_p", default=1, type=float)
537
+ parser.add_argument("--top_k", default=-1, type=int)
538
+ parser.add_argument("--max_tokens_per_call", default=16384, type=int)
539
+ parser.add_argument("--shuffle", action="store_true")
540
+ parser.add_argument("--use_vllm", action="store_true")
541
+ parser.add_argument("--use_vllm_tir", action="store_true")
542
+ parser.add_argument("--use_seperate", action="store_true")
543
+ parser.add_argument("--save_outputs", action="store_true")
544
+ parser.add_argument("--overwrite", action="store_true")
545
+ parser.add_argument("--num_shots", type=int, default=0)
546
+ parser.add_argument("--pipeline_parallel_size", type=int, default=1)
547
+ args = parser.parse_args()
548
+ args.top_p = (
549
+ 1 if args.temperature == 0 else args.top_p
550
+ ) # top_p must be 1 when using greedy sampling (vllm)
551
+ return args
552
+
553
+
554
+ if __name__ == "__main__":
555
+ args = parse_args()
556
+ seed_everything(args.seed)
557
+ evaluation_main(args)
my_evaluation_tir.sh ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ set -ex
2
+
3
+ export TOKENIZERS_PARALLELISM=false
4
+ top_p=${TOP_P:-1}
5
+
6
+ echo "TOP_P $top_p"
7
+
8
+ python3 -u my_evaluation.py \
9
+ --model_name_or_path ${MODEL_NAME_OR_PATH} \
10
+ --data_name ${DATA_NAME} \
11
+ --output_dir ${OUTPUT_DIR} \
12
+ --prompt_type ${PROMPT_TYPE} \
13
+ --input_key ${INPUT_KEY} \
14
+ --answer_key ${ANSWER_KEY} \
15
+ --seed 42 \
16
+ --temperature ${TEMPERATURE} \
17
+ --n_sampling ${N_SAMPLING} \
18
+ --top_p ${top_p} \
19
+ --start 0 \
20
+ --end -1 \
21
+ --use_vllm_tir \
22
+ --save_outputs \
23
+ # --use_seperate \
24
+ # --overwrite \
25
+
run_evaluation.sh ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ # export step=600
4
+ # export MODEL_NAME_OR_PATH=/cpfs/user/chenhao/outputs/qwen25_7B_reinforce_baseline_zero_tir_fix_boxed_lr1e-6_warmup0.0_kl0.0_zero_tir_0426_nginx_prefetch_fix_env_mask_vllm083_xverify_dapo_async_iternum2/_actor/global_step250/ckpt/pytorch_model.bin/
5
+ # export MODEL_NAME_OR_PATH=/cpfs/user/chenhao/outputs/qwen25_7B_reinforce_baseline_zero_tir_fix_boxed_lr1e-6_warmup0.0_kl0.0_zero_tir_0426_nginx_prefetch_fix_env_mask_vllm083_xverify_deepmath_async_iternum2/_actor/global_step350/ckpt/pytorch_model.bin/
6
+
7
+ # export CUDA_VISIBLE_DEVICES="0"
8
+ # export INPUT_KEY='problem,question'
9
+ # export ANSWER_KEY='answer,final_answer'
10
+ # export PROMPT_TYPE='orz_tir'
11
+ # export DATA_NAME="aime25"
12
+ # export OUTPUT_DIR=${MODEL_NAME_OR_PATH}/math_eval
13
+ # export N_SAMPLING=2
14
+ # export TEMPERATURE=0.0
15
+ # export VLLM_USE_V1=0
16
+ # export USE_TIR='yes'
17
+ # export USE_SEPERATE='no'
18
+
19
+ if [ "$USE_TIR" = "yes" ] && [ "$USE_SEPERATE" = "yes" ]; then
20
+ echo "USING TIR and SEPERATE"
21
+ bash my_evaluation_tir_seperate.sh
22
+ elif [ "$USE_TIR" = "yes" ]; then
23
+ echo "USING TIR"
24
+ bash my_evaluation_tir.sh
25
+ else
26
+ bash my_evaluation.sh
27
+ fi
run_script_evaluation.sh ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # apt-get update && \
2
+ # apt-get install -y gosu && \
3
+ # rm -rf /var/lib/apt/lists/*
4
+
5
+ # apt-get update && apt-get -y install sudo
6
+
7
+ echo "Number of GPUS: $N_GPUS"
8
+ echo "Number of process: $NUM_PROCESSES"
9
+ echo "WORLD_SIZE: $WORLD_SIZE"
10
+ echo "RANK: $RANK"
11
+ echo "MASTER_ADDR: $MASTER_ADDR"
12
+ echo "MASTER_PORT: $MASTER_PORT"
13
+
14
+ # export VLLM_PATH=/cpfs/user/chenhao/vllm
15
+ # export PYTHONPATH=$VLLM_PATH:$PYTHONPATH
16
+
17
+ export RANK=${RANK}
18
+ export MY_RANK=2
19
+ export NUM_PROCESSES=$(expr $RANK \* $MY_RANK)
20
+ echo "MY_RANK: $MY_RANK"
21
+ echo "RANK: $RANK"
22
+ echo "NUM_PROCESSES: $NUM_PROCESSES"
23
+ # export VLLM_USE_V1=0
24
+
25
+ # pip3 install deepspeed==0.16.0
26
+
27
+ # cd /cpfs/user/chenhao/debug/
28
+ # cp nccl.conf /etc/nccl.conf
29
+ # echo "COPY nccl.conf to etc"
30
+ # cp parameter_offload.py /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py
31
+ # echo "COPY parameter_offload to deepspeed"
32
+ # cp partitioned_param_coordinator.py /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py
33
+ # echo "COPY partitioned_param_coordinator to deepspeed"
34
+
35
+ pip3 install math-verify tabulate markdown pysbd jsonlines coloredlogs func_timeout timeout-decorator word2number Pebble -i https://mirrors.cloud.aliyuncs.com/pypi/simple --trusted-host mirrors.cloud.aliyuncs.com
36
+
37
+ pip3 install loguru fastapi uvicorn httpx python-multipart aiohttp aiolimiter pysbd jsonlines coloredlogs pebble aiolimiter -i https://mirrors.cloud.aliyuncs.com/pypi/simple --trusted-host mirrors.cloud.aliyuncs.com
38
+ pip3 install func_timeout sentencex requests_futures timeout_decorator flashtext pygments -i https://mirrors.cloud.aliyuncs.com/pypi/simple --trusted-host mirrors.cloud.aliyuncs.com
39
+
40
+ pip3 install math-verify loguru fastapi uvicorn httpx python-multipart aiohttp aiolimiter pysbd jsonlines coloredlogs pebble aiolimiter -i https://mirrors.cloud.aliyuncs.com/pypi/simple --trusted-host mirrors.cloud.aliyuncs.com
41
+ pip3 install func_timeout sentencex requests_futures timeout_decorator flashtext pygments -i https://mirrors.cloud.aliyuncs.com/pypi/simple --trusted-host mirrors.cloud.aliyuncs.com
42
+
43
+ # export ROOT_PATH=/cpfs/user/chenhao/outputs/qwen25_7B_reinforce_baseline_zero_tir_fix_boxed_lr1e-6_warmup0.0_kl0.0_zero_tir_0426_nginx_prefetch_fix_env_mask_vllm083_xverify_dapo_async_iternum2/
44
+
45
+ # export ROOT_PATH=/cpfs/user/chenhao/outputs/qwen25_7B_reinforce_baseline_zero_tir_fix_boxed_lr1e-6_warmup0.0_kl0.0_zero_tir_0502_nginx_prefetch_fix_env_mask_vllm083_xverify_orz_async_pipline_iternum2/
46
+
47
+ # export ROOT_PATH=/cpfs/user/chenhao/outputs/qwen25_7B_reinforce_baseline_zero_tir_fix_boxed_lr1e-6_warmup0.0_kl0.0_zero_tir_0504_nginx_prefetch_fix_env_mask_vllm083_xverify_deepmath_async_pipline_iternum2/
48
+
49
+ export ROOT_PATH=/newcpfs/user/chenhao/outputs/qwen25_32B_reinforce_baseline_zero_tir_lr1e-6_warmup0.0_kl0.0_zero_0812_agent_tir_iternum8_queue_size1_rolloutn16_orz_dapo_seqbalance_raw_adamw_before_select_dualclip_lossmask_dynamicbs_globaltoken_correction_latest/
50
+
51
+ # for step in 250 200 150 100 50
52
+ # do
53
+ # cd ${ROOT_PATH}_actor/
54
+ # mkdir ./global_step${step}/ckpt/
55
+ # rm -r ./global_step${step}/ckpt/
56
+ # python /cpfs/user/chenhao/debug/zero_to_fp32.py . ./global_step${step}/ckpt/pytorch_model.bin -t global_step${step}
57
+ # cp -r /cpfs/user/chenhao/pretrained_models/Qwen/Qwen2.5-7B-local/*.json ./global_step${step}/ckpt/pytorch_model.bin/
58
+ # done
59
+
60
+ cd /cpfs/user/chenhao/Qwen2-Math/evaluation
61
+ # export VLLM_ENABLE_V1_MULTIPROCESSING='0'
62
+
63
+ export NGINX_IP_FILE=/cpfs/user/chenhao/hf_datasets/qwen25_qwq/nginx_conf/nginx_ip.txt
64
+ export COMPILE_SERVER_PORT='10003'
65
+ export MATH_VERIFY_SERVER_PORT='10008'
66
+ export XVERIFY_MATH_MODEL_SERVER_PORT='10005'
67
+ export REMOTE_RM_URL='http://10.39.2.54:10007'
68
+ export OPENRLHF_PATH=/cpfs/user/chenhao/debug/OpenRLHF_082/
69
+ export PRETRAIN=/newcpfs/user/chenhao/pretrained_models/Qwen/Qwen2.5-7B-local/
70
+
71
+ export DEBUG_FLAG='yes'
72
+ export CUDA_VISIBLE_DEVICES="0,1"
73
+ export INPUT_KEY='problem,question'
74
+ export ANSWER_KEY='answer,final_answer'
75
+ export DATA_NAME="aime25,aime24,hmmt_feb_2025,hmmt_feb_2024,cmimc"
76
+ export N_SAMPLING=32
77
+ export TEMPERATURE=1.0
78
+ # export VLLM_USE_V1='0'
79
+ export USE_TIR='yes'
80
+ export TASK_MAX_CONCURRENT=32
81
+
82
+ export VLLM_VERSION='vllm085'
83
+ export USE_SEPERATE='no'
84
+ export USE_ID='USE_ID'
85
+
86
+ for step in 100 150
87
+ do
88
+ for iter in 1 2 4 8 16 18 20
89
+ do
90
+ export ENV_ITER_NUM=${iter}
91
+ export MODEL_NAME_OR_PATH=${ROOT_PATH}global_step${step}_hf_actor/
92
+ export OUTPUT_DIR=${MODEL_NAME_OR_PATH}/math_eval_useid
93
+ export PROMPT_TYPE='orz_tir'
94
+ export USE_SEPERATE='yes'
95
+ bash run_evaluation.sh
96
+ done
97
+ done