LongCat0830 xiaoyan001 commited on
Commit
120cb1f
·
verified ·
1 Parent(s): bf1cf99

Update README.md (#1)

Browse files

- Update README.md (1bd50f6ee4e0ddd675a53c3a000ecda37268d805)


Co-authored-by: Chen Chen <xiaoyan001@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +170 -29
README.md CHANGED
@@ -27,13 +27,19 @@ pipeline_tag: text-generation
27
 
28
  ## 📖 Introduction
29
 
30
- **UNO-Scorer** is a lightweight yet high-precision general scoring model developed as part of **UNO-Bench**. It is designed to efficiently automate the evaluation of Large Multimodal Models (LMMs) with minimal computational overhead.
 
 
 
 
 
31
 
32
  Built upon the powerful **Qwen3-14B** backbone, UNO-Scorer is fine-tuned on 13K high-quality in-house data. It overcomes the limitations of traditional Overall Reward Models (ORMs) by supporting **6 distinct question types**, with particular excellence in **Multi-Step Open-Ended Questions (MO)**.
33
 
 
34
  ## 📊 Performance
35
 
36
- UNO-Scorer demonstrates superior performance in automated evaluation, particularly in handling complex **Multi-Step Open-Ended Questions**. We compared the accuracy of our scorer against other advanced evaluators:
37
 
38
  | Model | Accuracy |
39
  | :--- | :--- |
@@ -43,57 +49,180 @@ UNO-Scorer demonstrates superior performance in automated evaluation, particular
43
 
44
  Experiments show that UNO-Scorer surpasses even proprietary frontier models like GPT-4.1 in this specific evaluation domain with lower cost.
45
 
46
-
47
-
48
  ## 💻 Usage
49
 
50
- ### 0. Quick Start
 
 
51
 
52
  ```bash
53
  pip install -U transformers
54
- python3 test_scorer_hf.py --model-name /path/to/your/model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ```
56
 
57
- We recommend using vLLM for inference as it offers significantly better efficiency compared to the standard HuggingFace approach. Please follow the steps below to set up the environment and run the inference script provided in our official repository [UNO-Bench](https://github.com/meituan-longcat/UNO-Bench).
58
 
 
59
 
60
- ### 1. Clone the Repository
61
- First, clone the UNO-Bench repository:
62
 
63
- ```bash
64
- git clone https://github.com/meituan-longcat/UNO-Bench.git
65
- cd UNO-Bench/uno-eval
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  ```
67
 
68
- ### 2. Install Dependencies
69
- Install the necessary Python libraries:
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ```bash
72
- pip install -r requirement.txt
73
  ```
74
 
75
- ### 3. Run Inference
76
- We provide an example script based on **vLLM** for efficient model inference. You can run the following command to test the scorer:
 
77
 
78
  ```bash
 
 
 
 
 
 
 
 
79
  bash examples/test_scorer_vllm.sh
80
  ```
81
 
82
- ### 4. Adapt Your Reference Answer
83
- The most critical aspect of utilizing the UNO-Scorer lies in the proper formatting of the Reference Answer. Specifically, it is required to:
 
 
 
 
84
 
85
- 1. Assign point values to the answer components. The total points for the question should typically sum to 10 points.
86
- 2. You may customize detailed scoring criteria for each reference answer to suit your needs(e.g., clarifying how to judge cases where the final choice is correct but the reasoning is flawed).
87
 
88
- Note: Since the model is primarily trained on Chinese corpora, it adheres more accurately to instructions when these specific descriptions are written in Chinese.
 
 
 
 
89
 
90
- You can structure the Reference Answer as follows:
91
 
92
- | Question Type | Scenario | **Reference Answer** | Example |
93
- | :--- | :--- | :--- | :--- |
94
- | **Single Question** | The model only needs to check if the final result matches. | Format as a single sub-question (Sub-question 1) worth exactly 10 points.<br><br>Template:<br>`小问1:{Answer},总分10分,无需关注推理过程,最终答案正确即可` | **Raw Answer:** "C"<br>**Input Answer:** `小问1:C,总分10分,无需关注推理过程,最终答案正确即可` |
95
- | **Multiple Question** | The model needs to grade specific checkpoints. | Break down the answer into numbered sub-steps with assigned points (summing to exactly 10).<br><br>Template:<br>`1. {Sub-Answer A} ({X} points); 2. {Sub-Answer B} ({Y} points).` | **Raw Answer:** "5 apples, 6 bananas"<br>**Input Answer:** `1. 5 apples (4 points); 2. 6 bananas (6 points).` |
96
 
 
 
 
 
 
 
 
 
97
 
98
  ## 📜 Citation
99
 
@@ -111,6 +240,18 @@ If you find this model or the UNO-Bench useful for your research, please cite ou
111
  }
112
  ```
113
 
114
- ---
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
- **Disclaimer:** This model is based on Qwen3-14B. Please strictly follow the license and usage policy of the original Qwen model series.
 
27
 
28
  ## 📖 Introduction
29
 
30
+ **UNO-Scorer** is a lightweight yet high-precision **LLM-based evaluation model** designed to efficiently automate the evaluation of Large Multimodal Models (LMMs) with minimal computational overhead.
31
+
32
+ **Core Functionality:**
33
+ - **Input**: Question + Reference Answer + Model Response
34
+ - **Processing**: Analyzes correctness by **comparing each sub-question** against the reference answer
35
+ - **Output**: Numerical score + Detailed evaluation reasoning for each sub-question
36
 
37
  Built upon the powerful **Qwen3-14B** backbone, UNO-Scorer is fine-tuned on 13K high-quality in-house data. It overcomes the limitations of traditional Overall Reward Models (ORMs) by supporting **6 distinct question types**, with particular excellence in **Multi-Step Open-Ended Questions (MO)**.
38
 
39
+
40
  ## 📊 Performance
41
 
42
+ UNO-Scorer demonstrates superior performance in automated evaluation, particularly in handling complex **Multi-Step Open-Ended Questions**. We compared the accuracy of our scorer against other advanced evaluators on our test set:
43
 
44
  | Model | Accuracy |
45
  | :--- | :--- |
 
49
 
50
  Experiments show that UNO-Scorer surpasses even proprietary frontier models like GPT-4.1 in this specific evaluation domain with lower cost.
51
 
 
 
52
  ## 💻 Usage
53
 
54
+ ### Quick Start (HuggingFace Transformers)
55
+
56
+ Get started with UNO-Scorer in just a few lines of code:
57
 
58
  ```bash
59
  pip install -U transformers
60
+ python3 test_scorer_hf.py --model-name /path/to/UNO-Scorer
61
+ ```
62
+
63
+ **Minimal Example:**
64
+ > ⚠️ **Critical**: The prompt template below is simplified for illustration. **Only the complete prompt template in `test_scorer_hf.py` will properly activate the model's fine-tuned scoring capabilities.** Custom or simplified prompts will not achieve optimal results.
65
+ ```python
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer
67
+ import re
68
+
69
+ def extract_score(text):
70
+ matches = re.findall(r'<score>([\d.]+)</score>', text)
71
+ return float(matches[-1]) if matches else 0.0
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained("meituan-longcat/UNO-Scorer-Qwen3-14B")
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ "meituan-longcat/UNO-Scorer-Qwen3-14B",
76
+ torch_dtype="auto",
77
+ device_map="auto"
78
+ )
79
+
80
+ # Prepare scoring prompt
81
+ question = "Which animal appears in the image?"
82
+ reference = "Sub-question 1: Elephant, total score 10 points"
83
+ response = "I see an elephant in the image."
84
+
85
+ prompt = f"""Please score the model's response based on the reference answer.
86
+
87
+ Question: {question}
88
+ Reference Answer: {reference}
89
+ Model Response: {response}
90
+
91
+ Provide a step-by-step analysis and output the total score in <score></score> tags."""
92
+
93
+ # Generate score
94
+ messages = [{"role": "user", "content": prompt}]
95
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
96
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
97
+ outputs = model.generate(**inputs, max_new_tokens=2048)
98
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
99
+
100
+ score = extract_score(result)
101
+ print(f"Score: {score}/10")
102
  ```
103
 
 
104
 
105
+ ### 🔄 How It Works
106
 
107
+ UNO-Scorer evaluates model responses through a structured process:
 
108
 
109
+ 1. **Information Organization**: Extracts question content, reference answer, model response, and scoring criteria
110
+ 2. **Question Type Classification**: Identifies the question type (multiple-choice, numerical, enumeration, yes/no, short-answer, or essay)
111
+ 3. **Detailed Comparison**: Compares model response against reference answer using type-specific criteria
112
+ 4. **Score Extraction**: Outputs final score in `<score>X</score>` format (where X is 0-10)
113
+
114
+ ### 📥 Input Format Requirements
115
+
116
+ The model expects three key inputs:
117
+
118
+ | Component | Description | Example |
119
+ | :--- | :--- | :--- |
120
+ | **Question** | The original question posed to the model | "Which animals appear in the image?" |
121
+ | **Reference Answer** | Ground truth answer with point allocation (sum to 10) | `Sub-question 1: Elephant, total score 10 points` |
122
+ | **Model Response** | The response from the model being evaluated | "I see an elephant in the image." |
123
+
124
+ #### Reference Answer Formatting (Critical!)
125
+
126
+ Since the model is trained primarily on Chinese corpora, **formatting reference answers in Chinese yields significantly better results**. However, English formatting is also supported.
127
+
128
+ **For Single-Answer Questions:**
129
+ ```
130
+ 1. {Answer}, total score 10 points, focus only on final answer correctness
131
+ 1. {答案},总分10分,无需关注推理过程,最终答案正确即可
132
+ ```
133
+ **For Multi-Part Questions:**
134
+ ```
135
+ 1. {Sub-Answer A} ({X} points); 2. {Sub-Answer B} ({Y} points)
136
+ 1. {子答案A}({X}分); 2. {子答案B}({Y}分)
137
+ ```
138
+
139
+ **With Custom Scoring Criteria:**
140
+
141
+ ```
142
+ 1. {Answer}, total score 10 points, scoring criteria: {detailed criteria}
143
+ 1. {答案},总分10分,评分标准:{详细标准}
144
  ```
145
 
146
+ ### 📤 Output Format
 
147
 
148
+ The model returns:
149
+
150
+ - **Detailed Evaluation**: Step-by-step analysis for each sub-question
151
+ - **Score Tag**: `<score>X</score>` where X ranges from 0 to 10
152
+
153
+ Example output:
154
+ ```
155
+ Sub-question 1:
156
+
157
+ Question Content: How many apples are in the image?
158
+ Reference Answer: 2
159
+ Model Response: There are two appels.
160
+ Points: 1 point
161
+ Question Type: Numerical
162
+ Comparison Process: The reference answer is "2" and the model response is "two". The numerical values are completely identical, with only the expression format differing. This meets the scoring standard for numerical questions.
163
+
164
+ Scoring Explanation: Completely correct, awarded 10 point.
165
+
166
+ <score>10</score>
167
+ ```
168
+
169
+ ### 📋 Complete Evaluation Example
170
+
171
+ See `test_scorer_hf.py` for a full working example with multiple question types:
172
+
173
+ - Multiple-choice questions
174
+ - Yes/No questions
175
+ - Open-ended questions
176
+ - Multi-part questions
177
+
178
+ Run the example:
179
  ```bash
180
+ python3 test_scorer_hf.py --model-name /path/to/UNO-Scorer
181
  ```
182
 
183
+ ### 🚀 Optimized Inference with vLLM (Recommended for Production)
184
+
185
+ For large-scale evaluation tasks, we strongly recommend using **vLLM** for significant performance improvements:
186
 
187
  ```bash
188
+ # 1. Clone the repository
189
+ git clone https://github.com/meituan-longcat/UNO-Bench.git
190
+ cd UNO-Bench/uno-eval
191
+
192
+ # 2. Install dependencies
193
+ pip install -r requirements.txt
194
+
195
+ # 3. Run vLLM-based inference
196
  bash examples/test_scorer_vllm.sh
197
  ```
198
 
199
+ **Why vLLM?**
200
+
201
+ - **10-20x faster** inference compared to standard HuggingFace
202
+ - Better batching support for multiple evaluation tasks
203
+ - Lower memory footprint
204
+ - Optimized for production deployments
205
 
206
+ ### ⚠️ Important Notes
 
207
 
208
+ 1. **Language**: Chinese formatting in reference answers produces significantly better results due to the model's training data composition
209
+ 2. **Point Allocation**: Reference answers must have total points equal to 10
210
+ 3. **Score Extraction**: Always look for `<score>X</score>` in the output
211
+ 4. **Batch Processing**: Use vLLM for evaluating multiple responses efficiently
212
+ 5. **Question Type Awareness**: Ensure reference answers clearly specify the question type for optimal scoring
213
 
214
+ ## 🎯 Supported Question Types
215
 
216
+ UNO-Scorer supports evaluation across 6 distinct question types:
 
 
 
217
 
218
+ | Question Type | Description | Scoring Rule |
219
+ | :--- | :--- | :--- |
220
+ | **Multiple-Choice** | Select correct option from given choices | Response must match the correct option exactly |
221
+ | **Numerical** | Provide specific numerical values | No tolerance for numerical errors |
222
+ | **Enumeration** | List all required items | Must include all items, no omissions or errors |
223
+ | **Yes/No** | Binary judgment questions | Response judgment must match reference answer |
224
+ | **Short-Answer** | Brief factual answers | Semantic equivalence acceptable, expression flexibility allowed |
225
+ | **Essay** | Longer analytical responses | Must contain core viewpoints from reference answer |
226
 
227
  ## 📜 Citation
228
 
 
240
  }
241
  ```
242
 
243
+ ## ⚖️ License & Disclaimer
244
+
245
+ This model is released under the **Apache 2.0 License**. It is based on Qwen3-14B. Please strictly follow the license and usage policy of the original Qwen model series.
246
+
247
+ **Disclaimer**: This model is designed for research and evaluation purposes. Users are responsible for ensuring their use complies with applicable laws and regulations.
248
+
249
+ ## 🤝 Contributing
250
+
251
+ We welcome contributions and feedback! Please feel free to:
252
+ - Report issues or bugs
253
+ - Suggest improvements
254
+ - Share your evaluation results
255
+ - Contribute enhancements
256
 
257
+ For more information, visit our [GitHub repository](https://github.com/meituan-longcat/UNO-Bench).