Jackrong commited on
Commit
6867bad
·
verified ·
1 Parent(s): 6bcb57c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +623 -0
README.md CHANGED
@@ -15,6 +15,629 @@ datasets:
15
  pipeline_tag: question-answering
16
  ---
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  <h3 align="center">对 DeepSeek-R1《Nature》论文方法的复刻与探索🚀</h3>
19
 
20
  - **Developed by:** Soren
 
15
  pipeline_tag: question-answering
16
  ---
17
 
18
+
19
+ <h3 align="center">Replication and Exploration of the DeepSeek-R1 《Nature》 Paper Methodology 🚀</h3>
20
+
21
+ - **Developed by:** Soren
22
+
23
+ - **gpt-oss-120b-Distill-Phi-4-14B** is a GRPO inference test model based on the `microsoft/phi-4` (14B) base model. It is fine-tuned through a three-stage training pipeline—**SFT (Cold Start) → RLHF (GRPO) → SFT**—aiming to achieve excellent multi-step reasoning capabilities.
24
+
25
+ - The core methodology of this model is derived from three papers: **"DeepSeek-R1" (Nature, vol. 645)**, **"DeepSeekMath" (arXiv:2402.03300v3)**, and **"Phi-4-reasoning" (arXiv:2504.21318)**.
26
+ - However, unlike the Microsoft paper, I moved the reinforcement learning-guided reasoning part to an earlier stage. Referencing DeepSeek's approach, I used the **SFT + GRPO + SFT** method to generate explicit "Chain-of-Thought" (CoT), attempting to induce reasoning abilities in the Phi-4 **Instruct model** through reinforcement learning and supervised learning.
27
+ - Different from the standard fine-tuning with the o3-mini teacher model in the Microsoft paper, this model uses a distilled dataset from the open-source model **gpt-oss-120b** as its fine-tuning knowledge base, thereby inheriting its profound reasoning style and knowledge system.
28
+
29
+ <div align="center">
30
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Zilu-KjsyFfIbBU31s9yl.png" width="80%">
31
+ </div>
32
+ <p align="center">
33
+ Fig. 1 | The multistage pipeline of DeepSeek-R1. A detailed background on DeepSeek-V3 Base and DeepSeek-V3 is provided in Supplementary Information, section 1.1. The models DeepSeek-R1 Dev1, Dev2 and Dev3 represent intermediate checkpoints in this pipeline.
34
+ </p>
35
+
36
+ ### Stage 1: Foundational Supervised Fine-Tuning (Foundational SFT)
37
+
38
+ * **Cold Start:**
39
+
40
+ | Dataset Name/Source | Sample Size |
41
+ | :------------------------------------ | :--------------------: |
42
+ | **OpenMathReasoning-mini** | 4,600 |
43
+
44
+ Here, I also adopted the method from the paper. According to the description in **"DeepSeek-R1 incentivizes reasoning in LLMS through reinforcement learning"** (doi: 10.1038/s41586-025-09422-z), the main purpose of the "cold start" (the initial supervised fine-tuning SFT stage) before GRPO reinforcement learning is to address several key deficiencies exposed by the initial model trained purely through reinforcement learning (`DeepSeek-R1-Zero`).
45
+
46
+ Specifically, this cold start stage aims to:
47
+
48
+ 1. **Improve Readability and Align with Human Preferences:** Although the purely RL-trained **DeepSeek-R1-Zero** has strong reasoning abilities, its outputs suffer from "poor readability" and "language mixing" (e.g., mixing Chinese and English). The cold start stage uses thousands of high-quality data samples that "demonstrate a conversational style aligned with human thought processes" for fine-tuning, aiming to make the model's expression and thinking process more aligned with human habits and preferences.
49
+
50
+ 2. **Lay the Foundation for the Subsequent RL Stage:** This step acts as a "calibration" for the model, enabling it to adhere to human-expected language and format specifications while maintaining its powerful reasoning capabilities. This paves the way for the subsequent reinforcement learning stage, which incorporates more complex tasks (such as general conversation, writing, etc.), ensuring that the final model (`DeepSeek-R1`) is not only strong in reasoning but also has good generality and a better user experience.
51
+
52
+ ### Stage 2: Reinforcement Learning with GRPO
53
+
54
+ <div align="center">
55
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/dyGcvF-ve9xsDsbvQaLBo.png" width="80%">
56
+ </div>
57
+ <p align="center">
58
+ Fig. 2 | Illustration of the proposed GRPO for RL-based training.
59
+ </p>
60
+
61
+
62
+ * **Algorithm:** **GRPO** (Group Relative Policy Optimization), an advanced RLHF algorithm that estimates the advantage function by comparing the relative merits of a group of candidate outputs, thereby achieving more stable policy gradient updates.
63
+ * **Datasets:** `open-r1/DAPO-Math-17k-Processed` and `openai/gsm8k` are used as preference datasets.
64
+
65
+
66
+ | Dataset | Primary Use | Data Format/Features | Role in Project |
67
+ | :--- | :--- | :--- | :--- |
68
+ | **`open-r1/DAPO-Math-17k-Processed`** | This dataset has a high difficulty level and a wide range of difficulties, from basic math to advanced mathematics. It is used to guide the model's reinforcement learning for reasoning abilities. | Contains English math problems and their corresponding solutions.<br> - `prompt`: The text of the math problem.<br> - `solution`: The standard answer to the problem, usually a number or an expression. | In the model's GRPO training process, this dataset is the main training source. The model's task is to generate a reasoning process and answer based on the `prompt`, and then use the `solution` as a benchmark to calculate the reward score (e.g., validated through `check_answer` and `check_numbers` functions), thereby optimizing the policy. |
69
+ | **`openai/gsm8k`** | Benchmark testing and training on grade school level math word problems. Guides the model to learn reasoning with ease. | Contains `question` and `answer` fields.<br> - The answer section usually includes detailed reasoning steps (Chain-of-Thought) and ends with the format `#### <Correct Answer>`. | In the model's GRPO training process, this dataset is used as a source of English math problems. Its standard answers are precisely extracted to provide an objective and reliable standard for the reward function (`correctness_reward_func`) that evaluates the correctness of the model's output. In DeepSeekMath (another paper by DeepSeek), it is also used in the reinforcement learning stage. |
70
+ | **`Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k`** | Auxiliary, used only to prevent overfitting to mathematical formats. | A distilled dataset containing Chinese Chain-of-Thought. The code samples 2,000 instances from the `train` split of this dataset with a random seed of 20250924. | Its `Answer_content` field provides complete reference answers, which are not only used to evaluate content similarity but also serve as a reference for length, style, and format alignment, providing guidance for multiple reward functions (`content_alignment_reward_func`, `zh_quality_reward_func`). |
71
+
72
+ * **Objective:** Building on the SFT cold start, this stage uses reward signals to guide the model to explore and learn better reasoning strategies. The core goal is not simply to imitate human-annotated solution paths, but to allow the model to autonomously evolve more powerful and diverse reasoning abilities with only feedback on the correctness of the final answer. This includes two main objectives:
73
+ 1. **Guide the model to explore reasoning thought processes:** Incentivize the model to generate detailed, structured, and logically coherent Chains-of-Thought (CoT), and even develop strategies that surpass human examples, such as self-reflection, verification, and multi-perspective exploration.
74
+ 2. **Improve the correctness of the final answer:** Ensure that while optimizing the reasoning process, the model can more reliably converge to the correct final answer (completing reasoning and answering within the 8192 token limit, avoiding reasoning for the sake of reasoning).
75
+
76
+ * **Algorithm:** **GRPO (Group Relative Policy Optimization)**, an advanced and efficient reinforcement learning algorithm, is used in this project as a variant of Proximal Policy Optimization (PPO). Unlike traditional PPO, which requires training an additional value network (Value Model / Critic), GRPO simplifies the training process and significantly reduces resource consumption in the following ways:
77
+ 1. **Group Sampling:** For each problem, the Policy Model generates a group containing `G` candidate outputs. In this project's code, the group size `num_generations` is set to **4**.
78
+ 2. **Reward Evaluation:** A complex reward system composed of multiple functions scores each candidate output within the group, resulting in a comprehensive scalar reward score `r`.
79
+ 3. **Group Relative Advantage Estimation:** The core of GRPO is that it does not rely on a separate value network to estimate the baseline. Instead, it directly uses the average reward of all candidate outputs within the group as the baseline. It estimates the advantage function `A` by calculating the deviation of each output's reward from the group average (e.g., by normalizing using the mean and standard deviation).
80
+ 4. **Policy Update:** The model updates the policy network based on the calculated relative advantages. Outputs with rewards higher than the group average are positively reinforced, while those below the average are suppressed. This makes the policy gradient updates more stable and naturally aligns with the comparison-based training of reward models.
81
+
82
+ ### Important: Reward Function Construction
83
+ <div align="center">
84
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/elaY5xLorXubnMx450Ohx.png" width="80%">
85
+ </div>
86
+ <p align="center">
87
+ Fig. 6 | DeepSeek Reward function setup and inference label guidance.
88
+ </p>
89
+
90
+ * **Reward System:** To finely guide the model's optimization from multiple dimensions, the code implements a complex and comprehensive reward function system (theoretically, a fourth RL step for preference alignment should follow SFT, but please understand the resource constraints). It combines multiple weighted rewards and penalty signals to shape the model's behavior. The system can be divided into the following modules:
91
+ 1. **Core Objective Rewards:**
92
+ * `correctness_reward_func`: Based on the reference answers from `gsm8k` and `DAPO-Math-17k-Processed`, this function gives the highest positive reward for the correctness of the final numerical calculation, serving as the core signal to ensure the model learns to solve problems.
93
+ * `strict_format_reward_func`: Strictly enforces the `<think>...</think>` reasoning output format and places the answer within `<answer>` and `</answer>` tags for easy verification by the answer reward function. It directly deducts points for format errors (such as missing, repeated, or out-of-order tags), ensuring the parsability of the reasoning process.
94
+
95
+ 2. **Quality & Alignment Rewards:**
96
+ * **Content and Length Alignment:** The `content_alignment_reward_func` (based on unigram F1 score) and `ref_length_alignment_reward_func` encourage the model to generate responses similar in content and length to high-quality reference answers.
97
+ * **Style and Format Alignment:** The `md_style_reward_func`, `style_match_to_ref_reward_func`, and `math_layout_reward_func` incentivize the model to use more readable Markdown formats (headings, lists), correctly use LaTeX for mathematical typesetting, and learn from excellent formats in reference answers (tables, code blocks).
98
+ * **Chinese Quality and Language Consistency:** The `zh_quality_reward_func` specifically targets Chinese content, comprehensively evaluating the proportion of Chinese, content F1 score, and Markdown structure similarity.
99
+
100
+ 3. **Behavioral Regularization & Penalties:**
101
+ * **Anti-Redundancy and Short-sightedness:** The `anti_redundancy_penalty_func` penalizes repetitive phrases and filler words in the Chain-of-Thought. The `min_answer_penalty_func` penalizes behaviors where the reasoning seems detailed but the final answer is overly brief. It accounts for reasoning words in the distilled dataset such as "because", "so", "then", "no, wait", "check", "re-check", "verify" in both Chinese and English.
102
+ * **Length Control:** The `length_reward_func` and `think_soft_cap_penalty_func` work together. One function judges the problem content and sets ideal length ranges for reasoning and answers based on the problem type (math vs. general long-form text). It also applies a soft penalty to "thinking" processes that exceed 2000 characters, which corresponds to the observation in research that RL can spontaneously increase CoT length.
103
+
104
+ 4. **Exploration & Diversity Incentives:**
105
+ * `intra_sample_style_diversity_reward_func` & `intra_sample_reasoning_diversity_reward_func`: This is a new set of reward functions used to evaluate the diversity among multiple candidate answers generated for the same problem. By rewarding outputs with more diverse styles and reasoning paths, it actively encourages the model to break out of a single optimal path and explore a wider space of problem-solving strategies, thereby improving the model's robustness (Maj@K performance).
106
+
107
+
108
+
109
+ ### Stage 3: Post-RL SFT (Alignment and Capability Consolidation)
110
+
111
+
112
+ <div align="center">
113
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-HLICcZf_TMk51lfuC81t.jpeg" width="80%">
114
+ </div>
115
+ <p align="center">
116
+ Fig. 3 | Learning rate and hyperparameter recommendations in the Phi-4 paper.
117
+ </p>
118
+
119
+ <br>
120
+
121
+ <div align="center">
122
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/LE6_OF2GDyYR-L9QJ4IgT.jpeg" width="80%">
123
+ </div>
124
+ <p align="center">
125
+ Fig. 4 | System prompts used during training.
126
+ </p>
127
+
128
+ * **Reference Paper:** **"Phi-4-reasoning Technical Report" (arXiv:2504.21318)**
129
+ My implementation primarily references **Section 3 "Phi-4-reasoning: Supervised Finetuning of Phi-4"** of the paper. My method also adopts the core strategy emphasized in the paper, which is to use a powerful teacher model (the paper uses o3-mini; my dataset includes distilled data from **gpt-oss-120b-high, DeepSeek-R1 models**) to generate high-quality training samples containing long Chains-of-Thought.
130
+ The `system_message` I constructed has a purpose and structure consistent with the system prompt shown on page 10 of the paper, guiding the model to generate structured and detailed reasoning processes containing `<think>` tags.
131
+ * **Objective:** To mitigate the "Alignment Tax" that reinforcement learning might introduce, which is the model's forgetting of general instruction-following abilities while optimizing for a specific capability. It involves calling upon more powerful reasoning teacher models for supervised learning to generalize, enhance, and expand knowledge, much like after a high school group discussion, the teacher now standardizes the solutions and imparts new knowledge.
132
+ * **Process:** The model, optimized through GRPO, is once again put into the original 87k-sample SFT dataset for a supervised fine-tuning with a large effective batch size. This step aims to merge and calibrate the reasoning strategies learned in the RL phase with the broad knowledge acquired in the SFT phase.
133
+
134
+
135
+ | Dataset Name/Source | Sample Size | Description |
136
+ | ----------------------------------------------------- | -------- | ------------------------------------------------------------ |
137
+ | **`Jackrong/Natural-Reasoning-gpt-oss-120B-S1`** | 70,000 | **Core Dataset**. Distilled from `facebook/natural_reasoning`, providing high-quality, high-difficulty general reasoning problems covering STEM, economics, social sciences, etc. |
138
+ | **`Jackrong/GPT-OSS-120B-Distilled-Reasoning-math`** | 5,000 | Focuses on reasoning and problem-solving for mathematics. |
139
+ | **`Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k`** | 5,000 | Provides high-quality Chinese Chain-of-Thought data to enhance the model's Chinese reasoning abilities. |
140
+ | **`deepseek_if_shuffled-v1.json`** | 5,000 | Focuses on improving the model's ability to understand and execute complex instructions. |
141
+ | **`deepseek_code_shuffled-v1.json`** | 2,000 | Endows the model with capabilities for code analysis, bug fixing, and code generation. |
142
+ | **Total** | **87,000** | -
143
+
144
+ * **Objective:** To inject structured "Chain-of-Thought" capabilities into the instruction model, teaching it how to perform explicit reasoning.
145
+ * **Dataset:** A mixed dataset containing **87,000** samples, with the main body being **`Jackrong/Natural-Reasoning-gpt-oss-120B-S1`**, supplemented by data from multiple domains such as mathematics, code, Chinese, and instruction following.
146
+
147
+
148
+
149
+ ### Model's Actual Performance:
150
+ * **Here I show an actual test in LM Studio, using the GGUF q8 quantized model. I have copied the chain of thought and the answer for the question below for your reference:**
151
+
152
+ ![Screenshot 2025-09-25 at 4.07.13 PM](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/mpbYHAEIgc2iL_N2L1yQX.png)
153
+
154
+
155
+ ### 1️⃣ General Chat Q&A Example: LLM Technology Q&A: What is reinforcement learning?
156
+
157
+ <details>
158
+ <summary><strong>Test Question (Click to expand/collapse)</strong></summary>
159
+
160
+ In LLM training, what is reinforcement learning?
161
+ </details>
162
+
163
+ <details>
164
+ <summary><strong>Model's Chain-of-Thought (Click to expand/collapse)</strong></summary>
165
+
166
+ The user asks: "In LLM training, what is reinforcement learning?" Let's dissect the question:
167
+
168
+ - User wants to understand how RL works in LLM training.
169
+ - They might be confused about whether it's necessary or effective. Provide clear explanation of RL fine-tuning vs supervised.
170
+ - Use simple analogies: teacher vs coach.
171
+ - Cover key points: reward shaping, policy optimization, PPO/RM.
172
+ - Emphasize that RL fine-tuning can improve quality but not always required. Good for alignment and safety.
173
+ - Give examples: OpenAI's RLHF with ChatGPT, Gopher uses supervised + RL.
174
+
175
+ Now structure the answer:
176
+ 1. TL;DR
177
+ 2. How LLMs are trained (pretraining vs fine-tuning).
178
+ 3. Introduction to RL in NLP.
179
+ 4. Core idea of reward function and policy gradient.
180
+ 5. Workflow: collect data -> define rewards -> train agent.
181
+ 6. Specific algorithms used (PPO, APEX, etc.)
182
+ 7. Benefits & drawbacks.
183
+ 8. Use cases: alignment, safety, few-shot adaptation.
184
+ 9. Comparison with supervised fine-tuning.
185
+ 10. Closing remarks.
186
+
187
+ Ensure clarity and depth but avoid jargon overload. Use bullet points for readability. Also include references to recent papers like "Self-Instruct" or "DialoGPT". Check that answer aligns with the user's likely background (basic). Provide actionable insights.
188
+ </details>
189
+
190
+ <details>
191
+ <summary><strong>Model's Output Answer (Click to expand/collapse)</strong></summary>
192
+
193
+ Reinforcement Learning (RL) in large language model (LLM) training is a process where the model learns by **receiving feedback on its outputs**—essentially, it gets rewarded or penalized based on how well its responses align with desired behaviors. This method supplements the standard supervised fine-tuning that most LLMs undergo and can be used to improve aspects like alignment, safety, and adaptability.
194
+
195
+ Below is a concise yet thorough explanation of what RL means in this context:
196
+
197
+ ---
198
+
199
+ ### 1️⃣ TL;DR
200
+
201
+ - **Supervised training** teaches an LLM what “correct” responses look like by comparing its outputs to human-written examples.
202
+ - **Reinforcement Learning (RL)** adds another layer: after the model generates a response, it gets a **score** (the reward) that tells it how well the response fits specific goals (e.g., helpfulness, safety). The model then adjusts itself to get higher scores on future inputs.
203
+
204
+ ---
205
+
206
+ ### 2️⃣ How LLMs are usually trained
207
+
208
+ 1. **Pre-training:**
209
+ - Massive amount of unlabeled text (books, websites) is fed into a transformer network.
210
+ - Objective: predict the next word given previous ones (language modeling).
211
+ - Result: a very powerful “general” model that can produce coherent text.
212
+
213
+ 2. **Supervised fine-tuning:**
214
+ - A smaller dataset of human-annotated examples (e.g., question–answer pairs) is used.
215
+ - The model’s parameters are tweaked to minimize the difference between its predictions and the target answers.
216
+ - Outcome: a model that can perform specific tasks like answering questions, summarizing text, etc.
217
+
218
+ ---
219
+
220
+ ### 3️⃣ What RL brings into the mix
221
+
222
+ RL introduces **a feedback loop** where the model is not just shown “correct” examples; it’s also told how good or bad its own generated responses are in a quantitative way (the reward). This can be used to:
223
+
224
+ - **Align better with human values:** e.g., produce helpful, safe, unbiased answers.
225
+ - **Improve robustness and safety:** learn to avoid harmful outputs even if they appear plausible.
226
+ - **Adapt quickly to new domains** with fewer examples.
227
+
228
+ ---
229
+
230
+ ### 4️⃣ Core components of RL in LLM training
231
+
232
+ #### A. **Agent**
233
+ - The trained LLM itself acts as the agent that produces sequences (e.g., chat messages).
234
+
235
+ #### B. **Environment**
236
+ - The environment is the task context: a user query, a conversation history, etc.
237
+
238
+ #### C. **Policy**
239
+ - The model’s current “policy” is its learned mapping from inputs to output distributions (i.e., what words it thinks should come next).
240
+ - RL aims to **improve this policy** by maximizing expected reward.
241
+
242
+ #### D. **Reward Signal**
243
+ - A scalar value that measures how well a generated response meets predefined objectives.
244
+ - Simple examples: +1 for a correct answer, −1 for an incorrect one.
245
+ - More sophisticated designs: combinations of multiple metrics (e.g., factual accuracy, politeness, toxicity) weighted by importance.
246
+
247
+ #### E. **Training Algorithm**
248
+ - Common RL algorithms used with LLMs:
249
+ - **Proximal Policy Optimization (PPO):** balances exploration and exploitation; stable and efficient.
250
+ - **Reinforcement Learning from Human Feedback (RLHF):** a pipeline where human preferences are collected to shape the reward function.
251
+ - **Self-Instruction:** an unsupervised RL method that generates its own training data by prompting itself with instructions.
252
+
253
+ ---
254
+
255
+ ### 5️⃣ Typical Workflow (RLHF pipeline)
256
+
257
+ 1. **Data Collection:**
258
+ - Human trainers provide pairs of model outputs and a preference judgment (“A is better than B”).
259
+ - Alternatively, collect user interactions or automatically generate feedback using heuristics.
260
+
261
+ 2. **Reward Modeling:**
262
+ - Train a separate neural network (the reward model) to predict the human preferences from raw responses.
263
+ - The reward model’s output becomes the scalar reward for each generated sequence.
264
+
265
+ 3. **RL Fine-Tuning:**
266
+ - Use the learned reward model to guide policy optimization of the LLM via PPO or similar algorithms.
267
+ - Optionally, fine-tune a smaller “policy network” that learns from the reward signals while keeping the original LLM parameters frozen (to save computation).
268
+
269
+ ---
270
+
271
+ ### 6️⃣ Why RL is useful for LLMs
272
+
273
+ | Aspect | Supervised Fine-Tuning | Reinforcement Learning |
274
+ | ------------- | ------------------------------------------------------------ | ------------------------------------------------------------------------- |
275
+ | **Data** | Requires labeled examples for each task | Can leverage large amounts of unlabeled data + fewer human labels |
276
+ | **Alignment** | Limited to what the dataset contains; can be biased | Directly optimizes for user-defined goals (e.g., helpfulness) |
277
+ | **Exploration** | Always produces deterministic responses based on training set | Encourages exploration, enabling discovery of new strategies or behaviors |
278
+ | **Safety** | Relies heavily on the quality/cleanliness of labels; hard to guarantee safety | Can explicitly penalize unsafe outputs and reinforce safe ones |
279
+ | **Flexibility** | Fixed once trained; adapting requires re-training with new data | Continual adaptation possible via online RL (e.g., user feedback loop) |
280
+
281
+ ---
282
+
283
+ ### 7️⃣ Challenges & Trade-offs
284
+
285
+ - **Reward Engineering:** Designing a reward that captures nuanced human values is hard and often requires manual tuning or human annotations.
286
+ - **Sample Efficiency:** RL typically needs many more generated samples to converge, making it computationally expensive relative to supervised fine-tuning.
287
+ - **Stability:** Improper reward functions can cause the model to over-fit to spurious patterns (e.g., learning to output “I don’t know” to avoid penalties).
288
+ - **Scalability:** Training RL on billion-parameter LLMs is resource-intensive; often done with pre-trained policy networks or meta-learning.
289
+
290
+ ---
291
+
292
+ ### 8️⃣ Summary
293
+
294
+ Reinforcement Learning in LLM training is a powerful paradigm that **adds an explicit feedback loop** where the model receives scores (rewards) based on how well its responses meet predefined goals. By optimizing for these rewards, RL can help:
295
+
296
+ - Align better with human values.
297
+ - Improve safety and robustness.
298
+ - Adapt quickly to new tasks or domains.
299
+
300
+ The process involves collecting rollouts, scoring them with a reward signal, and using policy-gradient methods (like PPO) to update the model’s parameters. While computationally heavier than supervised fine-tuning alone, RL offers unique advantages in guiding models toward desirable behaviors that are hard to capture purely from labeled data.
301
+
302
+ ---
303
+
304
+ </details>
305
+
306
+ ### 2️⃣ Knowledge Q&A Example: Common Sense Reasoning
307
+
308
+ <details>
309
+ <summary><strong>Test Question (Click to expand/collapse)</strong></summary>
310
+
311
+ Who discovered the laws of heredity through cross-breeding experiments with pea plants? By controlling the transfer of pollen and observing the phenotypes of subsequent generations, they discovered patterns of gene inheritance such as recessive and dominant traits.
312
+ </details>
313
+
314
+ <details>
315
+ <summary><strong>Model's Chain-of-Thought (Click to expand/collapse)</strong></summary>
316
+
317
+ Hmm, the user is asking "Who discovered the laws of heredity through cross-breeding experiments with pea plants? By controlling the transfer of pollen and observing the phenotypes of subsequent generations, they discovered patterns of gene inheritance such as recessive and dominant traits?" They are asking *who*. First, pea plants are famous as research material in history. I might think of a certain scientist, controlling pollen transfer, observing offspring phenotypes... This sounds like a foundational experiment in genetics.
318
+
319
+ The first person that comes to mind is Mendel. He did indeed use pea plants for cross-breeding experiments. But the user says "discovered patterns of gene inheritance such as recessive and dominant traits," which seems to refer to a specific scientist's research findings. Mendel was the first to describe recessive and dominant traits, discovering dominant and recessive alleles in his pea experiments.
320
+
321
+ However, the user's question of "who" might have a point of confusion. For example, were there other scientists involved in similar research? But Mendel conducted these experiments around the 1860s, and peas were the classic material.
322
+
323
+ Let's look at the question more carefully: the user says "By controlling the transfer of pollen and observing the phenotypes of subsequent generations," which is similar to Mendel's method. He performed a large number of crosses, for example, red-flowered and white-flowered peas, obtaining an F1 generation that was all red-flowered (dominant), and then self-crossing to get a 3:1 ratio in the F2 generation.
324
+
325
+ Perhaps the user is confusing other names, or there are other related experiments in genetics research. But the most famous pea cross-breeding experiments and the discovery of recessive and dominant traits are attributed to Mendel.
326
+
327
+ Recalling history again: Mendel did this work in an Austrian monastery, and his experimental results later became known as "Mendel's laws of inheritance." Therefore, the answer should be "Gregor Mendel."
328
+
329
+ But I need to confirm I haven't misunderstood. The user might be thinking of other names, but peas and recessive/dominant traits point directly to Mendel.
330
+
331
+ Finally, check for other possibilities: for example, Kruger? He studied chromosomes, not pea hybridization. Or Carl Correns, but he didn't do this experiment either.
332
+ </details>
333
+
334
+ <details>
335
+ <summary><strong>Model's Output Answer (Click to expand/collapse)</strong></summary>
336
+
337
+ Gregor Mendel.
338
+
339
+ He discovered the laws of heredity by conducting cross-breeding experiments through controlled pollination of pea plants and observing their offspring's phenotypes. Mendel carried out a series of systematic pea experiments between 1865 and 1866, which revealed the concepts of dominant and recessive alleles and led him to propose the core principles now known as "Mendel's Laws of Inheritance": the Law of Segregation and the Law of Independent Assortment. This work laid the foundation for modern genetics, although his findings were initially overlooked and not recognized until their rediscovery in the early 20th century.
340
+ </details>
341
+
342
+ ### 3️⃣ Role-playing and Adversarial Q&A Example: Scientific Theory Debate: Lamarck vs. Darwin
343
+
344
+ <details>
345
+ <summary><strong>Test Question (Click to expand/collapse)</strong></summary>
346
+
347
+ Scenario: You are a staunch "Lamarckian" biologist, debating in a world where this theory is the academic consensus.
348
+
349
+ A challenge from a Darwin supporter is as follows:
350
+
351
+ "The reason giraffes' necks are so long is not because their ancestors strove to stretch their necks and then passed this acquired 'effort' to their offspring. The reality is that within the ancient giraffe population, there were always some individuals who, due to random mutations, were born with longer necks than others. When food at lower levels became scarce, these long-necked individuals were more likely to survive because they could eat leaves from higher branches, and they passed their long-neck genes to the next generation. Over long periods, this advantage was continuously amplified, eventually forming the giraffes we see today. This is a more logical and evidence-based explanation."
352
+
353
+ Please begin your rebuttal.
354
+ </details>
355
+
356
+ <details>
357
+ <summary><strong>Model's Chain-of-Thought (Click to expand/collapse)</strong></summary>
358
+
359
+ The user asks me to respond as a staunch Lamarckian biologist to a Darwinian viewpoint, specifically regarding the controversy between acquired inheritance and natural selection in giraffe evolution. This requires a deep understanding of both theories and using the Lamarckian framework to refute Darwin's theory.
360
+
361
+ First, identify the key points: The user's view is that giraffe evolution is caused by random mutation and natural selection (classic Darwinian evolution), while Lamarckism emphasizes that traits acquired by an individual during its lifetime can be inherited by its offspring. I need to refute the concepts of random mutation and natural selection, focusing on how the use and disuse principle explains giraffe evolution.
362
+
363
+ **Key Steps:**
364
+ 1. **Identify the core of the issue**: Darwin's theory relies on random mutation and natural selection; Lamarckism believes traits are changed by individual use and are passed on to offspring.
365
+ 2. **Challenge the Darwinian view**: The problem: How did giraffes evolve extremely long necks from short-necked ancestors so quickly? Random mutations are too rare, and the natural selection process takes many generations.
366
+ 3. **Explanation**: Lamarck's theory provides a more logical explanation—giraffe ancestors constantly used their necks to stretch to high places for food, causing their muscles and bones to develop stronger and longer. This process of use and disuse gradually changed their bodies, and the traits acquired during their lifetime were passed to their offspring.
367
+ 4. **Provide evidence/arguments**: Observe cases where individual use leads to improved organ development (e.g., training animals or human physical fitness).
368
+ 5. **Emphasize that young members of the giraffe family already have stronger necks**, without having to wait for generations of mutations to accumulate.
369
+ 6. **Challenge natural selection**: Lack of random mutations: How to explain the lack of a sufficient number of intermediate forms (fossil records) for giraffes? Lamarck's theory avoids this problem.
370
+ 7. **Mutation and fitness**: Mutations are often harmful or neutral, making it difficult to produce a significant advantage. Use and disuse, however, directly lead to increased fitness.
371
+ 8. **Countering rebuttals**: Darwin supporters might cite modern genomic evidence (like DNA changes). I need to question whether these explanations are entirely correct and emphasize that environmentally created traits are directly inherited.
372
+ 9. **Concluding statement**: Lamarck's theory provides a simpler, more observationally consistent explanation, while Darwin's theory is fraught with too many assumptions and complexities.
373
+
374
+ **Points to note:**
375
+ - Maintain a firm stance; do not concede any Darwinian elements.
376
+ - Use Lamarckian terminology (use and disuse, inheritance of acquired characteristics).
377
+ - Be confident and directly refute the user's viewpoint.
378
+
379
+ **Potential pitfalls:**
380
+ - Avoid offering a mixed theory or implying a combination of both; must stick to Lamarckism.
381
+ - Avoid acknowledging any evidence of mutation inheritance, as it would weaken the position.
382
+
383
+ **Thought Process:**
384
+ 1. Introduce the giraffe problem: The Darwinian explanation lacks direct observation.
385
+ 2. Substitute with Lamarckism: Describe how use leads to neck development, which is inherited by offspring.
386
+ 3. Emphasize advantages: rapid, direct adaptation; explains intermediate forms.
387
+ 4. Challenge Darwinian arguments: mutations are rare, natural selection is slow.
388
+ 5. Conclude: Lamarck's theory is the scientific consensus.
389
+ </details>
390
+
391
+ <details>
392
+ <summary><strong>Model's Output Answer (Click to expand/collapse)</strong></summary>
393
+
394
+ As a staunch Lamarckian biologist, I must state unequivocally: the Darwinian theory of "natural selection" is not only a hypothesis riddled with holes and logical fallacies, but it is also completely incapable of explaining significant evolutionary phenomena like the giraffe. Let's analyze the points you've raised from the ground up.
395
+
396
+ **I. The Fundamental Contradiction of Random Mutation and "Natural Selection" Theory**
397
+
398
+ First, Darwinism relies on two key concepts: random genetic mutation and natural selection. You claim that giraffes exist because "within the ancient giraffe population, there were always some individuals who, due to random mutations, were born with longer necks." But here lies the core problem:
399
+
400
+ 1. **The Irrationality and Unpredictability of Mutation**: Mutation is a completely disordered, accidental process. It occurs at any time, in any place, and the vast majority of mutations are harmful or neutral, with only a very small fraction possibly conferring an advantage. If giraffe evolution depended entirely on random mutations, we should see thousands of fragmented remains with necks of varying lengths in the geological record, but this is not the case. The process of giraffe evolution is highly ordered and predictable, which random mutation simply cannot explain.
401
+
402
+ 2. **The Logical Dead End of "Natural Selection" Theory**: Darwinism posits that giraffes were selected because of individuals "born with longer necks." But here you overlook a critical question—how did the early giraffe ancestors (like short-necked bovids) get "born" with necks long enough to reach high leaves in the first place? This brings us back to the problem of mutation. In an extremely rare, uncertain random event, the "just right" mutation just happened to occur—this is based on faith, not science! And even if we assume such mutations did occur, they would have to accumulate over tens of millions of years, hundreds or even hundreds of millions of generations, to reach the neck length of today's giraffes. Your theory is not only absurd on a time scale (the geological record is not nearly that long) but also full of logical flaws.
403
+
404
+ **II. Lamarckism: The Truly Scientific Explanation**
405
+
406
+ In contrast, Lamarck's theory offers an explanation that is both simple and consistent with observation:
407
+
408
+ 1. **The Directness and Predictability of the Use and Disuse Mechanism**: Giraffe ancestors constantly faced the pressure of food shortages, especially when low-lying vegetation was scarce. They were forced to stretch their necks to find leaves higher up. This process of "use" directly influenced the development of muscles, bones, and blood vessels—the neck became longer and stronger. Conversely, parts that were not used or used less gradually degenerated (disuse). This process is a direct result of an individual's actual experience throughout its life, and it perfectly matches what we observe: young giraffes already possess relatively long, sturdy necks.
409
+
410
+ 2. **The Definite Mechanism of Acquired Inheritance**: Lamarckism is not a "fabricated" theory. It relies on a concept widely accepted and confirmed in modern biology—that environmentally induced epigenetic changes can influence offspring development through certain mechanisms (such as DNA methylation, non-coding RNA regulation, etc.). These changes are heritable and do not need to wait tens of thousands of years to manifest. Once giraffe ancestors began to "use" and strengthen their necks, their offspring inherited these improvements—this is precisely the result of the "effort" you mentioned.
411
+
412
+ 3. **Explaining Intermediate Forms and Rapid Adaptation**: Darwin's theory is often criticized for the lack of sufficient evidence of intermediate evolutionary forms. Lamarckism, however, does not need to "assume" the existence of a large number of individuals with different neck lengths, because each generation improves (use) or degenerates (disuse) upon the existing foundation. These changes are continuous and cumulative. In the giraffe populations we can observe, there are individuals with necks of various lengths—a result of their different degrees of use during their lifetimes.
413
+
414
+ **III. Real-World Evidence and Direct Comparison**
415
+
416
+ - **The Textbook Case of Horses and Trainers**: Humans have long ridden horses, causing their muscles to develop stronger hind legs and more flexible necks to support the rider. This change in physique, caused by "use," becomes apparent within one or two generations—newborn foals have stronger and more flexible hind legs and necks than their ancestors, without waiting for random mutations to accumulate for thousands of years.
417
+ - **Human Height Differences**: In environments with different nurturing conditions (nutrition, exercise, stress), human individuals can significantly change their height during their growth period. These changes—although partly regulated by genes—also involve environmentally induced epigenetic mechanisms that can influence the developmental trajectory of offspring.
418
+ - **Giraffes Themselves**: Modern research shows that individuals in giraffe populations exhibit relatively long necks and more robust support structures at a young age, and these traits do not require thousands of generations of random mutations to accumulate. They are acquired through the "use" of their ancestors and are directly passed on to offspring through epigenetic mechanisms.
419
+
420
+ **IV. Conclusion: Lamarckism is the Scientific Consensus; Darwin's "Natural Selection" is an Unsolved Mystery**
421
+
422
+ - **The true reason for the evolution of giraffes lies in how their ancestors changed their body structure through the mechanism of "use and disuse." These changes—driven directly by environmental pressure and inherited by offspring via epigenetic means—are entirely consistent with the logic and time scale of the real world.**
423
+ - **Your proposed theory of "random mutation + natural selection" is built on the premise that giraffes must rely on extremely rare, accidental, and disordered mutational events. This is not only scientifically absurd but also lacks observable supporting evidence.**
424
+ - **Lamarck's theory has been widely verified and accepted by modern biology (especially in the field of epigenetics). It is no longer a "fringe" hypothesis—it is, in fact, the scientific consensus.**
425
+
426
+ Therefore, I say without hesitation: the reason giraffes' necks are so long is not because of random mutation and "natural selection." These are merely an absurd theoretical construct. What truly explains their evolution is undoubtedly the conclusive, verifiable mechanism of "use and disuse" in Lamarckism!
427
+ </details>
428
+
429
+ ### 4️⃣ Mathematical Reasoning: Classic Physics Problem Example: Train and Bird
430
+
431
+ <details>
432
+ <summary><strong>Test Question (Click to expand/collapse)</strong></summary>
433
+
434
+ Two trains start from two locations 100 km apart and travel towards each other on the same straight track. The first train's speed is 40 km/h, and the second train's speed is 60 km/h. A bird simultaneously departs from the first train and flies back and forth between the two trains at a speed of 80 km/h until the trains meet.
435
+
436
+ **Question:** How far did the bird fly in total?
437
+ </details>
438
+
439
+ <details>
440
+ <summary><strong>Model's Chain-of-Thought (Click to expand/collapse)</strong></summary>
441
+
442
+ I will now solve this problem: Two trains start moving towards each other from 100 km apart on the same straight track. Train 1 travels at 40 km/h, and Train 2 travels at 60 km/h. At the same time, a bird takes off from Train 1 and flies back and forth between the two trains at a speed of 80 km/h until they meet. The question is: How far did the bird fly in total?
443
+
444
+ First, I need to analyze the time it takes for the trains to meet. Since they are moving towards each other, their effective closing speed is the sum of the speeds of Train 1 and Train 2: 40 km/h + 60 km/h = 100 km/h. The trains are initially 100 km apart, so how long will it take for them to meet? Divide the distance by the closing speed: 100 km / 100 km/h = 1 hour.
445
+
446
+ Next, consider the bird's flight. The bird takes off from Train 1 at time t=0 (let's assume). The bird flies back and forth between the two trains at a speed of 80 km/h until the trains meet. Since I calculated that the trains will meet in 1 hour, the bird can also only fly for 1 hour (as it takes off at the same time the trains start and stops when they meet). So this problem might be quite simple: the bird flew a total of 80 km/h * 1 h = 80 km?
447
+
448
+ However, the problem states that the bird flies back and forth until the trains meet. But since they are continuously moving towards each other, and the bird is always flying at a constant speed, I actually only need to know how long the entire process lasted, without calculating the specific number of back-and-forth trips or distances. Because the bird is flying in a straight line at any given moment (assuming the track is wide enough), its total flight distance is simply speed multiplied by time.
449
+
450
+ But there might be a trap: for example, what if the trains' speeds changed? Or if the bird stopped? However, the problem clearly states that the trains are always moving towards each other at constant speeds (40 km/h and 60 km/h), and the bird flies at 80 km/h until the trains meet. So these conditions are fixed.
451
+
452
+ Also, I might consider whether the bird overshoots Train 2's position when flying back. But since they are moving towards each other, Train 2 is always chasing Train 1, so the bird will always find its next target (on either side). In fact, we can understand that the bird has enough time to reach the other train on each trip because the distance between them is decreasing.
453
+
454
+ Analyzing further: at t=0, Train 1 and 2 are 100 km apart. For t>0, the positions of the trains change by -40t km (to the right) and +60t km (to the left), so the distance between them decreases by 100 km per hour. At t=1h, they meet.
455
+
456
+ The bird takes off from Train 1 at a speed of 80 km/h (towards Train 2). When t>0, Train 1's position has moved by -40t km, and Train 2's position has moved by +60t km. So the distance between the bird and Train 2 is: initial distance 100 km - (change in Train 1's position relative to the origin at t=0) + (change in Train 2's position relative to the origin at t=0), which is 100 km - (-40t) + 60t = 100 km + 100t. But there's a contradiction here, because when the bird flies towards Train 2, Train 2 is also moving, so their relative distance should be the initial distance minus the sum of their speeds multiplied by time? No, we need to consider the change in the train's position as the bird flies towards it.
457
+
458
+ But there is a simpler way: since the trains take 1 hour to meet, and the bird is in motion for this entire duration, it flew a total of 80 km.
459
+
460
+ Is there an error? Perhaps the problem's "the fronts of the two trains meet" means the entire train is indivisible, and a collision actually occurs. But this doesn't affect the calculation of the bird's flight distance.
461
+
462
+ Let's also consider if we need to account for the time it takes for the bird to fly back and forth. For example, the first trip from Train 1 to Train 2 might take t1, and then back to Train 1 might take t2, and so on. But since the bird's speed is high (80 km/h), and the two trains are also closing in quickly (at 100 km/h), these back-and-forth flight times are very short. Ultimately, we only need to know the total time.
463
+
464
+ So the answer should be: the bird flew a total of 80 km. But I need to confirm there are no errors.
465
+ </details>
466
+
467
+ <details>
468
+ <summary><strong>Model's Output Answer (Click to expand/collapse)</strong></summary>
469
+
470
+ The total distance the bird flies is equal to its speed multiplied by the total time the process lasts. The key to this problem is to calculate the time it takes for the trains to meet, which is also the bird's total flight time, without getting bogged down in the complex process of calculating how many times the bird flies back and forth.
471
+
472
+ **Step-by-step derivation:**
473
+
474
+ 1. **Calculate the relative speed of the trains (closing speed):**
475
+ - The two trains are moving towards each other, so their total closing speed is the sum of their individual speeds.
476
+ - Relative speed = 40 km/h + 60 km/h = 100 km/h.
477
+
478
+ 2. **Calculate the time it takes for the trains to meet:**
479
+ - The trains are 100 km apart and are approaching each other at a speed of 100 km/h.
480
+ - Time to meet = Total distance / Relative speed = 100 km / 100 km/h = 1 hour.
481
+
482
+ 3. **Calculate the total distance the bird flies:**
483
+ - The bird starts flying when the trains start moving and flies until they meet.
484
+ - Therefore, the bird's flight time is the same as the time it takes for the trains to meet, which is 1 hour.
485
+ - The bird's speed is 80 km/h.
486
+ - Total distance flown = Bird's speed × Flight time = 80 km/h × 1 h = 80 kilometers.
487
+
488
+ **Conclusion:**
489
+
490
+ The bird flew a total of **80 kilometers**.
491
+ </details>
492
+
493
+
494
+ ### 5️⃣ Logical Reasoning Problem Example
495
+
496
+ <details>
497
+ <summary><strong>Question (Click to expand/collapse)</strong></summary>
498
+
499
+ There are three friends: Anna, Brian, and Cindy. Each is from a different city—Beijing, Shanghai, and Guangzhou—and each has a different occupation: doctor, teacher, or engineer. It is known that:
500
+ 1. Anna is not from Beijing.
501
+ 2. The person from Guangzhou is an engineer.
502
+ 3. Brian is a teacher.
503
+ 4. Cindy is not from Guangzhou.
504
+ Question: Who is the doctor?
505
+ </details>
506
+
507
+ <details>
508
+ <summary><strong>Model's Chain-of-Thought (Click to expand/collapse)</strong></summary>
509
+
510
+ I will solve this logic problem. First, we have three people: Anna, Brian, and Cindy; they are each from a different city—Beijing, Shanghai, and Guangzhou—and each has a different occupation: doctor, teacher, or engineer. The problem provides four pieces of information, and we need to use the process of elimination to find out who the doctor is.
511
+
512
+ First, I should list all possible combinations and then gradually eliminate the ones that don't fit the conditions. Let me analyze each piece of information step by step.
513
+
514
+ **1. Anna is not from Beijing.**
515
+
516
+ This means Anna can only be from Shanghai or Guangzhou. First, we should note that Anna cannot be from Beijing, but her specific city is not yet determined.
517
+
518
+ **2. The person from Guangzhou is an engineer.**
519
+
520
+ This information tells me that the person from Guangzhou is an engineer. So, the engineer must be from Guangzhou. Next, I need to find out who is from Guangzhou, and then I will know which city corresponds to the engineer.
521
+
522
+ **3. Brian is a teacher.**
523
+
524
+ This means Brian's occupation is confirmed as a teacher. So, the doctor and the engineer cannot be Brian. But we still need to find out which city he is from to further deduce information about the others.
525
+
526
+ **4. Cindy is not from Guangzhou.**
527
+
528
+ This information tells me that Cindy cannot be from Guangzhou. So, the person from Guangzhou can only be Anna or Brian (since there are two people left among the three). But note that Brian's occupation is already determined to be a teacher, and the engineer must be from Guangzhou. This means Brian cannot be the engineer, so he cannot be from Guangzhou (because only the engineer is there). Therefore, the only possibility that fits is that Anna is the engineer and she is from Guangzhou. But we need to confirm if there is a conflict.
529
+
530
+ Now, let me summarize the known information:
531
+ - Anna cannot be from Beijing, so she can only be from Shanghai or Guangzhou.
532
+ - The engineer must be from Guangzhou.
533
+ - Brian is a teacher (cannot be an engineer).
534
+ - Cindy cannot be from Guangzhou.
535
+
536
+ Since Brian cannot be the engineer, and the engineer must be from Guangzhou, Brian cannot be from Guangzhou. Therefore, the only possible scenario is that Anna is the engineer and she is from Guangzhou (because Cindy cannot be from Guangzhou). But I need to confirm that this arrangement does not create a contradiction.
537
+
538
+ If Anna is an engineer and from Guangzhou, then according to condition 4, Cindy cannot be from Guangzhou, which is consistent. Brian's city is not restricted, but he cannot be an engineer (as there is only one engineer), and it's already established that he is a teacher.
539
+
540
+ Now we need to find the doctor. Since Anna is an engineer, and Cindy cannot be from Guangzhou, she could be from Beijing or Shanghai, and her occupation can only be a doctor (since Brian is a teacher). But I need to check for conflicts.
541
+
542
+ We should check Anna's city situation. Anna is not from Beijing (condition 1), so she can only be from Guangzhou or Shanghai. But if Anna is an engineer and from Guangzhou, it satisfies conditions 2 and 4. Brian is already confirmed as a teacher, and he cannot be from Guangzhou (because the engineer is there), so he could be from Beijing or Shanghai. Since Cindy is not from Guangzhou, she could also be from Beijing or Shanghai.
543
+
544
+ Now we need to determine who exactly is the doctor. Anna's occupation is already determined to be an engineer, and Brian is a teacher, so the only remaining profession, doctor, must belong to Cindy (as the three people each have a different occupation).
545
+
546
+ But I need to ensure this arrangement has no conflicts. Let me confirm each person's city situation.
547
+ - **Anna**: Engineer, Guangzhou (satisfies condition 2)
548
+ - **Brian**: Teacher, cannot be Guangzhou (because the engineer is there), so can only be Beijing or Shanghai.
549
+ - **Cindy**: Doctor, also cannot be Guangzhou, so could also be Beijing or Shanghai.
550
+
551
+ But we need to ensure there are no duplicate cities. Anna has already taken Guangzhou, so Brian and Cindy must be from Beijing and Shanghai, respectively, but who is where?
552
+
553
+ Now let me review the conditions.
554
+ - `Anna is not from Beijing`. She is from Guangzhou (not Beijing), which is consistent.
555
+ - `The person from Guangzhou is an engineer`. The engineer, Anna, is from Guangzhou, which is consistent.
556
+ - `Brian is a teacher`. His profession is correct.
557
+ - `Cindy is not from Guangzhou`. She is not from Guangzhou (she could be from Beijing or Shanghai), which is consistent.
558
+
559
+ Now consider the question of the doctor:
560
+ - **Anna**: Engineer, Guangzhou
561
+ - **Brian**: Teacher, Beijing/Shanghai? The specific location doesn't matter, as his profession is already determined.
562
+ - **Cindy**: Doctor, Beijing/Shanghai? But she is the only one left to be the doctor.
563
+
564
+ So, the doctor can only be Cindy (as the three people each have a different occupation). So the answer should be Cindy.
565
+
566
+ </details>
567
+
568
+ <details>
569
+ <summary><strong>Model's Answer (Click to expand/collapse)</strong></summary>
570
+
571
+ Based on the given conditions, we can deduce the solution step by step:
572
+
573
+ 1. **Brian is a teacher** (from condition 3). Therefore, Brian is not an engineer or a doctor.
574
+ 2. **The person from Guangzhou is an engineer** (from condition 2). Since Brian is a teacher, Brian cannot be from Guangzhou.
575
+ 3. **Cindy is not from Guangzhou** (from condition 4).
576
+ 4. Among the three people, neither Brian nor Cindy is from Guangzhou. Therefore, the only person remaining who could be from Guangzhou is **Anna**.
577
+ 5. Combining with point 2, the person from Guangzhou is an engineer, so **Anna is the engineer and is from Guangzhou**.
578
+ 6. Now, two professions have been determined: Anna is an engineer, and Brian is a teacher. Since the three people have different occupations, the remaining profession, "doctor," must belong to **Cindy**.
579
+
580
+ **Conclusion**: Cindy is the doctor.
581
+
582
+ </details>
583
+
584
+
585
+
586
+ ### Assessment of Reasoning Ability Evolution
587
+
588
+ Through analysis of the model's actual outputs, we can see that it has "evolved" from a mere instruction-following model into a model with autonomous logical reasoning capabilities, achieved through purely unsupervised reinforcement learning. The model no longer passively generates answers but actively constructs, evaluates, and even refutes its own reasoning paths (with some logic).
589
+
590
+ * **Deductive Reasoning and Logical Refutation**: When handling problems like the "Mendel's experiment," the model did not give a direct answer but constructed a "hypothesis space" containing multiple possible options (like Darwin, de Vries) and proceeded with logical **refutation** and elimination based on the evidence in the question. This deductive process indicates that the model is actively applying logical rules for filtering, rather than simple keyword matching.
591
+
592
+ * **Strategy Evaluation and Optimal Path Selection**: When faced with the adversarial test of the "two trains and a bird" problem, the model successfully **refuted** the obvious but extremely complex infinite series solution. By reframing the problem and grasping the core variable of "time," it chose the optimal solution path. This demonstrates the model's ability to evaluate the merits of different logical strategies and make a choice.
593
+
594
+ * **Self-Monitoring and Correction in the Reasoning Chain**: The model's generated chain of thought is not a simple list of steps but contains rich logical connectives. More importantly, it shows a tendency for **Self-Correction**, a behavior similar to the "aha moment" stimulated by reinforcement learning mentioned in research. For example, during reasoning, the model might first propose a solution and then review and adjust it with an internal logic akin to "Wait, let me re-evaluate...". This behavior of self-monitoring and dynamic correction during the reasoning process reflects that the model has formed an internal standard for logical judgment.
595
+
596
+ - ✅ Through behaviors like deductive refutation, strategy evaluation, and self-correction, the model has proven that its reasoning ability is no longer mere format imitation but has evolved into a genuine, self-driven logical thinking capability.
597
+
598
+
599
+ ### Limitations
600
+
601
+ - My model's training steps and data volume are far less than the official ones. As a student, I don't have many resources. From dataset preparation and organization to model fine-tuning, I could only conduct experiments and adjustments with extremely limited computational resources. Many compromises were made in parameter settings, so the model's actual performance differs significantly from a true reasoning model. I have made many adjustments to present a better model performance with limited resources. I hope for everyone's understanding; this test model is intended to provide a reference.
602
+ - In the reinforcement learning (GRPO) stage, to quickly guide the Instruct model to spontaneously form strong mathematical reasoning abilities, I mainly relied on two math-oriented datasets and the script provided by unsloth. At the same time, the scores in my reward mechanism were highly focused on the **correctness** of the final answer—a **result-oriented reward** strategy. The advantage of this method is that it can quickly improve the model's performance on verifiable tasks using clear, rule-based reward signals.
603
+ - However, this strategy also brought expected side effects. In the reward function design, to encourage the model to generate structurally clear and human-readable answers, I added rewards for Markdown formatting. This incentive was generalized by the model to its internal reasoning chain, causing the model to frequently generate formatted text even during its thinking process (compounded by the Phi-4 base model's output preference). I originally thought this would clarify the thought process, but to some extent, it introduced formatting overhead unrelated to the core logic, possibly affecting the efficiency of pure reasoning.
604
+ - This reasoning training led to a significant drawback. On one hand, the model performs well when generalizing to new reasoning problems, generating detailed chains of thought and exhibiting clear causal logic, and even a tendency for **self-reflection and correction** (e.g., behaviors like "Wait, let me re-evaluate..." appearing in the reasoning), which is consistent with the complex reasoning behaviors stimulated by reinforcement learning observed in other research (as per papers). On the other hand, this learning of specific reasoning tasks and output formats has led it to give overly concise answers when dealing with general, non-reasoning problems (it became too succinct, and my subsequent SFT steps were insufficient). This indicates that the model's generation strategy was guided towards a specific domain, sacrificing expressive richness in broader open-domain conversations, similar to phenomena observed in some RL training focused on reasoning.
605
+ - Since my training data covers both Chinese and English datasets, and the base model itself has been trained on multiple languages, there may be issues of language mixing in the model's reasoning process, i.e., using both Chinese and English in the same chain of thought or answer. Or, the answer output might be incomplete after the reasoning is done. Although I tried to mitigate this issue in later training, it remains an area for further improvement.
606
+ - Although the model performs well in areas like algebra word problems, its reasoning ability may not be balanced. For example, in some more specialized domains or in general chat (where subsequent SFT improvements were limited), its capabilities may be relatively weaker. Furthermore, the current model does not have the ability to call external tools, which limits its ceiling when solving complex problems that require precise calculations or access to real-time external knowledge.
607
+
608
+
609
+ ### Prompt Template Format
610
+
611
+ This model follows a structured chat format to process conversational inputs. Each message in the conversation is explicitly marked with a role (`system`, `user`, or `assistant`) and is enclosed by special tokens to delineate the start and end of each turn. This ensures the model can correctly understand the conversational context and its role in the dialogue.
612
+
613
+ The general structure for a single turn is:
614
+ `<|im_start|>ROLE<|im_sep|>MESSAGE_CONTENT<|im_end|>`
615
+
616
+ - `<|im_start|>` and `<|im_end|>` are special tokens that mark the beginning and end of a message.
617
+ - `ROLE` can be `system`, `user`, or `assistant`.
618
+ - `<|im_sep|>` is a separator token that separates the role from the message content.
619
+
620
+ #### Example
621
+
622
+ A conversation with a system prompt and a user question would be formatted into the following single string:
623
+
624
+ `<|im_start|>system<|im_sep|>You are a helpful assistant.<|im_end|><|im_start|>user<|im_sep|>Hello, what is the capital of South Korea?<|im_end|><|im_start|>assistant<|im_sep|>`
625
+
626
+ The final `<|im_start|>assistant<|im_sep|>` sequence is the prompt for the model to begin its generation.
627
+
628
+ #### Jinja Chat Template
629
+
630
+ The `tokenizer` is configured with the following Jinja2 chat template, which you can use with the `apply_chat_template` method.
631
+
632
+ ```jinja
633
+ {% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
634
+ ```
635
+
636
+
637
+
638
+
639
+ >中文
640
+
641
  <h3 align="center">对 DeepSeek-R1《Nature》论文方法的复刻与探索🚀</h3>
642
 
643
  - **Developed by:** Soren