中文翻译在下面
🚀 Soren-Logos-3B: Supervised Fine-Tuning (SFT) + GRPO Reinforcement Learning
Soren-Logos-3B is a deeply evolved version built upon the Soren-Oracle-Chat-3B model, created to address specific challenges. The core motivation of this project is to improve the relatively weak mathematical and logical capabilities of the base Llama-3.2-3B model.
To achieve this, I introduced Group Relative Policy Optimization (GRPO), an advanced reinforcement learning technique. Building on the capabilities of the Oracle model, a series of carefully designed reward functions were used to guide the model in generating a transparent "engine of thought." The core of this engine is to teach the model to think step-by-step logically and to externalize its complete thought process using a specific format—the <think>...</think> tags—before arriving at a conclusion. (The reward for this was kept small to prevent it from becoming a model that always explicitly shows its chain of thought). The goal is to create a 3B-level model that not only answers but also presents its reasoning clearly and methodically.
Core Upgrade: GRPO and the Engine of Thought
✨ The core upgrade of Soren-Logos focuses on improving the base model's capabilities and correctness (Process) on complex problems, as well as enhancing the quality of its responses. It's no longer just about getting the right answer but pursuing a verifiable and traceable source for that answer through GRPO reinforcement learning.
- Guiding Logical Chain Generation: Utilizes reward functions to incentivize the model to first generate a step-by-step, coherent chain of logical reasoning when faced with a problem, rather than directly outputting the answer. (This reward is intentionally small, so the model doesn't behave like a pure reasoning model).
- Externalizing the Thought Process: We specifically guided the model to learn to use the
<think>...</think>block to enclose its complete thinking and reasoning steps. This makes the model's "mind" no longer a black box, allowing users to clearly see how it analyzes a problem and reaches a conclusion.
Multi-dimensional Reward Function Optimization
In addition to the core logical reasoning training, this GRPO optimization includes a series of reward functions targeting response quality, achieving multi-dimensional alignment and improvement:
- Dialogue Alignment: Rewards more natural and human-preferred conversational styles, making communication smoother.
- Structured Output Tasks: For tasks like code generation and text formatting, it rewards outputs that strictly adhere to format requirements, enhancing the model's practicality in these scenarios.
- Penalizing Redundancy and Repetition: Penalizes verbose, repetitive, and meaningless answers, guiding the model to produce more concise and information-dense content.
Training Method
🔥The training of Soren-Logos-3B is divided into two core phases:
Phase 1: Building "Oracle" Foundational Capabilities
The model first inherits all the training achievements of Soren-Oracle-Chat-3B, completing a foundational fine-tuning on a high-quality mixed dataset of 86,448 samples to establish improvements in response depth, professionalism, and formatting.
Phase 2: "Logos" Reasoning Engine Reinforcement (GRPO)
Building on the first phase, the model entered a targeted reinforcement learning stage:
- Core Technology: Adopted Group Relative Policy Optimization (GRPO), an efficient RLHF method.
- Core Dataset: Used the openai/gsm8k dataset, focusing on training the model's step-by-step logical reasoning abilities through its math word problems.
- Composite Reward Functions: Designed and implemented multi-dimensional reward functions, including:
- Logical Format Reward: Strongly guides the model to generate the
<think>...</think>structure. - Answer Accuracy Reward: Scores the correctness of the final answer.
- Dialogue Alignment Reward: Encourages the generation of more fluent, human-preferred responses.
- Conciseness Penalty: Reduces the weight of redundant and repetitive content.
- Logical Format Reward: Strongly guides the model to generate the
Below is a comparison of responses from the fine-tuned model and the base model on several questions, intuitively demonstrating the model's improvements in formatting, logic, and professionalism.
Model Capability Comparison
| Question | Soren-Logos-3B (After Fine-tuning) | Official Llama-3.2-3B-Instruct |
|---|---|---|
| 1. Can machines think? | Answer: Machines do not possess the ability to "think" or have "thoughts." 💡 Core Distinction: Machine vs. Machine Thinking: - Machine (Device): Lacks consciousness, subjective experience, etc. - Machine Thinking (or AI): Highly simulated, algorithm-based. 🧠 Why can't machines "think"?: - 1️⃣ Data-Driven: Models rely on training data, not subjective consciousness. - 2️⃣ Lack of Self-Awareness: Cannot generate subjective experiences or self-awareness. Commentary: Uses paragraphs, lists, and emojis for a clear structure and greater depth. |
Answer: Currently, machines do not possess true thinking ability. However, Machine Learning and Deep Learning can already mimic human thinking in some aspects, making predictions and decisions, but they lack critical thinking and creativity. Commentary: The answer is relatively plain and lacks a hierarchical structure. |
| 2. Age Calculation Problem | Setup: Let's say Xiaoming is x years old, and his father is 4x years old.In 5 years: Xiaoming will be x + 5, and his father will be 4x + 5.Set up the equation: 4x + 5 = 3(x + 5) → Solving for x gives x = 10.Answer: ✅ Xiaoming is currently 10 years old, and his father is 40 years old. Commentary: The reasoning steps are clear, and the result is correct. |
Explanation: The original answer attempted to set up an equation but incorrectly incorporated the "in 5 years" condition into the initial equation, leading to a final calculation error. For example, it produced incorrect forms like d = 4m + 5.Commentary: Logical error, incorrect answer. |
| 3. The history of Apple Inc. | Answer: 🍎 Apple Inc. is a globally renowned innovation company. Early Years (1976–1980): Founding and First Products: Introduction to its establishment and initial products; Entry into Education: Early strategies for the education market. Notable Milestones: Lists major time points and products. Commentary: Uses subheadings and bullet points for a clearer and more organized structure. |
Answer: Apple Inc. has a rich history. Early Years (1976–1980): Introduction to the company's founding and first products; The Macintosh Era (1980–1996): Basic entries like the Macintosh release in 1984. Commentary: Presented as a simple list with more generalized information. |
| 4. Explain what a "neural network" is. | Answer: Introduction to Neural Networks and Core Concepts. Definition and Historical Background: Introduces the origin and development. Core Components: Neurons and Network Structure: Neurons, Layers, Connection Weights, etc. Working Principle and Architecture Types: Feedforward Networks, Convolutional Neural Networks, Recurrent Neural Networks, etc., along with application scenarios. Commentary: A comprehensive and in-depth answer covering history, components, principles, architecture, and applications with a high degree of structure. |
Answer: A neural network is a machine learning model that mimics the brain's neural system. Basic Components: Composed of multiple "neurons" and a hierarchical structure. Working Principle: Data input, weights and activation functions, etc. Commentary: The explanation is relatively basic and general. |
Actual Output Showcase:
Intended Use and Limitations
Intended Use
This model is designed for various tasks that require clear, reliable, and transparent reasoning processes, with improvements in several capabilities:
- General-purpose conversation in Chinese and English, instruction following, and logical reasoning.
- Educational, learning, and tutoring scenarios that require showing problem-solving steps.
- Debugging and verifying the reliability of AI outputs.
- Content creation tasks that require high-quality structured output.
Limitations
- Potential Inaccuracies: Like all language models, it may produce inaccurate or biased content. Always verify critical information.
- Knowledge Cutoff: The model's knowledge base ends in December 2023.
- Reasoning Capability Limits: Although its reasoning ability has been significantly enhanced, as a 3B-level model, it may still make mistakes when handling extremely complex reasoning problems.
Acknowledgments
The creation of Soren-Logos-3B would not have been possible without the many outstanding prior works and the collective wisdom of the open-source community. We express our sincerest gratitude to all contributors who provided the foundation, tools, and inspiration for this project.
Core Contributors
- Base LLM: We used
meta/Llama-3.2-3B-Instruct, developed by Meta, as the starting point for our model. Its excellent architecture and powerful foundational capabilities were key to the project's success. - Fine-tuning Framework: We used the
Unslothlibrary for efficient training and optimization.
I encourage users of this model to also acknowledge and cite the original contributors of the base models and datasets mentioned above. The open-source community thrives on sharing, and we hope Soren-Logos-3B can also be a part of this force.
⚠️ Disclaimer
Please note that the core objective of this fine-tuning was to optimize and enhance the model's capabilities in specific areas, but it was not a complete retraining from scratch.
The final performance, knowledge boundary, and capability limits of this model are strictly constrained by the inherent framework of its base model. Fine-tuning can improve performance in certain aspects but cannot overcome the fundamental limitations of the base model itself.
Therefore, the improvements brought by fine-tuning are incremental and do not represent a qualitative leap. I advise users to independently cross-verify all critical information and to evaluate the model's output with caution.
🚀 Soren-Logos-3B:监督微调SFT + GRPO 强化学习
Soren-Logos-3B 是在 Soren-Oracle-Chat-3B 模型基础上,为解决特定挑战而打造的深度进化版本。本项目的核心动机是改善 Llama-3.2-3B 基座模型相对孱弱的数学与逻辑能力。
为此,我引入了组相对策略优化(Group Relative Policy Optimization, GRPO),这是一种先进的强化学习技术。在 Oracle 模型的能力基础上,通过一系列精心设计的奖励函数,引导模型生成一个透明的“思想引擎”。这个引擎的核心是教会模型“step by step”地进行逻辑思考,并通过一个特定的格式——使用 <think>...</think> 标签——将其完整的思考过程外化,最终得出结论(这个奖励我给的很小,不会变成推理模型那种显示CoT),旨在打造一个不仅会回答,更能清晰、有条理地展示的3B级模型。
核心升级:GRPO 与思想引擎
✨ Soren-Logos 的核心升级则是聚焦于改善基础模型在复杂问题上的能力与正确性(Process),以及针对回答质量的改进。 不再仅仅满足于一个正确的答案,而是追求一个可验证、可追溯的答案来源,通过 GRPO 强化学习。
- 引导逻辑链生成: 利用奖励函数,激励模型在面对问题时,优先生成一步一步的、连贯的逻辑推理链,而不是直接给出答案(这个奖励我给的很小,模型不会变成推理模型)。
- 外化思考过程: 我们具体引导模型学会使用
<think>...</think>模块来包裹其完整的思考和推理步骤。这使得模型的“思维”不再是一个黑盒,用户可以清晰地看到它是如何分析问题并得出结论的。
多维度奖励函数优化
除了核心的逻辑推理训练,本次 GRPO 优化还包含了一系列针对回答质量的奖励函数,实现了多维度的对齐与改进:
- 对话对齐: 奖励更自然、更符合人类偏好的对话风格,让交流更顺畅。
- 结构化输出任务: 针对代码生成、文本格式化等任务,奖励严格遵循格式要求的输出,提升了模型在这些场景下的实用性。
- 惩罚冗余与重复: 对啰嗦、重复、无意义的回答进行惩罚,引导模型生成更简洁、信息密度更高的内容。
训练方法
🔥Soren-Logos-3B 的训练分为两个核心阶段:
阶段一:“Oracle”基础能力构建
模型首先继承 Soren-Oracle-Chat-3B 的全部训练成果,在一个包含 86,448个样本 的高质量混合数据集上完成了基础微调,以确立其在回答深度、专业性和格式化方面进行改善。
阶段二:“Logos”推理引擎强化 (GRPO)
在第一阶段的基础上,模型进入了针对性的强化学习阶段:
- 核心技术: 采用 **组相对策略优化 (GRPO)**,一种高效的 RLHF 方法。
- 核心数据集: 使用 openai/gsm8k 数据集,通过其中的数学应用题来重点训练模型的 step-by-step 逻辑推理能力。
- 复合奖励函数: 设计并实施了多维度奖励函数,包括:
- 逻辑格式奖励: 强力引导模型生成
<think>...</think>结构。 - 答案准确性奖励: 对最终答案的正确性进行打分。
- 对话对齐奖励: 鼓励生成更符合人类偏好的流畅回答。
- 简洁性惩罚: 降低冗余和重复内容的权重。
- 逻辑格式奖励: 强力引导模型生成
以下是微调后的模型与基座模型在几个问题上的回答对比,直观展示了模型在格式化、逻辑性和专业性上的提升。
模型能力对比展示
| 问题 (Question) | Soren-Oracle-Chat-3B (微调后) | 官方 Llama-3.2-3B-Instruct |
|---|---|---|
| 1. 机器能够思考吗? | 回答:机器并非具有“思维”或“思考”的能力。 💡 核心区分:机器 vs 机器思考: - 机器(设备):无意识、无主观体验等。 - 机器思考(或 AI):高度模拟化、基于算法。 🧠 为什么机器无法“思维”?: - 1️⃣ 数据驱动:模型依赖训练数据而非主观意识。 - 2️⃣ 缺乏自我认知:无法产生主观体验或自我意识。 评注:使用段落、列表与 Emoji,结构清晰且深度更佳。 |
回答:目前,机器尚未具有 true 的思考能力。然而,机器学习(Machine Learning)和深度学习(Deep Learning)已经可以在某些方面模仿人类的思考,能够做出预测和决策,但缺乏批判性思维和创造力。 评注:回答较为平铺直叙,缺乏结构层次。 |
| 2. 年龄计算题 | 设定:设小明今年 x 岁,爸爸今年 4x 岁。5 年后:小明 x + 5,爸爸 4x + 5。列方程: 4x + 5 = 3(x + 5) → 解得 x = 10。答案:✅小明现在 10 岁,爸爸现在 40 岁。 评注:推理步骤清晰,结果正确。 |
说明:原回答尝试建立方程但在构建时把“5年后”的条件错误地并入初始等式,导致最终计算错误。示例出现 d = 4m + 5 等错误形式。评注:逻辑错误,答案不正确。 |
| 3. 苹果公司的发展史 | 回答:🍎Apple Inc. 是全球知名创新公司。 Early Years (1976–1980):Founding and First Products:介绍创立与最初产品;Entry into Education:早期教育市场策略。 Notable Milestones:列出重大时间点与产品。 评注:使用子标题与条目,结构更清晰、条理性更好。 |
回答:Apple Inc. 有丰富的发展史。**Early Years (1976–1980):介绍公司创立及首批产品;The Macintosh Era (1980–1996)**:1984 年 Macintosh 发布等基本条目。 评注:以简单列表形式给出,信息较概括。 |
| 4. 讲解什么是“神经网络” | 回答:神经网络(Neural Network)简介与核心概念。 定义与历史背景:介绍起源与发展脉络。 核心组成:神经元与网络结构:神经元(Neuron)、层(Layer)、连接权重等。 工作原理与架构类型:前馈网络、卷积神经网络、循环神经网络等,以及应用场景。 评注:回答全面深入,覆盖历史、组件、原理、架构与应用,结构化程度高。 |
回答:神经网络是一种模仿大脑神经系统的机器学习模型。基本组成:由多个“神经元”与层级结构构成。工作原理:数据输入、权重与激活函数等。 评注:解释相对基础与概括。 |
输出内容实际展示:
预期用途与局限性
预期用途
本模型为需要清晰、可靠且推理过程透明的各类任务而设计,改善了部分能力:
- 通用中英文对话、指令遵循和逻辑推理。
- 需要展示解题步骤的教育、学习和辅导场景。
- 调试和验证 AI 输出的可靠性。
- 需要高质量结构化输出的内容创作任务。
局限性
- 潜在不准确性: 与所有语言模型一样,它可能产生不准确或有偏见的内容。请务必对关键信息进行核实。
- 知识截止日期: 模型的知识库截止于 2023年12月。
- 推理能力上限: 尽管推理能力得到显著加强,但作为3B级模型,它在处理极其复杂的推理问题时仍可能出错。
致谢
Soren-Logos-3B 的诞生离不开众多杰出的前期工作和开源社区的集体智慧。我们对所有为本项目提供基础、工具和灵感的贡献者表示最诚挚的感谢。
核心贡献者 (Core Contributors)
- 基础模型 (Base LLM): 我们采用了由 Meta 公司研发的
meta/Llama-3.2-3B-Instruct作为模型的起点。其卓越的架构和强大的基础能力,是本项目成功的关键。 - 微调框架 (Fine-tuning Framework): 我们使用了
Unsloth库进行高效的训练和优化。
我鼓励用户在使用本模型时,同样对上述基础模型和数据集的原始贡献者进行致谢和引用。开源社区因分享而繁荣,我们希望 Soren-Logos-3B 也能成为这份力量的一部分。
⚠️声明
请注意,本次微调的核心目标是优化和增强模型在特定方面的能力,但并非一次完整的重新训练。
本模型的最终性能、知识边界和能力的上限,均严格受限于其基座模型的固有框架。微调能够改善模型在某些方面的表现,但无法突破基座模型自身存在的根本性限制。
因此,微调所带来的改进是增量式的,并不能带来质的飞跃。我建议用户在使用模型时,对所有关键信息进行独立的交叉验证,并审慎评估其输出结果。
- Downloads last month
- 64
4-bit
8-bit
Model tree for Jackrong/Soren-Logos-3B
Base model
Jackrong/Soren-Oracle-Chat-3B