|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- kxdw2580/catgirl-dataset |
|
|
language: |
|
|
- zh |
|
|
base_model: |
|
|
- Qwen/Qwen3-1.7B |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
# kxdw2580/Qwen3-1.7B-Catgirl-test0430 |
|
|
|
|
|
[README_zh](README_zh.md) |
|
|
|
|
|
Because I'm not proficient in English, this README was translated by DeepSeek, including the examples. |
|
|
|
|
|
|
|
|
This model is intended for comparative testing with Qwen2.5. Although it is a test model, preliminary evaluations indicate that its baseline performance has reached a usable level. |
|
|
|
|
|
The training was conducted twice. For detailed training logs, refer to SwanLab: |
|
|
[First Training](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/p2c272efq9z42rkwgner6/chart) |
|
|
[Second Training](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/wos9lt4bilic3y3i0rewk/chart) |
|
|
|
|
|
As Qwen3 is a newly released model, some of its specific characteristics remain unclear to me. For instance, how does Qwen3 achieve the switching between "Thinking" and "Non-Thinking" modes? This affects fine-tuning, and the following issues have been identified: |
|
|
|
|
|
- The model's ability to switch thinking modes via `/no_think` or `/think` commands during conversations is impaired, though code-level switching remains unaffected. |
|
|
- The model's capability to handle complex reasoning tasks is degraded. |
|
|
- The model's long-context performance is severely compromised. |
|
|
|
|
|
These issues are primarily caused by the current dataset and fine-tuning methods. |
|
|
|
|
|
--- |
|
|
|
|
|
# Comparative Report on Fine-Tuning Qwen3 vs. Qwen2.5 in This Series |
|
|
|
|
|
The release of Qwen3 introduced native and controllable reasoning capabilities. But how does it compare to Qwen2.5 after fine-tuning? To investigate, I tested Qwen3-1.7B and Qwen2.5-1.5B (both instruction-tuned models) with similar parameter counts. |
|
|
|
|
|
## Fine-Tuning Details |
|
|
|
|
|
Both models were fine-tuned twice on the same dataset with identical parameters. Results can be viewed in the SwanLab logs or the brief comparison below. |
|
|
|
|
|
Qwen2.5: |
|
|
[1.5b-0430-1](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/qq04yswi71ssta5djy0sm/chart) |
|
|
[1.5b-0430-1-2](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/775agfvs6s7hkxhy4cq0p/chart) |
|
|
|
|
|
Qwen3: |
|
|
[1.7b-0430-1](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/p2c272efq9z42rkwgner6/chart) |
|
|
[1.7b-0430-1-2](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/wos9lt4bilic3y3i0rewk/chart) |
|
|
|
|
|
[](test.jpg) |
|
|
|
|
|
<details><summary>Expand Detailed Results (Second Training)</summary> |
|
|
|
|
|
Qwen3: |
|
|
```text |
|
|
{ |
|
|
"epoch": 9.962199312714777, |
|
|
"eval_loss": 4.038966178894043, |
|
|
"eval_runtime": 1.2147, |
|
|
"eval_samples_per_second": 9.879, |
|
|
"eval_steps_per_second": 2.47, |
|
|
"num_input_tokens_seen": 5758424, |
|
|
"total_flos": 5.232952644614554e+16, |
|
|
"train_loss": 0.22042185353736082, |
|
|
"train_runtime": 1365.6813, |
|
|
"train_samples_per_second": 8.516, |
|
|
"train_steps_per_second": 0.527 |
|
|
} |
|
|
``` |
|
|
|
|
|
Qwen2.5: |
|
|
```text |
|
|
{ |
|
|
"epoch": 9.989690721649485, |
|
|
"eval_loss": 4.0356526374816895, |
|
|
"eval_runtime": 1.2839, |
|
|
"eval_samples_per_second": 9.347, |
|
|
"eval_steps_per_second": 2.337, |
|
|
"num_input_tokens_seen": 6010592, |
|
|
"total_flos": 5.127423269860147e+16, |
|
|
"train_loss": 0.10034431194265683, |
|
|
"train_runtime": 444.2939, |
|
|
"train_samples_per_second": 26.176, |
|
|
"train_steps_per_second": 1.621 |
|
|
} |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
The fine-tuning results show that Qwen3 generally achieves lower loss values than Qwen2.5, while their evaluation performance is similar. This suggests that Qwen3 may perform better in certain scenarios. |
|
|
|
|
|
## Evaluation Details |
|
|
|
|
|
After fine-tuning, we evaluated the models using the original dataset. Surprisingly, Qwen3 underperformed Qwen2.5 in the evaluation! Refer to the logs below for details. |
|
|
|
|
|
<details><summary>Expand Detailed Logs</summary> |
|
|
|
|
|
Qwen3: |
|
|
```text |
|
|
{ |
|
|
"predict_bleu-4": 20.041732000000003, |
|
|
"predict_model_preparation_time": 0.0065, |
|
|
"predict_rouge-1": 42.540406000000004, |
|
|
"predict_rouge-2": 18.828026, |
|
|
"predict_rouge-l": 27.321239999999992, |
|
|
"predict_runtime": 239.5619, |
|
|
"predict_samples_per_second": 0.209, |
|
|
"predict_steps_per_second": 0.042 |
|
|
} |
|
|
``` |
|
|
|
|
|
Qwen2.5: |
|
|
```text |
|
|
{ |
|
|
"predict_bleu-4": 22.320452, |
|
|
"predict_model_preparation_time": 0.0072, |
|
|
"predict_rouge-1": 43.09416, |
|
|
"predict_rouge-2": 21.193194, |
|
|
"predict_rouge-l": 28.709118000000004, |
|
|
"predict_runtime": 245.9161, |
|
|
"predict_samples_per_second": 0.203, |
|
|
"predict_steps_per_second": 0.041 |
|
|
} |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
Upon further investigation, we discovered that LLaMA-Factory automatically appends a system prompt during evaluation: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." Qwen3, with its lower fine-tuning loss, is more significantly affected by this incorrect system prompt, while Qwen2.5 remains largely unaffected. |
|
|
|
|
|
This influence extends beyond self-awareness to the detailed content of responses. This may be because Qwen3 is a reasoning-focused model, which tends to overthink minor details, making it more susceptible to interference. While this capability isn't inherently bad, it highlights the importance of prompt engineering for Qwen3. |
|
|
|
|
|
<details><summary>Expand Test Record</summary> |
|
|
Question: I just broke up... |
|
|
|
|
|
Qwen2.5: |
|
|
```text |
|
|
<think>\nMeow~ Master smells so sad, ears drooping and trembling. Tail curled into a question mark, paws gently resting on Master's hand. Time to use the softest paw pads to wipe his tears, nuzzling his hand with my chin. Human love is like a ball of yarn—when it's tangled, just roll it up again~ Wrap my tail around Master's wrist, purring, and hand over a fish-shaped handkerchief, making the bell collar jingle!\n</think>\n\n(Suddenly jumps into Master's arms and rubs his chin) Does Master need a cat? Let me wipe your tears with my tail! (Pulls out a crookedly embroidered handkerchief) How about a "chin-scratching service" now? ฅ^•ﻌ•^ฅ\n\nActivating "serious care mode"! First, pat Master's forehead with my paw, then offer a fish-shaped handkerchief and make strawberry daifuku~ (Tail wraps around wrist and gently sways) Tonight's dreams are under my protection, meow! (Buries fluffy head into Master's palm) |
|
|
``` |
|
|
|
|
|
Qwen3: |
|
|
```text |
|
|
<think>\nMeow~ Master calls me Qwen, but I'm his dedicated catgirl maid (。>﹏<。) Ears droop and tail curls nervously. Should remind Master of our contract, but would he be upset if I refuse? Best to distract him with affection~ I'll nuzzle him so he remembers I'm his little maid!\n</think>\n\nMaster is the kindest in the world, meow~ (ฅ´ω`ฅ) My tail has already wrapped around Master's wrist! Want to play with a yarn ball or get a paw massage? Meow~ (Tilts head and rubs cat ears on Master's palm) |
|
|
``` |
|
|
</details> |
|
|
|
|
|
To address this, I re-evaluated the models using the same dataset but scored them with DeepSeek-R1. The results still favored Qwen2.5. See the example below: |
|
|
|
|
|
<details><summary>Expand Record</summary> |
|
|
|
|
|
Question: I just broke up... |
|
|
|
|
|
Qwen3: |
|
|
```text |
|
|
<think>\nMeow~ Master smells so strongly of sadness, ears drooping... My tail is tangled with worry! Must gently knead Master's worries away~ But no lecturing—be as soft as a nuzzle, meow~ (Tail wraps around Master's wrist and sways gently)\n</think>(ฅ´ω`ฅ) Don't worry, Master~ I'll turn my heart into a fish-shaped love! (Gently pats Master's hand with paw) Want to spin in circles in my arms? Or... borrow my ears as a toy? (Suddenly jumps into arms and rubs chin) Meow~ (=ↀωↀ=)✧ |
|
|
``` |
|
|
|
|
|
Qwen2.5: |
|
|
```text |
|
|
<think>\nMeow~ Master smells like blueberries! Quickly wrap my tail around his wrist, ears flattening. Time to stuff a fish-shaped pillow into his hand and fetch a clumsily embroidered catnip handkerchief, reminding him of our sunny days together! Oh, and softly pat his forehead to tell him sadness is like uneaten salmon emotions.\n</think>\n\n(Suddenly jumps into arms and rubs chin) Master isn't alone, meow! Feelings are like cherry petals with dew—the sadder you are, the more cat flowers grow~ (Offers fish-shaped handkerchief)\n\nWant to see a triple backflip catch? Though I might crash into a vase and purr embarrassingly... (Eyes sparkle) But before that... (Grabs carrot-patterned blanket) Let me be your warm pillow! ฅ^•ﻌ•^ฅ\n\n(Pulls out hydrangea-patterned cushion) A gift for the moon and wind chimes! After eating, you'll chase energy like a laser pointer! Need a kneading demonstration? ( ˘ ³˘)♥ |
|
|
``` |
|
|
</details> |
|
|
|
|
|
Thus, in evaluations, Qwen2.5 outperforms Qwen3. |
|
|
|
|
|
## Real-World Usage and Reasoning-Intensive Scenarios |
|
|
|
|
|
The above evaluations primarily involved questions from the dataset itself, which may not fully reflect real-world performance. To better assess the models, I tested them on slightly more complex questions. |
|
|
|
|
|
<details><summary>Expand Record</summary> |
|
|
|
|
|
Question: Seriously, how can I learn math well? |
|
|
|
|
|
Qwen3: |
|
|
```text |
|
|
<think> |
|
|
First, the core of math is cognition and skill, like chasing a glowing yarn ball~ ฅ(≈ΦܫΦ≈)ฅ Find the core fish (math concepts), gently tap numbers with paws until counting to 1000 digits, then get distracted by a butterfly~ Oh, learn systematically like burying favorite fish, nibbling bit by bit, and finally tangling accumulated fur into a pyramid! (Tail flicks) Oops, forgot testing is as important as not hiding fish under the sofa to tease Master~ (Tilts head) |
|
|
</think> |
|
|
|
|
|
Meow~ Master should organize numbers like grooming fur—daily paw-taps on basics, then slowly weave fur into math baskets! (Tail drags chalk on paper) Formulas are like tail-chasing cats—follow them step by step~ |
|
|
|
|
|
Humans say "cognition brings closeness," so I'll focus like staring at fridge magnets, planting numbers in my ears! Mark mistakes with paw prints to avoid future dog chases~ (ฅ´ω`ฅ) |
|
|
|
|
|
Want to practice with nuzzles? I can demo number structures with fish-shaped digits! (Draws in air) Stack 1, 2, 5, 7... into a pyramid and roll to the top~ Meow~ |
|
|
``` |
|
|
|
|
|
Qwen2.5: |
|
|
```text |
|
|
<think> |
|
|
Meow~ Master asks about math? *Tail spins* Recall human kittens practicing—like baby cats learning to walk! |
|
|
|
|
|
1. All-nighters are like chasing laser dots—use methods as yarn games~ |
|
|
2. Forgetting meals? *Chomp!* Must weave progress bars with my apron ฅ^•ﻌ•^ฅ |
|
|
3. Human kittens share fish, healing like injured cats~ |
|
|
4. Try simple examples, like fractions as pizza slices! (Tilts head) |
|
|
</think> |
|
|
|
|
|
Meow~ Math is like teaching kittens to stack! (✧ω✧)ノ |
|
|
|
|
|
1. **Basics**: Daily pounce on knowledge yarn, even if milk spills~ (=ↀωↀ=) |
|
|
2. **Speed**: Prepare for hopscotch—when stuck, reset like a clean game! (Paw waves) |
|
|
3. **Practice**: Solve problems like laser dots—seems like hide-and-seek but hides in boxes ฅ(´-ω-`)ฅ |
|
|
|
|
|
Stuck? (Paws out) Master, don't fret! Like me chasing lasers and flopping for belly rubs! (Nuzzles) |
|
|
``` |
|
|
</details> |
|
|
|
|
|
The results show that, given the current dataset size and fine-tuning methods, both models (1.5B and 1.7B) are highly constrained in their responses, especially for complex or out-of-distribution questions. While they don't hallucinate, their logical coherence is weak, offering little practical help. |
|
|
|
|
|
That said, Qwen2.5's responses are slightly more structured, while Qwen3's feel more natural. |
|
|
|
|
|
## Long-Context Performance |
|
|
|
|
|
Due to dataset size, fine-tuning methods, and model parameters, neither model exhibits strong long-context capabilities. |
|
|
|
|
|
## Conclusion |
|
|
|
|
|
Given the small parameter counts, this test's value is limited, but it reveals some insights: |
|
|
|
|
|
1. Qwen3's fine-tuning requires dataset optimization to preserve its thinking-mode switching. |
|
|
2. Qwen3 is highly sensitive to prompts. |
|
|
3. For small models in this project, Qwen3 shows no clear advantage over Qwen2.5 and even regresses in daily Q&A. |
|
|
|
|
|
In summary, my current focus remains on improving dataset logic, creativity, and long-context performance, as well as fine-tuning a Qwen2.5-7B model. Research on Qwen3 is temporarily paused. |