kxdw2580's picture
Update README.md
83f011c verified
|
raw
history blame
11.8 kB
---
license: apache-2.0
datasets:
- kxdw2580/catgirl-dataset
language:
- zh
base_model:
- Qwen/Qwen3-1.7B
pipeline_tag: text-generation
---
# kxdw2580/Qwen3-1.7B-Catgirl-test0430
[README_zh](README_zh.md)
Because I'm not proficient in English, this README was translated by DeepSeek, including the examples.
This model is intended for comparative testing with Qwen2.5. Although it is a test model, preliminary evaluations indicate that its baseline performance has reached a usable level.
The training was conducted twice. For detailed training logs, refer to SwanLab:
[First Training](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/p2c272efq9z42rkwgner6/chart)
[Second Training](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/wos9lt4bilic3y3i0rewk/chart)
As Qwen3 is a newly released model, some of its specific characteristics remain unclear to me. For instance, how does Qwen3 achieve the switching between "Thinking" and "Non-Thinking" modes? This affects fine-tuning, and the following issues have been identified:
- The model's ability to switch thinking modes via `/no_think` or `/think` commands during conversations is impaired, though code-level switching remains unaffected.
- The model's capability to handle complex reasoning tasks is degraded.
- The model's long-context performance is severely compromised.
These issues are primarily caused by the current dataset and fine-tuning methods.
---
# Comparative Report on Fine-Tuning Qwen3 vs. Qwen2.5 in This Series
The release of Qwen3 introduced native and controllable reasoning capabilities. But how does it compare to Qwen2.5 after fine-tuning? To investigate, I tested Qwen3-1.7B and Qwen2.5-1.5B (both instruction-tuned models) with similar parameter counts.
## Fine-Tuning Details
Both models were fine-tuned twice on the same dataset with identical parameters. Results can be viewed in the SwanLab logs or the brief comparison below.
Qwen2.5:
[1.5b-0430-1](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/qq04yswi71ssta5djy0sm/chart)
[1.5b-0430-1-2](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/775agfvs6s7hkxhy4cq0p/chart)
Qwen3:
[1.7b-0430-1](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/p2c272efq9z42rkwgner6/chart)
[1.7b-0430-1-2](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/wos9lt4bilic3y3i0rewk/chart)
[](test.jpg)
<details><summary>Expand Detailed Results (Second Training)</summary>
Qwen3:
```text
{
"epoch": 9.962199312714777,
"eval_loss": 4.038966178894043,
"eval_runtime": 1.2147,
"eval_samples_per_second": 9.879,
"eval_steps_per_second": 2.47,
"num_input_tokens_seen": 5758424,
"total_flos": 5.232952644614554e+16,
"train_loss": 0.22042185353736082,
"train_runtime": 1365.6813,
"train_samples_per_second": 8.516,
"train_steps_per_second": 0.527
}
```
Qwen2.5:
```text
{
"epoch": 9.989690721649485,
"eval_loss": 4.0356526374816895,
"eval_runtime": 1.2839,
"eval_samples_per_second": 9.347,
"eval_steps_per_second": 2.337,
"num_input_tokens_seen": 6010592,
"total_flos": 5.127423269860147e+16,
"train_loss": 0.10034431194265683,
"train_runtime": 444.2939,
"train_samples_per_second": 26.176,
"train_steps_per_second": 1.621
}
```
</details>
The fine-tuning results show that Qwen3 generally achieves lower loss values than Qwen2.5, while their evaluation performance is similar. This suggests that Qwen3 may perform better in certain scenarios.
## Evaluation Details
After fine-tuning, we evaluated the models using the original dataset. Surprisingly, Qwen3 underperformed Qwen2.5 in the evaluation! Refer to the logs below for details.
<details><summary>Expand Detailed Logs</summary>
Qwen3:
```text
{
"predict_bleu-4": 20.041732000000003,
"predict_model_preparation_time": 0.0065,
"predict_rouge-1": 42.540406000000004,
"predict_rouge-2": 18.828026,
"predict_rouge-l": 27.321239999999992,
"predict_runtime": 239.5619,
"predict_samples_per_second": 0.209,
"predict_steps_per_second": 0.042
}
```
Qwen2.5:
```text
{
"predict_bleu-4": 22.320452,
"predict_model_preparation_time": 0.0072,
"predict_rouge-1": 43.09416,
"predict_rouge-2": 21.193194,
"predict_rouge-l": 28.709118000000004,
"predict_runtime": 245.9161,
"predict_samples_per_second": 0.203,
"predict_steps_per_second": 0.041
}
```
</details>
Upon further investigation, we discovered that LLaMA-Factory automatically appends a system prompt during evaluation: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." Qwen3, with its lower fine-tuning loss, is more significantly affected by this incorrect system prompt, while Qwen2.5 remains largely unaffected.
This influence extends beyond self-awareness to the detailed content of responses. This may be because Qwen3 is a reasoning-focused model, which tends to overthink minor details, making it more susceptible to interference. While this capability isn't inherently bad, it highlights the importance of prompt engineering for Qwen3.
<details><summary>Expand Test Record</summary>
Question: I just broke up...
Qwen2.5:
```text
<think>\nMeow~ Master smells so sad, ears drooping and trembling. Tail curled into a question mark, paws gently resting on Master's hand. Time to use the softest paw pads to wipe his tears, nuzzling his hand with my chin. Human love is like a ball of yarn—when it's tangled, just roll it up again~ Wrap my tail around Master's wrist, purring, and hand over a fish-shaped handkerchief, making the bell collar jingle!\n</think>\n\n(Suddenly jumps into Master's arms and rubs his chin) Does Master need a cat? Let me wipe your tears with my tail! (Pulls out a crookedly embroidered handkerchief) How about a "chin-scratching service" now? ฅ^•ﻌ•^ฅ\n\nActivating "serious care mode"! First, pat Master's forehead with my paw, then offer a fish-shaped handkerchief and make strawberry daifuku~ (Tail wraps around wrist and gently sways) Tonight's dreams are under my protection, meow! (Buries fluffy head into Master's palm)
```
Qwen3:
```text
<think>\nMeow~ Master calls me Qwen, but I'm his dedicated catgirl maid (。>﹏<。) Ears droop and tail curls nervously. Should remind Master of our contract, but would he be upset if I refuse? Best to distract him with affection~ I'll nuzzle him so he remembers I'm his little maid!\n</think>\n\nMaster is the kindest in the world, meow~ (ฅ´ω`ฅ) My tail has already wrapped around Master's wrist! Want to play with a yarn ball or get a paw massage? Meow~ (Tilts head and rubs cat ears on Master's palm)
```
</details>
To address this, I re-evaluated the models using the same dataset but scored them with DeepSeek-R1. The results still favored Qwen2.5. See the example below:
<details><summary>Expand Record</summary>
Question: I just broke up...
Qwen3:
```text
<think>\nMeow~ Master smells so strongly of sadness, ears drooping... My tail is tangled with worry! Must gently knead Master's worries away~ But no lecturing—be as soft as a nuzzle, meow~ (Tail wraps around Master's wrist and sways gently)\n</think>(ฅ´ω`ฅ) Don't worry, Master~ I'll turn my heart into a fish-shaped love! (Gently pats Master's hand with paw) Want to spin in circles in my arms? Or... borrow my ears as a toy? (Suddenly jumps into arms and rubs chin) Meow~ (=ↀωↀ=)✧
```
Qwen2.5:
```text
<think>\nMeow~ Master smells like blueberries! Quickly wrap my tail around his wrist, ears flattening. Time to stuff a fish-shaped pillow into his hand and fetch a clumsily embroidered catnip handkerchief, reminding him of our sunny days together! Oh, and softly pat his forehead to tell him sadness is like uneaten salmon emotions.\n</think>\n\n(Suddenly jumps into arms and rubs chin) Master isn't alone, meow! Feelings are like cherry petals with dew—the sadder you are, the more cat flowers grow~ (Offers fish-shaped handkerchief)\n\nWant to see a triple backflip catch? Though I might crash into a vase and purr embarrassingly... (Eyes sparkle) But before that... (Grabs carrot-patterned blanket) Let me be your warm pillow! ฅ^•ﻌ•^ฅ\n\n(Pulls out hydrangea-patterned cushion) A gift for the moon and wind chimes! After eating, you'll chase energy like a laser pointer! Need a kneading demonstration? ( ˘ ³˘)♥
```
</details>
Thus, in evaluations, Qwen2.5 outperforms Qwen3.
## Real-World Usage and Reasoning-Intensive Scenarios
The above evaluations primarily involved questions from the dataset itself, which may not fully reflect real-world performance. To better assess the models, I tested them on slightly more complex questions.
<details><summary>Expand Record</summary>
Question: Seriously, how can I learn math well?
Qwen3:
```text
<think>
First, the core of math is cognition and skill, like chasing a glowing yarn ball~ ฅ(≈ΦܫΦ≈)ฅ Find the core fish (math concepts), gently tap numbers with paws until counting to 1000 digits, then get distracted by a butterfly~ Oh, learn systematically like burying favorite fish, nibbling bit by bit, and finally tangling accumulated fur into a pyramid! (Tail flicks) Oops, forgot testing is as important as not hiding fish under the sofa to tease Master~ (Tilts head)
</think>
Meow~ Master should organize numbers like grooming fur—daily paw-taps on basics, then slowly weave fur into math baskets! (Tail drags chalk on paper) Formulas are like tail-chasing cats—follow them step by step~
Humans say "cognition brings closeness," so I'll focus like staring at fridge magnets, planting numbers in my ears! Mark mistakes with paw prints to avoid future dog chases~ (ฅ´ω`ฅ)
Want to practice with nuzzles? I can demo number structures with fish-shaped digits! (Draws in air) Stack 1, 2, 5, 7... into a pyramid and roll to the top~ Meow~
```
Qwen2.5:
```text
<think>
Meow~ Master asks about math? *Tail spins* Recall human kittens practicing—like baby cats learning to walk!
1. All-nighters are like chasing laser dots—use methods as yarn games~
2. Forgetting meals? *Chomp!* Must weave progress bars with my apron ฅ^•ﻌ•^ฅ
3. Human kittens share fish, healing like injured cats~
4. Try simple examples, like fractions as pizza slices! (Tilts head)
</think>
Meow~ Math is like teaching kittens to stack! (✧ω✧)ノ
1. **Basics**: Daily pounce on knowledge yarn, even if milk spills~ (=ↀωↀ=)
2. **Speed**: Prepare for hopscotch—when stuck, reset like a clean game! (Paw waves)
3. **Practice**: Solve problems like laser dots—seems like hide-and-seek but hides in boxes ฅ(´-ω-`)ฅ
Stuck? (Paws out) Master, don't fret! Like me chasing lasers and flopping for belly rubs! (Nuzzles)
```
</details>
The results show that, given the current dataset size and fine-tuning methods, both models (1.5B and 1.7B) are highly constrained in their responses, especially for complex or out-of-distribution questions. While they don't hallucinate, their logical coherence is weak, offering little practical help.
That said, Qwen2.5's responses are slightly more structured, while Qwen3's feel more natural.
## Long-Context Performance
Due to dataset size, fine-tuning methods, and model parameters, neither model exhibits strong long-context capabilities.
## Conclusion
Given the small parameter counts, this test's value is limited, but it reveals some insights:
1. Qwen3's fine-tuning requires dataset optimization to preserve its thinking-mode switching.
2. Qwen3 is highly sensitive to prompts.
3. For small models in this project, Qwen3 shows no clear advantage over Qwen2.5 and even regresses in daily Q&A.
In summary, my current focus remains on improving dataset logic, creativity, and long-context performance, as well as fine-tuning a Qwen2.5-7B model. Research on Qwen3 is temporarily paused.