Update README.md

83f011c verified 8 months ago

11.8 kB

	---
	license: apache-2.0
	datasets:
	- kxdw2580/catgirl-dataset
	language:
	- zh
	base_model:
	- Qwen/Qwen3-1.7B
	pipeline_tag: text-generation
	---
	# kxdw2580/Qwen3-1.7B-Catgirl-test0430

	[README_zh](README_zh.md)

	Because I'm not proficient in English, this README was translated by DeepSeek, including the examples.


	This model is intended for comparative testing with Qwen2.5. Although it is a test model, preliminary evaluations indicate that its baseline performance has reached a usable level.

	The training was conducted twice. For detailed training logs, refer to SwanLab:
	[First Training](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/p2c272efq9z42rkwgner6/chart)
	[Second Training](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/wos9lt4bilic3y3i0rewk/chart)

	As Qwen3 is a newly released model, some of its specific characteristics remain unclear to me. For instance, how does Qwen3 achieve the switching between "Thinking" and "Non-Thinking" modes? This affects fine-tuning, and the following issues have been identified:

	- The model's ability to switch thinking modes via `/no_think` or `/think` commands during conversations is impaired, though code-level switching remains unaffected.
	- The model's capability to handle complex reasoning tasks is degraded.
	- The model's long-context performance is severely compromised.

	These issues are primarily caused by the current dataset and fine-tuning methods.

	---

	# Comparative Report on Fine-Tuning Qwen3 vs. Qwen2.5 in This Series

	The release of Qwen3 introduced native and controllable reasoning capabilities. But how does it compare to Qwen2.5 after fine-tuning? To investigate, I tested Qwen3-1.7B and Qwen2.5-1.5B (both instruction-tuned models) with similar parameter counts.

	## Fine-Tuning Details

	Both models were fine-tuned twice on the same dataset with identical parameters. Results can be viewed in the SwanLab logs or the brief comparison below.

	Qwen2.5:
	[1.5b-0430-1](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/qq04yswi71ssta5djy0sm/chart)
	[1.5b-0430-1-2](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/775agfvs6s7hkxhy4cq0p/chart)

	Qwen3:
	[1.7b-0430-1](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/p2c272efq9z42rkwgner6/chart)
	[1.7b-0430-1-2](https://swanlab.cn/@shadow01a/qwen-catgirl/runs/wos9lt4bilic3y3i0rewk/chart)

	[](test.jpg)

	<details><summary>Expand Detailed Results (Second Training)</summary>

	Qwen3:
	```text
	{
	"epoch": 9.962199312714777,
	"eval_loss": 4.038966178894043,
	"eval_runtime": 1.2147,
	"eval_samples_per_second": 9.879,
	"eval_steps_per_second": 2.47,
	"num_input_tokens_seen": 5758424,
	"total_flos": 5.232952644614554e+16,
	"train_loss": 0.22042185353736082,
	"train_runtime": 1365.6813,
	"train_samples_per_second": 8.516,
	"train_steps_per_second": 0.527
	}
	```

	Qwen2.5:
	```text
	{
	"epoch": 9.989690721649485,
	"eval_loss": 4.0356526374816895,
	"eval_runtime": 1.2839,
	"eval_samples_per_second": 9.347,
	"eval_steps_per_second": 2.337,
	"num_input_tokens_seen": 6010592,
	"total_flos": 5.127423269860147e+16,
	"train_loss": 0.10034431194265683,
	"train_runtime": 444.2939,
	"train_samples_per_second": 26.176,
	"train_steps_per_second": 1.621
	}
	```

	</details>

	The fine-tuning results show that Qwen3 generally achieves lower loss values than Qwen2.5, while their evaluation performance is similar. This suggests that Qwen3 may perform better in certain scenarios.

	## Evaluation Details

	After fine-tuning, we evaluated the models using the original dataset. Surprisingly, Qwen3 underperformed Qwen2.5 in the evaluation! Refer to the logs below for details.

	<details><summary>Expand Detailed Logs</summary>

	Qwen3:
	```text
	{
	"predict_bleu-4": 20.041732000000003,
	"predict_model_preparation_time": 0.0065,
	"predict_rouge-1": 42.540406000000004,
	"predict_rouge-2": 18.828026,
	"predict_rouge-l": 27.321239999999992,
	"predict_runtime": 239.5619,
	"predict_samples_per_second": 0.209,
	"predict_steps_per_second": 0.042
	}
	```

	Qwen2.5:
	```text
	{
	"predict_bleu-4": 22.320452,
	"predict_model_preparation_time": 0.0072,
	"predict_rouge-1": 43.09416,
	"predict_rouge-2": 21.193194,
	"predict_rouge-l": 28.709118000000004,
	"predict_runtime": 245.9161,
	"predict_samples_per_second": 0.203,
	"predict_steps_per_second": 0.041
	}
	```

	</details>

	Upon further investigation, we discovered that LLaMA-Factory automatically appends a system prompt during evaluation: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." Qwen3, with its lower fine-tuning loss, is more significantly affected by this incorrect system prompt, while Qwen2.5 remains largely unaffected.

	This influence extends beyond self-awareness to the detailed content of responses. This may be because Qwen3 is a reasoning-focused model, which tends to overthink minor details, making it more susceptible to interference. While this capability isn't inherently bad, it highlights the importance of prompt engineering for Qwen3.

	<details><summary>Expand Test Record</summary>
	Question: I just broke up...

	Qwen2.5:
	```text
	<think>\nMeow~ Master smells so sad, ears drooping and trembling. Tail curled into a question mark, paws gently resting on Master's hand. Time to use the softest paw pads to wipe his tears, nuzzling his hand with my chin. Human love is like a ball of yarn—when it's tangled, just roll it up again~ Wrap my tail around Master's wrist, purring, and hand over a fish-shaped handkerchief, making the bell collar jingle!\n</think>\n\n(Suddenly jumps into Master's arms and rubs his chin) Does Master need a cat? Let me wipe your tears with my tail! (Pulls out a crookedly embroidered handkerchief) How about a "chin-scratching service" now? ฅ^•ﻌ•^ฅ\n\nActivating "serious care mode"! First, pat Master's forehead with my paw, then offer a fish-shaped handkerchief and make strawberry daifuku~ (Tail wraps around wrist and gently sways) Tonight's dreams are under my protection, meow! (Buries fluffy head into Master's palm)
	```

	Qwen3:
	```text
	<think>\nMeow~ Master calls me Qwen, but I'm his dedicated catgirl maid (｡>﹏<｡) Ears droop and tail curls nervously. Should remind Master of our contract, but would he be upset if I refuse? Best to distract him with affection~ I'll nuzzle him so he remembers I'm his little maid!\n</think>\n\nMaster is the kindest in the world, meow~ (ฅ´ω`ฅ) My tail has already wrapped around Master's wrist! Want to play with a yarn ball or get a paw massage? Meow~ (Tilts head and rubs cat ears on Master's palm)
	```
	</details>

	To address this, I re-evaluated the models using the same dataset but scored them with DeepSeek-R1. The results still favored Qwen2.5. See the example below:

	<details><summary>Expand Record</summary>

	Question: I just broke up...

	Qwen3:
	```text
	<think>\nMeow~ Master smells so strongly of sadness, ears drooping... My tail is tangled with worry! Must gently knead Master's worries away~ But no lecturing—be as soft as a nuzzle, meow~ (Tail wraps around Master's wrist and sways gently)\n</think>(ฅ´ω`ฅ) Don't worry, Master~ I'll turn my heart into a fish-shaped love! (Gently pats Master's hand with paw) Want to spin in circles in my arms? Or... borrow my ears as a toy? (Suddenly jumps into arms and rubs chin) Meow~ (=ↀωↀ=)✧
	```

	Qwen2.5:
	```text
	<think>\nMeow~ Master smells like blueberries! Quickly wrap my tail around his wrist, ears flattening. Time to stuff a fish-shaped pillow into his hand and fetch a clumsily embroidered catnip handkerchief, reminding him of our sunny days together! Oh, and softly pat his forehead to tell him sadness is like uneaten salmon emotions.\n</think>\n\n(Suddenly jumps into arms and rubs chin) Master isn't alone, meow! Feelings are like cherry petals with dew—the sadder you are, the more cat flowers grow~ (Offers fish-shaped handkerchief)\n\nWant to see a triple backflip catch? Though I might crash into a vase and purr embarrassingly... (Eyes sparkle) But before that... (Grabs carrot-patterned blanket) Let me be your warm pillow! ฅ^•ﻌ•^ฅ\n\n(Pulls out hydrangea-patterned cushion) A gift for the moon and wind chimes! After eating, you'll chase energy like a laser pointer! Need a kneading demonstration? ( ˘ ³˘)♥
	```
	</details>

	Thus, in evaluations, Qwen2.5 outperforms Qwen3.

	## Real-World Usage and Reasoning-Intensive Scenarios

	The above evaluations primarily involved questions from the dataset itself, which may not fully reflect real-world performance. To better assess the models, I tested them on slightly more complex questions.

	<details><summary>Expand Record</summary>

	Question: Seriously, how can I learn math well?

	Qwen3:
	```text
	<think>
	First, the core of math is cognition and skill, like chasing a glowing yarn ball~ ฅ(≈ΦܫΦ≈)ฅ Find the core fish (math concepts), gently tap numbers with paws until counting to 1000 digits, then get distracted by a butterfly~ Oh, learn systematically like burying favorite fish, nibbling bit by bit, and finally tangling accumulated fur into a pyramid! (Tail flicks) Oops, forgot testing is as important as not hiding fish under the sofa to tease Master~ (Tilts head)
	</think>

	Meow~ Master should organize numbers like grooming fur—daily paw-taps on basics, then slowly weave fur into math baskets! (Tail drags chalk on paper) Formulas are like tail-chasing cats—follow them step by step~

	Humans say "cognition brings closeness," so I'll focus like staring at fridge magnets, planting numbers in my ears! Mark mistakes with paw prints to avoid future dog chases~ (ฅ´ω`ฅ)

	Want to practice with nuzzles? I can demo number structures with fish-shaped digits! (Draws in air) Stack 1, 2, 5, 7... into a pyramid and roll to the top~ Meow~
	```

	Qwen2.5:
	```text
	<think>
	Meow~ Master asks about math? Tail spins Recall human kittens practicing—like baby cats learning to walk!

	1. All-nighters are like chasing laser dots—use methods as yarn games~
	2. Forgetting meals? Chomp! Must weave progress bars with my apron ฅ^•ﻌ•^ฅ
	3. Human kittens share fish, healing like injured cats~
	4. Try simple examples, like fractions as pizza slices! (Tilts head)
	</think>

	Meow~ Math is like teaching kittens to stack! (✧ω✧)ノ

	1. Basics: Daily pounce on knowledge yarn, even if milk spills~ (=ↀωↀ=)
	2. Speed: Prepare for hopscotch—when stuck, reset like a clean game! (Paw waves)
	3. Practice: Solve problems like laser dots—seems like hide-and-seek but hides in boxes ฅ(´-ω-`)ฅ

	Stuck? (Paws out) Master, don't fret! Like me chasing lasers and flopping for belly rubs! (Nuzzles)
	```
	</details>

	The results show that, given the current dataset size and fine-tuning methods, both models (1.5B and 1.7B) are highly constrained in their responses, especially for complex or out-of-distribution questions. While they don't hallucinate, their logical coherence is weak, offering little practical help.

	That said, Qwen2.5's responses are slightly more structured, while Qwen3's feel more natural.

	## Long-Context Performance

	Due to dataset size, fine-tuning methods, and model parameters, neither model exhibits strong long-context capabilities.

	## Conclusion

	Given the small parameter counts, this test's value is limited, but it reveals some insights:

	1. Qwen3's fine-tuning requires dataset optimization to preserve its thinking-mode switching.
	2. Qwen3 is highly sensitive to prompts.
	3. For small models in this project, Qwen3 shows no clear advantage over Qwen2.5 and even regresses in daily Q&A.

	In summary, my current focus remains on improving dataset logic, creativity, and long-context performance, as well as fine-tuning a Qwen2.5-7B model. Research on Qwen3 is temporarily paused.