Introduction
We introduce TianJiangZhuG_3B, an advanced multimodal large language model (MLLM) that demonstrates superior overall performance. Additionally, we compare TianJiangZhuGe_3B with Qwen2.5-VL-3B-Instruct model, whose corresponding pre-trained base models are employed as the initialization of the langauge component in TianJiangZhuGe. Benefitting from Native Multimodal Pre-Training, the TianJiangZhuGe_3B achieves even better overall text performance than the Qwen2.5-VL-3B-Instruct.
Key Enhancements:
Meticulous Construction of High-Quality Chain-of-Thought (CoT) Datasets
Scale and Coverage: We have systematically built thousands of high-quality Chinese and English reasoning data across multiple domains such as mathematical applications, logical reasoning, and symbolic operations. This ensures the model’s generalization ability in diverse scenarios.
Data Generation Method: Based on selected image-text question-answer pairs, combined with the "Super Chain-of-Thought Model", we automatically generate Chain-of-Thought annotated data containing detailed reasoning paths. This method effectively enhances the model’s step-by-step reasoning and logical coherence
Multi-Stage GRPO Training Algorithm
Progressive Learning Mechanism: We innovatively propose a multi-stage GRPO (Gradient-based Reward Policy Optimization) training process. Through task design that progresses from shallow to deep and from simple to complex, we guide the model to achieve stepwise capability evolution:
Primary Stage: Focus on judgment and classification tasks to strengthen the model’s understanding of problem structures and basic logic. Intermediate Stage: Introduce multiple-choice and matching questions to improve the model’s ability to identify key information among distractors. Advanced Stage: Expand to open-ended generation tasks to encourage the model to conduct free deduction and complete logical expression.Algorithm Advantages: This training strategy effectively reduces the model’s learning difficulty in complex tasks, improves training stability and strategy convergence efficiency, while significantly enhancing the model’s adaptability across different task types.
Evaluation:
markdown
| Benchmark | Qwen2.5-VL-3B | TianJiangZhuGe-3B |
|---|---|---|
| POPE | 0.7676 | 0.8 |
| ai2d | 0.6343 | 0.6833 |
| vizwiz_val | 0.6099 | 0.6062 |
| MathVision | 22.86 | 22.14 |
| OCRBench | 68.1 | 71.4 |
| MathVista | 40.8 | 44.4 |
Using Transformers to Chat:
markdown
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
messages = [[{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "text", "text": "Describe this image."}]}],]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_path = '/nfs4/models/Tianjiangzhuge'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.float16).to(device)
inputs = processor(text=text, images=images, videos=videos, padding=True, return_tensors="pt", **video_kwargs).to(device)
generated_ids = model.generate(**inputs)
Multi image inference:
markdown
# Messages containing multiple images and a text query
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Describe the difference between these images."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-3B-Instruct
- Downloads last month
- 51
Model tree for TFJiangxiLab/TianJiangZhuGe_3B
Base model
Qwen/Qwen2.5-VL-3B-Instruct