Introduction

We introduce TianJiangZhuG_3B, an advanced multimodal large language model (MLLM) that demonstrates superior overall performance. Additionally, we compare TianJiangZhuGe_3B with Qwen2.5-VL-3B-Instruct model, whose corresponding pre-trained base models are employed as the initialization of the langauge component in TianJiangZhuGe. Benefitting from Native Multimodal Pre-Training, the TianJiangZhuGe_3B achieves even better overall text performance than the Qwen2.5-VL-3B-Instruct.

Key Enhancements:

Meticulous Construction of High-Quality Chain-of-Thought (CoT) Datasets

Scale and Coverage: We have systematically built thousands of high-quality Chinese and English reasoning data across multiple domains such as mathematical applications, logical reasoning, and symbolic operations. This ensures the model’s generalization ability in diverse scenarios.

Data Generation Method: Based on selected image-text question-answer pairs, combined with the "Super Chain-of-Thought Model", we automatically generate Chain-of-Thought annotated data containing detailed reasoning paths. This method effectively enhances the model’s step-by-step reasoning and logical coherence
Multi-Stage GRPO Training Algorithm

Progressive Learning Mechanism: We innovatively propose a multi-stage GRPO (Gradient-based Reward Policy Optimization) training process. Through task design that progresses from shallow to deep and from simple to complex, we guide the model to achieve stepwise capability evolution:
```
Primary Stage: Focus on judgment and classification tasks to strengthen the model’s understanding of problem structures and basic logic.

Intermediate Stage: Introduce multiple-choice and matching questions to improve the model’s ability to identify key information among distractors.

Advanced Stage: Expand to open-ended generation tasks to encourage the model to conduct free deduction and complete logical expression.
```
Algorithm Advantages: This training strategy effectively reduces the model’s learning difficulty in complex tasks, improves training stability and strategy convergence efficiency, while significantly enhancing the model’s adaptability across different task types.

Evaluation:

markdown

Benchmark	Qwen2.5-VL-3B	TianJiangZhuGe-3B
POPE	0.7676	0.8
ai2d	0.6343	0.6833
vizwiz_val	0.6099	0.6062
MathVision	22.86	22.14
OCRBench	68.1	71.4
MathVista	40.8	44.4

Using Transformers to Chat:

markdown

from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
messages = [[{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "text", "text": "Describe this image."}]}],]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

model_path = '/nfs4/models/Tianjiangzhuge'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.float16).to(device)
inputs = processor(text=text, images=images, videos=videos, padding=True, return_tensors="pt", **video_kwargs).to(device)

generated_ids = model.generate(**inputs)

Multi image inference:

markdown

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Describe the difference between these images."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-3B-Instruct

Downloads last month: 51

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TFJiangxiLab/TianJiangZhuGe_3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(549)

this model

TFJiangxiLab
/

TianJiangZhuGe_3B

license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-3B-Instruct

Model tree for TFJiangxiLab/TianJiangZhuGe_3B

Datasets used to train TFJiangxiLab/TianJiangZhuGe_3B