Introduction

We introduce TianJiangZhuG_3B, an advanced multimodal large language model (MLLM) that demonstrates superior overall performance. Additionally, we compare TianJiangZhuGe_3B with Qwen2.5-VL-3B-Instruct model, whose corresponding pre-trained base models are employed as the initialization of the langauge component in TianJiangZhuGe. Benefitting from Native Multimodal Pre-Training, the TianJiangZhuGe_3B achieves even better overall text performance than the Qwen2.5-VL-3B-Instruct.

Key Enhancements:

  1. Meticulous Construction of High-Quality Chain-of-Thought (CoT) Datasets

    Scale and Coverage: We have systematically built thousands of high-quality Chinese and English reasoning data across multiple domains such as mathematical applications, logical reasoning, and symbolic operations. This ensures the model’s generalization ability in diverse scenarios.

    Data Generation Method: Based on selected image-text question-answer pairs, combined with the "Super Chain-of-Thought Model", we automatically generate Chain-of-Thought annotated data containing detailed reasoning paths. This method effectively enhances the model’s step-by-step reasoning and logical coherence

  2. Multi-Stage GRPO Training Algorithm

    Progressive Learning Mechanism: We innovatively propose a multi-stage GRPO (Gradient-based Reward Policy Optimization) training process. Through task design that progresses from shallow to deep and from simple to complex, we guide the model to achieve stepwise capability evolution:

    Primary Stage: Focus on judgment and classification tasks to strengthen the model’s understanding of problem structures and basic logic.
    
    Intermediate Stage: Introduce multiple-choice and matching questions to improve the model’s ability to identify key information among distractors.
    
    Advanced Stage: Expand to open-ended generation tasks to encourage the model to conduct free deduction and complete logical expression.
    

    Algorithm Advantages: This training strategy effectively reduces the model’s learning difficulty in complex tasks, improves training stability and strategy convergence efficiency, while significantly enhancing the model’s adaptability across different task types.

Evaluation:

image

markdown

Benchmark Qwen2.5-VL-3B TianJiangZhuGe-3B
POPE 0.7676 0.8
ai2d 0.6343 0.6833
vizwiz_val 0.6099 0.6062
MathVision 22.86 22.14
OCRBench 68.1 71.4
MathVista 40.8 44.4

Using Transformers to Chat:

markdown

from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
messages = [[{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "text", "text": "Describe this image."}]}],]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

model_path = '/nfs4/models/Tianjiangzhuge'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.float16).to(device)
inputs = processor(text=text, images=images, videos=videos, padding=True, return_tensors="pt", **video_kwargs).to(device)

generated_ids = model.generate(**inputs)

Multi image inference:

markdown

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Describe the difference between these images."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-3B-Instruct

Downloads last month
51
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TFJiangxiLab/TianJiangZhuGe_3B

Finetuned
(549)
this model

Datasets used to train TFJiangxiLab/TianJiangZhuGe_3B