--- base_model: - OpenGVLab/InternVL3-38B language: - en library_name: transformers license: mit pipeline_tag: image-text-to-text tags: - Skywork R1V ---
Skywork Logo ======================================
Skywork-R1V3
======================================

📖 R1V3 Report | 💻 GitHub

GitHub Stars GitHub Forks License

## 1. Model Introduction Skywork-R1V3-38B is the latest and most powerful open-source multimodal reasoning model in the Skywork-R1V series. Built on InternVL-38B, it significantly pushes the boundaries of multimodal and cross-disciplinary intelligence. **Mainly through RL algorithm in post-training**, R1V3 boasts enhanced reasoning ability, achieving open-source state-of-the-art (SOTA) performance across numerous multimodal reasoning benchmarks. ## 2. Technical Highlights Skywork-R1V3 is an advanced, open-source Vision-Language Model (VLM) built on several core innovations: - **Refined Post-Training RL**: Instead of relying on reasoning pre-training, our fine-grained cold-start finetuning effectively primes the model for Reinforcement Learning (RL), which dramatically enhances its reasoning ability. - **Essential Connector Module**: We've uncovered the critical role of the connector module in achieving robust cross-modal alignment for strong multimodal reasoning. What's more, Connector-only Finetuning can further boost the model's performance post-RL. - **Entropy of Critical Reasoning Tokens**: This unique indicator effectively gauges reasoning capability, guiding checkpoint selection during RL training. These innovations lead to Broad Reasoning Generalization, allowing our RL-powered approach to successfully extend mathematical reasoning to diverse subject areas. Additionally, our work delves into RL-specific explorations like curriculum learning and learning rate strategies, alongside a broader discussion on multimodal reasoning. For more details, refer to our [[📖 R1V3 Report](https://huggingface.co/papers/2507.06167)] . ## 3. Evaluation ### 🌟 Key Results - **MMMU:** 76.0 - **EMMA-Mini(CoT):** 40.3 - **MMK12:** 78.5 - **Physics Reasoning:** PhyX-MC-TM (52.8), SeePhys (31.5) - **Logic Reasoning:** MME-Reasoning (42.8) VisuLogic (28.5) - **Math Benchmarks:** MathVista (77.1), MathVerse (59.6), MathVision (52.6) # Visual-Language Models Benchmark Comparison | Category | Benchmark | Metric | Skywork-38B | QVQ-72B | InternVL-78B | QwenVL-72B | Claude 3.7 | GPT-4o | |----------------|-------------------------|---------|------------:|--------:|-------------:|--------:|----------:|---------:| | **General** | MMMU (val) | Acc. | 🏆 **76.0** | 70.3 | 72.2 | 70.3 | 75.0 | 70.7 | | | EMMA (mini-cot) | Acc. | 40.3 | 32.0 | 38.3 | 39.3 | **56.5** | 36.0 | | | MMMU-pro | Acc. | 🏆 **55.4** | 46.9* | 48.6 | 51.1 | 50.0 | 54.5 | | | MMK12 | Acc. | 🏆 **78.5** | 62.7* | 67.4* | 70.5* | 55.3 | 49.9 | | | MMstar | Acc. | 70.6 | 60.8 | **72.5** | 70.8 | 68.8 | 65.1 | | | MMBench-en-1.1 | Acc. | 85.7 | 72.6* | 87.7 | **88.0** | 82.0 | 84.3 | | | HallusionBench | Acc. | 🏆 **61.3** | 55.3* | 59.1 | 55.2 | 58.3 | 56.2 | | **Mathematics**| MathVista (mini) | Acc. | 🏆 **77.1** | 71.4 | 72.2 | 74.8 | 66.8 | 62.9 | | | MathVerse (vision-only) | Acc. | 🏆 **59.6** | 45.1 | 51.0 | 57.6 | 49.9* | 49.9 | | | MathVision | Acc. | 52.6 | 35.9 | 43.1 | 38.1 | 58.6 | 31.2 | | | WeMath (strict) | Acc. |🏆 **56.5** | 37.7 | 46.1 | 50.6 | 48.9* | 50.6 | | **Logic** | Visulogic | Acc. | 🏆 **28.5** | 23.5* | 27.7 | 26.2 | 25.9 | 26.3 | | | LogicVista | Acc. | 59.7 | 53.8 | 55.9 | 57.1 | 60.6* | **64.4** | | | MME-reasoning | Acc. | 🏆 **42.8** | 35.2 | 32.1 | 34.1 | 34.1 | 30.2 | | **Physics** | PhyX (mc-text-minimal) | Acc. | 🏆 **52.8** | 35.2* | 40.5 | 44.8 | 41.6 | 43.8 | | | SeePhys | Acc. | 31.5 | 22.5 | 19.0* | 24.2 | **34.6** | 21.9 | 🏆 **Top performer** of Skywork-R1V3 in each benchmark [*] indicates results from our evaluation framework. ## 4. Usage If you need the detailed inference code and evaluation script, please refer to our [GitHub](https://github.com/SkyworkAI/Skywork-R1V). ### Run the Inference Script hf inference ```python import torch from transformers import AutoModel, AutoTokenizer from utils import load_image, split_model import argparse def main(): parser = argparse.ArgumentParser(description="Run inference with Skywork-R1V model.") parser.add_argument('--model_path', type=str, default='Skywork/Skywork-R1V3-38B', help="Path to the model.") parser.add_argument('--image_paths', type=str, nargs='+', required=True, help="Path(s) to the image(s).") parser.add_argument('--question', type=str, required=True, help="Question to ask the model.") args = parser.parse_args() device_map = split_model(args.model_path) model = AutoModel.from_pretrained( args.model_path, torch_dtype=torch.bfloat16, load_in_8bit=False, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True, device_map=device_map ).eval() tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True, use_fast=False) pixel_values = [load_image(img_path, max_num=12).to(torch.bfloat16).cuda() for img_path in args.image_paths] if len(pixel_values) > 1: num_patches_list = [img.size(0) for img in pixel_values] pixel_values = torch.cat(pixel_values, dim=0) else: pixel_values = pixel_values[0] num_patches_list = None prompt = " "*len(args.image_paths) + args.question generation_config = dict(max_new_tokens=64000, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.05) response = model.chat(tokenizer, pixel_values, prompt, generation_config, num_patches_list=num_patches_list) print(f'User: {args.question} Assistant: {response}') if __name__ == '__main__': main() ``` vllm inference ```shell python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --max_model_len 32768 --limit-mm-per-prompt "image=20" --tensor-parallel-size $N_GPU --dtype auto --trust-remote-code ``` --- ## 5. Citation If you use Skywork-R1V in your research, please cite: ``` @misc{shen2025skyworkr1v3technicalreport, title={Skywork-R1V3 Technical Report}, author={Wei Shen and Jiangbo Pei and Yi Peng and Xuchen Song and Yang Liu and Jian Peng and Haofeng Sun and Yunzhuo Hao and Peiyu Wang and Jianhao Zhang and Yahui Zhou}, year={2025}, eprint={2507.06167}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.06167}, } ``` ## 6.License This project is released under the MIT License. This project uses the [InternVL3-38B](https://huggingface.co/OpenGVLab/InternVL3-38B) as the base model, which is licensed under the MIT License.