nielsr HF Staff commited on
Commit
a47a1ba
·
verified ·
1 Parent(s): 0410d2c

Improve model card for Extract+Think: Add metadata, links, description, evaluation, and citation

Browse files

This PR significantly enhances the model card for the `Extract-0.6B` model, which is part of the **Extract+Think** framework.

Key improvements include:
* Adding essential metadata tags: `pipeline_tag: image-text-to-text`, `library_name: transformers`, and `license: cc-by-nc-4.0`. These tags improve discoverability on the Hugging Face Hub and enable the automated "How to use with Transformers" widget.
* Including direct links to the associated paper, project page, and GitHub repository for comprehensive information access.
* Expanding the model description based on the paper's abstract, clarifying the model's role as a perception module for visual extraction.
* Integrating the detailed evaluation table from the GitHub README to provide immediate performance context.
* Adding the BibTeX citation and acknowledgments for proper attribution.
* Removing the irrelevant `# File information` section and its content that was previously present in the README.

Please review and merge this PR.

Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,9 +1,77 @@
1
  ---
2
- {}
 
 
3
  ---
4
 
5
- # Extract+Think Model Card
 
 
 
 
 
 
 
 
6
 
7
  ## Model details
8
 
9
- Extract-0.6B is used as the perception module for the two-stage Extract+Think framework.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
  ---
6
 
7
+ # Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
8
+
9
+ This repository hosts the **Extract-0.6B** model, which serves as the perception module for the two-stage **Extract+Think** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
10
+
11
+ Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
12
+
13
+ * 📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
14
+ * 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
15
+ * 💻 **Code:** https://github.com/markendo/downscaling_intelligence
16
 
17
  ## Model details
18
 
19
+ Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)). This model aims to improve efficiency and performance in multimodal understanding by decoupling visual perception from linguistic reasoning.
20
+
21
+ ## Usage
22
+
23
+ To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
24
+
25
+ For generating extracted visual information, the following command is provided (example with `markendo/llava-extract-qwen3-1.7B`):
26
+ ```bash
27
+ cd lmms-eval
28
+ model_name=markendo/llava-extract-qwen3-1.7B
29
+ python -m lmms_eval \
30
+ --model=llava_onevision \
31
+ --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
32
+ --tasks=mmstar_prism_stage_1 \
33
+ --batch_size=1 \
34
+ --output_path results \
35
+ --log_samples
36
+ ```
37
+ Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
38
+
39
+ ## Evaluation
40
+
41
+ The Extract+Think approach is evaluated using `lmms-eval`, showing competitive performance in multimodal benchmarks. Below is a summary of performance on various benchmarks. For the full table and details, please refer to the paper.
42
+
43
+ | Model | LLM Size | # Vis. Data | In-Domain Avg. | MMStar Avg. |
44
+ |---|---|---|---|---|
45
+ | **End-to-End** | | | | |
46
+ | LLaVA-OneVision | 0.5B | 8.8M | 71.1 | 39.0 |
47
+ | InternVL2.5 | 0.5B | 64M | 83.2 | 48.2 |
48
+ | SmoLVLM | 1.7B | unk. | 75.9 | 41.3 |
49
+ | Our Baseline | 0.6B | 1.0M | 65.9 | 37.2 |
50
+ | Our Baseline | 1.7B | 1.0M | 76.8 | 40.9 |
51
+ | **Decoupled Models** | P / R| | | |
52
+ | PrismCaptioner | 1.8B / 70B | 1.9M | 75.4 | 41.9 |
53
+ | PrismCaptioner | 7.0B / 70B | 1.9M | 78.3 | 45.7 |
54
+ | Our Baseline | 0.6B / 4.0B | 1.0M | 64.6 | 34.0 |
55
+ | Our Baseline | 1.7B / 4.0B | 1.0M | 69.4 | 39.4 |
56
+ | <span style="font-variant: small-caps;">Caption+Think</span> | 0.6B / 1.7B | 2.0M | 75.0 | 43.0 |
57
+ | <span style="font-variant: small-caps;">Caption+Think</span> | 1.7B / 4.0B | 2.0M | 80.0 | 49.0 |
58
+ | [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-0.6B) | 0.6B / 1.7B | 0.4M | 78.0 | 42.6 |
59
+ | [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-1.7B) | 1.7B / 4.0B | 0.4M | 82.7 | 48.1 |
60
+ | [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-0.6B) | 0.6B / 1.7B | 2.4M | 80.3 | 46.6 |
61
+ | [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-1.7B) | 1.7B / 4.0B | 2.4M | 85.3 | 52.6 |
62
+
63
+ *For the full table, please refer to our paper.*
64
+
65
+ ## Acknowledgments
66
+
67
+ This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
68
+
69
+ ## Citation
70
+ ```bib
71
+ @article{endo2025downscalingintelligence,
72
+ author = {Endo, Mark and Yeung-Levy, Serena},
73
+ title = {Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models},
74
+ journal = {arXiv preprint},
75
+ year = {2025},
76
+ }
77
+ ```