Improve model card for Extract+Think: Add metadata, links, description, evaluation, and citation
Browse filesThis PR significantly enhances the model card for the `Extract-0.6B` model, which is part of the **Extract+Think** framework.
Key improvements include:
* Adding essential metadata tags: `pipeline_tag: image-text-to-text`, `library_name: transformers`, and `license: cc-by-nc-4.0`. These tags improve discoverability on the Hugging Face Hub and enable the automated "How to use with Transformers" widget.
* Including direct links to the associated paper, project page, and GitHub repository for comprehensive information access.
* Expanding the model description based on the paper's abstract, clarifying the model's role as a perception module for visual extraction.
* Integrating the detailed evaluation table from the GitHub README to provide immediate performance context.
* Adding the BibTeX citation and acknowledgments for proper attribution.
* Removing the irrelevant `# File information` section and its content that was previously present in the README.
Please review and merge this PR.
|
@@ -1,9 +1,77 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
# Extract+Think Model Card
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## Model details
|
| 8 |
|
| 9 |
-
Extract-0.6B is used as the perception module for the two-stage Extract+Think framework.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
|
| 8 |
+
|
| 9 |
+
This repository hosts the **Extract-0.6B** model, which serves as the perception module for the two-stage **Extract+Think** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
|
| 10 |
+
|
| 11 |
+
Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
|
| 12 |
+
|
| 13 |
+
* 📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
|
| 14 |
+
* 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
|
| 15 |
+
* 💻 **Code:** https://github.com/markendo/downscaling_intelligence
|
| 16 |
|
| 17 |
## Model details
|
| 18 |
|
| 19 |
+
Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)). This model aims to improve efficiency and performance in multimodal understanding by decoupling visual perception from linguistic reasoning.
|
| 20 |
+
|
| 21 |
+
## Usage
|
| 22 |
+
|
| 23 |
+
To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
|
| 24 |
+
|
| 25 |
+
For generating extracted visual information, the following command is provided (example with `markendo/llava-extract-qwen3-1.7B`):
|
| 26 |
+
```bash
|
| 27 |
+
cd lmms-eval
|
| 28 |
+
model_name=markendo/llava-extract-qwen3-1.7B
|
| 29 |
+
python -m lmms_eval \
|
| 30 |
+
--model=llava_onevision \
|
| 31 |
+
--model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
|
| 32 |
+
--tasks=mmstar_prism_stage_1 \
|
| 33 |
+
--batch_size=1 \
|
| 34 |
+
--output_path results \
|
| 35 |
+
--log_samples
|
| 36 |
+
```
|
| 37 |
+
Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
|
| 38 |
+
|
| 39 |
+
## Evaluation
|
| 40 |
+
|
| 41 |
+
The Extract+Think approach is evaluated using `lmms-eval`, showing competitive performance in multimodal benchmarks. Below is a summary of performance on various benchmarks. For the full table and details, please refer to the paper.
|
| 42 |
+
|
| 43 |
+
| Model | LLM Size | # Vis. Data | In-Domain Avg. | MMStar Avg. |
|
| 44 |
+
|---|---|---|---|---|
|
| 45 |
+
| **End-to-End** | | | | |
|
| 46 |
+
| LLaVA-OneVision | 0.5B | 8.8M | 71.1 | 39.0 |
|
| 47 |
+
| InternVL2.5 | 0.5B | 64M | 83.2 | 48.2 |
|
| 48 |
+
| SmoLVLM | 1.7B | unk. | 75.9 | 41.3 |
|
| 49 |
+
| Our Baseline | 0.6B | 1.0M | 65.9 | 37.2 |
|
| 50 |
+
| Our Baseline | 1.7B | 1.0M | 76.8 | 40.9 |
|
| 51 |
+
| **Decoupled Models** | P / R| | | |
|
| 52 |
+
| PrismCaptioner | 1.8B / 70B | 1.9M | 75.4 | 41.9 |
|
| 53 |
+
| PrismCaptioner | 7.0B / 70B | 1.9M | 78.3 | 45.7 |
|
| 54 |
+
| Our Baseline | 0.6B / 4.0B | 1.0M | 64.6 | 34.0 |
|
| 55 |
+
| Our Baseline | 1.7B / 4.0B | 1.0M | 69.4 | 39.4 |
|
| 56 |
+
| <span style="font-variant: small-caps;">Caption+Think</span> | 0.6B / 1.7B | 2.0M | 75.0 | 43.0 |
|
| 57 |
+
| <span style="font-variant: small-caps;">Caption+Think</span> | 1.7B / 4.0B | 2.0M | 80.0 | 49.0 |
|
| 58 |
+
| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-0.6B) | 0.6B / 1.7B | 0.4M | 78.0 | 42.6 |
|
| 59 |
+
| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-1.7B) | 1.7B / 4.0B | 0.4M | 82.7 | 48.1 |
|
| 60 |
+
| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-0.6B) | 0.6B / 1.7B | 2.4M | 80.3 | 46.6 |
|
| 61 |
+
| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-1.7B) | 1.7B / 4.0B | 2.4M | 85.3 | 52.6 |
|
| 62 |
+
|
| 63 |
+
*For the full table, please refer to our paper.*
|
| 64 |
+
|
| 65 |
+
## Acknowledgments
|
| 66 |
+
|
| 67 |
+
This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
|
| 68 |
+
|
| 69 |
+
## Citation
|
| 70 |
+
```bib
|
| 71 |
+
@article{endo2025downscalingintelligence,
|
| 72 |
+
author = {Endo, Mark and Yeung-Levy, Serena},
|
| 73 |
+
title = {Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models},
|
| 74 |
+
journal = {arXiv preprint},
|
| 75 |
+
year = {2025},
|
| 76 |
+
}
|
| 77 |
+
```
|