Update README.md
Browse files
README.md
CHANGED
|
@@ -1,21 +1,55 @@
|
|
| 1 |
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
library_name: transformers
|
| 4 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
# Extract+Think Model Card
|
| 8 |
|
| 9 |
-
This repository
|
| 10 |
|
| 11 |
-
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Model details
|
| 17 |
|
| 18 |
-
Extract-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Citation
|
| 21 |
```bib
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
+
tags:
|
| 5 |
+
- multimodal
|
| 6 |
+
- vision-language-model
|
| 7 |
+
- small-language-model
|
| 8 |
+
base_model:
|
| 9 |
+
- google/siglip-so400m-patch14-384
|
| 10 |
+
- Qwen/Qwen3-1.7B
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Extract+Think Model Card for markendo/llava-extract-from-scratch-qwen3-1.7B
|
| 14 |
|
| 15 |
+
This repository hosts the **Extract-1.7B<sup>†</sup>** model, which serves as the perception module for the two-stage **Extract+Think<sup>†</sup>** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
|
| 16 |
|
| 17 |
+
Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
|
| 18 |
+
In this variant, we train from scratch under the visual extraction tuning paradigm, without previous visual instruction tuning or captioning.
|
| 19 |
|
| 20 |
+
* 📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
|
| 21 |
+
* 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
|
| 22 |
+
* 💻 **Code:** https://github.com/markendo/downscaling_intelligence
|
| 23 |
+
|
| 24 |
+
<p align="center">
|
| 25 |
+
<img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
|
| 26 |
+
</p>
|
| 27 |
|
| 28 |
## Model details
|
| 29 |
|
| 30 |
+
Extract-1.7B<sup>†</sup> is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
|
| 31 |
+
|
| 32 |
+
## Usage
|
| 33 |
+
|
| 34 |
+
To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
|
| 35 |
+
|
| 36 |
+
For generating extracted visual information, the following command is provided:
|
| 37 |
+
```bash
|
| 38 |
+
cd lmms-eval
|
| 39 |
+
model_name=markendo/llava-extract-from-scratch-qwen3-1.7B
|
| 40 |
+
python -m lmms_eval \
|
| 41 |
+
--model=llava_onevision \
|
| 42 |
+
--model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
|
| 43 |
+
--tasks=mmstar_prism_stage_1 \
|
| 44 |
+
--batch_size=1 \
|
| 45 |
+
--output_path results \
|
| 46 |
+
--log_samples
|
| 47 |
+
```
|
| 48 |
+
Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
|
| 49 |
+
|
| 50 |
+
## Acknowledgments
|
| 51 |
+
|
| 52 |
+
This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
|
| 53 |
|
| 54 |
## Citation
|
| 55 |
```bib
|