markendo commited on
Commit
b28e48e
·
verified ·
1 Parent(s): 4b938cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -7
README.md CHANGED
@@ -1,21 +1,55 @@
1
  ---
2
- license: cc-by-nc-4.0
3
  library_name: transformers
4
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
5
  ---
6
 
7
- # Extract+Think Model Card
8
 
9
- This repository contains the `Extract-from-scratch-1.7B` model, a key component of the **Extract+Think** framework, as presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
10
 
11
- The **Extract+Think** approach is designed to analyze how reduced Large Language Model (LLM) capacity affects multimodal capabilities. It introduces visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details, followed by step-by-step reasoning to generate answers. This method aims to set a new standard for efficiency and performance in small multimodal models.
 
12
 
13
- Project Page: https://web.stanford.edu/~markendo/projects/downscaling_intelligence
14
- Code: https://github.com/markendo/downscaling_intelligence
 
 
 
 
 
15
 
16
  ## Model details
17
 
18
- Extract-from-scratch-1.7B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Citation
21
  ```bib
 
1
  ---
 
2
  library_name: transformers
3
  pipeline_tag: image-text-to-text
4
+ tags:
5
+ - multimodal
6
+ - vision-language-model
7
+ - small-language-model
8
+ base_model:
9
+ - google/siglip-so400m-patch14-384
10
+ - Qwen/Qwen3-1.7B
11
  ---
12
 
13
+ # Extract+Think Model Card for markendo/llava-extract-from-scratch-qwen3-1.7B
14
 
15
+ This repository hosts the **Extract-1.7B<sup>†</sup>** model, which serves as the perception module for the two-stage **Extract+Think<sup>†</sup>** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
16
 
17
+ Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
18
+ In this variant, we train from scratch under the visual extraction tuning paradigm, without previous visual instruction tuning or captioning.
19
 
20
+ * 📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
21
+ * 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
22
+ * 💻 **Code:** https://github.com/markendo/downscaling_intelligence
23
+
24
+ <p align="center">
25
+ <img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
26
+ </p>
27
 
28
  ## Model details
29
 
30
+ Extract-1.7B<sup>†</sup> is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
31
+
32
+ ## Usage
33
+
34
+ To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
35
+
36
+ For generating extracted visual information, the following command is provided:
37
+ ```bash
38
+ cd lmms-eval
39
+ model_name=markendo/llava-extract-from-scratch-qwen3-1.7B
40
+ python -m lmms_eval \
41
+ --model=llava_onevision \
42
+ --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
43
+ --tasks=mmstar_prism_stage_1 \
44
+ --batch_size=1 \
45
+ --output_path results \
46
+ --log_samples
47
+ ```
48
+ Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
49
+
50
+ ## Acknowledgments
51
+
52
+ This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
53
 
54
  ## Citation
55
  ```bib