markendo
/

llava-extract-from-scratch-qwen3-1.7B

@@ -1,21 +1,55 @@
 ---
-license: cc-by-nc-4.0
 library_name: transformers
 pipeline_tag: image-text-to-text
 ---
-# Extract+Think Model Card
-This repository contains the `Extract-from-scratch-1.7B` model, a key component of the **Extract+Think** framework, as presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
-The **Extract+Think** approach is designed to analyze how reduced Large Language Model (LLM) capacity affects multimodal capabilities. It introduces visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details, followed by step-by-step reasoning to generate answers. This method aims to set a new standard for efficiency and performance in small multimodal models.
-Project Page: https://web.stanford.edu/~markendo/projects/downscaling_intelligence
-Code: https://github.com/markendo/downscaling_intelligence
 ## Model details
-Extract-from-scratch-1.7B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
 ## Citation
 ```bib

 ---
 library_name: transformers
 pipeline_tag: image-text-to-text
+tags:
+- multimodal
+- vision-language-model
+- small-language-model
+base_model:
+- google/siglip-so400m-patch14-384
+- Qwen/Qwen3-1.7B
 ---
+# Extract+Think Model Card for markendo/llava-extract-from-scratch-qwen3-1.7B
+This repository hosts the **Extract-1.7B<sup>†</sup>** model, which serves as the perception module for the two-stage **Extract+Think<sup>†</sup>** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
+Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
+In this variant, we train from scratch under the visual extraction tuning paradigm, without previous visual instruction tuning or captioning.
+*   📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
+*   🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
+*   💻 **Code:** https://github.com/markendo/downscaling_intelligence
+<p align="center">
+<img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
+</p>
 ## Model details
+Extract-1.7B<sup>†</sup> is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
+## Usage
+To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
+For generating extracted visual information, the following command is provided:
+```bash
+cd lmms-eval
+model_name=markendo/llava-extract-from-scratch-qwen3-1.7B
+python -m lmms_eval \
+    --model=llava_onevision \
+    --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
+    --tasks=mmstar_prism_stage_1 \
+    --batch_size=1 \
+    --output_path results \
+    --log_samples
+```
+Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
+## Acknowledgments
+This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
 ## Citation
 ```bib