Improve model card for Extract+Think: Add metadata, links, description, evaluation, and citation

This PR significantly enhances the model card for the `Extract-0.6B` model, which is part of the **Extract+Think** framework.

Key improvements include:
* Adding essential metadata tags: `pipeline_tag: image-text-to-text`, `library_name: transformers`, and `license: cc-by-nc-4.0`. These tags improve discoverability on the Hugging Face Hub and enable the automated "How to use with Transformers" widget.
* Including direct links to the associated paper, project page, and GitHub repository for comprehensive information access.
* Expanding the model description based on the paper's abstract, clarifying the model's role as a perception module for visual extraction.
* Integrating the detailed evaluation table from the GitHub README to provide immediate performance context.
* Adding the BibTeX citation and acknowledgments for proper attribution.
* Removing the irrelevant `# File information` section and its content that was previously present in the README.

Please review and merge this PR.

Files changed (1) hide show

README.md +71 -3

README.md CHANGED Viewed

@@ -1,9 +1,77 @@
 ---
-{}
 ---
-# Extract+Think Model Card
 ## Model details
-Extract-0.6B is used as the perception module for the two-stage Extract+Think framework.

 ---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: image-text-to-text
 ---
+# Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
+This repository hosts the **Extract-0.6B** model, which serves as the perception module for the two-stage **Extract+Think** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
+Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
+*   📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
+*   🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
+*   💻 **Code:** https://github.com/markendo/downscaling_intelligence
 ## Model details
+Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)). This model aims to improve efficiency and performance in multimodal understanding by decoupling visual perception from linguistic reasoning.
+## Usage
+To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
+For generating extracted visual information, the following command is provided (example with `markendo/llava-extract-qwen3-1.7B`):
+```bash
+cd lmms-eval
+model_name=markendo/llava-extract-qwen3-1.7B
+python -m lmms_eval \
+    --model=llava_onevision \
+    --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
+    --tasks=mmstar_prism_stage_1 \
+    --batch_size=1 \
+    --output_path results \
+    --log_samples
+```
+Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
+## Evaluation
+The Extract+Think approach is evaluated using `lmms-eval`, showing competitive performance in multimodal benchmarks. Below is a summary of performance on various benchmarks. For the full table and details, please refer to the paper.
+| Model | LLM Size | # Vis. Data | In-Domain Avg. | MMStar Avg. |
+|---|---|---|---|---|
+| **End-to-End** | | | | |
+| LLaVA-OneVision | 0.5B | 8.8M | 71.1 | 39.0 |
+| InternVL2.5 | 0.5B | 64M | 83.2 | 48.2 |
+| SmoLVLM | 1.7B | unk. | 75.9 | 41.3 |
+| Our Baseline | 0.6B | 1.0M | 65.9 | 37.2 |
+| Our Baseline | 1.7B | 1.0M | 76.8 | 40.9 |
+| **Decoupled Models** | P / R| | | |
+| PrismCaptioner | 1.8B / 70B | 1.9M | 75.4 | 41.9 |
+| PrismCaptioner | 7.0B / 70B | 1.9M | 78.3 | 45.7 |
+| Our Baseline | 0.6B / 4.0B | 1.0M | 64.6 | 34.0 |
+| Our Baseline | 1.7B / 4.0B | 1.0M | 69.4 | 39.4 |
+| <span style="font-variant: small-caps;">Caption+Think</span> | 0.6B / 1.7B | 2.0M | 75.0 | 43.0 |
+| <span style="font-variant: small-caps;">Caption+Think</span> | 1.7B / 4.0B | 2.0M | 80.0 | 49.0 |
+| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-0.6B) | 0.6B / 1.7B | 0.4M | 78.0 | 42.6 |
+| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-1.7B) | 1.7B / 4.0B | 0.4M | 82.7 | 48.1 |
+| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-0.6B) | 0.6B / 1.7B | 2.4M | 80.3 | 46.6 |
+| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-1.7B) | 1.7B / 4.0B | 2.4M | 85.3 | 52.6 |
+*For the full table, please refer to our paper.*
+## Acknowledgments
+This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
+## Citation
+```bib
+@article{endo2025downscalingintelligence,
+  author    = {Endo, Mark and Yeung-Levy, Serena},
+  title     = {Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models},
+  journal   = {arXiv preprint},
+  year      = {2025},
+}
+```