[BUG] Missing grounding of first element in the image when the element is of type Text

#75

by MatteoOmenetti - opened 5 days ago

5 days ago

When you ask the model to perform OCR + layout, if the first element in the page is a "text" element, then there are not location information.

Code:

from vllm import LLM, SamplingParams
from PIL import Image

from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    limit_mm_per_prompt={"image": 1},
    seed=42,
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

prompt = "<image>\n<|grounding|>Convert the document to markdown. "
sampling_param = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    # ngram logit processor args
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # whitelist: <td>, </td>
    ),
    skip_special_tokens=False,
)

image = Image.open(
    "/gpfs/ZuFS1/proj/deep-search/mao/repos/test_repo/eval/end_to_end_docling_eval/t.png"
).convert("RGB")
print(image)


llm_inputs = [{"prompt": prompt, "multi_modal_data": {"image": image}}]

outputs = llm.generate(llm_inputs, sampling_params=sampling_param)

output_text = outputs[0].outputs[0].text
print(output_text)

Output

a feature of the forgetting mechanism. When compressing tokens by nearly \(20x\) , we find that precision can still approach \(60\%\) . These results indicate that optical contexts compression is a very promising and worthwhile research direction, and this approach does not bring any overhead because it can leverage VLM infrastructure, as multimodal systems inherently require an additional vision encoder.  <-- !!! MISSING GROUNDING HERE !!!

<|ref|>table<|/ref|><|det|>[[125, 250, 872, 380]]<|/det|>
<|ref|>table_caption<|/ref|><|det|>[[115, 190, 881, 240]]<|/det|>
Table 4 | Edit distances for different categories of documents in OmniDocBench. The results show that some types of documents can achieve good performance with just 64 or 100 vision tokens, while others require Gundam mode.   

<table><tr><td>Type Mode</td><td>Book Slides</td><td>Financial Report</td><td>Textbook</td><td>Exam Paper</td><td>Magazine</td><td>Academic Papers</td><td>Notes</td><td>Newspaper Overall</td></tr><tr><td>Tiny</td><td>0.147</td><td>0.116</td><td>0.207</td><td>0.173</td><td>0.294</td><td>0.201</td><td>0.395</td><td>0.297</td><td>0.94</td></tr><tr><td>Small</td><td>0.085</td><td>0.111</td><td>0.079</td><td>0.147</td><td>0.171</td><td>0.107</td><td>0.131</td><td>0.187</td><td>0.744</td></tr><tr><td>Base</td><td>0.037</td><td>0.08</td><td>0.027</td><td>0.1</td><td>0.13</td><td>0.073</td><td>0.052</td><td>0.176</td><td>0.645</td></tr><tr><td>Large</td><td>0.038</td><td>0.108</td><td>0.022</td><td>0.084</td><td>0.109</td><td>0.06</td><td>0.053</td><td>0.155</td><td>0.353</td></tr><tr><td>Gundam</td><td>0.035</td><td>0.085</td><td>0.289</td><td>0.095</td><td>0.094</td><td>0.059</td><td>0.039</td><td>0.153</td><td>0.122</td></tr><tr><td>Gundam-M</td><td>0.052</td><td>0.09</td><td>0.034</td><td>0.091</td><td>0.079</td><td>0.079</td><td>0.048</td><td>0.1</td><td>0.099</td></tr></table>  

<|ref|>sub_title<|/ref|><|det|>[[116, 411, 383, 428]]<|/det|>
### 4.2. OCR Practical Performance  

<|ref|>text<|/ref|><|det|>[[115, 437, 882, 585]]<|/det|>
DeepSeek- OCR is not only an experimental model; it has strong practical capabilities and can construct data for LLM/VLM pretraining. To quantify OCR performance, we test DeepSeek- OCR on OmniDocBench [27], with results shown in Table 3. Requiring only 100 vision tokens (640x640 resolution), DeepSeek- OCR surpasses GOT- OCR2.0 [38] which uses 256 tokens; with 400 tokens (285 valid tokens, 1280x1280 resolution), it achieves on- par performance with state- of- the- arts on this benchmark. Using fewer than 800 tokens (Gundam mode), DeepSeek- OCR outperforms MinerU2.0 [34] which needs nearly 7,000 vision tokens. These results demonstrate that our DeepSeek- OCR model is powerful in practical applications, and because the higher tokens compression, it enjoys a higher research ceiling.  

<|ref|>text<|/ref|><|det|>[[115, 590, 882, 753]]<|/det|>
As shown in Table 4, some categories of documents require very few tokens to achieve satisfactory performance, such as slides which only need 64 vision tokens. For book and report documents, DeepSeek- OCR can achieve good performance with only 100 vision tokens. Combined with the analysis from Section 4.1, this may be because most text tokens in these document categories are within 1,000, meaning the vision- token compression ratio does not exceed \(10x\) . For newspapers, Gundam or even Gundam- master mode is required to achieve acceptable edit distances, because the text tokens in newspapers are 4- 5,000, far exceeding the \(10x\) compression of other modes. These experimental results further demonstrate the boundaries of contexts optical compression, which may provide effective references for researches on the vision token optimization in VLMs and context compression, forgetting mechanisms in LLMs.  

<|ref|>sub_title<|/ref|><|det|>[[116, 776, 303, 792]]<|/det|>
### 4.3. Qualitative Study  

<|ref|>title<|/ref|><|det|>[[116, 803, 276, 819]]<|/det|>
#### 4.3.1. Deep parsing  

<|ref|>text<|/ref|><|det|>[[116, 828, 882, 894]]<|/det|>
DeepSeek- OCR possesses both layout and OCR 2.0 capabilities, enabling it to further parse images within documents through secondary model calls, a feature we refer to as "deep parsing". As shown in Figures 7,8,9,10, our model can perform deep parsing on charts, geometry, chemical formulas, and even natural images, requiring only a unified prompt.

Image:

MatteoOmenetti changed discussion status to closed 5 days ago

MatteoOmenetti

5 days ago

•

edited 5 days ago

This happens if you as prompt "<image>\n<|grounding|>Convert the document to markdown. <space>" as reported in the model card and in the official paper. If you remove the trailing space from the prompt, everything works fine.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment