support MPS / CPU inference

Browse files

Files changed (7) hide show

.gitattributes +1 -0
README.md +53 -90
deepencoder.py +1 -1
demo/image.png +3 -0
demo/requirements.txt +12 -0
demo/run_dpsk_ocr.py +51 -0
modeling_deepseekocr.py +32 -20

.gitattributes CHANGED Viewed

@@ -38,3 +38,4 @@ assets/show1.jpg filter=lfs diff=lfs merge=lfs -text
 assets/show2.jpg filter=lfs diff=lfs merge=lfs -text
 assets/show3.jpg filter=lfs diff=lfs merge=lfs -text
 assets/show4.jpg filter=lfs diff=lfs merge=lfs -text

 assets/show2.jpg filter=lfs diff=lfs merge=lfs -text
 assets/show3.jpg filter=lfs diff=lfs merge=lfs -text
 assets/show4.jpg filter=lfs diff=lfs merge=lfs -text
+demo/image.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -10,96 +10,76 @@ tags:
 license: mit
 library_name: transformers
 ---
-<div align="center">
-  <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek AI" />
-</div>
-<hr>
-<div align="center">
-  <a href="https://www.deepseek.com/" target="_blank">
-    <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" />
-  </a>
-  <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR" target="_blank">
-    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
-  </a>
-</div>
-<div align="center">
-  <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
-    <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
-  </a>
-  <a href="https://twitter.com/deepseek_ai" target="_blank">
-    <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
-  </a>
-</div>
-<p align="center">
-  <a href="https://github.com/deepseek-ai/DeepSeek-OCR"><b>🌟 Github</b></a> |
-  <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"><b>📥 Model Download</b></a> |
-  <a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf"><b>📄 Paper Link</b></a> |
-  <a href="https://arxiv.org/abs/2510.18234"><b>📄 Arxiv Paper Link</b></a> |
-</p>
-<h2>
-<p align="center">
-  <a href="https://huggingface.co/papers/2510.18234">DeepSeek-OCR: Contexts Optical Compression</a>
-</p>
-</h2>
-<p align="center">
-<img src="assets/fig1.png" style="width: 1000px" align=center>
-</p>
-<p align="center">
-<a href="https://huggingface.co/papers/2510.18234">Explore the boundaries of visual-text compression.</a>
-</p>
 ## Usage
-Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8：
-```
-torch==2.6.0
-transformers==4.46.3
-tokenizers==0.20.3
-einops
-addict
-easydict
-pip install flash-attn==2.7.3 --no-build-isolation
 ```
 ```python
 from transformers import AutoModel, AutoTokenizer
 import torch
-import os
-os.environ["CUDA_VISIBLE_DEVICES"] = '0'
-model_name = 'deepseek-ai/DeepSeek-OCR'
-tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
-model = model.eval().cuda().to(torch.bfloat16)
-# prompt = "<image>\nFree OCR. "
-prompt = "<image>\n<|grounding|>Convert the document to markdown. "
-image_file = 'your_image.jpg'
-output_path = 'your/output/dir'
-# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
-# Tiny: base_size = 512, image_size = 512, crop_mode = False
-# Small: base_size = 640, image_size = 640, crop_mode = False
-# Base: base_size = 1024, image_size = 1024, crop_mode = False
-# Large: base_size = 1280, image_size = 1280, crop_mode = False
-# Gundam: base_size = 1024, image_size = 640, crop_mode = True
-res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
 ```
 ## vLLM
 Refer to [����GitHub](https://github.com/deepseek-ai/DeepSeek-OCR/) for guidance on model inference acceleration and PDF processing, etc.<!--  -->
-[2025/10/23] 🚀🚀🚀 DeepSeek-OCR is now officially supported in upstream [vLLM](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html#installing-vllm).
 ```shell
 uv venv
 source .venv/bin/activate
@@ -114,7 +94,7 @@ from PIL import Image
 # Create model instance
 llm = LLM(
-    model="deepseek-ai/DeepSeek-OCR",
     enable_prefix_caching=False,
     mm_processor_cache_gb=0,
     logits_processors=[NGramPerReqLogitsProcessor]
@@ -166,21 +146,4 @@ for output in model_outputs:
 <td><img src="assets/show3.jpg" style="width: 500px"></td>
 <td><img src="assets/show4.jpg" style="width: 500px"></td>
 </tr>
-</table>
-## Acknowledgement
-We would like to thank [Vary](https://github.com/Ucas-HaoranWei/Vary/), [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/), [MinerU](https://github.com/opendatalab/MinerU), [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [OneChart](https://github.com/LingyvKong/OneChart), [Slow Perception](https://github.com/Ucas-HaoranWei/Slow-Perception) for their valuable models and ideas.
-We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
-## Citation
-```bibtex
-@article{wei2025deepseek,
-  title={DeepSeek-OCR: Contexts Optical Compression},
-  author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
-  journal={arXiv preprint arXiv:2510.18234},
-  year={2025}
-}

 license: mit
 library_name: transformers
 ---
+# DeepSeek-OCR – Apple Metal Performance Shaders (MPS) & CPU Support
+This repository uses the weights from the original DeepSeek-OCR and modifies model to support MPS and CPU inference
+- [Link to the original DeepSeek-OCR model](https://huggingface.co/deepseek-ai/DeepSeek-OCR).
+- [Link to the original DeepSeek-OCR repository](https://github.com/deepseek-ai/DeepSeek-OCR)
 ## Usage
+Inference using Huggingface transformers on Metal Performance Shaders (MPS) and CPU. Requirements tested on python 3.12.9：
+```shell
+git clone git@hf.co:Dogacel/DeepSeek-OCR-Metal-MPS
+cd DeepSeek-OCR-Metal-MPS/demo
+# Use mamba or conda
+mamba create -n deepseek-ocr python=3.12.9 -y
+mamba activate deepseek-ocr
+pip install -r requirements.txt
+python run_dpsk_ocr.py
 ```
 ```python
 from transformers import AutoModel, AutoTokenizer
 import torch
+model_name = 'Dogacel/DeepSeek-OCR-Metal-MPS'
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+model = AutoModel.from_pretrained(
+        model_name,
+        _attn_implementation='eager',
+        trust_remote_code=True,
+        use_safetensors=True,
+    )
+device = torch.device("mps")
+dtype = torch.float16
+model = model.eval().to(device).to(dtype)
+prompt = "<image>\n<|grounding|>Convert the document to markdown. "
+image_file = 'image.png'
+output_path = 'results4'
+res = model.infer(
+    tokenizer,
+    device=device,
+    dtype=dtype,
+    prompt=prompt,
+    image_file=image_file,
+    output_path = output_path,
+    base_size=1024,
+    image_size=640,
+    crop_mode=False,
+    save_results = True,
+    test_compress = True,
+)
 ```
 ## vLLM
+> vLLM integration hasn't been tested yet.
 Refer to [����GitHub](https://github.com/deepseek-ai/DeepSeek-OCR/) for guidance on model inference acceleration and PDF processing, etc.<!--  -->
 ```shell
 uv venv
 source .venv/bin/activate
 # Create model instance
 llm = LLM(
+    model="Dogacel/DeepSeek-OCR-Metal-MPS",
     enable_prefix_caching=False,
     mm_processor_cache_gb=0,
     logits_processors=[NGramPerReqLogitsProcessor]
 <td><img src="assets/show3.jpg" style="width: 500px"></td>
 <td><img src="assets/show4.jpg" style="width: 500px"></td>
 </tr>
+</table>

deepencoder.py CHANGED Viewed

@@ -1011,7 +1011,7 @@ def build_sam_vit_b(checkpoint=None):
         checkpoint=checkpoint,
     )
-def build_sam_fast_vit_b(checkpoint=None, compile_mode='max-autotune', dtype=torch.bfloat16):
     image_encoder = build_sam_vit_b(checkpoint).eval().to(dtype)
     # sam = _apply_eval_dtype_sam(sam, dtype)
     image_encoder = torch.compile(image_encoder, mode=compile_mode)

         checkpoint=checkpoint,
     )
+def build_sam_fast_vit_b(checkpoint=None, compile_mode='max-autotune', dtype=torch.float16):
     image_encoder = build_sam_vit_b(checkpoint).eval().to(dtype)
     # sam = _apply_eval_dtype_sam(sam, dtype)
     image_encoder = torch.compile(image_encoder, mode=compile_mode)

demo/image.png ADDED Viewed

Git LFS Details

SHA256: e07688758d8bb7a9ca0c3c53d33ec5a06a6a9dec2edd7ffb8e3a32d4609de80d
Pointer size: 131 Bytes
Size of remote file: 365 kB

demo/requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+transformers==4.46.3
+tokenizers==0.20.3
+torch==2.9.1
+torchvision==0.24.1
+torchaudio==2.9.1
+PyMuPDF
+img2pdf
+einops
+easydict
+addict
+Pillow
+numpy

demo/run_dpsk_ocr.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from transformers import AutoModel, AutoTokenizer
+import torch
+model_name = '../'
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+model = AutoModel.from_pretrained(
+        model_name,
+        _attn_implementation='eager',
+        trust_remote_code=True,
+        use_safetensors=True,
+    )
+device = torch.device("mps")
+dtype = torch.float16
+model = model.eval().to(device).to(dtype)
+# prompt = "<image>\nFree OCR. "
+prompt = "<image>\n<|grounding|>Convert the document to markdown. "
+image_file = 'image.png'
+output_path = 'results'
+# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
+# Tiny: base_size = 512, image_size = 512, crop_mode = False
+# Small: base_size = 640, image_size = 640, crop_mode = False
+# Base: base_size = 1024, image_size = 1024, crop_mode = False
+# Large: base_size = 1280, image_size = 1280, crop_mode = False
+# Gundam: base_size = 1024, image_size = 640, crop_mode = True
+res = model.infer(
+    tokenizer,
+    device=device,
+    dtype=dtype,
+    prompt=prompt,
+    image_file=image_file,
+    output_path = output_path,
+    base_size=1024,
+    image_size=640,
+    crop_mode=False,
+    save_results = True,
+    test_compress = True,
+)

modeling_deepseekocr.py CHANGED Viewed

@@ -501,8 +501,11 @@ class DeepseekOCRModel(DeepseekV2Model):
                 if images_in_this_batch:
                     images_in_this_batch = torch.cat(images_in_this_batch, dim=0)
                     # exit()
-                    inputs_embeds[idx].masked_scatter_(images_seq_mask[idx].unsqueeze(-1).cuda(), images_in_this_batch)
                 idx += 1
@@ -652,7 +655,12 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
         if attention_mask is not None and position_ids is None:
             # create position_ids on the fly for batch generation
             position_ids = attention_mask.long().cumsum(-1) - 1
-            position_ids.masked_fill_(attention_mask == 0, 1)
             if past_key_values:
                 position_ids = position_ids[:, -input_ids.shape[1] :]
@@ -700,9 +708,12 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
-    def infer(self, tokenizer, prompt='', image_file='', output_path = '', base_size=1024, image_size=640, crop_mode=True, test_compress=False, save_results=False, eval_mode=False):
         self.disable_torch_init()
         os.makedirs(output_path, exist_ok=True)
         os.makedirs(f'{output_path}/images', exist_ok=True)
@@ -799,9 +810,9 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
-                images_list.append(image_transform(global_view).to(torch.bfloat16))
-                # global_view_tensor = image_transform(global_view).to(torch.bfloat16)
                 width_crop_num, height_crop_num = crop_ratio
@@ -812,7 +823,7 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
                     """process the local views"""
                     for i in range(len(images_crop_raw)):
-                        images_crop_list.append(image_transform(images_crop_raw[i]).to(torch.bfloat16))
                 if image_size == 640:
                     valid_img_tokens += len(images_crop_list) * 100
@@ -846,7 +857,7 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
                 # else:
                 global_view = ImageOps.pad(image, (image_size, image_size),
                                         color=tuple(int(x * 255) for x in image_transform.mean))
-                images_list.append(image_transform(global_view).to(torch.bfloat16))
                 if base_size == 1024:
                     valid_img_tokens += int(256 * ratio)
@@ -905,18 +916,19 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
             if images_crop_list:
                 images_crop = torch.stack(images_crop_list, dim=0)
             else:
-                images_crop = torch.zeros((1, 3, base_size, base_size))
         if not eval_mode:
             streamer = NoEOSTextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)
-            with torch.autocast("cuda", dtype=torch.bfloat16):
                 with torch.no_grad():
                     output_ids = self.generate(
-                        input_ids.unsqueeze(0).cuda(),
-                        images=[(images_crop.cuda(), images_ori.cuda())],
-                        images_seq_mask = images_seq_mask.unsqueeze(0).cuda(),
                         images_spatial_crop = images_spatial_crop,
                         # do_sample=False,
                         # num_beams = 1,
@@ -929,12 +941,12 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
                         )
         else:
-            with torch.autocast("cuda", dtype=torch.bfloat16):
                 with torch.no_grad():
                     output_ids = self.generate(
-                        input_ids.unsqueeze(0).cuda(),
-                        images=[(images_crop.cuda(), images_ori.cuda())],
-                        images_seq_mask = images_seq_mask.unsqueeze(0).cuda(),
                         images_spatial_crop = images_spatial_crop,
                         # do_sample=False,
                         # num_beams = 1,
@@ -947,7 +959,7 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
         if '<image>' in conversation[0]['content'] and eval_mode:
-                outputs = tokenizer.decode(output_ids[0, input_ids.unsqueeze(0).cuda().shape[1]:])
                 stop_str = '<｜end▁of▁sentence｜>'
                 if outputs.endswith(stop_str):
                     outputs = outputs[:-len(stop_str)]
@@ -957,7 +969,7 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
                 return outputs
         if '<image>' in conversation[0]['content'] and test_compress:
-            outputs = tokenizer.decode(output_ids[0, input_ids.unsqueeze(0).cuda().shape[1]:])
             pure_texts_outputs_token_length = len(text_encode(tokenizer, outputs, bos=False, eos=False))
             print('='*50)
             print('image size: ', (w, h))
@@ -968,7 +980,7 @@ class DeepseekOCRForCausalLM(DeepseekV2ForCausalLM):
         if '<image>' in conversation[0]['content'] and save_results:
-            outputs = tokenizer.decode(output_ids[0, input_ids.unsqueeze(0).cuda().shape[1]:])
             stop_str = '<｜end▁of▁sentence｜>'
             print('='*15 + 'save results:' + '='*15)

                 if images_in_this_batch:
                     images_in_this_batch = torch.cat(images_in_this_batch, dim=0)
                     # exit()
+                    mask_indices = images_seq_mask[idx].nonzero(as_tuple=True)[0]
+                    if len(mask_indices) == images_in_this_batch.shape[0]:
+                        inputs_embeds[idx, mask_indices] = images_in_this_batch
+                    else:
+                        print(f"Size mismatch: mask has {len(mask_indices)} positions, but got {images_in_this_batch.shape[0]} features")
                 idx += 1
         if attention_mask is not None and position_ids is None:
             # create position_ids on the fly for batch generation
             position_ids = attention_mask.long().cumsum(-1) - 1
+            # position_ids.masked_fill_(attention_mask == 0, 1)
+            position_ids = torch.where(attention_mask == 0,
+                          torch.ones_like(position_ids),
+                          position_ids)
             if past_key_values:
                 position_ids = position_ids[:, -input_ids.shape[1] :]
+    def infer(self, tokenizer, device=torch.device("mps"), dtype=torch.float16, prompt='', image_file='', output_path = '', base_size=1024, image_size=640, crop_mode=True, test_compress=False, save_results=False, eval_mode=False):
         self.disable_torch_init()
+        self.target_device = device
+        self.target_dtype = dtype
         os.makedirs(output_path, exist_ok=True)
         os.makedirs(f'{output_path}/images', exist_ok=True)
+                images_list.append(image_transform(global_view).to(self.target_dtype))
+                # global_view_tensor = image_transform(global_view).to(self.dtype)
                 width_crop_num, height_crop_num = crop_ratio
                     """process the local views"""
                     for i in range(len(images_crop_raw)):
+                        images_crop_list.append(image_transform(images_crop_raw[i]).to(self.target_dtype))
                 if image_size == 640:
                     valid_img_tokens += len(images_crop_list) * 100
                 # else:
                 global_view = ImageOps.pad(image, (image_size, image_size),
                                         color=tuple(int(x * 255) for x in image_transform.mean))
+                images_list.append(image_transform(global_view).to(self.target_dtype))
                 if base_size == 1024:
                     valid_img_tokens += int(256 * ratio)
             if images_crop_list:
                 images_crop = torch.stack(images_crop_list, dim=0)
             else:
+               images_crop = torch.zeros((1, 3, base_size, base_size))
+        input_ids_tensor = input_ids.unsqueeze(0).to(self.target_device)
         if not eval_mode:
             streamer = NoEOSTextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)
+            with torch.autocast(self.target_device.type, dtype=self.target_dtype):
                 with torch.no_grad():
                     output_ids = self.generate(
+                        input_ids_tensor,
+                        images=[(images_crop.to(self.target_device), images_ori.to(self.target_device))],
+                        images_seq_mask = images_seq_mask.unsqueeze(0).to(self.target_device),
                         images_spatial_crop = images_spatial_crop,
                         # do_sample=False,
                         # num_beams = 1,
                         )
         else:
+            with torch.autocast(self.target_device.type, dtype=self.target_dtype):
                 with torch.no_grad():
                     output_ids = self.generate(
+                        input_ids_tensor,
+                        images=[(images_crop.to(self.target_device), images_ori.to(self.target_device))],
+                        images_seq_mask = images_seq_mask.unsqueeze(0).to(self.target_device),
                         images_spatial_crop = images_spatial_crop,
                         # do_sample=False,
                         # num_beams = 1,
         if '<image>' in conversation[0]['content'] and eval_mode:
+                outputs = tokenizer.decode(output_ids[0, input_ids.unsqueeze(0).to(self.device).shape[1]:])
                 stop_str = '<｜end▁of▁sentence｜>'
                 if outputs.endswith(stop_str):
                     outputs = outputs[:-len(stop_str)]
                 return outputs
         if '<image>' in conversation[0]['content'] and test_compress:
+            outputs = tokenizer.decode(output_ids[0, input_ids.unsqueeze(0).to(self.device).shape[1]:])
             pure_texts_outputs_token_length = len(text_encode(tokenizer, outputs, bos=False, eos=False))
             print('='*50)
             print('image size: ', (w, h))
         if '<image>' in conversation[0]['content'] and save_results:
+            outputs = tokenizer.decode(output_ids[0, input_ids.unsqueeze(0).to(self.device).shape[1]:])
             stop_str = '<｜end▁of▁sentence｜>'
             print('='*15 + 'save results:' + '='*15)