Multimodal OCR2
nanonets ocr / smoldocling / monkey ocr / typhoon ocr
Comprehensive Demo of Multimodal VLMs on the Hub
nanonets ocr / smoldocling / monkey ocr / typhoon ocr
Multimodal OCR model for complex document understanding.
Smart Any-Horizon Agents for Long Video Reasoning. [SAGE]
object detection, visual grounding, keypoint detection
Molmo2 - Image, Video (QA, Pointing & Tracking)
demo of a collection of qwen3-vl models
Text-to-Image โ 3D or Image-to-3D
demo of a collection of impressive ocr models on the hub
demo of a collection of impressive ocr vl models on hf
cosmos reason1 / docscopeocr / visionocr / captioner relaxed
camel doc ocr / core ocr / docscope ocr / monkey ocr
Florence-2-large / Florence-2-base
Ultra-compact Computer-Use Agent [GUI Localization]
Image-Text to Voice (en)
Testing for the latest transformers (DeepSeek-OCR).
demo of a collection of multimodal vlms on hf [ocr / others]
Florence-2 vision models demo. (transformers)
for document parsing task
OCR, VQA, Thinking and Object Detection.
understand document semantics, extract text and tables.
Vision-Language Models for Document Conversion
Experiment with small super OCR models here.
thinking / ocr / reasoning