AI & ML interests

Multi-Modal Collective Intelligence

prithivMLmodsย 
posted an update 1 day ago
view post
Post
1524
Try the all-new trending Qwen-Image-Edit specialized adapter demos, including Photo-to-Anime, Light Restoration, Multi-Angle Edits, Relighting, and more โ€” all in a single Hugging Face Space. Below is the demo link. ๐Ÿค—๐ŸŒ 

โฎž Demo-Space: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
โฎž How-to-Use: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast#2
โฎž Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update 5 days ago
view post
Post
2790
Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.๐Ÿ”ฅ๐Ÿค—

โžค Spaces [Demo] : prithivMLmods/Kontext-Photo-Mate-v2

Kontext-Adapters :
โœฆ Kontext-Bottom-Up-View: prithivMLmods/Kontext-Bottom-Up-View
โœฆ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
โœฆ Kontext-Top-Down-View: prithivMLmods/Kontext-Top-Down-View
โœฆ Kontext-CAM-Left-View: prithivMLmods/Kontext-CAM-Left-View
โœฆ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
โœฆ Kontext-Unblur-Upscale: prithivMLmods/Kontext-Unblur-Upscale
โœฆ Kontext-0811-exp: prithivMLmods/Kontext-0811-exp

Photo-Mate Collection:
โœฆ Kontext CAM Angles: https://huggingface.co/collections/prithivMLmods/kontext-cam-angles
โœฆ i2i - Kontext (exp): https://huggingface.co/collections/prithivMLmods/i2i-kontext-exp
โœฆ LZO-1 (Lossless Zoom Operator): https://huggingface.co/collections/prithivMLmods/lzo-1-lossless-zoom-operator

Related-Apps:
โœฆ Photo-Mate [Version 1.0]: prithivMLmods/Photo-Mate-i2i
โœฆ Image Generation Apps [Collection]: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
@prithivMLmods
prithivMLmodsย 
posted an update 9 days ago
view post
Post
1237
A week ago, I shared a post about the latest transformers test implementation of DeepSeek-OCR Compatibility (https://tinyurl.com/ykc4mm66). Now, Iโ€™m dropping the most compatible version of it to support the model with the latest transformers. ๐Ÿค—๐Ÿ”ฅ

โž  DeepSeek-OCR-Latest-BF16.I64: prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
โž  DeepSeek OCR [exp] : prithivMLmods/DeepSeek-OCR-experimental

โœ…Supports the latest transformers v4.57.1
โœ…torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0)
โœ…cuda version: 12.4
โœ…users can also opt out of specific attention implementations if desired.

โœจPrevious version: strangervisionhf/deepseek-ocr-latest-transformers
โ†—๏ธRelated Blog: https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms
โœจCommunity Page: strangervisionhf
โœจOriginal Model Page: deepseek-ai/DeepSeek-OCR

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update 13 days ago
view post
Post
2555
A small blog post titled - Hall of Multimodal OCR VLMs and Demonstrations has been published on โ†—๏ธ https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms on behalf of strangervisionhf

It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.
prithivMLmodsย 
posted an update 15 days ago
view post
Post
3819
Implemented DeepSeek-OCR to support the latest transformers on the strangervisionhf page. The page includes the model weights and corrected configuration, which fix the issues and allow transformers inference to run smoothly.๐Ÿค—๐Ÿ”ฅ

> Model: strangervisionhf/deepseek-ocr-latest-transformers
> Demo Space: prithivMLmods/DeepSeek-OCR-experimental

โœ…Supports the latest transformers
โœ…You can also opt out of the attention implementation if needed.
โœ…Supports torch version 2.6.0 or higher
โœ…torch version cuda: 12.4

If you are interested in experimenting with new things and streamlining compatibility, the strangervisionhf organization is open for you, and you can join the community.

> Multimodal Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0, https://huggingface.co/collections/strangervisionhf/october-2025-models

> Thank you, @merve , for assigning the blazing-fast Zero GPU support!

> Notebook : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepSeek-OCR-Demo/deepseek_ocr_demo.ipynb

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update 16 days ago
view post
Post
1504
Introducing Gliese-OCR-7B-Post2.0-final, a document content-structure retrieval VLM designed for content extraction (OCR), summarization, and document visual question answering. This is the fourth and final model in the Camel Doc OCR VLM series, following Gliese-OCR-7B-Post1.0. The model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.๐Ÿš€๐Ÿค—

> Gliese-OCR-7B-Post2.0-final : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> Gliese-OCR-7B-Post1.0 (previous) : prithivMLmods/Gliese-OCR-7B-Post1.0
> Gliese OCR Post-x.0 (collection) : https://huggingface.co/collections/prithivMLmods/gliese-ocr-post-x0
> Multimodal Implementations (collection) : https://huggingface.co/collections/prithivMLmods/multimodal-implementations
> Qwen VL Captions (other-collection) : https://huggingface.co/collections/prithivMLmods/qwen-vl-captions
> Run Demo Here : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> GitHub (4bit) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Gliese-OCR-7B-Post2.0-final(4bit)/Gliese_OCR_7B_Post2_0_final.ipynb

.
.
.
> To know more about it, visit the app page or the respective model page!!
prithivMLmodsย 
posted an update 17 days ago
view post
Post
1837
Here is the official Florence-2 Transformers-converted demo for the following vision models: florence-community/Florence-2-large, florence-community/Florence-2-large-ft, florence-community/Florence-2-base, and florence-community/Florence-2-base-ft. These models support tasks such as object detection, captioning, detailed captioning, more detailed captioning, dense region captioning, region proposal, OCR, and OCR with region. Try the official demo at the link below:

> Space: prithivMLmods/florence2-vision-models
> Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

> To know more about it, visit the app page or the respective model page!!
prithivMLmodsย 
posted an update 24 days ago
prithivMLmodsย 
posted an update 28 days ago
view post
Post
1899
Now you can try all the latest state-of-the-art multimodal vision-language models from the Qwen3-VL series demo on Hugging Face Spaces โ€” including 4B, 8B, and 30B (Instruct, 4B-Thinking) variants. Iโ€™ve also uploaded the weights for the Abliterated variants of these models, up to 30B parameters. Check out the Spaces and model links below! ๐Ÿค—๐Ÿ”ฅ

โœจ Qwen3-VL[4B,8B]: prithivMLmods/Qwen3-VL-Outpost
โœจ Qwen3-VL-30B-A3B-Demo: prithivMLmods/Qwen3-VL-HF-Demo
โœจ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Qwen3-VL Abliterated Model Collection [ Version 1.0 ]

โœจ Qwen3-VL-8B-Instruct-abliterated: https://huggingface.co/prithivMLmods/Qwen3-VL-8B-Instruct-abliterated
โœจ Qwen3-VL-4B-Instruct-abliterated: https://huggingface.co/prithivMLmods/Qwen3-VL-4B-Instruct-abliterated
โœจ Qwen3-VL-8B-Thinking-abliterated: https://huggingface.co/prithivMLmods/Qwen3-VL-8B-Thinking-abliterated
โœจ Qwen3-VL-4B-Thinking-abliterated: https://huggingface.co/prithivMLmods/Qwen3-VL-4B-Thinking-abliterated
โœจ Qwen3-VL-30B-A3B-Instruct-abliterated: https://huggingface.co/prithivMLmods/Qwen3-VL-30B-A3B-Instruct-abliterated
โœจ Qwen3-VL-30B-A3B-Thinking-abliterated: https://huggingface.co/prithivMLmods/Qwen3-VL-30B-A3B-Thinking-abliterated

โšกCollection: https://huggingface.co/collections/prithivMLmods/qwen3-vl-abliteration-oct-1625-68f0e3e567ef076594605fac

Note: This is version 1.0 of the Abliteration of the Qwen3-VL series of models. It may perform sub-optimally in some cases. If you encounter any issues, please open a discussion.
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
3065
Introducing Image-Guard-2.0, an experimental, lightweight vision-language encoder model with a size of 0.1B (<100M parameters), trained on SigLIP2 (siglip2-base-patch16-224). Designed for multi-label image classification tasks, this model functions as an image safety system, serving as an image guard or moderator across a wide range of categories, from anime to realistic imagery.

โšกblog-article: https://huggingface.co/blog/prithivMLmods/image-guard-models

It also performs strict moderation and filtering of artificially synthesized content, demonstrating strong detection and handling of explicit images. Image-Guard-2.0 delivers robust performance in streamlined scenarios, ensuring reliable and effective classification across diverse visual inputs.
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
3354
The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board โ€” including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. ๐Ÿค—๐Ÿ”ฅ

โšก Space / App: prithivMLmods/Qwen3-VL-HF-Demo

The modelโ€™s demo supports a wide range of tasks, including;
Image Inference, Video Inference, PDF Inference, Image Captioning (VLA), GIF Inference.

โšก Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Thanks for granting the blazing-fast Zero GPU access, @merve ๐Ÿ™

โšก Other Pages

> Github: https://github.com/prithivsakthiur/qwen3-vl-hf-demo
> Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
> VL caption โ€” < Sep 15 โ€™25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
> Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
465
Introducing the next-gen version of DeepCaption-VLA (v2.0) โ€” an advanced, multimodal model based on Qwen2.5-VL, specialized for Image Captioning and Vision Language Attribution (VLA). This enhanced release focuses on generating precise, attribute-rich captions that capture visual properties, object attributes, and scene details across diverse image types and aspect ratios. Version 2.0 introduces significant improvements in multilingual inference, delivering higher captioning quality and attribution accuracy in languages including Chinese (Zh), Thai (Th), and more.

๐Ÿค— DeepCaption-VLA (v2.0) : prithivMLmods/DeepCaption-VLA-V2.0-7B
๐Ÿซฑ Collection : prithivMLmods/vlm-20-oct-0825-68e606aa6e3993be8a3b1d51
โญ GitHub (notebook) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption_VLA_V2_0_7B/DeepCaption_VLA_V2_0_7Bipynb.ipynb

Other Pagesโšก

โžฅ Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
โžฅ VL caption โ€” < Sep 15 โ€™25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
โžฅ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
2794
Have built the new Image Studio with the Gemini Image Gen models for the following multiple tasks: imagen-4.0-fast-generate-001 model for Image Generation (Text-to-Image) and Multi-Image Editing (Image-to-Image), and Draw-to-Image powered by gemini-2.5-flash-image (aka Nano Banana).

โญ Gemini-Image-Studio: prithivMLmods/Gemini-Image-Studio (Latest)
๐Ÿคž Old-App: prithivMLmods/Nano-Banana-AIO
๐ŸฅŠ GitHub: https://github.com/prithivsakthiur/gemini-image-studio-hf

To proceed, you need to add your Gemini API key. Your API key is stored only for the duration of your session and will be lost when you reload or exit the page. It will not be shared or exposed anywhere.
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
4542
Try the Hugging Face Space demo for Logics-MLLM/Logics-Parsing, the latest multimodal VLM from the Logics Team at Alibaba Group. It enables end-to-end document parsing with precise content extraction in markdown format, and it also generates a clean HTML representation of the document while preserving its logical structure. ๐Ÿค—๐Ÿ”ฅ

Additionally, Iโ€™ve integrated one of my recent works โ€” prithivMLmods/Gliese-OCR-7B-Post1.0 โ€” which also excels at document comprehension.

โญ Space / App : prithivMLmods/VLM-Parsing
๐Ÿ“„ Technical Report by the Logics Team, Alibaba Group : Logics-Parsing Technical Report (2509.19760)
๐Ÿ–– MM: VLM-Parsing: prithivMLmods/mm-vlm-parsing-68e33e52bfb9ae60b50602dc
โšก Collections : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Other Pages:

โž” Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
โž” Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
โž” VL caption โ€” < Sep 15 โ€™25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391

.
.
.
To know more about it, visit the app page or the respective model page!!
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
1203
Try Banana Zoom an advanced image enhancement web app that lets users select regions of an image for AI-powered upscaling and detail refinement. Using Googleโ€™s (nano banana), it analyzes selections, generates context-aware enhancements, and produces high-resolution outputs. Simply drag-and-drop or upload images, make precise or fixed-size selections, and watch improvements in real-time with smooth zoom and pixel-dissolve effects.

Space / App: prithivMLmods/Banana-Zoom
Collection: https://huggingface.co/collections/prithivMLmods/image-gen-apps-diffusion-lastupdated-09-23-68a2f4c5ef3e5e394eacc20a
GitHub: https://github.com/prithivsakthiur/banana-zoom

Your API will be automatically destroyed once you refresh the app or exit it, so each user's API will be cycled in this way.
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
4419
Photo-Mate-i2i โ€“ a space for experimenting with adapters for image manipulation using Kontext adapters, including Photo-Restore-i2i, PhotoCleanser-i2i, Polaroid-Warm-i2i, Yarn-Photo-i2i, Monochrome-Pencil, and more. Try out the demo, and to learn more, visit the app page or the respective model pages!

โšกDemo: prithivMLmods/Photo-Mate-i2i
โš™๏ธHow to Use: prithivMLmods/Photo-Mate-i2i#2
๐Ÿ‘จโ€๐Ÿ”งi2i-Kontext(Experimental LoRAs): prithivMLmods/i2i-kontext-exp-68ce573b5c0623476b636ec7

prithivMLmodsย 
posted an update about 2 months ago
view post
Post
5233
Dropping some experimental adapters for FLUX.1-Kontext-dev, including Photo-Restore-i2i, PhotoCleanser-i2i, Polaroid-Warm-i2i, Yarn-Photo-i2i, and Monochrome-Pencil. These were trained under various settings with minimal image pairs to achieve optimal results. The dataset result sets end pairs were synthesized using Gemini-2.5-Flash-Image-Preview and others.๐Ÿค—โœจ

prithivMLmods/PhotoCleanser-i2i: Remove objects while preserving the rest of the image.
prithivMLmods/Photo-Restore-i2i: Restore old photos into moderately colorized, detailed images.
prithivMLmods/Polaroid-Warm-i2i: Seamless vintage Polaroid-style images with warm, faded tones.
prithivMLmods/Yarn-Photo-i2i: Convert images into yarn-stitched artwork while retaining key details.
prithivMLmods/Monochrome-Pencil: Turn images into monochrome pencil sketches while keeping original features.

โœจNote: All the above models share the same auto-labeling multimodal VLM captioning model, prithivMLmods/DeepCaption-VLA-7B, which is used for refining edit instructions and accurately understanding attributions for the generations.

โœจCollection: prithivMLmods/i2i-kontext-exp-68ce573b5c0623476b636ec7

.
.
.
To know more about it, visit the app page or the respective model page!!
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
1605
Many of 'em pinged me asking to make the nano-banana-aio to available on hf.co/spaces, so Iโ€™ve transferred the appโ€™s tech stack to make it compatible for deployment on Spaces. (Can be accessed with your own Gemini API) ๐Ÿค—โญ๏ธ

โœฆ Yes, it is now available on Spaces: prithivMLmods/Nano-Banana-AIO

Nano Banana AIO (All-in-One) App, which offers seamless image manipulation features, including single/multiple image adaptation, a canvas for free-style drawing to creative image generation, and standard text-to-image generation.

All in One Banana for you! ๐Ÿ˜‰
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
3124
I'm a Hugging Face Fellow now, guys!๐Ÿค—โค๏ธ

With the same passion, trust, and momentum to contribute to the community, Iโ€™m excited to do some amazing things to wrap up Q3 and Q4 of 2025. And importantly, Iโ€™ve been lucky enough to receive some knowledge and guidance from @merve to build open-source demos and stuff. Thank you for the belief.

Thank you โ€” much love.
Long live open source!

โ€” Prithiv
prithivMLmodsย 
posted an update 2 months ago
view post
Post
7205
Introducing Gliese-OCR-7B-Post1.0, a document content-structure retrieval VLM designed for content extraction(OCRs) and summarization. This is the third model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825. The new version fixes formal table reconstruction issues in both En and Zh, achieving optimal performance for long-context inferences. This model also shows significant improvements in LaTeX and Markdown rendering for OCR tasks.

๐Ÿค— Gliese-OCR-7B-Post1.0 : prithivMLmods/Gliese-OCR-7B-Post1.0
๐Ÿ“Œ Gliese-Post1.0 Collection : prithivMLmods/gliese-post10-68c52c4a6ca4935f5259a6d7
โฌ…๏ธ Previous Versions : prithivMLmods/Camel-Doc-OCR-062825
๐Ÿงจ Gliese-OCR-7B-Post1.0 (4-bit) Notebook Demo on T4 : prithivMLmods/Gliese-OCR-7B-Post1.0
๐Ÿ“– GitHub [Gliese-OCR-7B-Post1.0(4-bit)-reportlab] : https://tinyurl.com/ys7zuerc

Other Collections:

โž” Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
โž” Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
โž” Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the app page or the respective model page!!
  • 2 replies
ยท