CLIP ViT-B/16 Vision Encoder (Chinese)

模型来源:iic/multi-modal_clip-vit-base-patch16_zh(阿里达摩院 Chinese CLIP)

仅裁剪保留了 Vision Encoder 部分,去除了 Text Encoder 及其他无关权重,并以 float16 精度保存为 HuggingFace CLIPVisionModel 兼容的 safetensors 格式。


Source: iic/multi-modal_clip-vit-base-patch16_zh (Alibaba DAMO Academy Chinese CLIP)

Only the Vision Encoder is retained. The Text Encoder and all unrelated weights have been removed. Weights are saved in float16 precision as HuggingFace CLIPVisionModel-compatible safetensors format.


Model Info

Item Value
Architecture ViT-B/16
Parameters 85.8M
Hidden Size 768
Layers 12
Patch Size 16
Input Resolution 224×224
Output Tokens 196 (14×14, CLS removed)
Weight Precision float16
File Size ~164MB

Usage

from transformers import CLIPVisionModel, CLIPImageProcessor

model = CLIPVisionModel.from_pretrained("path_to/clip-base-vision-encoder")
processor = CLIPImageProcessor.from_pretrained("path_to/clip-base-vision-encoder")
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jingyaogong/clip-vit-base-patch16-ve

Finetuned
(51)
this model

Collection including jingyaogong/clip-vit-base-patch16-ve