MiniMind-V
Collection
lightweight Multimode model
•
7 items
•
Updated
•
4
模型来源:iic/multi-modal_clip-vit-base-patch16_zh(阿里达摩院 Chinese CLIP)
仅裁剪保留了 Vision Encoder 部分,去除了 Text Encoder 及其他无关权重,并以 float16 精度保存为 HuggingFace CLIPVisionModel 兼容的 safetensors 格式。
Source: iic/multi-modal_clip-vit-base-patch16_zh (Alibaba DAMO Academy Chinese CLIP)
Only the Vision Encoder is retained. The Text Encoder and all unrelated weights have been removed. Weights are saved in float16 precision as HuggingFace CLIPVisionModel-compatible safetensors format.
| Item | Value |
|---|---|
| Architecture | ViT-B/16 |
| Parameters | 85.8M |
| Hidden Size | 768 |
| Layers | 12 |
| Patch Size | 16 |
| Input Resolution | 224×224 |
| Output Tokens | 196 (14×14, CLS removed) |
| Weight Precision | float16 |
| File Size | ~164MB |
from transformers import CLIPVisionModel, CLIPImageProcessor
model = CLIPVisionModel.from_pretrained("path_to/clip-base-vision-encoder")
processor = CLIPImageProcessor.from_pretrained("path_to/clip-base-vision-encoder")
Base model
openai/clip-vit-base-patch16