Luke-Bergen commited on
Commit
7e32238
·
verified ·
1 Parent(s): 13ca277

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -31
README.md CHANGED
@@ -1,18 +1,21 @@
1
- # Mineral Nano 1
2
 
3
- Mineral Nano 1 is a compact, efficient language model designed for fast inference and low-resource environments.
4
 
5
  ## Model Details
6
 
7
  - **Model Name:** mineral-nano-1
8
- - **Model Type:** Causal Language Model
9
- - **Parameters:** ~85M parameters
10
  - **Context Length:** 2048 tokens
11
- - **Architecture:** Transformer-based decoder with 12 layers
12
  - **Precision:** BFloat16
 
 
13
 
14
  ## Architecture
15
 
 
16
  - **Hidden Size:** 768
17
  - **Intermediate Size:** 3072
18
  - **Attention Heads:** 12
@@ -20,59 +23,148 @@ Mineral Nano 1 is a compact, efficient language model designed for fast inferenc
20
  - **Vocabulary Size:** 32,000 tokens
21
  - **Positional Encoding:** RoPE (Rotary Position Embeddings)
22
 
 
 
 
 
 
 
 
 
23
  ## Usage
24
 
25
- ### Basic Text Generation
 
 
 
 
 
 
26
 
27
  ```python
28
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
29
 
30
  model_name = "your-username/mineral-nano-1"
31
- tokenizer = AutoTokenizer.from_pretrained(model_name)
32
- model = AutoModelForCausalLM.from_pretrained(model_name)
 
 
 
 
 
 
 
 
33
 
34
- prompt = "Once upon a time"
35
- inputs = tokenizer(prompt, return_tensors="pt")
36
  outputs = model.generate(**inputs, max_new_tokens=100)
37
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ```
39
 
40
- ### Chat Format
41
 
42
  ```python
43
  messages = [
44
- {"role": "system", "content": "You are a helpful assistant."},
45
- {"role": "user", "content": "What is machine learning?"}
 
 
 
 
 
46
  ]
47
 
48
- input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
49
- inputs = tokenizer(input_text, return_tensors="pt")
50
- outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
51
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
 
54
  ## Training Details
55
 
56
  - **Framework:** PyTorch with Transformers
57
- - **Training Data:** [Specify your training dataset]
58
  - **Training Duration:** [Specify training time]
59
  - **Hardware:** [Specify GPUs used]
 
 
 
 
 
 
 
 
 
 
60
 
61
  ## Limitations
62
 
63
- - Limited context window (2048 tokens)
64
- - May produce inconsistent outputs on complex reasoning tasks
65
- - Best suited for short-form text generation
66
- - Compact size means reduced capabilities compared to larger models
 
 
67
 
68
  ## Intended Use
69
 
70
  This model is designed for:
71
- - Educational purposes
72
- - Prototyping and experimentation
73
  - Low-resource deployment scenarios
74
- - Fast inference applications
75
- - Personal projects
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## License
78
 
@@ -81,9 +173,9 @@ This model is designed for:
81
  ## Citation
82
 
83
  ```bibtex
84
- @misc{mineral-nano-1,
85
  author = {Your Name},
86
- title = {Mineral Nano 1: A Compact Language Model},
87
  year = {2025},
88
  publisher = {HuggingFace},
89
  url = {https://huggingface.co/your-username/mineral-nano-1}
@@ -92,4 +184,8 @@ This model is designed for:
92
 
93
  ## Contact
94
 
95
- For questions or issues, please open an issue on the model repository.
 
 
 
 
 
1
+ # Mineral Nano 1 Vision
2
 
3
+ Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities.
4
 
5
  ## Model Details
6
 
7
  - **Model Name:** mineral-nano-1
8
+ - **Model Type:** Vision-Language Model (VLM)
9
+ - **Parameters:** ~110M parameters
10
  - **Context Length:** 2048 tokens
11
+ - **Architecture:** Transformer-based decoder with vision encoder (12 layers)
12
  - **Precision:** BFloat16
13
+ - **Image Resolution:** 224x224
14
+ - **Modalities:** Text + Images
15
 
16
  ## Architecture
17
 
18
+ ### Language Model
19
  - **Hidden Size:** 768
20
  - **Intermediate Size:** 3072
21
  - **Attention Heads:** 12
 
23
  - **Vocabulary Size:** 32,000 tokens
24
  - **Positional Encoding:** RoPE (Rotary Position Embeddings)
25
 
26
+ ### Vision Encoder
27
+ - **Image Size:** 224x224
28
+ - **Patch Size:** 16x16
29
+ - **Hidden Size:** 768
30
+ - **Layers:** 12
31
+ - **Image Tokens:** 196 per image
32
+ - **Architecture:** ViT-style encoder
33
+
34
  ## Usage
35
 
36
+ ### Installation
37
+
38
+ ```bash
39
+ pip install transformers pillow torch
40
+ ```
41
+
42
+ ### Basic Image Understanding
43
 
44
  ```python
45
+ from transformers import AutoProcessor, AutoModelForVision2Seq
46
+ from PIL import Image
47
+ import requests
48
 
49
  model_name = "your-username/mineral-nano-1"
50
+ processor = AutoProcessor.from_pretrained(model_name)
51
+ model = AutoModelForVision2Seq.from_pretrained(model_name)
52
+
53
+ # Load an image
54
+ url = "https://example.com/image.jpg"
55
+ image = Image.open(requests.get(url, stream=True).raw)
56
+
57
+ # Prepare inputs
58
+ prompt = "<image>What is in this image?"
59
+ inputs = processor(text=prompt, images=image, return_tensors="pt")
60
 
61
+ # Generate response
 
62
  outputs = model.generate(**inputs, max_new_tokens=100)
63
+ response = processor.decode(outputs[0], skip_special_tokens=True)
64
+ print(response)
65
+ ```
66
+
67
+ ### Multiple Images
68
+
69
+ ```python
70
+ from PIL import Image
71
+
72
+ images = [
73
+ Image.open("image1.jpg"),
74
+ Image.open("image2.jpg")
75
+ ]
76
+
77
+ prompt = "<image>Describe the first image. <image>Now describe the second image."
78
+ inputs = processor(text=prompt, images=images, return_tensors="pt")
79
+ outputs = model.generate(**inputs, max_new_tokens=200)
80
+ print(processor.decode(outputs[0], skip_special_tokens=True))
81
  ```
82
 
83
+ ### Chat with Images
84
 
85
  ```python
86
  messages = [
87
+ {
88
+ "role": "user",
89
+ "content": [
90
+ {"type": "image"},
91
+ {"type": "text", "text": "What objects are in this image?"}
92
+ ]
93
+ }
94
  ]
95
 
96
+ # Apply chat template
97
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
98
+ inputs = processor(text=text, images=image, return_tensors="pt")
99
+ outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
100
+ print(processor.decode(outputs[0], skip_special_tokens=True))
101
+ ```
102
+
103
+ ### Local Images
104
+
105
+ ```python
106
+ from PIL import Image
107
+
108
+ # Load local image
109
+ image = Image.open("path/to/your/image.jpg")
110
+
111
+ prompt = "<image>Describe what you see in detail."
112
+ inputs = processor(text=prompt, images=image, return_tensors="pt")
113
+ outputs = model.generate(**inputs, max_new_tokens=256)
114
+ print(processor.decode(outputs[0], skip_special_tokens=True))
115
  ```
116
 
117
  ## Training Details
118
 
119
  - **Framework:** PyTorch with Transformers
120
+ - **Training Data:** Text + Image pairs
121
  - **Training Duration:** [Specify training time]
122
  - **Hardware:** [Specify GPUs used]
123
+ - **Vision Encoder:** Pretrained ViT encoder fine-tuned with language model
124
+
125
+ ## Capabilities
126
+
127
+ ✅ Image description and captioning
128
+ ✅ Visual question answering
129
+ ✅ Object detection and recognition
130
+ ✅ Scene understanding
131
+ ✅ Multi-image reasoning
132
+ ✅ OCR and text extraction from images
133
 
134
  ## Limitations
135
 
136
+ - Limited to 224x224 resolution images
137
+ - Context window of 2048 tokens including image tokens
138
+ - May struggle with fine-grained details
139
+ - Best for general image understanding tasks
140
+ - Compact size means reduced capabilities compared to larger VLMs
141
+ - Limited multilingual vision capabilities
142
 
143
  ## Intended Use
144
 
145
  This model is designed for:
146
+ - Educational purposes and learning VLM architectures
147
+ - Prototyping multimodal applications
148
  - Low-resource deployment scenarios
149
+ - Fast inference with vision capabilities
150
+ - Mobile and edge device applications
151
+ - Personal projects requiring image understanding
152
+
153
+ ## Image Preprocessing
154
+
155
+ Images are automatically:
156
+ - Resized to 224x224
157
+ - Normalized with CLIP-style statistics
158
+ - Converted to RGB
159
+ - Split into 16x16 patches (196 total patches)
160
+
161
+ ## Performance Tips
162
+
163
+ - Use square images when possible for best results
164
+ - Ensure images are clear and well-lit
165
+ - Keep prompts concise and specific
166
+ - Use batch processing for multiple images
167
+ - Enable `use_cache=True` for faster generation
168
 
169
  ## License
170
 
 
173
  ## Citation
174
 
175
  ```bibtex
176
+ @misc{mineral-nano-1-vision,
177
  author = {Your Name},
178
+ title = {Mineral Nano 1 Vision: A Compact Vision-Language Model},
179
  year = {2025},
180
  publisher = {HuggingFace},
181
  url = {https://huggingface.co/your-username/mineral-nano-1}
 
184
 
185
  ## Contact
186
 
187
+ For questions or issues, please open an issue on the model repository.
188
+
189
+ ## Acknowledgments
190
+
191
+ This model builds upon research in vision transformers and multimodal learning.