takumi123xxx commited on
Commit
3cc029a
·
verified ·
1 Parent(s): 82f16c2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +216 -127
README.md CHANGED
@@ -1,207 +1,296 @@
1
  ---
2
- base_model: Qwen/Qwen3-VL-32B-Instruct
3
- library_name: peft
4
- pipeline_tag: text-generation
 
 
 
5
  tags:
6
- - base_model:adapter:Qwen/Qwen3-VL-32B-Instruct
7
- - lora
8
- - transformers
 
 
 
 
 
 
 
9
  ---
10
 
11
- # Model Card for Model ID
12
-
13
- <!-- Provide a quick summary of what the model is/does. -->
14
-
15
-
16
-
17
- ## Model Details
18
-
19
- ### Model Description
20
-
21
- <!-- Provide a longer summary of what this model is. -->
22
-
23
-
24
-
25
- - **Developed by:** [More Information Needed]
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
-
33
- ### Model Sources [optional]
34
 
35
- <!-- Provide the basic links for the model. -->
36
 
37
- - **Repository:** [More Information Needed]
38
- - **Paper [optional]:** [More Information Needed]
39
- - **Demo [optional]:** [More Information Needed]
40
 
41
- ## Uses
42
 
43
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
 
45
- ### Direct Use
46
 
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
48
 
49
- [More Information Needed]
50
-
51
- ### Downstream Use [optional]
52
-
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
-
55
- [More Information Needed]
56
-
57
- ### Out-of-Scope Use
58
-
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
-
61
- [More Information Needed]
62
-
63
- ## Bias, Risks, and Limitations
64
-
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
-
67
- [More Information Needed]
68
 
69
- ### Recommendations
 
 
 
 
 
70
 
71
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
 
73
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- ## How to Get Started with the Model
76
 
77
- Use the code below to get started with the model.
 
 
 
 
 
 
78
 
79
- [More Information Needed]
80
 
81
- ## Training Details
82
 
83
- ### Training Data
84
 
85
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
 
87
- [More Information Needed]
 
 
88
 
89
- ### Training Procedure
90
 
91
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
 
93
- #### Preprocessing [optional]
 
 
94
 
95
- [More Information Needed]
96
 
 
 
 
 
 
97
 
98
- #### Training Hyperparameters
 
 
 
 
 
 
 
 
 
99
 
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
101
 
102
- #### Speeds, Sizes, Times [optional]
 
 
103
 
104
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
105
 
106
- [More Information Needed]
 
 
 
 
 
 
107
 
108
- ## Evaluation
 
109
 
110
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
111
 
112
- ### Testing Data, Factors & Metrics
113
 
114
- #### Testing Data
 
 
 
 
 
 
 
 
115
 
116
- <!-- This should link to a Dataset Card if possible. -->
 
117
 
118
- [More Information Needed]
119
 
120
- #### Factors
 
121
 
122
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
 
124
- [More Information Needed]
125
 
126
- #### Metrics
127
 
128
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
129
 
130
- [More Information Needed]
131
 
132
- ### Results
 
 
 
 
133
 
134
- [More Information Needed]
135
 
136
- #### Summary
 
 
 
 
 
 
 
137
 
 
138
 
 
 
 
 
 
 
 
139
 
140
- ## Model Examination [optional]
141
 
142
- <!-- Relevant interpretability work for the model goes here -->
143
 
144
- [More Information Needed]
145
 
146
- ## Environmental Impact
 
 
147
 
148
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
 
150
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
151
 
152
- - **Hardware Type:** [More Information Needed]
153
- - **Hours used:** [More Information Needed]
154
- - **Cloud Provider:** [More Information Needed]
155
- - **Compute Region:** [More Information Needed]
156
- - **Carbon Emitted:** [More Information Needed]
157
 
158
- ## Technical Specifications [optional]
 
159
 
160
- ### Model Architecture and Objective
161
 
162
- [More Information Needed]
163
 
164
- ### Compute Infrastructure
165
 
166
- [More Information Needed]
167
 
168
- #### Hardware
169
 
170
- [More Information Needed]
171
 
172
- #### Software
173
 
174
- [More Information Needed]
 
175
 
176
- ## Citation [optional]
177
 
178
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
 
 
 
179
 
180
- **BibTeX:**
181
 
182
- [More Information Needed]
183
 
184
- **APA:**
 
 
 
 
 
 
185
 
186
- [More Information Needed]
187
 
188
- ## Glossary [optional]
 
 
 
 
 
 
189
 
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
 
192
- [More Information Needed]
193
 
194
- ## More Information [optional]
195
 
196
- [More Information Needed]
197
 
198
- ## Model Card Authors [optional]
 
199
 
200
- [More Information Needed]
201
 
202
- ## Model Card Contact
 
 
 
 
 
 
 
203
 
204
- [More Information Needed]
205
- ### Framework versions
206
 
207
- - PEFT 0.18.0
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - ja
5
+ - en
6
+ library_name: transformers
7
+ pipeline_tag: image-text-to-text
8
  tags:
9
+ - vision
10
+ - vlm
11
+ - qwen
12
+ - lora
13
+ - document-understanding
14
+ - form-detection
15
+ - japanese
16
+ base_model: Qwen/Qwen3-VL-32B-Instruct
17
+ datasets:
18
+ - hand-dot/pdfme-form-field-dataset
19
  ---
20
 
21
+ # PDFme Form Field Detector (32B)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ **Detects form fields that applicants need to fill in Japanese documents.**
24
 
25
+ This model is fine-tuned from [Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) using QLoRA to detect input fields in Japanese application forms, registration documents, and other official paperwork.
 
 
26
 
27
+ ## What This Model Does
28
 
29
+ Given an image of a Japanese document, this model identifies the bounding boxes of form fields that **applicants/customers** should fill in, while **excluding fields meant for staff/officials**.
30
 
31
+ ### Example Use Cases
32
 
33
+ - Automating form digitization
34
+ - Building PDF form generators
35
+ - Creating accessibility tools for document processing
36
 
37
+ ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ | Item | Value |
40
+ |------|-------|
41
+ | Base Model | Qwen/Qwen3-VL-32B-Instruct |
42
+ | Fine-tuning Method | QLoRA (4-bit quantization + LoRA) |
43
+ | Training Data | [hand-dot/pdfme-form-field-dataset](https://huggingface.co/datasets/hand-dot/pdfme-form-field-dataset) (90 samples, augmented) |
44
+ | Output Format | JSON with normalized bbox coordinates (0-1000) |
45
 
46
+ ## Performance
47
 
48
+ ### Evaluation Results (IoU 0.5)
49
+
50
+ | Metric | 32B Model | 8B Model | Description |
51
+ |--------|-----------|----------|-------------|
52
+ | **Recall** | **13.56%** | 18.08% | Ground truth fields detected |
53
+ | **Precision** | **5.24%** | 7.90% | Correct predictions |
54
+ | **Average IoU** | **0.2163** | 0.2209 | Overlap between predicted and ground truth |
55
+ | Matches | 24/177 | 32/177 | Matched predictions |
56
+ | Predictions | 458 | 405 | Total predictions |
57
+
58
+ ### Per-Sample Results (Best performers)
59
+
60
+ | Sample | Recall | Precision | IoU | Evaluation |
61
+ |--------|--------|-----------|-----|------------|
62
+ | **#2** | **60.00%** | **69.23%** | **0.507** | ⭐ Excellent |
63
+ | **#7** | 33.33% | 25.00% | 0.380 | Good |
64
+ | **#9** | 18.18% | 7.69% | 0.313 | Improved |
65
 
66
+ ### Training Progress
67
 
68
+ | Epoch | Loss | Notes |
69
+ |-------|------|-------|
70
+ | Start | 18.74 | - |
71
+ | 0.5 | 11.13 | Rapid decrease |
72
+ | 1.0 | 6.72 | Stabilizing |
73
+ | 2.0 | 5.75 | Converging |
74
+ | 3.0 | **5.59** | Final |
75
 
76
+ **Loss improved: 18.74 → 5.59 (70% reduction)**
77
 
78
+ ### Key Finding
79
 
80
+ Despite being 4x larger than the 8B model, the 32B model achieved similar accuracy. **The dataset (10 original samples) is the bottleneck**, not model capacity.
81
 
82
+ ### Current Limitations
83
 
84
+ 1. **Small training dataset** - 10 original samples, augmented to 90
85
+ 2. **Over-detection tendency** - 458 predictions vs 177 ground truth (2.6x)
86
+ 3. **Location precision** - Average IoU of 0.22 indicates room for improvement
87
 
88
+ ## Quick Start
89
 
90
+ ### Installation
91
 
92
+ ```bash
93
+ pip install transformers peft torch accelerate bitsandbytes
94
+ ```
95
 
96
+ ### Inference
97
 
98
+ ```python
99
+ import torch
100
+ from PIL import Image
101
+ from transformers import AutoProcessor, AutoModelForImageTextToText
102
+ from peft import PeftModel
103
 
104
+ # Load model (32B)
105
+ base_model = "Qwen/Qwen3-VL-32B-Instruct"
106
+ model = AutoModelForImageTextToText.from_pretrained(
107
+ base_model,
108
+ torch_dtype=torch.bfloat16,
109
+ device_map="auto",
110
+ trust_remote_code=True,
111
+ )
112
+ model = PeftModel.from_pretrained(model, "takumi123xxx/pdfme-form-field-detector-lora-32b")
113
+ processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)
114
 
115
+ # Prepare prompt
116
+ system_prompt = """You are an expert at analyzing Japanese documents.
117
+ There are two types of input fields:
118
+ 1. Fields for applicants/customers to fill → Target for detection
119
+ 2. Fields for staff/officials to fill → Exclude from detection"""
120
 
121
+ user_prompt = """Detect all input fields that applicants should fill in this image.
122
+ Exclude fields for staff.
123
+ Return JSON with bbox coordinates (0-1000 normalized)."""
124
 
125
+ # Load image
126
+ image = Image.open("your_document.png").convert("RGB")
127
 
128
+ messages = [
129
+ {"role": "system", "content": system_prompt},
130
+ {"role": "user", "content": [
131
+ {"type": "image", "image": image},
132
+ {"type": "text", "text": user_prompt},
133
+ ]},
134
+ ]
135
 
136
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
137
+ inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
138
 
139
+ output = model.generate(**inputs, max_new_tokens=2048)
140
+ result = processor.decode(output[0], skip_special_tokens=True)
141
+ print(result)
142
+ ```
143
 
144
+ ### Output Format
145
 
146
+ ```json
147
+ {
148
+ "applicant_fields": [
149
+ {"bbox": [100, 200, 500, 250]},
150
+ {"bbox": [100, 300, 500, 350]}
151
+ ],
152
+ "count": 2
153
+ }
154
+ ```
155
 
156
+ - `bbox`: `[x1, y1, x2, y2]` normalized to 0-1000 scale
157
+ - To convert to pixels: `pixel_x = bbox_x / 1000 * image_width`
158
 
159
+ ## Demo
160
 
161
+ Try the model on Hugging Face Spaces:
162
+ [takumi123xxx/pdfme-form-field-detector](https://huggingface.co/spaces/takumi123xxx/pdfme-form-field-detector)
163
 
164
+ ## Deployment (Inference Endpoints)
165
 
166
+ ### ⚠️ Important: Instance Selection
167
 
168
+ This model is a **32B parameter** Vision-Language Model. Please note the following when deploying:
169
 
170
+ | Condition | Recommended Instance | VRAM | Notes |
171
+ |-----------|---------------------|------|-------|
172
+ | **With 4-bit quantization** | `nvidia-a100` | 40GB+ | ⭐ Recommended |
173
+ | **Without 4-bit quantization** | `nvidia-a100-80g` | 80GB | Requires more VRAM |
174
 
175
+ ### Environment Variables
176
 
177
+ | Variable | Default | Description |
178
+ |----------|---------|-------------|
179
+ | `BASE_MODEL` | `Qwen/Qwen3-VL-32B-Instruct` | Base model |
180
+ | `USE_LORA` | `true` | Use LoRA adapter |
181
+ | `USE_4BIT` | `true` | Use 4-bit quantization (recommended) |
182
 
183
+ ## Training Details
184
 
185
+ - **Base Model**: Qwen/Qwen3-VL-32B-Instruct
186
+ - **Epochs**: 3
187
+ - **Batch Size**: 1 (with gradient accumulation of 8)
188
+ - **Learning Rate**: 2e-4
189
+ - **LoRA Rank**: 16
190
+ - **LoRA Alpha**: 32
191
+ - **Quantization**: 4-bit NF4
192
+ - **Training Time**: ~2 hours on RTX PRO 6000 (95GB VRAM)
193
 
194
+ ## Comparison: 8B vs 32B
195
 
196
+ | Aspect | 8B Model | 32B Model |
197
+ |--------|----------|-----------|
198
+ | Parameters | 8B | 32B (4x larger) |
199
+ | Final Loss | 5.60 | 5.59 |
200
+ | Recall | 18.08% | 13.56% |
201
+ | VRAM (4-bit) | ~20GB | ~40GB |
202
+ | Inference Speed | Faster | Slower |
203
 
204
+ **Conclusion**: With only 90 training samples, both models perform similarly. **Data quantity and diversity are the bottleneck**, not model size.
205
 
206
+ ## Future Improvements
207
 
208
+ ### Short-term
209
 
210
+ 1. **Expand original dataset** - 100+ diverse document samples
211
+ 2. **Reduce epochs** - 1-2 epochs may be sufficient for 32B
212
+ 3. **Separate test set** - Evaluate on unseen documents
213
 
214
+ ### Mid-term
215
 
216
+ 4. **Field type classification** - Identify field types (name, address, date, etc.)
217
+ 5. **Multi-turn dialogue** - Support conditional detection ("only detect name fields")
218
 
219
+ ### Long-term
 
 
 
 
220
 
221
+ 6. **Large-scale dataset** - 1000+ annotated samples across document types
222
+ 7. **Active learning** - Human review → feedback → continuous improvement
223
 
224
+ ## License
225
 
226
+ Apache 2.0
227
 
228
+ ---
229
 
230
+ # PDFme フォームフィールド検出モデル(32B)
231
 
232
+ **日本の書類から、申請者が記入すべきフォーム欄を自動検出するモデル**
233
 
234
+ [Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)をQLoRAでファインチューニングし、申請書や届出書などの入力欄を検出します。
235
 
236
+ ## このモデルでできること
237
 
238
+ 書類の画像を入力すると、**申請者(顧客)が記入すべき欄**の位置(bbox)を検出します。
239
+ **職員が記入する欄**(受付番号、処理日など)は除外されます。
240
 
241
+ ## モデル情報
242
 
243
+ | 項目 | 内容 |
244
+ |------|------|
245
+ | ベースモデル | Qwen/Qwen3-VL-32B-Instruct |
246
+ | 学習手法 | QLoRA(4bit量子化 + LoRA) |
247
+ | 学習データ | 90件(拡張データ) |
248
+ | 出力形式 | JSON(0-1000正規化されたbbox座標) |
249
 
250
+ ## 性能評価
251
 
252
+ ### 評価結果(IoU ≥ 0.5)
253
 
254
+ | 指標 | 32Bモデル | 8Bモデル | 説明 |
255
+ |------|-----------|----------|------|
256
+ | **Recall** | **13.56%** | 18.08% | 正解フィールドの検出率 |
257
+ | **Precision** | **5.24%** | 7.90% | 予測の正解率 |
258
+ | **平均IoU** | **0.2163** | 0.2209 | 予測と正解の重なり |
259
+ | マッチ数 | 24/177 | 32/177 | マッチした予測数 |
260
+ | 予測数 | 458 | 405 | 総予測数 |
261
 
262
+ ### 学習曲線
263
 
264
+ | Epoch | Loss | 備考 |
265
+ |-------|------|------|
266
+ | 開始 | 18.74 | - |
267
+ | 0.5 | 11.13 | 急速に減少 |
268
+ | 1.0 | 6.72 | 安定化 |
269
+ | 2.0 | 5.75 | 収束傾向 |
270
+ | 3.0 | **5.59** | 最終 |
271
 
272
+ **Loss改善: 18.74 5.59(70%減少)**
273
 
274
+ ### 重要な発見
275
 
276
+ 32Bモデルは8Bモデルと同等の精度でした。**データセット(元10件)がボトルネック**であり、モデルサイズではありません。
277
 
278
+ ## デモ
279
 
280
+ Hugging Face Spacesでお試しください:
281
+ [takumi123xxx/pdfme-form-field-detector](https://huggingface.co/spaces/takumi123xxx/pdfme-form-field-detector)
282
 
283
+ ## 学習詳細
284
 
285
+ - **ベースモデル**: Qwen/Qwen3-VL-32B-Instruct
286
+ - **エポック数**: 3
287
+ - **バッチサイズ**: 1(勾配累積: 8)
288
+ - **学習率**: 2e-4
289
+ - **LoRAランク**: 16
290
+ - **LoRAアルファ**: 32
291
+ - **量子化**: 4bit NF4
292
+ - **学習時間**: RTX PRO 6000(95GB VRAM)で約2時間
293
 
294
+ ## ライセンス
 
295
 
296
+ Apache 2.0