takumi123xxx
/

pdfme-form-field-detector-lora-32b

@@ -1,207 +1,296 @@
 ---
-base_model: Qwen/Qwen3-VL-32B-Instruct
-library_name: peft
-pipeline_tag: text-generation
 tags:
-- base_model:adapter:Qwen/Qwen3-VL-32B-Instruct
-- lora
-- transformers
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.18.0

 ---
+license: apache-2.0
+language:
+  - ja
+  - en
+library_name: transformers
+pipeline_tag: image-text-to-text
 tags:
+  - vision
+  - vlm
+  - qwen
+  - lora
+  - document-understanding
+  - form-detection
+  - japanese
+base_model: Qwen/Qwen3-VL-32B-Instruct
+datasets:
+  - hand-dot/pdfme-form-field-dataset
 ---
+# PDFme Form Field Detector (32B)
+**Detects form fields that applicants need to fill in Japanese documents.**
+This model is fine-tuned from [Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) using QLoRA to detect input fields in Japanese application forms, registration documents, and other official paperwork.
+## What This Model Does
+Given an image of a Japanese document, this model identifies the bounding boxes of form fields that **applicants/customers** should fill in, while **excluding fields meant for staff/officials**.
+### Example Use Cases
+- Automating form digitization
+- Building PDF form generators
+- Creating accessibility tools for document processing
+## Model Details
+| Item | Value |
+|------|-------|
+| Base Model | Qwen/Qwen3-VL-32B-Instruct |
+| Fine-tuning Method | QLoRA (4-bit quantization + LoRA) |
+| Training Data | [hand-dot/pdfme-form-field-dataset](https://huggingface.co/datasets/hand-dot/pdfme-form-field-dataset) (90 samples, augmented) |
+| Output Format | JSON with normalized bbox coordinates (0-1000) |
+## Performance
+### Evaluation Results (IoU ≥ 0.5)
+| Metric | 32B Model | 8B Model | Description |
+|--------|-----------|----------|-------------|
+| **Recall** | **13.56%** | 18.08% | Ground truth fields detected |
+| **Precision** | **5.24%** | 7.90% | Correct predictions |
+| **Average IoU** | **0.2163** | 0.2209 | Overlap between predicted and ground truth |
+| Matches | 24/177 | 32/177 | Matched predictions |
+| Predictions | 458 | 405 | Total predictions |
+### Per-Sample Results (Best performers)
+| Sample | Recall | Precision | IoU | Evaluation |
+|--------|--------|-----------|-----|------------|
+| **#2** | **60.00%** | **69.23%** | **0.507** | ⭐ Excellent |
+| **#7** | 33.33% | 25.00% | 0.380 | Good |
+| **#9** | 18.18% | 7.69% | 0.313 | Improved |
+### Training Progress
+| Epoch | Loss | Notes |
+|-------|------|-------|
+| Start | 18.74 | - |
+| 0.5 | 11.13 | Rapid decrease |
+| 1.0 | 6.72 | Stabilizing |
+| 2.0 | 5.75 | Converging |
+| 3.0 | **5.59** | Final |
+**Loss improved: 18.74 → 5.59 (70% reduction)**
+### Key Finding
+Despite being 4x larger than the 8B model, the 32B model achieved similar accuracy. **The dataset (10 original samples) is the bottleneck**, not model capacity.
+### Current Limitations
+1. **Small training dataset** - 10 original samples, augmented to 90
+2. **Over-detection tendency** - 458 predictions vs 177 ground truth (2.6x)
+3. **Location precision** - Average IoU of 0.22 indicates room for improvement
+## Quick Start
+### Installation
+```bash
+pip install transformers peft torch accelerate bitsandbytes
+```
+### Inference
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForImageTextToText
+from peft import PeftModel
+# Load model (32B)
+base_model = "Qwen/Qwen3-VL-32B-Instruct"
+model = AutoModelForImageTextToText.from_pretrained(
+    base_model,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+model = PeftModel.from_pretrained(model, "takumi123xxx/pdfme-form-field-detector-lora-32b")
+processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)
+# Prepare prompt
+system_prompt = """You are an expert at analyzing Japanese documents.
+There are two types of input fields:
+1. Fields for applicants/customers to fill → Target for detection
+2. Fields for staff/officials to fill → Exclude from detection"""
+user_prompt = """Detect all input fields that applicants should fill in this image.
+Exclude fields for staff.
+Return JSON with bbox coordinates (0-1000 normalized)."""
+# Load image
+image = Image.open("your_document.png").convert("RGB")
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": [
+        {"type": "image", "image": image},
+        {"type": "text", "text": user_prompt},
+    ]},
+]
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
+output = model.generate(**inputs, max_new_tokens=2048)
+result = processor.decode(output[0], skip_special_tokens=True)
+print(result)
+```
+### Output Format
+```json
+{
+  "applicant_fields": [
+    {"bbox": [100, 200, 500, 250]},
+    {"bbox": [100, 300, 500, 350]}
+  ],
+  "count": 2
+}
+```
+- `bbox`: `[x1, y1, x2, y2]` normalized to 0-1000 scale
+- To convert to pixels: `pixel_x = bbox_x / 1000 * image_width`
+## Demo
+Try the model on Hugging Face Spaces:
+[takumi123xxx/pdfme-form-field-detector](https://huggingface.co/spaces/takumi123xxx/pdfme-form-field-detector)
+## Deployment (Inference Endpoints)
+### ⚠️ Important: Instance Selection
+This model is a **32B parameter** Vision-Language Model. Please note the following when deploying:
+| Condition | Recommended Instance | VRAM | Notes |
+|-----------|---------------------|------|-------|
+| **With 4-bit quantization** | `nvidia-a100` | 40GB+ | ⭐ Recommended |
+| **Without 4-bit quantization** | `nvidia-a100-80g` | 80GB | Requires more VRAM |
+### Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `BASE_MODEL` | `Qwen/Qwen3-VL-32B-Instruct` | Base model |
+| `USE_LORA` | `true` | Use LoRA adapter |
+| `USE_4BIT` | `true` | Use 4-bit quantization (recommended) |
+## Training Details
+- **Base Model**: Qwen/Qwen3-VL-32B-Instruct
+- **Epochs**: 3
+- **Batch Size**: 1 (with gradient accumulation of 8)
+- **Learning Rate**: 2e-4
+- **LoRA Rank**: 16
+- **LoRA Alpha**: 32
+- **Quantization**: 4-bit NF4
+- **Training Time**: ~2 hours on RTX PRO 6000 (95GB VRAM)
+## Comparison: 8B vs 32B
+| Aspect | 8B Model | 32B Model |
+|--------|----------|-----------|
+| Parameters | 8B | 32B (4x larger) |
+| Final Loss | 5.60 | 5.59 |
+| Recall | 18.08% | 13.56% |
+| VRAM (4-bit) | ~20GB | ~40GB |
+| Inference Speed | Faster | Slower |
+**Conclusion**: With only 90 training samples, both models perform similarly. **Data quantity and diversity are the bottleneck**, not model size.
+## Future Improvements
+### Short-term
+1. **Expand original dataset** - 100+ diverse document samples
+2. **Reduce epochs** - 1-2 epochs may be sufficient for 32B
+3. **Separate test set** - Evaluate on unseen documents
+### Mid-term
+4. **Field type classification** - Identify field types (name, address, date, etc.)
+5. **Multi-turn dialogue** - Support conditional detection ("only detect name fields")
+### Long-term
+6. **Large-scale dataset** - 1000+ annotated samples across document types
+7. **Active learning** - Human review → feedback → continuous improvement
+## License
+Apache 2.0
+---
+# PDFme フォームフィールド検出モデル（32B）
+**日本の書類から、申請者が記入すべきフォーム欄を自動検出するモデル**
+[Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)をQLoRAでファインチューニングし、申請書や届出書などの入力欄を検出します。
+## このモデルでできること
+書類の画像を入力すると、**申請者（顧客）が記入すべき欄**の位置（bbox）を検出します。
+**職員が記入する欄**（受付番号、処理日など）は除外されます。
+## モデル情報
+| 項目 | 内容 |
+|------|------|
+| ベースモデル | Qwen/Qwen3-VL-32B-Instruct |
+| 学習手法 | QLoRA（4bit量子化 + LoRA） |
+| 学習データ | 90件（拡張データ） |
+| 出力形式 | JSON（0-1000正規化されたbbox座標） |
+## 性能評価
+### 評価結果（IoU ≥ 0.5）
+| 指標 | 32Bモデル | 8Bモデル | 説明 |
+|------|-----------|----------|------|
+| **Recall** | **13.56%** | 18.08% | 正解フィールドの検出率 |
+| **Precision** | **5.24%** | 7.90% | 予測の正解率 |
+| **平均IoU** | **0.2163** | 0.2209 | 予測と正解の重なり |
+| マッチ数 | 24/177 | 32/177 | マッチした予測数 |
+| 予測数 | 458 | 405 | 総予測数 |
+### 学習曲線
+| Epoch | Loss | 備考 |
+|-------|------|------|
+| 開始 | 18.74 | - |
+| 0.5 | 11.13 | 急速に減少 |
+| 1.0 | 6.72 | 安定化 |
+| 2.0 | 5.75 | 収束傾向 |
+| 3.0 | **5.59** | 最終 |
+**Loss改善: 18.74 → 5.59（70%減少）**
+### 重要な発見
+32Bモデルは8Bモデルと同等の精度でした。**データセット（元10件）がボトルネック**であり、モデルサイズではありません。
+## デモ
+Hugging Face Spacesでお試しください：
+[takumi123xxx/pdfme-form-field-detector](https://huggingface.co/spaces/takumi123xxx/pdfme-form-field-detector)
+## 学習詳細
+- **ベースモデル**: Qwen/Qwen3-VL-32B-Instruct
+- **エポック数**: 3
+- **バッチサイズ**: 1（勾配累積: 8）
+- **学習率**: 2e-4
+- **LoRAランク**: 16
+- **LoRAアルファ**: 32
+- **量子化**: 4bit NF4
+- **学習時間**: RTX PRO 6000（95GB VRAM）で約2時間
+## ライセンス
+Apache 2.0