---
license: apache-2.0
language:
  - ja
  - en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - vision
  - vlm
  - qwen
  - lora
  - document-understanding
  - form-detection
  - japanese
base_model: Qwen/Qwen3-VL-32B-Instruct
datasets:
  - hand-dot/pdfme-form-field-dataset
---

# PDFme Form Field Detector (32B)

**Detects form fields that applicants need to fill in Japanese documents.**

This model is fine-tuned from [Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) using QLoRA to detect input fields in Japanese application forms, registration documents, and other official paperwork.

## What This Model Does

Given an image of a Japanese document, this model identifies the bounding boxes of form fields that **applicants/customers** should fill in, while **excluding fields meant for staff/officials**.

### Example Use Cases

- Automating form digitization
- Building PDF form generators
- Creating accessibility tools for document processing

## Model Details

| Item | Value |
|------|-------|
| Base Model | Qwen/Qwen3-VL-32B-Instruct |
| Fine-tuning Method | QLoRA (4-bit quantization + LoRA) |
| Training Data | [hand-dot/pdfme-form-field-dataset](https://huggingface.co/datasets/hand-dot/pdfme-form-field-dataset) (90 samples, augmented) |
| Output Format | JSON with normalized bbox coordinates (0-1000) |

## Performance

### Evaluation Results (IoU ≥ 0.5)

| Metric | 32B Model | 8B Model | Description |
|--------|-----------|----------|-------------|
| **Recall** | **13.56%** | 18.08% | Ground truth fields detected |
| **Precision** | **5.24%** | 7.90% | Correct predictions |
| **Average IoU** | **0.2163** | 0.2209 | Overlap between predicted and ground truth |
| Matches | 24/177 | 32/177 | Matched predictions |
| Predictions | 458 | 405 | Total predictions |

### Per-Sample Results (Best performers)

| Sample | Recall | Precision | IoU | Evaluation |
|--------|--------|-----------|-----|------------|
| **#2** | **60.00%** | **69.23%** | **0.507** | ⭐ Excellent |
| **#7** | 33.33% | 25.00% | 0.380 | Good |
| **#9** | 18.18% | 7.69% | 0.313 | Improved |

### Training Progress

| Epoch | Loss | Notes |
|-------|------|-------|
| Start | 18.74 | - |
| 0.5 | 11.13 | Rapid decrease |
| 1.0 | 6.72 | Stabilizing |
| 2.0 | 5.75 | Converging |
| 3.0 | **5.59** | Final |

**Loss improved: 18.74 → 5.59 (70% reduction)**

### Key Finding

Despite being 4x larger than the 8B model, the 32B model achieved similar accuracy. **The dataset (10 original samples) is the bottleneck**, not model capacity.

### Current Limitations

1. **Small training dataset** - 10 original samples, augmented to 90
2. **Over-detection tendency** - 458 predictions vs 177 ground truth (2.6x)
3. **Location precision** - Average IoU of 0.22 indicates room for improvement

## Quick Start

### Installation

```bash
pip install transformers peft torch accelerate bitsandbytes
```

### Inference

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel

# Load model (32B)
base_model = "Qwen/Qwen3-VL-32B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, "takumi123xxx/pdfme-form-field-detector-lora-32b")
processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)

# Prepare prompt
system_prompt = """You are an expert at analyzing Japanese documents.
There are two types of input fields:
1. Fields for applicants/customers to fill → Target for detection
2. Fields for staff/officials to fill → Exclude from detection"""

user_prompt = """Detect all input fields that applicants should fill in this image.
Exclude fields for staff.
Return JSON with bbox coordinates (0-1000 normalized)."""

# Load image
image = Image.open("your_document.png").convert("RGB")

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": user_prompt},
    ]},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=2048)
result = processor.decode(output[0], skip_special_tokens=True)
print(result)
```

### Output Format

```json
{
  "applicant_fields": [
    {"bbox": [100, 200, 500, 250]},
    {"bbox": [100, 300, 500, 350]}
  ],
  "count": 2
}
```

- `bbox`: `[x1, y1, x2, y2]` normalized to 0-1000 scale
- To convert to pixels: `pixel_x = bbox_x / 1000 * image_width`

## Demo

Try the model on Hugging Face Spaces:
[takumi123xxx/pdfme-form-field-detector](https://huggingface.co/spaces/takumi123xxx/pdfme-form-field-detector)

## Cloud Deployment

### AWS SageMaker

```python
import boto3
import json
import base64

runtime = boto3.client("sagemaker-runtime", region_name="ap-northeast-1")

with open("document.png", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = runtime.invoke_endpoint(
    EndpointName="pdfme-form-detector-xxxxx",
    ContentType="application/json",
    Body=json.dumps({"inputs": image_base64})
)

result = json.loads(response["Body"].read().decode())
print(result)
```

### GCP Vertex AI

```python
from google.cloud import aiplatform
import base64

aiplatform.init(project="your-project-id", location="asia-northeast1")
endpoint = aiplatform.Endpoint("projects/xxx/locations/xxx/endpoints/xxx")

with open("document.png", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = endpoint.predict(instances=[{"image_base64": image_base64}])
print(response.predictions)
```

### Azure AI Foundry

```python
import requests
import base64

endpoint_url = "https://pdfme-detector-xxxxx.japaneast.inference.ml.azure.com/score"
api_key = "your-api-key"

with open("document.png", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
}

response = requests.post(
    endpoint_url,
    headers=headers,
    json={"image_base64": image_base64}
)
print(response.json())
```

### Recommended Instances

| Service | Instance | GPU | VRAM | Cost/hour |
|---------|----------|-----|------|-----------|
| **AWS SageMaker** | ml.g5.xlarge | A10G | 24GB | ~$1.20 |
| **GCP Vertex AI** | n1-standard-8 + L4 | L4 | 24GB | ~$1.20 |
| **Azure AI Foundry** | Standard_NC4as_T4_v3 | T4 | 16GB | ~$1.10 |

For detailed deployment instructions, see the [GitHub repository](https://github.com/JapanMarketing-Dev/pdfme-fineturning/tree/main/deploy).

## Training Details

- **Base Model**: Qwen/Qwen3-VL-32B-Instruct
- **Epochs**: 3
- **Batch Size**: 1 (with gradient accumulation of 8)
- **Learning Rate**: 2e-4
- **LoRA Rank**: 16
- **LoRA Alpha**: 32
- **Quantization**: 4-bit NF4
- **Training Time**: ~2 hours on RTX PRO 6000 (95GB VRAM)

## Comparison: 8B vs 32B

| Aspect | 8B Model | 32B Model |
|--------|----------|-----------|
| Parameters | 8B | 32B (4x larger) |
| Final Loss | 5.60 | 5.59 |
| Recall | 18.08% | 13.56% |
| VRAM (4-bit) | ~20GB | ~40GB |
| Inference Speed | Faster | Slower |

**Conclusion**: With only 90 training samples, both models perform similarly. **Data quantity and diversity are the bottleneck**, not model size.

## Future Improvements

### Short-term

1. **Expand original dataset** - 100+ diverse document samples
2. **Reduce epochs** - 1-2 epochs may be sufficient for 32B
3. **Separate test set** - Evaluate on unseen documents

### Mid-term

4. **Field type classification** - Identify field types (name, address, date, etc.)
5. **Multi-turn dialogue** - Support conditional detection ("only detect name fields")

### Long-term

6. **Large-scale dataset** - 1000+ annotated samples across document types
7. **Active learning** - Human review → feedback → continuous improvement

## License

Apache 2.0

---

# PDFme フォームフィールド検出モデル（32B）

**日本の書類から、申請者が記入すべきフォーム欄を自動検出するモデル**

[Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)をQLoRAでファインチューニングし、申請書や届出書などの入力欄を検出します。

## このモデルでできること

書類の画像を入力すると、**申請者（顧客）が記入すべき欄**の位置（bbox）を検出します。
**職員が記入する欄**（受付番号、処理日など）は除外されます。

## モデル情報

| 項目 | 内容 |
|------|------|
| ベースモデル | Qwen/Qwen3-VL-32B-Instruct |
| 学習手法 | QLoRA（4bit量子化 + LoRA） |
| 学習データ | 90件（拡張データ） |
| 出力形式 | JSON（0-1000正規化されたbbox座標） |

## 性能評価

### 評価結果（IoU ≥ 0.5）

| 指標 | 32Bモデル | 8Bモデル | 説明 |
|------|-----------|----------|------|
| **Recall** | **13.56%** | 18.08% | 正解フィールドの検出率 |
| **Precision** | **5.24%** | 7.90% | 予測の正解率 |
| **平均IoU** | **0.2163** | 0.2209 | 予測と正解の重なり |
| マッチ数 | 24/177 | 32/177 | マッチした予測数 |
| 予測数 | 458 | 405 | 総予測数 |

### 学習曲線

| Epoch | Loss | 備考 |
|-------|------|------|
| 開始 | 18.74 | - |
| 0.5 | 11.13 | 急速に減少 |
| 1.0 | 6.72 | 安定化 |
| 2.0 | 5.75 | 収束傾向 |
| 3.0 | **5.59** | 最終 |

**Loss改善: 18.74 → 5.59（70%減少）**

### 重要な発見

32Bモデルは8Bモデルと同等の精度でした。**データセット（元10件）がボトルネック**であり、モデルサイズではありません。

## デモ

Hugging Face Spacesでお試しください：
[takumi123xxx/pdfme-form-field-detector](https://huggingface.co/spaces/takumi123xxx/pdfme-form-field-detector)

## クラウドデプロイ

### 推奨インスタンス

| サービス | インスタンス | GPU | VRAM | 料金/時間 |
|----------|-------------|-----|------|----------|
| **AWS SageMaker** | ml.g5.xlarge | A10G | 24GB | ~$1.20 |
| **GCP Vertex AI** | n1-standard-8 + L4 | L4 | 24GB | ~$1.20 |
| **Azure AI Foundry** | Standard_NC4as_T4_v3 | T4 | 16GB | ~$1.10 |

詳細なデプロイ手順は[GitHubリポジトリ](https://github.com/JapanMarketing-Dev/pdfme-fineturning/tree/main/deploy)を参照してください。

## 学習詳細

- **ベースモデル**: Qwen/Qwen3-VL-32B-Instruct
- **エポック数**: 3
- **バッチサイズ**: 1（勾配累積: 8）
- **学習率**: 2e-4
- **LoRAランク**: 16
- **LoRAアルファ**: 32
- **量子化**: 4bit NF4
- **学習時間**: RTX PRO 6000（95GB VRAM）で約2時間

## ライセンス

Apache 2.0