|
|
--- |
|
|
base_model: numind/NuExtract-2.0-4B |
|
|
library_name: transformers |
|
|
model_name: invoices-donut-finetuned-Lora-merged |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- sft |
|
|
- trl |
|
|
licence: license |
|
|
--- |
|
|
|
|
|
### Overview |
|
|
`invoices-donut-merged` is the **LoRA adapter merged back into the base weights** of [`numind/NuExtract-2.0-4B`](https://huggingface.co/numind/NuExtract-2.0-4B). |
|
|
It behaves like a fully fine-tuned model but trained using efficient LoRA adapters. |
|
|
This makes it **production-ready**: no need to separately load base + adapters. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Extracting structured JSON fields from invoice images: |
|
|
- Invoice number, date |
|
|
- Seller/client details |
|
|
- Tax IDs, IBAN |
|
|
- Item descriptions, prices, VAT |
|
|
- Totals (net, VAT, gross) |
|
|
- Not intended for general document OCR outside invoices. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base model**: Qwen/Qwen2.5-VL-3B-Instruct |
|
|
- **Framework**: Hugging Face TRL (SFTTrainer) with PEFT/LoRA |
|
|
- **LoRA config**: |
|
|
- ***Rank (r)***: 8 |
|
|
- ***Alpha***: 32 |
|
|
- ***Target modules***: q_proj, v_proj |
|
|
- ***Dropout***: 0.1 |
|
|
- **Epochs**: 10 |
|
|
- **Batch size**: 2 |
|
|
- **Learning rate**: 1e-5 |
|
|
- **Precision**: bfloat16 |
|
|
- **Gradient accumulation**: 4 |
|
|
- **Scheduler**: Constant LR |
|
|
- **Max sequence length**: 1024 |
|
|
- **Gradient checkpointing**: Enabled |
|
|
- **Trainable parameters**: ~1.8M (0.05% of 3.75B total) |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch datasets pillow |
|
|
``` |
|
|
|
|
|
### Load Model and Processor |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoProcessor, AutoModelForVision2Seq |
|
|
|
|
|
model_name = "aliRafik/invoices-donut-finetuned-Lora-merged" |
|
|
|
|
|
model = AutoModelForVision2Seq.from_pretrained( |
|
|
model_name, |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16, # Optional: Use float32 if bfloat16 causes issues |
|
|
attn_implementation="flash_attention_2", # Requires Ampere+ GPU & torch >= 2.0 |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained( |
|
|
model_name, |
|
|
trust_remote_code=True, |
|
|
padding_side='left', |
|
|
use_fast=True |
|
|
) |
|
|
``` |
|
|
|
|
|
|
|
|
### Define Extraction Template |
|
|
|
|
|
```python |
|
|
template = """ |
|
|
{ |
|
|
"header": { |
|
|
"invoice_no": "string", |
|
|
"invoice_date": "date-time", |
|
|
"seller": "string", |
|
|
"client": "string", |
|
|
"seller_tax_id": "string", |
|
|
"client_tax_id": "string", |
|
|
"iban": "string" |
|
|
}, |
|
|
"items": [ |
|
|
{ |
|
|
"item_desc": "string", |
|
|
"item_qty": "number", |
|
|
"item_net_price": "number", |
|
|
"item_net_worth": "number", |
|
|
"item_vat": "number", |
|
|
"item_gross_worth": "number" |
|
|
} |
|
|
], |
|
|
"summary": { |
|
|
"total_net_worth": "number", |
|
|
"total_vat": "number", |
|
|
"total_gross_worth": "number" |
|
|
} |
|
|
} |
|
|
""" |
|
|
``` |
|
|
### Test on Sample from Dataset |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
import json |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# Load the dataset |
|
|
dataset = load_dataset("katanaml-org/invoices-donut-data-v1") |
|
|
|
|
|
# Select a sample (e.g., index 0) |
|
|
sample = dataset['train'][0] |
|
|
image = sample['image'] |
|
|
ground_truth = sample['ground_truth'] |
|
|
|
|
|
print(json.loads(ground_truth)) |
|
|
|
|
|
# Prepare message |
|
|
messages = [ |
|
|
{"role": "user", "content": [{"type": "image", "image": image}]} |
|
|
] |
|
|
|
|
|
# Process vision info |
|
|
image_inputs, _ = process_vision_info(messages) |
|
|
|
|
|
# Apply chat template |
|
|
text = processor.tokenizer.apply_chat_template( |
|
|
messages, |
|
|
template=template, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
# Prepare inputs |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
# Generation config |
|
|
generation_config = { |
|
|
"do_sample": False, |
|
|
"num_beams": 1, |
|
|
"max_new_tokens": 2048 |
|
|
} |
|
|
|
|
|
# Generate |
|
|
generated_ids = model.generate(**inputs, **generation_config) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
|
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, |
|
|
skip_special_tokens=True, |
|
|
clean_up_tokenization_spaces=False |
|
|
) |
|
|
|
|
|
# Parse and print |
|
|
try: |
|
|
extracted_data = json.loads(output_text[0]) |
|
|
print("Extracted Data:", extracted_data) |
|
|
except json.JSONDecodeError: |
|
|
print("Raw Output:", output_text[0]) |
|
|
|
|
|
# Compare with ground truth |
|
|
gt_parsed = json.loads(ground_truth)['gt_parse'] |
|
|
print("Ground Truth:", gt_parsed) |
|
|
|
|
|
|
|
|
``` |
|
|
### Test on Unseen Data (Custom Image) |
|
|
```python |
|
|
from PIL import Image |
|
|
from io import BytesIO |
|
|
import requests |
|
|
|
|
|
# Load from local path |
|
|
image_path = "/content/image.jpg" # Replace with your path |
|
|
image = Image.open(image_path) |
|
|
|
|
|
# Or load from URL |
|
|
# image_url = "https://example.com/your_invoice.jpg" |
|
|
# response = requests.get(image_url) |
|
|
# image = Image.open(BytesIO(response.content)) |
|
|
|
|
|
# Use same inference code as above |
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
## Example Results |
|
|
|
|
|
#### Input Image: |
|
|
|
|
|
 |
|
|
|
|
|
#### Extracted Data: |
|
|
|
|
|
```python |
|
|
|
|
|
{ |
|
|
"header": { |
|
|
"invoice_no": "49565075", |
|
|
"invoice_date": "2019-10-28", |
|
|
"seller": "Kane-Morgan 968 Carr Mission Apt. 320 Bernardville, VA 28211", |
|
|
"client": "Garcia Inc 445 Haas Viaduct Suite 454 Michaelhaven, LA 32852", |
|
|
"seller_tax_id": "964-95-3813", |
|
|
"client_tax_id": "909-75-5482", |
|
|
"iban": "GB73WCJ55232646970614" |
|
|
}, |
|
|
"items": [ |
|
|
{ |
|
|
"item_desc": "Anthropologie Gold Elegant Swan Decorative Metal Bottle Stopper Wine Saver", |
|
|
"item_qty": 3.0, |
|
|
"item_net_price": 19.98, |
|
|
"item_net_worth": 59.94, |
|
|
"item_vat": 10.0, |
|
|
"item_gross_worth": 65.93 |
|
|
}, |
|
|
{ |
|
|
"item_desc": "Lolita Happy Retirement Wine Glass 15 Ounce GLS11-5534H", |
|
|
"item_qty": 1.0, |
|
|
"item_net_price": 8.0, |
|
|
"item_net_worth": 8.0, |
|
|
"item_vat": 10.0, |
|
|
"item_gross_worth": 8.8 |
|
|
}, |
|
|
{ |
|
|
"item_desc": "Lolita \"Congratulations\" Hand Painted and Decorated Wine Glass NIB", |
|
|
"item_qty": 1.0, |
|
|
"item_net_price": 20.0, |
|
|
"item_net_worth": 20.0, |
|
|
"item_vat": 10.0, |
|
|
"item_gross_worth": 22.0 |
|
|
} |
|
|
], |
|
|
"summary": { |
|
|
"total_net_worth": 87.94, |
|
|
"total_vat": 8.79, |
|
|
"total_gross_worth": 96.73 |
|
|
} |
|
|
} |
|
|
|
|
|
``` |
|
|
## License |
|
|
#### Apache-2.0 |
|
|
tags: |
|
|
###### vision |
|
|
###### document-understanding |
|
|
###### invoice-processing |
|
|
###### donut |
|
|
###### qwen |
|
|
|
|
|
|
|
|
## Citations |
|
|
|
|
|
Cite TRL as: |
|
|
|
|
|
```bibtex |
|
|
@misc{vonwerra2022trl, |
|
|
title = {{TRL: Transformer Reinforcement Learning}}, |
|
|
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, |
|
|
year = 2020, |
|
|
journal = {GitHub repository}, |
|
|
publisher = {GitHub}, |
|
|
howpublished = {\url{https://github.com/huggingface/trl}} |
|
|
} |
|
|
``` |