Invoice Processing Baseline (Tesseract on RVL-CDIP)
This project establishes the performance "floor" for automated invoice processing using the RVL-CDIP Invoice dataset and Tesseract OCR.
π Performance Summary
- Dataset Sample Size: 100 images
- Mean OCR Confidence: 62.73%
- Critical Failures (<50%): 24%
- High Reliability (>80%): 19%
π Key Findings (Phase 1)
Traditional OCR struggles with complex layouts and low-quality scans. We identified specific "Hard Samples" (e.g., Index 15, 4, 88) where the confidence dropped below 30%.
π οΈ How to Reproduce
The attached 01_data_exploration.ipynb contains the full pipeline:
- Load dataset via
datasetslibrary. - Pre-processing using Otsu Thresholding.
- OCR via
pytesseract.
Next Phase: I am currently investigating Donut (OCR-free Transformer) to handle the 24% of invoices that Tesseract failed to parse.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support