Invoice Processing Baseline (Tesseract on RVL-CDIP)

This project establishes the performance "floor" for automated invoice processing using the RVL-CDIP Invoice dataset and Tesseract OCR.

📊 Performance Summary

Dataset Sample Size: 100 images
Mean OCR Confidence: 62.73%
Critical Failures (<50%): 24%
High Reliability (>80%): 19%

🔍 Key Findings (Phase 1)

Traditional OCR struggles with complex layouts and low-quality scans. We identified specific "Hard Samples" (e.g., Index 15, 4, 88) where the confidence dropped below 30%.

🛠️ How to Reproduce

The attached 01_data_exploration.ipynb contains the full pipeline:

Load dataset via datasets library.
Pre-processing using Otsu Thresholding.
OCR via pytesseract.

Next Phase: I am currently investigating Donut (OCR-free Transformer) to handle the 24% of invoices that Tesseract failed to parse.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Navneetkumar11
/

rvl-cdip-invoice-baseline-tesseract

Invoice Processing Baseline (Tesseract on RVL-CDIP)

📊 Performance Summary

🔍 Key Findings (Phase 1)

🛠️ How to Reproduce

Dataset used to train Navneetkumar11/rvl-cdip-invoice-baseline-tesseract