Invoice Processing Baseline (Tesseract on RVL-CDIP)

This project establishes the performance "floor" for automated invoice processing using the RVL-CDIP Invoice dataset and Tesseract OCR.

πŸ“Š Performance Summary

  • Dataset Sample Size: 100 images
  • Mean OCR Confidence: 62.73%
  • Critical Failures (<50%): 24%
  • High Reliability (>80%): 19%

πŸ” Key Findings (Phase 1)

Traditional OCR struggles with complex layouts and low-quality scans. We identified specific "Hard Samples" (e.g., Index 15, 4, 88) where the confidence dropped below 30%.

πŸ› οΈ How to Reproduce

The attached 01_data_exploration.ipynb contains the full pipeline:

  1. Load dataset via datasets library.
  2. Pre-processing using Otsu Thresholding.
  3. OCR via pytesseract.

Next Phase: I am currently investigating Donut (OCR-free Transformer) to handle the 24% of invoices that Tesseract failed to parse.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Navneetkumar11/rvl-cdip-invoice-baseline-tesseract