ejaz111's picture
Upload README.md with huggingface_hub
a0e8d16 verified
metadata
language: en
license: apache-2.0
tags:
  - text2text-generation
  - ocr
  - error-correction
  - bart
  - historical-text
datasets:
  - custom
metrics:
  - cer
  - wer
model-index:
  - name: bart-synthetic-data-vampyre-ocr-correction
    results:
      - task:
          type: text2text-generation
          name: OCR Error Correction
        dataset:
          type: custom
          name: The Vampyre (Synthetic + Real)
        metrics:
          - type: cer
            value: 14.49
            name: Character Error Rate
          - type: wer
            value: 37.99
            name: Word Error Rate

BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

This model is a fine-tuned version of facebook/bart-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

🎯 Model Description

  • Base Model: facebook/bart-base
  • Task: OCR error correction
  • Training Strategy:
    • Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
    • Test: Real OCR data from "The Vampyre" (300 samples)
  • Best Checkpoint: Epoch 2
  • Validation CER: 14.49%
  • Validation WER: 37.99%

πŸ“Š Performance

Evaluated on real historical OCR text from "The Vampyre":

Metric Score
Character Error Rate (CER) 14.49%
Word Error Rate (WER) 37.99%
Exact Match 0.0%

πŸš€ Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")

# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original:  {ocr_text}")
print(f"Corrected: {corrected_text}")

Using Pipeline

from transformers import pipeline

corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"

πŸŽ“ Training Details

Training Data

  • Synthetic Data (Train/Val): 1020 samples
    • 85% training (~867 samples)
    • 15% validation (~153 samples)
    • Generated using GPT-4 with 20 corruption strategies
  • Real Data (Test): 300 samples from "The Vampyre" OCR text
  • No data leakage: Test set contains only real OCR data, never seen during training

Training Configuration

  • Epochs: 20 (best model at epoch 2)
  • Batch Size: 16
  • Learning Rate: 1e-4
  • Optimizer: AdamW with weight decay 0.01
  • Scheduler: Linear with warmup (10% warmup steps)
  • Max Sequence Length: 512 tokens
  • Architecture: BART encoder-decoder with 139M parameters
  • Training Time: ~30 minutes on GPU

Corruption Strategies (Training Data)

The synthetic training data included these OCR error types:

  • Character substitutions (visual similarity)
  • Missing/extra characters
  • Word boundary errors
  • Case errors
  • Punctuation errors
  • Long s (ΕΏ) substitutions
  • Historical typography errors

πŸ“ˆ Training Progress

The model showed rapid improvement in early epochs:

  • Epoch 1: CER 16.62%
  • Epoch 2: CER 14.49% ⭐ (Best)
  • Epoch 3: CER 15.86%
  • Later epochs showed overfitting with CER rising to ~20%

The best checkpoint from epoch 2 was saved and is the one available in this repository.

πŸ’‘ Use Cases

This model is particularly effective for:

  • Correcting OCR errors in historical documents
  • Post-processing digitized manuscripts
  • Cleaning text from scanned historical books
  • Literary text restoration
  • Academic research on historical texts

⚠️ Limitations

  • Optimized for English historical texts
  • Best performance on texts similar to 19th-century literature
  • May struggle with extremely degraded or non-standard OCR
  • Maximum input length: 512 tokens
  • Higher WER compared to T5 baseline (37.99% vs 22.52%)

πŸ”¬ Model Comparison

Model CER WER Parameters
BART-base 14.49% 37.99% 139M
T5-base 13.93% 22.52% 220M

BART achieves slightly better character-level accuracy but struggles more with word-level corrections.

πŸ”¬ Evaluation Examples

Original OCR Corrected Output
"Th1s 1s an 0CR err0r" "This is an OCR error"
"The anci3nt tre55" "The ancient trees"
"bl0omiNg floweRs" "blooming flowers"

πŸ“š Citation

If you use this model in your research, please cite:

@misc{bart-vampyre-ocr,
  author = {Ejaz},
  title = {BART Base OCR Error Correction for Historical Texts},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
}

πŸ‘€ Author

Ejaz - Master's Student in AI and Robotics

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments