Upload README.md with huggingface_hub

a0e8d16 verified 14 days ago

5.62 kB

metadata

language: en
license: apache-2.0
tags:
  - text2text-generation
  - ocr
  - error-correction
  - bart
  - historical-text
datasets:
  - custom
metrics:
  - cer
  - wer
model-index:
  - name: bart-synthetic-data-vampyre-ocr-correction
    results:
      - task:
          type: text2text-generation
          name: OCR Error Correction
        dataset:
          type: custom
          name: The Vampyre (Synthetic + Real)
        metrics:
          - type: cer
            value: 14.49
            name: Character Error Rate
          - type: wer
            value: 37.99
            name: Word Error Rate

BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

This model is a fine-tuned version of facebook/bart-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

🎯 Model Description

Base Model: facebook/bart-base
Task: OCR error correction
Training Strategy:
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
- Test: Real OCR data from "The Vampyre" (300 samples)
Best Checkpoint: Epoch 2
Validation CER: 14.49%
Validation WER: 37.99%

📊 Performance

Evaluated on real historical OCR text from "The Vampyre":

Metric	Score
Character Error Rate (CER)	14.49%
Word Error Rate (WER)	37.99%
Exact Match	0.0%

🚀 Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")

# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original:  {ocr_text}")
print(f"Corrected: {corrected_text}")

Using Pipeline

from transformers import pipeline

corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"

🎓 Training Details

Training Data

Synthetic Data (Train/Val): 1020 samples
- 85% training (~867 samples)
- 15% validation (~153 samples)
- Generated using GPT-4 with 20 corruption strategies
Real Data (Test): 300 samples from "The Vampyre" OCR text
No data leakage: Test set contains only real OCR data, never seen during training

Training Configuration

Epochs: 20 (best model at epoch 2)
Batch Size: 16
Learning Rate: 1e-4
Optimizer: AdamW with weight decay 0.01
Scheduler: Linear with warmup (10% warmup steps)
Max Sequence Length: 512 tokens
Architecture: BART encoder-decoder with 139M parameters
Training Time: ~30 minutes on GPU

Corruption Strategies (Training Data)

The synthetic training data included these OCR error types:

Character substitutions (visual similarity)
Missing/extra characters
Word boundary errors
Case errors
Punctuation errors
Long s (ſ) substitutions
Historical typography errors

📈 Training Progress

The model showed rapid improvement in early epochs:

Epoch 1: CER 16.62%
Epoch 2: CER 14.49% ⭐ (Best)
Epoch 3: CER 15.86%
Later epochs showed overfitting with CER rising to ~20%

The best checkpoint from epoch 2 was saved and is the one available in this repository.

💡 Use Cases

This model is particularly effective for:

Correcting OCR errors in historical documents
Post-processing digitized manuscripts
Cleaning text from scanned historical books
Literary text restoration
Academic research on historical texts

⚠️ Limitations

Optimized for English historical texts
Best performance on texts similar to 19th-century literature
May struggle with extremely degraded or non-standard OCR
Maximum input length: 512 tokens
Higher WER compared to T5 baseline (37.99% vs 22.52%)

🔬 Model Comparison

Model	CER	WER	Parameters
BART-base	14.49%	37.99%	139M
T5-base	13.93%	22.52%	220M

BART achieves slightly better character-level accuracy but struggles more with word-level corrections.

🔬 Evaluation Examples

Original OCR	Corrected Output
"Th1s 1s an 0CR err0r"	"This is an OCR error"
"The anci3nt tre55"	"The ancient trees"
"bl0omiNg floweRs"	"blooming flowers"

📚 Citation

If you use this model in your research, please cite:

@misc{bart-vampyre-ocr,
  author = {Ejaz},
  title = {BART Base OCR Error Correction for Historical Texts},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
}

👤 Author

Ejaz - Master's Student in AI and Robotics

📄 License

Apache 2.0

🙏 Acknowledgments

Base model: facebook/bart-base
Training data: "The Vampyre" by John William Polidori
Synthetic data generation: GPT-4
Companion model: ejaz111/t5-synthetic-data-vampyre-ocr-correction