mizo-ocr / README.md
Khrawsynth's picture
Update README.md
6729721 verified
metadata
license: cc-by-4.0
language:
  - lus
tags:
  - ocr
  - mizo
  - northeast-india
  - trocr
  - image-to-text
  - low-resource
model_name: mizo-ocr
base_model: microsoft/trocr-base-printed

MizoOCR

The first OCR model for the Mizo language, developed by MWire Labs.

Model Description

MizoOCR is a fine-tuned TrOCR model for recognizing printed Mizo text, including its unique diacritical characters (â, ê, î, ô, û). It is built on microsoft/trocr-base-printed and trained on 70,000 deduplicated mix of curated + synthetic image-text pairs drawn from a 200k dataset generated by MWire Labs.

Performance

Split Character Accuracy
Validation 89.61%
Test 90.68%

Training Data

  • Total unique samples after deduplication: 102,171
  • Training samples: 70,000
  • Validation samples: 5,000
  • Test samples: 5,000

Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("MWirelabs/mizo-ocr")
model = VisionEncoderDecoderModel.from_pretrained("MWirelabs/mizo-ocr")

image = Image.open("mizo_text.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated = model.generate(pixel_values)
text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)

Limitations

  • Trained primarily on synthetic data with a small curated dataset; accuracy on real scanned documents may vary
  • Optimized for printed text, not handwritten
  • Performance may vary on heavily degraded or low-quality images

Citation

If you use this model, please cite:

@misc{mwirelabs2026mizoocr,
  title={MizoOCR: First OCR Model for the Mizo Language},
  author={MWire Labs},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/MWirelabs/mizo-ocr}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.