--- license: cc-by-4.0 language: - lus tags: - ocr - mizo - northeast-india - trocr - image-to-text - low-resource model_name: mizo-ocr base_model: microsoft/trocr-base-printed --- # MizoOCR The first OCR model for the Mizo language, developed by [MWire Labs](https://huggingface.co/MWirelabs). ## Model Description MizoOCR is a fine-tuned TrOCR model for recognizing printed Mizo text, including its unique diacritical characters (â, ê, î, ô, û). It is built on `microsoft/trocr-base-printed` and trained on 70,000 deduplicated mix of curated + synthetic image-text pairs drawn from a 200k dataset generated by MWire Labs. ## Performance | Split | Character Accuracy | |-------|-------------------| | Validation | 89.61% | | Test | 90.68% | ## Training Data - **Total unique samples after deduplication:** 102,171 - **Training samples:** 70,000 - **Validation samples:** 5,000 - **Test samples:** 5,000 ## Usage ```python from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image processor = TrOCRProcessor.from_pretrained("MWirelabs/mizo-ocr") model = VisionEncoderDecoderModel.from_pretrained("MWirelabs/mizo-ocr") image = Image.open("mizo_text.jpg").convert("RGB") pixel_values = processor(image, return_tensors="pt").pixel_values generated = model.generate(pixel_values) text = processor.tokenizer.decode(generated[0], skip_special_tokens=True) print(text) ``` ## Limitations - Trained primarily on synthetic data with a small curated dataset; accuracy on real scanned documents may vary - Optimized for printed text, not handwritten - Performance may vary on heavily degraded or low-quality images ## Citation If you use this model, please cite: ``` @misc{mwirelabs2026mizoocr, title={MizoOCR: First OCR Model for the Mizo Language}, author={MWire Labs}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/MWirelabs/mizo-ocr} } ``` ## About MWire Labs MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.