LightOnOCR-2-1B for Samaritan Hebrew/Aramaic

This model is a fine-tuned version of lightonai/LightOnOCR-2-1B-base specifically trained for page-level OCR of Samaritan manuscripts.

Model Description

Base Model: lightonai/LightOnOCR-2-1B-base
Training Data: samaritan-ai/samaritan_hebrew_LightOnOcr
Task: Page-level text transcription from manuscript images
Language: Samaritan Hebrew/Aramaic (smp/sam)
Architecture: Vision-Language Model (1B parameters)

This is a page-level model - it expects full pages, paragraphs or crops of lines.

Evaluation Results

Test Set Performance

Metric	Base Model	Fine-tuned Model	Improvement
CER (Character Error Rate)	475.89%	7.68%	+468.22% (+98.4%)
WER (Word Error Rate)	341.22%	15.37%	+325.85% (+95.5%)
Perfect Matches	0/50 (0.00%)	37/50 (74.00%)	+74.00%
Character Accuracy	382.84%	59.31%	-323.53%

Model Details

Base Model: lightonai/LightOnOCR-2-1B-base
Fine-tuned Model: LightOnOcr-2_samaritan
Test Samples: 50
Evaluation Date: 2026-01-23 14:19:32

Sample Predictions

Base Model Examples

❌ Sample 1:

Ground Truth: הלא כל גברה דאזל בתר בעל פעור שיציה יהוה אלהך מבגבך ואתון מתקרבים ביהוה אלהכון קעימים כלכון יומה חזו אלפת יתכון גזרים ודינים
Prediction: ``

❌ Sample 2:

Ground Truth: כעת מחר ברד כבד מאד אשר לא היה כמהו
Prediction: הַנְּהָאָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָהָ

❌ Sample 3:

Ground Truth: ויאמר משה לבני ישראל לא תערצון ולא תיראון מהם יהוה אלהיכם ההלך לפניכם הוא ילחם לכם ככל אשר עשה אתכם במצרים לעיניכם ובמדבר אשר ראית אשר נשאך יהוה אלהיך כאשר ישא איש את בנו בכל הדרך אשר הלכתם עד באכם עד המקום הזה ובדבר הזה אינכם מאמנים ביהוה אלהיכם ההלך לפניכם בדרך לתור לכם מקום
Prediction: 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

❌ Sample 4:

Ground Truth: היה יהוה עמך ונאמר תהיה נא
Prediction: ٣٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠٠

❌ Sample 5:

Ground Truth: לו עלה נעלה וירשנו אתה כי יכל
Prediction: 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25. 25.

❌ Sample 6:

Ground Truth: שקר על אחת מכל אשר יעשה האדם
Prediction: 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44. 44.

❌ Sample 7:

Ground Truth: הגר בתוכם תורה אחת יהיה לכם לעשה בשגגה והנפש
Prediction: `$\text{مَحَمَّد}$

$\text{بَعْدَ مَحَمَّدٍ}$

$\text{مَحَمَّدٍ}$

$\text{مَحَمَّدٍ}$`

❌ Sample 8:

Ground Truth: פרעה לשאת אתו ויקחו את מקניהם ואת רכושם
Prediction: స్థానం నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండి అందుబాటులో నుండ

❌ Sample 9:

Ground Truth: הענן שם יחנו בני ישראל על פי יהוה
Prediction: `$\text{2. مسحی کیا کریں}$

$\text{۱. مسحی کیا کریں}$`

❌ Sample 10:

Ground Truth: אוי לך מואב אבדת עם כמוש נתן
Prediction: ٥٠٠٠: سَمْعَةٌ مُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُحَمَّدٌ وَمُ

Fine-tuned Model Examples

✅ Sample 1:

Ground Truth: הלא כל גברה דאזל בתר בעל פעור שיציה יהוה אלהך מבגבך ואתון מתקרבים ביהוה אלהכון קעימים כלכון יומה חזו אלפת יתכון גזרים ודינים
Prediction: הלא כל גברה דאזל בתר בעל פעור שיציה יהוה אלהך מבגבך ואתון מתקרבים ביהוה אלהכון קעימים כלכון יומה חזו אלפת יתכון גזרים ודינים

✅ Sample 2:

Ground Truth: כעת מחר ברד כבד מאד אשר לא היה כמהו
Prediction: כעת מחר ברד כבד מאד אשר לא היה כמהו

❌ Sample 3:

Ground Truth: ויאמר משה לבני ישראל לא תערצון ולא תיראון מהם יהוה אלהיכם ההלך לפניכם הוא ילחם לכם ככל אשר עשה אתכם במצרים לעיניכם ובמדבר אשר ראית אשר נשאך יהוה אלהיך כאשר ישא איש את בנו בכל הדרך אשר הלכתם עד באכם עד המקום הזה ובדבר הזה אינכם מאמנים ביהוה אלהיכם ההלך לפניכם בדרך לתור לכם מקום
Prediction: ויאמר משה לבני ישראל לא תערצון ולא תיראון מהם יהוה אלהיכם ההלך לפניכם הוא ילחם לכם ככל אשר עשה אתכם במצרים לעיניכם ובמדבר אשר ראית אשר נשאך יהוה אלהיך כאשר ישא איש את בנו בכל הדרך אשר הלכתם עד באכם עד המקום הזה ובדבר הזה אינכם מאמינים ביהוה אלהיכם ההלך לפניכם בדרך לתר לכם מקום

✅ Sample 4:

Ground Truth: היה יהוה עמך ונאמר תהיה נא
Prediction: היה יהוה עמך ונאמר תהיה נא

✅ Sample 5:

Ground Truth: לו עלה נעלה וירשנו אתה כי יכל
Prediction: לו עלה נעלה וירשנו אתה כי יכל

✅ Sample 6:

Ground Truth: שקר על אחת מכל אשר יעשה האדם
Prediction: שקר על אחת מכל אשר יעשה האדם

❌ Sample 7:

Ground Truth: הגר בתוכם תורה אחת יהיה לכם לעשה בשגגה והנפש
Prediction: הגר בתוכם תורה אחת יהיה לכם לעשות בשגגה והנפש

✅ Sample 8:

Ground Truth: פרעה לשאת אתו ויקחו את מקניהם ואת רכושם
Prediction: פרעה לשאת אתו ויקחו את מקניהם ואת רכושם

✅ Sample 9:

Ground Truth: הענן שם יחנו בני ישראל על פי יהוה
Prediction: הענן שם יחנו בני ישראל על פי יהוה

✅ Sample 10:

Ground Truth: אוי לך מואב אבדת עם כמוש נתן
Prediction: אוי לך מואב אבדת עם כמוש נתן

Usage

Installation

# Requires transformers from source
pip install git+https://github.com/huggingface/transformers
pip install pillow torch

Python Usage

import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
from PIL import Image

# Load model and processor
model_id = "johnlockejrr/LightOnOCR-2-1B-base-samaritan"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = LightOnOcrProcessor.from_pretrained(model_id)
model = LightOnOcrForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
).to(device)

# Load your line image
image = Image.open("your_line_image.jpg").convert("RGB")

# Prepare input
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    text=[text],
    images=[[image]],
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 1024},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

# Generate transcription
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=4028, do_sample=False)

# Decode output
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0, input_length:]
transcription = processor.decode(generated_ids, skip_special_tokens=True)

print(transcription)

Batch Inference

from datasets import load_dataset

# Load dataset
dataset = load_dataset("johnlockejrr/LightOnOCR-2-1B-base-samaritan", split="train[:10]")

# Process batch
images = [[img.convert("RGB")] for img in dataset["image"]]
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
texts = [text] * len(images)

inputs = processor(
    text=texts,
    images=images,
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 1024},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

outputs = model.generate(**inputs, max_new_tokens=4028, do_sample=False)
predictions = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)

for pred, gt in zip(predictions, dataset["text"]):
    print(f"Prediction: {pred}")
    print(f"Ground Truth: {gt}")
    print()

Training Details

Base Model: lightonai/LightOnOCR-2-1B-base
Training Method: Fine-tuning with frozen language model backbone
Optimizer: AdamW (fused)
Learning Rate: 6e-5 with linear decay
Precision: bfloat16

Limitations

This model is trained on line-level images only. For full-page transcription, you need to first segment the page into individual lines.
Performance may vary on manuscript styles not represented in the training data.
Old Church Slavonic has many abbreviations and special characters that may require domain-specific post-processing.

Citation

If you use this model, please cite:

@misc{lightonocr2_smp_2026,
  title = {LightOnOCR Fine-tuned for Samaritan Hebrew/Aramaic},
  author = {John Locke},
  year = {2026},
  howpublished = {\url{https://huggingface.co/johnlockejrr/LightOnOCR-2-1B-base-samaritan-pre}}
}

And the original LightOnOCR paper:

@misc{lightonocr2_2026,
  title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR},
  author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin},
  year = {2026},
  howpublished = {\url{https://arxiv.org/pdf/2601.14251}}
}

Acknowledgments

LightOn AI for the excellent LightOnOCR base model

Downloads last month: 201

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for johnlockejrr/LightOnOCR-2-1B-base-samaritan

Base model

lightonai/LightOnOCR-2-1B-base

Finetuned

(13)

this model

Dataset used to train johnlockejrr/LightOnOCR-2-1B-base-samaritan

Paper for johnlockejrr/LightOnOCR-2-1B-base-samaritan

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Paper • 2601.14251 • Published Jan 20 • 25