You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

LEGATO-Small

LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR. This is the small variant of LEGATO, primarily for baseline comparisons and research purposes.

🔗 Try it: Interactive Demo | Leaderboard

⚠️ Important: This model must be used with the legato codebase. It cannot be loaded with standard Transformers pipelines alone due to the custom LegatoModel architecture.

Note: For production use, we recommend the full legato model instead. This small variant is less efficient and provided mainly for baseline comparisons.

Model Details

Developed by: Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith
Model type: Vision-language model for end-to-end OMR
Architecture: Based on Llama 3.2 11B Vision (Mllama). Uses a frozen vision encoder and a smaller trained text decoder that outputs ABC notation.
License: MIT
Paper: LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR

How to Use

Installation

Clone the repository and install dependencies:

git clone https://github.com/guang-yng/legato.git
cd legato
pip install -r requirements.txt

Tested with Python 3.12 and CUDA 12.4.

Access requirements: This model loads the vision encoder from meta-llama/Llama-3.2-11B-Vision. Ensure you have accepted the Llama 3.2 license on Hugging Face.

Inference

import torch
from PIL import Image
from transformers import AutoProcessor, GenerationConfig
from legato.models import LegatoModel

# Load model and processor
model = LegatoModel.from_pretrained("guangyangmusic/legato-small")
processor = AutoProcessor.from_pretrained("guangyangmusic/legato-small")

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and process image
image = Image.open("path/to/sheet_music.png").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate ABC notation
generation_config = GenerationConfig(
    max_length=2048,
    num_beams=10,
    repetition_penalty=1.1
)

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

# Decode output
abc_notation = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(abc_notation)

Half-Precision Inference (Reduced Memory)

model = LegatoModel.from_pretrained("guangyangmusic/legato-small")
model = model.to("cuda").half()  # Use FP16

Batch Inference

images = [Image.open(p).convert("RGB") for p in image_paths]
inputs = processor(images=images, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

abc_outputs = processor.batch_decode(outputs, skip_special_tokens=True)

Output

The model outputs ABC notation transcriptions. ABC can be converted to MusicXML using the conversion utilities in the legato codebase.

Intended Use

Primary use: Research and baseline comparisons for Optical Music Recognition
Audience: Researchers in music information retrieval and machine learning
Recommended alternative: Use guangyangmusic/legato for better performance and efficiency
Out-of-scope: Handwritten notation, audio-to-score transcription, production deployments

Limitations

Less efficient than the full legato model despite smaller size
Trained primarily on synthetic typeset data; performance may degrade on handwritten scores, low-quality scans, or unusual layouts
Requires significant GPU memory (~15GB+ for full precision; use --fp16 for lower memory)
Depends on access to meta-llama/Llama-3.2-11B-Vision for the vision encoder
Maximum generation length: 2048 tokens (default)

Training

Trained on PDMX-Synth with DeepSpeed ZeRO-2. For training and validation instructions, see the legato repository.

Evaluation

The codebase provides evaluation scripts for:

TEDn – Tree Edit Distance on MusicXML
OMR-NED – Normalized Edit Distance via musicdiff

See the README for evaluation commands.

Citation

@misc{yang2025legatolargescaleendtoendgeneralizable,
      title={LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR}, 
      author={Guang Yang and Victoria Ebert and Nazif Tamer and Brian Siyuan Zheng and Luiza Pozzobon and Noah A. Smith},
      year={2025},
      eprint={2506.19065},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.19065}, 
}

Related Models

Model	Link	Recommendation
legato	guangyangmusic/legato	✅ Recommended for production use
legato-small (this model)	guangyangmusic/legato-small	For baseline comparisons

Downloads last month: 53

Safetensors

Model size

10.9M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train guangyangmusic/legato-small

Paper for guangyangmusic/legato-small

LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Paper • 2506.19065 • Published Jun 23, 2025 • 1