You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

LEGATO-Small

LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR. This is the small variant of LEGATO, primarily for baseline comparisons and research purposes.

🔗 Try it: Interactive Demo | Leaderboard

⚠️ Important: This model must be used with the legato codebase. It cannot be loaded with standard Transformers pipelines alone due to the custom LegatoModel architecture.

Note: For production use, we recommend the full legato model instead. This small variant is less efficient and provided mainly for baseline comparisons.

Model Details

  • Developed by: Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith
  • Model type: Vision-language model for end-to-end OMR
  • Architecture: Based on Llama 3.2 11B Vision (Mllama). Uses a frozen vision encoder and a smaller trained text decoder that outputs ABC notation.
  • License: MIT
  • Paper: LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR

How to Use

Installation

  1. Clone the repository and install dependencies:
git clone https://github.com/guang-yng/legato.git
cd legato
pip install -r requirements.txt

Tested with Python 3.12 and CUDA 12.4.

  1. Access requirements: This model loads the vision encoder from meta-llama/Llama-3.2-11B-Vision. Ensure you have accepted the Llama 3.2 license on Hugging Face.

Inference

import torch
from PIL import Image
from transformers import AutoProcessor, GenerationConfig
from legato.models import LegatoModel

# Load model and processor
model = LegatoModel.from_pretrained("guangyangmusic/legato-small")
processor = AutoProcessor.from_pretrained("guangyangmusic/legato-small")

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and process image
image = Image.open("path/to/sheet_music.png").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate ABC notation
generation_config = GenerationConfig(
    max_length=2048,
    num_beams=10,
    repetition_penalty=1.1
)

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

# Decode output
abc_notation = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(abc_notation)

Half-Precision Inference (Reduced Memory)

model = LegatoModel.from_pretrained("guangyangmusic/legato-small")
model = model.to("cuda").half()  # Use FP16

Batch Inference

images = [Image.open(p).convert("RGB") for p in image_paths]
inputs = processor(images=images, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

abc_outputs = processor.batch_decode(outputs, skip_special_tokens=True)

Output

The model outputs ABC notation transcriptions. ABC can be converted to MusicXML using the conversion utilities in the legato codebase.

Intended Use

  • Primary use: Research and baseline comparisons for Optical Music Recognition
  • Audience: Researchers in music information retrieval and machine learning
  • Recommended alternative: Use guangyangmusic/legato for better performance and efficiency
  • Out-of-scope: Handwritten notation, audio-to-score transcription, production deployments

Limitations

  • Less efficient than the full legato model despite smaller size
  • Trained primarily on synthetic typeset data; performance may degrade on handwritten scores, low-quality scans, or unusual layouts
  • Requires significant GPU memory (~15GB+ for full precision; use --fp16 for lower memory)
  • Depends on access to meta-llama/Llama-3.2-11B-Vision for the vision encoder
  • Maximum generation length: 2048 tokens (default)

Training

Trained on PDMX-Synth with DeepSpeed ZeRO-2. For training and validation instructions, see the legato repository.

Evaluation

The codebase provides evaluation scripts for:

  • TEDn – Tree Edit Distance on MusicXML
  • OMR-NED – Normalized Edit Distance via musicdiff

See the README for evaluation commands.

Citation

@misc{yang2025legatolargescaleendtoendgeneralizable,
      title={LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR}, 
      author={Guang Yang and Victoria Ebert and Nazif Tamer and Brian Siyuan Zheng and Luiza Pozzobon and Noah A. Smith},
      year={2025},
      eprint={2506.19065},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.19065}, 
}

Related Models

Model Link Recommendation
legato guangyangmusic/legato Recommended for production use
legato-small (this model) guangyangmusic/legato-small For baseline comparisons
Downloads last month
53
Safetensors
Model size
10.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train guangyangmusic/legato-small

Paper for guangyangmusic/legato-small