arxiv:2502.00937

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Published on Feb 2

Authors:

Abstract

ModServe optimizes and scales large multimodal models in production by decoupling inference stages and using modality-aware scheduling to meet latency SLAs efficiently.

AI-generated summary

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.00937 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.00937 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.00937 in a Space README.md to link it from this page.