Papers
arxiv:2511.16618

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Published on Nov 20
· Submitted by Haofeng Liu on Nov 21
Authors:
,
,
,
,
,

Abstract

A surgical video segmentation model SAM2S enhances interactive video object segmentation through robust memory, temporal learning, and ambiguity handling, achieving high performance and real-time inference on a comprehensive surgical benchmark.

AI-generated summary

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average J\&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J\&F, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

Community

Paper author Paper submitter

We are excited to share our latest work: "SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking". https://arxiv.org/abs/2511.16618

We introduce SAM2S, enhancing SAM2 for surgical video segmentation through semantic long-term tracking and domain-specific adaptations.

  • achieves 80.42 J&F (+17.10 over vanilla SAM2) at 68 FPS for real-time surgical applications;
  • presents SA-SV benchmark: 61K frames, 1.6K masklets across 8 surgical procedure types;
  • enables robust long-term tracking in extended surgical videos (up to 30 minutes).

#SurgicalAI #SurgicalDataScience #SAM #SAM2 #VideoObjectSegmentation

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.16618 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.16618 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.16618 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.