SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Abstract
A surgical video segmentation model SAM2S enhances interactive video object segmentation through robust memory, temporal learning, and ambiguity handling, achieving high performance and real-time inference on a comprehensive surgical benchmark.
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average J\&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J\&F, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
Community
We are excited to share our latest work: "SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking". https://arxiv.org/abs/2511.16618
We introduce SAM2S, enhancing SAM2 for surgical video segmentation through semantic long-term tracking and domain-specific adaptations.
- achieves 80.42 J&F (+17.10 over vanilla SAM2) at 68 FPS for real-time surgical applications;
- presents SA-SV benchmark: 61K frames, 1.6K masklets across 8 surgical procedure types;
- enables robust long-term tracking in extended surgical videos (up to 30 minutes).
#SurgicalAI #SurgicalDataScience #SAM #SAM2 #VideoObjectSegmentation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation (2025)
- The 1st Solution for MOSEv1 Challenge on LSVOS 2025: CGFSeg (2025)
- 2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC (2025)
- Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking (2025)
- SVAC: Scaling Is All You Need For Referring Video Object Segmentation (2025)
- SAM 2++: Tracking Anything at Any Granularity (2025)
- Temporal-Guided Visual Foundation Models for Event-Based Vision (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper