arXiv:2509.00692

CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition

Published on Aug 31

Authors:

Yusen Peng ,

Abstract

CascadeFormer, a two-stage cascading transformer framework, achieves competitive performance in skeleton-based human action recognition through masked pretraining and cascading fine-tuning.

AI-generated summary

Skeleton-based human action recognition leverages sequences of human joint coordinates to identify actions performed in videos. Owing to the intrinsic spatiotemporal structure of skeleton data, Graph Convolutional Networks (GCNs) have been the dominant architecture in this field. However, recent advances in transformer models and masked pretraining frameworks open new avenues for representation learning. In this work, we propose CascadeFormer, a family of two-stage cascading transformers for skeleton-based human action recognition. Our framework consists of a masked pretraining stage to learn generalizable skeleton representations, followed by a cascading fine-tuning stage tailored for discriminative action classification. We evaluate CascadeFormer across three benchmark datasets (Penn Action N-UCLA, and NTU RGB+D 60), achieving competitive performance on all tasks. To promote reproducibility, we release our code and model checkpoints.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.00692 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.00692 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.