Open Access iconOpen Access

ARTICLE

crossmark

Masked Autoencoders as Single Object Tracking Learners

by Chunjuan Bo1,*, Xin Chen2, Junxing Zhang1

1 School of Information and Communication Engineering, Dalian Minzu University, Dalian, 116600, China
2 School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116024, China

* Corresponding Author: Chunjuan Bo. Email: email

(This article belongs to the Special Issue: Recognition Tasks with Transformers)

Computers, Materials & Continua 2024, 80(1), 1105-1122. https://doi.org/10.32604/cmc.2024.052329

Abstract

Significant advancements have been witnessed in visual tracking applications leveraging ViT in recent years, mainly due to the formidable modeling capabilities of Vision Transformer (ViT). However, the strong performance of such trackers heavily relies on ViT models pretrained for long periods, limiting more flexible model designs for tracking tasks. To address this issue, we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders, called TrackMAE. During pretraining, we employ two shared-parameter ViTs, serving as the appearance encoder and motion encoder, respectively. The appearance encoder encodes randomly masked image data, while the motion encoder encodes randomly masked pairs of video frames. Subsequently, an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level. In this way, ViT learns to understand both the appearance of images and the motion between video frames simultaneously. Experimental results demonstrate that ViT-Base and ViT-Large models, pretrained with TrackMAE and combined with a simple tracking head, achieve state-of-the-art (SOTA) performance without additional design. Moreover, compared to the currently popular MAE pretraining methods, TrackMAE consumes only 1/5 of the training time, which will facilitate the customization of diverse models for tracking. For instance, we additionally customize a lightweight ViT-XS, which achieves SOTA efficient tracking performance.

Keywords


Cite This Article

APA Style
Bo, C., Chen, X., Zhang, J. (2024). Masked autoencoders as single object tracking learners. Computers, Materials & Continua, 80(1), 1105-1122. https://doi.org/10.32604/cmc.2024.052329
Vancouver Style
Bo C, Chen X, Zhang J. Masked autoencoders as single object tracking learners. Comput Mater Contin. 2024;80(1):1105-1122 https://doi.org/10.32604/cmc.2024.052329
IEEE Style
C. Bo, X. Chen, and J. Zhang, “Masked Autoencoders as Single Object Tracking Learners,” Comput. Mater. Contin., vol. 80, no. 1, pp. 1105-1122, 2024. https://doi.org/10.32604/cmc.2024.052329



cc Copyright © 2024 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 470

    View

  • 232

    Download

  • 0

    Like

Share Link