Chunjuan Bo1,*, Xin Chen2, Junxing Zhang1
CMC-Computers, Materials & Continua, Vol.80, No.1, pp. 1105-1122, 2024, DOI:10.32604/cmc.2024.052329
- 18 July 2024
Abstract Significant advancements have been witnessed in visual tracking applications leveraging ViT in recent years, mainly due to the formidable modeling capabilities of Vision Transformer (ViT). However, the strong performance of such trackers heavily relies on ViT models pretrained for long periods, limiting more flexible model designs for tracking tasks. To address this issue, we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders, called TrackMAE. During pretraining, we employ two shared-parameter ViTs, serving as the appearance encoder and motion encoder, respectively. The appearance encoder encodes randomly masked image data,… More >