Masked Autoencoders as Single Object Tracking Learners

Chunjuan Bo^1,*, Xin Chen², Junxing Zhang¹
1 School of Information and Communication Engineering, Dalian Minzu University, Dalian, 116600, China
2 School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116024, China
* Corresponding Author: Chunjuan Bo. Email: email
(This article belongs to the Special Issue: Recognition Tasks with Transformers)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2024.052329

Received 30 March 2024; Accepted 30 May 2024; Published online 26 June 2024

Download PDF

Abstract

Significant advancements have been witnessed in visual tracking applications leveraging ViT in recent years, mainly due to the formidable modeling capabilities of Vision Transformer (ViT). However, the strong performance of such trackers heavily relies on ViT models pretrained for long periods, limiting more flexible model designs for tracking tasks. To address this issue, we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders, called TrackMAE. During pretraining, we employ two shared-parameter ViTs, serving as the appearance encoder and motion encoder, respectively. The appearance encoder encodes randomly masked image data, while the motion encoder encodes randomly masked pairs of video frames. Subsequently, an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level. In this way, ViT learns to understand both the appearance of images and the motion between video frames simultaneously. Experimental results demonstrate that ViT-Base and ViT-Large models, pretrained with TrackMAE and combined with a simple tracking head, achieve state-of-the-art (SOTA) performance without additional design. Moreover, compared to the currently popular MAE pretraining methods, TrackMAE consumes only 1/5 of the training time, which will facilitate the customization of diverse models for tracking. For instance, we additionally customize a lightweight ViT-XS, which achieves SOTA efficient tracking performance.

Keywords

Visual object tracking; vision transformer; masked autoencoder; visual representation learning

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

131

View
23

Download
0

Like

Enhancing the Robustness of Visual Object Tracking via Style Transfer
Abdollah Amirkhani, Amir Hossein...
Efficient Image Captioning Based on Vision Transformer Models
Samar Elbedwehy, T. Medhat, Taher...
Criss-Cross Attentional Siamese Networks for Object Tracking
Zhangdong Wang, Jiaohua Qin, Xuyu...
Explainable Anomaly Detection Using Vision Transformer Based SVDD
Ji-Won Baek, Kyungyong Chung
SiamDLA: Dynamic Label Assignment for Siamese Visual Tracking
Yannan Cai, Ke Tan, Zhenzhong...

All issues

Online First

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Masked Autoencoders as Single Object Tracking Learners

Abstract

Keywords

131

23

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link