A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

Dinh Phu; Dong Wang; Viet-Tuan Le

doi:10.32604/cmc.2024.050790

Open Access icon Open Access

REVIEW

A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

Dinh Phu Cuong Le^1,2, Dong Wang¹, Viet-Tuan Le^3,*

1 College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
2 Faculty of Information Technology, Yersin University of Da Lat, Da Lat, 66100, Vietnam
3 Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, 722000, Vietnam

* Corresponding Author: Viet-Tuan Le. Email: email

Computers, Materials & Continua 2024, 80(1), 37-60. https://doi.org/10.32604/cmc.2024.050790

Received 17 February 2024; Accepted 17 May 2024; Issue published 18 July 2024

Abstract

Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism. This study aims to provide a comprehensive survey of recent transformer-based approaches in image and video applications, as well as diffusion models. We begin by discussing existing surveys of vision transformers and comparing them to this work. Then, we review the main components of a vanilla transformer network, including the self-attention mechanism, feed-forward network, position encoding, etc. In the main part of this survey, we review recent transformer-based models in three categories: Transformer for downstream tasks, Vision Transformer for Generation, and Vision Transformer for Segmentation. We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models. We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets. Finally, we explore some future research directions to further improve the field.

Keywords

Transformer; vision transformer; self-attention; hierarchical transformer; diffusion models

Cite This Article

APA Style

Le, D.P.C., Wang, D., Le, V. (2024). A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models. Computers, Materials & Continua, 80(1), 37–60. https://doi.org/10.32604/cmc.2024.050790

Vancouver Style

Le DPC, Wang D, Le V. A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models. Comput Mater Contin. 2024;80(1):37–60. https://doi.org/10.32604/cmc.2024.050790

IEEE Style

D. P. C. Le, D. Wang, and V. Le, “A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models,” Comput. Mater. Contin., vol. 80, no. 1, pp. 37–60, 2024. https://doi.org/10.32604/cmc.2024.050790

BibTex EndNote RIS

Copyright © 2024 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models

Abstract

Keywords

Cite This Article

1812

963

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link