Open Access iconOpen Access

ARTICLE

crossmark

Convolution-Transformer for Image Feature Extraction

Lirong Yin1, Lei Wang1, Siyu Lu2,*, Ruiyang Wang2, Youshuai Yang2, Bo Yang2, Shan Liu2, Ahmed AlSanad3, Salman A. AlQahtani3, Zhengtong Yin4, Xiaolu Li5, Xiaobing Chen6, Wenfeng Zheng3,*

1 Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA, 70803, USA
2 School of Automation, University of Electronic Science and Technology of China, Chengdu, 610054, China
3 College of Computer and Information Sciences, King Saud University, Riyadh, 11574, Saudi Arabia
4 College of Resources and Environmental Engineering, Key Laboratory of Karst Georesources and Environment (Guizhou University), Ministry of Education, Guiyang, 550025, China
5 School of Geographical Sciences, Southwest University, Chongqing, 400715, China
6 School of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA

* Corresponding Authors: Siyu Lu. Email: email; Wenfeng Zheng. Email: email

Computer Modeling in Engineering & Sciences 2024, 141(1), 87-106. https://doi.org/10.32604/cmes.2024.051083

Abstract

This study addresses the limitations of Transformer models in image feature extraction, particularly their lack of inductive bias for visual structures. Compared to Convolutional Neural Networks (CNNs), the Transformers are more sensitive to different hyperparameters of optimizers, which leads to a lack of stability and slow convergence. To tackle these challenges, we propose the Convolution-based Efficient Transformer Image Feature Extraction Network (CEFormer) as an enhancement of the Transformer architecture. Our model incorporates E-Attention, depthwise separable convolution, and dilated convolution to introduce crucial inductive biases, such as translation invariance, locality, and scale invariance, into the Transformer framework. Additionally, we implement a lightweight convolution module to process the input images, resulting in faster convergence and improved stability. This results in an efficient convolution combined Transformer image feature extraction network. Experimental results on the ImageNet1k Top-1 dataset demonstrate that the proposed network achieves better accuracy while maintaining high computational speed. It achieves up to 85.0% accuracy across various model sizes on image classification, outperforming various baseline models. When integrated into the Mask Region-Convolutional Neural Network (R-CNN) framework as a backbone network, CEFormer outperforms other models and achieves the highest mean Average Precision (mAP) scores. This research presents a significant advancement in Transformer-based image feature extraction, balancing performance and computational efficiency.

Keywords


Cite This Article

APA Style
Yin, L., Wang, L., Lu, S., Wang, R., Yang, Y. et al. (2024). Convolution-transformer for image feature extraction. Computer Modeling in Engineering & Sciences, 141(1), 87-106. https://doi.org/10.32604/cmes.2024.051083
Vancouver Style
Yin L, Wang L, Lu S, Wang R, Yang Y, Yang B, et al. Convolution-transformer for image feature extraction. Comput Model Eng Sci. 2024;141(1):87-106 https://doi.org/10.32604/cmes.2024.051083
IEEE Style
L. Yin et al., “Convolution-Transformer for Image Feature Extraction,” Comput. Model. Eng. Sci., vol. 141, no. 1, pp. 87-106, 2024. https://doi.org/10.32604/cmes.2024.051083



cc Copyright © 2024 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1129

    View

  • 407

    Download

  • 0

    Like

Share Link