Convolution-Transformer for Image Feature Extraction

Lirong Yin¹, Lei Wang¹, Siyu Lu^2,*, Ruiyang Wang², Youshuai Yang², Bo Yang², Shan Liu², Ahmed AlSanad³, Salman A. AlQahtani³, Zhengtong Yin⁴, Xiaolu Li⁵, Xiaobing Chen⁶, Wenfeng Zheng^3,*
1 Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA, 70803, USA
2 School of Automation, University of Electronic Science and Technology of China, Chengdu, 610054, China
3 College of Computer and Information Sciences, King Saud University, Riyadh, 11574, Saudi Arabia
4 College of Resources and Environmental Engineering, Key Laboratory of Karst Georesources and Environment (Guizhou University), Ministry of Education, Guiyang, 550025, China
5 School of Geographical Sciences, Southwest University, Chongqing, 400715, China
6 School of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
* Corresponding Author: Siyu Lu. Email: email ; Wenfeng Zheng. Email: email

Computer Modeling in Engineering & Sciences https://doi.org/10.32604/cmes.2024.051083

Received 27 February 2024; Accepted 16 May 2024; Published online 14 June 2024

Download PDF

Abstract

This study addresses the limitations of Transformer models in image feature extraction, particularly their lack of inductive bias for visual structures. Compared to Convolutional Neural Networks (CNNs), the Transformers are more sensitive to different hyperparameters of optimizers, which leads to a lack of stability and slow convergence. To tackle these challenges, we propose the Convolution-based Efficient Transformer Image Feature Extraction Network (CEFormer) as an enhancement of the Transformer architecture. Our model incorporates E-Attention, depthwise separable convolution, and dilated convolution to introduce crucial inductive biases, such as translation invariance, locality, and scale invariance, into the Transformer framework. Additionally, we implement a lightweight convolution module to process the input images, resulting in faster convergence and improved stability. This results in an efficient convolution combined Transformer image feature extraction network. Experimental results on the ImageNet1k Top-1 dataset demonstrate that the proposed network achieves better accuracy while maintaining high computational speed. It achieves up to 85.0% accuracy across various model sizes on image classification, outperforming various baseline models. When integrated into the Mask Region-Convolutional Neural Network (R-CNN) framework as a backbone network, CEFormer outperforms other models and achieves the highest mean Average Precision (mAP) scores. This research presents a significant advancement in Transformer-based image feature extraction, balancing performance and computational efficiency.

Keywords

Transformer; E-Attention; depth convolution; dilated convolution; CEFormer

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

275

View
63

Download
1

Like

A Study of Boundary Conditions in the Meshless Local Petrov-Galerkin (MLPG) Method for Electromagnetic Field Computations
Meiling Zhao, Yufeng Nie
A User-Transformer Relation Identification Method Based on QPSO and Kernel Fuzzy Clustering
Yong Xiao, Xin Jin, Jingfeng Yang,...
A Novel Named Entity Recognition Scheme for Steel E-Commerce Platforms Using a Lite BERT
Maojian Chen, Xiong Luo, Hailun...
Code Transform Model Producing High-Performance Program
Bao Rong Chang, Hsiu-Fen Tsai,...
BEVGGC: Biogeography-Based Optimization Expert-VGG for Diagnosis COVID-19 via Chest X-ray Images
Junding Sun, Xiang Li, Chaosheng...

All issues

Online First

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

Convolution-Transformer for Image Feature Extraction

Abstract

Keywords

275

63

1

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link