Jieren Cheng1,3, Hua Li2,*, Dengbo Li3, Shuai Hua2, Victor S. Sheng4
1 School of Computer Science and Technology, Hainan University, Haikou, 570228, China
2 School of Cyberspace Security (School of Cryptology), Hainan University, Haikou, 570228, China
3 Hainan Blockchain Technology Engineering Research Center, Hainan University, Haikou, 570228, China
4 Department of Computer Science Texas Tech University TX, 79409, USA
* Corresponding Author: Hua Li. Email:
Computers, Materials & Continua 2023, 74(1), 1941-1957. https://doi.org/10.32604/cmc.2023.032757
Received 28 May 2022; Accepted 12 July 2022; Issue published 22 September 2022
Image semantic segmentation is a basic task in the field of computer vision. It can be regarded as a pixel level classification task, which achieves fine-grained reasoning by intensively predicting and inferring labels for each pixel, so that each pixel is labeled and divided into a specific category. Image semantic segmentation not only provides category prediction, but also provides spatial location information about these classes. In recent years, semantic segmentation has been applied more and more widely. It plays an important role in medical image analysis , automatic driving , virtual/augmented reality , video surveillance  and three-dimensional reconstruction .
Reviewing the development of semantic segmentation methods , Early methods were mostly based on mathematical methods, such as thresholding, k-means clustering, and conditional random fields. Then, with the great success of deep learning in various fields , researchers tried to use deep learning techniques for semantic segmentation task, and successfully designed the full convolution neural network (FCN) . Since then, convolution neural network has swept the field and become the mainstream method. In the past two years, transformer has become popular in computer vision, and the application of MLP technology in this field has inspired researchers to explore more possibilities in the field of semantic segmentation.
With the rapid emergence of new semantic segmentation methods based on deep learning in recent years, many past reviews have some shortcomings. Although they have [9,10] introduced common datasets in the field of semantic segmentation and technical details of some classical methods, they lacked generalizations and descriptions of some new technologies (e.g., transformer- and MLP-based methods). It is well known that there is no extensive survey covering many types of semantic segmentation methods such as CNN-based, transformer-based and MLP-based.
The goal of this paper is to summarize and classify current deep learning methods in semantic segmentation to provide comprehensive information reference for scholars and practitioners. Inspired by the work of Zhao et al. , this paper compares and analyzes the image segmentation work of three main neural network architectures in deep learning technology, and proposes a new classification method, which is shown in Fig. 1: it is based on network architecture. Existing semantic segmentation methods are divided into four categories according to different network architectures: CNN-based architectures, transformer-based architectures, MLP-based architectures, and others.
The key contributions of this paper include a systematic review of image semantic segmentation methods which covers the latest literature in the field of image semantic segmentation. Various deep learning algorithms used in image segmentation are described and divided into four categories according to different network architectures. The advantages and limitations of existing segmentation methods are compared and analyzed on popular benchmarks. The results of this study provide trends in semantic segmentation using deep learning, and challenges, and future research directions.
The remainder of this survey is organized as follows: Section 2 reviews some of the most popular image segmentation datasets and their characteristics. Section 3 is the main body of our survey. Section 4 summarizes some common metrics used in the performance evaluation of segmentation models, and then evaluates and analyzes the performance of the models. Section 5 discusses the main future research directions and challenges in the field of image segmentation. Finally, Section 6 makes a summary.
There are many datasets that can be used for semantic segmentation tasks. This paper introduces a total of ten representative general image segmentation datasets, including PASCAL visual object classes (VOC) , Cityscapes , Microsoft common objects in context (COCO) , ADE20K , CamBridge-driving labeled video database (CamVid) , COCO-stuff , Indian driving dataset (IDD) , Dark Zurich , adverse conditions dataset with correspondences (ACDC) , and PartImageNet . According to different purposes of these datasets, they can be divided into generic, urban/Driving, generic-part, etc. Although there are related works [9,10] that have described datasets in detail, they suffer from the problem of partial content invalidation and lack of recent datasets. Therefore, several image semantic segmentation datasets are briefly summarized, and detailed information (such as their purpose, number of classes, training/validation/testing splits, and access hyperlink.) about the characteristics of each dataset are provided. Tab. 1 shows a summarized view of the above datasets, where the first five are the more popular datasets, and the last five are the most recent datasets. In addition, some segmentation models only select the classes of interest when training the model on the dataset, instead of using all classes. Therefore, the number in brackets in the class column is used to indicate the number of frequently used classes. The above summary is intended to facilitate readers to have a basic understanding of commonly used semantic segmentation data sets when reading this article. Readers can refer to the corresponding link address to query the detailed description of the relevant data set according to their own needs.
In recent years, with the rapid development of neural networks, image semantic segmentation methods based on deep learning have entered a new stage of development. Model architectures for semantic segmentation are becoming more and more diverse, and the basic modules used to build different architectures are shown in Fig. 2, where PE denotes positional encoding. This section divides segmentation methods into four categories according to different architectures: CNN-based, transformer-based, MLP-based, and others, also introduces the typical segmentation methods based on these architectures in detail.
After a long-term evolution and development, backbone networks in the field of image semantic segmentation has given birth to different types, such as large-scale classic backbone and its variants, lightweight backbone, encoder-decoder backbone and multi-scale backbone. Next, the backbone networks will be introduced based on the above four types.
Classical backbone and its variants: Some deep networks (e.g., VGGNet , ResNet ) have made great contributions to the field of Semantic segmentation and have laid a solid foundation for subsequent development. The subsequent backbones [24,25] are mostly combined with the design of the previous backbones, making full use of the idea of residual connection and grouping to improve the network, and extracting more spatial information without adding a lot of parameters to improve the performance of the network.
Lightweight backbone: The lightweight backbone is suitable for terminal devices that lack computing resources. They have few parameters and fast inference speed, which are suitable for real-time semantic segmentation. The two more famous series of backbones are MobileNet  and ShuffleNet . The MobileNet proposed the deep separable convolution instead of the standard convolution. The ShuffleNet  designed a ShuffleNet unit, which used pointwise group convolution to replace the original pointwise convolution and added channel shuffle to strengthen the connection between groups. They all greatly reduce the computational overhead while maintaining accuracy.
Encoder-decoder backbone: Encoder-decoder network architecture [28,29] mainly includes encoder and decoder. The encoder generates a down sampling feature map, and the decoder up samples the feature map to match the input resolution. Usually, the input of each encoder layer is also bypassed to the decoder of the same feature map scale to help recover the missing spatial information.
Multi-branch backbone: Dual branch networks [30–32] a generally divide the network structure into spatial detail branches and depth feature branches, and then fuse the information of the two branches to reduce the loss of detail information. In addition, Sun et al.  proposed a high-resolution network (HRNet), which designed a multi-branch structure. They considered that the down sampling operation of both branches will lose part of the spatial information and reduce the network performance. Therefore, the HRNet started from a high-resolution subnet and gradually down sampled to form a subnet from high to low resolution. Each subnet is connected in parallel and continuously integrates information, all branches are aggregated to directly affect the output.
In the CNN-based architecture, the deep network has strong representation ability of semantic information, and the shallow network contains rich spatial detail information. Fully integrating the deep semantic information and shallow spatial information can effectively improve the accuracy of the network. In recent years, the work related to multi-scale information fusion has emerged one after another. Some works [34,35] constructed a long jump connection branch, which connects shallow features to deep features to reduce the loss of spatial information. Huang et al.  proposed the feature-aligned pyramid network (FaPN) for dense image prediction. The FaPN improved the feature pyramid network through feature alignment module and feature selection module, emphasized low-level features with rich spatial detail information, and solved the problem of prediction and classification errors caused by misalignment of context information in the process of feature fusion.
In addition to improving the performance of the model by fusing deep and shallow information, global semantic information can provide clues to segment category distribution and the robustness of the model can be improved by fusing the global and local features. Zhao et al.  developed the pyramid scene parsing network (PSPNet) to better learn the multi-scale context representation of a scene. The PSPNet proposed a pyramid pooling module (PPM), which extracted the global information of different sub-regions through four pooling modules with different down sampling scales, and then sampled on each branch to restore the resolution and concatenated the original feature map to fully integrate the local and global information. Zhang et al.  proposed an EncNet model, which designed a context encoding module to capture global semantic information and calculated the scaling factor of the feature graph based on the coding information to highlight the information categories that need to be emphasized. Finally, the EncNet used semantic coding loss (SE loss) to force the network to understand the global information.
To expand the receptive field and obtain rich semantic information, repeated maximum pooling and down sampling operations will be carried out, which will lead to the decline of feature map resolution and the loss of spatial information. Dilated convolutions can enlarge the receptive field without additional computational cost, so it has been very popular in the field of real-time segmentation, and many works are based on this technology to improve the performance of the model. Some of most important works include the DeepLap family proposed by Chen et al. [39,40] and the densely connected atrous spatial pyramid pooling (DenseASPP) proposed by Yang et al. . They all used dilated convolution to replace the original down sampling method and expanded the receptive field to obtain more context information without increasing the number of parameters and calculation.
Transformer is a deep neural network based on self-attention. It was first used in the field of natural language processing , then the model had been continuously improved  and had achieved excellent performance in a wide range of language tasks such as machine translation, text classification, and question answering. The great achievements of transformer in the field of natural language processing have greatly encouraged researchers to explore the role of transformer in the field of computer vision. In recent two years, transformer structure and its variants have been successfully applied to visual tasks such as image classification , image captioning , object detection , and segmentation . The Google team also analyzed the training of vision transformer and provided an effective guidance for future research on visual transformers . This section mainly introduces self-attention mechanisms and transformer networks (a specific form of self-attention) in the field of image segmentation.
The self-attention mechanism has played a key role in delivering efficient and accurate computer vision systems . The essence of visual attention mechanism is to imitate the way of human visual signal processing, quickly scan the global image, filter out important information, invest more attention resources and suppress other useless information, thereby greatly improving the efficiency and accuracy of visual information processing. Attention mechanisms can be divided into channel attention, spatial attention and mixed attention according to their function.
Channel attention, such as squeeze-and-excitation networks (SENet) . The SENet generated feature weights for each feature channel to represent the importance of the channel, and then multiplied the feature weights with the original feature map to enhance the beneficial feature channels and suppress the useless feature channels.
Spatial attention emphasizes the correlation between pixels, such as Non-local Neural Networks  and criss-cross network (CCNet) . They captured long-distance dependence through non-local operation, calculated the correlation between pixels through a variety of coding methods, and made full use of non-local information to enhance the representation ability of pixels. Wang et al.  proposed a differentiable non-local operation, which computed the response at a position as a weighted sum of the features at all positions in the feature map to capture long-range dependencies both in both space and time in a feed-forward fashion. To reduce computational overhead, Huang et al.  proposed the criss-cross attention module that for each pixel position generated a sparse attention map only on the criss-cross path.
Mixed attention enhances feature expression from two perspectives at the same time. For example, the dual attention network (DANet)  and the dual relation-aware attention network (DRANet)  not only emphasized the information expression between channels, but also emphasized the importance of local information in channels, and they fully integrated the advantages of spatial attention and channel attention. Similarly, Sagar et al.  proposed a dual multi-scale attention network (DMSANet), which first divided the input features into multiple groups to capture features at different scales, then fused the multi-scale features and performs channel and spatial dimension enhancement in parallel. Furthermore, considering that complex attention mechanisms are not suitable for lightweight models, there are many works from a lightweight perspective. For example, the Channelized Axial Attention (CAA) , the Axial-DeepLab  and Coordinate Attention  proposed the position-sensitive axis-attention, which reformulated the 2D self-attention mechanism into two 1D directional axial attention to capture full contextual information with high computational efficiency.
Vision transformer networks refer to how transformers can “altogether” replace the work of standard convolutions in deep neural networks on large-scale computer vision datasets. They first transform the original image into a series of image “patches”, which are then fed into the original transformer model for training. Next, we will introduce the transformer networks specifically designed to solve the image semantic segmentation task.
Zheng et al.  first performed semantic segmentation based on transformer and constructed a segmentation transformer network (SETR) to extract global semantic information. The input and output in transformer need to be serialized, so the SETR first divided the two-dimensional images into patches and converted them into a one-dimensional sequence with a length of , and then learned the specific coding of each patch through position coding to retain spatial information. Then, the sequence is input into the transformer encoder composed of multi-head self-attention (MSA) and multi-layer perceptron (MLP) modules to learn features. In addition, to effectively evaluate the effect of the encoder, the SETR designed three different decoders: naive up-sampling (naive), progressive up-sampling (pup) and multi-level feature aggregation (MLA). Inspired by SETR, Strudel et al.  designed a pure transformer model named Segmenter to apply to semantic segmentation tasks. Each layer of the encoder in the Segmenter models the global context information, then the mask transformer decodes the output of the encoder and class embedding, decodes the encoded sequence features into three-dimensional segmentation feature map, and then up-samples the feature map to obtain the final image segmentation map. Based on the advantage that transformer can build global dependencies in images, segmentation transformer  combined transformer with object contextual representations to represent the OCR pipeline to enhance feature expression.
Considering the key role of multi-scale features in CNN based segmentation model, researchers began to explore methods to combine the advantages of transformer and encode multi-scale information. For example, Xie et al.  proposed a SegFormer model, which designed a hierarchical transformer encoder and lightweight MLP decoder. Hierarchical transformer encoder will generate multi-level features, and the output feature size decreased layer by layer. Large-scale feature map provided coarse-grained information, and small-scale feature map provided fine-grained information. All-MLP decoder aggregated different levels of features and combined global attention and local attention to obtain more powerful information representation. The model achieved excellent segmentation performance. Liu et al.  proposed a mobile window scheme, which limited the self-attention calculation to non-overlapping local windows, constructed a hierarchical network architecture named Swin Transformer, merged patch blocks layer by layer to expand the perception range, and then encoded information of different scales to adapt to multi-scale tasks in vision. Chen et al.  proposed a vision transformer adapter (ViT-Adapter), which first used the spatial prior module to model the local spatial contexts of the input image, then injected the captured prior features into the patches encoded by the vision transformer backbone (VIT) , and finally extracted hierarchical features from the output of the block through the multi-scale feature extractor.
Besides using multi-scale information, effectively utilizing different distance attention mechanisms to enhance features can also improve the performance of the model. For example, the cross-scale transformer (CrossFormer)  proposed a long and short distance attention (LSDA), which not only paid attention to the dependency between adjacent embedding, but also emphasized the dependency between embedding far away from each other and retained the embedding characteristics of small-scale and large-scale while reducing the cost. Yang et al.  proposed a focal self-attention and constructed a transformer architecture, named as Focal Transformer. It constructed fine-grained attention in local scope and coarse-grained attention in global scope. At the same time, it effectively captured the visual dependence of short-range and long-range, reduced the amount of calculation and improved the performance of the model.
Multi-layer perceptron (MLP) is a neural network with forward structure. It has a simple structure and strong adaptive ability, and it is widely used in the fields of natural language processing  and computer vision . Although CNN-based and transformer-based networks are the mainstream choices in the field of computer vision, researchers still try to build the network architecture completely using MLP to explore more possibilities for visual network architecture. Recent studies have shown that the MLP-based architecture abandoning convolution and self-attention is simple in design, and its performance in many visual tasks is comparable to the CNN-based and Transformer-based architectures. This section will introduce the specific application of MLP in semantic segmentation task.
Chen et al.  proposed an MLP-like architecture called CycleMLP for dense prediction. Compared with the previous MLP-based architectures, the CycleMLP can flexibly processed input images of various scales and is easy to migrate to downstream tasks. Compared with the transformer based on architecture, CycleMLP has considerable performance and fewer parameters. It designed Cycle Fully-Connected Layer and used Cycle FC to replace the Spatial MLP in the MLP-Mixer architecture for optimization. It also defined the concept of pseudo-kernel, which changed the original fixed sampling point into a pseudo-kernel window centered on the original sampling point and with a certain receptive field. Combined with the advantages of channel FC and spatial FC, it not only maintained the linear computational complexity of the image, but also expanded the receptive field and fully aggregates the context information.
To capture the correlation of local features, Lian et al.  proposed an axial shift strategy and designed an axial shifted MLP (AS-MLP), which was conducive to the interaction of spatial information. The AS-MLP designed AS-MLP block, which extracted features from a single spatial direction through horizontal shift and vertical shift respectively, and then mapped the features to a linear layer through channel projection, and fused features from both directions to effectively extract local information. As-MLP extracted local features through a simple axial shifted strategy, which effectively improved the model performance and reduced the computational complexity.
In , a unified framework called SPACH has been developed to fairly compare the performance of CNN, transformer and MLP. The experimental results show that under the same pre-training conditions, all three architectures can perform the classification task well. CNN and transformer are complementary, CNN-based structure has the best generalization ability, and transformer-based structure has the largest model capacity.
Obviously, each type of architecture has its own advantages, so future researches do not need to stick to a single architecture. Researchers can integrate the advantages of multiple architectures to achieve more efficient performance according to actual task requirements. This section will mainly introduce the related research of hybrid architecture.
The convolution operation in CNN-based architectures has low computational cost, but it has the limitation that it is unable to model long-term dependencies. In contrast, transformer has global attention mechanism, but low-level details are insufficient. Therefore, part of the work combines CNN and transformer to make use of their advantages to achieve a more efficient architecture. The following methods all combine CNN and transformer. The nnFormer  built an interleaving architecture based on self-attention and convolution to achieve a better combination of transformer and CNN. Zhang et al.  proposed an architecture called TransFuse which integrated transformer and CNN. The TransFuse consists of parallel transformer branches and CNN branches, in which transformer branches capture global information and build remote dependencies, while CNN branches capture rich local information. Guo et al.  proposed a method to segment objects with transformers (SOTR), which extracted shallow features through feature pyramid and captured remote context dependencies based on parallel two branch transformer.
In addition to the work of combining CNN and transformer, researchers also try to combine CNN and MLP to build an architecture. MLP-based architectures can encode feature information well, but most of them have fixed dimension input, which is difficult to adapt to downstream tasks (such as Object detection, semantic segmentation), and has large amount of calculation and limited performance; Convolutional neural network can greatly reduce the amount of network calculation and flexibly adapt to different inputs. Therefore, the integration of CNN and MLP can build a lighter, phased and high-performance architecture. Li et al.  proposed a hierarchical convolutional MLPs (ConvMLP) for vision, which mainly included convolution stage and Conv-MLP stage. Tokenizer includes convolution, normalization, activation function and maximum pooling operation to extract initial features. The convolution stage is responsible for enhancing the spatial connection. In the Conv-MLP stage, convolution is used to increase the interaction ability of adjacent information in the process of patch merging and down sampling, and a depth wise convolution layer is embedded between the two MLP blocks to further promote the blending of adjacent information and effectively improve the performance of the model.
In this section, it first introduces several common metrics for model performance evaluation and then analyzes the performance of the segmentation model based on the evaluation metrics.
The performance evaluation of segmentation model needs a unified evaluation metrics. Next, some common metrics (speed, accuracy, memory footprint) used to evaluate the performance of segmentation models will be outlined.
Speed is a very important metrics. The inference speed (IS) represents the amount of forward pass time in millisecond (ms) that a network takes process an image and usually measured by frames per second (FPS). A fast inference speed is conducive to the landing application of image segmentation methods. Therefore, it is very meaningful to understand the time required in the process of model reasoning.
Accuracy in this paper mainly refers to the accuracy of the model, The Mean Intersection over Union (MIoU) is usually used as the accuracy of semantic segmentation. Assuming is the number of classes, MIoU can be defined as Eq. (1):
where refers to the number of pixels inferred from category as category , refers to the true positive. and refer to the false positive and false negative, respectively.
Memory footprint is another important factor in segmentation methods. Models with large parameters and complex calculations may not be applicable to some edge devices (such as unmanned aerial vehicles (UAVs), autopilot cars, and robots) with less memory than high-performance servers. Therefore, memory constraints need to be considered in model design. It may be very useful to fully describe the peak and average memory occupation during model operation.
For devices with limited computing power and storage space in industrial applications, the accuracy of the model is no longer the only concern, fast and lightweight models are also very important. Therefore, a comprehensive evaluation of model performance will help readers fully understand the advantages of different segmentation methods. This section will evaluate the performance of the methods described in Section 3.
Tab. 2 summarizes the performance of the main segmentation models based on deep learning on different datasets (e.g., Pascal VOC 2012 test, Cityscapes test, ADE20K Val). In the table, prams refer to network parameters. IS is measured on a NVIDIA 1080Ti GPU cards. Methods marked with “*” represents the IS measured on other GPU cards. The last letter of the value in the column of IS represents the input of different resolutions, in which the corresponding resolutions of a, b, c and d are , , , . “-” in table, when referring to the backbone or other evaluation metrics, respectively means that the prior works have not used the backbone network or have not published the corresponding value.
From the perspective of segmentation accuracy, it is not difficult to find that transformer-based models perform well on multiple benchmarks (e.g., Cityscapes test, ADE20K Val) compared to CNN and MLP-based models. If the goal is to improve network accuracy without paying attention to model size and computational cost, choosing to use transformer to design a segmentation network is a good choice. The MLP-based networks are comparable in accuracy to large CNN-based segmentation networks on the ADE20K Val dataset. From the perspective of network size and inference speed, CNN-based methods have absolute advantages.
The performance achieved by these methods stems from the unique advantages of their respective architectural designs. The convolution operation extracts image features through a fixed-size convolution kernel, and only needs to learn the parameters of the fixed-size window without encoding global information. Therefore, CNN-based networks are more lightweight and have faster inference speed. However, convolution also comes with the disadvantage that it cannot capture long-range dependencies such as the relationship between arbitrary pixels in an image. Furthermore, convolution filter weights remain fixed after training, and thus cannot dynamically adapt to variation to the input. Currently, methods to capture global information mainly include expanding the receptive field and embedding non-local self-attention mechanism in the networks. Self-attention mechanism is an integral part of the transformer, unlike convolution operations, its weights are dynamically calculated and can capture long-range dependencies. Therefore, the transformer-based networks have achieved SOTA results on multiple datasets, due to its unique advantages of encoding global information and having general modeling capabilities. However, transformer-based models have complex structures, modeling global information requires huge computational overhead, and training models requires high costs (enough computing resources, large-scale datasets). MLP-based segmentation methods are new attempts to semantic segmentation architecture. Its existence indicates that convolution operation and self-attention mechanism are not necessary conditions for good performance and provides more ideas for future development.
In summary, since the emergence of FCN, deep learning-based semantic segmentation methods have made significant progress in both accuracy and speed, their MIoU scores have increased by nearly 30% on multiple datasets. Among all segmentation models, the CNN-based models still occupy the majority. Transformer based methods achieve SOTA in terms of accuracy. At present, as new networks from different camps continue to improve the accuracy on image segmentation benchmarks, no conclusion can be made as which structure among CNN, Transformer, and MLP performs the best or is most suitable for semantic segmentation tasks. In general, each architecture has its own advantages. With the development of deep learning technology, future semantic segmentation models will integrate the advantages of multiple architectures to achieve better performance.
According to the research we reviewed, there is no doubt that image segmentation based on deep learning has developed rapidly and made great achievements. Image segmentation is not only the foundation of scene understanding, but also the focus of future research. Next, some major challenges in this field are summarized and future research directions are given.
Image semantic segmentation tasks rely on high-quality labeled data. However, labeling data is a time-consuming and labor-intensive task. To alleviate the dependence of semantic segmentation tasks on labeled data, self-supervised  and unsupervised learning  can be utilized. These technologies have great research value in image segmentation and can effectively alleviate the problem of poor image analysis results in some fields due to the lack of image segmentation datasets. Based on the idea of transfer learning, the general image segmentation model is trained on the upstream tasks, and then fine-tuned on the target dataset to perform new tasks in the target domain. With the help of self-supervised learning, a large amount of unlabeled data is used to capture the detailed information in the image to guide the model training, to reduce the dependence on labeled data in the model training process.
A lot of image segmentation works focus on two-dimensional image segmentation and cannot be directly applied to video sequences. However, visual recognition and scene understanding in the real world are usually based on video sequences. Point-cloud segmentation has a wide range of applications in the fields of autopilot, robot, 3D reconstruction, building modeling and so on. At present, point cloud segmentation faces many challenges, such as how to deal with 3D disordered and unstructured point-cloud data, how to construct an image segmentation model based on the original point cloud to reduce the loss of effective information in the process of information processing, and how to use deep learning method to segment different individuals in complex scenes, which are worthy of research and exploration.
In many applications, high accuracy is no longer the only pursuit, and it is also critical that the model can infer at near real-time speeds (it needs to reach the speed of at least 25 frames per second for general-purpose cameras). This is very useful for computer vision systems deployed in devices such as self-driving cars, robot dogs, and UAVs. Although large-scale transformer networks and MLP-based networks can achieve high accuracy, they have intensive power and computational requirements, and high resource costs affect their deployment on devices. Currently, most models either have high segmentation accuracy but long inference time, or fast inference but low segmentation accuracy, and do not achieve a good balance between speed and accuracy. Therefore, future work needs to pay more attention to real-time constraints and efficient hardware designs, continuously improve model performance, simplify the network, and find a balance between accuracy and running time to promote the implementation and application of relevant technologies.
Existing methods are often limited to a single network architecture and cannot fully integrate the advantages of different architectures. Each architecture type has its advantages and disadvantages. CNN-based architectures have low computational overhead, but they cannot capture long-distance features. In the future, it can be compensated by expanding the receptive field by using large convolution kernels, atrous convolution and attention mechanisms. Transformer-based architectures and MLP-based architectures can encode global dependencies, but they have high parameter complexity, which leads to high training and inference costs. The future work can build an efficient visual network by reasonably compressing the network architecture or using network architecture search methods. In the future, in addition to optimizing different architectures separately, researchers can also fully integrate the advantages of multiple architectures to build an effective hybrid architecture. Although there have been some related researches on hybrid architectures, there is still a lot of room for improvement. In addition, some useful experiences in the development of CNN-based architectures (e.g., constructing multi-scale information and fusing shallow information) can be applied to transformer- and MLP-based architectures to optimize feature extraction.
The research of semantic segmentation methods based on deep learning has made great progress in recent years. This survey first presents some commonly used image segmentation datasets and later reviews pioneering methods in the field of general image semantic segmentation. Furthermore, these methods are divided into 4 types according to their different architectures and highlights: CNN-based, transformer-based, MLP-based and others. To discover and utilize the power of different types of architectures, existing methods are compared and analyzed based on evaluation metrics (such as model size, inference speed, segmentation accuracy), and the key strengths and limitations of different types of architectures are reported. In generally, CNN-based methods have lighter models and faster inference speed, transformer-based methods can encode global information, MLP-based models are simple in design and do not require convolution operations and self-attention mechanisms. At present, there is no conclusion can be made as which structure among CNN, Transformer, and MLP performs the best or is most suitable for semantic segmentation tasks. Finally, possible research directions and challenges are specifically elaborated. We believe that combining the advantages of multiple deep learning architectures to design high-precision and high-efficiency networks is a key point to be explored to solve the current bottleneck, which is also the future scope of our present work.
Funding Statement: This work was supported by the Major science and technology project of Hainan Province (Grant No. ZDKJ2020012), National Natural Science Foundation of China (Grant No. 62162024 and 62162022), Key Projects in Hainan Province (Grant ZDYF2021GXJS003 and Grant ZDYF2020040), Graduate Innovation Project (Grant No. Qhys2021-187).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.