COVID-19 is a growing problem worldwide with a high mortality rate. As a result, the World Health Organization (WHO) declared it a pandemic. In order to limit the spread of the disease, a fast and accurate diagnosis is required. A reverse transcript polymerase chain reaction (RT-PCR) test is often used to detect the disease. However, since this test is time-consuming, a chest computed tomography (CT) or plain chest X-ray (CXR) is sometimes indicated. The value of automated diagnosis is that it saves time and money by minimizing human effort. Three significant contributions are made by our research. Its initial purpose is to use the essential finetuning methodology to test the action and efficiency of a variety of vision models, ranging from Inception to Neural Architecture Search (NAS) networks. Second, by plotting class activation maps (CAMs) for individual networks and assessing classification efficiency with AUC-ROC curves, the behavior of these models is visually analyzed. Finally, stacked ensembles techniques were used to provide greater generalization by combining finetuned models with six ensemble neural networks. Using stacked ensembles, the generalization of the models improved. Furthermore, the ensemble model created by combining all of the finetuned networks obtained a state-of-the-art COVID-19 accuracy detection score of 99.17%. The precision and recall rates were 99.99% and 89.79%, respectively, highlighting the robustness of stacked ensembles. The proposed ensemble approach performed well in the classification of the COVID-19 lesions on CXR according to the experimental results.
The coronavirus (COVID-19) was first noted in December 2019 in Wuhan City (Hubei, China). The viral infection quickly spread worldwide, eventually causing a global pandemic. Following a detailed study of its biological properties, the virus was found to be of zoonotic origin and consists of a single-stranded ribonucleic acid (RNA) genome with a strong capsid. Based on this survey, it was concluded that the virus belongs to the coronaviridae family and was subsequently named 2019-novel coronavirus (nCOV). A person infected with 2019-nCoV may have no symptoms or develop mild symptoms, including sore throat, dry cough, and fever. If the human body hosts the 2019-nCoV for a long period, the virus can cause severe respiratory illness and, in the worst case, it can lead to death. There are four stages that are used to assess the virus's virulence in the human body. During the first four days of the infection, the patient is often asymptomatic. The second stage is the progressive stage which generally occurs between the fifth and eighth day following the infection, whereby the patient may develop mild symptoms. Stage three is known as the peak stage, which occurs between nine and thirteen days. The final stage is the absorption stage, whereby the load of the virus exponentially increases [
Due to the rapid surge in cases, healthcare systems are finding it increasingly difficult to cope with the demand and to provide timely vaccination [ analyze the behavior and performance of various vision models ranging from inception to Neural Architecture Search (NAS) networks followed by appropriate model finetuning, visually assess the behavior of these models by plotting class activation maps (CAMs) for individual networks, determine the classification performance of the model by calculating the area under the curve (AUC) of a receiver operator curve (ROC), improve the generalization of the model by combining the finetuned model deep learning with the shocked model (stacked ensembles technique).
Numerous studies evaluated the use of deep learning methods for the automatic detection, classification, feature extraction, and segmentation for COVID-19 diagnosis from CXR and CT images. This study discusses the relevant applications of pre-trained deep neural networks that prompt the key aspects to impact COVID-19 detection and classification. Fan et al. [ examine the behavior and efficiency of different deep learning vision models ranging from Inception to NAS networks, using the proper finetuning procedure, visually assess the behavior of these models by plotting class activation maps (CAMs) for individual networks, determine the classification performance of the model by calculating the area under the curve (AUC) of a receiver operator curve (ROC), improve the generalization of the model by combining the finetuned model deep learning with the shocked model (stacked ensembles technique).
A total of 2905 CXRs were obtained from various databases, including the Italian Society of Medical Radiology (SIRM), ScienceDirect, The New England Journal of Medicine (NEJM), Radiological Society of North America (RSNA), Radiopaedia, Springer, Wiley, Medrxiv, and other sources
Convolutional neural networks (CNNs) are increasingly being used in computer vision to detect, classify, localize, and segment normal and pathological features from medical images [
The Inception architecture is designed with a novel ideology module. This network architecture is trained by widening layers to increase the depth of the network depth with a few computational parameters. There are two versions of the architecture, including a naive and a dimensionality-reduced. The Inception module consists of three levels. The bottom levels of inception feed into four different layers stacked by width. The intermediate layers extract spatial information individually and correlate with each layer. The top layer concatenates all the intermediate layer's feature maps to maintain a hierarchy of features to improve the perceived performance of the network [
After Inception, VGG networks were developed by a sequential convolutional layer with a pooling layer. The sequential depth of the models ranged from 11 to 19 layers. The appropriate use of the max-pooling layers in 16 and 19 layered VGG-Nets is essential for spatial sub-sampling and the extraction of generic features at the rearmost layers. VGG-Nets use small receptive fields of 5x5 and 3x3 to capture small features, eventually improving their detection precision accurately. The generalizability of the model for highly correlated inputs can be further improved by finetuning the learning application schedules to decrease the learning rate [
The Res-Nets were developed to address the problem of vanishing gradients by imparting identity mapping in large-scale networks. They reformulated deep layers by aggregating learned activations from a prior layer to form a residual connection. This residual learning minimizes the problem of degrading and exploding gradients in the deeper networks. These residual connections help in addressing learned activations from preceding layers, maintaining a constant information flow throughout the network, and eventually reduce the computational cost [
This network was inspired by the Inception network modules and identity mappings from ResNets. This method integrates dimensionality-reduced Inception modules with sequential residual connections hence increasing the learning capability of the network while reducing its computational cost. This provides better generalization ability when compared to various versions of the ResNet and Inception Networks [
This network was proposed to compete with the Inception network to reduce its flaws. The simultaneous mapping of spatial and cross-channel correlations guides allows for improved learning with small receptive fields and improves perceptive ability. The depth-wise separable convolutional layers enhance the learning through detailed feature extraction. These networks are computationally less expensive and perform better than the Inception network [
These densely connected CNNs are motivated by the residual connection of Res-Nets and imposed long-chained residual connections to form dense blocks. In Dense-Nets, for N layers, there are N(N+1)/2 connections (including residual connections) that enhance the network's capability for extracting detailed features while reducing image degradation. The sequential dense and transition blocks provide a collection of knowledge, and a bottleneck receptive field of 3x3, eventually improving its computational efficiency. The finetuning of larger weights improves generalization in deeper networks with a depth ranging from 121 to 201 layers [
Mobile-Nets were designed for mobile applications under a constrained environment. The main advantage of this network is the combination of inverted residual layers with linear bottlenecks. The constructed deep-network accrues a low-dimensional input, which eventually expands by elevating dimensional space. These elevated features are filtered via depth-wise separable CNNs and are further projected back onto a low-dimensional space using linear CNNs. This contribution reduces the need to access the main mobile application memory, thus providing faster executions through the use of a cache-memory [
Nas-Nets make use of convolution cells by learning from distinct classification tasks. The design of this network is based on a reduced depth-wise stacking of normal cells, hence providing an appropriate search space by decoupling a sophisticated architectural design. This adaptability of Nas-Nets enables it to perform well even on mobile applications. The computational cost is significantly reduced, and its performance can be improved by enhancing the depth [
This deep-stacked ensemble method was evaluated by classifying COVID-19 database inputs into a tri-class and a binary class, as shown in
Various samples were first considered and pre-processed to a specific resolution of 224×224×3 of the COVID-19 dataset. These pre-processed images were then fed into a variety of deep networks that use different paradigms to extract features from latent dimensions. The extracted feature vectors are then evaluated, and the two best-performing models are selected to form a stacked ensemble. The COVID-19 class is given more weight in this ensemble, which was assessed by classifying the feedback into a tri-class and a binary class.
Deep learning algorithms can accurately detect pathology from bio-medical imaging to human-level precision. The CNNs provide numerous advantages for feature detection in medical imaging. There are two methods that can be used to design neural architectures for medical imaging. The first method involves designing a novel architecture by overhauling loops in existing architectures by training it end-to-end. The second method involves model finetuning by either transferring the weights of a pre-trained model (transfer of weights) or by retraining an existing pre-trained architecture.
The training of an end-to-end CNN requires proper initializations, which can be computationally expensive. On the other hand, the transfer of weights from the pre-trained models for a similar problem statement can be useful to reduce the computational cost. However, they may not extract the invariances if the class samples in the problem statement are not trained at least once. For example, a pre-trained network on Imagenet may not be able to extract the invariances in CXRs if these samples are never seen or trained. This means that the model may end up capturing unwanted features on the CXR, leading to an inaccurate classification. In order to overcome this problem, the model is fine-tuned to obtain the appropriate features. Fine-tuning of the model is extremely important in medical imaging when the sample size is small, leading to class imbalance [
The Dtrain and Dtest samples were inserted into each model to capture latent feature vectors. A feedforward neural network was built to classify the extracted feature vectors, and all models were fine-tuned using Algorithm 1. The final extracted feature vector consisted of different three-dimensional shapes according to the model. These latent representations were then classified by attaching a dense layer consisting of 256 neurons followed by dropout [
All models were carefully finetuned, and their performance was evaluated using various performance metrics. The generalizations provided by the finetuned models are summarized in
Variables | Description |
---|---|
Training data | |
Test data for generalization | |
Training feature set; Testing feature set | |
Training label set; Testing label set | |
The initial number of iterations | |
The initial number of batches considered to fit into a NN | |
Adam optimizer with learning rate as a parameter | |
Initial learning rate | |
Loss function considered (Squared-Hinge) | |
Predicted labels during training or testing | |
Ground truth labels | |
Altered VGG16 VGG-19 network architectures designed for classification | |
Trained network architectures (VGG-16 and VGG-19) | |
A dense feedforward network with two parameters whereby one represents the input, and the other one shows the number of hidden neurons: |
|
Activation function softmax: |
Models | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
C-0 | C-1 | C-2 | C-0 | C-1 | C-2 | C-0 | C-1 | C-2 | ||
VGG-16 | 93.94 | 90.9 | 91.86 | 96.82 | 81.63 | 96.58 | 92.96 | 86.02 | 94.16 | 94.85 |
VGG-19 | 89.95 | 57.14 | 97.00 | 91.81 | 97.95 | 83.19 | 96.02 | 72.180 | 89.570 | 93.87 |
Xception | 97.11 | 97.91 | 97.86 | 100.0 | 95.91 | 99.71 | 96.02 | 96.90 | 93.85 | 94.19 |
InceptionV3 | 93.53 | 74.46 | 93.50 | 96.31 | 71.42 | 94.30 | 96.02 | 72.91 | 93.90 | 96.17 |
InceptionResNet | 94.77 | 92.30 | 94.19 | 95.71 | 73.46 | 97.15 | 95.41 | 81.81 | 95.65 | 95.55 |
ResNet-50 | 93.12 | 98.26 | 87.26 | 95.92 | 81.63 | 96.58 | 92.96 | 87.02 | 94.16 | 93.85 |
ResNet-101 | 90.09 | 75.41 | 97.32 | 86.68 | 93.87 | 82.62 | 97.55 | 83.64 | 89.37 | 91.79 |
ResNet-152 | 91.74 | 65.71 | 94.10 | 94.96 | 93.87 | 90.88 | 92.35 | 77.31 | 92.46 | 93.64 |
MobileNet | 97.24 | 93.75 | 96.91 | 98.14 | 91.83 | 98.29 | 96.94 | 92.78 | 97.59 | 97.53 |
NASNetMobile | 94.22 | 94.17 | 92.70 | 95.97 | 65.30 | 97.72 | 94.80 | 77.10 | 95.15 | 95.38 |
DenseNet-121 | 96.56 | 88.46 | 99.10 | 95.29 | 93.87 | 94.58 | 99.08 | 91.08 | 96.79 | 97.15 |
DenseNet-169 | 93.94 | 79.66 | 95.39 | 94.44 | 95.91 | 94.01 | 93.57 | 87.03 | 94.96 | 94.01 |
DenseNet-201 | 74.55 | 100.0 | 65.48 | 99.99 | 61.22 | 99.99 | 49.23 | 75.94 | 79.14 | 65.98 |
In the design of medical diagnostic prediction models, receiver operating characteristics (ROC) analysis is essential for analyzing the model performance. The area under the curve (AUC) of the ROC of a classifier determines the diagnostic stability of the model. This AUC-ROC curve is insensitive to the alterations in the individual class distributions [
A prediction model for medical imaging needs to have a high sensitivity and specificity. A clinically useful COVID-19 model based on CXR needs to be able to differentiate between COVID-19 from other infections. However, the distinction of CXR lesions caused by COVID-19 as opposed to other infections can be quite challenging. CAMs were therefore applied to all CXR input images [
The CAMs analysis shows that some of the models extract the peripheral and bilateral ground-glass opacities while some of the other models also extracted the rounded morphology typical of COVID-19 lesions [
Model averaging is the process of averaging the outcomes of a group of networks trained on a similar task or the same model trained on different parameters. The model averaging improves the generalization of the models by aggregating their predictions. The generalization for the model was obtained by minimizing the loss during stochastic optimization using equation
Similarly, weights can be assigned to individual models based on their prediction performance. These weights are then applied to the appropriate models to obtain aggregated generalization. This is known as weighted model averaging. In the case of model averaging, the models are equally treated by assigning the individual performance of the model to each network. This means that the weighted model averaging provides importance to the required models and discards the poorly performing models.
The generalization provided by the committee of the neural models improves when compared to that of model averaging and weighted model averaging. Hence, the models were stacked to improve the generalization ability of the model.
The stacked ensemble integrates or groups different models to provide aggregated generalization by mapping the output predictions onto a logit function. Instead of averaging the weights to the grouped models, logistic regression or multi-class logit was applied to map the predictions. Therefore, the predictions were gathered, and a logistic regression was applied to them or built at the end-to-end neural model that applies softmax non-linearity as final activation [
Our network was first considered to be a function that predicts a certain input x, where our true function is
So, the average individual error settled from the networks can be estimated as follows:
The ensemble learning of the grouping variant networks is presented in the following equations:
Estimated error resided by stacking these ensembles:
From
To understand this scenario,
This constant ‘
With this knowledge, it is clear that the stacked ensembles can outperform single networks in terms of generalization. As a result, six different neural network committees were formed by multiplying the number of neural networks described in
Ensemble model | Stacked finetuned networks |
---|---|
Ensemble-1 | VGG-16 + Xception |
Ensemble-2 | VGG-19 + Xception |
Ensemble-3 | VGG-16 + VGG-19 + Xception |
Ensemble-4 | Ensemble-3 + DenseNet169 + DenseNet201 |
Ensemble-5 | Ensemble-5 + MobileNet |
Ensemble-6 | Stacking all the models in Table. 1 |
As mentioned, the generalization error obtained by a committee of neural networks is always less to that of a single neural network. Six variant committees of networks were selected and combined as described in
Models | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
C-0 | C-1 | C-2 | C-0 | C-1 | C-2 | C-0 | C-1 | C-2 | ||
Ensemble-1 | 98.48 | 97.67 | 99.42 | 97.61 | 85.71 | 99.14 | 99.69 | 91.30 | 99.28 | 98.64 |
Ensemble-2 | 98.76 | 99.99 | 98.86 | 98.49 | 83.67 | 99.71 | 99.99 | 91.11 | 99.28 | 99.24 |
Ensemble-3 | 97.94 | 99.99 | 97.76 | 97.91 | 73.46 | 99.43 | 99.99 | 84.70 | 98.58 | 98.94 |
Ensemble-4 | 97.66 | 99.99 | 99.42 | 95.62 | 69.38 | 99.43 | 99.99 | 81.92 | 99.420 | 97.76 |
Ensemble-5 | 98.76 | 99.99 | 99.42 | 97.91 | 85.71 | 99.71 | 99.99 | 92.31 | 99.42 | 98.94 |
A comparative study was then performed to compare the performance of the proposed network with other existing models described in the literature
Our designed generic training algorithm facilitates the training process by acquiring faster convergence and with low computations (Iterations). During the training process, the batch size and learning rate are increased cautiously for each iteration to obtain a balanced criterion, as explained by Smith et al. [
Literature |
Image-Kind |
Model (generalization-test) |
Binary |
Accuracy (%) 3-class |
4-class |
---|---|---|---|---|---|
Yujin et al. [ |
CXR | Segmentation+ ResNet-18 (Train-Test) | - | - | 88.9 |
Mohammad et al. [ |
CXR | Xception+ ResNet50v2 |
91.4 | - | - |
Ozturk et al. [ |
CXR | DarkCovidNet |
98.3 | 87.2 | - |
Apostolopoulos et al. [ |
CXR | Transfer-Learning (Train-Test) | 98.75 | 94.7 | - |
Lin Li et al. [ |
CT | COVNET (Train-Test) | - | 96.0 | - |
A. Iqbal Khan et al. [ |
CXR | CORONET |
99.0 | 95.0 | 89.6 |
Wang et al. [ |
COPLE-NET | 80.72 ± 9.96 (Dice Score) | - | - | - |
Proposed | CXR | Stacked Ensemble (Train-Test) | - | 99.17 | - |
The noise due to training is theoretically represented as follows in
Here we assumed a constant momentum. A training algorithm was developed to conceptualize the noise constraint. Although a decaying learning rate can decrease the noise, it gradually increases the computational time for training. On the other hand, lowering the batch size can also reduce noise but comes at the cost of lowering the generalizing capacity of the model. These problems were overcome by developing an algorithm that increased the batch size during the specified iteration and cautiously increasing the learning rate as follows. The algorithm was first iterated for 16278 steps (iteration 1), whereby the learning rate was set to 10−4 by sending 15 samples as a batch at a time. In the next iteration (iteration 2), the batch size was increased by 50%, and the learning rate was increased tenfold. In order to maintain a consistent trade-off between generalization and faster convergence, the batch size was increased by a factor of 150% (to that of initial), and the learning rate was tuned as per the preceding iteration. During the experimentation, it was found that the proposed training procedure led to a faster convergence by training using only a few steps (approximately 20 epochs).
Appropriate training with fine-tuning of the ensemble was therefore critical to obtain these insightful outcomes. The final Ensemble-6 model had the highest performance when compared with the other method, with an accuracy score of 99.175%. Ensemble-1 and Ensemble-2 attained an accuracy of 98.487% and 98.762%, respectively. When taking into consideration only the COVID-19 class, the precision rate was at least 97.674%, but the recall rate was lower. The highest and lowest recall rates were 89.795% and 69.387% and were obtained by Ensemble-6 and Ensemble-4, respectively. However, due to the small sample of the COVID-19 class in our study, it was difficult to extract additional invariant features to improve the performance of the model further.
In this study, we observed that the stacked ensemble was slightly inefficient when a poor-performing model was included. The DenseNet-201 model evaluations were not always finetuned correctly, and the network depth was not always appropriate, leading to a high generalization error. The COVID-19 results were not always included in the single model based on the features derived from the individual models. The ensemble method offers more generalization, but the combination of multiple models increased the computational cost, which is unnecessary for small-scale computational systems (such as Ensemble-6). As a result, in real-world scenarios, small, quick, and efficient models such as Ensemble-1 and Ensemble-2 are advantageous. The progression of the virus can be visualized better on Chest CT axial images. However, there is a chance of missing disease progression on CXR [
In this study, various COVID-19 classification models were evaluated and compared using different classification metrics. Furthermore, a learning framework for finetuning these models was proposed, and their bottleneck activations were visualized using CAMs. The AUC-ROC curves were closely examined, and the output of each class was illustrated visually. These fine-tuned models were then stacked to outperform previous models and include a broad range of generalizations. The ensemble models achieve an accuracy score of 97.66 percent in the worst-case scenario. Even after finetuning for class imbalance, the models were found to have a high generalization ability. The least error rate obtained by the outperforming model, built by stacking all the finetuned models, was 0.83%. The stacked ensembles method improved the performance of the model and could therefore be used to improve the prediction accuracy of the diagnostic models in medical imaging.
The authors extend their appreciation to King Saud University for funding this work through Researchers Supporting Project number RSP-2021/305, King Saud University, Riyadh, Saudi Arabia.