|Computers, Materials & Continua |
Prediction of COVID-19 Cases Using Machine Learning for Effective Public Health Management
1Department of Computer Sciences, Kinnaird College for Women, Lahore, 54000, Pakistan
2Department of Information Systems, College of Computer and Information Sciences, Jouf University, Sakaka, Aljouf, 72341, Saudi Arabia
3Division of Computer Science & Information Technology, University of Education, Lahore, 54000, Pakistan
4School of Computer Science, National College of Business Administration & Economics, Lahore, 54000, Pakistan
5Department of Clinical Laboratory Sciences, College of Applied Medical Sciences, Jouf University, Sakaka, Aljouf, 72341, Saudi Arabia
*Corresponding Author: Fahad Ahmad. Email: firstname.lastname@example.org; email@example.com
Received: 24 July 2020; Accepted: 30 September 2020
Abstract: COVID-19 is a pandemic that has affected nearly every country in the world. At present, sustainable development in the area of public health is considered vital to securing a promising and prosperous future for humans. However, widespread diseases, such as COVID-19, create numerous challenges to this goal, and some of those challenges are not yet defined. In this study, a Shallow Single-Layer Perceptron Neural Network (SSLPNN) and Gaussian Process Regression (GPR) model were used for the classification and prediction of confirmed COVID-19 cases in five geographically distributed regions of Asia with diverse settings and environmental conditions: namely, China, South Korea, Japan, Saudi Arabia, and Pakistan. Significant environmental and non-environmental features were taken as the input dataset, and confirmed COVID-19 cases were taken as the output dataset. A correlation analysis was done to identify patterns in the cases related to fluctuations in the associated variables. The results of this study established that the population and air quality index of a region had a statistically significant influence on the cases. However, age and the human development index had a negative influence on the cases. The proposed SSLPNN-based classification model performed well when predicting the classes of confirmed cases. During training, the binary classification model was highly accurate, with a Root Mean Square Error (RMSE) of 0.91. Likewise, the results of the regression analysis using the GPR technique with Matern 5/2 were highly accurate (RMSE = 0.95239) when predicting the number of confirmed COVID-19 cases in an area. However, dynamic management has occupied a core place in studies on the sustainable development of public health but dynamic management depends on proactive strategies based on statistically verified approaches, like Artificial Intelligence (AI). In this study, an SSLPNN model has been trained to fit public health associated data into an appropriate class, allowing GPR to predict the number of confirmed COVID-19 cases in an area based on the given values of selected parameters. Therefore, this tool can help authorities in different ecological settings effectively manage COVID-19.
Keywords: Public health; sustainable development; artificial intelligence; SARS-CoV-2; shallow single-layer perceptron neural network; binary classification; gaussian process regression
In December 2019, an outbreak of pneumonia of unknown etiology was noticed in Wuhan City, China, which later spread across the globe. In January 2020, the cause of this pneumonia-like disease was confirmed to be a novel coronavirus known as SARS-CoV-2 . This virus belongs to Coronaviridae, a large family of enveloped single-stranded RNA viruses . Coronaviruses are well known to cause a variety of diseases, from the common cold to significant epidemics, like severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) [1–3].
In March 2020, the World Health Organization (WHO) classified COVID-19 as a pandemic that could threaten millions of people all over the world . Ever since, the number of confirmed cases has increased, partially because this new viral disease is highly contagious during the incubation period. Asymptomatic individuals infected with COVID-19 can spread the disease throughout their communities . Thus, asymptomatic carriers can play a significant role in related viral infections, such as rhinovirus and the influenza virus [6,7]. In addition, there is no antiviral drug or vaccine available against this virus. Therefore, molecular testing is the most reliable diagnostic test for COVID-19.
COVID-19 poses a significant challenge for governments. Though stakeholders have dedicated many resources to fight it, the epidemic has nevertheless caused a social and economic crisis in both developed and developing countries. During the present crisis, it is important to understand how to maintain sustainability practices with limited resources so that long-term public health outcomes can still be achieved. Sustainable development depends on the cooperation of stakeholders across social, ecological, cultural, and political domains. The current challenges of COVID-19 have caused mortality and morbidity on a massive scale, directly or indirectly influencing all these domains. After the emergency declaration from the WHO, all trade and travel were banned, which led to social unrest and devastating economic consequences. In the past, the Ebola, influenza, SARS, and HIN1 epidemics caused almost US $10 billion in losses. The current crisis is similar in nature to what occurred during the SARS epidemic and may have worse consequences; if the spread of the epidemic continues as it has, the worldwide losses are projected to exceed US $150 billion .
As the situation worsens, relevant tools based on artificial intelligence (AI) need to be studied; a machine learning process uses big data for pattern recognition, explanation, and prediction based on input data [9,10]. Therefore, AI has the potential to design tools to fight COVID-19. In this study, we utilized SSLPNN and GPR to predict the classes to which specific case studies belonged and the number of confirmed COVID-19 cases in specific geographical areas. Though, climatic and socio-economic conditions have a strong relationship with the incidence and spread of infectious diseases [11,12]. Nevertheless, this analysis will help to design public health policies to manage sustainable development policies.
2 Materials and Methods
This study was designed to predict the number of COVID-19 cases based on environmental and non-environmental factors. We used two different approaches. First, we analyzed the correlations between the confirmed cases (from February 1, 2020 to April 20, 2020) and several environmental factors (temperature, humidity, wind speed, ultraviolet (UV) index, elevation, air quality index and pollution level) and non-environmental factors (population, population density, gender ratio, and human development index). Second, we built a binary classification model to predict and classify COVID-19 cases using an SSLPNN algorithm based on critical factors related to sustainable development in the area of public health. These factors were divided into two significant modules: the first was the non-environmental module, and the second was the environmental module. Both modules were used as the inputs, with the number of confirmed COVID-19 cases designated as the outputs. The study design is presented in Fig. 1.
2.1 Conditions for Analysis
In the analysis, specific conditions were applied. These conditions included the following:
• • In addition to the number of COVID-19 cases, 14 different environmental and non-environmental variables were used, including temperature (minimum, maximum, and average), humidity, wind speed, air quality index, UV index, pollution level, population, population density, gender ratio, average age, and human development index levels.
• • To enhance the precision of the estimates and to reduce bias, different countries were considered due to their different topographical, monetary, and ecological situations.
• • The environmental data used in this study was based on the capital cities of the selected regions, as these regions generally had larger populaces.
• • The non-environmental data used in this study was also taken from the regions of the respective countries.
• • The analysis period was from February 1, 2020 to April 20, 2020.
• • Different countries of the Asian continent were selected for this study.
2.2 Data Collection
In this study, data was collected from the various official and independent websites of the selected countries, which were China, South Korea, Japan, Pakistan, and Saudi Arabia [13–23]. These countries were selected due to their diverse climatic conditions. The details of this dataset are provided in the supplementary file.
2.3 Correlation Analysis
To see the relationships between the total confirmed COVID-19 cases in the 54 provinces of the five countries included in this study and the 14 environmental and non-environmental variables, a correlation analysis was performed.
2.4 Spearman’s Rank Correlation
Before building the model, it was necessary to evaluate the correlations between each independent dataset. For this purpose, a Spearman’s correlation analysis was done on the non-parametric dataset. In a non-parametric dataset, the population data usually does not have a normal distribution and is randomly distributed vertically and horizontally. For this test, the selected parameters (environmental and non-environmental) and the total number of confirmed COVID-19 cases were included. This test was conducted to reveal the associations between two different variables without considering the distribution of the data, which is highly recommended for a dataset with at least ordinal scale. The relationships among the non-parametric variables are represented by parallel plot in Fig. 2. The mathematical formulation of Spearman’s rank correlation can be represented by the following equation:
= Spearman’s rank correlation
si = the transformation between the ranks of corresponding parameters
m = the number of values
2.5 Shallow Single-Layer Perceptron Neural Networks (SSLPNN)
AI embraces a wide variety of approaches and algorithms based on machine intelligence. It has numerous applications in innumerable areas of science, encompassing fuzzy logic theory, machine learning techniques, risk valuations and hazard detection, meta-heuristic algorithms and classification, and clustering techniques . SSLPNN is a type of neural network that has a smaller number of hidden layers and can be used for pattern recognition. While studies have shown that a shallow network can fit any function in the identification of patterns and prediction of problems, it is also considered a less complex artificial neural network. Though the use of deep learning is rapidly increasing in different fields of science, SSLPNN is still widely used in regression problems.
2.6 Mathematical Modeling of the Shallow Single-Layer Neural Network
In this study, we used a neural network architecture with 2,352 inputs for each selected parameter, one output neuron with a linear output function, and a single-layer grid. Through forward propagation, the network calculated the dot product between the nth sample x(n) and the weight vector w and then added the bias b. This calculation produced the weighted sum of the inputs with bias correction:
w = weight vector
b = bias
g = activation function
= network output
2.7 Objective Function
The mean square error function assesses the credibility of the algorithm on a distinct trial:
where, y(n) is 2 if the nth trial fits category 2, 1 if the nth trial fits category 1, and 0 if the nth trial fits category 0. A cost function with L2 regularization of the weights is used to assess the global performance of the classifier. The term is affixed with the cost function to handle huge weights and to lessen the search space, reducing the inoperable weights toward zero, thus delivering more straightforward representations:
= regularization parameter
= L2 norm of the weight vector
For large values of , the regularization is robust, enhancing the capacity associated with the weights. Consequently, the weights, which are not able to lessen the Mean Square Error (MSE), decrease to zero. However, for small values of , the regularization outcome is weak. Here, the regression results are converted into class tags by using a Heaviside step function to deliver a numerical measure of the grid performance:
Of note, the accuracy is computed as if it were a classification part.
2.8 Parameter Optimization
The cost function is deputed to compute the errors in the recent forecasts. The learning process matter is comparable to the cost function reduction. While the training samples are fixed, the cost function depends only on the network parameters (the weights and bias). Thus, the cost function reduction is also comparable to the optimization of the grid parameters. The whole process is controlled by the following equations.
The objective function to be reduced is the cost function Kn(θ), where n denotes the nth epoch, µ is a label for w and b, and gn represents the gradient.
This evaluation is then utilized to consider two exponential moving averages of the gradient mn and the squared gradient vn, respectively.
The two hyper-parameters β1, β2 ∈ [0, 1) regulate the exponential decline rates of these moving averages.
Finally, the grid parameters are restructured by utilizing the classical method of gradient descent represented by , respectively.
The term certifies that the denominator is always non-zero and avoiding mathematical difficulties.
The SSLPNN algorithm can forecast the value an estimated function for every input vector . Fig. 3 shows the flow of the input and output variables in the SSLPNN algorithm. The inputs are brought into the opening layer. Then, the valuation and optimization procedures are conducted, ending when the algorithm obtain better results The SSLPNN algorithm can be used as an influential instrument to deal with unexpected and indefinite problems. Thus, in the present study, a binary classification analysis was done by the SSLPNN algorithm.
To refine the precision of the model and to reduce the learning errors so as to obtain optimized outcomes, dissimilar models were created by hit-and-trial methods to find the appropriate number of layers and neurons for each layer. The input variables were the previously mentioned 14 notable factors: Namely, the population, population density, gender ratio, average age, human development index, elevation, temperatures (maximum, minimum, and average), relative humidity, wind speed, air quality index, pollution level, and UV index of each region. The number of confirmed COVID-19 cases was used as the output dataset. Two classes (labels) were assigned to the number of confirmed cases. Specifically, the number of confirmed cases under or equal to 800 were labeled as “0,” and the number of confirmed cases above 800 were labeled as “1.” The number of cases in five countries were included in the study. For modeling, 70% of the cases were used as training cases, 15% were used for validation, and 15% were reserved for testing.
2.9 Regression Learning through a Gaussian Process Regression (GPR) with the Matern 5/2 Preset
Determining the regulating parameters in an algorithm is important, as it aids in the quick convergence of the algorithm. There were no explicit associations among most of the parameters in this study. Thus, these parameters were considered independent and identified with the assistance of recent studies, experts, and trial-and-error methods. It is also important to identify the relationships between parameters through regression analysis, which helps with predictions based on the least learning error that are measured by the Root Mean Square Error (RMSE). Therefore, the process of selection was dimensionless and influenced the sensitivity of the modeling error. It is worth mentioning that the RMSE was used by the GPR algorithm with the Matern 5/2 GPR preset as a measurement of accuracy for the regression learner model.
When the dimensionality of the data is high, parameter identification typically turns out to be instinctive for the learning algorithms, as high-dimensional data tends to undesirably affect the efficacy of the majority of learning algorithms. Parameter identification is an effective dimensionality reduction procedure that chooses an ideal subclass of the unique parameters, delivering exceptional predictive control when modeling the data. These diverse structures can then be utilized to segregate trials into dissimilar modules. In this study, the Principal Component Analysis (PCA) procedure was used to select the optimal parameters.
In regression analysis, a GPR algorithm with variable models can adapt to numerous types of pattern recognition data for prediction through classification. The excellent experimental results demonstrate that GPR models provide a very promising feature selection solution to numerous pattern recognition problems through PCA. The algorithm can acquire patterns from the global distribution, therefore improving the precision of its pattern recognition capabilities.
GPR models with a finite-dimensional group of arbitrary variables and multivariate distribution are non-parametric kernel-based probabilistic models. Therefore, each linear combination is consistently distributed and the notion of Gaussian procedures is named after Carl Friedrich Gauss, as it emerges from Gaussian distribution to be an infinite-dimensional generalization of multivariate normal distributions. In this study, Gaussian process was used in the statistical modeling, regression to multiple target values, and analyses of mapping in higher dimensions. In addition, a GPR model with the Matern 5/2 GPR preset was used to plot the behavior of the algorithm; calculate the RMSE, R-Squared Value, MSE, Mean Absolute Error (MAE), prediction speed, and training time; and analyze the results of the GPR to see the similarities and differences in the data. The Matern 5/2 kernel does not have competence for measure problems in high dimensional spaces. The mathematical model of the Matern 5/2 GPR is illustrated as follows:
3.1 Relationships between Environmental and Non-Environmental Parameters and COVID-19 Cases
The number of cases showed a significant correlation with the population and air quality index of a region. A statistically significant inverse relationship was observed between the number of cases and the average age and human development index levels. The results of the correlation analysis are presented in Tab. 1.
3.2 Independent Association of Environmental and Non-Environmental Parameters
Furthermore, all included variables were analyzed to assess their correlations. An independent association was observed between each of the parameters. The results of this analysis are presented in Tab. 2. The correlation coefficient (R) indicated that the relationships were either negatively correlated or positively correlated among the independent parameters. Thus, the obtained results showed that SSLPNN could be used to further develop a pattern recognition model based on the selected parameters.
3.3 Results of the Pattern Recognition Model for Binary Classification Using SSLPNN
Before applying the binary classification through the pattern recognition model using the SSLPNN algorithm, a correlation analysis was conducted for the 54 case studies in the five countries, which included China, South Korea, Japan, Saudi Arabia, and Pakistan. This analysis showed a reasonable correlation coefficient (R) among the non-parametric variables (Tab. 3). Thus, it was decided that the 54 case studies could be evaluated in a cluster with the binary classification through the pattern recognition model. The results of our analysis indicate that the SSLPNN algorithm performed excellently, predicting the classes of the number of COVID-19 cases with an accuracy of 99.09% during training and an accuracy of 99.04% during testing, as shown in Tab. 3. These results demonstrated the high accuracy of the system, as presented in Fig. 4. The MSE for testing was almost 0 (MSE testing 9.11804e−01).
3.4 Prediction of COVID-19 Cases by Regression Analysis
The results of the regression analysis using the GPR technique with Matern 5/2 were reasonably accurate, with an RMSE of 0.95239 in the prediction of confirmed COVID-19 cases. The PCA technique was used for the removal of noise and redundant parameters in order to reduce the dimensionality of the dataset. The information and results for the models are presented in Tab. 4. In addition, the response plot and predicted values vs. actual values plot for the whole scenario are provided in Fig. 5. These results were promising, with the lowest RMSE value (0.952) and an R-value of 1 that could predict the values more accurately than all other competing models, as presented in Tab. 4. The overall training time required for this model was 134.04 sec, and the prediction speed was 11000 obs/sec.
Finally, the predictive number of COVID-19 cases was compared with the actual observed cases; the results were close. The overall observed cases were 1,271.00, and our model predicted 1,118.2 with an 87.96% accuracy. The results are presented in Tab. 5.
This paper examined the relationship between COVID-19 cases and different environmental, ecological, and socio-economic factors and established a model system based on these variables to classify and predict rates of infection. COVID-19 has created a panic among the public. Scientific approaches must be identified and developed to predict the impact of these factors and to help policymakers take appropriate actions in the future.
Weather conditions, such as temperature, humidity, wind speed, and air quality, can affect the viability of viruses. Studies suggest that temperature and humidity have a strong influence on the transmission of COVID-19 ; researchers have also found that temperature and humidity may affect COVID-19 mortality . In this study, we reported a statistically significant positive correlation between pollution, air quality index, and the number of positive COVID-19 cases in an area. Poor air quality is associated with the incidence of many diseases, such as asthma, bronchitis, lung and heart diseases, and many respiratory allergies . China, where the epidemic started, is also severely affected by air pollution , indicating a relationship between poor air quality and COVID-19 .
The results of this study indicate that population density and human development index levels can also be associated with the number of COVID-19 cases in an area. Socio-economic factors like population size and low human development index levels are a significant driver for emerging infectious diseases and their subsequent effects on public health [30,31]. According to the Spearman’s correlation coefficient, a direct and inverse relationship exists among the independent parameters of different case studies, due to the policies and restrictions in different countries for this issue. This finding supports the predictive power of our study, indicating we may be able to generalize it for other countries and extend its scope .
Despite significant advancements in medical science, infectious diseases are a leading cause of mortality. For a novel disease like COVID-19 that does not have any standard guidelines for treatment and vaccination, the short-term response from medical science will be limited. However, we can utilize mathematical tools to better understand and forecast the impacts of such diseases. In the last few years, AI has been widely adopted to better understand infectious diseases and to predict epidemics .
Studies have reported on the use of neural networks to predict the outbreaks of many diseases, such as foot and mouth disease, influenza, epidemic diarrhea, Ebola virus, Rift Valley fever virus, Nipah virus, and SARS [34–36]. A recent report utilized neural network models to identify the risk of COVID-19 cases in a specific country based on weather conditions, and promising results were reported .
In this paper, we used the SSLPNN algorithm, which performed excellently, predicting the classes of COVID-19 cases for both the training and testing datasets with an accuracy of 99.09% and 99.04%, respectively.
The results of the binary classification modeling using SSLPNN with Scaled Conjugate Gradient Backpropagation (SCGB) showed high accuracy, with an MSE of 0.0114858 in five selected countries. Moreover, the results of the regression analysis using the GPR technique with Matern 5/2 for 54 case studies in five countries also showed high accuracy in the prediction of COVID-19 confirmed cases, with an RMSE of 0.952. This study established some previously unexplored patterns in the relationships between COVID-19 infections and the environmental and non-environmental conditions of select countries. Based on this analysis, we propose that both SCGB and GPR may be applicable to classifying and predicting patterns of COVID-19 cases. The results show that AI techniques can provide reasonable estimates about upcoming events based on specific inputs by learning the hidden structures of a scenario [38,39]. These rational outcomes can support governments in policy-making decisions, particularly those regarding public health, to ensure a sustainable development process. Our comparative analysis of daily weather parameters and trends of confirmed cases also demonstrate the role of these variables in the rate of COVID-19 cases.
Our findings are consistent with previous studies into the effects of climatic conditions on epidemic diseases and public health [40,41]. A recent analysis of confirmed COVID-19 cases through a binary classification using artificial intelligence and regression analysis also showed the impact of weather conditions in the COVID-19 epidemic . Overall, machine learning is an innovative technique that is helpful when predicting upcoming trends in COVID-19 cases in relation to specific ecological and socio-economic factors.
Funding Statement: The authors received no specific funding for this study.
Conflict of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|