The coronavirus disease 2019 (COVID-19) has infected more than 50 million people in more than 100 countries, resulting in a major global impact. Many studies on the potential roles of environmental factors in the transmission of the novel COVID-19 have been published. However, the impact of environmental factors on COVID-19 remains controversial. Machine learning techniques have been used effectively in combating the COVID-19 epidemic. However, researches related to machine learning on weather conditions in spreading COVID-19 is generally lacking. Therefore, in this study, three machine learning models (Convolution Neural Network (CNN), ADtree Classifier and BayesNet) based on the confirmed cases and weather variables such as temperature, humidity, wind and precipitation are developed. This study aims to identify the best classification model to classify COVID-19 by using significant weather features chosen by Principle Component Analysis (PCA) feature selection method. The DS4C COVID-19 dataset is used to train and validate each machine learning model. Several data pre-processing tasks such as data cleaning and feature selection have been conducted on the raw dataset to ensure the quality of the training data. The performance of these machine learning algorithms is further rectified based on the selected features set by PCA. Each classifier is then optimized using different tuning parameters to achieve optimum values before comparing the output of the three classifiers against each other. The observational results have shown that the optimized CNN classifier with seven weather variables selected by PCA achieved the highest performance among all the techniques. The experimental results obtained show that the weather variables are more relevant in predicting the confirmed cases as compared to the other variables. Thus, from this result, it is evident that temperature, humidity, wind and precipitation are important features for predicting COVID-19 confirmed cases.
Machine learning is the most dominant way of processing massive data accurately and rapidly. Machine learning is a data analysis tool that automates the creation of analysis models. It is an artificial intelligence branch focused on the concept that systems can learn from data, detect patterns, and make choices with minimal human intervention [
The novel coronavirus disease (COVID-19) was first found in Wuhan, China, in December 2019, and has since rapidly spread across the world. The world has been forced to handle the outbreak. Big data has been generated from the global outbreak statistics provided by the World Health Organization (WHO). Large-scale case studies have thoroughly demonstrated the clinical features of patients with pneumonia symptoms [
In order to solve the proposed problem in this research, the DS4C dataset was used for experimental analysis. This dataset, provided by the Korea Centers for Disease Control and Prevention (KCDC), is far more effective than other open-source and available datasets. Reasonable training and testing sample are specified in this dataset. The experimental analysis was carried out on the entire dataset, and not on randomly selected samples. The advantages of the dataset over the other available datasets are discussed below: The training group does not include duplicate records, which improves the results. The dataset is updated daily, as the parameter cases selected for the dataset are useful for the comparison-based classification method. A comparative verification of results is possible due to the availability of a reasonable size of training and testing samples. The dataset does not have an outlier set of data, which means the data available for the experiment are clean and have been verified.
A growing volume of datasets was studied and analyzed using machine learning computation techniques such as data cleansing, pre-processing, feature extraction, and classification approaches. A precise experimental model was developed for the analysis based on deep learning-based multilayer perception, BayesNet and random tree Classification approach.
The remainder of the paper is structured into four sections. The related works are described in Section 2. The methodology is illustrated in Section 3. The experimental analysis of the proposed objective is explained in Section 4. The concluding remarks are discussed in Section 5.
In this technological era, computational technology plays an important role in analyzing and classifying disease deterrence in the health industry. The world is currently facing a pandemic and is forced to prevent the spread of COVID-19. Researchers have highlighted the possible applications of computational technologies like machine learning, artificial intelligence, deep learning, and the Internet of Things to develop strategies for monitoring, detecting, and preventing COVID-19. The use of computational technology has had a significant impact on health care monitoring and disease deterrence [
Sethi and Mittal used machine learning techniques to track the impact of lockdowns on numerous air contaminants attributable to the COVID-19 pandemic and identify those that cause COVID-19 fatalities in order to implement emission control initiatives. From the study, they found that the ozone and toluene contaminants rose during the lockdowns. They also deduced that the toxins that may affect COVID-19 related mortalities are including ozone, NH3, NO2 and PM10. The lockdown also led to environmental restoration [
A copious number of scientific works that analyze COVID-19 and other medical fields have been recently published. Many researchers highlight the analysis, classification and prediction-based analysis using the machine learning algorithm. Some of the prediction-based researches used texture and numerical-based observation, while others utilized observation as well as image-based results [
Furthermore, a machine learning logical model was used for the analysis of COVID-19 transmission using various databases to understand the outbreak inside and outside China. Dey et al. [
Various studies have been published about the potential factors in the transmission of the COVID-19. However, limited researchers have examined environmental conditions as important factors that could affect the COVID-19 spread [ Data pre-processing was carried out to handle outliers and missing values in the dataset. The feature selection and extraction were performed using the PCA with Ranker search method. CNN, ADtree and BayesNet classification models were trained and tested based on weather variables of the COVID 19 dataset. Comparative analysis was performed to identify the best classification model based on weather variables of the COVID-19 dataset between the CNN, ADtree and BayesNet.
This section describes the dataset, data cleansing, pre-processing, feature extraction and classification methods used. The automation machine learning tool WEKA was used for data cleansing, pre-processing, feature extraction and classification of the DS4C COVID-19 dataset.
The flow chart of the step-by-step implementation of the research using machine learning algorithms is shown in
The research experiment is the design and development of classification models for COVID-19 (DS4C) datasets. The dataset was retrieved from the official repository of the Korea Center for Disease Control and Prevention (KCDC). This database consists of daily case reports for 3,520 patients. The dataset summary is in .CSV format with 12 tables for
Feature No. | Selected Feature Name | Description of Features |
---|---|---|
1 | Code | The code of the region |
2 | Province | Region of country or empire. |
3 | Date | Date of test |
4 | Ave_temp | Average of temperature |
5 | Min_temp | Lowest temperature |
6 | Max_temp | Highest temperature |
7 | Avg_reative_humidity | The average relative humidity |
8 | Max_wind_speed | The maximum wind speed |
9 | Most_wind_direction | The most frequent wind direction |
10 | Precipitation | The daily precipitation |
This study used machine learning to classify COVID-19 cases. The class label was carried out based on average temperature and average relative humidity. The class label used for the classification is described in
Class | Mathematical Condition | Class Label |
---|---|---|
Avg_Relative_Humidity | Avg_relative_humidity >70% | 1 (Covid -19) |
Avg_relative_humidity <=70% | 0 (No COVID-19) | |
Ave_temp | Ave_temp >30°C | 0 (No Covid -19) |
Ave_temp <=30°C | 1 (Covid-19) |
There is a significant potential for spread due to unfavorable environmental conditions like monsoons, post-monsoons, and winter. Similarly, an increase in relative humidity tends to increase the growth rate, while an increase in temperature leads to a decrease in COVID-19 spread, and vice versa. The range of confirmed cases encountered when the temperature is less than 30oC and the relative humidity is higher than 70% shows that there is an environmental impact on the number of COVID-19 cases [
This study analyzed confirmed cases of COVID-19 using the classification model. Two class labels were used to classify the data: 1 for confirmed COVID-19 cases and 0 for non-COVID-19 cases. If the relative humidity is high (>70%) and the average temperature is less than or equal to 30°C, the data is labelled 1, indicating COVID-19 confirmed cases. However, if the relative humidity is less than or equal to 70% and the average temperature is greater than 30°C, they are labelled class 0 to indicate non-COVID-19 cases.
The DS4Cdataset was prepared in a suitable format before starting the experiments. The pre-processing included the following steps:
Data Collection (The DS4C COVID Dataset).
Data Cleansing (‘Training’ 3,99,021, ‘Testing’ 2,16,029): Missing data handling. Removing or estimating missing values in the data. Database balancing. Correcting imbalances in the target field. Removing duplicate records.
Data Pre-processing (‘Training’ 399,021, ‘Testing’ 216,029): Data entry. Converting data from ‘type’ to ‘others’ (single-valued attributes).
The development of classification models including the Convolution Neural Network (CNN), ADtree, and BayesNet, using 39,858 train and 21,468 test dataset.
Interpretation and Analysis: Algorithm performance was measured using the accuracy metric.
A total of 399,021 DS4C COVID-19 record was analyzed. The training dataset consisted of 10% of the dataset: 39,858 documents. Meanwhile, after filtering out duplicate records, a total of 21,468 records were given as the test-set on which the machine learning models were applied. The details of data filtering process are described in
Dataset | Before Filtering Duplicate Records | After Filtering Duplicate Records |
---|---|---|
Training Dataset | 399,021 | 39,858 |
Test Dataset | 216,029 | 21,468 |
Total | 615,050 | 61,326 |
Detailed description of the pre-processing operation dataset is shown in
The feature extraction was carried out using the PCA method. The PCA feature extraction techniques generate new features comprising a linear combination of initial features. These techniques are directed towards the maximum variance of the dataset, excluding the variance which has been already counted. The components generated from the PCA wrap exclude the maximum variance, and each component takes a lesser value than the initial values.
PCA is the most important and prominent feature extraction technique in machine learning. Its sensitivity reduces the dimension and record duplication in a specific layer of a selected feature [
The selected feature and processed form of data classification were carried out using deep learning, BayesNet and random tree techniques.
The deep learning classification is part of the machine learning algorithm using neural network and representation learning. Deep learning architecture is implemented using deep neural networks, such as Convolution Neural Network and Recurrent Neural Network for large databases. These deep neural networks perform better than simple neural networks. They have multiple layers which show linear perception. Convolution Neural Network (CNN) is one of the classification algorithms used for deep learning group. The CNN uses a feed-forward artificial neural network.
Recently, the deep learning approach was used for prediction-based research in medical sciences. The classification, prediction and recognition were carried out using text and observation-based research. The deep learning method was used for the prediction and classification of COVID-19 patients’ analysis. The outcome of COVID-19 research is utilized in saving thousands of lives and producing massive data that can be used for machine learning model training. Deep learning is used for an accurate prediction of COVID-19 patient. These types of diagnosis save time and are cost-effective for patients.
The author developed the COVID-Net based on the Convolution Neural Network used to diagnose COVID-19 positive cases from chest radiography images [
The model uses multiple layers which gradually extract higher-level features from the raw input [
The BayesNet classification works on the basis of Bayes theorem. This network is generated using the conditional probability of each node as a directed acyclic graph. In this technique, attributes are nominal, and no missing value parameter is used as they are replaced globally. The output of this classification algorithm can be represented by a graph [
De Freitas Barbosa et al. used the Bayes Network for COVID-19 diagnostic blood testing. The performance evaluation was extracted using precision and specificity. The precision and the specificity of the proposed research were 0.938 and 0.936, respectively. The Bayes Network had been proven as the best classifier based on low computational cost [
The alternating decision trees (ADtree) classification technique provides a mechanism for combining weak hypotheses generated during boosting into a single interpretable representation. To implement this technique, inequality conditions that compare a single feature with a constant were generated during each boosting iteration. In the omission, the conditions for this algorithm are difficult to implement. This approach offers an exciting decision tree by applying an improved logistic algorithm. The network is generated on the basis of average value of cases [
The tree-based classification is a prominent approach in medical sciences. ADtree, a decision tree, achieve higher accuracy than NbTree, Random Forest and REPTree. The researcher used the prediction and classification approach of the ADtree to identify COVID-19 positive cases in various structured datasets. This classifier approach is superior in terms of speed and accuracy [
This research experimented three classifiers based on deep learning-based Convolution Neural Network, BayesNet and ADtree methodologies. The performance evaluation was carried based on accuracy [
This research involves the experimental analysis—data pre-processing and cleansing, feature extraction and classification—of the DS4C COVID-19 dataset. The data pre-processing was carried out using numeric cleaning, missing values and repetitive value estimation, and missing value replacement to eliminate the outliers in the data. The experiment was tested on series .csv data form. The feature extraction was carried out to reduce instances and feature selection using principal component analysis (PCA) techniques. For the feature extraction, the PCA with the ranker search method was used. PCA techniques were employed to identify significant features from the dataset before running the classification models. A total of 61,326 instances were searched using 10 features. The features selected have 0.95 degrees of freedom from the variance. The ranked features using PCA techniques are described in
Features | Ranked Coefficients | Rank |
---|---|---|
Code | 0.0155 | 8 |
Province | –0.0719 | 9 |
Date | –0.1145 | 10 |
Ave_temp | 0.0952 | 5 |
Avg_reative_humidity | 0.9868 | 3 |
Min_temp | 0.9870 | 2 |
Max_temp | 0.9892 | 1 |
Max_wind_speed | 0.0967 | 4 |
Most_wind_direction | 0.0899 | 6 |
Precipitation | 0.0887 | 7 |
According to the PCA, the significant features were
After identifying the significant features, three different classification models are evaluated which are Convolution Neural Network (CNN), ADtree Classifier and BayesNet. The performance of these models are calculated using accuracy metrics.
A parameter-tuning experiment was conducted to determine the best parameters of several available options [
A CNN architecture is formed by a stack of distinct layers that transform the input into output through a differentiable function. A few different layers are commonly used, including the convolution layer, sub-sampling layer, dense layer, and output layer.
Type of Layer | Layer Description |
---|---|
CNN 3-Layer | Convolution Layer Sub-sampling Layer Output Layer |
CNN 4-Layer | Convolution Layer Sub-sampling Layer Dense Layer Output Layer |
CNN 5-Layer | Convolution Layer Sub-sampling Layer Convolution Layer Sub-sampling Layer Output Layer |
CNN 6-Layer | Convolution Layer Sub-sampling Layer Convolution Layer Sub-sampling Layer Dense Layer Output Layer |
CNN 7-Layer | Convolution Layer Sub-sampling Layer Convolution Layer Sub-sampling Layer Dense Layer Dense Layer Output Layer |
The performance of CNN is calculated based on the type of layer. The comparative parameter setup based on CNN layer is described in
CNN 3 layer | CNN 4 layer | CNN 5 layer | CNN 6 layer | CNN 7 layer | |
---|---|---|---|---|---|
Accuracy | 92.48 | 98.52 | 99.56 | 98.56 | 98.6 |
From the above table, it is observed that CNN layer 5 has the highest accuracy percentage of 99.56; meanwhile, the minimum accuracy is achieved in CNN layer 3. By defining suitable layers for each dataset, the accuracy is observed to increase. This is because CNN architecture allows more training instances inside the hidden layers, including the Convolutional Layer, Dense Layer and Sub-sampling layer. In conclusion, it is proven that using the right CNN layer for parameter tuning can give an optimum and the best result.
The performance of the classifier is impacted by tuning parameter. The classifier was tested using different tuning parameters. The ADtree classifier was tested for 10, 15, 20 and 25 boosting iterations. The
Iteration ofBoosting | Expand all paths (Default) | Expand the heaviest path | Expand the best z-pure path | Expand a random path |
---|---|---|---|---|
10 | 96.71 | 96.78 | 96.65 | 97.25 |
15 | 96.16 | 96.24 | 96.11 | 96.71 |
20 | 96.94 | 97.89 | 97.32 | 97.31 |
25 | 96.91 | 97.09 | 96.97 | 96.97 |
The BayseNet classification was performed by searching algorithm,score type and logical values of random order as the tuning parameters. The comparative performance of the BayseNet classification based on search type and random order values is shown in
Parameters | Search Algorithm Score Type | ||
---|---|---|---|
Genetic Search | Entropy | ||
Random |
True | 96.03 | 97.01 |
False | 95.77 | 96.89 |
The CNN, ADtree and BayesNet classification models have been described in general in the previous section. On the basis of CNN experiments, all results presented here were obtained with a CNN layer 5.
For the ADtree, all results presented in the Section 4.3 were obtained with the following parameters, where the value of used parameter is listed in brackets: Iteration of Boosting (20); the type of search to perform when building the tree (expands the heaviest path, whereby it searches the path with the most weighted instances).
Based on the BayesNet experiments, all results described in the Section 4.3 were obtained with the following parameters setup, where the chosen parameter’s value is listed in brackets: Random Order (20); Search Algorithm (Genetic Search); Score Type (Entropy).
The experimental analysis was carried out using CNN, ADtree and BayseNet classifier. Each classifier achieves good performance with the parameter tuning procedure. The classification models were validated using 70% train and 30% test dataset. It is a common practice to split the data into 70% as training and 30% as testing set [
The experiment was tested for training and testing classification towards the performance evaluation. The comparative performance for training and testing phase implemented using a Convolution Neural Network, ADtree and BayseNet and classified based on accuracy, is calculated and shown in
Phase | Classifier | Accuracy |
---|---|---|
Training |
CNN Classifier | 99.30 |
ADtree Classifier | 98.41 | |
BayesNet Classifier | 97.02 | |
Testing |
CNN Classifier | 99.56 |
ADtree Classifier | 97.89 | |
BayesNet Classifier | 97.01 |
The comparison of the testing performance of our proposed research classifier approach with respect to the results of other classifier based on DS4C COVID-19 dataset is shown in
Reference | Classifier | Accuracy |
---|---|---|
Alafif et al. [ |
Logistic Regression | 95.00 |
Result achieved from our research | CNN Classifier | 99.56 |
ADtree Classifier | 97.89 | |
BayesNet Classifier | 97.01 |
This study provides a comparison of performance between three classification methods: Convolution Neural Network (CNN), ADtree and BayesNet techniques in classifying B40 households using the DS4C COVID dataset. The classification accuracy of these three methods is compared to each other. Prior to performance comparison, several pre-processing techniques such as data cleaning, feature selection, and parameter tuning were conducted. The data pre-processing was carried out to remove the missing and outlier values of the dataset. The feature selection and extraction were carried out using the principal component analysis (PCA) with Ranker search method. Using PCA with the Ranker search method, the ten features are ranked based on the Ranked Coefficients. We removed the lowest four features (i.e.,
Future work can be conducted by leveraging other factors such as policies, herd immunity, population density, migration patterns, and other aspects that might directly influence how the spread of the COVID-19 disease occurs. Thus, the development of machine learning models based on weather conditions associated with health policies is a knowledge of great value for the benefit of mankind in this critical period.