COVID-19 is a contagious disease and its several variants put under stress in all walks of life and economy as well. Early diagnosis of the virus is a crucial task to prevent the spread of the virus as it is a threat to life in the whole world. However, with the advancement of technology, the Internet of Things (IoT) and social IoT (SIoT), the versatile data produced by smart devices helped a lot in overcoming this lethal disease. Data mining is a technique that could be used for extracting useful information from massive data. In this study, we used five supervised ML strategies for creating a model to analyze and forecast the existence of COVID-19 using the Kaggle dataset” COVID-19 Symptoms and Presence.” RapidMiner Studio ML software was used to apply the Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (K-NNs) and Naive Bayes (NB), Integrated Decision Tree (ID3) algorithms. To develop the model, the performance of each model was tested using 10-fold cross-validation and compared to major accuracy measures, Cohan’s kappa statistics, properly or mistakenly categorized cases and root means square error. The results demonstrate that DT outperforms other methods, with an accuracy of 98.42% and a root mean square error of 0.11. In the future, a devised model will be highly recommendable and supportive for early prediction/diagnosis of disease by providing different data sets.
COVID-19 pandemic affected badly all spheres of life, including its impact on the quality of life, psychology of sustainability and the global economy [
ML is a branch of Artificial Intelligence (AI) that is used to train the model by using the historical data as an input and predict the new outcomes by using different algorithms. “Machine Learning is defined as the study of computer programs that leverage algorithms and statistical models to learn through inference and patterns without being explicitly programmed. ML field has undergone significant developments in the last decade.” It has three types: Supervised Machine Learning (SML), Un-Supervised Machine Learning (Un-SML) and Reinforcement Learning (RL). In SML, algorithms build in such a manner that machines will be able to infer general patterns and hypotheses via externally supplied instances to predict the outcome of future instances. Moreover, SML classification algorithms have another purpose to categorize the data by using prior information [
Reinforcement learning is the next type of machine learning. In this sort of learning, the computer employs appropriate actions with a trial-and-error technique to identify the most likely outcome of future instances in the given environment [
By using the provided information, a machine could predict disease risk to life, diagnose the disease and could suggest appropriate treatments as required. Along with these benefits, machine learning presents several challenges in the health-care delivery system, including data pre-processing, model training and fine-tuning of the system to actual clinical problems due to limitations and constraints in the provision of sufficient data as well as ethical considerations, such as medico-legal ramifications, doctors’ expertise in ML tools and data privacy and security [
The disease prediction model was developed using multiple Supervised ML techniques and different algorithms would produce varied results based on the datasets. As several research articles have already been carried out to predict this novel disease earlier by using machine learning algorithms. Convolutional Neural Network (CNN) has been applied as a COVID-19 predictor based on the laboratory finding, a dataset was taken from a hospital in Sao Paul, Brazil and the model was predicted with 76% accuracy [
To forecast cases of COVID-19, ML models were developed by using mathematical expressions along with Stochastic Fractal search algorithms for the prediction of symptomatic and asymptomatic patients or mortality or recovery from COVID-19 virus [ The repository in the RapidMiner is used for Data storage, its processes and storage of results. Operators are used to carrying out an essential work flow for a project. Ports are used to connect Operators. The first operator’s output is being used as input again for a second. A process is a collection of Operators that work together to alter and analyze data. Parameters are used for modifying the behavior of an operator. A help option could be used for the understanding of certain Operators.
The solution to the problem of predicting COVID-19 presence in the person includes the following sections: Section 2 will describe materials and methods for the key problem, Section 3 is the Result’s section and Section 4 deals with Discussion on the results given by the different classifiers, Section 5 is Conclusion.
The work flow diagram is shown in
For data collection, the researcher used the dataset” COVID-19 Symptoms and Presence” which was available at
To process the dataset, the author used the RapidMiner machine learning software’s RapidMiner Studio supports ACCDB-Microsoft Access database, ARFF-Weka file format, CSV-Comma Separated Value, DBF-dBASE Database file format (read-only), DTA-Stata file format (read-only), HYPER-Tableau file format MDB-Microsoft Access database, QVX, QlikView data exchange (write-only), SAS-SAS file format up to v9.2 (read-only), SAV-IBM SPSS file format (read-only), TDE–Tableau file format, XLS/XLSX-Microsoft Excel spreadsheet, XML-Extensible Markup Language, XRFF-Weka file format file types. Therefore, The COVID-19 Symptoms and Presence dataset is in csv format, making it simple to import and study in the tool. Data preprocessing would begin by selecting the Import Data option from the Repository menu and searching for the dataset’s location. Once the dataset is imported into RapidMiner’s process window, the data cleaning operation is performed for the removal of undefined numbers which could be done by filing those missing values by an average of numbers/minimum or maximum value.
After data processing, different SML algorithms, namely DT, RF, KNN, ID3 and NB were applied. In the operator tab of the RapidMiner, the classifier name was chosen by the researchers and selected 10-fold cross-validation. For determining best configuration, many optimization techniques like Feature Scaling and Batch normalization, Mini-batch gradient descent, Gradient descent with momentum, RMSProp optimization, Adam optimization, Hyperparameter optimization and Learning rate decay. The research preferred hyperparameter optimization technique for determining the best configuration for each algorithm by performing several pieces of training on the model. For Decision Tree (DT), maximal depth ten was considered along with pruning (confidence = 0.1) and pre-pruning (minimal gain = 0.01). The researcher used the optimization technique to get the best results and successfully got 98.42% accuracy and Kappa 0.95. The researcher used another classifier like Random Forest (RF) with a hundred iterations and maximum depth to zero (unlimited depth), iterations represent the total number of trees in the forest. The researcher used the optimization technique to get better results and computed 98.37% accuracy with Kappa 0.948. Furthermore, another classifier like KNN was used to find better accuracy than previously used classifiers, when K = 10 with optimization technique. KNN classifier provided 97.58% and kappa was 0.922. For the ID3 classifier, the researcher used minimal leaf size 2 and with minimum size for split is 4. The author used an optimization technique to get better performance but the classifier gave 98.26% and Kappa is 0.945. Lastly, the NB with Kernel Estimator algorithm and Supervised Discretization rather than the normal distribution of numeric attributes gave an accuracy 96.87% and kappa 0.897 after optimization. Moreover, the following supervised machine algorithms were used and their mathematical descriptions are given as follows:-
The Complete Process of Classifiers Using RapidMiner is shown in
The Gini Index measures the reduction in class impurity from partitioning the feature space see
The RF algorithm creates trees as well, but it creates multiple trees from the values of random samples in the dataset, with final results dependent on the majority of the trees generated. By constructing a group of trees that create individual outcomes, aggregating those results and deciding which class had the most votes, RF could exhibit considerable increases in the categorization (accuracy) of a given model [
K-NNs algorithms functionality is to find k-nearest neighbors of data points in a data set. The k-nearest-neighbor classifier’s working is done by Euclidean distance between specified training samples and a test sample. The Euclidean distance between sample qi and pi is defined as
The test sample will be classified in that class which has a minimum Euclidean distance among the points. In practice, to avoid ties, k is preferably chosen to be odd. The k = 1 rule is generally called the nearest-neighbor classification rule [
The Integrated Decision Tree (ID3) is a supervised learning technique that uses a fixed set of instances to form a DT. Future samples will be classified using the modeled tree. The ID3 method generates trees based on the information gained from the training examples, which are then used to categorize the test samples. When utilizing the ID3 technique, there are no missing data because nominal attributes are used for classification [
The Gain Information G Info (p, T): In
A statistical supervised ML algorithm called a Naive Bayesian classifier which forecasts the likelihood of belonging to a particular class. When applied to a big dataset, NB produces excellent accuracy [ P(C/F): Posterior Probability P(C): Class Prior Probability P(F/C): Likelihood P(F): Predictor Prior Probability whereas ‘c’ stands for ‘class’ and ‘f’ stands for features.
The author employed 10-fold cross-validation testing to compare the following settings on several SML Algorithms. The results of the comparative analysis are displayed in
Each algorithm will perform differently concerning the correctly classified instances and accuracy is the parameter of all successfully predicted instances divided by the total predictions generated by the model. The ratio of correctly classified true positives (TP) and true negatives (TN) over the total number of cases is used to calculate accuracy. Accuracy can be calculated using the following formula in
Furthermore, precision is a crucial aspect in determining the optimal model; it is calculated by dividing the TP by the total of TP and FP and it assesses the accuracy of TP prediction’s overall anticipated positives. Precision refers to how many COVID-19-positive classified patients are genuinely COVID-19-positive in a given dataset and it can be calculated using the following
The accuracy of predicted TP over actual positive cases in the dataset is measured by recall. The following
These values are considered in the comparative analysis of the machine learning algorithms. TP and TN predictions show the correctly classified instances; conversely, FP and FN predictions show incorrectly classified instances in the model.
Cohen’s kappa statistic, often known as Kappa statistics, is a statistic that evaluates the reliability of a prediction or outcome between two raters of the same sample; it shows how closely the raters agree by chance. A zero score indicates that there is a random or low possibility of agreement between two raters and it indicates that the score could be less than zero. Furthermore, a score of 1 indicates that the two raters are completely in accord [
The following are contributing factors in determining the best model for the under discussion dataset as Accuracy alone could not determine the best model for detection of COVID-19 presence in the person. Therefore, the researcher kept this in mind while finding the best model.
Highest accuracy, precision, recall Highest correctly classified instances Lowest incorrectly classified instances; Highest kappa statistic score Lowest Root Mean Square Error
The researchers used the” COVID-19 Symptoms and Presence” dataset from Kaggle. There are 20 attributes in this dataset, plus one target/class attribute. There are 5434 instances in the dataset.
Yes accounts for 4383 (81%) of the cases, indicating that the person void-19 is present in 1051 occurrences (19%). However, in the modeling phase, the best performance of an algorithm can be attained by using the best configuration. As a result, the researchers used the COVID-19 Presence and Symptoms dataset to undertake hyperparameter optimization to determine the values at which the algorithm will work best. All of the studies employed 10-fold cross-validation and a batch size of 100 to assess the model’s performance.
The Hyperparameter optimization process was used to get optimal best results for each algorithm, after analysis of the results, researchers were able to decide which model gave the best results after optimization. Therefore, the DT algorithm predicted the presence of COVID-19 in a person with better results. The established models were evaluated using the 10-fold cross-validation technique and the results of the model’s accuracy performance along with other measures are displayed in
Algorithms | Accuracy | Kappa | RMSE |
---|---|---|---|
DT | 98.42% | 0.95% | 0.11 |
RF | 98.37% | 0.948% | 0.142 |
KNN | 97.58% | 0.922% | 0.136 |
ID3 | 98.26% | 0.945% | 0.106 |
NB | 96.87% | 0.897% | 0.159 |
The authors created a model with a low RMSE because they believe it will be more effective than models with a greater RMSE. Lastly, the data was analyzed using Kappa statistics as a criterion for the model’s effectiveness concerning the true labels present in our dataset. Finally, all algorithms due to tuning by hyperparameters performed well in the training/testing process. DT and RF almost showed the best performances in the prediction of COVID-19 presence in the person.
This study sought to create a COVID-19 presence prediction algorithm/model by using five supervised ML algorithms: DT, RF, K-NNs, ID3 and NB. The model’s performance was assessed in 10-fold cross-validation with RapidMiner Studio machine learning software for a detailed examination. The DT was found to be the most accurate ML algorithm with a 98.42 percent accuracy and a 0.11 root mean square error (RMSE). In terms of accuracy, recall, precision, properly and erroneously categorized instances and kappa statistic score, the DT method surpassed other algorithms. Furthermore, the results reveal that the Random Forest method is the second-best model for developing a COVID-19 presence predictor, as it achieves minor difference in accuracy measures as comparative to DT algorithm. Furthermore, the ID3 is the third-best model for predicting the presence of COVID-19 in a person. Furthermore, K-NNs is listed as the fourth most appropriate algorithm, while NB is ranked as the fifth most relevant algorithm for consideration. This research could help with medical decision-making by utilizing a technologically enhanced model to assist in diagnosing COVID-19 presence in a person based on symptoms. Additionally, symptoms experienced by the people were used to determine the chance of being COVID-19 positive or negative. The model created in this study can be used to create an application with the following advantages:
The presence of the COVID-19 virus in the individuals could easily be monitored by using symptoms as an input feature. Medical practitioners can utilize this study as a preliminary patient assessment. Businesses community could be assisted by restricting physical contact with the customer’s Possibility of having COVID-19 This study will be used at quarantine facilities as an additional self-management tool for monitoring COVID 19 symptoms while individuals are isolated This study may be useful to the community and government as a tool for containing the virus’s spread by detecting COVID-19 timely.
In addition to this research, A multifeatured learning model with enhanced local attention for vehicle re-Identification [
Tahir Sher has done all the experiments and wrote the manuscript. Dr. Abdul Rehman has revised the manuscript in several meetings and refined the idea. Whereas, Dr. Dongsun Kim has supervised this research work.