Novel Coronavirus Disease (COVID-19) is a communicable disease that originated during December 2019, when China officially informed the World Health Organization (WHO) regarding the constellation of cases of the disease in the city of Wuhan. Subsequently, the disease started spreading to the rest of the world. Until this point in time, no specific vaccine or medicine is available for the prevention and cure of the disease. Several research works are being carried out in the fields of medicinal and pharmaceutical sciences aided by data analytics and machine learning in the direction of treatment and early detection of this viral disease. The present report describes the use of machine learning algorithms [Linear and Logistic Regression, Decision Tree (DT), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and SVM with Grid Search] for the prediction and classification in relation to COVID-19. The data used for experimentation was the COVID-19 dataset acquired from the Center for Systems Science and Engineering (CSSE), Johns Hopkins University (JHU). The assimilated results indicated that the risk period for the patients is 12–14 days, beyond which the probability of survival of the patient may increase. In addition, it was also indicated that the probability of death in COVID cases increases with age. The death probability was found to be higher in males as compared to females. SVM with Grid search methods demonstrated the highest accuracy of approximately 95%, followed by the decision tree algorithm with an accuracy of approximately 94%. The present study and analysis pave a way in the direction of attribute correlation, estimation of survival days, and the prediction of death probability. The findings of the present study clearly indicate that machine learning algorithms have strong capabilities of prediction and classification in relation to COVID-19 as well.
The World Health Organization (WHO) acknowledged novel Coronavirus Disease (COVID-19) as a pandemic on March 11, 2020, and recommended instant action predominantly for the early detection and treatment of this disease. The name, COVID-19, was suggested by WHO for the novel coronavirus that was reported to affect the lower respiratory system in the individuals in Wuhan, China [
The main objectives of the present study were as follows: Study and analysis of the role of machine learning algorithms in relation to the COVID-19 dataset. Analysis and prediction of the impact of attributes and their correlation. Prediction of the survival status and the probability of death rates.
The dataset for experimentation was obtained from the data repository for the COVID-19 visual dashboard operated by the Center for Systems Science and Engineering (CSSE), Johns Hopkins University (JHU) [
S.No. | Attribute name | Discussion |
---|---|---|
1 | Id | Unique case number. |
2 | case_in_country | An integer value for the country-wise case number. |
3 | Reporting date | The date on which the case was actually reported. |
4 | Summary | It depicts the overall case summary. |
5 | Location | City in which the case has been reported. |
6 | Country | Country in which the case has been reported. |
7 | Gender | Male or female. |
8 | Age | Age of the patient. |
9 | Symptom_onset | The date on which the patient started feeling the symptoms. |
10 | If_onset_approximated | Whether the patient was approximated after emerging of the symptoms. |
11 | Hosp_visit_date | The date when the patient visited the hospital. |
12 | International_traveler | Whether the patient has travelled internationally. |
13 | Domestic traveler | Whether the patient has travelled in his own country. |
14 | Exposure_start | The date when the patient was exposed to a COVID suspected area. |
15 | Exposure_end | The date when the patient left the COVID suspected area. |
16 | Visiting Wuhan | Whether the patient has visited Wuhan recently. |
17 | From Wuhan | Whether the patient belongs to Wuhan or not. |
18 | Death | Whether the patient has died or not. |
19 | Recovered | Whether the patient recovered from COVID or not. |
20 | Symptom and source | Description of the symptoms and source of the case information. |
The following algorithms were utilized for developing a machine learning model for accurate prediction and classification from the COVID-19 dataset. Correlation analysis was performed in order to determine the dependent and independent attributes along with their strength of relationships.
Linear regression is nothing more than a representation of a linear model [
Y: The predicted value
θ0: The intercept
θ1,…,θn: Model parameters
x1, x2,…,xn: Feature values(independent variables)
The simplified model may also be presented as in
When there is only one independent variable, the representation of Y prediction may be presented as the following (
Here, the values for θ0 and θ1 are selected such that the error is minimized.
In the case of only one predictor, the intercept may be calculated as presented in
This implies the following: If θ1 > 0, then x and y have a positive relationship, i.e., y will increase with an increase in x. If θ1 < 0, then x and y have a negative relationship, i.e., y will decrease with an increase in x. If x = 0 (not included), the If x = 0 (included), it implies that a0 produced the average of the predicted values. If θ0 (not included), it implies that the prediction and the regression coefficient may be biased.
Our goal remains to find out optimal values of all model parameters (θ0,…,θn) in order to fit the model among all data values. When there is a single independent variable or feature, the model will always shape a straight line whereas in presence of more than one feature value, the model is called a hyper plane. Many a times, the dataset shows a curvature plane instead of a straight line (when x in the
Logistic regression is used when the target variable is categorical. Therefore, logistic regression is the machine learning algorithm for classification, while linear regression is the regression algorithm for prediction.
The logistic regression model data are based on the logistic or sigmoid function:
If it is considered in terms of Y, as in
A threshold value is predicted for mapping it to a discrete class. The threshold stated below may be considered for the mapping.
p ≥ 0.5, class = 1; p < 0.5, class = 0
This implies that the observation is positive if it is greater than or equal to 0.5. On the basis of the above-stated decision delimiters and logistic function, a predefined function could be generated and is presented in
A Decision Tree (DT) is a tree-like structure constructed for decision modeling and consequences. In a DT, the test of an attribute is denoted by the internal nodes, while the branch denotes the outcome [
K-nearest neighbor (KNN) is a classification algorithm that is based on neighbors’ majority [
Steps in the KNN algorithm:
Step 1: Data loading and initialization of K. K denotes the number of neighbors.
Step 2: Calculation and addition of the distance in ascending order.
Step 3: Selection of the first K entries from the collection sorted, as suggested in Step 2.
Step 4: Among the selected K entries, select the first K entries.
Step 5: In the case of regression, the mean of the K labels is returned, while K labels are returned in the case of classification.
Support Vector Machine (SVM) is a classification and regression algorithm which is basically a supervised machine-learning model. SVM has been used for two-group classification problems [
w.x + b ≥ 1, for all x of class 1
w.x + b ≤ −1 for all x of class 2
In terms of reproducing kernel Hilbert space [
where ||f||_k^2 represents the reproducing kernel Hilbert space, the kernel is denoted by k, and A is a constant.
There are other hyper-parameters, for example, gamma values, which could be fine-tuned to improve the performance of the model by identifying the best combination of parameters. However, the process of identifying optimal hyper-parameter is complex. One way to approach this is to create a grid of hyper-parameters and just attempt all the possible combinations of these hyper-parameters. This method is referred to as the grid search method. This method may be useful in the over-fitting problem [
The following steps were performed in the above-discussed approaches.
Step 1: Data cleaning: The dataset included a few null values in certain columns, which were managed by dropping the rows corresponding to those columns, as these values could not be replaced with any other significant values.
Step 2: Feature selection: Since the dataset contained multiple columns, this step comprised the selection of those significant columns which would later be used for classification and regression analysis.
Step 3: Train–test split: The dataset was randomly divided into train data and test data. The test data considered for analysis and verification in the present study ranged between 15% and 40%.
Step 4: Fitting/Train the model: A particular classification algorithm was selected with its optimal parameters, and was used to train the model.
Step 5: Testing the model: The model was applied to the test data, and the accuracy of the model was determined by analyzing the produced confusion matrix.
Step 6: Model evaluation: The model was evaluated by calculating the error metrics.
Step 7: Model prediction: The model was applied to sample data, and the predicted value was verified.
This section discusses the results of linear and logistic regression, DT, KNN, SVM, and SVM with Grid search applied to the above-stated dataset.
Positive and negative correlation findings were deduced and are presented in
Age | PIH | Death | Female | Male | Days | |
---|---|---|---|---|---|---|
Age | 1 | 0.31 | 0.38 | 0.006 | 0.05 | 0.14 |
PIH | 0.31 | 1 | 0.79 | −0.03 | 0.11 | 0.25 |
Death | 0.38 | 0.79 | 1 | −0.02 | 0.09 | 0.38 |
Female | 0.006 | −0.03 | −0.02 | 1 | −0.64 | −0.006 |
Male | 0.05 | 0.11 | 0.09 | −0.64 | 1 | −0.05 |
Days | 0.14 | 0.25 | 0.38 | −0.006 | −0.05 | 1 |
Positive correlation | Negative correlation | ||
---|---|---|---|
Size of correlation | Interpretation | Size of correlation | Interpretation |
0.90−1.00 | Very high positive | −0.90–−1.00 | Very high negative |
0.70−0.90 | High positive | −0.70–−0.90 | High negative |
0.50−0.70 | Moderate positive | −0.50–−0.70 | Moderate negative |
0.30−0.50 | Low positive | −0.30–−0.50 | Low negative |
0.00−0.30 | Negligible correlation | −0.00–−0.30 | Negligible correlation |
On the basis of these findings, a model for bed allocation system was developed for hospitals. This model was based on x1 (prev_illness), x2 (gender_female), x3 (gender_male), and x4 (age) parameters. The regression model is presented below.
The implications are presented in
The performance of the model was evaluated on the basis of the following measures:
Accuracy: The ratio of predicted implications to total implications.
where, TP denotes true positive, TN represents true negative, FP represents false positive, and FN denotes false negative.
Precision: The ratio of predicted positive implications to total positive implications.
Recall (Sensitivity): The ratio of predicted positive implications to all implications in the actual class—yes.
F1-Score: This score is calculated on the basis of precision and the recall weighted average.
Here, 0 and 1 represent the live status and dead status.
Test data (%) | Accuracy | Precision | Recall | F-score | |||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
40 | 0.94 | 0.93 | 0.99 | 1.00 | 0.73 | 0.96 | 0.84 |
35 | 0.93 | 0.92 | 0.99 | 1.00 | 0.70 | 0.96 | 0.82 |
30 | 0.94 | 0.93 | 0.98 | 1.00 | 0.73 | 0.96 | 0.84 |
25 | 0.93 | 0.92 | 0.98 | 1.00 | 0.70 | 0.96 | 0.81 |
20 | 0.93 | 0.92 | 0.97 | 1.00 | 0.67 | 0.95 | 0.80 |
15 | 0.94 | 0.93 | 0.97 | 0.99 | 0.74 | 0.96 | 0.84 |
Test data (%) | Accuracy | Precision | Recall | F-score | |||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
40 | 0.94 | 0.94 | 0.97 | 0.99 | 0.77 | 0.96 | 0.86 |
35 | 0.94 | 0.93 | 0.96 | 0.99 | 0.75 | 0.96 | 0.84 |
30 | 0.95 | 0.95 | 0.96 | 0.99 | 0.80 | 0.97 | 0.87 |
25 | 0.94 | 0.94 | 0.96 | 0.99 | 0.77 | 0.97 | 0.85 |
20 | 0.95 | 0.95 | 0.94 | 0.99 | 0.80 | 0.97 | 0.86 |
15 | 0.94 | 0.94 | 0.98 | 0.98 | 0.79 | 0.96 | 0.85 |
Test data (%) | Accuracy | Precision | Recall | F-score | |||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
40 | 0.88 | 0.87 | 0.96 | 1.00 | 0.45 | 0.93 | 0.61 |
35 | 0.88 | 0.87 | 0.92 | 0.99 | 0.48 | 0.93 | 0.63 |
30 | 0.89 | 0.89 | 0.90 | 0.98 | 0.55 | 0.93 | 0.69 |
25 | 0.90 | 0.90 | 0.91 | 0.98 | 0.59 | 0.94 | 0.72 |
20 | 0.91 | 0.90 | 0.92 | 0.99 | 0.62 | 0.94 | 0.74 |
15 | 0.90 | 0.90 | 0.90 | 0.98 | 0.62 | 0.94 | 0.73 |
Test data (%) | Accuracy | Precision | Recall | F-score | |||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
40 | 0.88 | 0.87 | 0.96 | 1.00 | 0.45 | 0.93 | 0.61 |
35 | 0.88 | 0.87 | 0.92 | 0.99 | 0.48 | 0.93 | 0.63 |
30 | 0.89 | 0.89 | 0.90 | 0.98 | 0.55 | 0.93 | 0.69 |
25 | 0.90 | 0.90 | 0.91 | 0.98 | 0.59 | 0.94 | 0.72 |
20 | 0.91 | 0.90 | 0.92 | 0.99 | 0.62 | 0.94 | 0.74 |
15 | 0.90 | 0.90 | 0.90 | 0.98 | 0.62 | 0.94 | 0.73 |
Test data (%) | Accuracy | Precision | Recall | F-score | |||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
40 | 0.94 | 0.93 | 0.99 | 1.00 | 0.74 | 0.96 | 0.85 |
35 | 0.94 | 0.93 | 0.99 | 1.00 | 0.73 | 0.96 | 0.84 |
30 | 0.95 | 0.94 | 0.98 | 1.00 | 0.76 | 0.97 | 0.86 |
25 | 0.94 | 0.93 | 0.98 | 1 | 0.72 | 0.96 | 0.83 |
20 | 0.93 | 0.93 | 0.97 | 1.00 | 0.71 | 0.96 | 0.82 |
15 | 0.95 | 0.94 | 0.97 | 0.99 | 0.79 | 0.97 | 0.87 |
Root Mean-Squared Error (RMSE) and Mean Absolute Scaled Error (MASE) were used for model error rate evaluation. RMSE demonstrates underestimation and overestimation within the same pattern, while MASE is beneficial for relative accuracy. A lower value of error implies a better fit.
yi = Observed values
n = Number of observations
ai = Actual time series
fi = Forecast results
The error rate presented in
The key findings of the present study are as follows: The results indicated that, on average, the risk period for the patients was 12–14 days, beyond which the chances of survival of the patient may increase. The main attributes affecting the prediction and classification were PIH, gender, and the number of days. The chances of survival for a female patient were high after approximately 16 days, while in the case of a male patient, the chances of survival were high after approximately 18 days. The results indicated that the patients of COVID-19 exhibited the highest number of days for the treatment when there was no PIH. In addition, female patients were indicated to have a higher number of days for treatment compared to male patients. It was inferred that even though the virus could attack males mostly, probably because males were more exposed to the virus because of being at places of work or outside of their homes to earn their livelihood, nonetheless, the immunity level of male patients was higher than that of the female patients. Finally, it was indicated that machine learning algorithms were capable of providing prediction and classification in relation to COVID-19 as well. SVM with Grid search has observed to be the most efficient in this regard, followed by DT.
Despite promising results, the present study also has certain limitations. The first one remains the limited availability of data related to COVID-19 patients. The information regarding the attributes varies continually, and updation in the dataset is quite possible in the future. Therefore, few results may vary accordingly. However, that would not critically alter the impact of the important attributes, such as age, PIH, and gender, as these have already been considered individually as well as in association with other attributes. Also, if the number of countries, affected by the COVID-19 virus, increase, results may vary further. Currently, the impact is limited, as the country’s association is limited. Finally, it must be stated that further research is required to elucidate how a hybrid combination of different computational methods could be formed and used in order to provide the most effective outcomes when used in combination.
The present report discusses the current scenario due to the COVID-19 pandemic in terms of the current statistics, the major symptoms exhibited by the patients, and its impact on the survival and the death rates. The machine learning algorithms, such as linear as well as non-linear regression, logistic regression, DT, KNN, SVM, and SVM with Grid search, have been considered for the purposes of classification and prediction in relation to COVID-19. The present report discusses three main attributes: PIH, gender, and the number of days. The dataset considered in the present study was obtained from CSSE, JHU. The results of experimentation indicated that the chances of death are higher in the case of PIH and patients older in age. On the basis of these attributes, the average number of treatment days after which the chances of survival may increase for the patient could be calculated. In addition, it was indicated that females exhibit lesser chances of death compared to males. The SVM with Grid search outperformed all the other algorithms studied, in terms of classification accuracy.