The COVID-19 disease has already spread to more than 213 countries and territories with infected (confirmed) cases of more than 27 million people throughout the world so far, while the numbers keep increasing. In India, this deadly disease was first detected on January 30, 2020, in a student of Kerala who returned from Wuhan. Because of India’s high population density, different cultures, and diversity, it is a good idea to have a separate analysis of each state. Hence, this paper focuses on the comprehensive analysis of the effect of COVID-19 on Indian states and Union Territories and the development of a regression model to predict the number of discharge patients and deaths in each state. The performance of the proposed prediction framework is determined by using three machine learning regression algorithms, namely Polynomial Regression (PR), Decision Tree Regression, and Random Forest (RF) Regression. The results show a comparative analysis of the states and union territories having more than 1000 cases, and the trained model is validated by testing it on further dates. The performance is evaluated using the RMSE metrics. The results show that the Polynomial Regression with an RMSE value of 0.08, shows the best performance in the prediction of the discharged patients. In contrast, in the case of prediction of deaths, Random Forest with a value of 0.14, shows a better performance than other techniques.

India witnessed its first case of COVID-19 on January 30, 2020, and the cases were further increased by February 3, 2020 [

India is the most densely populated country in the world. Along with this, the technology and medical facilities are also not so much advanced. For this reason, preventing this virus was a very big challenge in this country. Many steps have been taken to stop the rise in the number of cases of corona affected people in India. Thermal screening tests of every passenger who came from abroad, especially China was started, and those who founded with high fever were sent to the isolations [

Since most of the research and articles focus on the total deaths and infected patients of entire India altogether, but because of India’s high population density, different cultures, and diversity, it is good to have a separate analysis of each state of India. Hence, in this paper, we seek to carry out a comparative analysis of the prediction of an increase in recovered. Death cases in different states and union territories in India in the near future, as predicting these cases, would help estimate and arrange beds, ventilators, and other healthcare equipment on time and save many lives with proper facilities. This work uses three Machine learning regression algorithms; Polynomial Regression, Decision Tree Regression, and Random Forest Regression. The data is taken from the

The rest of the paper is organized as follows: Section 2 describes the various researches done in the world related to the Corona Virus. Section 3 discusses the methodology, which includes data processing and the different algorithms used. Section 4 reports the experimental results and analysis. Section 5 concludes the research done and suggests future areas of study.

Today, the world is grappling with Corona (COVID-19) Virus [

Poon et al. [

Qin et al. worked on the Dysregulation of the immune feedback, mainly the T lymphocytes that might have played a significant role in the spread of COVID-19. The data of all the examinations, which include peripheral lymphocyte subset, were studied, and a comparison was made between severe and other patients [

Caruso et al. worked on the analysis of patients suspected with COVID-19 and respiratory symptoms, and the criterion used for exclusion was chest CT scans [

The limitations that can be inferred from the previous works is that various inconsistent evidence is used for prediction as in some cases CT scan images are used, in other cases viral RNA and immune feedbacks are used for prediction purpose, which ultimately causes inconsistency in the results obtained through supervised learning models. After discussing the various researches, this study is focused on the analysis and predictions of total discharge and total death cases of COVID-19 in different states of India in the future. Different regression algorithms have been applied for the study. This work is in consideration of people’s health and to make them aware of how precaution can save their life where the enemy is in front of you but hidden.

For analyzing the impact of COVID-19 in different places of India, the dataset is collected from the trusted source; after obtaining the dataset, different regression algorithms are applied, which are discussed below.

The data used for this research is taken from two websites COVID19 India [

State/UT | Total cases | Discharges | Total deaths |
---|---|---|---|

Andaman & Nicobar Islands | 33 | 33 | 0 |

Andhra Pradesh | 3171 | 2009 | 57 |

Arunachal Pradesh | 2 | 1 | 0 |

Assam | 616 | 62 | 4 |

Bihar | 2983 | 900 | 13 |

Chandigarh | 266 | 187 | 4 |

Chhattisgarh | 361 | 79 | 0 |

Dadar Nagar Haveli | 2 | 0 | 0 |

Delhi | 14465 | 7223 | 288 |

Goa | 67 | 28 | 0 |

Gujarat | 14821 | 7139 | 915 |

In this paper, three regression models, namely, Polynomial Regression, Decision Tree Regression, and Random Forest Regression, are used to analyze the effect of COVID-19 on various Indian states and Union Territories, along with the prediction of the number of discharged patients and death cases in these states.

It is a unique circumstance of linear regression in which we align the data with a curvilinear correlation between the target variable and the independent variables in a polynomial equation. This type of regression gives better results when we have an association in data that is not linear and helps plot the best fit curve, which gives the minimum squared error [

The steps to build a polynomial regressor are shown below in

I. Forming the hypothesis function

where,

II. Minimizing the cost function

where, J (^{(i)}) is the actual value, and y^{(i)} is the predicted value. Here,

A DT regressor is an algorithm that is based on the concept of supervised machine learning. A Decision tree resembles a flow-chart structure, where the internal nodes denote a check over an attribute, each branch depicts the result of a test, and each terminal node shows carries along with it a class label. This model, based on our input data, learns a set of questions to figure out the class labels. For the determination of splits, in this case, we use the mean squared error metrics as we have continuously varying data in regression problems. The model is depicted in

The decision tree algorithm shows good performance in managing tabular data along with numerical characteristics or categorical characteristics with those far less from over a hundred categories. Apart from linear models, they can acquire variation interaction between the features and the target.

A Random Forest is a type of model that uses an ensemble approach to give good prediction results. It is one of the methods used to conduct both regression and classification tasks with the use of multiple decision trees and a technique known as Bootstrap Aggregation, or bagging for the same [

where h(x) is known as the summation of base models, and the output is an ensemble of these models, which are at the root level, various decision tree models only. The tree is formed in Decision trees by specifying the important variables as nodes, but in the case of Random Forest, arbitrariness is added to the model as the tree grows. This model also helps in saving time as very little time is spent in hyper-parameter tuning in this case.

The proposed model is illustrated in

In the model, 33% of the dataset is used for testing purposes and remaining for the training purpose. After this, different regression techniques are applied to compare the results based on the RMSE values and to predict the best algorithm for the future scenario.

This work witness a comprehensive analysis of the COVID-19 Outbreaks across various Indian states and Union Territories. As the COVID-19 cases are on a surge in India, it becomes the need of the hour to analyze the deaths, infected, and recovered cases in various states and predict them using the machine learning models. Different regression models are applied in this study and are further discussed in the subsections using the evaluation metrics.

The evaluation metrics used in this work are Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). These are discussed in detail below:

It is obtained by calculating the mean of squares, which is the difference between the initial sample data and the approximate values taken [

where N is the total number of observations, _{i}

The normal distribution of residuals (prediction errors) is the Root Mean Square Error (RMSE). Residuals are a measure of how far these data points are from the regression line, and it is a measure of how those residuals are spread out [

where N is the total number of observations, _{i}

The virus spreads from other creatures to humans. The COVID-19 and the human coronaviruses are categorized in the family of Coronaviridae [

S. No. | Age groups | Percentage |
---|---|---|

1 | 0–9 | 3.18 |

2 | 10–19 | 3.90 |

3 | 20–29 | 24.86 |

4 | 30–39 | 21.10 |

5 | 40–49 | 16.18 |

6 | 50–59 | 11.13 |

7 | 60–69 | 12.86 |

8 | 70–79 | 4.05 |

9 | > = 80 | 1.45 |

10 | Missing | 1.30 |

Hence, because of different population density, Government rules and regulations, and various cultures, considering India’s entire page is not right. Therefore, we analyze India’s states and union territories, having more than 1000 cases of COVID-19 till May 28, 2020. The analysis of total infected, deaths and discharged patients along with the mortality and recovery rate analysis of all the Indian states and Union territories is carried out as follows:

After that, Rajasthan, Madhya Pradesh, and Uttar Pradesh have recorded 7536, 7024, and 6548 cases, respectively. The COVID-19 situation is not controlled yet in these states and will show tremendous growth in the COVID-19 cases in the latter part of lockdown.

Although in most of the states, the discharge to infected ration remains almost similar, there are some states which show a high release to the infected ratio, which represents that they are in the right direction to control the spread of this deadly pandemic. Mostly, the southern states of the countries, including Tamil Nadu, Kerala, and Telangana, showed such a good ratio.

The mortality rate of various Indian states and union territories are represented in

The recovery rate of various Indian states and union territories are represented in

It is evident that initially, the mortality rate and the recovery rate were on the same level, and this was the situation in India till March 1, 2020. After this, we can see that the recovery rate started increasing at a very high rate while the mortality rate was still on a low. This increase in the recovery rate can be accounted for due to a higher testing rate and more awareness about the pandemic among people. Social distancing, the guidelines issued by the Government, and isolation helped in achieving such a high recovery rate in India. It is clearly shown that the recovery rate is increasing each day, and the mortality rate is tending more towards a constant value, which is a good sign for the country.

After the analysis of the total infected, discharges, and deaths due to COVID-19 in the states and Union Territories having more than 1000 cases in India, the performance of the trained regression model is evaluated using the root mean square error metrics. The RMSE metrics are calculated by validating the number of discharged patients and the number of deaths on further dates and calculation of corresponding error involved by using each regression algorithm. An RMSE value of 0.08 for PR, 0.66 for Decision Tree Regression, and 0.74 in case RF Regression for predicting discharged cases and an RMSE of 0.97, 0.15, and 0.14 respectively are obtained for the algorithms for prediction of death cases.

Machine learning approaches are valuable for prejudiced detection and cataloging tasks [

Algorithm used | Discharges |
---|---|

Polynomial regression | 0.08 |

Decision tree | 0.66 |

Random forest | 0.74 |

The RMSE values of the three algorithms, namely Polynomial Regression, Decision Tree, and Random Forest, for the prediction of total deaths, are provided in

Algorithm used | Deaths |
---|---|

Polynomial regression | 0.97 |

Decision tree | 0.15 |

Random forest | 0.14 |

This work seeks to carry out a comparative analysis of the prediction of an increase in the number of recovered and death cases in different states and union territories in India in the near future, as predicting these cases would help in estimating and arranging beds, ventilators, and other healthcare equipment’s on time and save many lives with proper facilities. The analysis includes the rising cases and deaths, mortality and recovery rates, gender and age-group based analysis, and case distribution. The research also focuses on developing the best regression model to predict the number of discharge patients and deaths in each state and union territory. In this work, the performance of the proposed framework is determined by using three machine learning regression algorithms, namely Polynomial Regression, Decision Tree Regression, and Random Forest Regression. The results show a comparative analysis of the states and union territories having more than 1000 cases, and the trained model is validated by testing it on further dates. The performance is evaluated using the RMSE metrics. The results show that Polynomial Regression with an RMSE value of 0.08 indicates the best performance in the discharged patients’ prognosis. In contrast, in the case of death prognosis, Random Forest, with an RMSE value of 0.14, shows a better performance than other techniques. Overall, this study helps make a detailed analysis and prediction and helps in understanding the entire scenario of COVID-19 across India.