Open Access
ARTICLE
Chi-Square and PCA Based Feature Selection for Diabetes Detection with Ensemble Classifier
1 School of Computing and Information Sciences, Florida International University, USA
2 Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, 64200, Pakistan
3 Department of Computer Science, Broward College, Broward County, Florida, USA
4 Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si, 38541, Korea
* Corresponding Author: Imran Ashraf. Email:
Intelligent Automation & Soft Computing 2023, 36(2), 1931-1949. https://doi.org/10.32604/iasc.2023.028257
Received 06 February 2022; Accepted 20 June 2022; Issue published 05 January 2023
Abstract
Diabetes mellitus is a metabolic disease that is ranked among the top 10 causes of death by the world health organization. During the last few years, an alarming increase is observed worldwide with a 70% rise in the disease since 2000 and an 80% rise in male deaths. If untreated, it results in complications of many vital organs of the human body which may lead to fatality. Early detection of diabetes is a task of significant importance to start timely treatment. This study introduces a methodology for the classification of diabetic and normal people using an ensemble machine learning model and feature fusion of Chi-square and principal component analysis. An ensemble model, logistic tree classifier (LTC), is proposed which incorporates logistic regression and extra tree classifier through a soft voting mechanism. Experiments are also performed using several well-known machine learning algorithms to analyze their performance including logistic regression, extra tree classifier, AdaBoost, Gaussian naive Bayes, decision tree, random forest, and k nearest neighbor. In addition, several experiments are carried out using principal component analysis (PCA) and Chi-square (Chi-2) features to analyze the influence of feature selection on the performance of machine learning classifiers. Results indicate that Chi-2 features show high performance than both PCA features and original features. However, the highest accuracy is obtained when the proposed ensemble model LTC is used with the proposed feature fusion framework-work which achieves a 0.85 accuracy score which is the highest of the available approaches for diabetes prediction. In addition, the statistical T-test proves the statistical significance of the proposed approach over other approaches.Keywords
Cite This Article
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.