Vol.44, No.2, 2023, pp.1009-1024, doi:10.32604/csse.2023.024277
OPEN ACCESS
ARTICLE
An Imbalanced Dataset and Class Overlapping Classification Model for Big Data
  • Mini Prince1,*, P. M. Joe Prathap2
1 Department of Information Technology, St. Peter’s College of Engineering and Technology, Chennai, 600054, Tamilnadu, India
2 Department of Information Technology, R.M.D Engineering College, Chennai, 601206, Tamilnadu, India
* Corresponding Author: Mini Prince. Email:
Received 12 October 2021; Accepted 17 January 2022; Issue published 15 June 2022
Abstract
Most modern technologies, such as social media, smart cities, and the internet of things (IoT), rely on big data. When big data is used in the real-world applications, two data challenges such as class overlap and class imbalance arises. When dealing with large datasets, most traditional classifiers are stuck in the local optimum problem. As a result, it’s necessary to look into new methods for dealing with large data collections. Several solutions have been proposed for overcoming this issue. The rapid growth of the available data threatens to limit the usefulness of many traditional methods. Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance. Among all of these techniques, Synthetic Minority Oversampling TechniquE (SMOTE) has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset. The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each. In this paper, we have proposed a parallel mode method using SMOTE and MapReduce strategy, this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem. Our proposed solution has been divided into three stages. The first stage involves the process of splitting the data into different blocks using a mapping function, followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algorithm for solving the class imbalanced problem. On each map block, a decision tree model would be constructed. Finally, the decision tree blocks would be combined for creating a classification model. We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s capabilities. As a result, the Hybrid SMOTE appears to have good scalability within the framework proposed, and it also cuts down the processing time.
Keywords
Imbalanced dataset; class overlapping; SMOTE; MapReduce; parallel programming; oversampling
Cite This Article
M. Prince and P. M. Joe Prathap, "An imbalanced dataset and class overlapping classification model for big data," Computer Systems Science and Engineering, vol. 44, no.2, pp. 1009–1024, 2023.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.