Open Access
ARTICLE
Dealing with the Class Imbalance Problem in the Detection of Fake Job Descriptions
1 Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh City, Vietnam
2 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
3 Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam
4 Department of Electronics & Communication Engineering, SRM Institute of Science and Technology, NCR Campus, Ghaziabad, India
5 Informetrics Research Group, Ton Duc Thang University, Ho Chi Minh City, Vietnam
* Corresponding Author: Tuong Le. Email:
Computers, Materials & Continua 2021, 68(1), 521-535. https://doi.org/10.32604/cmc.2021.015645
Received 01 December 2020; Accepted 02 February 2021; Issue published 22 March 2021
Abstract
In recent years, the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age. Identifying fraud in job descriptions can help jobseekers to avoid many of the risks of job hunting. However, the problem of detecting fake job descriptions comes up against the problem of class imbalance when the number of genuine jobs exceeds the number of fake jobs. This causes a reduction in the predictability and performance of traditional machine learning models. We therefore present an efficient framework that uses an oversampling technique called FJD-OT (Fake Job Description Detection Using Oversampling Techniques) to improve the predictability of detecting fake job descriptions. In the proposed framework, we apply several techniques including the removal of stop words and the use of a tokenizer to preprocess the text data in the first module. We then use a bag of words in combination with the term frequency-inverse document frequency (TF-IDF) approach to extract the features from the text data to create the feature dataset in the second module. Next, our framework applies k-fold cross-validation, a commonly used technique to test the effectiveness of machine learning models, that splits the experimental dataset [the Employment Scam Aegean (ESA) dataset in our study] into training and test sets for evaluation. The training set is passed through the third module, an oversampling module in which the SVMSMOTE method is used to balance data before training the classifiers in the last module. The experimental results indicate that the proposed approach significantly improves the predictability of fake job description detection on the ESA dataset based on several popular performance metrics.Keywords
Cite This Article
Citations
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.