Benchmarking Performance of Document Level Classification and Topic Modeling

Muhammad Bhatti; Azmat Ullah; Rohaya Latip; Abid Sohail; Anum Riaz; Rohail Hassan

doi:10.32604/cmc.2022.020083

Open Access icon Open Access

ARTICLE

Benchmarking Performance of Document Level Classification and Topic Modeling

Muhammad Shahid Bhatti^1,*, Azmat Ullah¹, Rohaya Latip², Abid Sohail¹, Anum Riaz¹, Rohail Hassan³

1 Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan
2 Department of Communication Technology and Network, Faculty of Computer Science, Universiti Putra Malaysia, Selangor, 43400, Malaysia
3 Othman Yeop Abdullah Graduate School of Business (OYAGSB), Universiti Utara Malaysia (UUM), Kuala Lumpur, 50300, Malaysia

* Corresponding Author: Muhammad Shahid Bhatti. Email: email

Computers, Materials & Continua 2022, 71(1), 125-141. https://doi.org/10.32604/cmc.2022.020083

Received 08 May 2021; Accepted 18 June 2021; Issue published 03 November 2021

Abstract

Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classiﬁcation and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text classiﬁcation has increased over time. However, Urdu language has not much experimented with research, it does not have readily available datasets, which turn out to be the primary reason behind limited research and applying the latest methodologies to the Urdu. To overcome these obstacles, a medium-sized dataset having six categories is collected from authentic Pakistani news sources. Urdu is a rich but complex language. Text processing can be challenging for Urdu due to its complex features as compared to other languages. Term frequency-inverse document frequency (TFIDF) based term weighting scheme for extracting features, chi-2 for selecting essential features, and Linear discriminant analysis (LDA) for dimensionality reduction have been used. TFIDF matrix and cosine similarity measure have been used to identify similar documents in a collection and find the semantic meaning of words in a document FastText model has been applied. The training-test split evaluation methodology is used for this experimentation, which includes 70% for training data and 30% for testing data. State-of-the-art machine learning and deep dense neural network approaches for Urdu news classiﬁcation have been used. Finally, we trained Multinomial Naïve Bayes, XGBoost, Bagging, and Deep dense neural network. Bagging and deep dense neural network outperformed the other algorithms. The experimental results show that deep dense achieves 92.0% mean f1 score, and Bagging 95.0% f1 score.

Keywords

Deep neural network; machine learning; natural language processing; TFIDF; sparse matrix; cosine similarity; classification; linear discriminant analysis; gradient boosting

Cite This Article

APA Style

Bhatti, M.S., Ullah, A., Latip, R., Sohail, A., Riaz, A. et al. (2022). Benchmarking Performance of Document Level Classification and Topic Modeling. Computers, Materials & Continua, 71(1), 125–141. https://doi.org/10.32604/cmc.2022.020083

Vancouver Style

Bhatti MS, Ullah A, Latip R, Sohail A, Riaz A, Hassan R. Benchmarking Performance of Document Level Classification and Topic Modeling. Comput Mater Contin. 2022;71(1):125–141. https://doi.org/10.32604/cmc.2022.020083

IEEE Style

M. S. Bhatti, A. Ullah, R. Latip, A. Sohail, A. Riaz, and R. Hassan, “Benchmarking Performance of Document Level Classification and Topic Modeling,” Comput. Mater. Contin., vol. 71, no. 1, pp. 125–141, 2022. https://doi.org/10.32604/cmc.2022.020083

BibTex EndNote RIS

Copyright © 2022 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Benchmarking Performance of Document Level Classification and Topic Modeling

Abstract

Keywords

Cite This Article

3690

2064

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link