Roman Urdu News Headline Classification Empowered with  Machine Learning

Rizwan Naqvi; Muhammad Khan; Nauman Malik; Shazia Saqib; Tahir Alyas; Dildar Hussain

doi:10.32604/cmc.2020.011686

Open Access icon Open Access

ARTICLE

Roman Urdu News Headline Classification Empowered with Machine Learning

Rizwan Ali Naqvi¹, Muhammad Adnan Khan^{2, *}, Nauman Malik², Shazia Saqib², Tahir Alyas², Dildar Hussain³

1 Department of Unmanned Vehicle Engineering, Sejong University, Seoul, 05006, Korea.
2 Department of Computer Science, Lahore Garrison University, Lahore, 54000, Pakistan.
3 School of Computational Sciences, Korea Institute for Advanced Study, Seoul, 02455, Korea.

* Corresponding Author: Muhammad Adnan Khan. Email: email .

Computers, Materials & Continua 2020, 65(2), 1221-1236. https://doi.org/10.32604/cmc.2020.011686

Received 23 May 2020; Accepted 29 June 2020; Issue published 20 August 2020

Download PDF

Abstract

Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent. Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing. The communication using the Roman characters, which are used in the script of Urdu language on social media, is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply. English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past. This is due to the numerous complexities involved in the processing of Roman Urdu data. The complexities associated with Roman Urdu include the non-availability of the tagged corpus, lack of a set of rules, and lack of standardized spellings. A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook, Twitter but meaningful information can only be extracted if data is in a structured format. We have developed a Roman Urdu news headline classifier, which will help to classify news into relevant categories on which further analysis and modeling can be done. The author of this research aims to develop the Roman Urdu news classifier, which will classify the news into five categories (health, business, technology, sports, international). First, we will develop the news dataset using scraping tools and then after preprocessing, we will compare the results of different machine learning algorithms like Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Long short term memory (LSTM), and Convolutional Neural Network (CNN). After this, we will use a phonetic algorithm to control lexical variation and test news from different websites. The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news. After applying above mentioned different machine learning algorithms, results have shown that Multinomial Naïve Bayes classifier is giving the best accuracy of 90.17% which is due to the noise lexical variation.

Keywords

Roman urdu, news headline classification, long short term memory, recurrent neural network, logistic regression, multinomial naïve Bayes, random forest, k neighbor, gradient boosting classifier.

Cite This Article

APA Style

Ali Naqvi, R., Adnan Khan, M., Malik, N., Saqib, S., Alyas, T. et al. (2020). Roman Urdu News Headline Classification Empowered with Machine Learning. Computers, Materials & Continua, 65(2), 1221–1236. https://doi.org/10.32604/cmc.2020.011686

Vancouver Style

Ali Naqvi R, Adnan Khan M, Malik N, Saqib S, Alyas T, Hussain D. Roman Urdu News Headline Classification Empowered with Machine Learning. Comput Mater Contin. 2020;65(2):1221–1236. https://doi.org/10.32604/cmc.2020.011686

IEEE Style

R. Ali Naqvi, M. Adnan Khan, N. Malik, S. Saqib, T. Alyas, and D. Hussain, “Roman Urdu News Headline Classification Empowered with Machine Learning,” Comput. Mater. Contin., vol. 65, no. 2, pp. 1221–1236, 2020. https://doi.org/10.32604/cmc.2020.011686

BibTex EndNote RIS

Citations

2

[click to view]

Copyright © 2020 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Roman Urdu News Headline Classification Empowered with Machine Learning

Abstract

Keywords

Cite This Article

Citations

4547

2037

0

Further Information

Guidelines

Follow Us

Join Us

Share Link