Talha Farooq Khan1, Majid Hussain1, Muhammad Arslan2, Muhammad Saeed1, Lal Khan3,*, Hsien-Tsung Chang4,5,6,*
CMES-Computer Modeling in Engineering & Sciences, Vol.145, No.3, pp. 3883-3911, 2025, DOI:10.32604/cmes.2025.073021
- 23 December 2025
Abstract Text clustering is an important task because of its vital role in NLP-related tasks. However, existing research on clustering is mainly based on the English language, with limited work on low-resource languages, such as Urdu. Low-resource language text clustering has many drawbacks in the form of limited annotated collections and strong linguistic diversity. The primary aim of this paper is twofold: (1) By introducing a clustering dataset named UNC-2025 comprises 100k Urdu news documents, and (2) a detailed empirical standard of Large Language Model (LLM) improved clustering methods for Urdu text. We explicitly evaluate the… More >