Open Access
ARTICLE
News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark
Zhuo Zhou1, Jiaohua Qin1,*, Xuyu Xiang1, Yun Tan1, Qiang Liu1, Neal N. Xiong2
1 College of Computer Science and Information Technology, Central South University of Forestry &
Technology, Changsha, 410114, China.
2 Department of Mathematics and Computer Science, Northeastern State University, OK, 74464, USA.
* Corresponding Author: Jiaohua Qin. Email:
Computers, Materials & Continua 2020, 62(1), 217-231. https://doi.org/10.32604/cmc.2020.06431
Abstract
Due to the slow processing speed of text topic clustering in stand-alone
architecture under the background of big data, this paper takes news text as the research
object and proposes LDA text topic clustering algorithm based on Spark big data
platform. Since the TF-IDF (term frequency-inverse document frequency) algorithm
under Spark is irreversible to word mapping, the mapped words indexes cannot be traced
back to the original words. In this paper, an optimized method is proposed that TF-IDF
under Spark to ensure the text words can be restored. Firstly, the text feature is extracted
by the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then the
features are inputted to the LDA (Latent Dirichlet Allocation) topic model for training.
Finally, the text topic clustering is obtained. Experimental results show that for large data
samples, the processing speed of LDA topic model clustering has been improved based
Spark. At the same time, compared with the LDA topic model based on word frequency
input, the model proposed in this paper has a reduction of perplexity.
Keywords
Cite This Article
Z. Zhou, J. Qin, X. Xiang, Y. Tan, Q. Liu
et al., "News text topic clustering optimized method based on tf-idf algorithm on spark,"
Computers, Materials & Continua, vol. 62, no.1, pp. 217–231, 2020.
Citations