News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

Zhuo Zhou; Jiaohua Qin; Xuyu Xiang; Yun Tan; Qiang Liu; Neal Xiong

doi:10.32604/cmc.2020.06431

Open Access icon Open Access

ARTICLE

News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

Zhuo Zhou¹, Jiaohua Qin^1,*, Xuyu Xiang¹, Yun Tan¹, Qiang Liu¹, Neal N. Xiong²

1 College of Computer Science and Information Technology, Central South University of Forestry & Technology, Changsha, 410114, China.
2 Department of Mathematics and Computer Science, Northeastern State University, OK, 74464, USA.

* Corresponding Author: Jiaohua Qin. Email: email

Computers, Materials & Continua 2020, 62(1), 217-231. https://doi.org/10.32604/cmc.2020.06431

Download PDF

Abstract

Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data, this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform. Since the TF-IDF (term frequency-inverse document frequency) algorithm under Spark is irreversible to word mapping, the mapped words indexes cannot be traced back to the original words. In this paper, an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored. Firstly, the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then the features are inputted to the LDA (Latent Dirichlet Allocation) topic model for training. Finally, the text topic clustering is obtained. Experimental results show that for large data samples, the processing speed of LDA topic model clustering has been improved based Spark. At the same time, compared with the LDA topic model based on word frequency input, the model proposed in this paper has a reduction of perplexity.

Keywords

News text topic clustering, spark platform, countvectorizer algorithm, TFIDF algorithm, latent dirichlet allocation model.

Cite This Article

APA Style

Zhou, Z., Qin, J., Xiang, X., Tan, Y., Liu, Q. et al. (2020). News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark . Computers, Materials & Continua, 62(1), 217–231. https://doi.org/10.32604/cmc.2020.06431

Vancouver Style

Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, N. Xiong N. News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark . Comput Mater Contin. 2020;62(1):217–231. https://doi.org/10.32604/cmc.2020.06431

IEEE Style

Z. Zhou, J. Qin, X. Xiang, Y. Tan, Q. Liu, and N. N. Xiong, “News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark ,” Comput. Mater. Contin., vol. 62, no. 1, pp. 217–231, 2020. https://doi.org/10.32604/cmc.2020.06431

BibTex EndNote RIS

Citations

20

[click to view]

Copyright © 2020 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

Abstract

Keywords

Cite This Article

Citations

3730

1918

0

Further Information

Guidelines

Follow Us

Join Us

Share Link