Research on Performance Optimization of Spark Distributed Computing Platform

Qinlu He; Fan Zhang; Genqing Bian; Weiqi Zhang; Zhen Li

doi:10.32604/cmc.2024.046807

Open Access icon Open Access

ARTICLE

Research on Performance Optimization of Spark Distributed Computing Platform

Qinlu He^1,*, Fan Zhang¹, Genqing Bian¹, Weiqi Zhang¹, Zhen Li²

1 School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an, 710054, China
2 Shaanxi Institute of Metrology Science, Xi’an, 710043, China

* Corresponding Author: Qinlu He. Email: email

Computers, Materials & Continua 2024, 79(2), 2833-2850. https://doi.org/10.32604/cmc.2024.046807

Received 15 October 2023; Accepted 07 April 2024; Issue published 15 May 2024

Abstract

Spark, a distributed computing platform, has rapidly developed in the field of big data. Its in-memory computing feature reduces disk read overhead and shortens data processing time, making it have broad application prospects in large-scale computing applications such as machine learning and image processing. However, the performance of the Spark platform still needs to be improved. When a large number of tasks are processed simultaneously, Spark’s cache replacement mechanism cannot identify high-value data partitions, resulting in memory resources not being fully utilized and affecting the performance of the Spark platform. To address the problem that Spark’s default cache replacement algorithm cannot accurately evaluate high-value data partitions, firstly the weight influence factors of data partitions are modeled and evaluated. Then, based on this weighted model, a cache replacement algorithm based on dynamic weighted data value is proposed, which takes into account hit rate and data difference. Better integration and usage strategies are implemented based on LRU (Least Recently Used). The weight update algorithm updates the weight value when the data partition information changes, accurately measuring the importance of the partition in the current job; the cache removal algorithm clears partitions without useful values in the cache to release memory resources; the weight replacement algorithm combines partition weights and partition information to replace RDD partitions when memory remaining space is insufficient. Finally, by setting up a Spark cluster environment, the algorithm proposed in this paper is experimentally verified. Experiments have shown that this algorithm can effectively improve cache hit rate, enhance the performance of the platform, and reduce job execution time by 7.61% compared to existing improved algorithms.

Keywords

Spark; memory optimization; memory replacement strategy

Cite This Article

APA Style

He, Q., Zhang, F., Bian, G., Zhang, W., Li, Z. (2024). Research on Performance Optimization of Spark Distributed Computing Platform. Computers, Materials & Continua, 79(2), 2833–2850. https://doi.org/10.32604/cmc.2024.046807

Vancouver Style

He Q, Zhang F, Bian G, Zhang W, Li Z. Research on Performance Optimization of Spark Distributed Computing Platform. Comput Mater Contin. 2024;79(2):2833–2850. https://doi.org/10.32604/cmc.2024.046807

IEEE Style

Q. He, F. Zhang, G. Bian, W. Zhang, and Z. Li, “Research on Performance Optimization of Spark Distributed Computing Platform,” Comput. Mater. Contin., vol. 79, no. 2, pp. 2833–2850, 2024. https://doi.org/10.32604/cmc.2024.046807

BibTex EndNote RIS

Copyright © 2024 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Research on Performance Optimization of Spark Distributed Computing Platform

Abstract

Keywords

Cite This Article

599

294

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link