Open Access
ARTICLE
An Improved Memory Cache Management Study Based on Spark
Hebei University of Economics and Business, Shijiazhuang, Hebei, 050061, China.
University College Dublin, Belfield, Dublin 4, Ireland.
The Australian e-Health Research Centre, ICT Centre, CSIRO, Australia.
* Corresponding Author: Ning Cao. Email: .
Computers, Materials & Continua 2018, 56(3), 415-431. https://doi.org/10.3970/cmc.2018.03716
Abstract
Spark is a fast unified analysis engine for big data and machine learning, in which the memory is a crucial resource. Resilient Distribution Datasets (RDDs) are parallel data structures that allow users explicitly persist intermediate results in memory or on disk, and each one can be divided into several partitions. During task execution, Spark automatically monitors cache usage on each node. And when there is a RDD that needs to be stored in the cache where the space is insufficient, the system would drop out old data partitions in a least recently used (LRU) fashion to release more space. However, there is no mechanism specifically for caching RDD in Spark, and the dependency of RDDs and the need for future stages are not been taken into consideration with LRU. In this paper, we propose the optimization approach for RDDs cache and LRU based on the features of partitions, which includes three parts: the prediction mechanism for persistence, the weight model by using the entropy method, and the update mechanism of weight and memory based on RDDs partition feature. Finally, through the verification on the spark platform, the experiment results show that our strategy can effectively reduce the time in performing and improve the memory usage.Keywords
Cite This Article
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.