Suzhen Wang1, Yanpiao Zhang1, Lu Zhang1, Ning Cao2, *, Chaoyi Pang3
CMC-Computers, Materials & Continua, Vol.56, No.3, pp. 415-431, 2018, DOI:10.3970/cmc.2018.03716
Abstract Spark is a fast unified analysis engine for big data and machine learning, in which the memory is a crucial resource. Resilient Distribution Datasets (RDDs) are parallel data structures that allow users explicitly persist intermediate results in memory or on disk, and each one can be divided into several partitions. During task execution, Spark automatically monitors cache usage on each node. And when there is a RDD that needs to be stored in the cache where the space is insufficient, the system would drop out old data partitions in a least recently used (LRU) fashion… More >